fix: parse dates with a numeric offset and redundant tz abbreviation (#1227)#1348
Open
Sanjays2402 wants to merge 1 commit into
Open
fix: parse dates with a numeric offset and redundant tz abbreviation (#1227)#1348Sanjays2402 wants to merge 1 commit into
Sanjays2402 wants to merge 1 commit into
Conversation
…crapinghub#1227) RFC 2822 email dates carry both a numeric UTC offset and an equivalent timezone abbreviation in parentheses, e.g. ``Thu, 30 May 2024 10:13:10 -0500 (CDT)``. ``dateparser.parse`` returned None for these, while Python's own ``email.utils.parsedate_to_datetime`` handles them. Root cause: ``strip_braces`` turns ``(CDT)`` into a bare ``CDT`` token, so the string now contains two timezone tokens (``-0500`` and ``CDT``). ``pop_tz_offset_from_string`` removed only the first token it matched (``CDT``), leaving the numeric ``-0500`` stranded in the string; the absolute parser then failed on the leftover ``0500`` and the whole parse returned None. The equivalent GMT-prefixed form (``GMT+0800 (CST)``) already worked only because its offset regex greedily spans the trailing abbreviation. Fix: after removing the first timezone token, strip a second, adjacent token too when it denotes the same UTC offset (the parenthesised abbreviation is informational; the numeric offset is authoritative). A conflicting second timezone is left in place, preserving current behavior. The remainder is right-stripped before the follow-up search because the numeric-offset regexes are anchored at the end of the string. Adds regression tests for both token orderings at the pop-timezone level and at the full-parse level; each fails without the fix and passes with it.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #1227.
dateparser.parsereturnedNonefor RFC 2822 email dates thatcarry both a numeric UTC offset and a redundant, equivalent timezone
abbreviation in parentheses:
Python's own
email.utils.parsedate_to_datetimeparses this string fine, sodateparser silently failing on a very common email-header form is surprising.
Root cause
dateparser.utils.strip_bracesturns(CDT)into a bareCDTtoken, so thestring reaching the timezone step contains two timezone tokens:
-0500andCDT.pop_tz_offset_from_stringremoved only the first tokenit matched (
CDT), leaving the numeric-0500stranded in the string. Theabsolute parser then choked on the leftover (
ValueError: Unable to parse: 0500) and the whole parse returnedNone.The equivalent GMT-prefixed form already worked:
...but only by accident: the offset regex for the
GMT+0800form is(?:UTC|GMT)\+08:?00.*$, whose trailing.*$greedily swallows theCSTabbreviation, so both tokens are consumed by one match. A barenumeric offset like
-0500has no such spanning regex, so its redundantabbreviation is orphaned.
Fix
After removing the first timezone token, strip a second, adjacent token too
when it denotes the same UTC offset — the parenthesised abbreviation is
purely informational and the numeric offset is authoritative (this matches
email.utils). A conflicting second timezone is deliberately left in place,so existing behavior is preserved for contradictory input. The remainder is
right-stripped before the follow-up search because the numeric-offset regexes
are anchored at the end of the string.
The change is behavior-preserving for every existing case: I verified the new
pop_tz_offset_from_stringreturns byte-identical(string, offset)resultsto the old implementation across all 40 inputs in the existing
test_extracting_valid_offsetsuite plus assorted trailing-whitespace edgecases.
Before / after
Thu, 30 May 2024 10:13:10 -0500 (CDT)None2024-05-30 10:13:10-05:0030 May 2024 10:13:10 -0500 CDTNone2024-05-30 10:13:10-05:0030 May 2024 10:13:10 CDT -0500None2024-05-30 10:13:10-05:00Mon, 15 Jan 2024 09:30:00 +0000 (UTC)None2024-01-15 09:30:00+00:00Fri Sep 23 2016 10:34:51 GMT+0800 (CST)2016-09-23 10:34:51+08:0030 May 2024 10:13:10 -0500 CST(conflicting)NoneNone(unchanged)Tests
Added regression coverage at two levels:
tests/test_timezone_parser.py::TestTZPopping::test_timezone_deleted_from_string— both token orderings (
-0500 CDTandCDT -0500) must leave the stringclean.
tests/test_date_parser.py::TestDateParser::test_parsing_with_utc_offsets— full-parse cases converted to UTC (the reported string plus a
+0000 (UTC)variant).Each new case fails on
masterand passes with the fix (verified by stashingonly the source change):
Full suite green with the fix:
24205 passed, 1 skipped, 1 xfailed(baseline was
24200 passed+ 5 new cases).ruff checkandruff format --checkare clean on all changed files.