fix(filters): escape - in regex_url character class (#1601)#1880
fix(filters): escape - in regex_url character class (#1601)#1880SAY-5 wants to merge 4 commits intoNVIDIA-NeMo:mainfrom
- in regex_url character class (#1601)#1880Conversation
The `regex_url` pattern used in heuristic string filters contained
the unescaped class `[$-_@.&+]`. Python's re engine reads `$-_` as
a *range* spanning U+0024 ($) through U+005F (_), which silently
includes `<`, `>`, `;`, `:`, `=`, and other punctuation. Matches
therefore bled past the end of legitimate URLs into surrounding
HTML/punctuation:
re.findall(URL_RE, 'see http://x.com<bad> end')
# before: ['http://x.com<bad>']
# after: ['http://x.com']
Escaping the `-` makes the regex engine treat it as a literal char,
so the class is exactly the set of URL-safe characters the original
author intended ([$_@.&+]).
Adds regression tests under tests/stages/text/utils/test_regex_url.py
covering: angle-bracket bleed, semicolon bleed, retention of all
explicitly-allowed characters, and parity for `http`/`https`.
Signed-off-by: SAY-5 <say.apm35@gmail.com>
Greptile SummaryThis PR fixes a long-standing regex bug where the character class
Confidence Score: 5/5Safe to merge — the change is a one-character regex escape fix with regression tests and a backward-compatible API extension. The regex correction is minimal and well-understood: escaping No files require special attention. Important Files Changed
Reviews (4): Last reviewed commit: "Merge branch 'main' into fix/regex-url-o..." | Re-trigger Greptile |
sarahyurick
left a comment
There was a problem hiding this comment.
Hi @SAY-5 thanks for opening! Left some requests.
There was a problem hiding this comment.
Please add the parameters as described here: #1601 (comment)
Then tests can be added to the existing pytest file. I do not think we need to add a separate test file here.
Signed-off-by: Sai Asish Y <saiasish.cnp@gmail.com>
|
@sarahyurick addressed both review points:
The escaped-dash regex itself was already in place from the original commit; left that alone since it produces the same compiled pattern as your suggestion form. Happy to switch to your literal form if you'd prefer that diff. Thanks for the review! |
sarahyurick
left a comment
There was a problem hiding this comment.
This is great, thank you @SAY-5 ! I left some tiny comments.
| # `[$-_…]` the regex engine treats `$-_` as a *range* spanning U+0024 | ||
| # through U+005F, which silently includes `<`, `>`, `;`, `:`, etc., so | ||
| # matches bleed past the actual URL into surrounding HTML/punctuation. | ||
| # See https://github.com/NVIDIA-NeMo/Curator/issues/1601. |
| # Regression for #1601: previously the `[$-_…]` range matched `<`, | ||
| # `>`, `;`, `:`, etc., so a URL match bled past the URL into the | ||
| # surrounding HTML/punctuation. | ||
| from nemo_curator.stages.text.utils.constants import regex_url |
There was a problem hiding this comment.
Can you move this import to the top of the file instead?
| # The fix must not regress on characters the original character | ||
| # class intended to allow: letters, digits, `$`, `_`, `@`, `.`, | ||
| # `&`, `+`, `-`, `!`, `*`, `(`, `)`, `,`, and percent-encoded escapes. | ||
| from nemo_curator.stages.text.utils.constants import regex_url |
| def test_urls_filter_accepts_compiled_pattern(self) -> None: | ||
| # The custom regex argument should also accept a pre-compiled | ||
| # `re.Pattern` instance, not just a string. | ||
| import re |
Signed-off-by: SAY-5 <SAY-5@users.noreply.github.com>
|
/ok to test 5661033 |
Closes #1601.
Summary
regex_urlpattern innemo_curator/stages/text/utils/constants.pyhad the unescaped class[$-_@.&+]. Python'sreengine reads$-_as a range from U+0024 ($) through U+005F (_), which silently includes<,>,;,:,=, etc.regex_url.findall("see http://x.com<bad> end")previously returned['http://x.com<bad>']instead of['http://x.com']. The pattern is used byUrlsFilterandRepeatingTopNGramsFilter(called transitively viarepeating_n_grampaths), so this bleed inflates theurls_ratioscore on documents that interleave URLs with HTML or punctuation.-([$\-_@.&+]) so it is treated as a literal character. The class now contains exactly the set of URL-safe characters the original author intended.Behavior change
Characters that the buggy range accidentally admitted (
<,>,;,:,=, ASCII 0x25–0x5F that weren't otherwise listed) are no longer captured. Characters the original code intentionally listed ($,_,@,.,&,+,-,!,*,(,),,, percent-encoded escapes) all still match.Test plan
tests/stages/text/utils/test_regex_url.pywith four regression cases:<bad>after a URL no longer extends the match,;nextafter a URL no longer extends the match,http://andhttps://continue to match.rereproduction; full pytest run requires the Ray/cosmos_xenna stack which CI will exercise).Authorship
Signed off per DCO. The patch is one line; the regression tests document why.