fix(filters): escape `-` in regex_url character class (#1601) by SAY-5 · Pull Request #1880 · NVIDIA-NeMo/Curator

SAY-5 · 2026-04-28T08:48:06Z

Closes #1601.

Summary

The regex_url pattern in nemo_curator/stages/text/utils/constants.py had the unescaped class [$-_@.&+]. Python's re engine reads $-_ as a range from U+0024 ($) through U+005F (_), which silently includes <, >, ;, :, =, etc.
Result: regex_url.findall("see http://x.com<bad> end") previously returned ['http://x.com<bad>'] instead of ['http://x.com']. The pattern is used by UrlsFilter and RepeatingTopNGramsFilter (called transitively via repeating_n_gram paths), so this bleed inflates the urls_ratio score on documents that interleave URLs with HTML or punctuation.
Fix: escape the - ([$\-_@.&+]) so it is treated as a literal character. The class now contains exactly the set of URL-safe characters the original author intended.

Behavior change

Characters that the buggy range accidentally admitted (<, >, ;, :, =, ASCII 0x25–0x5F that weren't otherwise listed) are no longer captured. Characters the original code intentionally listed ($, _, @, ., &, +, -, !, *, (, ), ,, percent-encoded escapes) all still match.

Test plan

Added tests/stages/text/utils/test_regex_url.py with four regression cases:
- <bad> after a URL no longer extends the match,
- ;next after a URL no longer extends the match,
- all explicitly-allowed characters still match,
- both http:// and https:// continue to match.
Verified the new tests fail against the unfixed regex and pass against the fixed one (standalone Python re reproduction; full pytest run requires the Ray/cosmos_xenna stack which CI will exercise).

Authorship

Signed off per DCO. The patch is one line; the regression tests document why.

The `regex_url` pattern used in heuristic string filters contained the unescaped class `[$-_@.&+]`. Python's re engine reads `$-_` as a *range* spanning U+0024 ($) through U+005F (_), which silently includes `<`, `>`, `;`, `:`, `=`, and other punctuation. Matches therefore bled past the end of legitimate URLs into surrounding HTML/punctuation: re.findall(URL_RE, 'see http://x.com<bad> end') # before: ['http://x.com<bad>'] # after: ['http://x.com'] Escaping the `-` makes the regex engine treat it as a literal char, so the class is exactly the set of URL-safe characters the original author intended ([$_@.&+]). Adds regression tests under tests/stages/text/utils/test_regex_url.py covering: angle-bracket bleed, semicolon bleed, retention of all explicitly-allowed characters, and parity for `http`/`https`. Signed-off-by: SAY-5 <say.apm35@gmail.com>

copy-pr-bot · 2026-04-28T08:48:09Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

greptile-apps · 2026-04-28T08:49:24Z

Greptile Summary

This PR fixes a long-standing regex bug where the character class [$-_@.&+] was treated as a range ($ through _, U+0024–U+005F), silently matching characters like <, >, ;, :, and = and causing URL matches to bleed into surrounding HTML or punctuation. The fix escapes the hyphen ([$\\-_@.&+]) and adds an optional url_regex parameter to UrlsFilter and PornographicUrlsFilter for callers that need a different pattern.

constants.py: Single-character fix (- → \\- in the character class) plus an inline comment documenting why the escape is intentional.
string.py: UrlsFilter and PornographicUrlsFilter gain an optional url_regex parameter (string or pre-compiled pattern); both store it as self._url_regex and use it in score_document.
test_filters.py: Four new tests covering HTML-tag bleed regression, allowed-character correctness, custom string regex, and acceptance of a pre-compiled re.Pattern.

Confidence Score: 5/5

Safe to merge — the change is a one-character regex escape fix with regression tests and a backward-compatible API extension.

The regex correction is minimal and well-understood: escaping - in a character class changes only the specific unintended range matches that the PR documents. The new url_regex parameter defaults to None (falling back to the existing compiled regex), so existing call sites are unaffected. The four new tests directly cover the fixed behavior and the new parameter paths.

No files require special attention.

Important Files Changed

Filename	Overview
nemo_curator/stages/text/utils/constants.py	Escapes the `-` in `[$-_@.&+]` to `[$\-_@.&+]`, fixing the silent character-range bug; adds explanatory comment.
nemo_curator/stages/text/filters/heuristic/string.py	Adds optional `url_regex` parameter to `UrlsFilter` and `PornographicUrlsFilter`; stores it as `self._url_regex` and uses it in `score_document`.
tests/stages/text/modules/test_filters.py	Adds four regression and feature tests: HTML tag bleed, allowed characters, custom string regex, and compiled-pattern acceptance.

_{Reviews (4): Last reviewed commit: "Merge branch 'main' into fix/regex-url-o..." | Re-trigger Greptile}

sarahyurick

Hi @SAY-5 thanks for opening! Left some requests.

sarahyurick · 2026-05-04T17:54:35Z

Please add the parameters as described here: #1601 (comment)

Then tests can be added to the existing pytest file. I do not think we need to add a separate test file here.

Signed-off-by: Sai Asish Y <saiasish.cnp@gmail.com>

SAY-5 · 2026-05-06T21:27:16Z

@sarahyurick addressed both review points:

Parameterized regex_url on UrlsFilter and PornographicUrlsFilter (per the issue #1601 thread). Both accept url_regex: re.Pattern | str | None = None and fall back to the project default when omitted, so existing call sites are unaffected.
Folded the tests into the existing tests/stages/text/modules/test_filters.py suite alongside test_urls, and removed the standalone test_regex_url.py. New cases cover the HTML-tag/semicolon regression, the allowed-character set, and the new url_regex parameter (string + compiled re.Pattern forms).

The escaped-dash regex itself was already in place from the original commit; left that alone since it produces the same compiled pattern as your suggestion form. Happy to switch to your literal form if you'd prefer that diff.

Thanks for the review!

sarahyurick

This is great, thank you @SAY-5 ! I left some tiny comments.

sarahyurick · 2026-05-07T21:37:26Z

+# `[$-_…]` the regex engine treats `$-_` as a *range* spanning U+0024
+# through U+005F, which silently includes `<`, `>`, `;`, `:`, etc., so
+# matches bleed past the actual URL into surrounding HTML/punctuation.
+# See https://github.com/NVIDIA-NeMo/Curator/issues/1601.


Remove link please.

sarahyurick · 2026-05-07T21:37:40Z

+        # Regression for #1601: previously the `[$-_…]` range matched `<`,
+        # `>`, `;`, `:`, etc., so a URL match bled past the URL into the
+        # surrounding HTML/punctuation.
+        from nemo_curator.stages.text.utils.constants import regex_url


Can you move this import to the top of the file instead?

sarahyurick · 2026-05-07T21:37:59Z

+        # The fix must not regress on characters the original character
+        # class intended to allow: letters, digits, `$`, `_`, `@`, `.`,
+        # `&`, `+`, `-`, `!`, `*`, `(`, `)`, `,`, and percent-encoded escapes.
+        from nemo_curator.stages.text.utils.constants import regex_url


Same comment as above.

sarahyurick · 2026-05-07T21:38:38Z

+    def test_urls_filter_accepts_compiled_pattern(self) -> None:
+        # The custom regex argument should also accept a pre-compiled
+        # `re.Pattern` instance, not just a string.
+        import re


Same comment as above.

Signed-off-by: SAY-5 <SAY-5@users.noreply.github.com>

sarahyurick · 2026-05-08T21:20:39Z

/ok to test 5661033

SAY-5 requested a review from a team as a code owner April 28, 2026 08:48

SAY-5 requested review from huvunvidia and removed request for a team April 28, 2026 08:48

github-actions Bot added the community-request label Apr 28, 2026

svcnvidia-nemo-ci added the waiting-on-maintainers Waiting on maintainers to respond label Apr 30, 2026

sarahyurick requested changes May 4, 2026

View reviewed changes

svcnvidia-nemo-ci added waiting-on-customer Waiting on the original author to respond and removed waiting-on-maintainers Waiting on maintainers to respond labels May 4, 2026

address review: parameterize url_regex, fold tests into existing suite

9abd3fe

Signed-off-by: Sai Asish Y <saiasish.cnp@gmail.com>

svcnvidia-nemo-ci removed the waiting-on-customer Waiting on the original author to respond label May 7, 2026

sarahyurick reviewed May 7, 2026

View reviewed changes

svcnvidia-nemo-ci added the waiting-on-customer Waiting on the original author to respond label May 8, 2026

test(filters): address remaining review nits

4004624

Signed-off-by: SAY-5 <SAY-5@users.noreply.github.com>

sarahyurick approved these changes May 8, 2026

View reviewed changes

Merge branch 'main' into fix/regex-url-overly-permissive-range-1601

5661033

copy-pr-bot Bot deployed to test May 8, 2026 21:21 Active

copy-pr-bot Bot requested a deployment to nemo-ci May 8, 2026 21:21 Queued

copy-pr-bot Bot temporarily deployed to nemo-ci May 8, 2026 21:21 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(filters): escape `-` in regex_url character class (#1601)#1880

fix(filters): escape `-` in regex_url character class (#1601)#1880
SAY-5 wants to merge 4 commits intoNVIDIA-NeMo:mainfrom
SAY-5:fix/regex-url-overly-permissive-range-1601

SAY-5 commented Apr 28, 2026

Uh oh!

copy-pr-bot Bot commented Apr 28, 2026

Uh oh!

greptile-apps Bot commented Apr 28, 2026 •

edited

Loading

Uh oh!

sarahyurick left a comment

Uh oh!

Uh oh!

sarahyurick May 4, 2026

Uh oh!

SAY-5 commented May 6, 2026

Uh oh!

sarahyurick left a comment

Uh oh!

sarahyurick May 7, 2026

Uh oh!

sarahyurick May 7, 2026

Uh oh!

sarahyurick May 7, 2026

Uh oh!

sarahyurick May 7, 2026

Uh oh!

sarahyurick commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

SAY-5 commented Apr 28, 2026

Summary

Behavior change

Test plan

Authorship

Uh oh!

copy-pr-bot Bot commented Apr 28, 2026

Uh oh!

greptile-apps Bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

sarahyurick left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sarahyurick May 4, 2026

Choose a reason for hiding this comment

Uh oh!

SAY-5 commented May 6, 2026

Uh oh!

sarahyurick left a comment

Choose a reason for hiding this comment

Uh oh!

sarahyurick May 7, 2026

Choose a reason for hiding this comment

Uh oh!

sarahyurick May 7, 2026

Choose a reason for hiding this comment

Uh oh!

sarahyurick May 7, 2026

Choose a reason for hiding this comment

Uh oh!

sarahyurick May 7, 2026

Choose a reason for hiding this comment

Uh oh!

sarahyurick commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

greptile-apps Bot commented Apr 28, 2026 •

edited

Loading