Skip to content

fix(filters): escape - in regex_url character class (#1601)#1880

Open
SAY-5 wants to merge 4 commits intoNVIDIA-NeMo:mainfrom
SAY-5:fix/regex-url-overly-permissive-range-1601
Open

fix(filters): escape - in regex_url character class (#1601)#1880
SAY-5 wants to merge 4 commits intoNVIDIA-NeMo:mainfrom
SAY-5:fix/regex-url-overly-permissive-range-1601

Conversation

@SAY-5
Copy link
Copy Markdown

@SAY-5 SAY-5 commented Apr 28, 2026

Closes #1601.

Summary

  • The regex_url pattern in nemo_curator/stages/text/utils/constants.py had the unescaped class [$-_@.&+]. Python's re engine reads $-_ as a range from U+0024 ($) through U+005F (_), which silently includes <, >, ;, :, =, etc.
  • Result: regex_url.findall("see http://x.com<bad> end") previously returned ['http://x.com<bad>'] instead of ['http://x.com']. The pattern is used by UrlsFilter and RepeatingTopNGramsFilter (called transitively via repeating_n_gram paths), so this bleed inflates the urls_ratio score on documents that interleave URLs with HTML or punctuation.
  • Fix: escape the - ([$\-_@.&+]) so it is treated as a literal character. The class now contains exactly the set of URL-safe characters the original author intended.

Behavior change

Characters that the buggy range accidentally admitted (<, >, ;, :, =, ASCII 0x25–0x5F that weren't otherwise listed) are no longer captured. Characters the original code intentionally listed ($, _, @, ., &, +, -, !, *, (, ), ,, percent-encoded escapes) all still match.

Test plan

  • Added tests/stages/text/utils/test_regex_url.py with four regression cases:
    • <bad> after a URL no longer extends the match,
    • ;next after a URL no longer extends the match,
    • all explicitly-allowed characters still match,
    • both http:// and https:// continue to match.
  • Verified the new tests fail against the unfixed regex and pass against the fixed one (standalone Python re reproduction; full pytest run requires the Ray/cosmos_xenna stack which CI will exercise).

Authorship

Signed off per DCO. The patch is one line; the regression tests document why.

The `regex_url` pattern used in heuristic string filters contained
the unescaped class `[$-_@.&+]`. Python's re engine reads `$-_` as
a *range* spanning U+0024 ($) through U+005F (_), which silently
includes `<`, `>`, `;`, `:`, `=`, and other punctuation. Matches
therefore bled past the end of legitimate URLs into surrounding
HTML/punctuation:

    re.findall(URL_RE, 'see http://x.com<bad> end')
    # before: ['http://x.com<bad>']
    # after:  ['http://x.com']

Escaping the `-` makes the regex engine treat it as a literal char,
so the class is exactly the set of URL-safe characters the original
author intended ([$_@.&+]).

Adds regression tests under tests/stages/text/utils/test_regex_url.py
covering: angle-bracket bleed, semicolon bleed, retention of all
explicitly-allowed characters, and parity for `http`/`https`.

Signed-off-by: SAY-5 <say.apm35@gmail.com>
@SAY-5 SAY-5 requested a review from a team as a code owner April 28, 2026 08:48
@SAY-5 SAY-5 requested review from huvunvidia and removed request for a team April 28, 2026 08:48
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 28, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 28, 2026

Greptile Summary

This PR fixes a long-standing regex bug where the character class [$-_@.&+] was treated as a range ($ through _, U+0024–U+005F), silently matching characters like <, >, ;, :, and = and causing URL matches to bleed into surrounding HTML or punctuation. The fix escapes the hyphen ([$\\-_@.&+]) and adds an optional url_regex parameter to UrlsFilter and PornographicUrlsFilter for callers that need a different pattern.

  • constants.py: Single-character fix (-\\- in the character class) plus an inline comment documenting why the escape is intentional.
  • string.py: UrlsFilter and PornographicUrlsFilter gain an optional url_regex parameter (string or pre-compiled pattern); both store it as self._url_regex and use it in score_document.
  • test_filters.py: Four new tests covering HTML-tag bleed regression, allowed-character correctness, custom string regex, and acceptance of a pre-compiled re.Pattern.

Confidence Score: 5/5

Safe to merge — the change is a one-character regex escape fix with regression tests and a backward-compatible API extension.

The regex correction is minimal and well-understood: escaping - in a character class changes only the specific unintended range matches that the PR documents. The new url_regex parameter defaults to None (falling back to the existing compiled regex), so existing call sites are unaffected. The four new tests directly cover the fixed behavior and the new parameter paths.

No files require special attention.

Important Files Changed

Filename Overview
nemo_curator/stages/text/utils/constants.py Escapes the - in [$-_@.&+] to [$\-_@.&+], fixing the silent character-range bug; adds explanatory comment.
nemo_curator/stages/text/filters/heuristic/string.py Adds optional url_regex parameter to UrlsFilter and PornographicUrlsFilter; stores it as self._url_regex and uses it in score_document.
tests/stages/text/modules/test_filters.py Adds four regression and feature tests: HTML tag bleed, allowed characters, custom string regex, and compiled-pattern acceptance.

Reviews (4): Last reviewed commit: "Merge branch 'main' into fix/regex-url-o..." | Re-trigger Greptile

@svcnvidia-nemo-ci svcnvidia-nemo-ci added the waiting-on-maintainers Waiting on maintainers to respond label Apr 30, 2026
Copy link
Copy Markdown
Contributor

@sarahyurick sarahyurick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @SAY-5 thanks for opening! Left some requests.

Comment thread nemo_curator/stages/text/utils/constants.py
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the parameters as described here: #1601 (comment)

Then tests can be added to the existing pytest file. I do not think we need to add a separate test file here.

@svcnvidia-nemo-ci svcnvidia-nemo-ci added waiting-on-customer Waiting on the original author to respond and removed waiting-on-maintainers Waiting on maintainers to respond labels May 4, 2026
Signed-off-by: Sai Asish Y <saiasish.cnp@gmail.com>
@SAY-5
Copy link
Copy Markdown
Author

SAY-5 commented May 6, 2026

@sarahyurick addressed both review points:

  • Parameterized regex_url on UrlsFilter and PornographicUrlsFilter (per the issue #1601 thread). Both accept url_regex: re.Pattern | str | None = None and fall back to the project default when omitted, so existing call sites are unaffected.
  • Folded the tests into the existing tests/stages/text/modules/test_filters.py suite alongside test_urls, and removed the standalone test_regex_url.py. New cases cover the HTML-tag/semicolon regression, the allowed-character set, and the new url_regex parameter (string + compiled re.Pattern forms).

The escaped-dash regex itself was already in place from the original commit; left that alone since it produces the same compiled pattern as your suggestion form. Happy to switch to your literal form if you'd prefer that diff.

Thanks for the review!

@svcnvidia-nemo-ci svcnvidia-nemo-ci removed the waiting-on-customer Waiting on the original author to respond label May 7, 2026
Copy link
Copy Markdown
Contributor

@sarahyurick sarahyurick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thank you @SAY-5 ! I left some tiny comments.

# `[$-_…]` the regex engine treats `$-_` as a *range* spanning U+0024
# through U+005F, which silently includes `<`, `>`, `;`, `:`, etc., so
# matches bleed past the actual URL into surrounding HTML/punctuation.
# See https://github.com/NVIDIA-NeMo/Curator/issues/1601.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove link please.

# Regression for #1601: previously the `[$-_…]` range matched `<`,
# `>`, `;`, `:`, etc., so a URL match bled past the URL into the
# surrounding HTML/punctuation.
from nemo_curator.stages.text.utils.constants import regex_url
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this import to the top of the file instead?

# The fix must not regress on characters the original character
# class intended to allow: letters, digits, `$`, `_`, `@`, `.`,
# `&`, `+`, `-`, `!`, `*`, `(`, `)`, `,`, and percent-encoded escapes.
from nemo_curator.stages.text.utils.constants import regex_url
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above.

def test_urls_filter_accepts_compiled_pattern(self) -> None:
# The custom regex argument should also accept a pre-compiled
# `re.Pattern` instance, not just a string.
import re
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above.

@svcnvidia-nemo-ci svcnvidia-nemo-ci added the waiting-on-customer Waiting on the original author to respond label May 8, 2026
Signed-off-by: SAY-5 <SAY-5@users.noreply.github.com>
@sarahyurick
Copy link
Copy Markdown
Contributor

/ok to test 5661033

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-request waiting-on-customer Waiting on the original author to respond

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: Overly permissive url regex

3 participants