OPENNLP-1850: UAX #29 word tokenizer and the layered Term model (2/4)#1104
OPENNLP-1850: UAX #29 word tokenizer and the layered Term model (2/4)#1104krickert wants to merge 5 commits into
Conversation
|
OPENNLP-1850 stacked PRs (review independently; merge bottom-up, re-targeting each base to
Supersedes #1101. |
There was a problem hiding this comment.
Pull request overview
Adds the next slice of OPENNLP-1850 by introducing a Unicode UAX #29 word segmenter/tokenizer implementation and a new layered normalization “Term” model (Term + TermAnalyzer), plus a language-to-normalization profile registry and the associated Unicode data/license attributions.
Changes:
- Implement UAX #29 word boundary segmentation (
WordSegmenter) and a word tokenizer (WordTokenizer) with typed tokens (WordType,WordToken), including Extended_Pictographic support. - Introduce the layered Term normalization stack (
Term,TermAnalyzer,Dimension) and a language-based registry (NormalizationProfile,NormalizationProfiles). - Add comprehensive JUnit tests (including official Unicode conformance) and update NOTICE/LICENSE/RAT exclusions for bundled Unicode data.
Reviewed changes
Copilot reviewed 25 out of 27 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| src/license/NOTICE.template | Expands Unicode data attribution text for additional bundled UCD/UTS resources. |
| rat-excludes | Excludes newly bundled Unicode data files from RAT header checks. |
| opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/TermAnalyzerTest.java | Tests for TermAnalyzer layering, ordering, lazy dimensions, and tokenization behavior. |
| opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/NormalizationProfilesTest.java | Tests language-to-profile resolution and search analyzer behavior. |
| opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/ConfusablesTest.java | Tests confusable skeleton folding behavior. |
| opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/tokenize/uax29/WordTokenizerTest.java | Tests tokenizer output, typed tokens, and max-length chopping behavior. |
| opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/tokenize/uax29/WordSegmenterTest.java | Tests segmentation boundaries on representative UAX #29 cases. |
| opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/tokenize/uax29/WordBreakPropertyTest.java | Tests Word_Break property lookup behavior and edge cases. |
| opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/tokenize/uax29/WordBoundaryConformanceTest.java | Runs the official Unicode WordBreakTest.txt conformance suite against WordSegmenter. |
| opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/tokenize/uax29/ExtendedPictographicTest.java | Tests Extended_Pictographic membership checks and bounds safety. |
| opennlp-core/opennlp-runtime/src/main/resources/opennlp/tools/tokenize/uax29/ExtendedPictographic.txt | Bundled derived Unicode data for Extended_Pictographic property membership. |
| opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/TermAnalyzer.java | Implements configurable token segmentation + ordered normalization dimension pipeline. |
| opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/Term.java | Represents a token with cached/lazy normalization layers. |
| opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/NormalizationProfiles.java | Registry mapping language codes to normalization/stemming profiles with detection dispatch. |
| opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/NormalizationProfile.java | Per-language profile record and searchAnalyzer() builder. |
| opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/Dimension.java | Javadoc updates aligning Dimension docs with the new Term/TermAnalyzer model. |
| opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/tokenize/uax29/WordType.java | Adds token categorization for downstream handling (scripts, numeric, emoji, etc.). |
| opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/tokenize/uax29/WordTokenizer.java | Implements UAX #29-based word tokenization with spans and optional typed streaming. |
| opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/tokenize/uax29/WordToken.java | Typed token record (span + type) produced by WordTokenizer. |
| opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/tokenize/uax29/WordSegmenter.java | Implements the UAX #29 word boundary algorithm with fast-path transition tables. |
| opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/tokenize/uax29/WordBreakProperty.java | Loads and looks up Unicode Word_Break property values from bundled data. |
| opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/tokenize/uax29/WordBreak.java | Enum for Word_Break property values + parser for property names in the data file. |
| opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/tokenize/uax29/ExtendedPictographic.java | Loads Extended_Pictographic membership from bundled data for WB3c behavior. |
| NOTICE | Updates top-level NOTICE with expanded Unicode attribution details. |
| LICENSE | Updates top-level LICENSE to include Unicode License V3 applicability for added data files. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
dab5605 to
67c922a
Compare
|
Thx for the PR. Here are some suggestions:
|
81aa6c5 to
36de08f
Compare
|
Term.at() ordering footgun. WordType.IDEOGRAPHIC javadoc overstates. WordTokenizer implements Tokenizer directly. Nit: IntList capacity doubling overflow. |
8c2451a to
3f06095
Compare
|
Status: rebased onto the updated foundation, no content change. |
| } catch (IOException e) { | ||
| throw new UncheckedIOException("Unable to read Extended_Pictographic data resource", e); | ||
| } |
3f06095 to
0ec5a36
Compare
Builds on the normalization foundation. - opennlp-runtime tokenize/uax29: the UAX #29 word segmenter and Tokenizer implementation (WordSegmenter, WordTokenizer, WordType, WordBreak, boundary engine) with bundled Unicode WordBreakProperty and emoji ExtendedPictographic data, validated against the official WordBreakTest conformance suite (1944/1944). - The layered Term model (Term, TermAnalyzer) that tokenizes then normalizes per token over the Dimension ladder, the per-language NormalizationProfile registry, and the confusable-fold coverage. - Extends the bundled-Unicode attribution (NOTICE, NOTICE.template, LICENSE, rat-excludes) to the WordBreakProperty / ExtendedPictographic / WordBreakTest data files, and restores Dimension's javadoc cross-links now that the Term layer is present.
- WordBoundaryConformanceTest: guard the conformance resource stream with Objects.requireNonNull and a clear message instead of an opaque NPE in InputStreamReader, and remove the unused NO_BOUNDARY constant. - NormalizationProfiles.forLanguage: fail loud on a null language argument at the public entry point, with a null-rejection test.
- Term.at: document that an unconfigured dimension is applied on top of normalized()
rather than in canonical pipeline order, with a non-commutative example.
- WordType.IDEOGRAPHIC: soften javadoc ('a token containing a Han ideograph', not 'a
single Han ideograph').
- WordTokenizer: note the deliberate choice to implement Tokenizer directly instead of
extending AbstractTokenizer.
- WordSegmenter.IntList: overflow-aware 1.5x growth instead of length*2.
…moji WordType classifies every Extended_Pictographic code point as EMOJI, which includes symbol-like characters (copyright, trademark, double-exclamation, arrows), so the word tokenizer keeps them rather than dropping them as punctuation. State this in the WordTokenizer javadoc and add a test.
090593f to
9f2622e
Compare
0ec5a36 to
b150056
Compare
NormalizationProfiles.detect now rejects a null text or detector with a clear NullPointerException instead of failing deeper inside language detection. The TermAnalyzer caseFold(Locale) builder step rejects a null locale up front. ExtendedPictographic names the missing resource in its read-failure message, matching WordBreakProperty.
Part 2/4 of OPENNLP-1850. Stacked on the foundation branch (base is OPENNLP-1850-1-foundation, so the diff is only this slice).
UAX #29 word segmenter and Tokenizer impl with bundled WordBreakProperty/ExtendedPictographic data (conformance 1944/1944), the layered Term model (Term, TermAnalyzer), the NormalizationProfile registry, and the WordBreak data's License V3 attribution.