Skip to content

feat(trie): TrieCharLegacy, expanded dict-op benchmarks, and real-text e2e benchmarks#105

Open
Copilot wants to merge 16 commits intomainfrom
copilot/optimize-triechar-memory
Open

feat(trie): TrieCharLegacy, expanded dict-op benchmarks, and real-text e2e benchmarks#105
Copilot wants to merge 16 commits intomainfrom
copilot/optimize-triechar-memory

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 4, 2026

Preserves the original HashSet-backed trie as TrieCharLegacy while promoting the leaner trie (no HashSet) as the default TrieChar. Adds a comprehensive benchmark suite covering all three DictBackend implementations across real Thai text corpora.

Backend changes

  • TrieCharLegacy — original TrieChar (trie + parallel HashSet<String>) renamed and kept as a named alternative. O(1) contain() vs O(k) trie walk in new TrieChar; ~12% higher memory (~43 MB vs ~38 MB for 62k words).
  • All three backends (TrieChar, TrieCharLegacy, FstDict) implement DictBackend and are interchangeable:
let tok: NewmmTokenizer<TrieChar>       = NewmmTokenizer::new("dict.txt")?;
let tok: NewmmTokenizer<TrieCharLegacy> = NewmmTokenizer::<TrieCharLegacy>::from_word_list(words);
let tok: NewmmFstTokenizer              = NewmmFstTokenizer::new("dict.txt")?;

New benchmarks

dict_operations group

contain / add / remove for all 3 backends × 4 dictionaries (500-short, 500-long, 10k, words_th).

e2e_tokenization group (new — uses real text files from tests/data/)

Scope Backends Dictionaries Texts
Small texts TrieChar, TrieCharLegacy, FstDict all 4 wikipedia-s, wikipedia-m
Large texts TrieChar, TrieCharLegacy 10k, words_th wikipedia-l, wisesight-sentiment
Large texts FstDict words_th only wikipedia-l, wisesight-sentiment
Parallel all 3 words_th wisesight-sentiment

FstDict with small vocabularies (≤10k words) on large texts is excluded — a low-match-rate vocabulary causes O(n·k·B) fallback processing where FST byte overhead dominates, yielding hundreds of seconds per iteration.

Key findings (measured)

TrieChar TrieCharLegacy FstDict
Memory (62k words) ~38 MB ~43 MB ~0.87 MB
prefix_ref (hot path) identical identical 29–54× slower
contain() O(k) trie O(1) hash O(1) hash
End-to-end throughput identical identical 9–15× slower

contain() is not on the tokenization hot path; TrieChar and TrieCharLegacy have statistically identical end-to-end throughput.

Docs updated

BENCHMARK_RESULTS.md, docs/implementation.md, docs/design.md, CHANGELOG.md — backend comparison tables, complexity analysis, when-to-use guidance.

Copilot AI and others added 8 commits April 3, 2026 03:14
Sync all changes from main (PR #102 simplify-api and subsequent commits):
- New parallel API: segment_with_options(), segment_parallel(), ParallelOptions
- Simplified Tokenizer trait (segment only)
- tests/data/ benchmark dictionaries (500-short, 500-long, 10k, words_th)
- docs/ directory (design.md, implementation.md, BENCHMARK_RESULTS.md)
- nlpo3-js renamed from nlpo3-nodejs; Node.js/Python binding updates
- CLI dict moved to nlpo3-cli/

Branch-specific changes preserved:
- TrieChar memory optimization (HashSet removed, word_count counter used)
- Updated BENCHMARK_RESULTS with TrieChar optimization numbers

CHANGELOG.md conflict resolved: keep main concise summary + add
TrieChar optimization bullet.

Co-authored-by: bact <128572+bact@users.noreply.github.com>
…char-memory

Co-authored-by: bact <128572+bact@users.noreply.github.com>
Copilot AI assigned Copilot and bact Apr 4, 2026
@bact bact added enhancement New feature or request refactoring Refactoring, reformatting, with no new functionality. nlpo3-python and removed enhancement New feature or request nlpo3-python labels Apr 4, 2026
@bact
Copy link
Copy Markdown
Member

bact commented Apr 7, 2026

@copilot fix this error.

error[E0034]: multiple applicable items in scope
Error:    --> src/main.rs:101:65
    |
101 |                 None => TokenizerWrapper::Newmm(NewmmTokenizer::from_word_list(
    |                                                                 ^^^^^^^^^^^^^^ multiple `from_word_list` found
    |
    = note: candidate #1 is defined in an impl for the type `NewmmTokenizer`
    = note: candidate #2 is defined in an impl for the type `NewmmTokenizer<TrieCharLegacy>`

For more information about this error, try `rustc --explain E0034`.
error: could not compile `nlpo3-cli` (bin "nlpo3" test) due to 1 previous error

see full log at https://github.com/PyThaiNLP/nlpo3/actions/runs/23973204520/job/70328448425?pr=105

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 7, 2026

@copilot fix this error.

...

Fixed in f03f70b. The ambiguity arose because adding TrieCharLegacy introduced a second from_word_list impl. I disambiguated the CLI call site with an explicit type parameter: NewmmTokenizer::<TrieChar>::from_word_list(...) and added the necessary use nlpo3::tokenizer::trie_char::TrieChar import.

Copilot AI requested a review from bact April 7, 2026 21:46
Copilot finished work on behalf of bact April 7, 2026 21:46
Copilot finished work on behalf of bact April 7, 2026 21:50
@bact bact marked this pull request as ready for review April 7, 2026 22:04
bact added 5 commits April 11, 2026 01:12
Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
bact added 2 commits April 12, 2026 08:12
Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

refactoring Refactoring, reformatting, with no new functionality.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants