Skip to content

Vocabulary Pruning Engine for Multilingual GLiNER Models#366

Open
ALI-AL-MARJANI wants to merge 12 commits into
urchade:mainfrom
ALI-AL-MARJANI:feature/vocab-pruning-engine
Open

Vocabulary Pruning Engine for Multilingual GLiNER Models#366
ALI-AL-MARJANI wants to merge 12 commits into
urchade:mainfrom
ALI-AL-MARJANI:feature/vocab-pruning-engine

Conversation

@ALI-AL-MARJANI

Copy link
Copy Markdown

Problem

Multilingual GLiNER models (mDeBERTa-v3) carry a 250k-token embedding matrix.
For single-language deployments, >60% of these embeddings are never accessed.
This creates an unnecessary memory and cold-start bottleneck for edge/CPU deployments.

Solution

scripts/prune_gliner_vocab.py tokenises a target-language corpus, identifies the
active token intersection, slices word_embeddings.weight, rebuilds tokenizer.json,
and saves a fully self-contained pruned model loadable via GLiNER.from_pretrained().

Benchmark — urchade/gliner_multi-v2.1 (English Wikipedia, 100k articles)

Metric Original Pruned Change
Vocabulary 250,105 90,840 −63.7%
Model size 1,155.8 MB 666.5 MB −42.3% (−489 MB)
Entity correctness 6/6 PASS ✓ Lossless

Files changed

  • scripts/prune_gliner_vocab.py — pruning engine (new)
  • scripts/validate_pruned_model.py — 3-tier correctness validator (new)
  • docs/vocab_pruning.md — documentation with benchmarks (new)
  • docs/index.md — added to toctree
  • gliner/modeling/encoder.py — bugfix: token_lengths kwarg leaked into BiEncoder
    labels encoder forward pass
  • gliner/model.py — bugfix: bare import onnxruntime replaced with try/except

Usage

python scripts/prune_gliner_vocab.py \
    --model_id urchade/gliner_multi-v2.1 \
    --dataset_for_vocab wikipedia \
    --output_dir ./pruned_en \
    --lang en

Ali322O added 11 commits May 18, 2026 15:47
  - scripts/prune_gliner_vocab.py: prune multilingual GLiNER vocab to target language
  - scripts/validate_pruned_model.py: 3-tier validator (PASS/SCORE_DRIFT/ENTITY_FAIL),
  --score_tol flag
  - docs/vocab_pruning.md: full documentation page with benchmark results
  - docs/index.md: add vocab_pruning to toctree
  - gliner/modeling/encoder.py: fix token_lengths kwarg leak in BiEncoder.encode_labels
  - gliner/model.py: wrap onnxruntime import in try/except for ARM compatibility

  Benchmarked on urchade/gliner_multi-v2.1 (mDeBERTa-v3, 250k vocab):
    English Wikipedia corpus -> 250,105 -> 90,840 tokens (63.7% reduction)
    Model size: 1155.8 -> 666.5 MB (42.3% smaller), ALL PASS entity correctness
@Ingvarstep

Copy link
Copy Markdown
Collaborator

@ALI-AL-MARJANI , thanks for your contributions. Please, fix ruff errors first so we can move forward.

Fix 25 ruff errors blocking the lint job requested by maintainer:
- Sort import blocks (I001) and __all__ (RUF022) across 9 files
- Merge duplicate get_train_dataloader (F811) — curriculum path integrated
  into the single authoritative implementation
- Add missing docstring args (D417) in config.py, long_doc.py, model.py
- Remove unused imports (F401): math, torch, Dict, numpy
- Fix for-loop variable overwrite (PLW2901) in descriptions.py
- Remove unused neg_mask variable (F841) in loss_functions.py
- Fix EN dash in string literal (RUF001) in trainer.py
- Fix docstring summary blank line (D205) in hard_negatives.py
- Break line > 120 chars (E501) in model.py

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@ALI-AL-MARJANI

Copy link
Copy Markdown
Author

CI should pass now i guess

@Ingvarstep

Copy link
Copy Markdown
Collaborator

@ALI-AL-MARJANI , please merge from main as there are some conflicts right now that need to be resolved

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants