What:
Add an HTML extraction stage that uses a small language model (0.6B param) to classify HTML elements via sequence labeling — distinguishing main content, boilerplate, navigation, ads, and structured elements (code blocks, tables, formulas) — and converts the classified structure to clean Markdown.
Why:
AICC paper (Nov 2025, Shanghai AI Lab) demonstrates directly: a 3B model trained on MinerU-HTML-extracted data achieves 50.82% average accuracy vs 49.74% on Trafilatura-extracted data across 13 benchmarks — a 1.08pp improvement from extraction quality alone, with no other change. MinerU-HTML achieves 81.82% ROUGE-N vs Trafilatura's 63.58% on structured elements (code: 90.93%, formulas: 93.99%, tables preserved). NeMo currently supports Trafilatura, Resiliparse, and JusText — all heuristic-based.
Definition of Done:
- ModelBasedHTMLExtractionStage under nemo_curator/stages/text/download/html_extractors/
- Loads a 0.6B sequence labeling model (MinerU-HTML or equivalent) to classify HTML elements
- Two-stage pipeline: element classification → Markdown conversion with semantic element handling
- Preserves code blocks (fenced markdown), math formulas (LaTeX), and table structures
- Falls back to Trafilatura for pages where model confidence is low
- GPU-batched inference; processes 10K pages/hour on single A100
- Configurable: output format (markdown vs plain text), fallback threshold, GPU/CPU mode
- Benchmark: ROUGE-N on held-out web pages vs Trafilatura baseline (documented in README)
- Test: verify code blocks, math formulas, and tables are correctly preserved end-to-end
What:
Add an HTML extraction stage that uses a small language model (0.6B param) to classify HTML elements via sequence labeling — distinguishing main content, boilerplate, navigation, ads, and structured elements (code blocks, tables, formulas) — and converts the classified structure to clean Markdown.
Why:
AICC paper (Nov 2025, Shanghai AI Lab) demonstrates directly: a 3B model trained on MinerU-HTML-extracted data achieves 50.82% average accuracy vs 49.74% on Trafilatura-extracted data across 13 benchmarks — a 1.08pp improvement from extraction quality alone, with no other change. MinerU-HTML achieves 81.82% ROUGE-N vs Trafilatura's 63.58% on structured elements (code: 90.93%, formulas: 93.99%, tables preserved). NeMo currently supports Trafilatura, Resiliparse, and JusText — all heuristic-based.
Definition of Done: