Model-Based HTML Extraction

What:
Add an HTML extraction stage that uses a small language model (0.6B param) to classify HTML elements via sequence labeling — distinguishing main content, boilerplate, navigation, ads, and structured elements (code blocks, tables, formulas) — and converts the classified structure to clean Markdown.

Why:
AICC paper (Nov 2025, Shanghai AI Lab) demonstrates directly: a 3B model trained on MinerU-HTML-extracted data achieves 50.82% average accuracy vs 49.74% on Trafilatura-extracted data across 13 benchmarks — a 1.08pp improvement from extraction quality alone, with no other change. MinerU-HTML achieves 81.82% ROUGE-N vs Trafilatura's 63.58% on structured elements (code: 90.93%, formulas: 93.99%, tables preserved). NeMo currently supports Trafilatura, Resiliparse, and JusText — all heuristic-based.

Definition of Done:
  - ModelBasedHTMLExtractionStage under nemo_curator/stages/text/download/html_extractors/
  - Loads a 0.6B sequence labeling model (MinerU-HTML or equivalent) to classify HTML elements
  - Two-stage pipeline: element classification → Markdown conversion with semantic element handling
  - Preserves code blocks (fenced markdown), math formulas (LaTeX), and table structures
  - Falls back to Trafilatura for pages where model confidence is low
  - GPU-batched inference; processes 10K pages/hour on single A100
  - Configurable: output format (markdown vs plain text), fallback threshold, GPU/CPU mode
  - Benchmark: ROUGE-N on held-out web pages vs Trafilatura baseline (documented in README)
  - Test: verify code blocks, math formulas, and tables are correctly preserved end-to-end

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model-Based HTML Extraction #1723

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Model-Based HTML Extraction #1723

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions