Skip to content

Model-Based HTML Extraction #1723

@arhamm1

Description

@arhamm1

What:
Add an HTML extraction stage that uses a small language model (0.6B param) to classify HTML elements via sequence labeling — distinguishing main content, boilerplate, navigation, ads, and structured elements (code blocks, tables, formulas) — and converts the classified structure to clean Markdown.

Why:
AICC paper (Nov 2025, Shanghai AI Lab) demonstrates directly: a 3B model trained on MinerU-HTML-extracted data achieves 50.82% average accuracy vs 49.74% on Trafilatura-extracted data across 13 benchmarks — a 1.08pp improvement from extraction quality alone, with no other change. MinerU-HTML achieves 81.82% ROUGE-N vs Trafilatura's 63.58% on structured elements (code: 90.93%, formulas: 93.99%, tables preserved). NeMo currently supports Trafilatura, Resiliparse, and JusText — all heuristic-based.

Definition of Done:

  • ModelBasedHTMLExtractionStage under nemo_curator/stages/text/download/html_extractors/
  • Loads a 0.6B sequence labeling model (MinerU-HTML or equivalent) to classify HTML elements
  • Two-stage pipeline: element classification → Markdown conversion with semantic element handling
  • Preserves code blocks (fenced markdown), math formulas (LaTeX), and table structures
  • Falls back to Trafilatura for pages where model confidence is low
  • GPU-batched inference; processes 10K pages/hour on single A100
  • Configurable: output format (markdown vs plain text), fallback threshold, GPU/CPU mode
  • Benchmark: ROUGE-N on held-out web pages vs Trafilatura baseline (documented in README)
  • Test: verify code blocks, math formulas, and tables are correctly preserved end-to-end

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions