Skip to content

Nemotron OCR SDG Pipeline#1899

Open
suiyoubi wants to merge 43 commits intomainfrom
aot/omni_sdg
Open

Nemotron OCR SDG Pipeline#1899
suiyoubi wants to merge 43 commits intomainfrom
aot/omni_sdg

Conversation

@suiyoubi
Copy link
Copy Markdown
Contributor

@suiyoubi suiyoubi commented Apr 30, 2026

Description

Adds the Nemotron OCR SDG pipeline — a multimodal synthetic data generation pipeline that converts images into structured OCR + QA conversation data for vision-language model training.

Pipeline stages

Stage Model Output
`NemotronOCRV2Stage` NemotronOCR-v2 Dense word-level OCR with bounding boxes
`OCRScoringQAStage` Gemini 3 Pro (NVIDIA Inference API) Scoring, validation, and missing-region detection
`OCRConversationalizeStage` 11 output format variants → `ConversationSample`
`OCRDenseQAStage` 6 QA types (bbox↔text, point↔text, dense dump)

Key components

  • `nemo_curator/models/omni/` — `NVInferenceModel` base class for NVIDIA Inference API-backed VLMs; `Gemini3Pro` concrete implementation
  • `nemo_curator/models/client/nvinference_client.py` — thin streaming client helpers (`get_nvinference_api_key`, `create_openai_client`, `stream_chat_completion_text`)
  • `nemo_curator/stages/synthetic/omni/base.py` — `VLMProcessingStage` and `ModelProcessingStage` base classes with batched inference, per-prompt error isolation, and setup/teardown lifecycle
  • `nemo_curator/stages/synthetic/omni/io.py` — `HFDatasetImageReader`, `TarImageReader`, `ParquetReader`, `SkipProcessedStage`, `ResultWriterStage`
  • `nemo_curator/tasks/ocr.py` — `OCRDenseWord` and `OCRData` task data classes
  • `docker/Dockerfile` — installs `nemotron-ocr-v2` from source (no-build-isolation, CUDA arch list for A100/A10/RTX Ada/H100)
  • `tutorials/synthetic/omni/hf_ocr_pipeline.py` — end-to-end example using HuggingFace datasets

Tests

63 new unit tests across `tests/tasks/`, `tests/models/`, and `tests/stages/synthetic/omni/` — all CPU-only, no GPU required.

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

suiyoubi and others added 30 commits March 4, 2026 10:40
Signed-off-by: Ao Tang <aot@nvidia.com>
- Updated the `transformers` dependency from `<=4.55.2` to `==4.57.0` in `pyproject.toml` and `uv.lock` to ensure compatibility with the Cosmos Embed imports.
- Added a new `Gemini3Pro` model class in `gemini.py` utilizing the NVIDIA Inference API.
- Introduced `DescriptionOutputStage` and `DescriptionValidatorStage` for processing and validating image descriptions, respectively.
- Enhanced `VLMProcessingStage` to improve GPU resource handling and added a `num_workers` parameter to `DescriptionStage` for better scalability.

This commit enhances the model's capabilities and ensures that dependencies are up-to-date for optimal performance.

Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
Removes description-specific stages (description*.py, description
pipeline tutorials) that belong on aot/omni_description.
Adds OCR result inspection/review scripts and shared design docs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… pipeline tutorial

Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
…Data and OCRDenseWord classes

Signed-off-by: Ao Tang <aot@nvidia.com>
--metrics-dir wires Ray metrics into the running Prometheus/Grafana instance.
--run-name sets SLURM_JOB_NAME so Xenna labels the run on the
ray_pipeline_input_tasks metric for human-readable identification in Grafana.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… timing

- OCRConversationData.to_dict(): call conversation.to_dict() explicitly
  instead of relying on dataclasses.asdict(), which bypasses the custom
  ConversationSample serialization and drops the "t" media-type field from
  image fragments.
- RayClient: move Prometheus service-discovery registration to after Ray
  is started and responsive; add _wait_for_ray_service_discovery_file()
  so the SD file exists before Prometheus is told to watch it.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove internal-only I/O stages from io.py (InputFormat, ImageReaderStage,
ImageFolderReaderStage, TarImageReaderStage, JsonlTarImageReaderStage,
OcrJsonlReaderStage, JsonlPipelineOutputReaderStage, TarImageReader,
ParquetImageReaderStage, ParquetImageReader).  These classes are only
used by the internal ocr_pipeline.py and will live on aot/omni_sdg_internal.

Public io.py now exports: HFDatasetImageReaderStage, SkipProcessedStage,
ResultWriterStage, merge_output_shards, ImageWriterStage, and the
FileReader helpers (load_image_from_task, TarFileReader, etc.).

Add tests/stages/test_hf_dataset_image_reader.py covering HFDatasetImageReaderStage.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ImageWriterStage is not used by hf_ocr_pipeline.py and only makes sense
with the internal JSONL+tar pipeline; moved to aot/omni_sdg_internal.

TarFileReader, ParquetFileReader, _file_readers dispatcher, and the
deprecated _parse_tar_slice_path wrapper are removed.  In the public HF
pipeline images are always regular JPEG files on disk, so load_image_from_task
is simplified to a single RegularFileReader call.

io.py: 870 → 631 lines.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e_url

SUPPORTED_IMAGE_EXTENSIONS was only used by the internal reader classes
removed in the previous commit.  FileReader.read_image_url() was never
called in this branch — drop it and its now-unused base64 import.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ng in nvinference_client.py

Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
… overwrite scenarios

Signed-off-by: Ao Tang <aot@nvidia.com>
@suiyoubi
Copy link
Copy Markdown
Contributor Author

/ok to test 3829629

Comment on lines +203 to +204
image_path = self.image_dir / f"{image_id}.jpg"
if not image_path.exists():
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 image_id used as filename without path sanitization

f"{image_id}.jpg" is used directly as a filename. If the id_column value contains a / (e.g. "en/doc_001"), Path(image_dir) / "en/doc_001.jpg" silently creates a subdirectory en/ instead of a flat file. The subdirectory won't be created by self.image_dir.mkdir(parents=True, exist_ok=True) above (that only creates image_dir itself), so the subsequent pil_image.save(image_path) will fail with FileNotFoundError.

Consider sanitizing:

safe_id = image_id.replace("/", "_").replace("\\", "_")
image_path = self.image_dir / f"{safe_id}.jpg"

pass


class NVInferenceModel(VLMModel):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For discussion: Well, we can let it be "nvinference", or generic "OpenAI-API-compatible" or something? Basically we're not specifically targeting nvinference, but rather any openai compatible api. Although internally we would of course focus on nvinference.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should just leave it for nvinference for now. promoting NV inference infra isn't a bad idea

Comment thread nemo_curator/models/omni/base.py
content.append({"type": "text", "text": prompt})
return content

def generate(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meaning, we only have static batching for now, no dynamic batching?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thats true. Do we expect perf gain for inference API from using dynamic batching ?

Comment thread nemo_curator/tasks/ocr.py


@dataclass(kw_only=True)
class OCRDenseWord:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we should rename. It's not necessarily a word, but can also be a line (if using the line prediction mode).

DEFAULT_MODEL_ID = "gcp/google/gemini-3-pro"


class Gemini3Pro(NVInferenceModel):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, was thinking if we should set our just released omni model as default?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants