Nemotron OCR SDG Pipeline by suiyoubi · Pull Request #1899 · NVIDIA-NeMo/Curator

suiyoubi · 2026-04-30T15:49:47Z

Description

Adds the Nemotron OCR SDG pipeline — a multimodal synthetic data generation pipeline that converts images into structured OCR + QA conversation data for vision-language model training.

Pipeline stages

Stage	Model	Output
`NemotronOCRV2Stage`	NemotronOCR-v2	Dense word-level OCR with bounding boxes
`OCRScoringQAStage`	Gemini 3 Pro (NVIDIA Inference API)	Scoring, validation, and missing-region detection
`OCRConversationalizeStage`	—	11 output format variants → `ConversationSample`
`OCRDenseQAStage`	—	6 QA types (bbox↔text, point↔text, dense dump)

Key components

`nemo_curator/models/omni/` — `NVInferenceModel` base class for NVIDIA Inference API-backed VLMs; `Gemini3Pro` concrete implementation
`nemo_curator/models/client/nvinference_client.py` — thin streaming client helpers (`get_nvinference_api_key`, `create_openai_client`, `stream_chat_completion_text`)
`nemo_curator/stages/synthetic/omni/base.py` — `VLMProcessingStage` and `ModelProcessingStage` base classes with batched inference, per-prompt error isolation, and setup/teardown lifecycle
`nemo_curator/stages/synthetic/omni/io.py` — `HFDatasetImageReader`, `TarImageReader`, `ParquetReader`, `SkipProcessedStage`, `ResultWriterStage`
`nemo_curator/tasks/ocr.py` — `OCRDenseWord` and `OCRData` task data classes
`docker/Dockerfile` — installs `nemotron-ocr-v2` from source (no-build-isolation, CUDA arch list for A100/A10/RTX Ada/H100)
`tutorials/synthetic/omni/hf_ocr_pipeline.py` — end-to-end example using HuggingFace datasets

Tests

63 new unit tests across `tests/tasks/`, `tests/models/`, and `tests/stages/synthetic/omni/` — all CPU-only, no GPU required.

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.

Signed-off-by: Ao Tang <aot@nvidia.com>

- Updated the `transformers` dependency from `<=4.55.2` to `==4.57.0` in `pyproject.toml` and `uv.lock` to ensure compatibility with the Cosmos Embed imports. - Added a new `Gemini3Pro` model class in `gemini.py` utilizing the NVIDIA Inference API. - Introduced `DescriptionOutputStage` and `DescriptionValidatorStage` for processing and validating image descriptions, respectively. - Enhanced `VLMProcessingStage` to improve GPU resource handling and added a `num_workers` parameter to `DescriptionStage` for better scalability. This commit enhances the model's capabilities and ensures that dependencies are up-to-date for optimal performance. Signed-off-by: Ao Tang <aot@nvidia.com>

Signed-off-by: Ao Tang <aot@nvidia.com>

… aot/omni_sdg

Signed-off-by: Ao Tang <aot@nvidia.com>

Removes description-specific stages (description*.py, description pipeline tutorials) that belong on aot/omni_description. Adds OCR result inspection/review scripts and shared design docs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… pipeline tutorial Signed-off-by: Ao Tang <aot@nvidia.com>

Signed-off-by: Ao Tang <aot@nvidia.com>

…Data and OCRDenseWord classes Signed-off-by: Ao Tang <aot@nvidia.com>

--metrics-dir wires Ray metrics into the running Prometheus/Grafana instance. --run-name sets SLURM_JOB_NAME so Xenna labels the run on the ray_pipeline_input_tasks metric for human-readable identification in Grafana. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… timing - OCRConversationData.to_dict(): call conversation.to_dict() explicitly instead of relying on dataclasses.asdict(), which bypasses the custom ConversationSample serialization and drops the "t" media-type field from image fragments. - RayClient: move Prometheus service-discovery registration to after Ray is started and responsive; add _wait_for_ray_service_discovery_file() so the SD file exists before Prometheus is told to watch it. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Remove internal-only I/O stages from io.py (InputFormat, ImageReaderStage, ImageFolderReaderStage, TarImageReaderStage, JsonlTarImageReaderStage, OcrJsonlReaderStage, JsonlPipelineOutputReaderStage, TarImageReader, ParquetImageReaderStage, ParquetImageReader). These classes are only used by the internal ocr_pipeline.py and will live on aot/omni_sdg_internal. Public io.py now exports: HFDatasetImageReaderStage, SkipProcessedStage, ResultWriterStage, merge_output_shards, ImageWriterStage, and the FileReader helpers (load_image_from_task, TarFileReader, etc.). Add tests/stages/test_hf_dataset_image_reader.py covering HFDatasetImageReaderStage. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ImageWriterStage is not used by hf_ocr_pipeline.py and only makes sense with the internal JSONL+tar pipeline; moved to aot/omni_sdg_internal. TarFileReader, ParquetFileReader, _file_readers dispatcher, and the deprecated _parse_tar_slice_path wrapper are removed. In the public HF pipeline images are always regular JPEG files on disk, so load_image_from_task is simplified to a single RegularFileReader call. io.py: 870 → 631 lines. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…e_url SUPPORTED_IMAGE_EXTENSIONS was only used by the internal reader classes removed in the previous commit. FileReader.read_image_url() was never called in this branch — drop it and its now-unused base64 import. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ng in nvinference_client.py Signed-off-by: Ao Tang <aot@nvidia.com>

Signed-off-by: Ao Tang <aot@nvidia.com>

… overwrite scenarios Signed-off-by: Ao Tang <aot@nvidia.com>

suiyoubi · 2026-04-30T20:39:12Z

/ok to test 3829629

greptile-apps · 2026-04-30T20:46:02Z

+            image_path = self.image_dir / f"{image_id}.jpg"
+            if not image_path.exists():


image_id used as filename without path sanitization

f"{image_id}.jpg" is used directly as a filename. If the id_column value contains a / (e.g. "en/doc_001"), Path(image_dir) / "en/doc_001.jpg" silently creates a subdirectory en/ instead of a flat file. The subdirectory won't be created by self.image_dir.mkdir(parents=True, exist_ok=True) above (that only creates image_dir itself), so the subsequent pil_image.save(image_path) will fail with FileNotFoundError.

Consider sanitizing:

safe_id = image_id.replace("/", "_").replace("\\", "_") image_path = self.image_dir / f"{safe_id}.jpg"

voegtlel · 2026-05-04T09:41:32Z

+        pass
+
+
+class NVInferenceModel(VLMModel):


For discussion: Well, we can let it be "nvinference", or generic "OpenAI-API-compatible" or something? Basically we're not specifically targeting nvinference, but rather any openai compatible api. Although internally we would of course focus on nvinference.

I think we should just leave it for nvinference for now. promoting NV inference infra isn't a bad idea

voegtlel · 2026-05-04T09:55:19Z

+        content.append({"type": "text", "text": prompt})
+        return content
+
+    def generate(


Meaning, we only have static batching for now, no dynamic batching?

thats true. Do we expect perf gain for inference API from using dynamic batching ?

voegtlel · 2026-05-04T09:58:52Z

+
+
+@dataclass(kw_only=True)
+class OCRDenseWord:


I guess we should rename. It's not necessarily a word, but can also be a line (if using the line prediction mode).

voegtlel · 2026-05-04T10:01:19Z

+DEFAULT_MODEL_ID = "gcp/google/gemini-3-pro"
+
+
+class Gemini3Pro(NVInferenceModel):


Actually, was thinking if we should set our just released omni model as default?

suiyoubi and others added 30 commits March 4, 2026 10:40

omni sdg

b6bf8c1

Signed-off-by: Ao Tang <aot@nvidia.com>

Refactor model initialization and enhance pipeline execution

2b9f157

Signed-off-by: Ao Tang <aot@nvidia.com>

Merge branch 'main' of github.com:NVIDIA-NeMo/Curator into aot/omni_sdg

21f417f

add metrics support

ec46518

Signed-off-by: Ao Tang <aot@nvidia.com>

Add support for per-stage pip specifications and virtual environments

5271c7c

Signed-off-by: Ao Tang <aot@nvidia.com>

Merge branch 'aot/runtime_env' of github.com:NVIDIA-NeMo/Curator into…

e4df9af

… aot/omni_sdg

Add test

fa9f9a5

Signed-off-by: Ao Tang <aot@nvidia.com>

Merge branch 'main' of github.com:NVIDIA-NeMo/Curator into aot/omni_sdg

2d34d90

Signed-off-by: Ao Tang <aot@nvidia.com>

Add ocr nemotron v2 pipeline

df58099

Signed-off-by: Ao Tang <aot@nvidia.com>

Enhance OCR pipeline with new scoring and verification stages

0cb4be7

Signed-off-by: Ao Tang <aot@nvidia.com>

Add OCR conversationalization and output formatting for dense OCR data

f41918c

Signed-off-by: Ao Tang <aot@nvidia.com>

Add HFDatasetImageReaderStage for HuggingFace datasets and create OCR…

cc19ce9

… pipeline tutorial Signed-off-by: Ao Tang <aot@nvidia.com>

Remove per-stage runtime environment design document and example script

94ec830

Signed-off-by: Ao Tang <aot@nvidia.com>

Refactor OCR pipeline and result inspection scripts

3b9a139

Signed-off-by: Ao Tang <aot@nvidia.com>

remove

effe3aa

Signed-off-by: Ao Tang <aot@nvidia.com>

remove rtx

965a87c

Signed-off-by: Ao Tang <aot@nvidia.com>

remove generate_stream

71d5c02

Signed-off-by: Ao Tang <aot@nvidia.com>

refactor

3330d71

Signed-off-by: Ao Tang <aot@nvidia.com>

Refactor ParquetImageReader

2a634d5

Signed-off-by: Ao Tang <aot@nvidia.com>

remove

37a710f

Signed-off-by: Ao Tang <aot@nvidia.com>

Refactor JSON serialization in ResultWriterStage and update ImageTask…

344157a

…Data and OCRDenseWord classes Signed-off-by: Ao Tang <aot@nvidia.com>

Refactor RayClient metrics service discovery and improve error handli…

f5a2326

…ng in nvinference_client.py Signed-off-by: Ao Tang <aot@nvidia.com>

init file

bb74e10

Signed-off-by: Ao Tang <aot@nvidia.com>

copy-pr-bot Bot temporarily deployed to nemo-ci April 30, 2026 20:25 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci April 30, 2026 20:25 Error

Update merge_output_shards behavior in tests to support appending and…

3829629

… overwrite scenarios Signed-off-by: Ao Tang <aot@nvidia.com>

copy-pr-bot Bot temporarily deployed to test April 30, 2026 20:40 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci April 30, 2026 20:40 Inactive

greptile-apps Bot reviewed Apr 30, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to nemo-ci April 30, 2026 20:51 Inactive

voegtlel suggested changes May 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nemotron OCR SDG Pipeline#1899

Nemotron OCR SDG Pipeline#1899
suiyoubi wants to merge 43 commits intomainfrom
aot/omni_sdg

suiyoubi commented Apr 30, 2026 •

edited

Loading

Uh oh!

suiyoubi commented Apr 30, 2026

Uh oh!

greptile-apps Bot Apr 30, 2026

Uh oh!

voegtlel May 4, 2026

Uh oh!

suiyoubi May 8, 2026

Uh oh!

Uh oh!

voegtlel May 4, 2026

Uh oh!

suiyoubi May 8, 2026

Uh oh!

voegtlel May 4, 2026

Uh oh!

voegtlel May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		image_path = self.image_dir / f"{image_id}.jpg"
		if not image_path.exists():

		DEFAULT_MODEL_ID = "gcp/google/gemini-3-pro"


		class Gemini3Pro(NVInferenceModel):

Conversation

suiyoubi commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Pipeline stages

Key components

Tests

Checklist

Uh oh!

suiyoubi commented Apr 30, 2026

Uh oh!

greptile-apps Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

voegtlel May 4, 2026

Choose a reason for hiding this comment

Uh oh!

suiyoubi May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

voegtlel May 4, 2026

Choose a reason for hiding this comment

Uh oh!

suiyoubi May 8, 2026

Choose a reason for hiding this comment

Uh oh!

voegtlel May 4, 2026

Choose a reason for hiding this comment

Uh oh!

voegtlel May 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

suiyoubi commented Apr 30, 2026 •

edited

Loading