Conversation
- Updated the `transformers` dependency from `<=4.55.2` to `==4.57.0` in `pyproject.toml` and `uv.lock` to ensure compatibility with the Cosmos Embed imports. - Added a new `Gemini3Pro` model class in `gemini.py` utilizing the NVIDIA Inference API. - Introduced `DescriptionOutputStage` and `DescriptionValidatorStage` for processing and validating image descriptions, respectively. - Enhanced `VLMProcessingStage` to improve GPU resource handling and added a `num_workers` parameter to `DescriptionStage` for better scalability. This commit enhances the model's capabilities and ensures that dependencies are up-to-date for optimal performance. Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
Removes description-specific stages (description*.py, description pipeline tutorials) that belong on aot/omni_description. Adds OCR result inspection/review scripts and shared design docs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… pipeline tutorial Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
…Data and OCRDenseWord classes Signed-off-by: Ao Tang <aot@nvidia.com>
--metrics-dir wires Ray metrics into the running Prometheus/Grafana instance. --run-name sets SLURM_JOB_NAME so Xenna labels the run on the ray_pipeline_input_tasks metric for human-readable identification in Grafana. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… timing - OCRConversationData.to_dict(): call conversation.to_dict() explicitly instead of relying on dataclasses.asdict(), which bypasses the custom ConversationSample serialization and drops the "t" media-type field from image fragments. - RayClient: move Prometheus service-discovery registration to after Ray is started and responsive; add _wait_for_ray_service_discovery_file() so the SD file exists before Prometheus is told to watch it. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove internal-only I/O stages from io.py (InputFormat, ImageReaderStage, ImageFolderReaderStage, TarImageReaderStage, JsonlTarImageReaderStage, OcrJsonlReaderStage, JsonlPipelineOutputReaderStage, TarImageReader, ParquetImageReaderStage, ParquetImageReader). These classes are only used by the internal ocr_pipeline.py and will live on aot/omni_sdg_internal. Public io.py now exports: HFDatasetImageReaderStage, SkipProcessedStage, ResultWriterStage, merge_output_shards, ImageWriterStage, and the FileReader helpers (load_image_from_task, TarFileReader, etc.). Add tests/stages/test_hf_dataset_image_reader.py covering HFDatasetImageReaderStage. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ImageWriterStage is not used by hf_ocr_pipeline.py and only makes sense with the internal JSONL+tar pipeline; moved to aot/omni_sdg_internal. TarFileReader, ParquetFileReader, _file_readers dispatcher, and the deprecated _parse_tar_slice_path wrapper are removed. In the public HF pipeline images are always regular JPEG files on disk, so load_image_from_task is simplified to a single RegularFileReader call. io.py: 870 → 631 lines. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e_url SUPPORTED_IMAGE_EXTENSIONS was only used by the internal reader classes removed in the previous commit. FileReader.read_image_url() was never called in this branch — drop it and its now-unused base64 import. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ng in nvinference_client.py Signed-off-by: Ao Tang <aot@nvidia.com>
… overwrite scenarios Signed-off-by: Ao Tang <aot@nvidia.com>
|
/ok to test 3829629 |
| image_path = self.image_dir / f"{image_id}.jpg" | ||
| if not image_path.exists(): |
There was a problem hiding this comment.
image_id used as filename without path sanitization
f"{image_id}.jpg" is used directly as a filename. If the id_column value contains a / (e.g. "en/doc_001"), Path(image_dir) / "en/doc_001.jpg" silently creates a subdirectory en/ instead of a flat file. The subdirectory won't be created by self.image_dir.mkdir(parents=True, exist_ok=True) above (that only creates image_dir itself), so the subsequent pil_image.save(image_path) will fail with FileNotFoundError.
Consider sanitizing:
safe_id = image_id.replace("/", "_").replace("\\", "_")
image_path = self.image_dir / f"{safe_id}.jpg"| pass | ||
|
|
||
|
|
||
| class NVInferenceModel(VLMModel): |
There was a problem hiding this comment.
For discussion: Well, we can let it be "nvinference", or generic "OpenAI-API-compatible" or something? Basically we're not specifically targeting nvinference, but rather any openai compatible api. Although internally we would of course focus on nvinference.
There was a problem hiding this comment.
I think we should just leave it for nvinference for now. promoting NV inference infra isn't a bad idea
| content.append({"type": "text", "text": prompt}) | ||
| return content | ||
|
|
||
| def generate( |
There was a problem hiding this comment.
Meaning, we only have static batching for now, no dynamic batching?
There was a problem hiding this comment.
thats true. Do we expect perf gain for inference API from using dynamic batching ?
|
|
||
|
|
||
| @dataclass(kw_only=True) | ||
| class OCRDenseWord: |
There was a problem hiding this comment.
I guess we should rename. It's not necessarily a word, but can also be a line (if using the line prediction mode).
| DEFAULT_MODEL_ID = "gcp/google/gemini-3-pro" | ||
|
|
||
|
|
||
| class Gemini3Pro(NVInferenceModel): |
There was a problem hiding this comment.
Actually, was thinking if we should set our just released omni model as default?
Description
Adds the Nemotron OCR SDG pipeline — a multimodal synthetic data generation pipeline that converts images into structured OCR + QA conversation data for vision-language model training.
Pipeline stages
Key components
Tests
63 new unit tests across `tests/tasks/`, `tests/models/`, and `tests/stages/synthetic/omni/` — all CPU-only, no GPU required.
Checklist