feat: Granary v2 pipeline — vLLM 0.19.1, throughput metadata export, standardized audio stages#1881
Conversation
…tion Signed-off-by: Meline Mkrtchyan <mmkrtchyan@nvidia.com>
…ncies Signed-off-by: Meline Mkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>
Signed-off-by: Meline Mkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: Meline Mkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: Nune Tadevosyan <ntadevosyan@cw-dfw-cs-001-login-01.cm.cluster>
Signed-off-by: Nune Tadevosyan <ntadevosyan@cw-dfw-cs-001-login-01.cm.cluster>
Signed-off-by: nune-tadevosyan <152167970+nune-tadevosyan@users.noreply.github.com>
Signed-off-by: Meline Mkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: Nune Tadevosyan <ntadevosyan@cw-dfw-cs-001-login-01.cm.cluster>
Signed-off-by: Nune Tadevosyan <ntadevosyan@cw-dfw-cs-001-login-01.cm.cluster>
Signed-off-by: Nune Tadevosyan <ntadevosyan@cw-dfw-cs-001-login-01.cm.cluster>
* Add shard-level checkpointing with .done markers and corpus_id support Signed-off-by: Meline Mkrtchyan <mmkrtchyan@nvidia.com> * minor fixes and ruff fixes Signed-off-by: Meline Mkrtchyan <mmkrtchyan@nvidia.com> * making output filepath to match with input filepath structure Signed-off-by: Meline Mkrtchyan <mmkrtchyan@nvidia.com> * Add case-insensitive corpus name matching in manifest path extraction Signed-off-by: Meline Mkrtchyan <mmkrtchyan@nvidia.com> --------- Signed-off-by: Meline Mkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: Nune Tadevosyan <ntadevosyan@cw-dfw-cs-001-login-01.cm.cluster>
Signed-off-by: Nune Tadevosyan <ntadevosyan@cw-dfw-cs-001-login-01.cm.cluster>
Signed-off-by: Nune Tadevosyan <ntadevosyan@cw-dfw-cs-001-login-01.cm.cluster>
Signed-off-by: Nune Tadevosyan <ntadevosyan@cw-dfw-cs-001-login-01.cm.cluster>
Signed-off-by: Nune Tadevosyan <ntadevosyan@cw-dfw-cs-001-login-01.cm.cluster>
Signed-off-by: Nune Tadevosyan <ntadevosyan@cw-dfw-cs-001-login-01.cm.cluster>
Signed-off-by: Nune Tadevosyan <ntadevosyan@cw-dfw-cs-001-login-01.cm.cluster>
Signed-off-by: Meline Mkrtchyan <mmkrtchyan@nvidia.com>
Converts spoken-form text to written form (e.g. "fourteen dollars" → "$14") using batched vLLM inference with Qwen3.5-35B-A3B-FP8. Runs after PnC restoration, validates outputs against 3 hallucination checks (word insertion, novel content, excessive deletion), and falls back to input on failure. Includes bundled default prompt with 18 ITN conversion rule categories. Enabled via --enable_itn in run_pipeline.py. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove transformers<5.0 cap from constraint-dependencies - Remove huggingface-hub<1.0 cap from override-dependencies - Override nemo-toolkit's transformers<4.58 pin - Override video_cuda12's torch<=2.9.1 pin - Point transformers/vllm/huggingface-hub to PyPI (NVIDIA mirror behind) - Remove torch<=2.9.1 from build dependency-group Resulting versions: vLLM: 0.16.0 -> 0.19.1 torch: 2.9.1+cu128 -> 2.11.0+cu128 transformers: 4.57.6 -> 5.7.0 huggingface-hub: 0.36.2 -> 1.13.0
vLLM 0.19.1 wheels are compiled against torch 2.10.0. torch 2.11.0 has incompatible C++ ABI causing: undefined symbol: _ZN3c1013MessageLoggerC1EPKciib
Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>
Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>
|
Hi @mohammadaaftabv since the PR is very large, are you splitting to split it up by stage/module? |
|
|
||
|
|
||
| def test_followup_prompt_stores_disfluency() -> None: |
There was a problem hiding this comment.
Mock return value unpacks to wrong arity — tests will fail
QwenOmni.generate() returns a 3-tuple (pred_texts, disfluency_texts, skipped_indices), but every test mock here supplies only a 2-tuple. process_batch unpacks with pred_texts, disfluency_texts, skipped_indices = self._model.generate(...), so all affected tests will raise ValueError: not enough values to unpack (expected 3, got 2).
Affected mocks (at minimum lines 62, 74, 81, 99, 112) should each be updated to include the third element:
stage._model.generate.return_value = (["hello world"], [""], set())| if cfg.get("enable_itn", False): | ||
| itn_model = cfg.get("itn_model_id", "Qwen/Qwen3.5-35B-A3B-FP8") | ||
| if itn_model != cfg.get("pnc_model_id", "Qwen/Qwen3.5-35B-A3B-FP8"): | ||
| tasks.append((f"snapshot:{itn_model}", lambda m=itn_model: snapshot_download(m))) |
There was a problem hiding this comment.
ITN model silently skipped when
skip_pnc=True and models share default ID
_prefetch_models avoids a double-download by comparing itn_model against pnc_model_id regardless of whether PnC itself is being fetched. When skip_pnc=True and itn_model_id equals pnc_model_id (both default to "Qwen/Qwen3.5-35B-A3B-FP8"), the PnC download is skipped, the comparison itn_model != cfg.get("pnc_model_id", ...) evaluates to False, and the ITN model is never queued. The parallel pre-fetch is silently a no-op for the ITN model. The pipeline falls back to setup_on_node() downloading it sequentially, erasing the 10-15 min advantage the pre-fetch was designed to provide.
| if cfg.get("enable_itn", False): | |
| itn_model = cfg.get("itn_model_id", "Qwen/Qwen3.5-35B-A3B-FP8") | |
| if itn_model != cfg.get("pnc_model_id", "Qwen/Qwen3.5-35B-A3B-FP8"): | |
| tasks.append((f"snapshot:{itn_model}", lambda m=itn_model: snapshot_download(m))) | |
| if cfg.get("enable_itn", False): | |
| itn_model = cfg.get("itn_model_id", "Qwen/Qwen3.5-35B-A3B-FP8") | |
| already_queued = any(name == f"snapshot:{itn_model}" for name, _ in tasks) | |
| if not already_queued: | |
| tasks.append((f"snapshot:{itn_model}", lambda m=itn_model: snapshot_download(m))) |
| def num_workers(self) -> int | None: | ||
| return 1 | ||
|
|
||
| def xenna_stage_spec(self) -> dict[str, Any]: | ||
| return {"num_workers": 1} |
There was a problem hiding this comment.
ManifestWriterStage missing IS_ACTOR_STAGE spec for Ray
ManifestWriterStage defines num_workers() = 1 and an Xenna-spec {"num_workers": 1}, but has no ray_stage_spec() override. In the Ray Data backend, without IS_ACTOR_STAGE: True, the executor may instantiate multiple concurrent workers for this stage. Each worker appends to the same output path, producing interleaved JSONL lines that corrupt the output file. ShardedManifestWriterStage (the parallel writer in this PR) correctly pairs num_workers() = 1 with ray_stage_spec() = {RayStageSpecKeys.IS_ACTOR_STAGE: True}; ManifestWriterStage needs the same.
| def outputs(self) -> tuple[list[str], list[str]]: | ||
| keys = [self.pred_text_key, self.skip_me_key] | ||
| if self.followup_prompt: | ||
| keys.append(self.disfluency_text_key) | ||
| return [], keys |
There was a problem hiding this comment.
followup_prompt_file is resolved inside _create_model() but outputs() and process_batch() both gate on self.followup_prompt (the raw dataclass field, which stays None). If a user sets only followup_prompt_file, QwenOmni will run full two-turn inference consuming GPU time and tokens, but disfluency_text_key is never written to any task and is not declared as an output — downstream stages relying on it receive tasks with the field entirely absent.
| def outputs(self) -> tuple[list[str], list[str]]: | |
| keys = [self.pred_text_key, self.skip_me_key] | |
| if self.followup_prompt: | |
| keys.append(self.disfluency_text_key) | |
| return [], keys | |
| def outputs(self) -> tuple[list[str], list[str]]: | |
| keys = [self.pred_text_key, self.skip_me_key] | |
| if self.followup_prompt or self.followup_prompt_file: | |
| keys.append(self.disfluency_text_key) | |
| return [], keys |
Summary
_log_metrics()inInferenceQwenOmniStageandPnCRestorationStage, withperf_summary.jsonaggregate output inShardedManifestWriterStageChanges
nemo_curator/stages/audio/inference/qwen_omni.py— add inference timing + utterance count metricsnemo_curator/stages/audio/text_filtering/pnc_restoration.py— add inference timing + restoration count metricsnemo_curator/stages/audio/io/sharded_manifest_writer.py— addperf_summary.json+ per-shard_perf.jsonlexporttutorials/audio/qwen_omni_inprocess/main.py— log perf summary after pipeline completionpyproject.toml/uv.lock— dependency pinsTest plan
perf_summary.jsonoutput in next Kratos run