Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
91 commits
Select commit Hold shift + click to select a range
4ea4f8a
Add in-process vLLM inference pipeline for Qwen3-Omni audio transcrip…
melllinia Apr 9, 2026
98a5c05
Make audio stage imports lazy to avoid pulling heavy optional depende…
melllinia Apr 9, 2026
c3587c2
move to lhotse (#1789)
nithinraok Apr 10, 2026
7cf1c79
Add deployment scripts and fix reader for multi-node support
melllinia Apr 15, 2026
c0ef5b7
Add deployment scripts and fix reader for multi-node support
melllinia Apr 15, 2026
dfd2c94
Combining two stages
Apr 17, 2026
a749615
waveform change
Apr 17, 2026
d18ac7e
Fixes
Apr 20, 2026
cac3b8a
Updates
Apr 20, 2026
92f3263
Updates
Apr 20, 2026
030ea44
Update run_pipeline.py
nune-tadevosyan Apr 20, 2026
7047f85
Merge PR 1824 Combining two stages and text filtering
melllinia Apr 20, 2026
37c4f4c
pnc restoration
Apr 21, 2026
e312832
working changes for english
Apr 22, 2026
89a93af
Final pnc
Apr 22, 2026
8e10503
Fix for different stages
Apr 23, 2026
b7d6d0e
Add shard-level checkpointing with .done markers (#1836)
melllinia Apr 23, 2026
ec445fd
Merging checkpointing to pipeline
Apr 23, 2026
1a4638f
remove tests
Apr 23, 2026
8b5b401
remove low threshold check in hallucination detection
Apr 23, 2026
f33d95e
batch support for pnc
Apr 23, 2026
0c1070e
clean up
Apr 23, 2026
74c2b2c
ruff update
Apr 23, 2026
1d014dc
ruff update
Apr 23, 2026
4dab67a
fix
Apr 23, 2026
31e8c5d
FP8 model
Apr 23, 2026
3c04350
Updates
Apr 23, 2026
bdd1f3b
ruff update
Apr 23, 2026
7682aa6
Add QwenASR hallucination recovery and unified _skip_me tracking
melllinia Apr 23, 2026
3914ae4
Add ITNRestorationStage for inverse text normalization
nithinraok Apr 23, 2026
085dedb
Fix ruff lint and format issues in ITN stage
nithinraok Apr 23, 2026
f1e40ff
Preserve source_lang through pipeline for per-sample language-aware p…
nithinraok Apr 23, 2026
d516bd8
Merge remote-tracking branch 'nithin/add_ITN' into nkoluguri/integrat…
nithinraok Apr 24, 2026
53c4bc8
Merge remote-tracking branch 'nithin/nithinraok/add-ml-prompt' into n…
nithinraok Apr 24, 2026
0cf8e6c
match max model len for ITN and PnC
nithinraok Apr 26, 2026
7e32df1
add Qwen3ASR for all
nithinraok Apr 26, 2026
a6594b4
add ITN args to run_pipeline
nithinraok Apr 27, 2026
15424e3
updated prompt for ITN
nithinraok Apr 27, 2026
5fdfa0a
additional notes key, skip witing to keys after skip_me found, add pn…
nithinraok Apr 27, 2026
633acc7
FastText and Hallucination update
Apr 27, 2026
727fb46
Resolve a conflict
Apr 27, 2026
e6a75ab
ruff fixes
Apr 27, 2026
d7a7802
ruff fixes
Apr 27, 2026
50f2c0b
ruff fixes
Apr 27, 2026
3137a94
Merge pull request #1 from nune-tadevosyan/fasttext-hallucination-update
nithinraok Apr 27, 2026
caccd37
Add min word count for FastText
nithinraok Apr 27, 2026
bb23549
Default source_lang to en, base followup stages to use this key, remo…
nithinraok Apr 27, 2026
ed453b5
Add per-language prompt selection: --en_prompt_file and --ml_prompt_f…
nithinraok Apr 27, 2026
53656ab
update QwenASR to take language from source_key
nithinraok Apr 27, 2026
f084293
doc string update
nithinraok Apr 27, 2026
0e7bcbe
reduce prompt for itn
nithinraok Apr 28, 2026
2e38ee4
reduce ITN prompt and remove english only validation
nithinraok Apr 28, 2026
1cbb4a6
increase gpu utilization
nithinraok Apr 28, 2026
9224f3f
standardize audio stages for upstream compatibility
mohammadaaftabv Apr 28, 2026
06a7173
align new stages with developer guide conventions
mohammadaaftabv Apr 28, 2026
9746720
fix bugs and convention violations found via cross-PR review audit
mohammadaaftabv Apr 28, 2026
2e33275
fix ruff lint errors introduced by prior commits
mohammadaaftabv Apr 28, 2026
0099b7c
fix all ruff errors in audio/ and add comprehensive test suite
mohammadaaftabv Apr 28, 2026
75afa97
address PM tutorial critique and task-list gaps
mohammadaaftabv Apr 28, 2026
9d11dec
fix audio test failures: correct imports and align tests with source
mohammadaaftabv Apr 28, 2026
5de5d83
feat: add Hydra entry point for Granary v2 Qwen-Omni pipeline (Kratos…
mohammadaaftabv Apr 29, 2026
03b82ac
Add hf_token handling and update Hydra config for Kratos
mohammadaaftabv Apr 29, 2026
44da20b
feat: add parallel model prefetch, pipeline deep dive docs, and s3 en…
mohammadaaftabv Apr 29, 2026
8734bbb
refactor: remove s3_endpoint_url from Curator config, update pipeline…
mohammadaaftabv Apr 29, 2026
792abde
feat: add vllm, qwen-omni-utils, qwen-asr to audio_cuda12 extras
mohammadaaftabv Apr 29, 2026
7e0cb08
feat: add fasttext to audio_cuda12 extras
mohammadaaftabv Apr 29, 2026
93c1d87
feat: bump transformers to >=5.0 for qwen3_5_moe support
mohammadaaftabv Apr 29, 2026
0f204cb
revert: restore transformers <5.0 constraint
mohammadaaftabv Apr 29, 2026
398e67a
docs: add detailed branch comparison with nkoluguri/integration-test
mohammadaaftabv Apr 30, 2026
a4c4822
Align vLLM API usage with reference branch
mohammadaaftabv Apr 30, 2026
0af2d8b
Regenerate uv.lock after aligning pyproject.toml with reference
mohammadaaftabv Apr 30, 2026
41d4fc6
Add vllm to audio_cuda12 extra and regenerate lockfile
mohammadaaftabv Apr 30, 2026
46af49b
Add pyopenssl>=26.0.0 to fix cryptography incompatibility
mohammadaaftabv Apr 30, 2026
12df5af
Fix config paths: /src/Curator -> /opt/Curator
mohammadaaftabv Apr 30, 2026
f6a837d
Upgrade vLLM 0.15.1 -> 0.16.0 for qwen3_5_moe support
mohammadaaftabv Apr 30, 2026
ac71bd8
feat: upgrade vLLM to 0.19.1 for Qwen3.5 MoE support
mohammadaaftabv May 1, 2026
0d0a8ef
fix: pin torch==2.10.0 to match vLLM 0.19.1 compiled ABI
mohammadaaftabv May 1, 2026
58e27f1
feat: add throughput metadata export for perf analysis
mohammadaaftabv May 1, 2026
89aeaf8
docs: add Granary v2 pipeline architecture explanation
mohammadaaftabv May 1, 2026
6ba525d
docs: remove personal reference from branch comparison
mohammadaaftabv May 4, 2026
7be68b1
chore: remove personal paths from curator examples
mohammadaaftabv May 4, 2026
45d0672
feat: add granular Granary throughput metrics
mohammadaaftabv May 4, 2026
e97d5cc
Add Granary v2 throughput experiment controls
mohammadaaftabv May 5, 2026
e59bbb8
Add runtime GPU resource overrides
mohammadaaftabv May 5, 2026
f1f51f6
Make Qwen ASR cache sizing follow stage batch size
mohammadaaftabv May 5, 2026
ef9c13d
Improve Granary v2 ASR cache sizing and perf summaries
mohammadaaftabv May 5, 2026
566e66b
docs: refresh granary v2 pipeline deep dive
mohammadaaftabv May 5, 2026
0d2c25a
refactor: reuse audio throughput metrics
mohammadaaftabv May 5, 2026
cb1f994
support ray data qwen omni backend
mohammadaaftabv May 6, 2026
ad4cf19
stabilize qwen backend startup
mohammadaaftabv May 6, 2026
e3d3463
configure qwen omni processors from yaml
mohammadaaftabv May 7, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
565 changes: 565 additions & 0 deletions CURATOR_BRANCH_COMPARISON.md

Large diffs are not rendered by default.

767 changes: 767 additions & 0 deletions GRANARY_V2_PIPELINE_EXPLAINED.md

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
You receive audio in English.

MAIN GOAL: faithfully transcribe audio as is spoken in the audio with all disfluencies present in the audio.
- Do NOT remove, correct, or "clean up" any speech artifacts.
- Do NOT paraphrase, edit grammar, or make the speech more polished.

FILLER WORDS:
- Include hesitation markers like "um", "uh", "hm", "ah" etc as is spoken in the audio.

REPETITIONS:
- Consecutive instances of the same word or short phrase spoken unintentionally — keep all repetitions as-is.
- Example: "I I think", "the the problem"

FALSE STARTS:
- Incomplete words or phrases the speaker abandons, mark with a hyphen — keep them as-is.
- Example: "I was go going to the store." → "I was go- going to the store."

COLLOQUIAL REDUCTIONS:
- Preserve forms such as "wanna", "gonna", "kinda", "lemme", "lotta", "outta", "Imma", "sorta", "ya", "m'kay", "finna", "tryna", etc exactly as spoken. Do NOT expand them into standard forms.

WRONG GRAMMAR:
- Grammatical errors should be faithfully captured in the transcript — do NOT correct them.
- You MUST NOT fix subject-verb agreement, tense errors, or any other grammatical issues.

NUMERICALS:
- Keep numbers as is spoken in words. Do NOT convert them to numbers. like "oh eleven" should be "oh eleven" as spoken in the audio not "zero eleven" etc

Output format:
- Return ONLY the transcription text.
- No explanations, no JSON, no lists.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Transcribe the {language} audio into text exactly as the speaker says it. Write numbers as spoken words.
474 changes: 474 additions & 0 deletions examples/audio/qwen_omni_inprocess/run_pipeline.py

Large diffs are not rendered by default.

203 changes: 203 additions & 0 deletions nemo_curator/models/qwen_asr.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,203 @@
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""Qwen3-ASR model wrapper for in-process vLLM inference.

Uses the ``qwen_asr`` library which wraps vLLM internally and exposes a
high-level ``transcribe()`` API that accepts in-memory numpy waveforms.
"""

from __future__ import annotations

import gc
import time
from typing import TYPE_CHECKING, Any

from loguru import logger

from nemo_curator.models.base import ModelInterface

if TYPE_CHECKING:
import numpy as np

_QWEN3_ASR_MODEL_ID = "Qwen/Qwen3-ASR-0.6B"


class QwenASR(ModelInterface):
"""Qwen3-ASR model via the ``qwen_asr`` library with vLLM backend.

Audio is accepted as in-memory numpy arrays (mono, any sample rate).
The ``qwen_asr`` library handles resampling to 16 kHz, chunking long
audio, and batched vLLM inference internally.
"""

def __init__(
self,
model_id: str = _QWEN3_ASR_MODEL_ID,
gpu_memory_utilization: float = 0.7,
max_model_len: int | None = None,
max_new_tokens: int = 4096,
max_inference_batch_size: int = 128,
):
self.model_id = model_id
self.gpu_memory_utilization = gpu_memory_utilization
self.max_model_len = max_model_len
self.max_new_tokens = max_new_tokens
self.max_inference_batch_size = max_inference_batch_size

self._model: Any = None

@property
def model_id_names(self) -> list[str]:
return [self.model_id]

# ------------------------------------------------------------------
# Lifecycle
# ------------------------------------------------------------------

@staticmethod
def _patch_transformers_compat() -> None:
"""Patch transformers.check_model_inputs for qwen-asr compatibility.

Newer transformers changed check_model_inputs from a decorator factory
(called with parentheses) to a plain decorator. The qwen-asr package
uses the old ``@check_model_inputs()`` syntax which breaks on newer
versions. This wraps it to accept both styles.
"""
try:
import transformers
original = getattr(transformers, "check_model_inputs", None)
if original is None:
return
import inspect
sig = inspect.signature(original)
params = list(sig.parameters.values())
if params and params[0].name == "func":
def compat_check_model_inputs(*args): # noqa: ANN202
if args and callable(args[0]):
return original(args[0])
return original
transformers.check_model_inputs = compat_check_model_inputs
except Exception: # noqa: BLE001, S110
pass

def setup(self) -> None:
setup_t0 = time.perf_counter()
self._patch_transformers_compat()

try:
from qwen_asr import Qwen3ASRModel
except ImportError:
msg = "qwen_asr is required for QwenASR. Install it: pip install qwen-asr[vllm]"
raise ImportError(msg) from None

logger.info(
f"Loading QwenASR model={self.model_id} "
f"gpu_mem={self.gpu_memory_utilization} "
f"max_model_len={self.max_model_len} "
f"max_new_tokens={self.max_new_tokens} "
f"max_batch={self.max_inference_batch_size}"
)

llm_kwargs: dict[str, Any] = {
"model": self.model_id,
"gpu_memory_utilization": self.gpu_memory_utilization,
"max_inference_batch_size": self.max_inference_batch_size,
"max_new_tokens": self.max_new_tokens,
"trust_remote_code": True,
"enforce_eager": True,
"enable_prefix_caching": True,
"prefix_caching_hash_algo": "xxhash",
}
if self.max_model_len is not None:
llm_kwargs["max_model_len"] = self.max_model_len

self._model = Qwen3ASRModel.LLM(**llm_kwargs)

logger.info("QwenASR model loaded in {:.3f}s", time.perf_counter() - setup_t0)

def teardown(self) -> None:
del self._model
self._model = None
gc.collect()
try:
import torch

torch.cuda.empty_cache()
except Exception: # noqa: BLE001, S110
pass

# ------------------------------------------------------------------
# Generation
# ------------------------------------------------------------------

def generate(
self,
waveforms: list[np.ndarray],
sample_rates: list[int],
contexts: list[str] | None = None,
languages: list[str | None] | None = None,
) -> tuple[list[str], list[str]]:
"""Run batched ASR inference on in-memory audio waveforms.

Args:
waveforms: List of 1-D mono numpy float32 arrays.
sample_rates: Corresponding sample rates for each waveform.
contexts: Optional per-sample instruction strings for
``with_instruction`` mode.
languages: Per-sample language names (e.g. ``"English"``).

Returns:
``(texts, languages)`` -- transcribed text and detected
language for each input.
"""
if self._model is None:
msg = "Model not initialized. Call setup() first."
raise RuntimeError(msg)

n = len(waveforms)
valid_indices = [i for i, w in enumerate(waveforms) if w.size > 0]

if not valid_indices:
logger.warning(f"All {n} audio samples are empty, returning empty predictions")
return [""] * n, [""] * n

if len(valid_indices) < n:
logger.warning(f"Skipping {n - len(valid_indices)}/{n} empty waveforms in QwenASR batch")

valid_waveforms = [waveforms[i] for i in valid_indices]
valid_rates = [sample_rates[i] for i in valid_indices]
valid_langs = [languages[i] for i in valid_indices] if languages else None
valid_contexts = [contexts[i] for i in valid_indices] if contexts else None

audio_inputs: list[tuple[np.ndarray, int]] = list(
zip(valid_waveforms, valid_rates, strict=True)
)

kwargs: dict[str, Any] = {
"audio": audio_inputs,
"language": valid_langs,
}
if valid_contexts is not None:
kwargs["context"] = valid_contexts

results = self._model.transcribe(**kwargs)

texts: list[str] = [""] * n
detected_langs: list[str] = [""] * n
for idx, r in zip(valid_indices, results, strict=True):
texts[idx] = getattr(r, "text", str(r))
detected_langs[idx] = getattr(r, "language", "")

return texts, detected_langs
Loading