graph LR
Orchestration_Entrypoint["Orchestration & Entrypoint"]
Audio_Processing["Audio Processing"]
Core_Model_Tokenizer["Core Model & Tokenizer"]
Inference_Decoding_Engine["Inference & Decoding Engine"]
Post_processing_Formatting["Post-processing & Formatting"]
Orchestration_Entrypoint -- "Initiates audio processing" --> Audio_Processing
Audio_Processing -- "Returns spectrogram to" --> Orchestration_Entrypoint
Orchestration_Entrypoint -- "Invokes transcription with spectrogram" --> Inference_Decoding_Engine
Inference_Decoding_Engine -- "Utilizes for token generation" --> Core_Model_Tokenizer
Inference_Decoding_Engine -- "Returns raw transcription to" --> Orchestration_Entrypoint
Orchestration_Entrypoint -- "Applies final formatting" --> Post_processing_Formatting
click Core_Model_Tokenizer href "https://github.com/CodeBoarding/GeneratedOnBoardings/blob/main/whisper/Core_Model_Tokenizer.md" "Details"
click Inference_Decoding_Engine href "https://github.com/CodeBoarding/GeneratedOnBoardings/blob/main/whisper/Inference_Decoding_Engine.md" "Details"
click Post_processing_Formatting href "https://github.com/CodeBoarding/GeneratedOnBoardings/blob/main/whisper/Post_processing_Formatting.md" "Details"
The whisper library is built on a component-based architecture that separates the user interface from the core machine learning workflow. The process begins at the Orchestration & Entrypoint layer, which manages the entire transcription pipeline. It first calls the Audio Processing component to load an audio file and convert it into a standardized spectrogram. This spectrogram is then passed to the Inference & Decoding Engine, which uses the Core Model & Tokenizer to perform the actual transcription and language detection. Finally, the raw output is sent to the Post-processing & Formatting component to normalize the text, add timestamps, and generate user-friendly output files like SRT or VTT, completing the flow.
Provides high-level API and CLI entry points. It coordinates the entire transcription pipeline, from loading audio to formatting the final output.
Related Classes/Methods:
Handles loading audio from files, resampling, and converting it into a log-Mel spectrogram, the standardized input format required by the model.
Related Classes/Methods:
Core Model & Tokenizer [Expand]
Contains the essential AI assets: the Whisper neural network architecture (Encoder-Decoder) and the tokenizer for converting text to and from tokens.
Related Classes/Methods:
Inference & Decoding Engine [Expand]
Manages the core transcription task. It runs the decoding loop, performs language detection, and applies strategies like beam search, using the Core Model to generate text from audio.
Related Classes/Methods:
Post-processing & Formatting [Expand]
A suite of tools to refine the raw model output. It handles text normalization, word-level timestamp alignment, and formats the final results into SRT/VTT files.
Related Classes/Methods:
whisper/normalizers/(1:1)whisper/timing.py(1:1)whisper/utils.py(1:1)