Skip to content

DistilWhisperModel Documentation

Overview

The DistilWhisperModel is a Python class designed to handle English speech recognition tasks. It leverages the capabilities of the Whisper model, which is fine-tuned for speech-to-text processes. It is designed for both synchronous and asynchronous transcription of audio inputs, offering flexibility for real-time applications or batch processing.

Installation

Before you can use DistilWhisperModel, ensure you have the required libraries installed:

pip3 install --upgrade swarms

Initialization

The DistilWhisperModel class is initialized with the following parameters:

Parameter Type Description Default
model_id str The identifier for the pre-trained Whisper model "distil-whisper/distil-large-v2"

Example of initialization:

from swarms.models import DistilWhisperModel

# Initialize with default model
model_wrapper = DistilWhisperModel()

# Initialize with a specific model ID
model_wrapper = DistilWhisperModel(model_id="distil-whisper/distil-large-v2")

Attributes

After initialization, the DistilWhisperModel has several attributes:

Attribute Type Description
device str The device used for computation ("cuda:0" for GPU or "cpu").
torch_dtype torch.dtype The data type used for the Torch tensors.
model_id str The model identifier string.
model torch.nn.Module The actual Whisper model loaded from the identifier.
processor transformers.AutoProcessor The processor for handling input data.

Methods

transcribe

Transcribes audio input synchronously.

Arguments:

Argument Type Description
inputs Union[str, dict] File path or audio data dictionary.

Returns: str - The transcribed text.

Usage Example:

# Synchronous transcription
transcription = model_wrapper.transcribe("path/to/audio.mp3")
print(transcription)

async_transcribe

Transcribes audio input asynchronously.

Arguments:

Argument Type Description
inputs Union[str, dict] File path or audio data dictionary.

Returns: Coroutine - A coroutine that when awaited, returns the transcribed text.

Usage Example:

import asyncio

# Asynchronous transcription
transcription = asyncio.run(model_wrapper.async_transcribe("path/to/audio.mp3"))
print(transcription)

real_time_transcribe

Simulates real-time transcription of an audio file.

Arguments:

Argument Type Description
audio_file_path str Path to the audio file.
chunk_duration int Duration of audio chunks in seconds.

Usage Example:

# Real-time transcription simulation
model_wrapper.real_time_transcribe("path/to/audio.mp3", chunk_duration=5)

Error Handling

The DistilWhisperModel class incorporates error handling for file not found errors and generic exceptions during the transcription process. If a non-recoverable exception is raised, it is printed to the console in red to indicate failure.

Conclusion

The DistilWhisperModel offers a convenient interface to the powerful Whisper model for speech recognition. Its design supports both batch and real-time transcription, catering to different application needs. The class's error handling and retry logic make it robust for real-world applications.

Additional Notes

  • Ensure you have appropriate permissions to read audio files when using file paths.
  • Transcription quality depends on the audio quality and the Whisper model's performance on your dataset.
  • Adjust chunk_duration according to the processing power of your system for real-time transcription.

For a full list of models supported by transformers.AutoModelForSpeechSeq2Seq, visit the Hugging Face Model Hub.