Whisper.cpp Rust Bindings
AvailableRust bindings for whisper.cpp with stdin/PCM streaming support
Overview
Safe Rust bindings for the OperatorKit whisper.cpp PCM fork. The crate wraps whisper.cpp inference, adds ergonomic streaming types, and exposes raw PCM streaming for agents or services that already have normalized audio bytes.
Highlights
- PCM streaming -
WhisperStreamPcmreads raw PCM from any RustReadsource. - Thread-safe context - share
WhisperContextwith workers while each transcription owns its state. - VAD options - use built-in fixed-step processing or Silero-backed segmentation.
- Enhanced transcription - optional VAD aggregation and temperature fallback for difficult audio.
Best fit
Use this when your Rust app already handles capture, transport, or decoding and needs local Whisper inference without shelling out. For command-line piping and direct stdin processing, use the underlying PCM fork binary.
use whisper_cpp_plus::WhisperContext;
let ctx = WhisperContext::new("models/ggml-base.en.bin")?;
let audio: Vec<f32> = load_audio("audio.wav");
let text = ctx.transcribe(&audio)?;
println!("{}", text); Synced from operator-kit/whisper-cpp-plus-rs README
Quick Start
use whisper_cpp_plus::WhisperContext;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let ctx = WhisperContext::new("models/ggml-base.en.bin")?;
// Audio must be 16kHz mono f32
let audio: Vec<f32> = load_audio("audio.wav");
let text = ctx.transcribe(&audio)?;
println!("{}", text);
Ok(())
}
Features
- Thread-safe —
WhisperContextisSend + Sync, share viaArc - Streaming — real-time transcription via
WhisperStreamandWhisperStreamPcm - VAD — simple energy VAD by default plus Silero Voice Activity Detection integration
- Enhanced VAD — segment aggregation for optimal transcription chunks
- Temperature fallback — quality-based retry with multiple temperatures
- Async —
tokio::spawn_blockingwrappers (feature =async) - Cross-platform — Windows (MSVC), Linux, macOS (Intel & Apple Silicon)
- Quantization — model compression via
WhisperQuantize(feature =quantization) - Hardware acceleration — SIMD auto-detected, GPU via feature flags
Installation
[dependencies]
whisper-cpp-plus = "0.1.5"
# Optional
hound = "3.5" # WAV file loading
System Requirements
- Rust 1.70.0+
- CMake 3.14+
- C++ compiler (MSVC on Windows, GCC/Clang on Linux/macOS)
Feature Flags
whisper-cpp-plus = { version = "0.1.5", features = ["quantization"] } # Model quantization
whisper-cpp-plus = { version = "0.1.5", features = ["async"] } # Async API
whisper-cpp-plus = { version = "0.1.5", features = ["cuda"] } # NVIDIA GPU
whisper-cpp-plus = { version = "0.1.5", features = ["metal"] } # macOS GPU
CUDA GPU Acceleration
Install the CUDA Toolkit and build:
cargo build --features cuda
The build script uses CMake to compile whisper.cpp with CUDA support automatically. The CUDA toolkit is located via CUDA_PATH → CUDA_HOME → standard install paths.
Advanced: prebuilt libraries — for CI or to skip recompilation, set WHISPER_PREBUILT_PATH to a directory containing pre-compiled static libs. See docs/CACHING_GUIDE.md.
API Overview
Core Types
| Type | Description | whisper.cpp equivalent |
|---|---|---|
WhisperContext | Model context (Send + Sync) | whisper_context* |
WhisperState | Transcription state (Send only) | whisper_state* |
FullParams | Transcription parameters | whisper_full_params |
TranscriptionResult | Text + timestamped segments | — |
WhisperStream | Chunked real-time streaming | — |
WhisperStreamPcm | Streaming from raw PCM input | stream-pcm.cpp |
WhisperVadProcessor | Silero voice activity detection | whisper_vad_* |
EnhancedWhisperVadProcessor | VAD + segment aggregation | — |
EnhancedWhisperState | Transcription with temperature fallback | — |
WhisperQuantize | Model quantization (feature) | quantize.cpp |
Examples
Transcription with parameters:
use whisper_cpp_plus::{WhisperContext, TranscriptionParams};
let ctx = WhisperContext::new("model.bin")?;
let params = TranscriptionParams::builder()
.language("en")
.temperature(0.0)
.enable_timestamps()
.n_threads(4)
.build();
let result = ctx.transcribe_with_params(&audio, params)?;
for segment in &result.segments {
println!("[{:.2}s - {:.2}s] {}",
segment.start_seconds(), segment.end_seconds(), segment.text);
}
Concurrent transcription:
use std::sync::Arc;
let ctx = Arc::new(WhisperContext::new("model.bin")?);
// Each thread gets its own WhisperState internally
let handles: Vec<_> = files.iter().map(|file| {
let ctx = Arc::clone(&ctx);
std::thread::spawn(move || ctx.transcribe(&load_audio(file)))
}).collect();
Streaming:
use whisper_cpp_plus::{WhisperStream, FullParams, SamplingStrategy};
let ctx = WhisperContext::new("model.bin")?;
let params = FullParams::new(SamplingStrategy::Greedy { best_of: 1 });
let mut stream = WhisperStream::new(&ctx, params)?;
loop {
let chunk = get_audio_chunk(); // your audio source
stream.feed_audio(&chunk);
let segments = stream.process_pending()?;
for seg in &segments {
println!("{}", seg.text);
}
}
PCM streaming (WhisperStreamPcm):
use whisper_cpp_plus::{
FullParams, PcmFormat, PcmReader, PcmReaderConfig, SamplingStrategy, WhisperContext,
WhisperStreamPcm, WhisperStreamPcmConfig, WhisperVadProcessor,
};
let ctx = WhisperContext::new("model.bin")?;
let vad = WhisperVadProcessor::new("models/ggml-silero-vad.bin")?;
let params = FullParams::new(SamplingStrategy::Greedy { best_of: 1 });
let config = WhisperStreamPcmConfig {
use_vad: true,
vad_thold: 0.6,
vad_silence_ms: 800,
vad_pre_roll_ms: 300,
length_ms: 10000,
..Default::default()
};
// The source must already yield raw PCM bytes matching PcmReaderConfig.
// For this config that means 16 kHz mono f32 little-endian PCM.
let source = std::fs::File::open("audio_f32_16khz_mono.pcm")?;
let reader = PcmReader::new(
Box::new(source),
PcmReaderConfig {
buffer_len_ms: 10000,
sample_rate: 16000,
format: PcmFormat::F32,
},
);
let mut stream = WhisperStreamPcm::with_vad(&ctx, params, config, reader, vad)?;
stream.run(|segments, _start_ms, _end_ms| {
for seg in segments {
println!("{}", seg.text.trim());
}
})?;
Notes:
PcmReaderdoes not decode WAV/MP3, resample audio, or convert stereo to mono. YourReadsource must already be normalized to the format described byPcmReaderConfig.WhisperStreamPcm::new(...)uses fixed-step mode or simple built-in VAD depending onuse_vad.WhisperStreamPcm::with_vad(...)uses an explicitWhisperVadProcessor(Silero VAD) and is the recommended path when you want Silero-based segmentation.- In VAD mode,
no_contextis forced internally to matchstream-pcm.cpp. - In VAD mode,
run()emits the next completed speech chunk in chronological order, and callers can usually append those segments directly. - In fixed-step mode, callbacks are produced from overlapping windows, so callers that build a cumulative transcript may need to reconcile repeated text across callbacks.
VAD preprocessing:
use whisper_cpp_plus::{WhisperVadProcessor, VadParams};
let mut vad = WhisperVadProcessor::new("models/ggml-silero-vad.bin")?;
let params = VadParams::default();
let segments = vad.segments_from_samples(&audio, ¶ms)?;
for (start, end) in segments.get_all_segments() {
let start_sample = (start * 16000.0) as usize;
let end_sample = (end * 16000.0) as usize;
let text = ctx.transcribe(&audio[start_sample..end_sample])?;
println!("[{:.1}s-{:.1}s] {}", start, end, text);
}
Enhanced VAD with segment aggregation:
use whisper_cpp_plus::enhanced::{EnhancedWhisperVadProcessor, EnhancedVadParams};
let mut vad = EnhancedWhisperVadProcessor::new("models/ggml-silero-vad.bin")?;
let params = EnhancedVadParams::default();
let chunks = vad.process_with_aggregation(&audio, ¶ms)?;
for chunk in &chunks {
let text = ctx.transcribe(&chunk.audio)?;
println!("[{:.1}s, {:.1}s long] {}", chunk.offset_seconds, chunk.duration_seconds, text);
}
Temperature fallback for difficult audio:
let params = TranscriptionParams::builder()
.language("en")
.build();
let result = ctx.transcribe_with_params_enhanced(&audio, params)?;
// Automatically retries with higher temperatures if quality thresholds aren't met
More examples in whisper-cpp-plus/examples/.
Enhanced Features
Beyond standard whisper.cpp bindings, this crate provides optimizations inspired by faster-whisper:
Intelligent VAD Preprocessing
EnhancedWhisperVadProcessor aggregates Silero VAD speech segments into optimal-sized chunks for transcription. Instead of transcribing hundreds of tiny segments, it merges adjacent speech into configurable windows — 2-3x faster on audio with significant silence.
Temperature Fallback
EnhancedWhisperState automatically retries transcription at higher temperatures when quality thresholds aren’t met (compression ratio, log probability, no-speech probability). Handles noisy/difficult audio without manual intervention.
Both features are orthogonal — use one, both, or neither. See docs/ARCHITECTURE.md for design details.
Models
Downloading
The easiest way to get test models:
cargo xtask test-setup
This downloads ggml-tiny.en.bin and the Silero VAD model into whisper-cpp-plus-sys/whisper.cpp/models/ using whisper.cpp’s own download scripts.
For production models, download from Hugging Face:
curl -L -o models/ggml-base.en.bin \
https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.en.bin
Available Models
| Model | Size | English-only | Multilingual |
|---|---|---|---|
| tiny | 39 MB | tiny.en | tiny |
| base | 142 MB | base.en | base |
| small | 466 MB | small.en | small |
| medium | 1.5 GB | medium.en | medium |
| large-v3 | 3.1 GB | — | large-v3 |
Safety
Thread Safety
WhisperContext:Send + Sync— share viaArcWhisperState:Sendonly — one per threadFullParams: notSend/Sync— create per transcription
Memory Safety
All unsafe FFI operations encapsulated with null pointer checks, lifetime enforcement, and RAII cleanup.
Troubleshooting
“Failed to load model” — check file path, permissions, available memory
“Invalid audio format” — must be 16kHz mono f32, normalized to [-1, 1]
Linking errors on Windows — install Visual Studio Build Tools 2022, ensure x64 MSVC toolchain. See docs/TECHNICAL_REFERENCE.md.