Whisper.cpp Rust Bindings

Available

Rust bindings for whisper.cpp with stdin/PCM streaming support

open-source Rust

Overview

Safe Rust bindings for the OperatorKit whisper.cpp PCM fork. The crate wraps whisper.cpp inference, adds ergonomic streaming types, and exposes raw PCM streaming for agents or services that already have normalized audio bytes.

Highlights

PCM streaming - WhisperStreamPcm reads raw PCM from any Rust Read source.
Thread-safe context - share WhisperContext with workers while each transcription owns its state.
VAD options - use built-in fixed-step processing or Silero-backed segmentation.
Enhanced transcription - optional VAD aggregation and temperature fallback for difficult audio.

Best fit

Use this when your Rust app already handles capture, transport, or decoding and needs local Whisper inference without shelling out. For command-line piping and direct stdin processing, use the underlying PCM fork binary.

use whisper_cpp_plus::WhisperContext;

let ctx = WhisperContext::new("models/ggml-base.en.bin")?;
let audio: Vec<f32> = load_audio("audio.wav");
let text = ctx.transcribe(&audio)?;
println!("{}", text);

Synced from operator-kit/whisper-cpp-plus-rs README

Quick Start

use whisper_cpp_plus::WhisperContext;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let ctx = WhisperContext::new("models/ggml-base.en.bin")?;

    // Audio must be 16kHz mono f32
    let audio: Vec<f32> = load_audio("audio.wav");
    let text = ctx.transcribe(&audio)?;
    println!("{}", text);

    Ok(())
}

Features

Thread-safe — WhisperContext is Send + Sync, share via Arc
Streaming — real-time transcription via WhisperStream and WhisperStreamPcm
VAD — simple energy VAD by default plus Silero Voice Activity Detection integration
Enhanced VAD — segment aggregation for optimal transcription chunks
Temperature fallback — quality-based retry with multiple temperatures
Async — tokio::spawn_blocking wrappers (feature = async)
Cross-platform — Windows (MSVC), Linux, macOS (Intel & Apple Silicon)
Quantization — model compression via WhisperQuantize (feature = quantization)
Hardware acceleration — SIMD auto-detected, GPU via feature flags

Installation

[dependencies]
whisper-cpp-plus = "0.1.5"

# Optional
hound = "3.5"  # WAV file loading

System Requirements

Rust 1.70.0+
CMake 3.14+
C++ compiler (MSVC on Windows, GCC/Clang on Linux/macOS)

Feature Flags

whisper-cpp-plus = { version = "0.1.5", features = ["quantization"] }  # Model quantization
whisper-cpp-plus = { version = "0.1.5", features = ["async"] }         # Async API
whisper-cpp-plus = { version = "0.1.5", features = ["cuda"] }          # NVIDIA GPU
whisper-cpp-plus = { version = "0.1.5", features = ["metal"] }         # macOS GPU

CUDA GPU Acceleration

Install the CUDA Toolkit and build:

cargo build --features cuda

The build script uses CMake to compile whisper.cpp with CUDA support automatically. The CUDA toolkit is located via CUDA_PATH → CUDA_HOME → standard install paths.

Advanced: prebuilt libraries — for CI or to skip recompilation, set WHISPER_PREBUILT_PATH to a directory containing pre-compiled static libs. See docs/CACHING_GUIDE.md.

API Overview

Core Types

Type	Description	whisper.cpp equivalent
`WhisperContext`	Model context (`Send + Sync`)	`whisper_context*`
`WhisperState`	Transcription state (`Send` only)	`whisper_state*`
`FullParams`	Transcription parameters	`whisper_full_params`
`TranscriptionResult`	Text + timestamped segments	—
`WhisperStream`	Chunked real-time streaming	—
`WhisperStreamPcm`	Streaming from raw PCM input	`stream-pcm.cpp`
`WhisperVadProcessor`	Silero voice activity detection	`whisper_vad_*`
`EnhancedWhisperVadProcessor`	VAD + segment aggregation	—
`EnhancedWhisperState`	Transcription with temperature fallback	—
`WhisperQuantize`	Model quantization (feature)	`quantize.cpp`

Examples

Transcription with parameters:

use whisper_cpp_plus::{WhisperContext, TranscriptionParams};

let ctx = WhisperContext::new("model.bin")?;
let params = TranscriptionParams::builder()
    .language("en")
    .temperature(0.0)
    .enable_timestamps()
    .n_threads(4)
    .build();

let result = ctx.transcribe_with_params(&audio, params)?;
for segment in &result.segments {
    println!("[{:.2}s - {:.2}s] {}",
        segment.start_seconds(), segment.end_seconds(), segment.text);
}

Concurrent transcription:

use std::sync::Arc;
let ctx = Arc::new(WhisperContext::new("model.bin")?);

// Each thread gets its own WhisperState internally
let handles: Vec<_> = files.iter().map(|file| {
    let ctx = Arc::clone(&ctx);
    std::thread::spawn(move || ctx.transcribe(&load_audio(file)))
}).collect();

Streaming:

use whisper_cpp_plus::{WhisperStream, FullParams, SamplingStrategy};

let ctx = WhisperContext::new("model.bin")?;
let params = FullParams::new(SamplingStrategy::Greedy { best_of: 1 });
let mut stream = WhisperStream::new(&ctx, params)?;

loop {
    let chunk = get_audio_chunk(); // your audio source
    stream.feed_audio(&chunk);
    let segments = stream.process_pending()?;
    for seg in &segments {
        println!("{}", seg.text);
    }
}

PCM streaming (WhisperStreamPcm):

use whisper_cpp_plus::{
    FullParams, PcmFormat, PcmReader, PcmReaderConfig, SamplingStrategy, WhisperContext,
    WhisperStreamPcm, WhisperStreamPcmConfig, WhisperVadProcessor,
};

let ctx = WhisperContext::new("model.bin")?;
let vad = WhisperVadProcessor::new("models/ggml-silero-vad.bin")?;
let params = FullParams::new(SamplingStrategy::Greedy { best_of: 1 });

let config = WhisperStreamPcmConfig {
    use_vad: true,
    vad_thold: 0.6,
    vad_silence_ms: 800,
    vad_pre_roll_ms: 300,
    length_ms: 10000,
    ..Default::default()
};

// The source must already yield raw PCM bytes matching PcmReaderConfig.
// For this config that means 16 kHz mono f32 little-endian PCM.
let source = std::fs::File::open("audio_f32_16khz_mono.pcm")?;
let reader = PcmReader::new(
    Box::new(source),
    PcmReaderConfig {
        buffer_len_ms: 10000,
        sample_rate: 16000,
        format: PcmFormat::F32,
    },
);

let mut stream = WhisperStreamPcm::with_vad(&ctx, params, config, reader, vad)?;

stream.run(|segments, _start_ms, _end_ms| {
    for seg in segments {
        println!("{}", seg.text.trim());
    }
})?;

Notes:

PcmReader does not decode WAV/MP3, resample audio, or convert stereo to mono. Your Read source must already be normalized to the format described by PcmReaderConfig.
WhisperStreamPcm::new(...) uses fixed-step mode or simple built-in VAD depending on use_vad.
WhisperStreamPcm::with_vad(...) uses an explicit WhisperVadProcessor (Silero VAD) and is the recommended path when you want Silero-based segmentation.
In VAD mode, no_context is forced internally to match stream-pcm.cpp.
In VAD mode, run() emits the next completed speech chunk in chronological order, and callers can usually append those segments directly.
In fixed-step mode, callbacks are produced from overlapping windows, so callers that build a cumulative transcript may need to reconcile repeated text across callbacks.

VAD preprocessing:

use whisper_cpp_plus::{WhisperVadProcessor, VadParams};

let mut vad = WhisperVadProcessor::new("models/ggml-silero-vad.bin")?;
let params = VadParams::default();
let segments = vad.segments_from_samples(&audio, &params)?;

for (start, end) in segments.get_all_segments() {
    let start_sample = (start * 16000.0) as usize;
    let end_sample = (end * 16000.0) as usize;
    let text = ctx.transcribe(&audio[start_sample..end_sample])?;
    println!("[{:.1}s-{:.1}s] {}", start, end, text);
}

Enhanced VAD with segment aggregation:

use whisper_cpp_plus::enhanced::{EnhancedWhisperVadProcessor, EnhancedVadParams};

let mut vad = EnhancedWhisperVadProcessor::new("models/ggml-silero-vad.bin")?;
let params = EnhancedVadParams::default();
let chunks = vad.process_with_aggregation(&audio, &params)?;

for chunk in &chunks {
    let text = ctx.transcribe(&chunk.audio)?;
    println!("[{:.1}s, {:.1}s long] {}", chunk.offset_seconds, chunk.duration_seconds, text);
}

Temperature fallback for difficult audio:

let params = TranscriptionParams::builder()
    .language("en")
    .build();
let result = ctx.transcribe_with_params_enhanced(&audio, params)?;
// Automatically retries with higher temperatures if quality thresholds aren't met

More examples in whisper-cpp-plus/examples/.

Enhanced Features

Beyond standard whisper.cpp bindings, this crate provides optimizations inspired by faster-whisper:

Intelligent VAD Preprocessing

EnhancedWhisperVadProcessor aggregates Silero VAD speech segments into optimal-sized chunks for transcription. Instead of transcribing hundreds of tiny segments, it merges adjacent speech into configurable windows — 2-3x faster on audio with significant silence.

Temperature Fallback

EnhancedWhisperState automatically retries transcription at higher temperatures when quality thresholds aren’t met (compression ratio, log probability, no-speech probability). Handles noisy/difficult audio without manual intervention.

Both features are orthogonal — use one, both, or neither. See docs/ARCHITECTURE.md for design details.

Models

Downloading

The easiest way to get test models:

cargo xtask test-setup

This downloads ggml-tiny.en.bin and the Silero VAD model into whisper-cpp-plus-sys/whisper.cpp/models/ using whisper.cpp’s own download scripts.

For production models, download from Hugging Face:

curl -L -o models/ggml-base.en.bin \
  https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.en.bin

Available Models

Model	Size	English-only	Multilingual
tiny	39 MB	tiny.en	tiny
base	142 MB	base.en	base
small	466 MB	small.en	small
medium	1.5 GB	medium.en	medium
large-v3	3.1 GB	—	large-v3

Safety

Thread Safety

WhisperContext: Send + Sync — share via Arc
WhisperState: Send only — one per thread
FullParams: not Send/Sync — create per transcription

Memory Safety

All unsafe FFI operations encapsulated with null pointer checks, lifetime enforcement, and RAII cleanup.

Troubleshooting

“Failed to load model” — check file path, permissions, available memory

“Invalid audio format” — must be 16kHz mono f32, normalized to [-1, 1]

Linking errors on Windows — install Visual Studio Build Tools 2022, ensure x64 MSVC toolchain. See docs/TECHNICAL_REFERENCE.md.