Whisper API on Windows: How dictate.app Gets to 200ms

Most developers who build on Windows know OpenAI's Whisper model. It's state-of-the-art for speech recognition. The standard question is where you run it - local CPU, local GPU, or cloud API. Each path has a radically different latency profile.

dictate.app is a Whisper API Windows app that uses Groq's inference infrastructure instead of OpenAI's API or local processing. This post explains why that matters, how the pipeline works, and what the privacy model looks like for developers who care about those details.

Three Ways to Run Whisper on Windows

Option 1: Local CPU (whisper.cpp, faster-whisper)

Running Whisper locally on CPU is free, fully private, and works offline. It's also slow. On a modern mid-range laptop without a dedicated GPU, a 10-second audio clip takes 8–15 seconds to transcribe using the medium or large model. Even the tiny model - with noticeably worse accuracy - takes 2–4 seconds.

For batch transcription of recordings, this is fine. For push-to-talk dictation where you expect text to appear immediately after speaking, it's unusable.

Option 2: Local GPU (CUDA-accelerated)

If you have an Nvidia GPU with CUDA support, local Whisper transcription drops to 0.5–2 seconds depending on model size and VRAM. The large-v3 model on a 3080 is roughly 0.8–1.2 seconds for a typical dictation clip.

That's workable. But it requires GPU hardware most users don't have for a secondary task, and setup involves CUDA drivers, Python environments, and model downloads (1–3GB). Not something you hand to a non-technical user.

Option 3: Groq Whisper API (what dictate.app uses)

Groq built custom silicon - LPUs (Language Processing Units) - specifically optimized for transformer inference. Their Whisper Large v3 Turbo implementation is the fastest publicly available Whisper inference. dictate.app sends audio to Groq's API and receives the transcript in roughly 150–250ms, including network round-trip.

That's faster than most GPU setups, with zero local hardware requirement, and it works on any Windows machine with internet access.

Latency Comparison

Groq Whisper API
~200ms
Local GPU (3080)
~800ms
OpenAI Whisper API
~900ms
Local CPU (medium)
8–15 seconds

Approximate values for a 5–10 second audio clip on typical consumer hardware. Network conditions affect cloud results.

The dictate.app Pipeline

The transcription flow is straightforward:

  1. Hotkey press - Ctrl+Space triggers audio recording from the default input device.
  2. Audio capture - Recorded as PCM audio, converted to the format Groq expects (16kHz mono, compressed for transmission).
  3. API call - Audio chunk sent to Groq's whisper-large-v3-turbo endpoint.
  4. Response - JSON response with transcript text, typically within 150–250ms.
  5. Paste - Text injected at the current cursor position via clipboard paste (Ctrl+V).

The hotkey release triggers the API call. By the time you've moved your hands back to the keyboard, the text is already there.

Why Not OpenAI's Whisper API?

OpenAI also exposes a Whisper API. The accuracy is comparable. The latency is not. OpenAI's API typically returns in 700ms–1.5 seconds for short clips - fast enough for batch work, but perceptibly slow for push-to-talk dictation.

Groq's LPU architecture processes the same model in roughly 4–5x less time. For a dictation tool where every millisecond shows up as a feeling, that difference is the product.

For developers building similar tools: The Groq API is OpenAI-compatible. You can swap the base URL in your existing OpenAI SDK code and point at Groq's endpoint. Model name is whisper-large-v3-turbo.

Audio Quality Considerations

Whisper's accuracy is sensitive to audio quality. dictate.app shows a real-time waveform during recording - partly for UX feedback, but also to help users calibrate microphone placement. A few things that affect accuracy:

Privacy Architecture

For developers and security-conscious users, the data path matters. Here's exactly what happens to your audio:

The full data policy is in our privacy policy. The short version: your voice goes to Groq and nowhere else, and Groq doesn't keep it.

Language Detection

Whisper is a multilingual model. Groq's deployment supports the same 70+ languages as the original model, with automatic language detection. dictate.app defaults to auto-detect and lets you pin a specific language in settings if you want deterministic behavior in multilingual environments.

Using dictate.app vs Building Your Own

If you're a developer evaluating whether to use dictate.app or build a similar tool yourself, the honest answer is: building it is not that complicated. The Groq API is well-documented, and the core transcription loop is maybe 50–100 lines of code.

What dictate.app handles that takes more work: global hotkey registration across Windows applications, reliable text injection into all app types (some use non-standard input handling), the waveform visualization, settings persistence, and the trial/licensing system.

If you want to experiment with the Groq Whisper API directly, their documentation is solid. If you want a ready-to-use tool that's already solved the Windows-specific integration problems, that's what dictate.app is.

Try dictate.app Free for 30 Days

Groq Whisper. ~200ms. Any Windows app. No account required to start.

Start Free Trial →

No credit card · $8.99/month after trial · Privacy policy

See how dictate.app compares to other options on the homepage, or reach out at support@dictate.app with technical questions.