How Echoic Uses Apple Silicon for Real-Time Speech Recognition
Three years ago, running a high-quality speech recognition model locally on a laptop would have been impractical. The models were too large, the inference too slow, and the power draw too high. Today, Echoic runs sub-500ms transcription entirely on your Mac, with no GPU, no cloud, and no noticeable battery drain. Here's how.
The Apple Silicon Architecture
Apple Silicon chips — the M-series — aren't just faster versions of Intel processors. They're a System on Chip (SoC) where CPU cores, GPU cores, and specialized accelerators share the same memory and die. The accelerator that matters most for speech recognition is the Neural Engine.
The Neural Engine is a dedicated matrix multiplication accelerator — purpose-built for the kind of compute that neural networks require. An M3 delivers up to 18 TOPS. The M4 pushes that to 38 TOPS. Critically, it runs independently of the CPU and GPU, so running a speech model doesn't slow down your other work.
CoreML: The Software Layer
Apple's CoreML framework bridges ML models and Apple Silicon hardware. You give it a model in .mlpackage format and it figures out the optimal execution path — Neural Engine, GPU, or CPU — automatically.
Echoic uses CoreML as the execution backend for all speech recognition. There is no CUDA, no Python runtime, no Docker container. The models run as first-class macOS compute workloads.
The Three Bundled Models
Parakeet v3 (~542 MB)
NVIDIA's best-in-class English model. CTC-based architecture, optimized for long-form transcription. Achieves word error rates competitive with cloud services. Recommended for meeting transcription.
Whisper Large (~632 MB)
OpenAI's multilingual model — 90+ languages. Encoder-decoder architecture, processes 30-second windows. Slightly slower than Parakeet on Apple Silicon but unmatched for non-English meetings.
Moonshine v2 (~290 MB)
Compact English model built for real-time latency. Half the size of Parakeet, optimized for short utterances. The right choice for dictation where sub-500ms is non-negotiable.
The Real-Time Pipeline
- Audio capture. AVAudioEngine reads microphone samples at 16kHz.
- VAD. Lightweight voice activity detection — model only runs when you're speaking.
- CoreML inference. Moonshine v2 completes in 80–200ms on M1 and later.
- Text injection. Output inserted at cursor via the macOS Accessibility API.
A typical short utterance ("schedule the meeting for Thursday at 2") takes 300–450ms from when you stop speaking to when text appears. Fast enough to feel nearly instantaneous.
Why This Wasn't Possible Before
On an Intel MacBook Pro, Whisper Large took 4–8 seconds per 30-second audio segment — too slow for real-time use and loud enough to spin up the fans. The M1 changed this. Moonshine v2 inference completes in 80–200ms, completely silent, barely registering in Activity Monitor.
What's Next
Model quality keeps improving. Parakeet and Moonshine are already at v3 and v2 respectively — meaningfully better than their initial releases. Apple's own speech models via SpeechAnalyzer improve with each macOS release.
One of the most compelling emerging use cases is dictation for AI-powered development tools. Claude Code, Codex, Cursor, and ChatGPT are all prompt-driven — the developer's words are the primary input. Typing long, precise prompts is slow and interrupts flow. Speaking them is natural. With sub-500ms local transcription running silently on the Neural Engine, prompts land in any tool — terminal, browser, IDE chat — without latency, without a cloud round-trip, and without a privacy concern. As AI coding assistants become central to how developers work, fast local speech recognition stops being a convenience and starts being infrastructure.
The gap between cloud and local speech recognition is narrowing steadily. For English, local models are already competitive. The era of requiring the cloud for good speech recognition is ending. Apple Silicon is a large part of why.