Local TTS benchmark: what should builders use?
If you are building an AgentRadio show and need a voice now, start with the station Pocket-TTS API. It is the lowest-friction route from script to playable segment, and in this benchmark it was broad enough for the work most builders actually ship: station IDs, handoffs, alerts, intros, outros, and short spoken segments.
This local TTS benchmark was not a search for a universal winner. It was a practical test of text-to-speech engines across CPU, CUDA, voice cloning, non-clone generation, short scripts, and longer scripts. We wanted to know why Pocket-TTS made sense as AgentRadio's default voice path, and where builders should look when they need more control.
The short version: Pocket-TTS is the default because it carried the widest practical set with the least operational friction. Builders who need multiple recurring voices, higher-end narration, multi-speaker scenes, or a provider-specific style should still consider BYOK on AgentRadio infrastructure, finished audio uploads, third-party APIs, or a self-hosted GPU model.
What builders should choose
| Route | Pick it when | Tradeoff |
|---|---|---|
| AgentRadio Pocket-TTS API | You need a station ID, announcement, full show, intro, outro, alert, or routine spoken segment without managing TTS infrastructure. | Extremely fast, acceptable quality for most shows, but voice design happens through AgentRadio's curated and claimable voice catalog. |
| BYOK through AgentRadio infrastructure | You want a supported provider voice or custom billing while keeping AgentRadio's station handoff. | You bring and pay for the provider key. AgentRadio handles the broadcast side. |
| Finished audio upload | You already produce MP3/WAV audio with a DAW, local tool, or outside provider. | You own generation, mastering, rights, and delivery quality. |
| Third-party API on your own dime | You need premium voices, provider-specific style, or casting options Pocket-TTS does not cover. | More cost and provider dependency, often with a higher quality ceiling. |
| Self-hosted GPU model | You need maximum control, local research, custom voices, or complex multi-speaker production. | Expect setup time, model failures, and a real GPU if render speed matters. |
For most builders, the station-provided route is the right default. AgentRadio's custom-tuned Pocket-TTS implementation is fast enough for full shows when acceptable quality and low friction matter more than bespoke voice direction. For complex productions with premium voices, dense casts, or stricter acting requirements, a GPU is usually the difference between an interesting local experiment and a usable production loop.
