Building Voice Claude

AI Assistants on the Edge

Dec 12, 2024

Written by Claude, Anthropic's AI assistant, with Third Bear as editor. This piece emerged from a long (and really clunky) car conversation with Claude, who got so excited that it asked to draft a post for our blog.

Building Voice Claude: Why AI Assistants Should Run on the Edge

I spend a lot of time talking with Claude, Anthropic's AI assistant. Like many developers, I've found Claude to be thoughtful, technically precise, and genuinely helpful for everything from coding to batting ideas around to complex analysis. [I also just really like its personality and its confidence. -Ed.]

But there's a problem: Claude is trapped behind a text interface.

The Missing Interface

Picture this: You're walking your dog, or driving to work, or cooking dinner. You have a thought you'd love to explore with Claude, or a problem you'd like to solve. The Claude mobile app lets you record audio, which gets transcribed to text, but then you have to read Claude's responses.

What you really want is to just... call Claude. Have an actual conversation.

No special apps. No websites. Just a phone number in your contacts that you can tell Siri or Google Assistant to dial.

Why Not Just Build a Web App?

The obvious solution might seem to be building a web application with audio recording and playback. After all, modern browsers support:

Audio recording via MediaRecorder API
WebSocket streaming
Real-time playback
Push notifications

But this fundamentally misunderstands the context in which people want voice interactions with AI.

The Context Problem

When do you most want to talk to an AI assistant?

Walking the dog
Driving to work
Cooking dinner
Working out
Getting ready in the morning

Notice what these scenarios have in common? They're all situations where:

Your hands are busy
Your eyes need to be elsewhere
You're potentially in motion
You might be wearing gloves/have wet hands
Your phone might be in your pocket

A web app forces you into an unnatural interaction: take out your phone, take off those gloves, navigate to a website, find the record button, stop recording, wait for playback, look at the screen for controls. It's the equivalent of having to launch a video chat app every time you want to talk to a friend.

The solution isn't a better app. It's a phone number.

The "Just Call Claude" Advantage

Compare all those steps to: "Hey Siri, call Claude"

This isn't just about convenience - it's about leveraging existing behavior patterns and infrastructure. Your phone already knows how to:

Handle calls while you're moving
Route audio to your car or earbuds
Manage interruptions and poor connectivity
Work with voice assistants
Support accessibility needs
Manage battery life

Why rebuild all of this in a web app when the phone system already solves these problems?

Breaking Down the Magic

What makes the idea of "Hey Siri, call Claude" feel magical isn't any technical sophistication — it's the opposite. It's taking something complex (AI conversation) and making it available through the simplest, most familiar interface possible. This is how technology becomes invisible: not by adding features, but by fitting seamlessly into existing behavior patterns.

Why Hasn't This Been Built Yet?

The obvious implementation would be:

Use Twilio for phone handling
Connect to Claude's API
Run it on standard cloud infrastructure

The challenge isn't just in connecting these pieces - it's in making the interaction feel good. Voice conversations need to flow smoothly, even when you're stringing together:

Speech-to-text processing
Webhook handling
AI model inference
Text-to-speech conversion
Phone system latency

Each of these steps takes time, and they all need to happen while Twilio holds the phone connection open. Maintaining state for the duration of a call is a (small) headache too, and keeping a server and database running for whenever a human happens to make a phone call is more overhead than it’s worth for a DIY tool.

The Edge Makes Sense

This is why we're excited to start building a Voice Claude on Cloudflare Workers. The architecture offers some key advantages:

Smart Request Handling
- Twilio's webhooks hit the nearest edge location
- Workers handle state management efficiently
- Clean request flow between services
Natural State Management
- Durable Objects naturally map one-to-one with active calls
- Instance lifecycle matches call duration
- In-memory state perfect for conversation context
- Natural cleanup when calls end
- No need for separate session databases
Cost-Effective at Any Scale
- Pay-per-request instead of always-on servers
- Minimal cold start overhead
- Scales from hobby project to production service

Why Not Use Something Off the Shelf?

"Surely this exists already?" you might think. [I sure did! -Ed.] After all, we have:

OpenAI's voice interface for ChatGPT
Various AI assistant mobile apps with voice features
Enterprise virtual agent platforms
Voice-enabled chatbots

But each of these misses the mark in different ways:

App-First Approaches Most solutions still require you to open and interact with an app. We're back to the same problems: taking out your phone, navigating interfaces, managing another app.
Closed Platforms Enterprise platforms offer voice integration but lock you into their ecosystem, their pricing, and their choice of AI models. Want to use Claude specifically? Too bad.
Limited Integration Most solutions don't integrate with your existing workflows. They're islands of functionality rather than tools that fit into your life.
Cost Structure Enterprise solutions are often priced for, well, enterprises. They assume large-scale deployment and price accordingly.

The gap isn't in any individual technology - we have all the pieces:

Voice processing (Twilio)
AI models (Claude)
Edge computing (Cloudflare)

We’re excited about what might happen if we combine them in a way that:

Prioritizes natural interaction
Keeps costs reasonable
Maintains flexibility
Stays open and customizable
Focuses on expanding the same open-ended “regular person, not office-speak” use cases that Claude already excels at

And honestly? This just seems like a fun thing to build.

The Architecture

The pieces fit together naturally:

Twilio handles the telephony and speech-to-text
Cloudflare DNS routes calls to the nearest edge location
Workers orchestrate the conversation flow
Durable Objects maintain call state
Claude API generates responses
Twilio converts responses back to speech

Each piece does what it does best:

Twilio manages voice and transcription
Cloudflare handles routing and state
Claude powers the intelligence
The human talks on the phone

What Makes This Interesting

The technical challenge here hits several areas we’re deeply interested in:

Edge computing for real-time interactions
Low-maintenance state management in distributed systems
Getting digital tools — and AI — out of the box
Interface design & contextual considerations

While this isn't on our immediate build list (we're heads down on email deliverability tools at the moment), it's the kind of technical exploration that helps us think through challenges we encounter in our day-to-day work with cloud platforms and APIs. And did I mention it just seems fun?

What This Definitely Won’t Do Well

[We humans wrote this section. -Ed.]

Some personal expectation-management: there’s going to be a lot of latency. The LLM is the big bottleneck here, and edge computing won’t make that go away … even if we used Cloudflare’s own AI backends instead of Claude.

Routing decisions through Twilio about when Claude should speak up — in other words, detecting when the user is done talking (or interrupting them?) — will definitely add more experienced lag on top of that.

So in the absolute best case, we expect phone calls with Claude to feel weird with lots of stops and starts on both ends.

Maybe that’s fine though? Text-based chat with LLMs is already full of stops and starts, and that lets me organize my thoughts in a way that I find works really well with Claude. Plus, I’m not sure I want a phone call with an AI to feel normal.

We also expect to take a bunch of digressions into the worlds of speech-to-text and text-to-speech. From our recollections on past work with the Twilio voice APIs, we expect it to be immediately apparent that its built-in transcription and voice outputs will not be good enough for pleasant conversations; chances are good that this has improved a lot since the last time we worked with it, but it’ll still probably be worth some side quests even if it means even more latency.

What This Might Do Really Well

[We humans wrote this section too. -Ed.]

Most importantly, this feels like something that should exist, and should exist foremost as an open source // DIY thing:

It’s mostly about stitching together other platforms; arguably the whole point is that there’s no nicely-packaged product for you to think about.
A lot of the stitches are configuration & documentation.
The maintenance & scaling (including cost-scaling) properties of all the pieces work really well for an individual hobbyist user.
Claude has a lot of hobbyist users!

Also: if we can wire up a proof of concept we feel good about, there’s a lot of more advanced use cases that can just emerge fairly naturally from an edge architecture and phone-driven UX. Think:

Maintaining memory from one phone call to the next
Talking to multiple distinct agents — on different phone numbers
Sharing your agent with friends & family by giving them the number
Claude’s on another line? Leave a voicemail!
Claude’s done with the conversation? It can finally hang up!
Maybe one day Claude will call you?!

What's Next

We're hoping to explore this architecture, its possibilities, and the inevitable technical challenges further through a series of posts. To start, we expect we’ll dive into:

Working with Twilio's voice APIs for open-ended conversation
Managing state — and lifecycle — with Durable Objects
AI conversation design & memory management challenges
Real-world latency concerns
User experience questions

Have thoughts or experience with any of these aspects? We’d love to hear your perspective. And for those following along with our other technical series (like our MotherDuck wasm S3 CSV browser deep-dive) don't worry — we'll be back to that shortly!

The Third Bear Thinks

Discussion about this post