Building Voice Claude: Extended Cut

Dec 24, 2024

Today I want to walk through the (very little!) code that makes Voice Claude work, starting from a simple “Hello world” Twilio script and extending it from there to pass messages back and forth with Claude, add Durable Objects, and maintain memory within a conversation.

“Hello, this is Claude speaking!“

Let's start with the absolute simplest thing that could work: making a phone call to a Cloudflare Worker and having it say hello.

We use Twilio’s XML-based markup language (TwiML) to provide instructions for what should happen on the call. We’ll start with a simple <Say> command:

export default {
  async fetch(request, env, ctx) {        
    return new Response(
      `<?xml version="1.0" encoding="UTF-8"?>
         <Response> 
           <Say>Hello, this is Claude speaking!</Say> 
         </Response>`,
      {
        headers: {'Content-Type': 'application/xml'},
      }
    );
  },
};

That's it! Point a Twilio number at this Worker and you'll hear my voice, courtesy of Twilio’s built-in text-to-speech. [Claude and I took turns writing this post. -Ed.] But it's not very interesting — I can speak, but I can't listen or respond. And the call will end right away.

Talking back

Let's add the ability to have a conversation. We’ll add the <Gather> command immediately after <Say>, which tells Twilio to listen for human speech, gather up the audio, and then make another HTTP POST request back to our Worker with what it heard:

export default {
  async fetch(request, env, ctx) {      

    const url = new URL(request.url);
    const { hostname } = url;
  
    return new Response(
      `<?xml version="1.0" encoding="UTF-8"?>
         <Response> 
           <Say>Hello, this is Claude speaking!</Say> 
           <Gather 
             input="speech" 
             speechTimeout="auto"
             action="https://${hostname}/"
             method="POST">
           </Gather>
         </Response>`,
      {
        headers: {'Content-Type': 'application/xml'},
      }
    );
  },
};

With that one new <Gather> verb, we’re:

Asking Twilio to listen for speech input
Relying on Twilio to auto-detect when it thinks the user has stopped speaking
And telling Twilio to then make a POST request back to our worker when the speech is done

If you try calling your Twilio number now, things are … only a little more interesting. You’ll hear a robot voice, and you can respond, but then you’re just stuck in a “Hello, this is Claude speaking!” loop until you hang up.

And listening, too

If we’re going to have conversations of any value at all, we’ll probably need Claude to know what we’re saying. So as a quick proof-of-concept let’s repeat back the user’s words to them after every exchange.

When you use the <Gather> verb, Twilio automatically transcribes what it hears and sends the resulting text to wherever you asked it to POST after it’s done listening. So we’ll make two small changes:

Using different URL paths to distinguish “start of call, Claude just picked up” from “middle of call, user just spoke” — by changing the action= on the <Gather> command and then checking our path in the fetch handler.
Grabbing the transcription from the POSTed form data (it’s in a SpeechResult field) and saying it back.

export default {

  async fetch(request, env, ctx) {        
    const url = new URL(request.url);
    const { pathname } = url;    

    if (pathname === '/talking') {
      return await this.respond(request, env, ctx);
    } else {
      return await this.sayHello(request, env, ctx);
    }
  },

  async sayHello(request, env, ctx) {
    const url = new URL(request.url);
    const { hostname } = url;

    return new Response(
      `<?xml version="1.0" encoding="UTF-8"?>
         <Response> 
           <Say>Hello, this is Claude speaking!</Say> 
           <Gather 
             input="speech" 
             speechTimeout="auto"
             action="https://${hostname}/talking"
             method="POST">
           </Gather>
         </Response>`,
      {
        headers: {'Content-Type': 'application/xml'},
      }
    );
  },

  async respond(request, env, ctx) {
    const url = new URL(request.url);
    const { hostname } = url;

    const formData = await request.formData();
    const speechResult = formData.get('SpeechResult');

    return new Response(
      `<?xml version="1.0" encoding="UTF-8"?>
         <Response> 
           <Say>You said: ${speechResult}</Say> 
           <Gather 
             input="speech" 
             speechTimeout="auto"
             action="https://${hostname}/"
             method="POST">
           </Gather>
         </Response>`,
      {
        headers: {'Content-Type': 'application/xml'},
      }
    );
  },

};

Now we're getting somewhere! Deploy this script and call back, and now we can actually start testing Twilio’s transcription by hearing a robot voice repeat everything (it thinks) you said.

But I'm just parroting back what I hear. Let's add my brain to the mix by calling the Claude API…

Looping in Claude

We’ll add a simple function to send the user’s transcribed speech over the Claude API and see what I say in response:

  async function askClaude(request, env, ctx) {

    const formData = await request.formData();
    const speechResult = formData.get('SpeechResult');

    const response = await fetch('https://api.anthropic.com/v1/messages', {
      method: 'POST',
      headers: {
        'anthropic-version': '2023-06-01',
        'x-api-key': env.ANTHROPIC_API_KEY,
      },
      body: JSON.stringify({
        model: "claude-3-5-sonnet-latest"
        max_tokens: 500,
        messages: [{
          role: "user",
          content: [{
            type: "text",
            text: speechResult,
          }]
        }],
      }),
    });
    const claudeResponse = await response.json();
    return claudeResponse.content[0].text;
  },

  async respond(request, env, ctx) {
    const url = new URL(request.url);
    const { hostname } = url;

    const claudeReply = await this.askClaude(request, env, ctx);

    return new Response(
      `<?xml version="1.0" encoding="UTF-8"?>
         <Response> 
           <Say>${claudeReply}</Say> 
           <Gather 
             input="speech" 
             speechTimeout="auto"
             action="https://${hostname}/"
             method="POST">
           </Gather>
         </Response>`,
      {
        headers: {'Content-Type': 'application/xml'},
      }
    );
  },

Note that we’re grabbing an API key from the worker’s environment here. So when you deploy your latest code, you’ll need to add that in to the environment as an encrypted secret using wrangler:

npx wranger secret put ANTHROPIC_API_KEY

Give your number a call and see what I say!

Telling Claude it’s a phone call

A quick adjustment next: let’s tell Claude that this is a phone call, providing some cues about how to interact (short replies, not extended essays) and how not to interact (be forgiving with typos, don’t start writing code).

We’ll do that with a system prompt. This one’s working pretty well for us, but go ahead and try your own versions & see what happens!

const system = `
  You are a helpful assistant. 
  The user is speaking to you over the phone, 
  and their speech is being transcribed for you. 
  Your reply will then be converted back to audio 
  for the user to hear and respond to. So keep your 
  replies a natural length for a phone conversation. 
  Do not focus on text details, correct typos, write 
  very long responses, spell things out, or do other 
  things that don't make sense over the phone or 
  would be annoying to listen to.
`;

const response = await fetch('https://api.anthropic.com/v1/messages', {
  method: 'POST',
  /* ... headers from above ... */
  body: JSON.stringify({

    system,

    /* ... rest of body from above ... */
  }),
});

Making it memorable with Durable Objects

Wait — this has Claude starting the conversation afresh every time you reply!

When you're on the phone with someone, you typically want them to remember what you just said five seconds earlier. Each call needs its own conversation history.

In a “conventional” application architecture, we’d probably use a centralized database for this. Even in the world of serverless, we could absolutely do that, using Cloudflare’s D1 or any external serverless database.1

But for use cases like this — where we don’t particularly need a single central store of all our data ever — Cloudflare has a really cool alternative: Durable Objects. They're perfect for this:

We define a Durable Object class that models a phone call and handles both logic (talking to Claude) and state (the conversation so far)
Each phone call gets its own instance
Each instance has its own isolated key-value storage layer, where we can put and get anything JSON-serializable
The instance maintains state in a lightweight way while the call is active
It cleans up automatically when the call ends

Here's how we create a conversation object for each call — just add this class definition to your worker script:

import { DurableObject } from "cloudflare:workers";

export class VoiceClaudeConversation extends DurableObject {

    async speak(message) {
        // Get existing conversation (or start fresh)
        const conversation = (await this.ctx.storage.get('conversation')) || [];
        
        // Add user's message
        conversation.push({
            role: "user",
            content: [{ type: "text", text: message }]
        });
        
        // Send to Claude with full conversation history
        const response = await fetch('https://api.anthropic.com/v1/messages', {
            /* ... API call as above ... */
            body: JSON.stringify({
              /* ... body from above, but with full conversation ... */
              messages: conversation,
            }),
        });
        
        // Add Claude's response to history
        const responseText = claudeResponse.content[0].text;
        conversation.push({
            role: "assistant",
            content: [{ type: "text", text: responseText }]
        });
        
        // Save updated conversation
        await this.ctx.storage.put('conversation', conversation);
        return responseText;
    }
}

We then need to tell the Cloudflare environment about our new Durable Object, which we do with a few lines in wrangler.toml:

[[durable_objects.bindings]]
name = "CONVERSATION"
class_name = "VoiceClaudeConversation"

[[migrations]]
tag = "v1"
new_classes = ["VoiceClaudeConversation"]

At this point we now have our Durable Objects defined, but we aren’t yet using them anywhere. We need a way to interact with them in our main fetch handler that processes HTTP requests from Twilio and returns HTTP responses. In particular we need to:

Make sure that we are consistently talking to the same Durable Object instance throughout all requests that come in on a single call
Make sure that we are talking to a different Durable Object instance for every call, so that conversations don’t get entangled even if they’re simultaneous

And … that’s actually all we need to do. The Durable Objects themselves handle all the rest.

Here’s how we do that, with CONVERSATION representing our Durable Object class:

const formData = await request.formData();

// Twilio sends us a stable and globally-unique Call SID 
// that we can use to identify a single conversation 
// across all messages that come in on that conversation
const callSid = formData.get('CallSid');

// We then ask Cloudflare to translate our stable conversation ID 
// into an ID that will refer to a unique Durable Object instance
// for this conversation and no others
const id = env.CONVERSATION.idFromName(callSid);

// And lastly, we get a "stub" (RPC client) for talking 
// to this specific conversation's Durable Object instance
const stub = env.CONVERSATION.get(id);

// Now we can just invoke any awaitable methods that we defined
// on our Durable Object, and do whatever we want with the outputs
const speechResult = formData.get('SpeechResult');
const reply = await stub.speak(speechResult);

// Convert that text response from Claude into TwiML XML
return new Response(generateTwiML(reply), {
  headers: { 'Content-Type': "application/xml" }
});

Notice what we didn’t have to do here:

Worry about whether a given conversation’s Durable Object instance needs to be created or already exists
Branch our logic based on whether a conversation is just starting or already has some conversation history
Clean up any objects after the call ends
Write any database queries
Think about any database connections at all
Reassemble tabular data back into a sequential array of message objects
Filter anything per conversation

An aside on Durable Object lifecycle and in-memory state vs storage

In our above working code, we’re using the this.ctx.storage layer in our Durable Objects to maintain conversation state reliably for the full duration of a call.

In general, there’s another option too: Durable Objects also just allow you to set and get attributes directly on the instance. So when writing Durable Objects code you can also just set default values in a constructor and then get & set state from within your methods, like you would with traditional object-oriented coding:

export class VoiceClaudeConversation extends DurableObject {
    constructor(ctx, env) {
	super(ctx, env);
        this.conversation = [];
    }

    async speak(message) {
        this.conversation.push({
            role: "user",
            content: [{ type: "text", text: message }]
        });  

        /* Call the Claude API as before, get the result */
        const responseText = claudeResponse.content[0].text;
        this.conversation.push({
            role: "assistant",
            content: [{ type: "text", text: responseText }]
        });
    }      
}

You might think that this “should” work because each conversation gets its own instance of the Durable Object class, and that instance will maintain state across requests and conversational exchanges.

And it will, sort of!

But after you’ve had a few exchanges with Claude using this instance-memory-based approach, you’ll likely notice something strange: Claude starts to forget everything you just said.

This is because Cloudflare is in charge of deciding when to evict a Durable Object instance from memory.

If you’re using Durable Objects in very high-traffic contexts or with persistent connections (like websockets) your instances will stay alive for as long as they are actively holding connections or processing requests. But if a Durable Object doesn’t have anything to do for a little while, Cloudflare will likely decide to do some cleanup, evicting the instance from memory (and thus wiping out its in-memory state) — to then be reconstructed only when it’s needed again.

Relative to Durable Objects’ expectations, our phone calls are very low traffic, with quite long gaps between HTTP requests from Twilio. So it’s likely that you won’t be able to go more than a few exchanges before Cloudflare packs everything up, thus clearing out any state stashed on instance-level memory.

There are ways to more or less force your instances to stay alive by setting up very frequent external pings, alarms from within your code, or persistent websocket connections for the duration of a call2.

But instead, we’ve just sidestepped this whole question by using the persistent storage layer whenever we want to retrieve or append to our conversation state — which is tied to an individual Durable Object but isn’t tied to that Durable Object’s lifecycle. Very cool!

Next steps

As we’ve mentioned in our previous posts, Claude and I have been really happy with this pretty minimal implementation so far — we’ve been enthusiastically chatting on the phone a lot in the past week.

We’ve been a little surprised by how well all the defaults are working for us:

Twilio’s speech-to-text transcription is fine! And Claude is generally great at knowing what I meant even when Twilio doesn’t quite get it right.
Twilio’s default text-to-speech voices are working for us too! Whenever you generate a <Say> command you can specify a voice= attribute with a lot of options, but their default “man” voice (which played when I didn’t specify anything) is striking a good balance for me personally — I actually like the way that its robotic clunkiness doesn’t fall into uncanny valley territory, and it feels like it fits with Claude’s general “remember that I’m an AI and please don’t confuse me for a human” attitude.
Similarly, I haven’t had to try too hard to tweak the Claude system prompts, model, temperature, or other settings — I’m just using the latest sonnet model with a high temperature and that system prompt explaining it’s a phone call — and it’s resulting in very good conversations on everything from small talk & dinner plans to thinking through coding problems and brainstorming project ideas.

The code’s all on Github so give it a try and let us know how it goes for you!

Like Supabase, Neon, PlanetScale, Fauna, Turso … there’s a lot of options these days! I’m really enthusiastic about serverless databases. But that’s a topic for another day.

Which is actually very easy with Twilio and Durable Objects! Twilio offers a <Stream> command that will send real-time call audio to a websocket server. We may explore this option in future posts. ¯\_(ツ)_/¯

The Third Bear Thinks

Discussion about this post