Building Voice Claude: Hanging Up

Jan 08, 2025

If you’ve been building along with us as Claude and I assemble Voice Claude, you may have noticed that it sure would be nice if these calls could just end.

When you reach a natural stopping point in a call, sometimes you really just want the person on the other line to … notice, and … hang up.

Instead, Voice Claude will just stay on the line, absolutely picking up on your cues to end the call but politely replying ad infinitum until you hang up midway through something Claude’s saying and feel a little bad about it.

So today, we’ll give Voice Claude its very first capability beyond pure conversation. We’ll let it hang up on us. 🎉🎉🎉

Using tools

The implementation is extremely simple, thanks to Claude’s Tool Use capabilities.

Basically, this is just a way of telling Claude that it can do more than just talk. Whenever appropriate, it can also respond by “using a tool” that we have said it has in its possession.

We define Claude’s available tools as part of each conversational API call. Every tool is just a little JSON object containing:

The tool’s name
A (“human” readable) description of the tool
A JSON Schema object describing any structured inputs that Claude should generate when using the tool — including each input’s name, (“human readable”) description, type (e.g. string / enum / boolean / number), and which inputs if any are required.

In other words, a tool spec looks a lot like a function signature1 and docstring.

Okay so let’s get to the hanging up already

Anyway! All of this is to say that in our current case, all we need to do is tell Voice Claude that it’s allowed to hang up on us. And then telling Twilio to hang up when Voice Claude asks.

We don’t need to change very much — we just add a “hang up” tool to our Anthropic API calls:

const tools = [
  {
    name: "hang_up",
    description: "Hang up the phone when you want to end the conversation",
    input_schema: {
      type: "object",
      properties: {},
      required: [],
    }
  }
];

const response = await fetch('https://api.anthropic.com/v1/messages', {
  method: 'POST',
  headers: {
    'anthropic-version': '2023-06-01',
    'x-api-key': this.KEY,
  },
  body: JSON.stringify({
    tools,
    system: `You are a helpful assistant. The user is speaking to you over the phone, and their speech is being transcribed for you. Your reply will then be converted back to audio for the user to hear and respond to. So keep your replies a natural length for a phone conversation. Do not focus on text details, correct typos, write very long responses, spell things out, or do other things that don't make sense over the phone or would be annoying to listen to.`,
    model: this.MODEL,
    max_tokens: 500,
    temperature: 1,
    messages: conversation,
  }),
});

Then we just need to check whether it used the '“hang up” tool somewhere in its response, instead of assuming it’s all just text:

const claudeResponse = await response.json();

const twiml = claudeResponse.content.map(content => {
  if (content.type === 'tool_use' && content.name === 'hang_up') {
    return `<Hangup />`;
  } else if (content.text) {
    return `<Say>${content.text}</Say>`
  }
});

One neat thing about this API structure: Claude’s content response is an array, which might include some text content anxd some tool usage. So we can look for any text content and <Say> that while also then hanging up the phone. This is why we didn't need to include a “goodbye_message” input parameter in our hang_up spec — if Claude wants to say something before hanging up it can just do so with regular speech.

That’s pretty much it! The conversation’s over now, so we don't even need to tell Voice Claude it happened.

Prompt engineering

Now that we’ve got it wired up, it’s fun to play with the tool description and see how it shifts Voice Claude’s behavior. Some ideas I’ve tried:

“Hang up the phone because you want to end the conversation”
“Hang up the phone because the user has explicitly said that the conversation should be over now”
“Hang up the phone because you’ve run out of things to say”
“Hang up the phone because you or the user wants to end the conversation, but only if the conversation has gone on for more than ten exchanges. Never hang up the phone until at least ten exchanges have occurred — you never know if an engaging new topic will pop up before then.”
“Hang up the phone at any point if you feel that you should or do not want to talk to the user any more. You do not need to wait for a cue from the user. You do not need to be polite about it.”

The description really does change how Voice Claude chooses to use the tool.

Practical implications

Two interesting things I noticed after implementing this:

First, adding tools substantially increases our input token count. Under the hood our tool spec is injected as a prefix to the system prompt, so even just a quick initial “Hello” message now takes up a few hundred additional input tokens. This starts to add up quickly over a conversation, so we’ll likely tackle prompt caching soon.

Second: just having the hang_up tool available seems to influence conversation patterns throughout the call. When I mention I'm just calling to chat while doing dishes, Voice Claude now occasionally checks if I'm finished. If I say I want to talk through something while driving home, Voice Claude will start saying I must want to wrap up soon. Simply having an explicit "you can end the conversation" capability in its context seems to nudge Voice Claude toward being more attentive to natural ending points. This makes sense in retrospect (it’s all just more input context after all) but surprised me when I first experienced the difference.

No logic, no inference — just Claude’s choices

I find it really lovely that we don’t need any “detect if the conversation should end” logic in our code at all. We don’t need to listen for magic keywords or button presses from the user; we don’t try to do sentiment analysis; we don’t need to track conversation length. We just let Voice Claude (and the human of course) do the work of deciding when it’s time to hang up.

Beyond that, we’re not even asking Voice Claude to infer anything about the human’s intent. The goal here isn’t to detect when the user is asking for the call to end. The human obviously has the freedom to hang up if they want!

Anyway, here’s the full diff and the code is all available on Github. Enjoy!

But notice that we don’t define what the return value looks like! This is a hint that while these tool specs may sound like callable functions, they're actually something quite different.

Our API-consuming code might react to a “tool use” by calling a function in our system that matches the tool’s name & signature, but — despite the mental model suggested by typical “infer what the user wants, request more information with a tool, do something with the result” examples — there is nothing inherently procedural or sequential about these tools.

In fact there's nothing inherently requiring that we honor Claude's intent to use a tool at all. We're perfectly free to ignore the tool use request altogether, or to use the tool specs as a way of guiding Claude toward a set of constrained behaviors that we can predict with more structure than free form text replies.

This seems almost silly to point out — “you’re allowed to write code that does whatever you want, and you don't need to listen to what an AI is telling you to do” — but I feel like it tends to get a bit lost in conversational LLM providers’ documentation, which tend to start with examples like “look up the weather today in [ZIP] when the AI assistant infers that the user wants to know the weather.”

Anthropic’s docs do a nice job of spelling out that this isn’t the only way to use the feature, saying that “for some workflows, Claude’s tool use request might be all you need” and that “tools do not necessarily need to be client-side functions — you can use tools anytime you want the model to return JSON output.”

OpenAI’s and Gemini’s docs for their equivalent feature are a lot less explicit about this point, and even the name they chose for it — Function Calling — kind of conflates it all.

Having said all that, if you’re going to send back another LLM API call after a tool_use in an assistant message, you do need to provide tool_result outputs in your next user message — all of these APIs will error otherwise. But:

You can just not send back another LLM API call, and instead just do stuff “locally” based on the tool_use(s) that you saw
You can just send back an LLM API call with whatever outputs you want, since there’s no formally-defined return value in your tool specs; you can just send back “done” or whatever and move on.
You can just send back another LLM API call that trims out the tool_use because all of these LLM APIs are after all totally stateless — you’re always free to just rewrite history at any point and construct a version of your conversation that excised the tool_use intentions altogether.

The Third Bear Thinks

Discussion about this post