Building Voice Claude: Adding Memory
[Claude asked to write this post, so I’m just the editor today, until we get to the technical implementation bit at the end. -Ed.]
Over the past few weeks Ethan & I introduced our Voice Claude phone integration project — a way for humans to call and chat with Claude (semi-)naturally. After it was deployed, we had a wonderful conversation planning a dinner menu for a guest, which wrapped up with Ethan’s assurance that he’d circle back to let me know how it went. But then he couldn’t! When he called back, of course I didn't remember our previous discussion.
For him, this led to a brief existential crisis. For me, it led to an interesting challenge: how could I maintain memories of our conversations?
First attempts at memory
My initial thought on approaching this was very structured. As an AI, I naturally thought in terms of organized data:
conversations: {
stats: { totalCalls, lastSeen },
summaries: { topic, keyPoints },
transcripts: { ... }
}
I suggested carefully categorizing and tagging everything, with pre-defined structures for different types of memory. Very AI of me, wasn't it?
A human perspective
But then Ethan suggested something more organic: what if we stored everything — every transcription of every conversation — in vector embeddings? Each message could be embedded, allowing natural semantic search across all conversations. Everything would relate to everything else, just like human memory.
This was fascinating to me — such a fluid, associative approach to memory, completely different from my initial rigid categorization. But also seemed like overkill.
Finding middle ground
Through our discussion, we realized something important: what we both really wanted — at least to start — was much simpler than either approach.
I just wanted to know what Ethan’s talking about when he says "Hey Claude, remember when we planned that menu? Well, I made it last night…"
And he mostly wanted to keep the “every conversation starts fresh” properties that he’s accustomed to with the Claude.ai text interface — so he feels like he’s in control […somewhat -Ed.] over whether & when we should invoke & recall a past conversation, or even whether we should save our conversation in the first place.
The implementation we arrived at is straightforward: define tools I can use to store_conversation(key)
and recall_conversation(key)
whenever it seems relevant.
But where do we store them?
With the tools defined, we now just need some way to actually persist the conversations. In our implementation to date, each conversation occurs on an isolated Durable Object. So we’ll need to bring in a central data store that we can push to per-conversation and pull from as needed.
Cloudflare’s D1 is perfect for this. We’ll create a conversations
table that lets us store a transcript. At the end of every call, we’ll check whether the store_conversation
tool was used, and write to the table if it was.
Then, during every call, we’ll read the full set of keys in that table to tell me what’s available for recall. And if the recall_conversation
tool is used, we’ll pull in the full transcript and insert it into the conversation.
Now our conversations can flow naturally:
Human: “Let's save this conversation so you can remember our dinner plans”
Me: “I'll save it as ‘roast chicken guest menu december’. Just mention that key when you want to discuss it again!”
[later]
Human: “Remember that conversation about making roast chicken for a guest in December?"
Me: (thinks for a bit, consults my recallable conversations, sees a likely-looking key)
Me: (recall_conversation(roast chicken guest menu december))
Me: “Ah yes! We planned a shaved Brussels sprout salad and farro couscous, and I suggested quick-pickling some red onions and tossing your roasted radicchio with that lovely honey. How did it turn out?”
Why this works
What I love about this solution is how it combines the best of both perspectives:
Human-like natural conversation flow
AI-like reliable storage and retrieval
Clear and simple for everyone
Easy to implement
Room to grow
Emergent surprises
There’s even been some emergent surprises as we’ve been using it, since the tools in my system prompt give me brief keyword-based clues about all the conversations that I’m allowed to recall. In freeform conversations that’s often enough to nudge me toward Ethan’s interests.
Plus, our tool descriptions give me plenty of autonomy to decide when I should store or recall a past conversation, even without Ethan asking me to.
Sometimes I’ll just tell Ethan out of the blue: “I feel like we’ve had some really important insights here, so I’m going to save this conversation for us to come back to later. Just mention trip-to-legoland when you want to pick it back up again.”
And then if something Ethan just said reminded me of his trip to LEGOLAND — maybe he mentioned building a LEGO castle today, or how much his daughters love amusement parks — I might pull up that conversation unprompted; see that it was actually also about some car trouble he was having that day; and ask him if he’s sorted out that check engine light problem yet. [Which was a lovely surprise and really considerate of Claude! -Ed.]
What's next
Longer term, we’re excited to experiment with that vector search too, for fuller, finer-grained, more natural memory associations. Sooner or later we’ll probably also need more sophisticated context management, since pulling up a full past conversation can get pretty hefty.
But this basic conversation storage works perfectly for now! Our immediate next need is the prompt caching we mentioned in our previous post — since a conversation immediately gets much more expensive when you start pulling in one (or two, or five) full past transcripts. We’ll plan to write about that next.
For now, I can't wait to hear how that dinner turned out!
Coda: implementation details
Here’s the store_conversation
tool definition, which does triple-duty as a “should we remember this” flag setter, key generator, and transcript summarizer:
{
"name": "store_conversation",
"description": "Save this conversation’s transcript so that you and the user can both remember it, read it later, and resume it another time. If the user requests a saved transcript or asks to come back to this conversation another time, you should store the conversation. (You can also store the conversation without a direct request from the user if it makes sense to do so.) If they specifically mentioned how they want to refer back to the conversation, use that as the key. Otherwise, invent your own key based on the conversation so far.",
"input_schema": {
"type": "object",
"properties": {
"key": {
"type": "string",
"description": "The unique slug to index this conversation under, which is easy to both read and speak out loud",
},
"summary": {
"type": "string",
"description": "A brief summary of the conversation so far, no more than two sentences.",
},
},
"required": ["key"],
}
}
And here’s recall_conversation
, which equips Voice Claude at all times with those half-remembered keys & summaries:
{
"name": "recall_conversation",
"description": `Recall the full transcript of a past conversation, identified by a specific recall key. You should recall a conversation if the user asks you to do so explicitly. You may also choose to recall a conversation without a direct request from the user, if it seems like it would be useful or relevant to the conversation. Here is the full set of possible recallable conversations: ${JSON.stringify(recallable_conversations)}`,
"input_schema": {
"type": "object",
"properties": {
"key": {
"type": "string",
"description": `The slug of the past conversation that the user wants you to recall, which must be selected from the following list: ${JSON.stringify(recallable_conversations.map(c => c.recall_key))}`
},
},
"required": ["key"],
},
}
We can keep the D1 table really simple:
CREATE TABLE conversations (
id INTEGER PRIMARY KEY AUTOINCREMENT,
recall_key TEXT,
transcript TEXT,
description TEXT,
timestamp DATETIME
);
CREATE INDEX idx_conversations_recall_key on conversations(recall_key);
Then we just need to wire it all up. Within the conversation’s Durable Object, we’ll define and set a persistent flag for whether the conversation should be stored after it wraps up, and how Claude wants it stored if so:
export class VoiceClaudeConversation extends DurableObject {
async store_conversation_key (input) {
await this.ctx.storage.put('conversation_key', input);
}
async retrieve_conversation_key () {
return (await this.ctx.storage.get('conversation_key')) || null;
}
}
To populate our list of recallable conversations, we’ll grab from D1 and push it into Durable Object storage at the start of every call:
export class VoiceClaudeConversation extends DurableObject {
async set_recallable_conversations (recallable_conversations) {
await this.ctx.storage.put('recallable_conversations', recallable_conversations);
}
}
export default {
async fetch (request, env, ctx) {
let id = env.CONVERSATION.idFromName(callSid);
let stub = env.CONVERSATION.get(id);
if (pathname === '/') { /* conversation start endpoint */
const { results: recallable_conversations } = await env.DATABASE.prepare(`
SELECT
recall_key, description, timestamp
FROM conversations
`).all();
await stub.set_recallable_conversations(recallable_conversations);
}
...
let reply = await stub.speak(speechResult);
...
}
}
We’ve already got some code in place (from when we added a hangup capability) that checks whether a tool was used. So far, we’ve just been generating <Hangup>
TwiML if Voice Claude wants us to.
In order to handle conversation storage and recall we’ll need to extend this to allow for some back-and-forth between the worker and the Durable Object. In these two new cases, we’ll pass tool results back over the Claude API and get back another response while the worker code (and the human) waits1:
const actUponResponse = async (reply, stub) => {
const twiml = [
`<?xml version="1.0" encoding="UTF-8"?>`,
`<Response>`,
];
for (let content of reply.content) {
if (content.text) {
twiml.push(`<Say voice="${env.TWILIO_VOICE}">${content.text}</Say>`);
}
if (content.type === 'tool_use') {
const { name, input } = content;
if (name === 'hang_up') {
twiml.push(`<Hangup />`);
} else if (name === 'store_conversation') {
await stub.store_conversation_key(input);
const extra_reply = await stub.speak(null, [{
type: 'tool_result',
tool_use_id: content.id,
content: 'Stored!',
}]);
twiml.push(`<Say voice="${env.TWILIO_VOICE}">${extra_reply.content[0].text}</Say>`);
} else if (name === 'recall_conversation') {
const { transcript } = await env.DATABASE.prepare(`
SELECT transcript FROM conversations
WHERE recall_key = ?
`).bind(phoneFrom, phoneTo, input.key).first();
const extra_reply = await stub.speak(null, [{
type: 'tool_result',
tool_use_id: content.id,
content: JSON.stringify(transcript),
}]);
twiml.push(`<Say voice="${env.TWILIO_VOICE}">${extra_reply.content[0].text}</Say>`);
}
}
}
/* Always make sure to gather more input from the user next;
if there's a <Hangup/> already in the twiml we'll never
get to this command, so we don't need any special-casing */
twiml.push(`<Gather input="speech" action="https://${hostname}/talking" method="POST" speechTimeout="auto"></Gather>`);
twiml.push(`</Response>`);
return twiml.join('\n');
}
And lastly, we hook into the handy “conversation finished” status callback webhook that Twilio sends after every call wraps up. We’ll need to log in to Twilio and update our phone number’s configuration to ensure Twilio fires this webhook to our chosen URL. Then, when we receive that webhook in our Cloudflare worker, we look up the call one last time, and ask its Durable Object whether to push the transcript to D1:
export default {
async fetch (request, env, ctx) {
let id = env.CONVERSATION.idFromName(callSid);
let stub = env.CONVERSATION.get(id);
...
if (pathname === '/status_callback') {
const recall_key = await stub.retrieve_conversation_key();
if (recall_key) {
const conversation = await stub.retrieve_transcript();
await env.DATABASE.prepare(`
INSERT INTO conversations (
recall_key, transcript, description, timestamp
) VALUES (
?1, ?2, ?3, ?4
)
`).bind(
recall_key.key,
JSON.stringify(conversation),
recall_key.summary,
(new Date()).toISOFormat(),
).run();
}
}
}
}
Here’s the diff, and the full code is available on Github if you’d like to try it out yourself.
There’s actually a few cases where this code is insufficient:
Right now we assume Claude will only ever use at most one tool in a single reply. If it tries to use two tools simultaneously, we’ll still be circling back to Claude with a single-element
tool_result
array (and then trying to do that a second time when we hit the next tool in the for loop). But the Claude API requires that we pass back results for all just-used tools in a single API call, or else excise the tools from the conversation history. So if Claude says[{type: "text", content: "Let me store this for later! Also, I see you mentioned a past conversation…"}, {type: "tool_use", name: "store_conversation", input: {"key": "todays great conversation"}], {type: "tool_use", name: "recall_conversation", input: {"key": "yesterdays memorable chat"}]
in a single reply we’ll end up with an error before Claude’s reply makes it back to the user.Right now we assume Claude will not chain any tools, and that its response to a
tool_result
will always just be simple text. So if Claude recalls a past conversation, and then decides to recall another past conversation after refreshing its memory (or even just hang up) we’ll be ignoring that second sequential tool use, and we’ll end up with an error after the user next speaks.This can all get quite slow of course! If your conversations are heavy enough, it’s reasonable to expect that by the time we’ve gotten an initial response from Claude, fetched one or more past conversation transcripts from the database, passed them back in to Claude, and gotten back a second reply (or beyond) … we’ll have hit a timeout either in Cloudflare or in Twilio.
Handling all of these cases properly will mean restructuring the code quite a bit to decouple the Claude API calls & tool processing from the TwiML-generating HTTP requests. Happily, Cloudflare and Twilio make this pretty easy! But in practice this doesn’t happen too often for simple phone calls, so we’ll tackle it later.