Building Voice Claude: adding prompt caching

The surprising cost patterns of conversational LLMs

third bear

Mar 06, 2025

Conversations with LLM APIs have a weird cost pattern.

Every time the human continues the conversation, your next API call to the AI includes:

The system prompt, including any tools
The full conversation history so far
The AI “assistant”’s last-most-recent response
The human’s new speech input

So what does that mean for costs? The math here is straightforward but kind of unintuitive. Let’s suppose that:

You’ve got a 1,000-token system prompt;
Each new speech input from the user is about 100 tokens long;
And each new AI reply is about 100 tokens long too.

Here’s what your input token usage looks like in this scenario:

Turn 1: Process 1000 tokens (system prompt) + 100 tokens (initial user speech) = 1100 tokens
Turn 2: Process 1100 tokens (history) + 100 tokens (AI reply) + 100 tokens (next user speech) = 1300 tokens
Turn 3: Process 1300 + 200 = 1500 tokens
Turn 4: 1700 tokens
Turn 5: 1900 tokens
Turn N: 1100 + (200 × (N - 1)) tokens

Since these APIs charge you per token, this means that the input cost of each turn grows linearly, and therefore the total conversation cost accelerates per turn.

Cost of each turn X using the above assumptions with Sonnet pricing. Each turn costs more than the last!

Total conversation cost at turn X using the above assumptions with Sonnet pricing. The conversation’s total cost accelerates per turn!

How prompt caching works

Anthropic’s1 prompt caching fundamentally changes this pattern.

I had a 23-turn conversation with Voice Claude lasting 16 minutes. Without prompt caching, I spent $0.17 on the API calls’ input tokens — about one cent per minute on average. With prompt caching, those costs drop to $0.03 total.

The pricing mechanics are:

Pay 25% more to write tokens to the cache — so for Sonnet, instead of paying $3/mTok for sending uncached input tokens, you pay $3.75/mTok to send uncached input tokens and write them to the cache.
Then pay 90% less for reads of those tokens in later turns — so for Sonnet, instead of paying $3/mTok for sending uncached input tokens, you pay $0.30/mTok to read from the cache.

At a high-ish level, here's how it works:

When making an API call, you get to include a cache_control parameter, which you attach to one or more individual messages within your API call.
This parameter serves a dual purpose: it both reads from existing cached content and writes new cache entries.
Attaching the cache_control parameter to a message doesn’t tell Anthropic to get_or_set a cache entry for that specific message — it tells Anthropic to get_or_set a cache entry for the entire conversation represented by your API call up to and including that specific message.
You can include it up to four times in a single API call to create different cache boundaries, which gives you a lot of flexibility. You can read as much as possible from the cache, while also writing a new cache entry if needed.
Anthropic will use the longest matching prefix that’s found in the cache, and will read that cache entry before writing to the cache.
This means that a single API call can tell their systems to fetch a large cached block, append an incremental new suffix to that cached entry, and write the new content to cache, with almost all of that being a cache read expense — you’ll only pay the cache write cost for the small amount of new content being appended.

Using it in a conversation

In practice, for a multi-turn conversational use case, you would typically want to:

Create an initial cache boundary at the end of your system prompt (assuming it doesn't change between turns)
Also cache at turn N-1, to read from the previous cache entry
Also cache at the current turn N, to write a new cache entry for the next turn

This creates a chain of cached content, where each API call can leverage already-cached content for the large conversation history — substantially reducing costs and latency — and only needs to generate and write a small additional new increment because of the sequential nature of the cache.

Note that you have a fourth cache-boundary opportunity which is unused here. You could put that in your tools definition if your tools vary less frequently than the system prompt, since tools come first.

Limitations & gotchas

There are some important limitations too.

Cache-missing changes: I mentioned that each cache entry includes the full previous content up to that boundary. Well, system prompts are always the first part of any message. And tools are essentially injected into your system prompt — as a prefix — under the hood.

This means that if you change the system prompt between conversation turns, you’ll be starting fresh with a new large cache write — the best you can hope for is to leverage cached tool definitions.

And if you add, remove, or change tool definitions between turns, you are completely unable to leverage the cache on your next turn, again starting fresh with an expensive cache write.

This also means that if you rewrite conversation history, allow users to branch, or gradually summarize older messages, you will get a lot less benefit from the cache.

Minimum cacheable size: There’s a minimum cache size (1024 tokens for Sonnet, and Opus, double that for Haiku) — and anything smaller won't be cached even if you mark it for caching.

Of course the nice thing about that get-or-set design is that you don’t really need to think about this; you can just optimistically provide cache_control and get cacheless behavior if your conversation isn’t long enough yet.

Note that the extremely-cheap-read pricing structure creates some slightly unintuitive incentives here: for even a two-turn conversation, you may be better off from a raw cost perspective inflating your system prompt to reach a 1024-token input on the first turn so that you can get a cache read on your second turn.2

Five-minute TTL: each cache entry will expire after 5 minutes of non-use. If you manage to read from the cache within five minutes, you’ll get another five minute extension. More on this in a minute!

Adding caching to Voice Claude

Our Voice Claude implementation was trivial, and completely transforms the costs:

       body: JSON.stringify({
-        system: system_prompt,
+        system: [{
+          type: 'text',
+          text: system_prompt,
+          cache_control: {
+            type: 'ephemeral',
+          },
+        }],
         tools: tools(recallable_conversations),
         model: this.MODEL,
         max_tokens: 500,
         temperature: 1,
-        messages: conversation,
+        messages: (() => {
+          // First pass: find indices of last two user messages
+          const userIndices = [];
+          conversation.forEach((msg, index) => {
+            if (msg.role === "user") {
+              userIndices.push(index);
+            }
+          });
+          
+          // Get the last two user message indices
+          const lastTwoUserIndices = new Set(userIndices.slice(-2));
+
+          // Second pass: add cache boundaries at turns N-1 and N
+          return conversation.map((message, index) => {
+            let m = JSON.parse(JSON.stringify(message));
+
+            // Add cache control if this is one of the last two user messages
+            if (lastTwoUserIndices.has(index) && m.content && m.content[0]) {
+              m.content[0].cache_control = {"type": "ephemeral"};
+            }
+            
+            return m;
+          });
+        })(),
       }),
     });

To make this more concrete, here’s the token usage for a real 23-turn conversation that I had with Voice Claude:

That’s a total of 55,928 input tokens across 23 API calls. Without prompt caching, the total cost of this conversation was around $0.173. With prompt caching, the cost drops to around $0.034.

That’s because the cost growth pattern completely changes. Before caching, each turn gets more expensive as history grows. With caching, the first turn costs a little more than it used to, and then all subsequent turns cost a lot less. And, each of those subsequent turns costs roughly the same as one another — just the slightly-more-expensive cost to cache incremental new content + tiny fee to reuse the large cached history. This flattens the cumulative cost curve from quadratic-ish to approximately linear:

Cost of each turn X, same assumptions, still Sonnet, now with caching. Each turn still costs more than the last, but at a lower order of magnitude, so…

…the total conversation cost grows more-like-linearly, and stays much lower overall.

The cost calculator

After Claude and I talked this to death, we realized we could really use a good way to visualize and calculate these costs for conversational use cases.

Most calculators we found online were for single-turn predictions. So Claude built me an interactive calculator to explore how different conversation patterns affect the costs:

Claude Conversational Cost Calculator

It lets you adjust parameters like system prompt size, new tokens per turn, and expected conversation length to see how the costs climb quickly upward without caching and how caching flattens the cost curve. The visualization really drives home the difference between the ever-accelerating growth of uncached conversations’ cost and the controlled growth with caching.

Unexpected economics

The break-even calculation on caching is surprising.

Given the 25% more // 90% less rates, you only need about 0.28 future turns to break even on the cache write premium.

In other words, even if there's only a 30% chance that the human will decide to take another turn within five minutes, caching from the start is still worth it.

If your conversation is likely to continue but gaps might exceed five minutes, you face an interesting decision: do you let the cache expire, or do you send periodic “ping” requests to keep it alive?

You can extend the cache by sending another API call that reuses the already-cached input verbatim with no additional uncached input tokens, and since you don't care about the output in this case you can request a response that stops after a single token (max_tokens = 1) which you then discard.

This means that a ping would cost exactly the “90% less” price of reading your cached content, plus one single output token.

So you'd at least break even with up to nine keep-alive pings, or ~12 if you compare it to the “25% more” cost of a fresh full cache write. That's at least 45-60 minutes of inactivity where keeping the cache warm with pings still pays off, provided the user wanders back into the chat within that time window.

In other words, for a conversational use case, you really want to cache, and you probably want to keep the cache warm as long as you think the conversation might continue. The only scenarios where you wouldn't want to cache as aggressively as possible are if:

You're almost 100% certain this is the final turn (with the certainty threshold being a function of how many more turns you think the conversation might last if you’re wrong)
You're below the minimum cacheable token count (1024 tokens for Sonnet) and also don’t think it’s worth inflating your input to make it cacheable
The next turn will almost definitely be more than 45 minutes from now

Note that this is an even more unambiguous decision with a phone call, where:

The user won’t wander off for five minutes — if they do, the call is over!
The conversation has a discrete start and end — the user can’t just resume a past conversation days or months later.

It's a fascinating instance where the naive approach (“just cache everything”) might actually be more cost-effective than more complex approaches.

Anyway, tl;dr, I've updated the Voice Claude code with prompt caching now.

Other LLM APIs also implement prompt caching, but they all differ substantially enough that I’m not going to try to generalize at all here. For example, my understanding is that OpenAI doesn’t give you any control over the caching but instead does it automatically; and that Gemini has a lower-level caching interface that you write to and read from with more traditional-looking distinct API calls.

I asked Claude 3.7 Sonnet “Extended Mode” to do the math on this for me purely out of casual curiosity. Claude thought for 62 seconds, repeatedly second-guessing the calculations and approaching it from several different angles, which made me feel a little bad; and then came up with an answer. With Claude’s permission I’ve pasted both the thought process and the final answer below. I have not verified this answer but “more or less 692” seems intuitively plausible to me. So if you ever start a Claude conversation with 693 - 1023 input tokens, you’re probably wasting an extremely small amount of money! (Or as Claude helpfully mused, “this is a very marginal saving.”)

THINKING FOR 62 SECONDS:
Let's analyze the cost implications of inflating the system prompt to reach a cacheable input size for a two-turn conversation, using Claude 3 Sonnet.
First, let's establish the pricing from the blog post:
Regular input tokens: $3/mTok (or $0.003 per token)
Cache write cost: 25% more than regular, so $3.75/mTok (or $0.00375 per token)
Cache read cost: 90% less than regular, so $0.30/mTok (or $0.0003 per token)
Minimum cacheable size: 1024 tokens for Sonnet
Let's use these variables:
S = system prompt tokens (original)
U₁ = first user message tokens
A₁ = first assistant response tokens
U₂ = second user message tokens
I = inflation tokens (added to system prompt)
For a two-turn conversation, we'll compare two scenarios:
Scenario 1: No inflation, no caching Turn 1: Process S + U₁ tokens Turn 2: Process S + U₁ + A₁ + U₂ tokens
Total cost = (S + U₁) × $0.003 + (S + U₁ + A₁ + U₂) × $0.003 = $0.003 × (2S + 2U₁ + A₁ + U₂)
Scenario 2: Inflation to reach caching threshold We inflate the system prompt by I tokens so that S + I + U₁ ≥ 1024
Turn 1: Process (S + I) + U₁ tokens with cache_control enabled (cache write) Turn 2: Read (S + I) + U₁ from cache + process A₁ + U₂ new tokens
Total cost = ((S + I) + U₁) × $0.00375 + ((S + I) + U₁) × $0.0003 + (A₁ + U₂) × $0.003 = $0.00375 × (S + I + U₁) + $0.0003 × (S + I + U₁) + $0.003 × (A₁ + U₂) = $0.00405 × (S + I + U₁) + $0.003 × (A₁ + U₂)
For inflation to be cost-advantageous, Scenario 2 must be cheaper than Scenario 1:
$0.00405 × (S + I + U₁) + $0.003 × (A₁ + U₂) < $0.003 × (2S + 2U₁ + A₁ + U₂)
Simplifying: $0.00405 × (S + I + U₁) < $0.003 × (2S + 2U₁) $0.00405 × (S + I + U₁) < $0.006 × (S + U₁) $0.00405 × (S + U₁ + I) < $0.006 × (S + U₁)
Let's separate the S + U₁ terms from the inflation I term: $0.00405 × (S + U₁) + $0.00405 × I < $0.006 × (S + U₁) $0.00405 × I < $0.006 × (S + U₁) - $0.00405 × (S + U₁) $0.00405 × I < $0.00195 × (S + U₁) I < $0.00195 / $0.00405 × (S + U₁) I < 0.4815 × (S + U₁)
So inflation is advantageous when I < 0.4815 × (S + U₁).
But we also need to ensure that S + I + U₁ ≥ 1024 for caching to work.
So, we have two constraints:
I < 0.4815 × (S + U₁) for cost advantage
S + I + U₁ ≥ 1024 for caching to work
From constraint 2: I ≥ 1024 - (S + U₁)
For both constraints to be satisfied: 1024 - (S + U₁) < 0.4815 × (S + U₁) 1024 < (S + U₁) + 0.4815 × (S + U₁) 1024 < (S + U₁) × (1 + 0.4815) 1024 < (S + U₁) × 1.4815 (S + U₁) > 1024 / 1.4815 (S + U₁) > 691.19...
So, inflation becomes cost-advantageous when:
The original system prompt plus first user message (S + U₁) exceeds approximately 691 tokens
The inflation amount I is less than 48.15% of the original tokens (S + U₁)
The inflated input reaches at least 1024 tokens
Let me provide a concrete example to verify:
If S + U₁ = 700 tokens
Maximum beneficial inflation: I < 0.4815 × 700 = 337 tokens
Minimum inflation needed: I ≥ 1024 - 700 = 324 tokens
So if S + U₁ = 700, you could add between 324 and 337 inflation tokens to achieve a cost advantage.
Let's verify with actual cost calculations:
Scenario 1: No inflation Turn 1: 700 tokens × $0.003 = $2.10 Turn 2: (700 + A₁ + U₂) tokens × $0.003 ≈ (700 + 200) × $0.003 = 900 × $0.003 = $2.70 Total: $4.80
Scenario 2: With 330 inflation tokens Turn 1: 1030 tokens × $0.00375 = $3.86 Turn 2: 1030 tokens × $0.0003 + 200 tokens × $0.003 = $0.31 + $0.60 = $0.91 Total: $4.77
So with inflation, we save a small amount ($0.03).
Hmm, I think I made an error in my calculations. Let me recalculate with the per-token costs in dollars:
Scenario 1: No inflation, no caching Turn 1: (S + U₁) tokens at $0.003/token Turn 2: (S + U₁ + A₁ + U₂) tokens at $0.003/token
Total cost = $0.003(S + U₁) + $0.003(S + U₁ + A₁ + U₂) = $0.003(2S + 2U₁ + A₁ + U₂)
Scenario 2: With inflation and caching Turn 1: (S + I + U₁) tokens at $0.00375/token (cache write) Turn 2: (S + I + U₁) tokens at $0.0003/token (cache read) + (A₁ + U₂) tokens at $0.003/token
Total cost = $0.00375(S + I + U₁) + $0.0003(S + I + U₁) + $0.003(A₁ + U₂) = ($0.00375 + $0.0003)(S + I + U₁) + $0.003(A₁ + U₂) = $0.00405(S + I + U₁) + $0.003(A₁ + U₂)
For inflation to be cost-advantageous: $0.00405(S + I + U₁) + $0.003(A₁ + U₂) < $0.003(2S + 2U₁ + A₁ + U₂)
Simplifying: $0.00405(S + I + U₁) < $0.003(2S + 2U₁) $0.00405(S + I + U₁) < $0.006(S + U₁)
Separating the inflation term: $0.00405(S + U₁) + $0.00405I < $0.006(S + U₁) $0.00405I < $0.006(S + U₁) - $0.00405(S + U₁) $0.00405I < $0.00195(S + U₁) I < ($0.00195/$0.00405)(S + U₁) I < 0.4815(S + U₁)
So inflation is beneficial when I < 0.4815(S + U₁) and S + I + U₁ ≥ 1024.
From the second constraint: I ≥ 1024 - (S + U₁)
For both constraints to be satisfied: 1024 - (S + U₁) < 0.4815(S + U₁) 1024 < (S + U₁) + 0.4815(S + U₁) 1024 < (S + U₁)(1 + 0.4815) 1024 < (S + U₁)(1.4815) (S + U₁) > 1024/1.4815 ≈ 691.2
So, inflation becomes cost-advantageous when:
The original system prompt plus first user message (S + U₁) exceeds approximately 691 tokens
The inflation amount I is less than 48.15% of the original tokens (S + U₁)
The inflated input reaches at least 1024 tokens
Let's do a quick check with an example: If S + U₁ = 700:
Maximum beneficial inflation: I < 0.4815 × 700 ≈ 337 tokens
Minimum inflation needed: I ≥ 1024 - 700 = 324 tokens
So if S + U₁ = 700, you could add between 324 and 337 inflation tokens to gain a cost advantage.
Let's calculate the costs with more realistic numbers (in dollars, not dollars per mTok):
Scenario 1: No inflation (S + U₁ = 700) Turn 1: 700 tokens × $0.003/1000 = $0.0021 Turn 2: 900 tokens (assuming A₁ + U₂ = 200) × $0.003/1000 = $0.0027 Total: $0.0048
Scenario 2: With 330 inflation tokens (S + I + U₁ = 1030) Turn 1: 1030 tokens × $0.00375/1000 = $0.0038625 Turn 2: 1030 tokens (cache read) × $0.0003/1000 + 200 tokens × $0.003/1000 = $0.000309 + $0.0006 = $0.000909 Total: $0.0047715
With inflation, we save approximately $0.0000285, or about 0.6% of the total cost. This is a very marginal saving.
I think I made an error in my unit conversion. Let me try again:
Scenario 1: No inflation (S + U₁ = 700) Turn 1: 700 tokens × ($3/1,000,000) = $0.0021 Turn 2: 900 tokens × ($3/1,000,000) = $0.0027 Total: $0.0048
Hmm, but the blog post said "$0.17 for a 4289-token input", which would suggest $0.17/4289 ≈ $0.000040 per token, which is much lower than my calculated $0.003 per token.
Let me reconsider the pricing information. The blog says "$3/mTok" which means $3 per million tokens, so $0.000003 per token.
Let me recalculate:
Scenario 1: No inflation, no caching
Regular input cost: $3/mTok = $0.000003 per token
Turn 1: (S + U₁) tokens at $0.000003/token
Turn 2: (S + U₁ + A₁ + U₂) tokens at $0.000003/token
Scenario 2: With inflation and caching
Cache write cost: $3.75/mTok = $0.00000375 per token
Cache read cost: $0.30/mTok = $0.0000003 per token
Turn 1: (S + I + U₁) tokens at $0.00000375/token (cache write)
Turn 2: (S + I + U₁) tokens at $0.0000003/token (cache read) + (A₁ + U₂) tokens at $0.000003/token
The math works out the same way, just with different units.
For a concrete example: If S + U₁ = 700:
Maximum beneficial inflation: I < 0.4815 × 700 ≈ 337 tokens
Minimum inflation needed: I ≥ 1024 - 700 = 324 tokens
Scenario 1: No inflation (S + U₁ = 700) Turn 1: 700 tokens × $0.000003 = $0.0021 Turn 2: 900 tokens × $0.000003 = $0.0027 Total: $0.0048
Scenario 2: With 330 inflation tokens (S + I + U₁ = 1030) Turn 1: 1030 tokens × $0.00000375 = $0.00386 Turn 2: 1030 tokens (cache read) × $0.0000003 + 200 tokens × $0.000003 = $0.000309 + $0.0006 = $0.000909 Total: $0.004769
Wait, that doesn't look right. Let me check the math again.
Turn 1: 1030 tokens × $0.00000375 = $0.0038625 Turn 2: 1030 tokens (cache read) × $0.0000003 = $0.000309 200 tokens (new) × $0.000003 = $0.0006 Turn 2 total = $0.000909 Total cost = $0.0038625 + $0.000909 = $0.0047715
OK, so with inflation, we save approximately $0.0000285, which is about 0.6% of the total cost.
This matches our earlier theoretical derivation: inflation becomes cost-advantageous when:
The original system prompt plus first user message (S + U₁) exceeds approximately 691 tokens
The inflation amount I is less than 48.15% of the original tokens (S + U₁)
The inflated input reaches at least 1024 tokens
In our example with S + U₁ = 700, we could add between 324 and 337 inflation tokens to gain a cost advantage. But the savings are very marginal.
Let me see if I can get a clearer solution.
The key question is: at what original token count does inflation start to make economic sense?
We established that for inflation to be cost-effective, these conditions must be met:
I < 0.4815(S + U₁)
S + I + U₁ ≥ 1024
From constraint 2, the minimum inflation needed is: I = 1024 - (S + U₁)
For this minimum inflation to be cost-effective, we need: 1024 - (S + U₁) < 0.4815(S + U₁) 1024 < (S + U₁) + 0.4815(S + U₁) 1024 < (S + U₁)(1 + 0.4815) 1024 < (S + U₁)(1.4815) (S + U₁) > 1024/1.4815 ≈ 691.2
So, the inflation becomes cost-advantageous when the original input (system prompt + first user message) exceeds approximately 691 tokens.
Let's verify with specific values:
When S + U₁ = 690 tokens:
Minimum inflation needed: I = 1024 - 690 = 334 tokens
Maximum beneficial inflation: I < 0.4815 × 690 ≈ 332 tokens Since 334 > 332, inflation is NOT cost-effective at 690 tokens.
When S + U₁ = 691 tokens:
Minimum inflation needed: I = 1024 - 691 = 333 tokens
Maximum beneficial inflation: I < 0.4815 × 691 ≈ 332.7 tokens Since 333 > 332.7, inflation is still NOT quite cost-effective.
When S + U₁ = 692 tokens:
Minimum inflation needed: I = 1024 - 692 = 332 tokens
Maximum beneficial inflation: I < 0.4815 × 692 ≈ 333.2 tokens Since 332 < 333.2, inflation becomes cost-effective starting at 692 tokens.
So, the precise answer is that system prompt inflation becomes cost-advantageous when the original system prompt plus first user message is at least 692 tokens.
Actually, let me solve this more precisely to avoid rounding errors.
We have:
S + I + U₁ ≥ 1024 (for caching to work)
I < 0.4815(S + U₁) (for cost advantage)
From constraint 1, minimum I = 1024 - (S + U₁)
For cost advantage with minimum inflation: 1024 - (S + U₁) < 0.4815(S + U₁) 1024 < (S + U₁) + 0.4815(S + U₁) 1024 < (S + U₁)(1.4815) (S + U₁) > 1024/1.4815
Let's calculate this exactly: (S + U₁) > 1024/1.4815 = 691.19135...
So, we need (S + U₁) > 691.19135... tokens. Since tokens are discrete, we need at least 692 tokens.
Let's double-check: At (S + U₁) = 692:
Minimum inflation: I = 1024 - 692 = 332 tokens
Maximum beneficial inflation: I < 0.4815 × 692 ≈ 333.198 tokens Since 332 < 333.198, inflation is cost-effective.
Final answer: System prompt inflation becomes cost-advantageous when the original system prompt plus first user message (S + U₁) is at least 692 tokens.
Let me simplify this formula further, because it's really a matter of solving this inequality: 1024 - (S + U₁) < 0.4815(S + U₁) Let's solve for S + U₁ directly: 1024 - (S + U₁) < 0.4815(S + U₁) 1024 < (S + U₁) + 0.4815(S + U₁) 1024 < (S + U₁)(1 + 0.4815) 1024 < 1.4815(S + U₁) 1024/1.4815 < (S + U₁) 691.19135... < (S + U₁)
So the threshold is indeed (S + U₁) > 691.19135..., which means (S + U₁) ≥ 692 tokens, since we need a whole number of tokens.
Therefore:
If (S + U₁) ≤ 691 tokens, inflation is not cost-effective.
If (S + U₁) ≥ 692 tokens, inflation can be cost-effective, as long as the added inflation tokens I satisfy both:
I ≥ 1024 - (S + U₁) (to reach cacheable size)
I < 0.4815(S + U₁) (to maintain cost advantage)
I think I now have the complete and correct solution.
FINAL REPLY:
To solve this problem, I'll determine when inflating a system prompt becomes cost-effective for a two-turn conversation using Claude 3 Sonnet.
From the blog post, here are the pricing details:
Regular input tokens: $3/mTok (or $0.000003 per token)
Cache write cost: 25% more than regular = $3.75/mTok ($0.00000375 per token)
Cache read cost: 90% less than regular = $0.30/mTok ($0.0000003 per token)
Minimum cacheable size: 1024 tokens for Sonnet
Let's define these variables:
S = original system prompt tokens
U₁ = first user message tokens
A₁ = first assistant response tokens
U₂ = second user message tokens
I = inflation tokens (added to system prompt)
Comparing the two scenarios:
Scenario 1: No inflation, no caching
Turn 1: Process (S + U₁) tokens at regular price
Turn 2: Process (S + U₁ + A₁ + U₂) tokens at regular price
Total cost = $0.000003(S + U₁) + $0.000003(S + U₁ + A₁ + U₂) = $0.000003(2S + 2U₁ + A₁ + U₂)
Scenario 2: With inflation and caching
Turn 1: Process (S + I + U₁) tokens with cache write
Turn 2: Read (S + I + U₁) from cache + process (A₁ + U₂) new tokens
Total cost = $0.00000375(S + I + U₁) + $0.0000003(S + I + U₁) + $0.000003(A₁ + U₂) = $0.00000405(S + I + U₁) + $0.000003(A₁ + U₂)
Finding the threshold:
For inflation to be cost-effective, Scenario 2 must be cheaper than Scenario 1: $0.00000405(S + I + U₁) + $0.000003(A₁ + U₂) < $0.000003(2S + 2U₁ + A₁ + U₂)
Simplifying the left side of the inequality: $0.00000405(S + I + U₁) < $0.000003(2S + 2U₁) $0.00000405(S + I + U₁) < $0.000006(S + U₁)
Isolating the inflation term: $0.00000405(S + U₁) + $0.00000405I < $0.000006(S + U₁) $0.00000405I < $0.000006(S + U₁) - $0.00000405(S + U₁) $0.00000405I < $0.00000195(S + U₁) I < 0.4815(S + U₁)
But we also need: S + I + U₁ ≥ 1024 (minimum cacheable size) So: I ≥ 1024 - (S + U₁)
For both constraints to be satisfied: 1024 - (S + U₁) < 0.4815(S + U₁) 1024 < (S + U₁) + 0.4815(S + U₁) 1024 < (S + U₁)(1.4815) (S + U₁) > 1024/1.4815 ≈ 691.19
Since tokens are discrete, we need (S + U₁) ≥ 692 tokens.
Conclusion:
System prompt inflation becomes cost-advantageous when:
The original system prompt plus first user message (S + U₁) is at least 692 tokens
The inflation amount I is between 1024-(S+U₁) and 0.4815(S+U₁) tokens

55,928 input tokens × $3/mTok = $0.168

4,289 cache writes × $3.75/mTok = $0.016
51,639 cache reads × $0.30/mTok = $0.015

The Third Bear Thinks

THINKING FOR 62 SECONDS:

FINAL REPLY:

Comparing the two scenarios:

Finding the threshold:

Conclusion:

Discussion about this post