Part 1: Using vs. Building With AI
When you use Claude in claude.ai, you type a message and get a response. The interface handles everything else: the conversation history, the system context, the display. Building a product with the Claude API means you are responsible for all of that yourself.
You decide what's in the system prompt — and it applies to every user of your product. You decide how much conversation history to send with each request — and pay for every token. You decide when to use streaming, what to do when the model returns JSON that doesn't parse, and how to handle rate limits at 3am when you're not watching.
These are engineering problems, not prompting problems. The prompting skills from the rest of this site still apply — but they're now embedded in an application with real users, real costs, and real reliability requirements.
When you use AI, you are the context — you know the goal, the constraints, what you've tried. When you build with AI, your product must provide all of that context programmatically, for every user, on every request. System prompts are code, not suggestions.
Which Model to Use
The current Claude models (as of April 2026):
- claude-opus-4-6 — Most capable. Best for complex reasoning, nuanced tasks, and anything where output quality matters most. Highest cost.
- claude-sonnet-4-6 — Best balance of capability and cost. The right default for most product use cases. Excellent at structured output, coding tasks, and document processing.
- claude-haiku-4-5-20251001 — Fastest and cheapest. For high-volume, lower-complexity tasks: classification, summarization, extracting structured data from simple inputs.
The practical rule: start with Sonnet, benchmark the output quality on your actual task, then consider whether Haiku is good enough for cost reasons or whether Opus is needed for quality reasons. Most products end up on Sonnet.
Part 2: Your First API Call
The Anthropic SDK handles authentication, retries, and response parsing. Install it and make one call before building anything more complex.
# Python
pip install anthropic
# Node.js / TypeScript
npm install @anthropic-ai/sdk
import anthropic
client = anthropic.Anthropic() # reads ANTHROPIC_API_KEY from environment
message = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[
{"role": "user", "content": "Summarize this in one sentence: [text]"}
]
)
print(message.content[0].text)
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic(); // reads ANTHROPIC_API_KEY from environment
const message = await client.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 1024,
messages: [
{ role: "user", content: "Summarize this in one sentence: [text]" }
],
});
console.log(message.content[0].text);
Set your API key as an environment variable — never hardcode it in your source files:
# .env (never commit this file)
ANTHROPIC_API_KEY=sk-ant-...
# Load it with python-dotenv (Python) or dotenv (Node)
# The SDK reads it automatically if the env var is set
Understanding the Response
# The response has these key fields:
message.id # unique request ID
message.model # model used
message.content # list of content blocks
message.content[0].text # the text response
message.usage.input_tokens # tokens you sent (what you pay for)
message.usage.output_tokens # tokens in the response (what you pay for)
message.stop_reason # "end_turn" | "max_tokens" | "tool_use"
Log usage.input_tokens and usage.output_tokens on every request in development. It's the fastest way to understand your token budget and catch prompt engineering mistakes — a prompt that's unexpectedly long will immediately show up in the token count.
Part 3: System Prompt Design
The system prompt is the most important engineering decision in an AI-powered product. It defines what your product does, how it behaves, and what it refuses to do — for every user, on every request.
What Goes in a System Prompt
A production system prompt typically has five components:
- Role and context. Who Claude is in this product and what it knows about the product domain.
- Task definition. What Claude should do. Be specific — "help users" is not a task definition.
- Output format. What the response should look like. If you need JSON, say so. If you need a specific structure, describe it exactly.
- Constraints and guardrails. What Claude should not do. Topics to avoid, things to refuse, scope limits.
- Examples. One or two examples of ideal input/output pairs for complex or nuanced tasks. These are the most reliable way to communicate the quality standard you expect.
You are a code review assistant for a team that builds TypeScript applications.
Your task is to review code changes and provide feedback in the following categories:
- Correctness: bugs, logic errors, type safety issues
- Security: SQL injection, XSS, secrets in code, missing auth checks
- Performance: obvious N+1 queries, unnecessary re-renders, sync operations that should be async
- Maintainability: unclear names, missing error handling, code duplication
Output format — respond with a JSON object:
{
"summary": "one sentence overall assessment",
"issues": [
{
"severity": "critical" | "major" | "minor" | "suggestion",
"category": "correctness" | "security" | "performance" | "maintainability",
"location": "filename:line",
"description": "what the issue is",
"suggestion": "how to fix it"
}
],
"approved": boolean
}
Rules:
- Only flag real issues — don't invent problems to seem thorough
- "critical" severity is for security vulnerabilities and definite bugs only
- If there are no issues in a category, don't mention that category
- "approved" is true only if there are no critical or major issues
- Do not explain your output format or add commentary outside the JSON
System Prompt as Code
Your system prompt is code — it changes how your product behaves, and those changes affect all users. Treat it accordingly:
- Store it in a file in your repository, not hardcoded in an API call
- Version control it with the same discipline as application code
- Test changes against a fixed set of inputs before deploying
- Use environment variables if you need different prompts for dev/staging/prod
from pathlib import Path
SYSTEM_PROMPT = Path("prompts/code_review.txt").read_text()
def review_code(diff: str) -> dict:
message = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=2048,
system=SYSTEM_PROMPT,
messages=[{"role": "user", "content": diff}]
)
return json.loads(message.content[0].text)
Part 4: Context Management
The context window is the total number of tokens Claude can see in one request — your system prompt, the conversation history, and the current message. Claude Sonnet 4 has a 200,000 token context window, which sounds large until you're building a product that processes long documents or maintains long conversations.
The Context Window Budget
Think of the context window as a fixed budget. Every token you spend on history is a token you can't spend on documents. Every token the system prompt consumes is a token the user can't use. Most products allocate the budget something like:
- System prompt: 500–2000 tokens (keep it tight)
- Retrieved context / documents: variable, often 10,000–50,000 tokens
- Conversation history: 5–20 messages, pruned to fit
- Current user message: whatever the user sends
- Max output tokens: set explicitly, typically 1,024–4,096
Managing Conversation History
For a chat product, you need to decide how much conversation history to include with each request. Sending the entire history gets expensive quickly. Three common strategies:
def build_messages(history: list[dict], new_message: str, max_history: int = 10) -> list[dict]:
"""Keep only the last N messages of history."""
recent = history[-max_history:] if len(history) > max_history else history
return recent + [{"role": "user", "content": new_message}]
import tiktoken # or use Claude's token counting
def build_messages_within_budget(
history: list[dict],
new_message: str,
token_budget: int = 10_000
) -> list[dict]:
"""Add messages from newest to oldest until budget is used."""
messages = [{"role": "user", "content": new_message}]
tokens_used = count_tokens(new_message)
for msg in reversed(history):
msg_tokens = count_tokens(msg["content"])
if tokens_used + msg_tokens > token_budget:
break
messages.insert(0, msg)
tokens_used += msg_tokens
return messages
def summarize_old_history(history: list[dict]) -> str:
"""Summarize early conversation to a compact representation."""
early_history = history[:-10] # everything before last 10 messages
response = client.messages.create(
model="claude-haiku-4-5-20251001", # cheap model for summarization
max_tokens=512,
messages=[{
"role": "user",
"content": f"Summarize this conversation in 2-3 sentences, keeping the key facts:\n\n{format_history(early_history)}"
}]
)
return response.content[0].text
Use Haiku for the summarization step — it's cheap and fast for this kind of task, and you're compressing old history that doesn't need deep understanding.
Retrieval-Augmented Generation (RAG)
For products that answer questions about a document corpus, you don't send all documents in every request. You embed the user's question, retrieve the most relevant chunks, and send only those:
def answer_question(question: str, vector_db) -> str:
# 1. Retrieve relevant chunks (using embeddings + vector search)
relevant_chunks = vector_db.search(question, top_k=5)
# 2. Build context from retrieved chunks
context = "\n\n---\n\n".join(
f"[Source: {chunk.source}]\n{chunk.text}"
for chunk in relevant_chunks
)
# 3. Ask Claude with the retrieved context
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system="""Answer questions using only the provided context.
If the answer is not in the context, say so — do not guess.
Cite the source for each claim.""",
messages=[{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {question}"
}]
)
return response.content[0].text
Part 5: Cost Optimization
At scale, prompt costs are real engineering costs. A product with 10,000 daily active users making 5 API calls each needs these numbers to work. The main levers: model selection, prompt caching, and prompt efficiency.
Prompt Caching
Prompt caching is the single most impactful cost optimization for most products. When a request starts with a long prefix that's identical to a recent request (your system prompt, a large document, a reference corpus), Claude can reuse the cached computation instead of reprocessing it.
The cache TTL is 5 minutes by default. For a product with a fixed, long system prompt, caching typically reduces costs by 70–90% on the prompt portion of your costs.
import anthropic
client = anthropic.Anthropic()
# Mark the system prompt for caching
# Cache hits are billed at ~10% of normal input cost
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": LONG_SYSTEM_PROMPT, # your stable system prompt
"cache_control": {"type": "ephemeral"} # marks this for caching
}
],
messages=[{"role": "user", "content": user_message}]
)
# Check whether the cache was hit or missed
print(f"Cache read tokens: {response.usage.cache_read_input_tokens}")
print(f"Cache creation tokens: {response.usage.cache_creation_input_tokens}")
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
const response = await client.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 1024,
system: [
{
type: "text",
text: LONG_SYSTEM_PROMPT,
cache_control: { type: "ephemeral" },
},
],
messages: [{ role: "user", content: userMessage }],
});
// usage.cache_read_input_tokens > 0 means cache hit
console.log("Cache hit:", (response.usage.cache_read_input_tokens ?? 0) > 0);
Cache write costs slightly more than a regular input token. Cache reads cost ~10% of a regular input token. If the same prefix appears in more than 1 in 10 requests, caching saves money. For a fixed system prompt served to many users, the payback is essentially immediate.
Model Selection by Task
Not every request in your product needs the same model. A common pattern:
def get_model_for_task(task_type: str) -> str:
"""Select the right model based on what we need to do."""
routing = {
# High quality needed
"complex_reasoning": "claude-opus-4-6",
"creative_writing": "claude-opus-4-6",
# Good balance for most tasks
"code_review": "claude-sonnet-4-6",
"document_analysis": "claude-sonnet-4-6",
"chat_response": "claude-sonnet-4-6",
# Simple tasks where speed and cost matter
"classification": "claude-haiku-4-5-20251001",
"summarization": "claude-haiku-4-5-20251001",
"data_extraction": "claude-haiku-4-5-20251001",
"intent_detection": "claude-haiku-4-5-20251001",
}
return routing.get(task_type, "claude-sonnet-4-6")
Prompt Efficiency
Shorter prompts cost less. Every word in your system prompt costs money on every request. Audit your system prompts periodically — AI is good at making them shorter without losing meaning:
This is my system prompt. It currently uses about 800 tokens. Make it as short as possible while preserving every behavioral constraint — I don't care about tone, only that the AI behaves identically. Then tell me how many tokens the optimized version uses and what you removed.
[paste your current system prompt]
Part 6: Streaming
Streaming lets you display Claude's response as it's generated rather than waiting for the full response. For user-facing features, streaming dramatically improves perceived performance — users see output in under a second instead of waiting 5–10 seconds for a complete response.
import anthropic
client = anthropic.Anthropic()
with client.messages.stream(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": "Write a short story about a robot."}]
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)
# After the stream:
message = stream.get_final_message()
print(f"\n\nTokens: {message.usage.input_tokens} in, {message.usage.output_tokens} out")
// app/api/chat/route.ts
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
export async function POST(req: Request) {
const { message } = await req.json();
const stream = await client.messages.stream({
model: "claude-sonnet-4-6",
max_tokens: 1024,
messages: [{ role: "user", content: message }],
});
// Return as Server-Sent Events for the browser to consume
const encoder = new TextEncoder();
const readable = new ReadableStream({
async start(controller) {
for await (const chunk of stream) {
if (
chunk.type === "content_block_delta" &&
chunk.delta.type === "text_delta"
) {
controller.enqueue(
encoder.encode(`data: ${JSON.stringify({ text: chunk.delta.text })}\n\n`)
);
}
}
controller.enqueue(encoder.encode("data: [DONE]\n\n"));
controller.close();
},
});
return new Response(readable, {
headers: {
"Content-Type": "text/event-stream",
"Cache-Control": "no-cache",
Connection: "keep-alive",
},
});
}
"use client";
import { useState } from "react";
export function ChatInput() {
const [response, setResponse] = useState("");
async function sendMessage(message: string) {
setResponse("");
const res = await fetch("/api/chat", {
method: "POST",
body: JSON.stringify({ message }),
});
const reader = res.body!.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const lines = decoder.decode(value).split("\n");
for (const line of lines) {
if (line.startsWith("data: ") && line !== "data: [DONE]") {
const { text } = JSON.parse(line.slice(6));
setResponse((prev) => prev + text);
}
}
}
}
return (
<div>
<button onClick={() => sendMessage("Tell me something interesting")}>
Ask
</button>
<p>{response}</p>
</div>
);
}
Part 7: Tool Use (Function Calling)
Tool use lets Claude call functions in your application — look up data, run calculations, trigger actions. Claude decides when to use a tool based on the user's request, calls it with structured arguments, and uses the result to generate its final response.
This is the pattern behind AI assistants that can actually do things: look up a user's order status, check a flight, query a database, send a notification.
Defining a Tool
tools = [
{
"name": "get_order_status",
"description": "Look up the current status of a customer order by order ID. Use this when the user asks about their order.",
"input_schema": {
"type": "object",
"properties": {
"order_id": {
"type": "string",
"description": "The order ID (format: ORD-XXXXXXXX)"
}
},
"required": ["order_id"]
}
},
{
"name": "list_customer_orders",
"description": "Get a list of recent orders for the authenticated customer.",
"input_schema": {
"type": "object",
"properties": {
"limit": {
"type": "integer",
"description": "Maximum number of orders to return (default 5, max 20)"
}
}
}
}
]
The Tool Use Loop
import json
def run_agent(user_message: str, customer_id: str) -> str:
messages = [{"role": "user", "content": user_message}]
while True:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=f"You are a customer support agent for Acme Store. The authenticated customer ID is {customer_id}.",
tools=tools,
messages=messages,
)
# If Claude finished (no tool call), return the text
if response.stop_reason == "end_turn":
return response.content[0].text
# If Claude wants to use a tool, run it
if response.stop_reason == "tool_use":
# Add Claude's response (including the tool call) to history
messages.append({"role": "assistant", "content": response.content})
# Find and execute each tool call
tool_results = []
for block in response.content:
if block.type == "tool_use":
result = execute_tool(block.name, block.input, customer_id)
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": json.dumps(result)
})
# Add tool results to history and continue
messages.append({"role": "user", "content": tool_results})
def execute_tool(name: str, inputs: dict, customer_id: str) -> dict:
"""Dispatch tool calls to the appropriate function."""
if name == "get_order_status":
return order_service.get_status(inputs["order_id"], customer_id)
elif name == "list_customer_orders":
limit = inputs.get("limit", 5)
return order_service.list_orders(customer_id, limit=limit)
else:
return {"error": f"Unknown tool: {name}"}
Claude decides which tool to call based on the user's message. But the authorization check must happen in your tool function — not in the system prompt. Always verify that the authenticated user has permission to access the resource before returning data. Never rely on Claude to enforce authorization.
Getting Structured JSON Output (Without Tool Use)
If you just need structured output and don't need Claude to call external functions, instruct it to return JSON directly:
import json
def extract_contact_info(text: str) -> dict:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=512,
system="""Extract contact information from the provided text.
Respond with a JSON object only — no prose, no explanation.
Schema:
{
"name": string | null,
"email": string | null,
"phone": string | null,
"company": string | null
}
If a field is not present in the text, use null.""",
messages=[{"role": "user", "content": text}]
)
try:
return json.loads(response.content[0].text)
except json.JSONDecodeError:
# Ask Claude to fix the JSON if it malformed it
fix_response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=512,
messages=[
{"role": "user", "content": response.content[0].text},
{"role": "assistant", "content": "The above is invalid JSON. Fixed valid JSON:"}
]
)
return json.loads(fix_response.content[0].text)
Part 8: Production Patterns
A few patterns that distinguish production-grade API integrations from prototype-grade ones.
Retry on Rate Limits and Transient Errors
The Anthropic SDK has automatic retry built in, but you should configure it explicitly:
client = anthropic.Anthropic(
max_retries=3, # retry up to 3 times (default is 2)
timeout=60.0, # timeout per request in seconds
)
For high-volume production systems, wrap your API calls with circuit breaker logic — if Claude is returning errors, stop sending requests and degrade gracefully rather than hammering a service that's struggling.
Log Everything You Need to Debug
import logging
import time
logger = logging.getLogger(__name__)
def call_claude(messages: list, system: str, **kwargs) -> anthropic.types.Message:
start = time.time()
try:
response = client.messages.create(
model="claude-sonnet-4-6",
system=system,
messages=messages,
**kwargs
)
duration_ms = int((time.time() - start) * 1000)
logger.info(
"claude_api_call",
extra={
"duration_ms": duration_ms,
"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens,
"cache_read_tokens": getattr(response.usage, "cache_read_input_tokens", 0),
"stop_reason": response.stop_reason,
"model": response.model,
}
)
return response
except anthropic.APIError as e:
logger.error("claude_api_error", extra={"error": str(e), "status": getattr(e, "status_code", None)})
raise
Handling max_tokens Truncation
If stop_reason is "max_tokens", the response was cut off. For user-facing features this is usually a bug — the user got a partial answer. Handle it explicitly:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=messages
)
if response.stop_reason == "max_tokens":
# Either increase max_tokens, or inform the user the response was cut off
# For structured output (JSON), you must never silently accept a truncated response
raise ValueError(f"Response was truncated after {response.usage.output_tokens} tokens")
Building with Claude API — Summary
- System prompts are code. Store them in files, version control them, test changes against real inputs before deploying. They define your product's behavior for all users.
- Enable prompt caching on your system prompt. If your system prompt is longer than a few hundred tokens and you serve many users, caching reduces that cost by ~90%. It's a one-line change.
- Route by task, not by habit. Use Haiku for classification and summarization. Use Sonnet for most product tasks. Reserve Opus for complex reasoning where output quality justifies the cost.
- Log input and output tokens on every request in development. Token counts are the fastest way to catch inefficient prompts and unexpected context size.
- Authorization lives in your code, not your prompts. When using tool use, verify user permissions in the tool function before returning data — never rely on Claude to enforce access control.
- Stream for user-facing features. Streaming reduces perceived latency from 5–10 seconds to under 1 second. Users notice the difference.
- Handle
stop_reason == "max_tokens". A truncated response is a bug for structured output and a bad experience for users. Setmax_tokenshigh enough, and detect truncation explicitly.
Related Guides
AI Evals in Production
Build golden datasets, run prompt regression in CI, and add release quality gates for AI-powered features.
Prompt Engineering for Python
FastAPI, Pydantic, SQLAlchemy, and pytest patterns for the Python backend that hosts your AI-powered product.
Debugging with AI
When your AI-powered feature behaves unexpectedly, the same investigation workflow applies: reproduce, isolate, hypothesize, then fix.
AI-Assisted CI/CD
Automate testing and deployment of your AI-powered product with GitHub Actions.