AI Agent Deep Learning Guide (Generated by Claude Opus 4.6)
A complete learning path from “calling SDK APIs” to “actually understanding agents”
Table of Contents
- The Essence of Agents
- Handwritten Agent Loop
- Prompt Engineering for Agents
- The Craft of Tool Design
- Core Papers and Mental Models
- Memory and Context Management
- Multi-Agent Orchestration
- Reliability Engineering
- Evaluation (Evals)
- Practical Project Ideas
- Recommended Resources
1. The Essence of Agents
One-line definition
Agent = LLM + Tool Calling + Loop
That’s it. Every framework (Vercel AI SDK, LangChain, CrewAI) is essentially wrapping this.
Pseudocode
function agent(userMessage, tools, maxSteps): messages = [systemPrompt, userMessage]
for step in 1..maxSteps: response = LLM(messages, tools)
if response.hasToolCalls: for toolCall in response.toolCalls: result = execute(toolCall.name, toolCall.input) messages.append(toolCall) // record what tool the model wants messages.append(toolResult) // record what tool returned else: return response.text // no more tool calls, final answer
return "Max steps reached, stop"Key insight
The model does not “know” it is an agent.
It only predicts the next token based on the current messages array.
When tool definitions exist in context, the model may emit tool-call formatted output.
When tool results are appended, the model continues generation with updated context.
So-called “autonomous decision-making” is really:
- The model reads tool descriptions and knows available capabilities
- The model decides whether more tool calls are needed based on the user query and current results
- If no further calls are needed, it directly returns a text answer
No magic involved.
2. Handwrite the Agent Loop (Without SDK)
This is the most important step: implement an agent loop using plain HTTP calls, independent of frameworks.
2.1 Basic version: single-round tool calls
// Handwritten agent loop - understand what SDKs do behind the scenes
import "dotenv/config";
const DEEPSEEK_API_KEY = process.env.DEEPSEEK_API_KEY;const BASE_URL = "https://api.deepseek.com/v1";
// ----- Tool definitions -----const toolDefinitions = [ { type: "function" as const, function: { name: "getWeather", description: "Get current weather for a given city", parameters: { type: "object", properties: { city: { type: "string", description: "City name" }, }, required: ["city"], }, }, }, { type: "function" as const, function: { name: "calculate", description: "Execute math calculations", parameters: { type: "object", properties: { expression: { type: "string", description: "Math expression" }, }, required: ["expression"], }, }, },];
// ----- Tool implementations -----const toolImplementations: Record<string, (args: any) => any> = { getWeather: ({ city }: { city: string }) => { const data: Record<string, any> = { Beijing: { temp: 5, condition: "Sunny", humidity: 30 }, Tokyo: { temp: 8, condition: "Light rain", humidity: 75 }, }; return data[city] ?? { temp: 20, condition: "Unknown", humidity: 50 }; }, calculate: ({ expression }: { expression: string }) => { const sanitized = expression.replace(/[^0-9+\-*/().% ]/g, ""); return { result: new Function(`return (${sanitized})`)() }; },};
// ----- Core: Agent loop -----async function callLLM(messages: any[]) { const response = await fetch(`${BASE_URL}/chat/completions`, { method: "POST", headers: { "Content-Type": "application/json", Authorization: `Bearer ${DEEPSEEK_API_KEY}`, }, body: JSON.stringify({ model: "deepseek-chat", messages, tools: toolDefinitions, }), }); const data = await response.json(); return data.choices[0].message;}
async function runAgent(userMessage: string, maxSteps = 10) { console.log(`\n🧑 User: ${userMessage}\n`);
// This is the full agent state: one messages array const messages: any[] = [ { role: "system", content: "You are a helpful assistant. Answer in Chinese." }, { role: "user", content: userMessage }, ];
for (let step = 1; step <= maxSteps; step++) { // Step 1: call LLM const assistantMessage = await callLLM(messages); messages.push(assistantMessage);
// Step 2: check tool calls if (!assistantMessage.tool_calls || assistantMessage.tool_calls.length === 0) { // No tool calls -> model believes it can answer directly console.log(`🤖 Assistant: ${assistantMessage.content}`); return; }
// Step 3: execute each tool call for (const toolCall of assistantMessage.tool_calls) { const fnName = toolCall.function.name; const fnArgs = JSON.parse(toolCall.function.arguments);
console.log(`🔧 [Step ${step}] ${fnName}(${JSON.stringify(fnArgs)})`);
const result = toolImplementations[fnName](fnArgs); console.log(` → ${JSON.stringify(result)}`);
// Step 4: append tool result to messages (critical!) messages.push({ role: "tool", tool_call_id: toolCall.id, content: JSON.stringify(result), }); }
// Loop again with updated context }
console.log("⚠️ Max step limit reached");}
// RunrunAgent("Check weather in Beijing and Tokyo, then calculate temperature difference");Running this makes three things crystal clear:
- Each LLM input is the full
messagesarray (including prior tool calls and results) - The model’s “memory” is this array and nothing else
- What SDKs do is managing this array and automating this loop
2.2 Hands-on exercises
Exercise 1: Print full messages before every callLLM and observe how it grows.
Exercise 2: Intentionally make a vague tool description and observe whether tool selection degrades.
Exercise 3: Set maxSteps = 1 and observe how the model behaves with one-step constraints.
3. Prompt Engineering for Agents
Prompt engineering for chat and for agents are fundamentally different.
Chat only needs good answers; agents also need correct tool selection and invocation.
3.1 System prompt essentials
You are a [role].
## CapabilitiesYou can use the following tools:- getWeather: query weather. Use when user asks weather-related questions.- calculate: math calculation. Use for precise calculations, do not do mental math.- searchKnowledge: search knowledge base. Use when facts are uncertain.
## Behavior rules1. Think first whether tools are needed; do not call blindly.2. If one tool result is insufficient, call additional tools.3. After collecting required information, provide a complete Chinese answer.4. If a tool returns error or empty result, tell the user instead of fabricating.
## Constraints- Do not invent capabilities for tools.- Do not call tools more than 5 times in one response.3.2 Common issues and fixes
| Issue | Symptom | Fix |
|---|---|---|
| Model skips tools | User asks weather, model hallucinates answer | Emphasize “must use tools for real-time facts” |
| Model overuses tools | Casual chat still triggers tools | Specify “use tools only when necessary” |
| Wrong parameters | Invalid argument formats | Improve tool and parameter descriptions |
| Infinite loop | Repeats same tool calls | Set maxSteps and add anti-repeat guidance |
| Ignores tool outputs | Tool returned data but model doesn’t use it | Require answers grounded in tool results |
3.3 Advanced technique: Chain of Thought
## Response workflowBefore deciding tool calls, think in <thinking> tags:1. What does user want?2. What do I already know?3. What is missing? Which tool should I call?4. If info is sufficient, answer directly.This improves controllability and interpretability of tool decisions.
4. The Craft of Tool Design
Tool design directly determines your agent’s capability ceiling.
The model understands tools through description + schema, not your implementation code.
4.1 Good tools vs bad tools
Bad design:
tool({ description: "Database operations", // too vague inputSchema: z.object({ sql: z.string(), // asking model to write SQL directly? risky }),})Good design:
tool({ description: "Get user profile by user ID; returns name, email, and signup time", inputSchema: z.object({ userId: z.string().describe("Unique user identifier, format like user_123"), }),})4.2 Tool design principles
Principle 1: single responsibility
❌ processData(action: "create" | "read" | "update" | "delete", ...)✅ createUser(name, email)✅ getUser(userId)✅ updateUser(userId, fields)✅ deleteUser(userId)Principle 2: description is model-facing documentation
- State when this tool should be used
- State parameter meaning and format
- State what the return payload contains
Principle 3: parameter design should reduce model error rates
❌ date: z.string() // model may output "tomorrow", "Jan 1", etc.✅ year: z.number(), month: z.number(), day: z.number() // structured, robustPrinciple 4: return values need sufficient context
// Bad: model can misinterpret unit❌ return { temp: 5 }
// Good: self-contained return✅ return { temp: 5, unit: "°C", city: "Beijing", condition: "Sunny" }4.3 Tool granularity trade-offs
| Granularity | Pros | Cons | Best for |
|---|---|---|---|
| Fine-grained (many small tools) | Flexible composition, easier testing | More steps, higher wrong-tool risk | General-purpose agents |
| Coarse-grained (few big tools) | Fewer steps, fewer decision errors | Less flexible, logic hardcoded | Specific workflows |
5. Core Papers and Mental Models
5.1 ReAct (Reasoning + Acting)
Paper: ReAct: Synergizing Reasoning and Acting in Language Models (2022)
This is the conceptual foundation of modern tool-calling agents.
Core idea: alternate between “reasoning” and “acting”.
User: Which city is warmer, Beijing or Tokyo?
Thought: I need both temperatures. Check Beijing first.Action: getWeather(city="Beijing")Observation: {temp: 5, condition: "Sunny"}
Thought: Beijing is 5°C. Now check Tokyo.Action: getWeather(city="Tokyo")Observation: {temp: 8, condition: "Light rain"}
Thought: 8°C > 5°C. I have enough information.Answer: Tokyo (8°C) is warmer than Beijing (5°C), difference is 3°C.Why it matters: almost all mainstream agent tool loops are implementations of ReAct in practice.
5.2 Plan-and-Execute
Unlike step-by-step ReAct, Plan-and-Execute does full planning first.
User: Compare React vs Vue and write a report.
Plan: 1. Search core features of React 2. Search core features of Vue 3. Search performance comparison data 4. Search ecosystem comparison 5. Synthesize and write final report
Execute: Step 1: search("React core features 2024") → ... Step 2: search("Vue core features 2024") → ... ...Best for: complex, multi-step tasks requiring global planning.
5.3 Reflexion
Let agent reflect after failure and improve next attempt.
Attempt 1: Action: search("Python sorting") → irrelevant results Reflection: query is too broad; should specify algorithm
Attempt 2: Action: search("Python quicksort implementation") → useful results5.4 Must-read paper list
| Paper | Year | Core contribution |
|---|---|---|
| ReAct | 2022 | Alternating reasoning + acting |
| Toolformer | 2023 | Learning when to use tools |
| Reflexion | 2023 | Self-reflection and iteration |
| Plan-and-Execute | 2023 | Plan-first agent architecture |
| LATS (Language Agent Tree Search) | 2023 | Tree-search-based decisions |
| Voyager | 2023 | Lifelong autonomous learning in Minecraft |
6. Memory and Context Management
6.1 The problem: context windows are finite
DeepSeek context windows are around 64K-128K tokens. Sounds large, but in agent workloads:
System Prompt: ~500 tokensTool definitions (3): ~800 tokensUser message: ~100 tokensEach tool roundtrip: ~300-1000 tokens──────────────────────────After 10 rounds: ~5000-10000 tokensLong-running agents (for example coding assistants) can quickly approach limits.
6.2 Solution A: conversation compression
// Compress history with another LLM call when messages grow too longasync function compressHistory(messages: Message[]): Promise<Message[]> { const summary = await generateText({ model: deepseek("deepseek-chat"), prompt: `Summarize key facts and decisions from the conversation: ${JSON.stringify(messages)}`, });
return [ messages[0], // keep system prompt { role: "system", content: `Conversation summary: ${summary.text}` }, ...messages.slice(-4), // keep most recent 4 messages ];}6.3 Solution B: RAG (Retrieval-Augmented Generation)
Instead of stuffing everything into context, store knowledge in vector DB and retrieve on demand.
User question → vector search → insert relevant chunks into context → LLM answer// Conceptual exampleconst relevantDocs = await vectorDB.search(userQuery, { topK: 5 });const context = relevantDocs.map((d) => d.content).join("\n");
const response = await generateText({ model: deepseek("deepseek-chat"), system: `Answer based on references below:\n${context}`, prompt: userQuery,});6.4 Three memory layers
| Layer | Implementation | Lifecycle | Example |
|---|---|---|---|
| Working memory | messages array | Single conversation | Current dialogue context |
| Short-term memory | DB/files | Cross-session (days/weeks) | User preferences, recent tasks |
| Long-term memory | Vector DB | Persistent | Knowledge base, historical decisions |
7. Multi-Agent Orchestration
A single agent has limits. For complex tasks, multiple agents should collaborate.
7.1 Pattern 1: Pipeline
Planner Agent → Executor Agent → Reviewer Agent plan execute verify// Conceptual exampleconst plan = await plannerAgent.generateText({ prompt: "User wants a Todo API, produce an implementation plan",});
const code = await executorAgent.generateText({ prompt: `Implement code based on this plan: ${plan.text}`,});
const review = await reviewerAgent.generateText({ prompt: `Review code against plan:\nPlan: ${plan.text}\nCode: ${code.text}`,});7.2 Pattern 2: Delegation
A manager agent assigns subtasks to specialist agents.
Manager Agent / | \ Search Agent Code Agent Test AgentImplementation idea: manager has “delegate-to-agent” tools.
const managerTools = { delegateToSearchAgent: tool({ description: "Delegate search tasks to search specialist", inputSchema: z.object({ query: z.string() }), execute: async ({ query }) => { const result = await searchAgent.generateText({ prompt: query }); return result.text; }, }), delegateToCodeAgent: tool({ description: "Delegate coding tasks to code specialist", inputSchema: z.object({ task: z.string() }), execute: async ({ task }) => { const result = await codeAgent.generateText({ prompt: task }); return result.text; }, }),};7.3 Pattern 3: Debate / Consensus
Multiple agents analyze from different perspectives, then synthesize.
Agent A (optimistic) ──┐Agent B (skeptical) ───┤→ Synthesizer Agent → final conclusionAgent C (technical) ───┘7.4 Pattern 4: Swarm collaboration
Agents share a task pool and pick tasks autonomously.
Task Pool: [task1, task2, task3, task4, task5] ↑ Agent A (takes task1) | Agent B (takes task2) | Agent C (takes task3)7.5 Which pattern to choose?
| Scenario | Recommended pattern | Why |
|---|---|---|
| Linear workflow (code→test→deploy) | Pipeline | Each step depends on previous |
| Complex project (parallel modules) | Delegation / Swarm | Parallel work + coordination |
| High-stakes decision-making | Debate | Multi-angle risk reduction |
| Clear role-based decomposition | Delegation | Centralized scheduling |
8. Reliability Engineering
This is the biggest gap between demo agents and production agents.
8.1 Typical model failure modes
| Error type | Example | Mitigation |
|---|---|---|
| Hallucination | Invents unsupported tool args | Validate with Zod schema |
| Format errors | Tool args are invalid JSON | try-catch + retry |
| Wrong tool choice | Calculates when it should search | Improve tool descriptions |
| Error ignorance | Tool fails but model ignores | Prompt explicit error handling |
| Infinite loop | Repeats same tool call | maxSteps + repetition detection |
| Over-calling | Simple task calls many tools | Prompt to think before tool use |
8.2 Defensive programming
// Tool execution wrapperasync function safeExecute( toolName: string, args: unknown, impl: Function): Promise<string> { try { const result = await Promise.race([ impl(args), new Promise((_, reject) => setTimeout(() => reject(new Error("Tool execution timeout")), 10000) ), ]); return JSON.stringify(result); } catch (error) { // Never let tool errors crash the whole agent // Return structured error to model so it can react return JSON.stringify({ error: true, message: `Tool ${toolName} execution failed: ${error}`, }); }}8.3 Output validation
// Validate model final output with Zodimport { z } from "zod/v4";import { generateObject } from "ai";
// generateObject forces schema-constrained structured outputsconst { object } = await generateObject({ model: deepseek("deepseek-chat"), schema: z.object({ answer: z.string(), confidence: z.number().min(0).max(1), sources: z.array(z.string()), }), prompt: "...",});// object is guaranteed to follow schema8.4 Reliability checklist
- All tools have timeout control
- All tools have
try-catch - Reasonable
maxStepsconfigured - Critical outputs validated by schema
- Retry mechanism for intermittent API failures
- Step-by-step logs for observability
- Sensitive actions (delete, payment) require human confirmation
9. Evaluation (Evals)
Agent outputs are stochastic: same input can produce different tool sequences and final responses.
So how do we evaluate quality?
9.1 Evaluation dimensions
| Dimension | What it measures | Method |
|---|---|---|
| Correctness | Is final answer correct? | Human labels + automated checks |
| Tool efficiency | Minimum steps used? | Average step count |
| Tool accuracy | Were correct tools selected? | Compare against expected sequence |
| Robustness | Recovers from tool errors? | Inject failures and observe |
| Latency | End-to-end time | Timing |
| Cost | Token usage | Usage statistics |
9.2 Simple eval framework
interface TestCase { input: string; expectedToolCalls?: string[]; // expected tool names expectedOutputContains?: string[]; // expected output keywords maxStepsAllowed?: number; // expected max steps}
const testCases: TestCase[] = [ { input: "How is Beijing weather", expectedToolCalls: ["getWeather"], expectedOutputContains: ["Beijing", "°C"], maxStepsAllowed: 2, }, { input: "Temperature difference between Beijing and Tokyo", expectedToolCalls: ["getWeather", "getWeather", "calculate"], expectedOutputContains: ["difference", "3"], maxStepsAllowed: 5, },];
async function runEval(testCases: TestCase[]) { let passed = 0; for (const tc of testCases) { const { text, steps } = await runAgent(tc.input); const toolsCalled = steps.flatMap((s) => s.toolCalls.map((c) => c.toolName));
const toolsMatch = tc.expectedToolCalls ? JSON.stringify(toolsCalled) === JSON.stringify(tc.expectedToolCalls) : true; const outputMatch = tc.expectedOutputContains ? tc.expectedOutputContains.every((kw) => text.includes(kw)) : true; const stepsOk = tc.maxStepsAllowed ? steps.length <= tc.maxStepsAllowed : true;
if (toolsMatch && outputMatch && stepsOk) { passed++; console.log(`✅ PASS: "${tc.input}"`); } else { console.log(`❌ FAIL: "${tc.input}"`); if (!toolsMatch) console.log(` Tools: expected ${tc.expectedToolCalls}, got ${toolsCalled}`); if (!outputMatch) console.log(` Output missing expected keywords`); if (!stepsOk) console.log(` Steps: ${steps.length} > ${tc.maxStepsAllowed}`); } } console.log(`\nResult: ${passed}/${testCases.length} passed`);}9.3 LLM-as-Judge
Use another LLM to score response quality:
const judgment = await generateText({ model: deepseek("deepseek-chat"), prompt: `You are a judge. Evaluate this assistant response.
User question: ${userQuestion}Assistant answer: ${agentAnswer}
Scoring rubric (1-5):1. Accuracy: Is answer grounded in tool-returned data?2. Completeness: Does it answer all user asks?3. Conciseness: Is there unnecessary filler?
Provide scores and rationale.`,});10. Practical Project Ideas
Ordered by increasing difficulty. Each project exposes different real-world problems.
Project 1: Personal knowledge-base QA agent
Build: read local Markdown files and answer questions via RAG.
You’ll learn:
- Chunking
- Embeddings
- Similarity retrieval
- Injecting retrieval results into prompts
Suggested stack: AI SDK + local vector DB (for example vectra or orama)
Project 2: CLI coding assistant
Build: a CLI agent that can read/write files and run commands (a simplified Claude Code style tool).
You’ll learn:
- Permission controls for dangerous operations
- File-system tool design
- Command sandboxing
- Error recovery
Example tools:
readFile(path) → read filewriteFile(path, content) → write filerunCommand(cmd) → run command (requires confirmation)listFiles(dir) → list directoryProject 3: Multi-agent research assistant
Build: given a topic, automatically search, read, summarize, and generate a report.
You’ll learn:
- Multi-agent orchestration
- Real API integration (search APIs)
- Long-context handling
- Structured outputs
Architecture:
Planner → [Searcher, Searcher, Searcher] (parallel) → Synthesizer → WriterProject 4: Self-improving code-generation agent
Build: requirements → generate code → run tests → analyze failures → patch code → retry.
You’ll learn:
- Reflexion pattern
- Code execution sandbox
- Test-driven agent loops
- Failure analysis and recovery
11. Recommended Resources
Papers
| Paper | Link |
|---|---|
| ReAct | https://arxiv.org/abs/2210.03629 |
| Toolformer | https://arxiv.org/abs/2302.04761 |
| Reflexion | https://arxiv.org/abs/2303.11366 |
| Voyager | https://arxiv.org/abs/2305.16291 |
| LATS | https://arxiv.org/abs/2310.04406 |
| A Survey on LLM-based Agents | https://arxiv.org/abs/2308.11432 |
Documentation
| Resource | Link |
|---|---|
| Vercel AI SDK docs | https://ai-sdk.dev |
| Vercel AI SDK - Agents | https://ai-sdk.dev/docs/foundations/agents |
| OpenAI Function Calling | https://platform.openai.com/docs/guides/function-calling |
| DeepSeek API docs | https://platform.deepseek.com/api-docs |
| Anthropic Tool Use guide | https://docs.anthropic.com/en/docs/build-with-claude/tool-use |
Open-source projects worth reading
| Project | Why read it |
|---|---|
| Vercel AI SDK (vercel/ai) | Production-grade agent loop patterns |
| LangChain.js | Abstractions for chain/agent/memory |
| AutoGPT | Early autonomous agent lessons and limitations |
| OpenDevin | Open-source coding agent tool design + sandboxing |
| CrewAI | Multi-agent collaboration patterns |
Courses and blogs
| Resource | Note |
|---|---|
| DeepLearning.AI - AI Agents | Andrew Ng short course, free |
| Lilian Weng’s blog | OpenAI researcher; many agent surveys |
| Simon Willison’s blog | Practical LLM case studies |
Summary
SDK usage → Entry point (you are here)Handwritten loop → Understand fundamentalsTool design → Sets capability ceilingPrompt engineering → Sets stability ceilingMulti-agent design → Handles complex tasksReliability → Bridge from demo to productionEvaluation → Quantify improvement directionMost important advice: build an agent that solves your own real problem.
Demos always look smooth; real scenarios reveal the actual hard parts.