2639 words
13 min
AI Agent Deep Learning Guide
> A complete learning path from “calling SDK APIs” to “actually understanding agents”
2026-02-24

AI Agent Deep Learning Guide (Generated by Claude Opus 4.6)#

A complete learning path from “calling SDK APIs” to “actually understanding agents”

Table of Contents#

  1. The Essence of Agents
  2. Handwritten Agent Loop
  3. Prompt Engineering for Agents
  4. The Craft of Tool Design
  5. Core Papers and Mental Models
  6. Memory and Context Management
  7. Multi-Agent Orchestration
  8. Reliability Engineering
  9. Evaluation (Evals)
  10. Practical Project Ideas
  11. Recommended Resources

1. The Essence of Agents#

One-line definition#

Agent = LLM + Tool Calling + Loop

That’s it. Every framework (Vercel AI SDK, LangChain, CrewAI) is essentially wrapping this.

Pseudocode#

function agent(userMessage, tools, maxSteps):
messages = [systemPrompt, userMessage]
for step in 1..maxSteps:
response = LLM(messages, tools)
if response.hasToolCalls:
for toolCall in response.toolCalls:
result = execute(toolCall.name, toolCall.input)
messages.append(toolCall) // record what tool the model wants
messages.append(toolResult) // record what tool returned
else:
return response.text // no more tool calls, final answer
return "Max steps reached, stop"

Key insight#

The model does not “know” it is an agent.
It only predicts the next token based on the current messages array.
When tool definitions exist in context, the model may emit tool-call formatted output.
When tool results are appended, the model continues generation with updated context.

So-called “autonomous decision-making” is really:

  • The model reads tool descriptions and knows available capabilities
  • The model decides whether more tool calls are needed based on the user query and current results
  • If no further calls are needed, it directly returns a text answer

No magic involved.


2. Handwrite the Agent Loop (Without SDK)#

This is the most important step: implement an agent loop using plain HTTP calls, independent of frameworks.

2.1 Basic version: single-round tool calls#

src/manual-agent-basic.ts
// Handwritten agent loop - understand what SDKs do behind the scenes
import "dotenv/config";
const DEEPSEEK_API_KEY = process.env.DEEPSEEK_API_KEY;
const BASE_URL = "https://api.deepseek.com/v1";
// ----- Tool definitions -----
const toolDefinitions = [
{
type: "function" as const,
function: {
name: "getWeather",
description: "Get current weather for a given city",
parameters: {
type: "object",
properties: {
city: { type: "string", description: "City name" },
},
required: ["city"],
},
},
},
{
type: "function" as const,
function: {
name: "calculate",
description: "Execute math calculations",
parameters: {
type: "object",
properties: {
expression: { type: "string", description: "Math expression" },
},
required: ["expression"],
},
},
},
];
// ----- Tool implementations -----
const toolImplementations: Record<string, (args: any) => any> = {
getWeather: ({ city }: { city: string }) => {
const data: Record<string, any> = {
Beijing: { temp: 5, condition: "Sunny", humidity: 30 },
Tokyo: { temp: 8, condition: "Light rain", humidity: 75 },
};
return data[city] ?? { temp: 20, condition: "Unknown", humidity: 50 };
},
calculate: ({ expression }: { expression: string }) => {
const sanitized = expression.replace(/[^0-9+\-*/().% ]/g, "");
return { result: new Function(`return (${sanitized})`)() };
},
};
// ----- Core: Agent loop -----
async function callLLM(messages: any[]) {
const response = await fetch(`${BASE_URL}/chat/completions`, {
method: "POST",
headers: {
"Content-Type": "application/json",
Authorization: `Bearer ${DEEPSEEK_API_KEY}`,
},
body: JSON.stringify({
model: "deepseek-chat",
messages,
tools: toolDefinitions,
}),
});
const data = await response.json();
return data.choices[0].message;
}
async function runAgent(userMessage: string, maxSteps = 10) {
console.log(`\n🧑 User: ${userMessage}\n`);
// This is the full agent state: one messages array
const messages: any[] = [
{ role: "system", content: "You are a helpful assistant. Answer in Chinese." },
{ role: "user", content: userMessage },
];
for (let step = 1; step <= maxSteps; step++) {
// Step 1: call LLM
const assistantMessage = await callLLM(messages);
messages.push(assistantMessage);
// Step 2: check tool calls
if (!assistantMessage.tool_calls || assistantMessage.tool_calls.length === 0) {
// No tool calls -> model believes it can answer directly
console.log(`🤖 Assistant: ${assistantMessage.content}`);
return;
}
// Step 3: execute each tool call
for (const toolCall of assistantMessage.tool_calls) {
const fnName = toolCall.function.name;
const fnArgs = JSON.parse(toolCall.function.arguments);
console.log(`🔧 [Step ${step}] ${fnName}(${JSON.stringify(fnArgs)})`);
const result = toolImplementations[fnName](fnArgs);
console.log(` → ${JSON.stringify(result)}`);
// Step 4: append tool result to messages (critical!)
messages.push({
role: "tool",
tool_call_id: toolCall.id,
content: JSON.stringify(result),
});
}
// Loop again with updated context
}
console.log("⚠️ Max step limit reached");
}
// Run
runAgent("Check weather in Beijing and Tokyo, then calculate temperature difference");

Running this makes three things crystal clear:

  1. Each LLM input is the full messages array (including prior tool calls and results)
  2. The model’s “memory” is this array and nothing else
  3. What SDKs do is managing this array and automating this loop

2.2 Hands-on exercises#

Exercise 1: Print full messages before every callLLM and observe how it grows.

Exercise 2: Intentionally make a vague tool description and observe whether tool selection degrades.

Exercise 3: Set maxSteps = 1 and observe how the model behaves with one-step constraints.


3. Prompt Engineering for Agents#

Prompt engineering for chat and for agents are fundamentally different.
Chat only needs good answers; agents also need correct tool selection and invocation.

3.1 System prompt essentials#

You are a [role].
## Capabilities
You can use the following tools:
- getWeather: query weather. Use when user asks weather-related questions.
- calculate: math calculation. Use for precise calculations, do not do mental math.
- searchKnowledge: search knowledge base. Use when facts are uncertain.
## Behavior rules
1. Think first whether tools are needed; do not call blindly.
2. If one tool result is insufficient, call additional tools.
3. After collecting required information, provide a complete Chinese answer.
4. If a tool returns error or empty result, tell the user instead of fabricating.
## Constraints
- Do not invent capabilities for tools.
- Do not call tools more than 5 times in one response.

3.2 Common issues and fixes#

IssueSymptomFix
Model skips toolsUser asks weather, model hallucinates answerEmphasize “must use tools for real-time facts”
Model overuses toolsCasual chat still triggers toolsSpecify “use tools only when necessary”
Wrong parametersInvalid argument formatsImprove tool and parameter descriptions
Infinite loopRepeats same tool callsSet maxSteps and add anti-repeat guidance
Ignores tool outputsTool returned data but model doesn’t use itRequire answers grounded in tool results

3.3 Advanced technique: Chain of Thought#

## Response workflow
Before deciding tool calls, think in <thinking> tags:
1. What does user want?
2. What do I already know?
3. What is missing? Which tool should I call?
4. If info is sufficient, answer directly.

This improves controllability and interpretability of tool decisions.


4. The Craft of Tool Design#

Tool design directly determines your agent’s capability ceiling.
The model understands tools through description + schema, not your implementation code.

4.1 Good tools vs bad tools#

Bad design:

tool({
description: "Database operations", // too vague
inputSchema: z.object({
sql: z.string(), // asking model to write SQL directly? risky
}),
})

Good design:

tool({
description: "Get user profile by user ID; returns name, email, and signup time",
inputSchema: z.object({
userId: z.string().describe("Unique user identifier, format like user_123"),
}),
})

4.2 Tool design principles#

Principle 1: single responsibility

❌ processData(action: "create" | "read" | "update" | "delete", ...)
✅ createUser(name, email)
✅ getUser(userId)
✅ updateUser(userId, fields)
✅ deleteUser(userId)

Principle 2: description is model-facing documentation

  • State when this tool should be used
  • State parameter meaning and format
  • State what the return payload contains

Principle 3: parameter design should reduce model error rates

❌ date: z.string() // model may output "tomorrow", "Jan 1", etc.
✅ year: z.number(), month: z.number(), day: z.number() // structured, robust

Principle 4: return values need sufficient context

// Bad: model can misinterpret unit
❌ return { temp: 5 }
// Good: self-contained return
✅ return { temp: 5, unit: "°C", city: "Beijing", condition: "Sunny" }

4.3 Tool granularity trade-offs#

GranularityProsConsBest for
Fine-grained (many small tools)Flexible composition, easier testingMore steps, higher wrong-tool riskGeneral-purpose agents
Coarse-grained (few big tools)Fewer steps, fewer decision errorsLess flexible, logic hardcodedSpecific workflows

5. Core Papers and Mental Models#

5.1 ReAct (Reasoning + Acting)#

Paper: ReAct: Synergizing Reasoning and Acting in Language Models (2022)

This is the conceptual foundation of modern tool-calling agents.

Core idea: alternate between “reasoning” and “acting”.

User: Which city is warmer, Beijing or Tokyo?
Thought: I need both temperatures. Check Beijing first.
Action: getWeather(city="Beijing")
Observation: {temp: 5, condition: "Sunny"}
Thought: Beijing is 5°C. Now check Tokyo.
Action: getWeather(city="Tokyo")
Observation: {temp: 8, condition: "Light rain"}
Thought: 8°C > 5°C. I have enough information.
Answer: Tokyo (8°C) is warmer than Beijing (5°C), difference is 3°C.

Why it matters: almost all mainstream agent tool loops are implementations of ReAct in practice.

5.2 Plan-and-Execute#

Unlike step-by-step ReAct, Plan-and-Execute does full planning first.

User: Compare React vs Vue and write a report.
Plan:
1. Search core features of React
2. Search core features of Vue
3. Search performance comparison data
4. Search ecosystem comparison
5. Synthesize and write final report
Execute:
Step 1: search("React core features 2024") → ...
Step 2: search("Vue core features 2024") → ...
...

Best for: complex, multi-step tasks requiring global planning.

5.3 Reflexion#

Let agent reflect after failure and improve next attempt.

Attempt 1:
Action: search("Python sorting") → irrelevant results
Reflection: query is too broad; should specify algorithm
Attempt 2:
Action: search("Python quicksort implementation") → useful results

5.4 Must-read paper list#

PaperYearCore contribution
ReAct2022Alternating reasoning + acting
Toolformer2023Learning when to use tools
Reflexion2023Self-reflection and iteration
Plan-and-Execute2023Plan-first agent architecture
LATS (Language Agent Tree Search)2023Tree-search-based decisions
Voyager2023Lifelong autonomous learning in Minecraft

6. Memory and Context Management#

6.1 The problem: context windows are finite#

DeepSeek context windows are around 64K-128K tokens. Sounds large, but in agent workloads:

System Prompt: ~500 tokens
Tool definitions (3): ~800 tokens
User message: ~100 tokens
Each tool roundtrip: ~300-1000 tokens
──────────────────────────
After 10 rounds: ~5000-10000 tokens

Long-running agents (for example coding assistants) can quickly approach limits.

6.2 Solution A: conversation compression#

// Compress history with another LLM call when messages grow too long
async function compressHistory(messages: Message[]): Promise<Message[]> {
const summary = await generateText({
model: deepseek("deepseek-chat"),
prompt: `Summarize key facts and decisions from the conversation:
${JSON.stringify(messages)}`,
});
return [
messages[0], // keep system prompt
{ role: "system", content: `Conversation summary: ${summary.text}` },
...messages.slice(-4), // keep most recent 4 messages
];
}

6.3 Solution B: RAG (Retrieval-Augmented Generation)#

Instead of stuffing everything into context, store knowledge in vector DB and retrieve on demand.

User question → vector search → insert relevant chunks into context → LLM answer
// Conceptual example
const relevantDocs = await vectorDB.search(userQuery, { topK: 5 });
const context = relevantDocs.map((d) => d.content).join("\n");
const response = await generateText({
model: deepseek("deepseek-chat"),
system: `Answer based on references below:\n${context}`,
prompt: userQuery,
});

6.4 Three memory layers#

LayerImplementationLifecycleExample
Working memorymessages arraySingle conversationCurrent dialogue context
Short-term memoryDB/filesCross-session (days/weeks)User preferences, recent tasks
Long-term memoryVector DBPersistentKnowledge base, historical decisions

7. Multi-Agent Orchestration#

A single agent has limits. For complex tasks, multiple agents should collaborate.

7.1 Pattern 1: Pipeline#

Planner Agent → Executor Agent → Reviewer Agent
plan execute verify
// Conceptual example
const plan = await plannerAgent.generateText({
prompt: "User wants a Todo API, produce an implementation plan",
});
const code = await executorAgent.generateText({
prompt: `Implement code based on this plan: ${plan.text}`,
});
const review = await reviewerAgent.generateText({
prompt: `Review code against plan:\nPlan: ${plan.text}\nCode: ${code.text}`,
});

7.2 Pattern 2: Delegation#

A manager agent assigns subtasks to specialist agents.

Manager Agent
/ | \
Search Agent Code Agent Test Agent

Implementation idea: manager has “delegate-to-agent” tools.

const managerTools = {
delegateToSearchAgent: tool({
description: "Delegate search tasks to search specialist",
inputSchema: z.object({ query: z.string() }),
execute: async ({ query }) => {
const result = await searchAgent.generateText({ prompt: query });
return result.text;
},
}),
delegateToCodeAgent: tool({
description: "Delegate coding tasks to code specialist",
inputSchema: z.object({ task: z.string() }),
execute: async ({ task }) => {
const result = await codeAgent.generateText({ prompt: task });
return result.text;
},
}),
};

7.3 Pattern 3: Debate / Consensus#

Multiple agents analyze from different perspectives, then synthesize.

Agent A (optimistic) ──┐
Agent B (skeptical) ───┤→ Synthesizer Agent → final conclusion
Agent C (technical) ───┘

7.4 Pattern 4: Swarm collaboration#

Agents share a task pool and pick tasks autonomously.

Task Pool: [task1, task2, task3, task4, task5]
Agent A (takes task1) | Agent B (takes task2) | Agent C (takes task3)

7.5 Which pattern to choose?#

ScenarioRecommended patternWhy
Linear workflow (code→test→deploy)PipelineEach step depends on previous
Complex project (parallel modules)Delegation / SwarmParallel work + coordination
High-stakes decision-makingDebateMulti-angle risk reduction
Clear role-based decompositionDelegationCentralized scheduling

8. Reliability Engineering#

This is the biggest gap between demo agents and production agents.

8.1 Typical model failure modes#

Error typeExampleMitigation
HallucinationInvents unsupported tool argsValidate with Zod schema
Format errorsTool args are invalid JSONtry-catch + retry
Wrong tool choiceCalculates when it should searchImprove tool descriptions
Error ignoranceTool fails but model ignoresPrompt explicit error handling
Infinite loopRepeats same tool callmaxSteps + repetition detection
Over-callingSimple task calls many toolsPrompt to think before tool use

8.2 Defensive programming#

// Tool execution wrapper
async function safeExecute(
toolName: string,
args: unknown,
impl: Function
): Promise<string> {
try {
const result = await Promise.race([
impl(args),
new Promise((_, reject) =>
setTimeout(() => reject(new Error("Tool execution timeout")), 10000)
),
]);
return JSON.stringify(result);
} catch (error) {
// Never let tool errors crash the whole agent
// Return structured error to model so it can react
return JSON.stringify({
error: true,
message: `Tool ${toolName} execution failed: ${error}`,
});
}
}

8.3 Output validation#

// Validate model final output with Zod
import { z } from "zod/v4";
import { generateObject } from "ai";
// generateObject forces schema-constrained structured outputs
const { object } = await generateObject({
model: deepseek("deepseek-chat"),
schema: z.object({
answer: z.string(),
confidence: z.number().min(0).max(1),
sources: z.array(z.string()),
}),
prompt: "...",
});
// object is guaranteed to follow schema

8.4 Reliability checklist#

  • All tools have timeout control
  • All tools have try-catch
  • Reasonable maxSteps configured
  • Critical outputs validated by schema
  • Retry mechanism for intermittent API failures
  • Step-by-step logs for observability
  • Sensitive actions (delete, payment) require human confirmation

9. Evaluation (Evals)#

Agent outputs are stochastic: same input can produce different tool sequences and final responses.
So how do we evaluate quality?

9.1 Evaluation dimensions#

DimensionWhat it measuresMethod
CorrectnessIs final answer correct?Human labels + automated checks
Tool efficiencyMinimum steps used?Average step count
Tool accuracyWere correct tools selected?Compare against expected sequence
RobustnessRecovers from tool errors?Inject failures and observe
LatencyEnd-to-end timeTiming
CostToken usageUsage statistics

9.2 Simple eval framework#

interface TestCase {
input: string;
expectedToolCalls?: string[]; // expected tool names
expectedOutputContains?: string[]; // expected output keywords
maxStepsAllowed?: number; // expected max steps
}
const testCases: TestCase[] = [
{
input: "How is Beijing weather",
expectedToolCalls: ["getWeather"],
expectedOutputContains: ["Beijing", "°C"],
maxStepsAllowed: 2,
},
{
input: "Temperature difference between Beijing and Tokyo",
expectedToolCalls: ["getWeather", "getWeather", "calculate"],
expectedOutputContains: ["difference", "3"],
maxStepsAllowed: 5,
},
];
async function runEval(testCases: TestCase[]) {
let passed = 0;
for (const tc of testCases) {
const { text, steps } = await runAgent(tc.input);
const toolsCalled = steps.flatMap((s) => s.toolCalls.map((c) => c.toolName));
const toolsMatch = tc.expectedToolCalls
? JSON.stringify(toolsCalled) === JSON.stringify(tc.expectedToolCalls)
: true;
const outputMatch = tc.expectedOutputContains
? tc.expectedOutputContains.every((kw) => text.includes(kw))
: true;
const stepsOk = tc.maxStepsAllowed
? steps.length <= tc.maxStepsAllowed
: true;
if (toolsMatch && outputMatch && stepsOk) {
passed++;
console.log(`✅ PASS: "${tc.input}"`);
} else {
console.log(`❌ FAIL: "${tc.input}"`);
if (!toolsMatch) console.log(` Tools: expected ${tc.expectedToolCalls}, got ${toolsCalled}`);
if (!outputMatch) console.log(` Output missing expected keywords`);
if (!stepsOk) console.log(` Steps: ${steps.length} > ${tc.maxStepsAllowed}`);
}
}
console.log(`\nResult: ${passed}/${testCases.length} passed`);
}

9.3 LLM-as-Judge#

Use another LLM to score response quality:

const judgment = await generateText({
model: deepseek("deepseek-chat"),
prompt: `You are a judge. Evaluate this assistant response.
User question: ${userQuestion}
Assistant answer: ${agentAnswer}
Scoring rubric (1-5):
1. Accuracy: Is answer grounded in tool-returned data?
2. Completeness: Does it answer all user asks?
3. Conciseness: Is there unnecessary filler?
Provide scores and rationale.`,
});

10. Practical Project Ideas#

Ordered by increasing difficulty. Each project exposes different real-world problems.

Project 1: Personal knowledge-base QA agent#

Build: read local Markdown files and answer questions via RAG.

You’ll learn:

  • Chunking
  • Embeddings
  • Similarity retrieval
  • Injecting retrieval results into prompts

Suggested stack: AI SDK + local vector DB (for example vectra or orama)

Project 2: CLI coding assistant#

Build: a CLI agent that can read/write files and run commands (a simplified Claude Code style tool).

You’ll learn:

  • Permission controls for dangerous operations
  • File-system tool design
  • Command sandboxing
  • Error recovery

Example tools:

readFile(path) → read file
writeFile(path, content) → write file
runCommand(cmd) → run command (requires confirmation)
listFiles(dir) → list directory

Project 3: Multi-agent research assistant#

Build: given a topic, automatically search, read, summarize, and generate a report.

You’ll learn:

  • Multi-agent orchestration
  • Real API integration (search APIs)
  • Long-context handling
  • Structured outputs

Architecture:

Planner → [Searcher, Searcher, Searcher] (parallel) → Synthesizer → Writer

Project 4: Self-improving code-generation agent#

Build: requirements → generate code → run tests → analyze failures → patch code → retry.

You’ll learn:

  • Reflexion pattern
  • Code execution sandbox
  • Test-driven agent loops
  • Failure analysis and recovery

Papers#

PaperLink
ReActhttps://arxiv.org/abs/2210.03629
Toolformerhttps://arxiv.org/abs/2302.04761
Reflexionhttps://arxiv.org/abs/2303.11366
Voyagerhttps://arxiv.org/abs/2305.16291
LATShttps://arxiv.org/abs/2310.04406
A Survey on LLM-based Agentshttps://arxiv.org/abs/2308.11432

Documentation#

ResourceLink
Vercel AI SDK docshttps://ai-sdk.dev
Vercel AI SDK - Agentshttps://ai-sdk.dev/docs/foundations/agents
OpenAI Function Callinghttps://platform.openai.com/docs/guides/function-calling
DeepSeek API docshttps://platform.deepseek.com/api-docs
Anthropic Tool Use guidehttps://docs.anthropic.com/en/docs/build-with-claude/tool-use

Open-source projects worth reading#

ProjectWhy read it
Vercel AI SDK (vercel/ai)Production-grade agent loop patterns
LangChain.jsAbstractions for chain/agent/memory
AutoGPTEarly autonomous agent lessons and limitations
OpenDevinOpen-source coding agent tool design + sandboxing
CrewAIMulti-agent collaboration patterns

Courses and blogs#

ResourceNote
DeepLearning.AI - AI AgentsAndrew Ng short course, free
Lilian Weng’s blogOpenAI researcher; many agent surveys
Simon Willison’s blogPractical LLM case studies

Summary#

SDK usage → Entry point (you are here)
Handwritten loop → Understand fundamentals
Tool design → Sets capability ceiling
Prompt engineering → Sets stability ceiling
Multi-agent design → Handles complex tasks
Reliability → Bridge from demo to production
Evaluation → Quantify improvement direction

Most important advice: build an agent that solves your own real problem.
Demos always look smooth; real scenarios reveal the actual hard parts.