AI Agent Deep Learning Guide

AI Agent Deep Learning Guide (Generated by Claude Opus 4.6)#

A complete learning path from “calling SDK APIs” to “actually understanding agents”

Table of Contents#

1. The Essence of Agents#

One-line definition#

Agent = LLM + Tool Calling + Loop

That’s it. Every framework (Vercel AI SDK, LangChain, CrewAI) is essentially wrapping this.

Pseudocode#

1
function agent(userMessage, tools, maxSteps):
2
    messages = [systemPrompt, userMessage]
3

4
    for step in 1..maxSteps:
5
        response = LLM(messages, tools)
6

7
        if response.hasToolCalls:
8
            for toolCall in response.toolCalls:
9
                result = execute(toolCall.name, toolCall.input)
10
                messages.append(toolCall)       // record what tool the model wants
11
                messages.append(toolResult)     // record what tool returned
12
        else:
13
            return response.text   // no more tool calls, final answer
14

15
    return "Max steps reached, stop"

Key insight#

The model does not “know” it is an agent.
It only predicts the next token based on the current messages array.
When tool definitions exist in context, the model may emit tool-call formatted output.
When tool results are appended, the model continues generation with updated context.

So-called “autonomous decision-making” is really:

The model reads tool descriptions and knows available capabilities
The model decides whether more tool calls are needed based on the user query and current results
If no further calls are needed, it directly returns a text answer

No magic involved.

2. Handwrite the Agent Loop (Without SDK)#

This is the most important step: implement an agent loop using plain HTTP calls, independent of frameworks.

2.1 Basic version: single-round tool calls#

1
// Handwritten agent loop - understand what SDKs do behind the scenes
2

3
import "dotenv/config";
4

5
const DEEPSEEK_API_KEY = process.env.DEEPSEEK_API_KEY;
6
const BASE_URL = "https://api.deepseek.com/v1";
7

8
// ----- Tool definitions -----
9
const toolDefinitions = [
10
  {
11
    type: "function" as const,
12
    function: {
13
      name: "getWeather",
14
      description: "Get current weather for a given city",
15
      parameters: {
16
        type: "object",
17
        properties: {
18
          city: { type: "string", description: "City name" },
19
        },
20
        required: ["city"],
21
      },
22
    },
23
  },
24
  {
25
    type: "function" as const,
26
    function: {
27
      name: "calculate",
28
      description: "Execute math calculations",
29
      parameters: {
30
        type: "object",
31
        properties: {
32
          expression: { type: "string", description: "Math expression" },
33
        },
34
        required: ["expression"],
35
      },
36
    },
37
  },
38
];
39

40
// ----- Tool implementations -----
41
const toolImplementations: Record<string, (args: any) => any> = {
42
  getWeather: ({ city }: { city: string }) => {
43
    const data: Record<string, any> = {
44
      Beijing: { temp: 5, condition: "Sunny", humidity: 30 },
45
      Tokyo: { temp: 8, condition: "Light rain", humidity: 75 },
46
    };
47
    return data[city] ?? { temp: 20, condition: "Unknown", humidity: 50 };
48
  },
49
  calculate: ({ expression }: { expression: string }) => {
50
    const sanitized = expression.replace(/[^0-9+\-*/().% ]/g, "");
51
    return { result: new Function(`return (${sanitized})`)() };
52
  },
53
};
54

55
// ----- Core: Agent loop -----
56
async function callLLM(messages: any[]) {
57
  const response = await fetch(`${BASE_URL}/chat/completions`, {
58
    method: "POST",
59
    headers: {
60
      "Content-Type": "application/json",
61
      Authorization: `Bearer ${DEEPSEEK_API_KEY}`,
62
    },
63
    body: JSON.stringify({
64
      model: "deepseek-chat",
65
      messages,
66
      tools: toolDefinitions,
67
    }),
68
  });
69
  const data = await response.json();
70
  return data.choices[0].message;
71
}
72

73
async function runAgent(userMessage: string, maxSteps = 10) {
74
  console.log(`\n🧑 User: ${userMessage}\n`);
75

76
  // This is the full agent state: one messages array
77
  const messages: any[] = [
78
    { role: "system", content: "You are a helpful assistant. Answer in Chinese." },
79
    { role: "user", content: userMessage },
80
  ];
81

82
  for (let step = 1; step <= maxSteps; step++) {
83
    // Step 1: call LLM
84
    const assistantMessage = await callLLM(messages);
85
    messages.push(assistantMessage);
86

87
    // Step 2: check tool calls
88
    if (!assistantMessage.tool_calls || assistantMessage.tool_calls.length === 0) {
89
      // No tool calls -> model believes it can answer directly
90
      console.log(`🤖 Assistant: ${assistantMessage.content}`);
91
      return;
92
    }
93

94
    // Step 3: execute each tool call
95
    for (const toolCall of assistantMessage.tool_calls) {
96
      const fnName = toolCall.function.name;
97
      const fnArgs = JSON.parse(toolCall.function.arguments);
98

99
      console.log(`🔧 [Step ${step}] ${fnName}(${JSON.stringify(fnArgs)})`);
100

101
      const result = toolImplementations[fnName](fnArgs);
102
      console.log(`   → ${JSON.stringify(result)}`);
103

104
      // Step 4: append tool result to messages (critical!)
105
      messages.push({
106
        role: "tool",
107
        tool_call_id: toolCall.id,
108
        content: JSON.stringify(result),
109
      });
110
    }
111

112
    // Loop again with updated context
113
  }
114

115
  console.log("⚠️ Max step limit reached");
116
}
117

118
// Run
119
runAgent("Check weather in Beijing and Tokyo, then calculate temperature difference");

Running this makes three things crystal clear:

Each LLM input is the full messages array (including prior tool calls and results)
The model’s “memory” is this array and nothing else
What SDKs do is managing this array and automating this loop

2.2 Hands-on exercises#

Exercise 1: Print full messages before every callLLM and observe how it grows.

Exercise 2: Intentionally make a vague tool description and observe whether tool selection degrades.

Exercise 3: Set maxSteps = 1 and observe how the model behaves with one-step constraints.

3. Prompt Engineering for Agents#

Prompt engineering for chat and for agents are fundamentally different.
Chat only needs good answers; agents also need correct tool selection and invocation.

3.1 System prompt essentials#

1
You are a [role].
2

3
## Capabilities
4
You can use the following tools:
5
- getWeather: query weather. Use when user asks weather-related questions.
6
- calculate: math calculation. Use for precise calculations, do not do mental math.
7
- searchKnowledge: search knowledge base. Use when facts are uncertain.
8

9
## Behavior rules
10
1. Think first whether tools are needed; do not call blindly.
11
2. If one tool result is insufficient, call additional tools.
12
3. After collecting required information, provide a complete Chinese answer.
13
4. If a tool returns error or empty result, tell the user instead of fabricating.
14

15
## Constraints
16
- Do not invent capabilities for tools.
17
- Do not call tools more than 5 times in one response.

3.2 Common issues and fixes#

Issue	Symptom	Fix
Model skips tools	User asks weather, model hallucinates answer	Emphasize “must use tools for real-time facts”
Model overuses tools	Casual chat still triggers tools	Specify “use tools only when necessary”
Wrong parameters	Invalid argument formats	Improve tool and parameter descriptions
Infinite loop	Repeats same tool calls	Set `maxSteps` and add anti-repeat guidance
Ignores tool outputs	Tool returned data but model doesn’t use it	Require answers grounded in tool results

3.3 Advanced technique: Chain of Thought#

1
## Response workflow
2
Before deciding tool calls, think in <thinking> tags:
3
1. What does user want?
4
2. What do I already know?
5
3. What is missing? Which tool should I call?
6
4. If info is sufficient, answer directly.

This improves controllability and interpretability of tool decisions.

4. The Craft of Tool Design#

Tool design directly determines your agent’s capability ceiling.
The model understands tools through description + schema, not your implementation code.

4.1 Good tools vs bad tools#

Bad design:

1
tool({
2
  description: "Database operations",  // too vague
3
  inputSchema: z.object({
4
    sql: z.string(),  // asking model to write SQL directly? risky
5
  }),
6
})

Good design:

1
tool({
2
  description: "Get user profile by user ID; returns name, email, and signup time",
3
  inputSchema: z.object({
4
    userId: z.string().describe("Unique user identifier, format like user_123"),
5
  }),
6
})

4.2 Tool design principles#

Principle 1: single responsibility

1
❌ processData(action: "create" | "read" | "update" | "delete", ...)
2
✅ createUser(name, email)
3
✅ getUser(userId)
4
✅ updateUser(userId, fields)
5
✅ deleteUser(userId)

Principle 2: description is model-facing documentation

State when this tool should be used
State parameter meaning and format
State what the return payload contains

Principle 3: parameter design should reduce model error rates

1
❌ date: z.string()  // model may output "tomorrow", "Jan 1", etc.
2
✅ year: z.number(), month: z.number(), day: z.number()  // structured, robust

Principle 4: return values need sufficient context

1
// Bad: model can misinterpret unit
2
❌ return { temp: 5 }
3

4
// Good: self-contained return
5
✅ return { temp: 5, unit: "°C", city: "Beijing", condition: "Sunny" }

4.3 Tool granularity trade-offs#

Granularity	Pros	Cons	Best for
Fine-grained (many small tools)	Flexible composition, easier testing	More steps, higher wrong-tool risk	General-purpose agents
Coarse-grained (few big tools)	Fewer steps, fewer decision errors	Less flexible, logic hardcoded	Specific workflows

5. Core Papers and Mental Models#

5.1 ReAct (Reasoning + Acting)#

Paper: ReAct: Synergizing Reasoning and Acting in Language Models (2022)

This is the conceptual foundation of modern tool-calling agents.

Core idea: alternate between “reasoning” and “acting”.

1
User: Which city is warmer, Beijing or Tokyo?
2

3
Thought: I need both temperatures. Check Beijing first.
4
Action: getWeather(city="Beijing")
5
Observation: {temp: 5, condition: "Sunny"}
6

7
Thought: Beijing is 5°C. Now check Tokyo.
8
Action: getWeather(city="Tokyo")
9
Observation: {temp: 8, condition: "Light rain"}
10

11
Thought: 8°C > 5°C. I have enough information.
12
Answer: Tokyo (8°C) is warmer than Beijing (5°C), difference is 3°C.

Why it matters: almost all mainstream agent tool loops are implementations of ReAct in practice.

5.2 Plan-and-Execute#

Unlike step-by-step ReAct, Plan-and-Execute does full planning first.

1
User: Compare React vs Vue and write a report.
2

3
Plan:
4
  1. Search core features of React
5
  2. Search core features of Vue
6
  3. Search performance comparison data
7
  4. Search ecosystem comparison
8
  5. Synthesize and write final report
9

10
Execute:
11
  Step 1: search("React core features 2024") → ...
12
  Step 2: search("Vue core features 2024") → ...
13
  ...

Best for: complex, multi-step tasks requiring global planning.

5.3 Reflexion#

Let agent reflect after failure and improve next attempt.

1
Attempt 1:
2
  Action: search("Python sorting") → irrelevant results
3
  Reflection: query is too broad; should specify algorithm
4

5
Attempt 2:
6
  Action: search("Python quicksort implementation") → useful results

5.4 Must-read paper list#

Paper	Year	Core contribution
ReAct	2022	Alternating reasoning + acting
Toolformer	2023	Learning when to use tools
Reflexion	2023	Self-reflection and iteration
Plan-and-Execute	2023	Plan-first agent architecture
LATS (Language Agent Tree Search)	2023	Tree-search-based decisions
Voyager	2023	Lifelong autonomous learning in Minecraft

6. Memory and Context Management#

6.1 The problem: context windows are finite#

DeepSeek context windows are around 64K-128K tokens. Sounds large, but in agent workloads:

1
System Prompt:         ~500 tokens
2
Tool definitions (3):  ~800 tokens
3
User message:          ~100 tokens
4
Each tool roundtrip:   ~300-1000 tokens
5
──────────────────────────
6
After 10 rounds:       ~5000-10000 tokens

Long-running agents (for example coding assistants) can quickly approach limits.

6.2 Solution A: conversation compression#

1
// Compress history with another LLM call when messages grow too long
2
async function compressHistory(messages: Message[]): Promise<Message[]> {
3
  const summary = await generateText({
4
    model: deepseek("deepseek-chat"),
5
    prompt: `Summarize key facts and decisions from the conversation:
6
    ${JSON.stringify(messages)}`,
7
  });
8

9
  return [
10
    messages[0],  // keep system prompt
11
    { role: "system", content: `Conversation summary: ${summary.text}` },
12
    ...messages.slice(-4),  // keep most recent 4 messages
13
  ];
14
}

6.3 Solution B: RAG (Retrieval-Augmented Generation)#

Instead of stuffing everything into context, store knowledge in vector DB and retrieve on demand.

1
User question → vector search → insert relevant chunks into context → LLM answer
2
// Conceptual example
3
const relevantDocs = await vectorDB.search(userQuery, { topK: 5 });
4
const context = relevantDocs.map((d) => d.content).join("\n");
5

6
const response = await generateText({
7
  model: deepseek("deepseek-chat"),
8
  system: `Answer based on references below:\n${context}`,
9
  prompt: userQuery,
10
});

6.4 Three memory layers#

Layer	Implementation	Lifecycle	Example
Working memory	`messages` array	Single conversation	Current dialogue context
Short-term memory	DB/files	Cross-session (days/weeks)	User preferences, recent tasks
Long-term memory	Vector DB	Persistent	Knowledge base, historical decisions

7. Multi-Agent Orchestration#

A single agent has limits. For complex tasks, multiple agents should collaborate.

7.1 Pattern 1: Pipeline#

1
Planner Agent → Executor Agent → Reviewer Agent
2
     plan            execute          verify
3
// Conceptual example
4
const plan = await plannerAgent.generateText({
5
  prompt: "User wants a Todo API, produce an implementation plan",
6
});
7

8
const code = await executorAgent.generateText({
9
  prompt: `Implement code based on this plan: ${plan.text}`,
10
});
11

12
const review = await reviewerAgent.generateText({
13
  prompt: `Review code against plan:\nPlan: ${plan.text}\nCode: ${code.text}`,
14
});

7.2 Pattern 2: Delegation#

A manager agent assigns subtasks to specialist agents.

1
Manager Agent
2
        /      |      \
3
  Search Agent  Code Agent  Test Agent

Implementation idea: manager has “delegate-to-agent” tools.

1
const managerTools = {
2
  delegateToSearchAgent: tool({
3
    description: "Delegate search tasks to search specialist",
4
    inputSchema: z.object({ query: z.string() }),
5
    execute: async ({ query }) => {
6
      const result = await searchAgent.generateText({ prompt: query });
7
      return result.text;
8
    },
9
  }),
10
  delegateToCodeAgent: tool({
11
    description: "Delegate coding tasks to code specialist",
12
    inputSchema: z.object({ task: z.string() }),
13
    execute: async ({ task }) => {
14
      const result = await codeAgent.generateText({ prompt: task });
15
      return result.text;
16
    },
17
  }),
18
};

7.3 Pattern 3: Debate / Consensus#

Multiple agents analyze from different perspectives, then synthesize.

1
Agent A (optimistic) ──┐
2
Agent B (skeptical) ───┤→ Synthesizer Agent → final conclusion
3
Agent C (technical) ───┘

7.4 Pattern 4: Swarm collaboration#

Agents share a task pool and pick tasks autonomously.

1
Task Pool: [task1, task2, task3, task4, task5]
2
                    ↑
3
  Agent A (takes task1) | Agent B (takes task2) | Agent C (takes task3)

7.5 Which pattern to choose?#

Scenario	Recommended pattern	Why
Linear workflow (code→test→deploy)	Pipeline	Each step depends on previous
Complex project (parallel modules)	Delegation / Swarm	Parallel work + coordination
High-stakes decision-making	Debate	Multi-angle risk reduction
Clear role-based decomposition	Delegation	Centralized scheduling

8. Reliability Engineering#

This is the biggest gap between demo agents and production agents.

8.1 Typical model failure modes#

Error type	Example	Mitigation
Hallucination	Invents unsupported tool args	Validate with Zod schema
Format errors	Tool args are invalid JSON	`try-catch` + retry
Wrong tool choice	Calculates when it should search	Improve tool descriptions
Error ignorance	Tool fails but model ignores	Prompt explicit error handling
Infinite loop	Repeats same tool call	`maxSteps` + repetition detection
Over-calling	Simple task calls many tools	Prompt to think before tool use

8.2 Defensive programming#

1
// Tool execution wrapper
2
async function safeExecute(
3
  toolName: string,
4
  args: unknown,
5
  impl: Function
6
): Promise<string> {
7
  try {
8
    const result = await Promise.race([
9
      impl(args),
10
      new Promise((_, reject) =>
11
        setTimeout(() => reject(new Error("Tool execution timeout")), 10000)
12
      ),
13
    ]);
14
    return JSON.stringify(result);
15
  } catch (error) {
16
    // Never let tool errors crash the whole agent
17
    // Return structured error to model so it can react
18
    return JSON.stringify({
19
      error: true,
20
      message: `Tool ${toolName} execution failed: ${error}`,
21
    });
22
  }
23
}

8.3 Output validation#

1
// Validate model final output with Zod
2
import { z } from "zod/v4";
3
import { generateObject } from "ai";
4

5
// generateObject forces schema-constrained structured outputs
6
const { object } = await generateObject({
7
  model: deepseek("deepseek-chat"),
8
  schema: z.object({
9
    answer: z.string(),
10
    confidence: z.number().min(0).max(1),
11
    sources: z.array(z.string()),
12
  }),
13
  prompt: "...",
14
});
15
// object is guaranteed to follow schema

8.4 Reliability checklist#

All tools have timeout control
All tools have try-catch
Reasonable maxSteps configured
Critical outputs validated by schema
Retry mechanism for intermittent API failures
Step-by-step logs for observability
Sensitive actions (delete, payment) require human confirmation

9. Evaluation (Evals)#

Agent outputs are stochastic: same input can produce different tool sequences and final responses.
So how do we evaluate quality?

9.1 Evaluation dimensions#

Dimension	What it measures	Method
Correctness	Is final answer correct?	Human labels + automated checks
Tool efficiency	Minimum steps used?	Average step count
Tool accuracy	Were correct tools selected?	Compare against expected sequence
Robustness	Recovers from tool errors?	Inject failures and observe
Latency	End-to-end time	Timing
Cost	Token usage	Usage statistics

9.2 Simple eval framework#

1
interface TestCase {
2
  input: string;
3
  expectedToolCalls?: string[];    // expected tool names
4
  expectedOutputContains?: string[]; // expected output keywords
5
  maxStepsAllowed?: number;          // expected max steps
6
}
7

8
const testCases: TestCase[] = [
9
  {
10
    input: "How is Beijing weather",
11
    expectedToolCalls: ["getWeather"],
12
    expectedOutputContains: ["Beijing", "°C"],
13
    maxStepsAllowed: 2,
14
  },
15
  {
16
    input: "Temperature difference between Beijing and Tokyo",
17
    expectedToolCalls: ["getWeather", "getWeather", "calculate"],
18
    expectedOutputContains: ["difference", "3"],
19
    maxStepsAllowed: 5,
20
  },
21
];
22

23
async function runEval(testCases: TestCase[]) {
24
  let passed = 0;
25
  for (const tc of testCases) {
26
    const { text, steps } = await runAgent(tc.input);
27
    const toolsCalled = steps.flatMap((s) => s.toolCalls.map((c) => c.toolName));
28

29
    const toolsMatch = tc.expectedToolCalls
30
      ? JSON.stringify(toolsCalled) === JSON.stringify(tc.expectedToolCalls)
31
      : true;
32
    const outputMatch = tc.expectedOutputContains
33
      ? tc.expectedOutputContains.every((kw) => text.includes(kw))
34
      : true;
35
    const stepsOk = tc.maxStepsAllowed
36
      ? steps.length <= tc.maxStepsAllowed
37
      : true;
38

39
    if (toolsMatch && outputMatch && stepsOk) {
40
      passed++;
41
      console.log(`✅ PASS: "${tc.input}"`);
42
    } else {
43
      console.log(`❌ FAIL: "${tc.input}"`);
44
      if (!toolsMatch) console.log(`   Tools: expected ${tc.expectedToolCalls}, got ${toolsCalled}`);
45
      if (!outputMatch) console.log(`   Output missing expected keywords`);
46
      if (!stepsOk) console.log(`   Steps: ${steps.length} > ${tc.maxStepsAllowed}`);
47
    }
48
  }
49
  console.log(`\nResult: ${passed}/${testCases.length} passed`);
50
}

9.3 LLM-as-Judge#

Use another LLM to score response quality:

1
const judgment = await generateText({
2
  model: deepseek("deepseek-chat"),
3
  prompt: `You are a judge. Evaluate this assistant response.
4

5
User question: ${userQuestion}
6
Assistant answer: ${agentAnswer}
7

8
Scoring rubric (1-5):
9
1. Accuracy: Is answer grounded in tool-returned data?
10
2. Completeness: Does it answer all user asks?
11
3. Conciseness: Is there unnecessary filler?
12

13
Provide scores and rationale.`,
14
});

10. Practical Project Ideas#

Ordered by increasing difficulty. Each project exposes different real-world problems.

Project 1: Personal knowledge-base QA agent#

Build: read local Markdown files and answer questions via RAG.

You’ll learn:

Chunking
Embeddings
Similarity retrieval
Injecting retrieval results into prompts

Suggested stack: AI SDK + local vector DB (for example vectra or orama)

Project 2: CLI coding assistant#

Build: a CLI agent that can read/write files and run commands (a simplified Claude Code style tool).

You’ll learn:

Permission controls for dangerous operations
File-system tool design
Command sandboxing
Error recovery

Example tools:

1
readFile(path) → read file
2
writeFile(path, content) → write file
3
runCommand(cmd) → run command (requires confirmation)
4
listFiles(dir) → list directory

Project 3: Multi-agent research assistant#

Build: given a topic, automatically search, read, summarize, and generate a report.

You’ll learn:

Multi-agent orchestration
Real API integration (search APIs)
Long-context handling
Structured outputs

Architecture:

1
Planner → [Searcher, Searcher, Searcher] (parallel) → Synthesizer → Writer

Project 4: Self-improving code-generation agent#

Build: requirements → generate code → run tests → analyze failures → patch code → retry.

You’ll learn:

Reflexion pattern
Code execution sandbox
Test-driven agent loops
Failure analysis and recovery

11. Recommended Resources#

Papers#

Paper	Link
ReAct	https://arxiv.org/abs/2210.03629
Toolformer	https://arxiv.org/abs/2302.04761
Reflexion	https://arxiv.org/abs/2303.11366
Voyager	https://arxiv.org/abs/2305.16291
LATS	https://arxiv.org/abs/2310.04406
A Survey on LLM-based Agents	https://arxiv.org/abs/2308.11432

Documentation#

Resource	Link
Vercel AI SDK docs	https://ai-sdk.dev
Vercel AI SDK - Agents	https://ai-sdk.dev/docs/foundations/agents
OpenAI Function Calling	https://platform.openai.com/docs/guides/function-calling
DeepSeek API docs	https://platform.deepseek.com/api-docs
Anthropic Tool Use guide	https://docs.anthropic.com/en/docs/build-with-claude/tool-use

Open-source projects worth reading#

Project	Why read it
Vercel AI SDK (vercel/ai)	Production-grade agent loop patterns
LangChain.js	Abstractions for chain/agent/memory
AutoGPT	Early autonomous agent lessons and limitations
OpenDevin	Open-source coding agent tool design + sandboxing
CrewAI	Multi-agent collaboration patterns

Courses and blogs#

Resource	Note
DeepLearning.AI - AI Agents	Andrew Ng short course, free
Lilian Weng’s blog	OpenAI researcher; many agent surveys
Simon Willison’s blog	Practical LLM case studies

Summary#

1
SDK usage            → Entry point (you are here)
2
Handwritten loop     → Understand fundamentals
3
Tool design          → Sets capability ceiling
4
Prompt engineering   → Sets stability ceiling
5
Multi-agent design   → Handles complex tasks
6
Reliability          → Bridge from demo to production
7
Evaluation           → Quantify improvement direction

Most important advice: build an agent that solves your own real problem.
Demos always look smooth; real scenarios reveal the actual hard parts.