Pixiv Crawler / Downloader / Proxy: One-Click Deployment Without a Server

Project overview#

I rebuilt my Pixiv crawler (originally written in Python last year) using an AI-assisted and serverless architecture. It now supports scheduled tasks, downloads, and proxy access.

✨ Core features#

Serverless architecture: built on Vercel + Cloudflare Workers with near-zero ops overhead
Smart crawling: supports popularity filtering and automatic quality discovery
Image proxy: solves CORS issues and provides fast image delivery
Data storage: integrated with Supabase, supports complex queries
Batch download: bulk image download to Cloudflare R2
⏰ Scheduled jobs: auto-crawl rankings with no manual intervention

️ System architecture#

1
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
2
│   Vercel API    │    │ Cloudflare Cron │    │   Supabase DB   │
3
│ (Main Service)  │◄──►│ (Scheduled Jobs)│◄──►│  (Data Storage) │
4
└─────────────────┘    └─────────────────┘    └─────────────────┘
5
         │                       │                       │
6
         ▼                       ▼                       ▼
7
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
8
│  Image Proxy    │    │ Crawl Scheduler │    │ Data Analytics  │
9
│ (CORS Solution) │    │(Task Dispatch)  │    │      API        │
10
└─────────────────┘    └─────────────────┘    └─────────────────┘

Core implementation#

1. Smart crawler engine#

This crawler does more than fetch basic metadata; it also estimates content popularity:

1
/**
2
 * Pixiv crawler core class
3
 * Supports recommendation and popularity scoring
4
 */
5
export class PixivCrawler {
6
  private headers: any;
7
  private logManager: any;
8
  private taskId: string;
9

10
  constructor(pid: string, headers: any, logManager: any, taskId: string) {
11
    this.headers = headers;
12
    this.logManager = logManager;
13
    this.taskId = taskId;
14
  }
15

16
  /**
17
   * Fetch illustration detail and compute popularity
18
   * @param pid Illustration ID
19
   * @returns Illustration info + popularity score
20
   */
21
  async getIllustDetail(pid: string) {
22
    try {
23
      const url = `https://www.pixiv.net/ajax/illust/${pid}`;
24
      const response = await fetch(url, { headers: this.headers });
25
      const data = await response.json();
26

27
      if (data.error) {
28
        throw new Error(`API error: ${data.message}`);
29
      }
30

31
      const illust = data.body;
32

33
      // Popularity scoring
34
      const popularity = this.calculatePopularity(
35
        illust.likeCount,
36
        illust.bookmarkCount,
37
        illust.viewCount
38
      );
39

40
      return {
41
        pid: illust.id,
42
        title: illust.title,
43
        tags: illust.tags.tags.map((tag: any) => tag.tag),
44
        likeCount: illust.likeCount,
45
        bookmarkCount: illust.bookmarkCount,
46
        viewCount: illust.viewCount,
47
        popularity,
48
        createDate: illust.createDate
49
      };
50
    } catch (error) {
51
      this.logManager.addLog(`Failed to fetch illustration ${pid}: ${error.message}`, 'error', this.taskId);
52
      throw error;
53
    }
54
  }
55

56
  /**
57
   * Popularity formula based on likes, bookmarks, and views
58
   */
59
  private calculatePopularity(likes: number, bookmarks: number, views: number): number {
60
    if (views === 0) return 0;
61

62
    const likeRate = likes / views;
63
    const bookmarkRate = bookmarks / views;
64

65
    // Weighted popularity
66
    return (likeRate * 0.3 + bookmarkRate * 0.7) * Math.log10(views + 1);
67
  }
68
}

2. Image proxy service#

Solve CORS and serve images with fallback size strategy:

1
/**
2
 * Pixiv image proxy service
3
 * Supports multi-size fallback and graceful downgrade
4
 */
5
export class PixivProxy {
6
  private headers: any;
7

8
  constructor() {
9
    this.headers = {
10
      'Referer': 'https://www.pixiv.net/',
11
      'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
12
    };
13
  }
14

15
  /**
16
   * Proxy image from Pixiv
17
   * @param pid Illustration ID
18
   * @param size Preferred size
19
   * @returns Image stream
20
   */
21
  async proxyImage(pid: string, size: string = 'regular') {
22
    // Size priority: thumb_mini -> small -> regular -> original
23
    const sizeOptions = ['thumb_mini', 'small', 'regular', 'original'];
24
    const startIndex = sizeOptions.indexOf(size);
25

26
    if (startIndex === -1) {
27
      throw new Error(`Unsupported size: ${size}`);
28
    }
29

30
    // Try each size by priority
31
    for (let i = startIndex; i < sizeOptions.length; i++) {
32
      try {
33
        const currentSize = sizeOptions[i];
34
        const imageUrl = await this.getImageUrl(pid, currentSize);
35

36
        if (imageUrl) {
37
          const imageResponse = await fetch(imageUrl, {
38
            headers: this.headers
39
          });
40

41
          if (imageResponse.ok) {
42
            return {
43
              data: imageResponse.body,
44
              contentType: imageResponse.headers.get('content-type'),
45
              size: currentSize
46
            };
47
          }
48
        }
49
      } catch (error) {
50
        console.log(`Failed at size ${sizeOptions[i]}, trying next size`);
51
        continue;
52
      }
53
    }
54

55
    throw new Error(`No available image size found for ${pid}`);
56
  }
57

58
  /**
59
   * Get image URL for a specific size
60
   */
61
  private async getImageUrl(pid: string, size: string): Promise<string | null> {
62
    try {
63
      const response = await fetch(`https://www.pixiv.net/ajax/illust/${pid}`, {
64
        headers: this.headers
65
      });
66

67
      const data = await response.json();
68
      const urls = data.body?.urls;
69

70
      return urls?.[size] || null;
71
    } catch (error) {
72
      return null;
73
    }
74
  }
75
}

3. API design#

A full RESTful API for crawling, proxying, stats, and UI:

1
/**
2
 * Main API handler
3
 * Supports crawl / download / proxy / stats
4
 */
5
export default async function handler(req: VercelRequest, res: VercelResponse) {
6
  // CORS headers
7
  res.setHeader('Access-Control-Allow-Origin', '*');
8
  res.setHeader('Access-Control-Allow-Methods', 'GET, POST, OPTIONS');
9
  res.setHeader('Access-Control-Allow-Headers', 'Content-Type');
10

11
  if (req.method === 'OPTIONS') {
12
    res.status(200).end();
13
    return;
14
  }
15

16
  const { action, pid, size } = req.query;
17

18
  try {
19
    switch (action) {
20
      case 'proxy-image':
21
        // Proxy image endpoint
22
        if (!pid) {
23
          res.status(400).json({ error: 'Missing pid parameter' });
24
          return;
25
        }
26

27
        const proxy = new PixivProxy();
28
        const imageResult = await proxy.proxyImage(pid as string, size as string);
29

30
        res.setHeader('Content-Type', imageResult.contentType);
31
        res.setHeader('Cache-Control', 'public, max-age=86400'); // cache 1 day
32

33
        return imageResult.data.pipe(res);
34

35
      case 'get-pic':
36
        // Get illustration data
37
        const crawler = new PixivCrawler(pid as string, getPixivHeaders(), logManager, 'api_request');
38
        const illustInfo = await crawler.getIllustDetail(pid as string);
39

40
        res.status(200).json({
41
          success: true,
42
          data: illustInfo
43
        });
44
        break;
45

46
      case 'stats':
47
        // Get aggregated stats
48
        const supabase = new SupabaseService();
49
        const stats = await supabase.getStats();
50

51
        res.status(200).json({
52
          success: true,
53
          stats
54
        });
55
        break;
56

57
      default:
58
        // Return web UI
59
        const htmlContent = getWebInterface();
60
        res.setHeader('Content-Type', 'text/html');
61
        res.status(200).send(htmlContent);
62
    }
63
  } catch (error) {
64
    res.status(500).json({
65
      error: 'Internal server error',
66
      message: error.message
67
    });
68
  }
69
}

4. Scheduled crawling#

Automated cron jobs with Cloudflare Workers:

1
/**
2
 * Cloudflare Cron Worker
3
 * Executes scheduled crawl jobs
4
 */
5
export default {
6
  async scheduled(event: ScheduledEvent, env: Env, ctx: ExecutionContext) {
7
    console.log('Scheduled task started:', new Date().toISOString());
8

9
    try {
10
      // Daily ranking crawl
11
      if (shouldRunDailyRanking(event.cron)) {
12
        await triggerRankingCrawl('daily', env);
13
      }
14

15
      // Weekly ranking crawl
16
      if (shouldRunWeeklyRanking(event.cron)) {
17
        await triggerRankingCrawl('weekly', env);
18
      }
19

20
      // Cleanup expired logs
21
      if (shouldCleanLogs(event.cron)) {
22
        await cleanExpiredLogs(env);
23
      }
24

25
    } catch (error) {
26
      console.error('Scheduled task failed:', error);
27
    }
28
  }
29
};
30

31
/**
32
 * Trigger ranking crawl
33
 */
34
async function triggerRankingCrawl(type: 'daily' | 'weekly' | 'monthly', env: Env) {
35
  const endpoint = `${env.MAIN_SERVICE_URL}/api/?action=${type}`;
36

37
  try {
38
    const response = await fetch(endpoint, {
39
      method: 'GET',
40
      headers: {
41
        'Authorization': `Bearer ${env.API_TOKEN}`
42
      }
43
    });
44

45
    if (response.ok) {
46
      console.log(`${type} ranking crawl triggered`);
47
    } else {
48
      throw new Error(`HTTP ${response.status}: ${response.statusText}`);
49
    }
50
  } catch (error) {
51
    console.error(`Failed to trigger ${type} ranking crawl:`, error);
52
  }
53
}

3-minute quick deployment#

Step 1: Clone the project#

1
git clone https://github.com/your-username/serverless_pixiv_crawler.git
2
cd serverless_pixiv_crawler
3
npm install

Step 2: Configure env vars#

Copy .env.example to .env:

1
# Supabase config
2
SUPABASE_URL=your_supabase_url_here
3
SUPABASE_ANON_KEY=your_supabase_anon_key_here
4

5
# Pixiv config
6
PIXIV_COOKIE=your_pixiv_cookie_here
7

8
# Cloudflare R2 config (optional)
9
CLOUDFLARE_ACCOUNT_ID=your_account_id
10
CLOUDFLARE_ACCESS_KEY_ID=your_access_key
11
CLOUDFLARE_SECRET_ACCESS_KEY=your_secret_key
12
CLOUDFLARE_BUCKET_NAME=your_bucket_name

Step 3: Deploy to Vercel#

1
# Install Vercel CLI
2
npm i -g vercel
3

4
# Login and deploy
5
vercel login
6
vercel --prod

Step 4: Deploy cron worker (optional)#

1
cd cron_worker
2
npm install
3

4
# Configure Cloudflare Workers
5
npx wrangler login
6
npx wrangler deploy

Feature demo#

1. Web admin dashboard#

After deployment, open your Vercel domain and you will see a modern admin dashboard:

Real-time stats: crawl count, success rate, and more
Task management: start, stop, and monitor jobs
Log viewer: real-time runtime logs
Data search: quickly filter and locate crawled content

2. API examples#

1
// Get illustration info
2
fetch('https://your-domain.vercel.app/api/?action=get-pic&pid=123456')
3
  .then(res => res.json())
4
  .then(data => console.log(data));
5

6
// Proxy image access
7
const imageUrl = 'https://your-domain.vercel.app/api/?action=proxy-image&pid=123456&size=regular';
8
document.getElementById('image').src = imageUrl;
9

10
// Start crawl task
11
fetch('https://your-domain.vercel.app/api/', {
12
  method: 'POST',
13
  headers: { 'Content-Type': 'application/json' },
14
  body: JSON.stringify({
15
    pid: '123456',
16
    targetNum: 1000,
17
    popularityThreshold: 0.22
18
  })
19
});

3. Data analytics#

The system automatically collects and analyzes data:

1
-- Top tags
2
SELECT tag, COUNT(*) as count
3
FROM illustrations
4
CROSS JOIN LATERAL unnest(tags) as tag
5
GROUP BY tag
6
ORDER BY count DESC
7
LIMIT 20;
8

9
-- Popularity distribution
10
SELECT
11
  CASE
12
    WHEN popularity >= 0.8 THEN 'Very High'
13
    WHEN popularity >= 0.5 THEN 'High'
14
    WHEN popularity >= 0.2 THEN 'Medium'
15
    ELSE 'Low'
16
  END as level,
17
  COUNT(*) as count
18
FROM illustrations
19
GROUP BY level;

Advanced capabilities#

1. Smart recommendation#

Built-in recommendation logic provides:

Content discovery: find similar high-quality works from existing data
Trend prediction: estimate future popularity trends
Style analysis: classify visual styles automatically

2. Anti-ban strategy#

Header rotation: mimic real browser behavior
⏱️ Adaptive delay: dynamic request intervals
️ Retry logic: resilient handling of transient errors

3. Data quality assurance#

✅ Automatic deduplication
Content validation
Quality scoring per item

Cost analysis#

The biggest advantage: it can run fully on free tiers.

Service	Free quota	Enough for
Vercel	100GB/month	Small to medium projects
Supabase	500MB DB	~500,000 records
Cloudflare Workers	100,000 req/day	Most use cases
Cloudflare R2	10GB storage	Tens of thousands of images

Troubleshooting#

Common problems#

Deployment fails
Check environment variable settings
Verify Supabase connectivity
Crawl fails
Verify Pixiv cookie validity
Check network status
Images not loading
Confirm proxy service is running
Check CORS configuration

Performance optimization tips#

Enable caching: tune CDN cache policies
Monitor metrics: track performance regularly
Clean up regularly: remove expired logs and stale data

Summary#

This serverless Pixiv crawler demonstrates what modern web architecture can deliver:

✅ Zero ops cost: fully managed cloud services
✅ Elastic scalability: handles traffic spikes automatically
✅ Complete workflow: crawl, store, analyze, and serve in one system
✅ Easy deployment: launch in about 3 minutes

Whether you want to learn serverless architecture or need a practical data-collection tool, this project is a great starting point.

Disclaimer: this project is for learning and research only. Please follow target-site terms of service and control crawl frequency responsibly.

If this article helped you, please give the project a ⭐ Star!