Project overview
I rebuilt my Pixiv crawler (originally written in Python last year) using an AI-assisted and serverless architecture. It now supports scheduled tasks, downloads, and proxy access.
✨ Core features
- Serverless architecture: built on Vercel + Cloudflare Workers with near-zero ops overhead
- Smart crawling: supports popularity filtering and automatic quality discovery
- Image proxy: solves CORS issues and provides fast image delivery
- Data storage: integrated with Supabase, supports complex queries
- Batch download: bulk image download to Cloudflare R2
- ⏰ Scheduled jobs: auto-crawl rankings with no manual intervention
️ System architecture
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐│ Vercel API │ │ Cloudflare Cron │ │ Supabase DB ││ (Main Service) │◄──►│ (Scheduled Jobs)│◄──►│ (Data Storage) │└─────────────────┘ └─────────────────┘ └─────────────────┘ │ │ │ ▼ ▼ ▼┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐│ Image Proxy │ │ Crawl Scheduler │ │ Data Analytics ││ (CORS Solution) │ │(Task Dispatch) │ │ API │└─────────────────┘ └─────────────────┘ └─────────────────┘Core implementation
1. Smart crawler engine
This crawler does more than fetch basic metadata; it also estimates content popularity:
/** * Pixiv crawler core class * Supports recommendation and popularity scoring */export class PixivCrawler { private headers: any; private logManager: any; private taskId: string;
constructor(pid: string, headers: any, logManager: any, taskId: string) { this.headers = headers; this.logManager = logManager; this.taskId = taskId; }
/** * Fetch illustration detail and compute popularity * @param pid Illustration ID * @returns Illustration info + popularity score */ async getIllustDetail(pid: string) { try { const url = `https://www.pixiv.net/ajax/illust/${pid}`; const response = await fetch(url, { headers: this.headers }); const data = await response.json();
if (data.error) { throw new Error(`API error: ${data.message}`); }
const illust = data.body;
// Popularity scoring const popularity = this.calculatePopularity( illust.likeCount, illust.bookmarkCount, illust.viewCount );
return { pid: illust.id, title: illust.title, tags: illust.tags.tags.map((tag: any) => tag.tag), likeCount: illust.likeCount, bookmarkCount: illust.bookmarkCount, viewCount: illust.viewCount, popularity, createDate: illust.createDate }; } catch (error) { this.logManager.addLog(`Failed to fetch illustration ${pid}: ${error.message}`, 'error', this.taskId); throw error; } }
/** * Popularity formula based on likes, bookmarks, and views */ private calculatePopularity(likes: number, bookmarks: number, views: number): number { if (views === 0) return 0;
const likeRate = likes / views; const bookmarkRate = bookmarks / views;
// Weighted popularity return (likeRate * 0.3 + bookmarkRate * 0.7) * Math.log10(views + 1); }}2. Image proxy service
Solve CORS and serve images with fallback size strategy:
/** * Pixiv image proxy service * Supports multi-size fallback and graceful downgrade */export class PixivProxy { private headers: any;
constructor() { this.headers = { 'Referer': 'https://www.pixiv.net/', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36' }; }
/** * Proxy image from Pixiv * @param pid Illustration ID * @param size Preferred size * @returns Image stream */ async proxyImage(pid: string, size: string = 'regular') { // Size priority: thumb_mini -> small -> regular -> original const sizeOptions = ['thumb_mini', 'small', 'regular', 'original']; const startIndex = sizeOptions.indexOf(size);
if (startIndex === -1) { throw new Error(`Unsupported size: ${size}`); }
// Try each size by priority for (let i = startIndex; i < sizeOptions.length; i++) { try { const currentSize = sizeOptions[i]; const imageUrl = await this.getImageUrl(pid, currentSize);
if (imageUrl) { const imageResponse = await fetch(imageUrl, { headers: this.headers });
if (imageResponse.ok) { return { data: imageResponse.body, contentType: imageResponse.headers.get('content-type'), size: currentSize }; } } } catch (error) { console.log(`Failed at size ${sizeOptions[i]}, trying next size`); continue; } }
throw new Error(`No available image size found for ${pid}`); }
/** * Get image URL for a specific size */ private async getImageUrl(pid: string, size: string): Promise<string | null> { try { const response = await fetch(`https://www.pixiv.net/ajax/illust/${pid}`, { headers: this.headers });
const data = await response.json(); const urls = data.body?.urls;
return urls?.[size] || null; } catch (error) { return null; } }}3. API design
A full RESTful API for crawling, proxying, stats, and UI:
/** * Main API handler * Supports crawl / download / proxy / stats */export default async function handler(req: VercelRequest, res: VercelResponse) { // CORS headers res.setHeader('Access-Control-Allow-Origin', '*'); res.setHeader('Access-Control-Allow-Methods', 'GET, POST, OPTIONS'); res.setHeader('Access-Control-Allow-Headers', 'Content-Type');
if (req.method === 'OPTIONS') { res.status(200).end(); return; }
const { action, pid, size } = req.query;
try { switch (action) { case 'proxy-image': // Proxy image endpoint if (!pid) { res.status(400).json({ error: 'Missing pid parameter' }); return; }
const proxy = new PixivProxy(); const imageResult = await proxy.proxyImage(pid as string, size as string);
res.setHeader('Content-Type', imageResult.contentType); res.setHeader('Cache-Control', 'public, max-age=86400'); // cache 1 day
return imageResult.data.pipe(res);
case 'get-pic': // Get illustration data const crawler = new PixivCrawler(pid as string, getPixivHeaders(), logManager, 'api_request'); const illustInfo = await crawler.getIllustDetail(pid as string);
res.status(200).json({ success: true, data: illustInfo }); break;
case 'stats': // Get aggregated stats const supabase = new SupabaseService(); const stats = await supabase.getStats();
res.status(200).json({ success: true, stats }); break;
default: // Return web UI const htmlContent = getWebInterface(); res.setHeader('Content-Type', 'text/html'); res.status(200).send(htmlContent); } } catch (error) { res.status(500).json({ error: 'Internal server error', message: error.message }); }}4. Scheduled crawling
Automated cron jobs with Cloudflare Workers:
/** * Cloudflare Cron Worker * Executes scheduled crawl jobs */export default { async scheduled(event: ScheduledEvent, env: Env, ctx: ExecutionContext) { console.log('Scheduled task started:', new Date().toISOString());
try { // Daily ranking crawl if (shouldRunDailyRanking(event.cron)) { await triggerRankingCrawl('daily', env); }
// Weekly ranking crawl if (shouldRunWeeklyRanking(event.cron)) { await triggerRankingCrawl('weekly', env); }
// Cleanup expired logs if (shouldCleanLogs(event.cron)) { await cleanExpiredLogs(env); }
} catch (error) { console.error('Scheduled task failed:', error); } }};
/** * Trigger ranking crawl */async function triggerRankingCrawl(type: 'daily' | 'weekly' | 'monthly', env: Env) { const endpoint = `${env.MAIN_SERVICE_URL}/api/?action=${type}`;
try { const response = await fetch(endpoint, { method: 'GET', headers: { 'Authorization': `Bearer ${env.API_TOKEN}` } });
if (response.ok) { console.log(`${type} ranking crawl triggered`); } else { throw new Error(`HTTP ${response.status}: ${response.statusText}`); } } catch (error) { console.error(`Failed to trigger ${type} ranking crawl:`, error); }}3-minute quick deployment
Step 1: Clone the project
git clone https://github.com/your-username/serverless_pixiv_crawler.gitcd serverless_pixiv_crawlernpm installStep 2: Configure env vars
Copy .env.example to .env:
# Supabase configSUPABASE_URL=your_supabase_url_hereSUPABASE_ANON_KEY=your_supabase_anon_key_here
# Pixiv configPIXIV_COOKIE=your_pixiv_cookie_here
# Cloudflare R2 config (optional)CLOUDFLARE_ACCOUNT_ID=your_account_idCLOUDFLARE_ACCESS_KEY_ID=your_access_keyCLOUDFLARE_SECRET_ACCESS_KEY=your_secret_keyCLOUDFLARE_BUCKET_NAME=your_bucket_nameStep 3: Deploy to Vercel
# Install Vercel CLInpm i -g vercel
# Login and deployvercel loginvercel --prodStep 4: Deploy cron worker (optional)
cd cron_workernpm install
# Configure Cloudflare Workersnpx wrangler loginnpx wrangler deployFeature demo
1. Web admin dashboard
After deployment, open your Vercel domain and you will see a modern admin dashboard:
- Real-time stats: crawl count, success rate, and more
- Task management: start, stop, and monitor jobs
- Log viewer: real-time runtime logs
- Data search: quickly filter and locate crawled content
2. API examples
// Get illustration infofetch('https://your-domain.vercel.app/api/?action=get-pic&pid=123456') .then(res => res.json()) .then(data => console.log(data));
// Proxy image accessconst imageUrl = 'https://your-domain.vercel.app/api/?action=proxy-image&pid=123456&size=regular';document.getElementById('image').src = imageUrl;
// Start crawl taskfetch('https://your-domain.vercel.app/api/', { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ pid: '123456', targetNum: 1000, popularityThreshold: 0.22 })});3. Data analytics
The system automatically collects and analyzes data:
-- Top tagsSELECT tag, COUNT(*) as countFROM illustrationsCROSS JOIN LATERAL unnest(tags) as tagGROUP BY tagORDER BY count DESCLIMIT 20;
-- Popularity distributionSELECT CASE WHEN popularity >= 0.8 THEN 'Very High' WHEN popularity >= 0.5 THEN 'High' WHEN popularity >= 0.2 THEN 'Medium' ELSE 'Low' END as level, COUNT(*) as countFROM illustrationsGROUP BY level;Advanced capabilities
1. Smart recommendation
Built-in recommendation logic provides:
- Content discovery: find similar high-quality works from existing data
- Trend prediction: estimate future popularity trends
- Style analysis: classify visual styles automatically
2. Anti-ban strategy
- Header rotation: mimic real browser behavior
- ⏱️ Adaptive delay: dynamic request intervals
- ️ Retry logic: resilient handling of transient errors
3. Data quality assurance
- ✅ Automatic deduplication
- Content validation
- Quality scoring per item
Cost analysis
The biggest advantage: it can run fully on free tiers.
| Service | Free quota | Enough for |
|---|---|---|
| Vercel | 100GB/month | Small to medium projects |
| Supabase | 500MB DB | ~500,000 records |
| Cloudflare Workers | 100,000 req/day | Most use cases |
| Cloudflare R2 | 10GB storage | Tens of thousands of images |
Troubleshooting
Common problems
- Deployment fails
- Check environment variable settings
- Verify Supabase connectivity
- Crawl fails
- Verify Pixiv cookie validity
- Check network status
- Images not loading
- Confirm proxy service is running
- Check CORS configuration
Performance optimization tips
- Enable caching: tune CDN cache policies
- Monitor metrics: track performance regularly
- Clean up regularly: remove expired logs and stale data
Summary
This serverless Pixiv crawler demonstrates what modern web architecture can deliver:
- ✅ Zero ops cost: fully managed cloud services
- ✅ Elastic scalability: handles traffic spikes automatically
- ✅ Complete workflow: crawl, store, analyze, and serve in one system
- ✅ Easy deployment: launch in about 3 minutes
Whether you want to learn serverless architecture or need a practical data-collection tool, this project is a great starting point.
Related resources
Disclaimer: this project is for learning and research only. Please follow target-site terms of service and control crawl frequency responsibly.
If this article helped you, please give the project a ⭐ Star!
