ddodxy

Posted on Mar 19

# How I Built an AI-Powered Literature Review Tool for Thesis Students

#ai #python #showdev #webdev

From scraping 3 academic databases to AI summaries — a solo build story

The Problem That Started It All

Every thesis student knows the pain. You sit down with a research topic, open Google Scholar, and spend the next 3-4 hours manually:

Searching across Google Scholar, Scopus, and Semantic Scholar separately
Downloading papers one by one
Copy-pasting metadata into a spreadsheet
Repeating this every time your advisor asks for "more references"

I was doing exactly this for my own thesis when I thought — this entire workflow is automatable. So I built LitAssist: a full-stack web app that scrapes journals from 3 sources, processes them through a Python pipeline, and generates AI literature reviews using Gemini.

Here's everything I learned building it.

Tech Stack Overview

Frontend:  Alpine.js + Tailwind CSS (MPA, no build framework)
Backend:   Node.js + Express 5 + Socket.IO
Database:  MongoDB + Mongoose
Scraping:  Puppeteer (Google Scholar) + Semantic Scholar API
AI:        Google Gemini 2.5 Flash
Infra:     Podman + Docker Compose
Tunnel:    ngrok (for public access during dev)

The key architectural decision: hybrid Node.js + Python pipeline. Node handles browser automation and the web server. Python handles data cleaning, deduplication, and classification. Each tool does what it's best at.

Architecture Deep Dive

User clicks "Start Scrape"
        │
        ▼
  Socket.IO event → scraper/index.js (Node.js)
        │
        ├── Google Scholar (Puppeteer + Chromium)
        ├── Scopus (Semantic Scholar API)  
        └── Semantic Scholar API
        │
        ▼
  jurnal_mentah.json (raw data)
        │
        ▼
  processor/main.py (Python + Pandas)
  ├── Clean & normalize
  ├── Detect duplicates
  ├── Classify categories
  └── Calculate relevance scores
        │
        ▼
  MongoDB (via insertMany bulk)
        │
        ▼
  Dashboard updates via Socket.IO

Why Socket.IO for Real-Time Updates?

The scraping process takes 1-5 minutes depending on target count and whether Google Scholar triggers CAPTCHA. A regular HTTP request would timeout. Socket.IO lets me stream progress updates to the frontend in real-time — the user sees exactly which source is being scraped and how many results are coming in.

The Hardest Problem: CAPTCHA

Google Scholar aggressively uses CAPTCHA to block bots. Most scraping tools either fail silently or get permanently IP-banned.

My solution: noVNC + xvfb + x11vnc running inside the container.

When Google Scholar serves a CAPTCHA:

The scraper detects it and pauses
Sends a Socket.IO event to the frontend
Opens an embedded noVNC panel in the dashboard
User solves the CAPTCHA visually, directly in the browser
Scraper resumes automatically

This is the difference between a tool that works once and a tool that works in production.

// Detect CAPTCHA and notify client
const isCaptcha = await page.$('form#captcha-form') !== null;
if (isCaptcha) {
  io.to(socketId).emit('captcha_required', { 
    message: 'CAPTCHA detected. Please solve it in the panel below.' 
  });
  // Wait for user to solve
  await waitForCaptchaResolved(page, socketId);
}

Freemium Model in Practice

LitAssist has three roles: Free, Premium, and Admin.

Feature	Free	Premium
Lifetime quota	10 journals	Unlimited
Scrapes/day	2	Unlimited
Max target per scrape	25	50
AI Summary	✗	✓
Queue priority	Standard	Priority

Implementing this was straightforward with MongoDB user documents and middleware:

// Quota check middleware
async function checkQuota(req, res, next) {
  const user = await User.findById(req.session.userId);
  if (user.role === 'free' && user.quotaUsed >= 10) {
    return res.status(403).json({ 
      error: 'Quota exhausted. Upgrade to Premium.' 
    });
  }
  next();
}

Security Hardening

After building the core features, I ran a full security audit:

OWASP ZAP baseline scan → fixed CSP headers, removed CDN wildcards
Nikto web server scan → disabled ETag inode leaks, removed X-Powered-By
Trivy dependency scan → 0 CVEs in npm packages
npm audit → 0 vulnerabilities
Helmet.js → full security header suite
express-rate-limit → rate limiting on auth endpoints (verified: 99.98% blocked in load test)

ZAP final score: 0 FAIL, 7 WARN (all remaining warnings are CDN trade-offs or false positives).

Performance Results (Lighthouse Mobile)

Category	Score
Performance	87
Accessibility	93
Best Practices	100
SEO	100

Key optimizations that moved the needle:

Migrated from Tailwind Play CDN → PostCSS build (400KB → 13KB CSS)
Switched Alpine.js from CDN to local vendor file
Added gzip compression via compression middleware
Implemented font-display: swap for Google Fonts
Added proper cache headers (immutable for assets, 1hr for HTML)

Load Testing (k6)

Scenario 1: 100 concurrent users, static pages
→ 7,493 requests | 0% error | 6.78ms avg response

Scenario 2: 10 concurrent users, full user journey  
→ 1,665 requests | 0% error | 6ms avg response

Scenario 3: Rate limiter stress test (25 VUs hammering login)
→ 161,004 requests | 99.98% blocked after limit | 0 server crashes

Scenario 4: API stress test (20 VUs, all endpoints)
→ 3,668 requests | 0% error | 4ms avg response

The server handles real load with sub-10ms response times. The rate limiter successfully blocks brute force attempts without crashing.

What I Would Do Differently

1. Start with a build pipeline for CSS. Using Tailwind Play CDN for development is fine, but I had to migrate everything to PostCSS later. Should have set this up from day one.

2. Plan the quota system early. Adding freemium logic after the core was built required touching a lot of files.

3. Use TypeScript. The scraper logic is complex enough that TypeScript would have caught several bugs early.

4. Separate the scraper into a microservice. Right now it runs in the same process as the web server. Under heavy load, a long-running scrape job could block other requests.

What's Next

[ ] Deploy to production (VPS with proper RAM for Chromium)
[ ] Add Zotero integration for direct export
[ ] Support more databases (PubMed, IEEE Xplore)
[ ] Batch processing for multiple topics
[ ] Mobile app wrapper

Try It / Source Code

GitHub: github.com/ridhoajaaa/LitAssist-Public

If you're a thesis student who wants to automate your literature review process, feel free to try LitAssist. If you're a developer interested in the architecture, the full source is on GitHub.

Questions? Drop them in the comments — happy to go deeper on any part of the stack.

Built with Node.js, Python, Alpine.js, Tailwind CSS, MongoDB, Socket.IO, and Puppeteer.

Tags: #nodejs #python #webdev #showdev #opensource

DEV Community