DEV Community

Cover image for # How I Built an AI-Powered Literature Review Tool for Thesis Students
ddodxy
ddodxy

Posted on

# How I Built an AI-Powered Literature Review Tool for Thesis Students

From scraping 3 academic databases to AI summaries — a solo build story


The Problem That Started It All

Every thesis student knows the pain. You sit down with a research topic, open Google Scholar, and spend the next 3-4 hours manually:

  • Searching across Google Scholar, Scopus, and Semantic Scholar separately
  • Downloading papers one by one
  • Copy-pasting metadata into a spreadsheet
  • Repeating this every time your advisor asks for "more references"

I was doing exactly this for my own thesis when I thought — this entire workflow is automatable. So I built LitAssist: a full-stack web app that scrapes journals from 3 sources, processes them through a Python pipeline, and generates AI literature reviews using Gemini.

Here's everything I learned building it.


Tech Stack Overview

Frontend:  Alpine.js + Tailwind CSS (MPA, no build framework)
Backend:   Node.js + Express 5 + Socket.IO
Database:  MongoDB + Mongoose
Scraping:  Puppeteer (Google Scholar) + Semantic Scholar API
AI:        Google Gemini 2.5 Flash
Infra:     Podman + Docker Compose
Tunnel:    ngrok (for public access during dev)
Enter fullscreen mode Exit fullscreen mode

The key architectural decision: hybrid Node.js + Python pipeline. Node handles browser automation and the web server. Python handles data cleaning, deduplication, and classification. Each tool does what it's best at.


Architecture Deep Dive

User clicks "Start Scrape"
        │
        ▼
  Socket.IO event → scraper/index.js (Node.js)
        │
        ├── Google Scholar (Puppeteer + Chromium)
        ├── Scopus (Semantic Scholar API)  
        └── Semantic Scholar API
        │
        ▼
  jurnal_mentah.json (raw data)
        │
        ▼
  processor/main.py (Python + Pandas)
  ├── Clean & normalize
  ├── Detect duplicates
  ├── Classify categories
  └── Calculate relevance scores
        │
        ▼
  MongoDB (via insertMany bulk)
        │
        ▼
  Dashboard updates via Socket.IO
Enter fullscreen mode Exit fullscreen mode

Why Socket.IO for Real-Time Updates?

The scraping process takes 1-5 minutes depending on target count and whether Google Scholar triggers CAPTCHA. A regular HTTP request would timeout. Socket.IO lets me stream progress updates to the frontend in real-time — the user sees exactly which source is being scraped and how many results are coming in.


The Hardest Problem: CAPTCHA

Google Scholar aggressively uses CAPTCHA to block bots. Most scraping tools either fail silently or get permanently IP-banned.

My solution: noVNC + xvfb + x11vnc running inside the container.

When Google Scholar serves a CAPTCHA:

  1. The scraper detects it and pauses
  2. Sends a Socket.IO event to the frontend
  3. Opens an embedded noVNC panel in the dashboard
  4. User solves the CAPTCHA visually, directly in the browser
  5. Scraper resumes automatically

This is the difference between a tool that works once and a tool that works in production.

// Detect CAPTCHA and notify client
const isCaptcha = await page.$('form#captcha-form') !== null;
if (isCaptcha) {
  io.to(socketId).emit('captcha_required', { 
    message: 'CAPTCHA detected. Please solve it in the panel below.' 
  });
  // Wait for user to solve
  await waitForCaptchaResolved(page, socketId);
}
Enter fullscreen mode Exit fullscreen mode

Freemium Model in Practice

LitAssist has three roles: Free, Premium, and Admin.

Feature Free Premium
Lifetime quota 10 journals Unlimited
Scrapes/day 2 Unlimited
Max target per scrape 25 50
AI Summary
Queue priority Standard Priority

Implementing this was straightforward with MongoDB user documents and middleware:

// Quota check middleware
async function checkQuota(req, res, next) {
  const user = await User.findById(req.session.userId);
  if (user.role === 'free' && user.quotaUsed >= 10) {
    return res.status(403).json({ 
      error: 'Quota exhausted. Upgrade to Premium.' 
    });
  }
  next();
}
Enter fullscreen mode Exit fullscreen mode

Security Hardening

After building the core features, I ran a full security audit:

  • OWASP ZAP baseline scan → fixed CSP headers, removed CDN wildcards
  • Nikto web server scan → disabled ETag inode leaks, removed X-Powered-By
  • Trivy dependency scan → 0 CVEs in npm packages
  • npm audit → 0 vulnerabilities
  • Helmet.js → full security header suite
  • express-rate-limit → rate limiting on auth endpoints (verified: 99.98% blocked in load test)

ZAP final score: 0 FAIL, 7 WARN (all remaining warnings are CDN trade-offs or false positives).


Performance Results (Lighthouse Mobile)

Category Score
Performance 87
Accessibility 93
Best Practices 100
SEO 100

Key optimizations that moved the needle:

  • Migrated from Tailwind Play CDN → PostCSS build (400KB → 13KB CSS)
  • Switched Alpine.js from CDN to local vendor file
  • Added gzip compression via compression middleware
  • Implemented font-display: swap for Google Fonts
  • Added proper cache headers (immutable for assets, 1hr for HTML)

Load Testing (k6)

Scenario 1: 100 concurrent users, static pages
→ 7,493 requests | 0% error | 6.78ms avg response

Scenario 2: 10 concurrent users, full user journey  
→ 1,665 requests | 0% error | 6ms avg response

Scenario 3: Rate limiter stress test (25 VUs hammering login)
→ 161,004 requests | 99.98% blocked after limit | 0 server crashes

Scenario 4: API stress test (20 VUs, all endpoints)
→ 3,668 requests | 0% error | 4ms avg response
Enter fullscreen mode Exit fullscreen mode

The server handles real load with sub-10ms response times. The rate limiter successfully blocks brute force attempts without crashing.


What I Would Do Differently

1. Start with a build pipeline for CSS. Using Tailwind Play CDN for development is fine, but I had to migrate everything to PostCSS later. Should have set this up from day one.

2. Plan the quota system early. Adding freemium logic after the core was built required touching a lot of files.

3. Use TypeScript. The scraper logic is complex enough that TypeScript would have caught several bugs early.

4. Separate the scraper into a microservice. Right now it runs in the same process as the web server. Under heavy load, a long-running scrape job could block other requests.


What's Next

  • [ ] Deploy to production (VPS with proper RAM for Chromium)
  • [ ] Add Zotero integration for direct export
  • [ ] Support more databases (PubMed, IEEE Xplore)
  • [ ] Batch processing for multiple topics
  • [ ] Mobile app wrapper

Try It / Source Code

GitHub: github.com/ridhoajaaa/LitAssist-Public

If you're a thesis student who wants to automate your literature review process, feel free to try LitAssist. If you're a developer interested in the architecture, the full source is on GitHub.

Questions? Drop them in the comments — happy to go deeper on any part of the stack.


Built with Node.js, Python, Alpine.js, Tailwind CSS, MongoDB, Socket.IO, and Puppeteer.

Tags: #nodejs #python #webdev #showdev #opensource

Top comments (0)