Waddah Ali

Posted on Jul 3

Thessori: An Autonomous Literature Review Agent on Qwen Cloud

#ai #agents #research #webdev

Every research project starts the same way. You have a question, and between you and an answer sit a few hundred papers you haven't read. Finding the relevant ones, reading each, and noticing what nobody has tried yet is mechanical work — right up until the moment it isn't. Most tools that promise to help only kick in after you've already gathered the papers. Thessori starts one step earlier. You give it a research question, and it hands back a literature review you could actually put in front of someone.

Try Thessori Now

I built it for the Qwen Cloud Hackathon Series, specifically competing in the Autopilot Agent track. The pitch is short. You choose which Qwen model to start with, Plus or Max (latest versions), and then type one or more research questions. The agent searches arXiv and Semantic Scholar in parallel and has Qwen rank what it finds, then stops and shows you its picks so you can throw out the ones that don't belong. After that it pulls down the actual PDFs, reads them, writes a structured summary of each, works out the gaps across the whole set, and assembles everything into a review you can export as Markdown, LaTeX, or PDF. When it's done, you can sit and ask it questions about what it just wrote or tell it to go chase a thread it found.

The interesting part, it turned out, was almost never the model itself. Qwen did exactly what it was told, but the scaffolding required to make it reliable.

What it actually does

A run has two halves with a person in the middle.

In the first half you submit your questions. Before it searches anything, Qwen quietly rewrites what you typed into three sharper academic queries "attention layers from google" becomes "transformer self-attention mechanism," "scaled dot-product attention," and so on. This is optional; there's a checkbox, and if you turn it off, your wording goes through untouched. Either way, I keep your original phrasing and show you exactly what it expanded into, because silently searching for something other than what the user asked is the kind of thing that feels like a bug even when it's working.

For each query the agent fires an arXiv request and a Semantic Scholar request at the same time, merges everything, and drops duplicates by title. Those titles go to Qwen with one instruction: here are the papers, here are the queries, give me back the indices of the ten most relevant. The model returns something like [3, 0, 7, 12, ...] and the agent slices the list down to those ten.

Then it stops.

This is the part I cared most about getting right. A lot of "autonomous" agents are autonomous in the sense that they decide everything for you and you find out what they chose afterward. For a literature review that's the wrong shape. The ranking model is good but not psychic it'll occasionally rank a paper highly because the title overlaps with your query while the actual paper is about something else. So the agent puts its ten picks on screen as a checklist, every box ticked, and waits. You untick the ones that don't belong and hit generate. While it works, a row of checkpoints across the top fills in left to right fetch, review, summarize, gaps, and report with the live status underneath ("Downloading PDF 2 of 5…"), so you're never staring at a frozen spinner wondering if it died.

The second half runs on whatever survived. Here's where I spent real effort, and it doesn't show up in a screenshot: for each approved paper the agent doesn't summarize the abstract. It downloads the paper's PDF from arXiv, pulls the text out, and summarizes the actual body. Abstracts are marketing. They oversell the contribution and they bury the limitations, because the limitations are the part the authors least want you to read. The paper itself is more honest. So Thessori reads up to the first twenty-five pages, and when a paper is long it keeps a big chunk of the front intro, method and a slice of the back results and conclusion and the limitations the abstract skipped rather than chopping at a character count and losing the ending. Each paper comes back as four short labeled sections: contribution, method, findings, and limitations.

Then every summary is concatenated and sent back to Qwen once more with a different prompt: what problems are still open, which methods are missing,where the field could go. That becomes the closing section. Alongside the prose, the model hands back three concrete follow-up search queries, and those feed a "Deep Dive" button: one click and the agent starts a fresh run on the gaps it just found, which is about the closest thing to how a real person reads. You finish a paper, you notice a hole, you go looking again.

Everything is stitched into Markdown, which is the single source of truth the LaTeX and PDF exports are derived from. Math renders in the browser with KaTeX. And because the whole report sits in memory, there's a chat assistant in the corner that answers questions strictly about what it wrote, "Which of these used a transformer", "give me the gaps in one line" and it streams its answer token by token, so it feels like talking to something rather than submitting a form.

How it scales

Doing this manually is a chore. A thorough literature review usually means spending two or three hours searching, filtering out irrelevant titles, downloading PDFs, and skimming them to find the actual contributions. Thessori does the entire pipeline from query to a structured, formatted LaTeX draft in about three minutes.

It is also cheap. Because we rank the papers by title first and let the user filter the list before downloading the full texts, we only run the heavy summarization and gap analysis steps on the papers that actually matter. A full run costs less than five cents in API tokens, compared to the dollars it would cost to blindly feed dozens of raw, unverified PDFs into a model.

Why a state machine

The backend is a LangGraph graph wrapped in FastAPI. I chose an explicit state machine over a free-roaming agent loop for a boring reason: I wanted to know exactly what would happen on every run.

There are six steps. Each is a plain async function that takes the current state and returns the keys it changed. Expand rewrites the queries. Fetch writes the candidates. Rank overwrites them with the top ten. Summarize writes the summaries. Gap analysis writes a string and a list of follow-up queries. Report writes the Markdown. No tool-calling roulette, no step deciding at runtime to call itself three more times. When something broke, the stack trace pointed at one function, not at a planner buried under a reasoning loop.

The state itself is one typed dictionary that every node reads from and writes to. That's the whole contract. If you want to know what data exists at any point in the pipeline, you read one TypedDict.

The checkpoint problem

Here's where it got interesting. The pause for human approval is easy to describe and annoying to implement, because the graph has to actually stop, hand control back to a web UI, and resume later from a different process an HTTP request with the user's choices folded in.

LangGraph has machinery for this, with checkpointers and interrupts. I didn't want to stand up a checkpoint store for a hackathon, so I used a simpler trick: two graphs. The first runs expand, fetch, and rank, then ends. The API hands the ranked papers to the browser along with a session id and keeps the run's state in memory. When you approve a subset, a second request looks up that state, adds the approved papers, and invokes the second graph, which runs summarize, gaps, and report. The pause is just the boundary between two HTTP calls. No checkpointer, no persistence layer to babysit.

This worked on the second try, and the first try is worth describing because it cost me an hour.

LangGraph lets you declare your state schema as a bare dict. I did that, because every node reads and writes dictionary keys, so it seemed fine. The fetch-and-rank graph fetched twenty papers correctly and then died inside the ranking step with KeyError: 'queries'. The queries were right there in the initial state. They simply weren't reaching the second node.

The cause is subtle and worth knowing if you ever reach for LangGraph with a plain dict. With an untyped dict schema, the framework doesn't treat each key as a channel that persists across steps. A node returning a partial update doesn't merge into the running state; the keys it didn't return fall away. The fix is to give the graph a real schema. I'd already written a ResearchState TypedDict to document the shape, so I swapped StateGraph(dict) for StateGraph(ResearchState). With a typed schema every field becomes its own last-value channel, partial returns merge, and queries survives all the way through. One word changed and the bug was gone.

The schema isn't documentation. It's load-bearing.

class ResearchState(TypedDict):
    session_id: str
    queries: list[str]
    original_queries: list[str]
    use_ai_expansion: bool
    max_papers: int
    top_k_papers: int
    raw_papers: list[dict]
    approved_papers: list[dict]
    summaries: list[dict]
    gap_analysis: str
    deep_dive_queries: list[str]
    markdown_report: str
    status: str
    error: str | None
    timestamp: str | None
    model: str | None

Many jobs, one model

Qwen does several different things in this pipeline, and keeping them separate mattered.

Query expansion and ranking are selection tasks short, capped at a couple hundred tokens, parsed as JSON. Summarization is constrained writing, one call per paper, four labeled sections, told explicitly to be terse. Gap analysis is the one place I let it stretch: because Qwen has a large context window, I can dump every single paper summary into a single prompt. The model is asked to reason across the entire set to find contradictions and omissions rather than restating any one paper. The assistant chat is its own job conversational, with the report pinned in the system prompt so it can't wander off into things it didn't read. Same model, qwen-model, prompts shaped to the work.

Integration was genuinely a non-event, and I mean that as a compliment. Qwen exposes an OpenAI-compatible API, so the entire client is the standard OpenAI async SDK pointed at a different base URL. Two environment variables and a model name. Turning on streaming for the chat was one flag — stream=True — and reading the chunks off the response on the frontend. I've spent more time wiring up loggers than I spent wiring up Qwen.

The boring problems were the real ones

If you want to know where a project like this actually spends its time, it isn't the headline feature.

The first thing that broke was pip install, before I'd run a line of application code. The machine had a brand-new Python, new enough that my pinned dependency versions had no prebuilt wheels for it pip tried to compile Pydantic's Rust core from source and gave up. The honest fix was to stop pinning exact old versions and pin to floors instead, then let the resolver pull whatever current release ships a wheel for that interpreter. The public API I was using hadn't changed, so no application code moved with it. Pinned versions are a promise the rest of the ecosystem has to keep, and on a fresh interpreter nobody has made that promise yet.

Then Semantic Scholar. Their API is free and good and, without a key, rate-limited hard enough that it returned 429 on essentially every call while I was testing. My fetch step originally fired all the source requests with a plain asyncio.gather, which means one source erroring takes down the whole fetch. Fine until the afternoon of the demo, when Semantic Scholar decides you've had enough. I changed the gather to collect exceptions instead of raising them, so a throttled source is skipped and the run continues on whatever came back usually arXiv alone, which is plenty to show the thing working. The day someone drops in a Semantic Scholar key, the second source lights up with no code change.

# Fetching from multiple sources concurrently without letting one failure kill the run
results = await asyncio.gather(
    fetch_arxiv(queries),
    fetch_semantic_scholar(queries),
    return_exceptions=True
)

candidates = []
for res in results:
    if isinstance(res, Exception):
        logger.error(f"Source fetch failed: {res}")
        continue
    candidates.extend(res)

Then LaTeX, which was my own fault. The export turns the Markdown into a .tex file with a handful of regexes. My first escaper only handled headings, bold, and links and left the body prose alone. Fine until a summary contains a percent sign, which in LaTeX starts a comment and eats the rest of the line, or an underscore, which is illegal outside math mode. Real model output is full of both. So a converter that produced a beautiful .tex on my test string produced uncompilable garbage on actual summaries. I rewrote it to escape special characters in a single pass and walk each line as tokens, escaping plain text while leaving bold, links, and inline math intact. You only find that bug by feeding the thing real data, which is the whole argument for testing with real data early.

The front end had its own version of the same lesson. I'd swapped the landing-page background for an animated WebGL shader, and the entire app went black no background, no text, nothing. The build passed, so it was a runtime crash. The cause was React's StrictMode, which mounts every component twice in development. My cleanup code freed the WebGL context, then the immediate re-mount tried to initialize on the dead canvas and threw an uncaught throw inside a useEffect with no error boundary tears down the whole app, not just the background. The fix was to stop destroying the context on cleanup and wrap the setup so a graphics failure degrades to a plain dark page instead of taking the page with it. Two lines. An hour to find them.

The one I'm proudest of fixing is the quietest. The Deep Dive feature originally had the model write its three follow-up queries as a Markdown list inside the gap-analysis prose, and then I parsed them back out of that text with a regex. It worked in testing and then, of course, it didn't when the model formatted the list a little differently, the parser grabbed a whole paragraph as a single "query," and Deep Dive cheerfully searched arXiv for a 300-character sentence and came back with zero results. The lesson is one I keep relearning: don't round-trip structured data through prose. I changed the gap prompt to emit the queries as a delimited JSON array, parse that, and store them as a real list which is also what gets rendered as the clean numbered section in the report and what the button searches. Same data, one source, no regex archaeology.

Edges

I'd rather be straight about the edges.

The pipeline, the human checkpoint, the multi-source fetch, the query expansion, the ranking, the full-text PDF reading, the per-paper summaries, the gap analysis, the deep-dive follow-up, the streaming assistant, and the Markdown and LaTeX exports are all real and all working. Math in the report renders in the browser. Sessions are written to your browser, so a refresh or a server restart doesn't lose your review, the landing page just offers to resume and shows your research history instead of dropping you back to zero. In production the whole thing runs on one port, FastAPI handing out the built React app and answering the API on the same origin.

PDF export is honest about its one dependency. Turning LaTeX into a PDF needs a TeX engine on the server, and if one isn't installed the endpoint returns a clear message telling you to take the .tex to Overleaf instead of throwing a 500. Google Doc export is a stub that returns a placeholder URL; the real Drive upload is on the list, not in the box. None of this is hidden, it's the line I drew with the time I had.

Production Readiness

If you want to take this agent out of a sandbox and deploy it in a real production system, the graph-based state machine is already built for it. You wouldn't need to rewrite the core flow.

In a real environment, you'd make three concrete upgrades:

State Persistence

Instead of keeping the session state in a FastAPI memory dict or saving the sessions on the user's browser, you would swap out the checkpointer for a database like PostgreSQL or Redis. This lets a researcher start a search, go to lunch, and resume the review days later without losing their progress when a container restarts.

Task Queuing

Phase two (downloading PDFs, segmenting text, and running the gap analysis) is slow and compute-heavy. Running this inside a direct HTTP request will eventually cause timeouts. In production, you'd push these tasks onto a worker queue like Celery or Temporal and have the frontend query the progress over WebSockets.

Private Repositories

Most research teams don't just read public papers. They have internal document libraries, patents, and shared wikis. The fetch node can be extended to search these private data stores alongside the public APIs by plugging in a local vector search engine.

Because the state is governed by a strict TypedDict contract, these changes are structural swaps rather than logic rewrites.

Where this goes

The piece I'd keep no matter what is the checkpoint. The most useful thing this project does isn't the summarizing any capable model can do. It's that the agent searches for the exact papers you were trying to dig up from arXiv or Scholar and stops and shows its work before committing to it, and once it's done, you can interrogate what it wrote and send it back out after the threads it found. The obvious extensions all build on that. Let the assistant edit and re-rank without leaving the report. Follow a strong paper's citations the way a person would, instead of stopping at the first search. Persist across machines, not just across a refresh.

But the core is done, and the core is the part I wanted: a research question goes in one end, and a review you'd actually hand to someone comes out the other, with a human in the loop at the one moment it counts.