DEV Community

Cover image for Building the Romanian NLP API that should already exist
Peter Abolins
Peter Abolins

Posted on

Building the Romanian NLP API that should already exist

If you've tried to do anything programmatic with Romanian text, you've probably hit the same wall I did.

There's no clean API for it. You end up scraping DEXonline, wrestling with incomplete library support, or calling a general-purpose LLM and hoping it gets the grammar right. None of that is good enough for production.

The specific gap

Given an arbitrary Romanian sentence, return for each token: its lemma, part of speech, grammatical case, number, gender, person, and tense.

This is what spaCy does for English, French, and German in a pip install. For Romanian, no equivalent exists as a callable REST API.

Romanian NLP tooling sits at roughly 15% of what exists for English. The academic resources are there — DEXonline (313k+ lemmas), RoLEX (330k morphosyntactic entries), the Universal Dependencies Romanian Treebank — they're just not packaged in a way developers can actually use.

But why not just use ChatGPT?

Fair question. The short answer: for production, an LLM is not infrastructure — it's an oracle. The accusative form of câine is always câinele, regardless of the LLM's mood that day. LexicRo returns deterministic, structured linguistic data by contract. Beyond that: LLM costs at scale are unpredictable; a dedicated conjugation endpoint responds in under 50ms vs 500ms–3s for an LLM call; and structured output from an LLM requires prompt engineering, validation, and retry logic. LexicRo returns clean JSON, every time.

What I'm building

LexicRo — an open-core, hosted API platform covering the endpoints Romanian developers actually need:

POST /analyze
→ lemma, POS, case, gender, number, person, tense per token

GET /conjugate/{verb}
→ full conjugation table — all moods and tenses including
  perfect simplu and viitor I (both tested at B1+)

GET /inflect/{word}
→ all inflected forms across cases, numbers, genders

GET /lookup/{word}
→ lexical data from DEXonline: definition, gender, plural, etymology

POST /difficulty
→ CEFR level scoring (A1–C2), calibrated to Romanian B1/B2 exams
Enter fullscreen mode Exit fullscreen mode

Technical approach

Not starting from scratch — the data and models are there:

  • Base model: bert-base-romanian-cased-v1 fine-tuned for morphological tagging
  • Conjugation: verbecc Romanian XML templates, extended with full B1+ tense coverage
  • Lexical: DEXonline database dump + RoLEX dataset
  • Infrastructure: FastAPI, Docker, full OpenAPI spec, Python and JS SDKs

Licence and access

  • Code: MIT (self-hostable)
  • Model weights: CC BY-NC 4.0 (free for research/non-commercial, commercial use via hosted API)
  • Free tier: 1,000 req/day, no credit card, all endpoints
  • Paid tiers from €9/month for production use

Phase 1 ships first

The conjugation and lexical lookup endpoints are the straightforward part — wrapping verbecc and DEXonline cleanly. That's what ships first (~3 months). The morphological analyser (the hard part, requiring fine-tuned BERT) follows in phase 2.

What I'm looking for

I'm in pre-development and genuinely looking for:

  1. Feedback on the endpoint design — does this cover what you'd actually need?
  2. Early users working with Romanian text at any scale
  3. Academic connections — pursuing EU grant funding (Horizon Europe, CEF Digital)
  4. Anyone who's built adjacent to this — what did you learn?

Links: lexicro.com · github.com/LexicRo · contact@lexicro.com

Romanian deserves the same NLP infrastructure as French or German. Building it in public — feedback welcome.

Top comments (1)

Collapse
 
aibughunter profile image
AI Bug Slayer 🐞

LexicRo is filling a real gap — lower-resource languages often lack solid NLP tooling. Morphological analysis and conjugation for Romanian is complex work that benefits the whole community.