Building the Romanian NLP API that should already exist

#nlp #romanian #api #opensource

If you've tried to do anything programmatic with Romanian text, you've probably hit the same wall I did.

There's no clean API for it. You end up scraping DEXonline, wrestling with incomplete library support, or calling a general-purpose LLM and hoping it gets the grammar right. None of that is good enough for production.

The specific gap

Given an arbitrary Romanian sentence, return for each token: its lemma, part of speech, grammatical case, number, gender, person, and tense.

This is what spaCy does for English, French, and German in a pip install. For Romanian, no equivalent exists as a callable REST API.

Romanian NLP tooling sits at roughly 15% of what exists for English. The academic resources are there — DEXonline (313k+ lemmas), RoLEX (330k morphosyntactic entries), the Universal Dependencies Romanian Treebank — they're just not packaged in a way developers can actually use.

But why not just use ChatGPT?

Fair question. The short answer: for production, an LLM is not infrastructure — it's an oracle. The accusative form of câine is always câinele, regardless of the LLM's mood that day. LexicRo returns deterministic, structured linguistic data by contract. Beyond that: LLM costs at scale are unpredictable; a dedicated conjugation endpoint responds in under 50ms vs 500ms–3s for an LLM call; and structured output from an LLM requires prompt engineering, validation, and retry logic. LexicRo returns clean JSON, every time.

What I'm building

LexicRo — an open-core, hosted API platform covering the endpoints Romanian developers actually need:

POST /analyze
→ lemma, POS, case, gender, number, person, tense per token

GET /conjugate/{verb}
→ full conjugation table — all moods and tenses including
  perfect simplu and viitor I (both tested at B1+)

GET /inflect/{word}
→ all inflected forms across cases, numbers, genders

GET /lookup/{word}
→ lexical data from DEXonline: definition, gender, plural, etymology

POST /difficulty
→ CEFR level scoring (A1–C2), calibrated to Romanian B1/B2 exams

Technical approach

Not starting from scratch — the data and models are there:

Base model: bert-base-romanian-cased-v1 fine-tuned for morphological tagging
Conjugation: verbecc Romanian XML templates, extended with full B1+ tense coverage
Lexical: DEXonline database dump + RoLEX dataset
Infrastructure: FastAPI, Docker, full OpenAPI spec, Python and JS SDKs

Licence and access

Code: MIT (self-hostable)
Model weights: CC BY-NC 4.0 (free for research/non-commercial, commercial use via hosted API)
Free tier: 1,000 req/day, no credit card, all endpoints
Paid tiers from €9/month for production use

Phase 1 ships first

The conjugation and lexical lookup endpoints are the straightforward part — wrapping verbecc and DEXonline cleanly. That's what ships first (~3 months). The morphological analyser (the hard part, requiring fine-tuned BERT) follows in phase 2.

What I'm looking for

I'm in pre-development and genuinely looking for:

Feedback on the endpoint design — does this cover what you'd actually need?
Early users working with Romanian text at any scale
Academic connections — pursuing EU grant funding (Horizon Europe, CEF Digital)
Anyone who's built adjacent to this — what did you learn?

Links: lexicro.com · github.com/LexicRo · contact@lexicro.com

Romanian deserves the same NLP infrastructure as French or German. Building it in public — feedback welcome.

Top comments (2)

AI Bug Slayer 🐞 • Apr 18

LexicRo is filling a real gap — lower-resource languages often lack solid NLP tooling. Morphological analysis and conjugation for Romanian is complex work that benefits the whole community.

Peter Abolins • Apr 19

Thanks — that's exactly the framing I am aiming for. Romanian sits in an awkward middle ground: large enough to have real demand (24M speakers, official EU language) but small enough that it's been overlooked by the major NLP tooling efforts. Phase 1 is conjugation and lexical lookup, which are the straightforward parts. The morphological analyser is where it will get interesting. Happy to keep you posted as it develops.