If you've tried to do anything programmatic with Romanian text, you've probably hit the same wall I did.
There's no clean API for it. You end up scraping DEXonline, wrestling with incomplete library support, or calling a general-purpose LLM and hoping it gets the grammar right. None of that is good enough for production.
The specific gap
Given an arbitrary Romanian sentence, return for each token: its lemma, part of speech, grammatical case, number, gender, person, and tense.
This is what spaCy does for English, French, and German in a pip install. For Romanian, no equivalent exists as a callable REST API.
Romanian NLP tooling sits at roughly 15% of what exists for English. The academic resources are there — DEXonline (313k+ lemmas), RoLEX (330k morphosyntactic entries), the Universal Dependencies Romanian Treebank — they're just not packaged in a way developers can actually use.
But why not just use ChatGPT?
Fair question. The short answer: for production, an LLM is not infrastructure — it's an oracle. The accusative form of câine is always câinele, regardless of the LLM's mood that day. LexicRo returns deterministic, structured linguistic data by contract. Beyond that: LLM costs at scale are unpredictable; a dedicated conjugation endpoint responds in under 50ms vs 500ms–3s for an LLM call; and structured output from an LLM requires prompt engineering, validation, and retry logic. LexicRo returns clean JSON, every time.
What I'm building
LexicRo — an open-core, hosted API platform covering the endpoints Romanian developers actually need:
POST /analyze
→ lemma, POS, case, gender, number, person, tense per token
GET /conjugate/{verb}
→ full conjugation table — all moods and tenses including
perfect simplu and viitor I (both tested at B1+)
GET /inflect/{word}
→ all inflected forms across cases, numbers, genders
GET /lookup/{word}
→ lexical data from DEXonline: definition, gender, plural, etymology
POST /difficulty
→ CEFR level scoring (A1–C2), calibrated to Romanian B1/B2 exams
Technical approach
Not starting from scratch — the data and models are there:
-
Base model:
bert-base-romanian-cased-v1fine-tuned for morphological tagging - Conjugation: verbecc Romanian XML templates, extended with full B1+ tense coverage
- Lexical: DEXonline database dump + RoLEX dataset
- Infrastructure: FastAPI, Docker, full OpenAPI spec, Python and JS SDKs
Licence and access
- Code: MIT (self-hostable)
- Model weights: CC BY-NC 4.0 (free for research/non-commercial, commercial use via hosted API)
- Free tier: 1,000 req/day, no credit card, all endpoints
- Paid tiers from €9/month for production use
Phase 1 ships first
The conjugation and lexical lookup endpoints are the straightforward part — wrapping verbecc and DEXonline cleanly. That's what ships first (~3 months). The morphological analyser (the hard part, requiring fine-tuned BERT) follows in phase 2.
What I'm looking for
I'm in pre-development and genuinely looking for:
- Feedback on the endpoint design — does this cover what you'd actually need?
- Early users working with Romanian text at any scale
- Academic connections — pursuing EU grant funding (Horizon Europe, CEF Digital)
- Anyone who's built adjacent to this — what did you learn?
Links: lexicro.com · github.com/LexicRo · contact@lexicro.com
Romanian deserves the same NLP infrastructure as French or German. Building it in public — feedback welcome.
Top comments (1)
LexicRo is filling a real gap — lower-resource languages often lack solid NLP tooling. Morphological analysis and conjugation for Romanian is complex work that benefits the whole community.