Thesius Code

Posted on Mar 23 • Originally published at datanest-stores.pages.dev

NLP Starter Kit

#machinelearning #python #mlops #datascience

NLP Starter Kit

Natural language processing has evolved from rule-based tokenizers to transformer-based models that understand context, but the engineering complexity has grown with it. This starter kit gives you production-ready text processing pipelines, embedding workflows for semantic search and clustering, fine-tuning scripts for domain adaptation, and a complete RAG (Retrieval-Augmented Generation) implementation. Each component is modular — use the text preprocessor standalone, or wire everything together into an end-to-end NLP system. Built with Hugging Face Transformers, sentence-transformers, and PyTorch.

Key Features

Text Processing Pipeline — Configurable preprocessing with tokenization, normalization, stopword removal, and language detection.
Embedding Workflows — Generate, store, and search embeddings using sentence-transformers with FAISS, ChromaDB, and NumPy backends.
Fine-Tuning Scripts — LoRA-based fine-tuning for text classification, NER, and seq2seq tasks with automatic hyperparameter selection.
RAG Implementation — Retrieval-augmented generation with document ingestion, chunking, retrieval, and source attribution.
Text Classification — Train classifiers with TF-IDF + scikit-learn baselines and transformer models, with automatic selection.
Evaluation Suite — BLEU, ROUGE, BERTScore, and custom metrics with per-class breakdowns.

Quick Start

unzip nlp-starter-kit.zip && cd nlp-starter-kit
pip install -r requirements.txt

# Run text classification or generate embeddings
python src/nlp_starter_kit/core.py classify --config config.example.yaml
python src/nlp_starter_kit/core.py embed --input ./docs/ --output ./embeddings/

# config.example.yaml
preprocessing:
  lowercase: true
  remove_html: true
  remove_urls: true
  min_token_length: 2
  max_sequence_length: 512

embeddings:
  model: sentence-transformers/all-MiniLM-L6-v2
  batch_size: 64
  normalize: true
  vector_store: faiss  # faiss | chromadb | numpy

classification:
  approach: transformer  # tfidf_svm | tfidf_lr | transformer
  model: distilbert-base-uncased
  num_labels: 5
  max_epochs: 10
  learning_rate: 2e-5
  early_stopping_patience: 3

rag:
  chunking: { strategy: recursive, chunk_size: 512, chunk_overlap: 50 }
  retriever: { model: sentence-transformers/all-MiniLM-L6-v2, top_k: 5 }
  generator: { model: google/flan-t5-base, max_new_tokens: 256 }

fine_tuning:
  method: lora  # full | lora | prefix
  lora_r: 16
  lora_alpha: 32
  target_modules: [q_proj, v_proj]

Architecture

┌─────────────┐     ┌──────────────┐     ┌──────────────┐
│  Raw Text   │────>│  Preprocessor│────>│  Tokenizer   │
│  / Docs     │     │  (clean/norm)│     │  (HF/spaCy)  │
└─────────────┘     └──────────────┘     └──────┬───────┘
                                                 │
                    ┌──────────────┐     ┌───────▼───────┐
                    │  Vector      │<────│  Embedding    │
                    │  Store       │     │  Model        │
                    └──────┬───────┘     └───────────────┘
                           │
              ┌────────────┼────────────┐
              │            │            │
       ┌──────▼────┐ ┌─────▼─────┐ ┌───▼──────┐
       │ Semantic  │ │ Classify  │ │  RAG     │
       │ Search    │ │ / NER     │ │ Generate │
       └───────────┘ └───────────┘ └──────────┘

Usage Examples

Text Classification with Transformers

from nlp_starter_kit.core import TextClassifier

classifier = TextClassifier.from_config("config.example.yaml")
classifier.train(train_path="./data/train.csv", val_path="./data/val.csv",
                 text_column="text", label_column="category")

metrics = classifier.evaluate("./data/test.csv")
print(f"Accuracy: {metrics['accuracy']:.4f} | F1: {metrics['f1_macro']:.4f}")

predictions = classifier.predict(["The quarterly earnings exceeded expectations."])
print(f"{predictions[0]['label']} ({predictions[0]['confidence']:.2f})")

Semantic Search with Embeddings

from nlp_starter_kit.core import EmbeddingIndex

index = EmbeddingIndex(
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    vector_store="faiss",
)

# Index documents and search
index.add_documents(documents_dir="./knowledge_base/", file_types=["md", "txt", "pdf"])
results = index.search("How do I configure database connection pooling?", top_k=5)
for r in results:
    print(f"[{r['score']:.4f}] {r['source']}: {r['text'][:100]}...")

Fine-Tune with LoRA

from nlp_starter_kit.core import FineTuner

tuner = FineTuner(base_model="distilbert-base-uncased", method="lora", lora_r=16, lora_alpha=32)
tuner.train(
    train_data="./data/domain_train.csv", val_data="./data/domain_val.csv",
    task="classification", num_labels=8, epochs=5, learning_rate=2e-5,
)
tuner.save_adapter("./models/domain_adapter/")  # ~10MB vs full model ~250MB

# Load and use
model = tuner.load_model("./models/domain_adapter/")
predictions = model.predict(["Sample domain text for classification."])

RAG Pipeline

from nlp_starter_kit.core import RAGPipeline

rag = RAGPipeline.from_config("config.example.yaml")
rag.ingest(documents_dir="./knowledge_base/", chunking_strategy="recursive", chunk_size=512)

response = rag.query("What are the best practices for database indexing?")
print(f"Answer: {response.answer}")
for source in response.sources:
    print(f"  - {source.title} (relevance: {source.score:.3f})")

Configuration Reference

Parameter	Type	Default	Description
`embeddings.model`	str	`all-MiniLM-L6-v2`	Sentence transformer
`embeddings.vector_store`	str	`faiss`	Vector backend
`classification.approach`	str	`transformer`	tfidf_svm, tfidf_lr, transformer
`fine_tuning.method`	str	`lora`	full, lora, prefix
`rag.retriever.top_k`	int	`5`	Chunks to retrieve

Best Practices

Start with TF-IDF baselines — Before fine-tuning a transformer, try TF-IDF + SVM. If it hits 90% of target accuracy, you may not need transformers.
Use LoRA for fine-tuning — LoRA adapters are 100x smaller than full fine-tuning and perform comparably with 1,000+ samples.
Chunk by semantic boundaries — Use recursive strategy which respects paragraph and sentence boundaries.
Evaluate retrieval separately from generation — If the right context isn't retrieved, no generator can produce a good answer.

Troubleshooting

Issue	Cause	Fix
Fine-tuning loss doesn't decrease	Learning rate too high or data formatting issue	Reduce LR to 1e-5, verify CSV has correct `text` and `label` columns
Embedding search returns irrelevant results	Chunk size too large or wrong embedding model	Reduce `chunk_size` to 256, try `all-mpnet-base-v2` for better quality
CUDA OOM during fine-tuning	Model + optimizer states exceed VRAM	Reduce `batch_size` to 4-8, enable gradient checkpointing, use LoRA
RAG answers are hallucinated	Retrieved context is irrelevant	Increase `top_k`, improve chunking, add a reranker stage

This is 1 of 11 resources in the ML Engineer Toolkit toolkit. Get the complete [NLP Starter Kit] with all files, templates, and documentation for $39.

Get the Full Kit →

Or grab the entire ML Engineer Toolkit bundle (11 products) for $149 — save 30%.

Get the Complete Bundle →

DEV Community

NLP Starter Kit

NLP Starter Kit

Key Features

Quick Start

Architecture

Usage Examples

Text Classification with Transformers

Semantic Search with Embeddings

Fine-Tune with LoRA

RAG Pipeline

Configuration Reference

Best Practices

Troubleshooting

Related Articles

Top comments (0)