DEV Community

Thesius Code
Thesius Code

Posted on • Originally published at datanest-stores.pages.dev

NLP Starter Kit

NLP Starter Kit

Natural language processing has evolved from rule-based tokenizers to transformer-based models that understand context, but the engineering complexity has grown with it. This starter kit gives you production-ready text processing pipelines, embedding workflows for semantic search and clustering, fine-tuning scripts for domain adaptation, and a complete RAG (Retrieval-Augmented Generation) implementation. Each component is modular — use the text preprocessor standalone, or wire everything together into an end-to-end NLP system. Built with Hugging Face Transformers, sentence-transformers, and PyTorch.

Key Features

  • Text Processing Pipeline — Configurable preprocessing with tokenization, normalization, stopword removal, and language detection.
  • Embedding Workflows — Generate, store, and search embeddings using sentence-transformers with FAISS, ChromaDB, and NumPy backends.
  • Fine-Tuning Scripts — LoRA-based fine-tuning for text classification, NER, and seq2seq tasks with automatic hyperparameter selection.
  • RAG Implementation — Retrieval-augmented generation with document ingestion, chunking, retrieval, and source attribution.
  • Text Classification — Train classifiers with TF-IDF + scikit-learn baselines and transformer models, with automatic selection.
  • Evaluation Suite — BLEU, ROUGE, BERTScore, and custom metrics with per-class breakdowns.

Quick Start

unzip nlp-starter-kit.zip && cd nlp-starter-kit
pip install -r requirements.txt

# Run text classification or generate embeddings
python src/nlp_starter_kit/core.py classify --config config.example.yaml
python src/nlp_starter_kit/core.py embed --input ./docs/ --output ./embeddings/
Enter fullscreen mode Exit fullscreen mode
# config.example.yaml
preprocessing:
  lowercase: true
  remove_html: true
  remove_urls: true
  min_token_length: 2
  max_sequence_length: 512

embeddings:
  model: sentence-transformers/all-MiniLM-L6-v2
  batch_size: 64
  normalize: true
  vector_store: faiss  # faiss | chromadb | numpy

classification:
  approach: transformer  # tfidf_svm | tfidf_lr | transformer
  model: distilbert-base-uncased
  num_labels: 5
  max_epochs: 10
  learning_rate: 2e-5
  early_stopping_patience: 3

rag:
  chunking: { strategy: recursive, chunk_size: 512, chunk_overlap: 50 }
  retriever: { model: sentence-transformers/all-MiniLM-L6-v2, top_k: 5 }
  generator: { model: google/flan-t5-base, max_new_tokens: 256 }

fine_tuning:
  method: lora  # full | lora | prefix
  lora_r: 16
  lora_alpha: 32
  target_modules: [q_proj, v_proj]
Enter fullscreen mode Exit fullscreen mode

Architecture

┌─────────────┐     ┌──────────────┐     ┌──────────────┐
│  Raw Text   │────>│  Preprocessor│────>│  Tokenizer   │
│  / Docs     │     │  (clean/norm)│     │  (HF/spaCy)  │
└─────────────┘     └──────────────┘     └──────┬───────┘
                                                 │
                    ┌──────────────┐     ┌───────▼───────┐
                    │  Vector      │<────│  Embedding    │
                    │  Store       │     │  Model        │
                    └──────┬───────┘     └───────────────┘
                           │
              ┌────────────┼────────────┐
              │            │            │
       ┌──────▼────┐ ┌─────▼─────┐ ┌───▼──────┐
       │ Semantic  │ │ Classify  │ │  RAG     │
       │ Search    │ │ / NER     │ │ Generate │
       └───────────┘ └───────────┘ └──────────┘
Enter fullscreen mode Exit fullscreen mode

Usage Examples

Text Classification with Transformers

from nlp_starter_kit.core import TextClassifier

classifier = TextClassifier.from_config("config.example.yaml")
classifier.train(train_path="./data/train.csv", val_path="./data/val.csv",
                 text_column="text", label_column="category")

metrics = classifier.evaluate("./data/test.csv")
print(f"Accuracy: {metrics['accuracy']:.4f} | F1: {metrics['f1_macro']:.4f}")

predictions = classifier.predict(["The quarterly earnings exceeded expectations."])
print(f"{predictions[0]['label']} ({predictions[0]['confidence']:.2f})")
Enter fullscreen mode Exit fullscreen mode

Semantic Search with Embeddings

from nlp_starter_kit.core import EmbeddingIndex

index = EmbeddingIndex(
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    vector_store="faiss",
)

# Index documents and search
index.add_documents(documents_dir="./knowledge_base/", file_types=["md", "txt", "pdf"])
results = index.search("How do I configure database connection pooling?", top_k=5)
for r in results:
    print(f"[{r['score']:.4f}] {r['source']}: {r['text'][:100]}...")
Enter fullscreen mode Exit fullscreen mode

Fine-Tune with LoRA

from nlp_starter_kit.core import FineTuner

tuner = FineTuner(base_model="distilbert-base-uncased", method="lora", lora_r=16, lora_alpha=32)
tuner.train(
    train_data="./data/domain_train.csv", val_data="./data/domain_val.csv",
    task="classification", num_labels=8, epochs=5, learning_rate=2e-5,
)
tuner.save_adapter("./models/domain_adapter/")  # ~10MB vs full model ~250MB

# Load and use
model = tuner.load_model("./models/domain_adapter/")
predictions = model.predict(["Sample domain text for classification."])
Enter fullscreen mode Exit fullscreen mode

RAG Pipeline

from nlp_starter_kit.core import RAGPipeline

rag = RAGPipeline.from_config("config.example.yaml")
rag.ingest(documents_dir="./knowledge_base/", chunking_strategy="recursive", chunk_size=512)

response = rag.query("What are the best practices for database indexing?")
print(f"Answer: {response.answer}")
for source in response.sources:
    print(f"  - {source.title} (relevance: {source.score:.3f})")
Enter fullscreen mode Exit fullscreen mode

Configuration Reference

Parameter Type Default Description
embeddings.model str all-MiniLM-L6-v2 Sentence transformer
embeddings.vector_store str faiss Vector backend
classification.approach str transformer tfidf_svm, tfidf_lr, transformer
fine_tuning.method str lora full, lora, prefix
rag.retriever.top_k int 5 Chunks to retrieve

Best Practices

  1. Start with TF-IDF baselines — Before fine-tuning a transformer, try TF-IDF + SVM. If it hits 90% of target accuracy, you may not need transformers.
  2. Use LoRA for fine-tuning — LoRA adapters are 100x smaller than full fine-tuning and perform comparably with 1,000+ samples.
  3. Chunk by semantic boundaries — Use recursive strategy which respects paragraph and sentence boundaries.
  4. Evaluate retrieval separately from generation — If the right context isn't retrieved, no generator can produce a good answer.

Troubleshooting

Issue Cause Fix
Fine-tuning loss doesn't decrease Learning rate too high or data formatting issue Reduce LR to 1e-5, verify CSV has correct text and label columns
Embedding search returns irrelevant results Chunk size too large or wrong embedding model Reduce chunk_size to 256, try all-mpnet-base-v2 for better quality
CUDA OOM during fine-tuning Model + optimizer states exceed VRAM Reduce batch_size to 4-8, enable gradient checkpointing, use LoRA
RAG answers are hallucinated Retrieved context is irrelevant Increase top_k, improve chunking, add a reranker stage

This is 1 of 11 resources in the ML Engineer Toolkit toolkit. Get the complete [NLP Starter Kit] with all files, templates, and documentation for $39.

Get the Full Kit →

Or grab the entire ML Engineer Toolkit bundle (11 products) for $149 — save 30%.

Get the Complete Bundle →


Related Articles

Top comments (0)