NLP Starter Kit
Natural language processing has evolved from rule-based tokenizers to transformer-based models that understand context, but the engineering complexity has grown with it. This starter kit gives you production-ready text processing pipelines, embedding workflows for semantic search and clustering, fine-tuning scripts for domain adaptation, and a complete RAG (Retrieval-Augmented Generation) implementation. Each component is modular — use the text preprocessor standalone, or wire everything together into an end-to-end NLP system. Built with Hugging Face Transformers, sentence-transformers, and PyTorch.
Key Features
- Text Processing Pipeline — Configurable preprocessing with tokenization, normalization, stopword removal, and language detection.
- Embedding Workflows — Generate, store, and search embeddings using sentence-transformers with FAISS, ChromaDB, and NumPy backends.
- Fine-Tuning Scripts — LoRA-based fine-tuning for text classification, NER, and seq2seq tasks with automatic hyperparameter selection.
- RAG Implementation — Retrieval-augmented generation with document ingestion, chunking, retrieval, and source attribution.
- Text Classification — Train classifiers with TF-IDF + scikit-learn baselines and transformer models, with automatic selection.
- Evaluation Suite — BLEU, ROUGE, BERTScore, and custom metrics with per-class breakdowns.
Quick Start
unzip nlp-starter-kit.zip && cd nlp-starter-kit
pip install -r requirements.txt
# Run text classification or generate embeddings
python src/nlp_starter_kit/core.py classify --config config.example.yaml
python src/nlp_starter_kit/core.py embed --input ./docs/ --output ./embeddings/
# config.example.yaml
preprocessing:
lowercase: true
remove_html: true
remove_urls: true
min_token_length: 2
max_sequence_length: 512
embeddings:
model: sentence-transformers/all-MiniLM-L6-v2
batch_size: 64
normalize: true
vector_store: faiss # faiss | chromadb | numpy
classification:
approach: transformer # tfidf_svm | tfidf_lr | transformer
model: distilbert-base-uncased
num_labels: 5
max_epochs: 10
learning_rate: 2e-5
early_stopping_patience: 3
rag:
chunking: { strategy: recursive, chunk_size: 512, chunk_overlap: 50 }
retriever: { model: sentence-transformers/all-MiniLM-L6-v2, top_k: 5 }
generator: { model: google/flan-t5-base, max_new_tokens: 256 }
fine_tuning:
method: lora # full | lora | prefix
lora_r: 16
lora_alpha: 32
target_modules: [q_proj, v_proj]
Architecture
┌─────────────┐ ┌──────────────┐ ┌──────────────┐
│ Raw Text │────>│ Preprocessor│────>│ Tokenizer │
│ / Docs │ │ (clean/norm)│ │ (HF/spaCy) │
└─────────────┘ └──────────────┘ └──────┬───────┘
│
┌──────────────┐ ┌───────▼───────┐
│ Vector │<────│ Embedding │
│ Store │ │ Model │
└──────┬───────┘ └───────────────┘
│
┌────────────┼────────────┐
│ │ │
┌──────▼────┐ ┌─────▼─────┐ ┌───▼──────┐
│ Semantic │ │ Classify │ │ RAG │
│ Search │ │ / NER │ │ Generate │
└───────────┘ └───────────┘ └──────────┘
Usage Examples
Text Classification with Transformers
from nlp_starter_kit.core import TextClassifier
classifier = TextClassifier.from_config("config.example.yaml")
classifier.train(train_path="./data/train.csv", val_path="./data/val.csv",
text_column="text", label_column="category")
metrics = classifier.evaluate("./data/test.csv")
print(f"Accuracy: {metrics['accuracy']:.4f} | F1: {metrics['f1_macro']:.4f}")
predictions = classifier.predict(["The quarterly earnings exceeded expectations."])
print(f"{predictions[0]['label']} ({predictions[0]['confidence']:.2f})")
Semantic Search with Embeddings
from nlp_starter_kit.core import EmbeddingIndex
index = EmbeddingIndex(
model_name="sentence-transformers/all-MiniLM-L6-v2",
vector_store="faiss",
)
# Index documents and search
index.add_documents(documents_dir="./knowledge_base/", file_types=["md", "txt", "pdf"])
results = index.search("How do I configure database connection pooling?", top_k=5)
for r in results:
print(f"[{r['score']:.4f}] {r['source']}: {r['text'][:100]}...")
Fine-Tune with LoRA
from nlp_starter_kit.core import FineTuner
tuner = FineTuner(base_model="distilbert-base-uncased", method="lora", lora_r=16, lora_alpha=32)
tuner.train(
train_data="./data/domain_train.csv", val_data="./data/domain_val.csv",
task="classification", num_labels=8, epochs=5, learning_rate=2e-5,
)
tuner.save_adapter("./models/domain_adapter/") # ~10MB vs full model ~250MB
# Load and use
model = tuner.load_model("./models/domain_adapter/")
predictions = model.predict(["Sample domain text for classification."])
RAG Pipeline
from nlp_starter_kit.core import RAGPipeline
rag = RAGPipeline.from_config("config.example.yaml")
rag.ingest(documents_dir="./knowledge_base/", chunking_strategy="recursive", chunk_size=512)
response = rag.query("What are the best practices for database indexing?")
print(f"Answer: {response.answer}")
for source in response.sources:
print(f" - {source.title} (relevance: {source.score:.3f})")
Configuration Reference
| Parameter | Type | Default | Description |
|---|---|---|---|
embeddings.model |
str | all-MiniLM-L6-v2 |
Sentence transformer |
embeddings.vector_store |
str | faiss |
Vector backend |
classification.approach |
str | transformer |
tfidf_svm, tfidf_lr, transformer |
fine_tuning.method |
str | lora |
full, lora, prefix |
rag.retriever.top_k |
int | 5 |
Chunks to retrieve |
Best Practices
- Start with TF-IDF baselines — Before fine-tuning a transformer, try TF-IDF + SVM. If it hits 90% of target accuracy, you may not need transformers.
- Use LoRA for fine-tuning — LoRA adapters are 100x smaller than full fine-tuning and perform comparably with 1,000+ samples.
-
Chunk by semantic boundaries — Use
recursivestrategy which respects paragraph and sentence boundaries. - Evaluate retrieval separately from generation — If the right context isn't retrieved, no generator can produce a good answer.
Troubleshooting
| Issue | Cause | Fix |
|---|---|---|
| Fine-tuning loss doesn't decrease | Learning rate too high or data formatting issue | Reduce LR to 1e-5, verify CSV has correct text and label columns |
| Embedding search returns irrelevant results | Chunk size too large or wrong embedding model | Reduce chunk_size to 256, try all-mpnet-base-v2 for better quality |
| CUDA OOM during fine-tuning | Model + optimizer states exceed VRAM | Reduce batch_size to 4-8, enable gradient checkpointing, use LoRA |
| RAG answers are hallucinated | Retrieved context is irrelevant | Increase top_k, improve chunking, add a reranker stage |
This is 1 of 11 resources in the ML Engineer Toolkit toolkit. Get the complete [NLP Starter Kit] with all files, templates, and documentation for $39.
Or grab the entire ML Engineer Toolkit bundle (11 products) for $149 — save 30%.
Top comments (0)