From University Project to Paid API: How I Built and Shipped a Real AI Product in 3 Weeks

Dominic Jesse Kwame Addo — Mon, 07 Jul 2025 19:10:55 +0000

As an Applied AI student, my days are usually filled with lectures, studying, and work. But recently, I took on a different kind of challenge. I set out to go from a simple idea to a live, monetized Toxicity Scoring API, and I did it.

This isn't just a project report. This is the real story of how a student with a limited budget and a single PC built a production-ready AI service using DistilBERT, FastAPI, Docker, and a whole lot of late nights. If you've ever wondered what it really takes to ship an end-to-end ML product, this post is for you. I'll walk you through the why, the how, the code that broke, and the lessons I learned along the way.

The Spark: Why This Project?

My interest in toxicity detection started during my Applied AI course. My group and I worked on a Transformer-based sentiment analysis model for toxic comments using the OLID and HASOC datasets. The technical challenge was fascinating, but I kept thinking about the practical side. Most existing moderation tools are too blunt—they either block everything or nothing.

I wanted to build something different. A tool that could understand nuance and provide detailed feedback about different types of toxicity. Something I could afford to run for my own small projects.

My goals were simple:

Nuanced detection: Go beyond simple "toxic/not-toxic" labels
Lightweight: Run on my Intel Arc A770 GPU with 16GB VRAM
Simple integration: One API call, clear JSON response
Real deployment: Actually ship it and make it available to other developers

To keep my professional student life separate from my projects, I decided to build this under my development brand, Kwame Aseda.

Building the API: From Notebook to Production

Architecture Overview

The system architecture is straightforward but effective:

System architecture showing the complete request flow from user input through RapidAPI to the containerized FastAPI application running on Hugging Face Spaces.

The flow is simple:

Model: DistilBERT fine-tuned for multi-label classification
API: FastAPI with Pydantic schemas for validation
Deployment: Docker container running on Hugging Face Spaces, accessible via RapidAPI

The Data Balancing Act

I used the Jigsaw Toxic Comment Classification dataset, which immediately presented a challenge. The dataset was hugely imbalanced. The 'threat' category had very few examples compared to 'toxic' or 'obscene'. A naive model would just learn to ignore the rare categories.

To fix this, I implemented sample-level weighting where each training example gets a weight based on the average inverse frequency of its labels:

# Compute sample-level weights based on label frequency
label_array = np.stack(train_df[label_cols].values)  # [N, L]
pos_counts = label_array.sum(axis=0)               # positives per label
neg_counts = len(label_array) - pos_counts

# Compute inverse frequencies to upweight rare examples
inv_pos = 1.0 / (pos_counts + 1e-6)
inv_neg = 1.0 / (neg_counts + 1e-6)

# Build a weight for each sample by averaging each label's inverse frequency
sample_weights = [
    float(((lbls * inv_pos) + ((1 - lbls) * inv_neg)).mean())
    for lbls in label_array
]

# Create weighted sampler for balanced training
sampler = WeightedRandomSampler(
    weights=torch.DoubleTensor(sample_weights),
    num_samples=len(sample_weights),
    replacement=True
)

# Use custom trainer with the weighted sampler
class BalancedTrainer(CustomTrainer):
    def get_train_dataloader(self):
        return DataLoader(
            train_dataset,
            batch_size=training_args.per_device_train_batch_size,
            sampler=sampler,
            collate_fn=collate_fn,
        )

This approach works better for multi-label classification than traditional class weights because samples with rare label combinations get higher sampling probability.

Model Training and the Intel Arc Experience

Training on my Intel Arc A770 was an adventure. I used Intel Extension for PyTorch (IPEX) to optimize for the XPU backend, which gave me decent performance:

Training time: ~67 seconds per 200 steps
Throughput: ~356 samples per second
Memory usage: Comfortably fit within 16GB VRAM

Here's how the training progressed over 6000 steps:

Step  | Train Loss | Val Loss | Accuracy | F1 Macro | F1 Weighted | Samples/sec
------|------------|----------|----------|----------|-------------|------------
200   | 0.6751     | 0.5951   | 0.7200   | 0.6488   | 0.7509      | 345.32
400   | 0.4442     | 0.4193   | 0.7967   | 0.7333   | 0.8172      | 349.19
800   | 0.3202     | 0.3045   | 0.8520   | 0.7963   | 0.8650      | 350.69
1400  | 0.2350     | 0.2619   | 0.8981   | 0.8487   | 0.9045      | 338.42
2000  | 0.2168     | 0.2394   | 0.8940   | 0.8446   | 0.9012      | 337.88
2800  | 0.1997     | 0.2738   | 0.9159   | 0.8669   | 0.9190      | 338.74
4000  | 0.1686     | 0.2774   | 0.9206   | 0.8760   | 0.9240      | 352.30
6000  | 0.1376     | 0.2695   | 0.9193   | 0.8757   | 0.9232      | 362.22

The model achieved strong final performance:

F1 Macro: 0.877 (good balance across all classes)
F1 Weighted: 0.923 (excellent overall performance)
ROC AUC: 0.983 (great discrimination ability)

You can see the steady improvement, especially in the first 1400 steps where F1 Macro jumped from 0.65 to 0.85. The Intel Arc handled the workload well, maintaining consistent throughput throughout training.

Production API Design

FastAPI made building the API straightforward. Here's the core structure:

# Pydantic models for request/response validation
class TextInput(BaseModel):
    text: str = Field(..., example="Your comment here.")

class PredictionSummary(BaseModel):
    request_id: str
    overall_score: float
    classification: str
    confidence: float
    category_scores: Dict[str, float]

# FastAPI app with prediction endpoint
app = FastAPI(title="Toxicity Detection API", version="1.0.0", lifespan=lifespan)

@app.post("/predict", response_model=PredictionSummary, tags=["Inference"])
async def predict(input_data: TextInput) -> PredictionSummary:
    if not input_data.text.strip():
        raise HTTPException(400, "Empty input")
    return predictor.predict_summary(input_data.text)

The API returns detailed toxicity scores for six categories: toxic, severe_toxic, obscene, threat, insult, and identity_hate. Each comes with a confidence score and a binary prediction based on optimized thresholds.

Making It Work: Threshold Tuning

One of my biggest mistakes was starting with a single global threshold (0.5) for all categories. This completely missed threats and severely under-detected other categories.

The solution was per-label threshold optimization using precision-recall curves. Instead of using a single 0.5 threshold for all categories, I computed the optimal threshold for each label by maximizing the F1 score:

# Tune Optimal Thresholds
print("\n=== Tuning Optimal Thresholds ===")
# Gather probs & labels over the entire val set
val_loader = trainer.get_eval_dataloader()
all_probs, all_labels = [], []
model.eval()

with torch.no_grad():
    for batch in val_loader:
        # Move labels to CPU before numpy()
        labels = batch["labels"].cpu().numpy()
        # Send inputs to device
        inputs = {k: v.to(model.device) for k, v in batch.items() if k != "labels"}
        logits = model(**inputs).logits
        probs = torch.sigmoid(logits).cpu().numpy()

        all_probs.append(probs)
        all_labels.append(labels)

all_probs = np.vstack(all_probs)
all_labels = np.vstack(all_labels)

# Find the threshold per label that maximizes F1
optimal_thresholds = {}
for i, lbl in enumerate(label_cols):
    precision, recall, thresholds = precision_recall_curve(
        all_labels[:, i], all_probs[:, i]
    )
    # If there are no thresholds (e.g. label never appears), default to 0.5
    if thresholds.size == 0:
        best_thr = 0.5
    else:
        # thresholds has length len(precision)-1
        f1_scores = 2 * (precision[:-1] * recall[:-1]) / (precision[:-1] + recall[:-1] + 1e-8)
        best_idx = int(np.nanargmax(f1_scores))
        best_thr = float(thresholds[best_idx])

    optimal_thresholds[lbl] = best_thr

print("\nOptimal thresholds per label:")
for lbl, t in optimal_thresholds.items():
    print(f"  {lbl:15}: {t:.3f}")

# Save for use in evaluation and API
with open("thresholds.json", "w") as f:
    json.dump(optimal_thresholds, f, indent=2)
print("\nSaved thresholds.json")

This boosted performance significantly, especially for the minority classes.

Deployment: Docker and Hugging Face Spaces

Getting the model into production required containerization. Here's my Dockerfile:

# Use a specific Python runtime version for consistency
FROM python:3.10.12-slim-bullseye

# Set environment variables
ENV PYTHONUNBUFFERED=1
ENV MODEL_DIR=/app/enhanced_toxic_comment_model

# Set the working directory
WORKDIR /app

# Install essential build tools and clean up to reduce image size
RUN apt-get update && \
    apt-get install -y --no-install-recommends \
    build-essential \
    git \
    curl && \
    rm -rf /var/lib/apt/lists/*

# Copy only the requirements file initially to leverage Docker cache
COPY requirements.txt .

# Upgrade pip and install dependencies
RUN pip install --upgrade pip && \
    pip install --no-cache-dir -r requirements.txt

# Copy the application code and model directory
COPY core_api_refactored.py .
COPY enhanced_toxic_comment_model ./enhanced_toxic_comment_model

# Run the application using Uvicorn
CMD ["uvicorn", "core_api_refactored:app", "--host", "0.0.0.0", "--port", "7860"]

The deployment stack:

Base: Python 3.10 slim image
Dependencies: FastAPI 0.95+, PyTorch 2.1+, Transformers 4.40+
Runtime: Uvicorn ASGI server on port 8000
Hosting: Hugging Face Spaces (free tier!)

The "Oh No" Moments: Real Bugs and Fixes

Let me be honest about what went wrong. These weren't small hiccups—they were hours-long debugging sessions.

The Pydantic Type Disaster

My .env file defined TOXICITY_THRESHOLD=thresholds.json, but Pydantic expected a float. The validation error was cryptic and took me two hours to track down.

Fix: Split the configuration into toxicity_threshold: float (0.5) and thresholds_path: Path for the JSON file.

The Missing Dockerfile Mystery

I spent 30 minutes getting "no such file or directory" errors when running docker build. The issue? I was running the command from the wrong directory.

Fix: Always cd into the folder containing your Dockerfile, or use docker build -f path/to/Dockerfile ..

The Git LFS Upload Hang

My model file (~270 MB) would stall during upload to Hugging Face Spaces. "Uploading LFS objects..." would hang forever.

Fix: Ran git lfs install and huggingface-cli lfs-enable-largefiles . and learned to be patient with large file uploads.

The RapidAPI Studio NetworkError

The API worked perfectly in RapidAPI's playground but threw NetworkError in the Studio requests tab.

Fix: In the Hub Listing → Gateway, I switched Access Control to "RapidAPI Auth" and Runtime to "Rapid Runtime", then used the Gateway URL in requests.

The Intel Arc Float32 vs BFloat16 Chaos

IPEX defaulted to BFloat16 tensors, but my downstream code expected Float32. This caused random type errors that were hard to debug.

Fix: Explicitly cast the model and all tensors to torch.float32 before and after IPEX optimization.

Key lesson: Always test your API locally with the exact Docker command before you git push. It saves hours of debugging in production.

API Usage: How Others Can Use It

The API is now live and ready for integration. Here's how to use it:

### Example API request
curl -X POST "https://toxicity-score-api.p.rapidapi.com/predict" \
  -H "Content-Type: application/json" \
  -H "X-RapidAPI-Key: YOUR_API_KEY" \
  -H "X-RapidAPI-Host: toxicity-score-api.p.rapidapi.com" \
  -d '{
    "text": "I really hate you.."
  }'

Response

{
  "request_id": "7d542552-5707-404b-a64e-7ce0babbd1db",
  "overall_score": 0.2127,
  "classification": "toxic",
  "confidence": 0.2127,
  "category_scores": {
    "toxic": 0.2127,
    "severe_toxic": 0.0026,
    "obscene": 0.0056,
    "threat": 0.0389,
    "insult": 0.0317,
    "identity_hate": 0.0243
  }
}

Current pricing tiers:

Basic: Free tier for testing
Pro: $7/month (recommended for small projects)
Ultra: $15/month
Mega: $25/month

Lessons Learned: What I'd Tell My Past Self

Three weeks ago, I thought deployment was the easy part. I was wrong. Here's what I learned:

Always sanity-check your hard examples. My initial model couldn't detect obvious threats because I didn't manually review edge cases during training.

Environment parity is everything. Test locally with the exact same setup you'll use in production. Docker helps, but it's not magic.

Document your pipeline end-to-end. When bugs happen at 2 AM, you'll thank yourself for writing down every step.

Imbalanced data is harder than you think. Class weights and sampling strategies matter more than model architecture tweaks.

The biggest surprise? Going from a working notebook to a production API took longer than training the model. But shipping something real that other developers can use? That feeling is worth every debugging session.

Join the Journey

This project was a huge learning experience, juggling my studies at LTU with a real-world product launch. If you're a student or developer with questions, or if you use the API and have feedback, please reach out!

Try it now:

GitHub: github.com/Kwaseda/toxicity-score-api
RapidAPI: rapidapi.com/Kwaseda/api/toxicity-score-api
Hugging Face: huggingface.co/spaces/Kwaseda/toxicity-score-api

Connect with me:

Dev.to: @KwameAseda
GitHub: @Kwaseda

If this post helped you or if you're working on something similar, I'd love to hear about it. Building in public is scary, but the community support makes it worth it.

Star ⭐ the repo if you found this useful, and let me know what you build next!

DEV Community: Dominic Jesse Kwame Addo