Serhii Korol

Posted on Jul 30 • Edited on Aug 6

Train Your Own Model with ML.NET: A Step-by-Step Guide to Personalized AI

#dotnet #machinelearning #csharp

Hi friends!

Today we’re diving into something quietly revolutionary: how AI, when personalized, can drastically improve our digital experiences. As developers, many of us use AI tools daily. Often, they feel like magic — you ask a question, and AI responds with context-aware suggestions. Behind the scenes, this magic is powered by Large Language Models (LLMs).

These models are trained on massive datasets and can learn from feedback. But there's a catch: what’s relevant for one person might not be for another. Take email filtering, for example. One user’s spam is another’s important newsletter. General spam filters can’t always adapt to individual preferences.

Wouldn’t it be great to train your own model — one that evolves with your feedback, works offline, and respects your privacy? Thanks to ML.NET, Microsoft’s machine learning library for .NET, you can. In this article, we'll build a personalized spam detector using C# and ML.NET that learns from your emails and improves over time.

Step 1: Create a New Project

Create a simple Console Application:

dotnet new console -n SpamDetector

Step 2: Add Required Packages

Add the following NuGet packages to your project:

Microsoft.ML
Microsoft.ML.FastTree

dotnet add package Microsoft.ML
dotnet add package Microsoft.ML.FastTree

Step 3: Prepare the Dataset

Add a TSV (Tab-Separated Values) file for training data. TSV is preferred over CSV since commas can appear in the body of emails. It is needed for the correct parsing of your data. The full dataset can be found in the source code.

Example: email_dataset.tsv

Sender  Subject Body    IsSpam
reports@company.com Monthly Report  Attached is the report for the current month    False
meetings@calendar.com   Meeting Tomorrow    Don't forget about the meeting tomorrow at 10:00    False
hr@company.com  Documents   Sending the necessary documents False
pm@projecthub.com   Project Ready   The project is completed and ready for review   False
vacations@company.com   Vacation    Submitting a vacation request   False
win@lottery-prize.com   Win a Million!  You won a million dollars! Click here!  True
loans@fastcash-now.com  Online Loan Get a loan without documents in 5 minutes   True
deals@superdiscounts.com    90% Discount    Incredible 90% discount on all products! Hurry! True
help@urgent-finance.org Urgent Help Urgent financial help without refusals  True
homejobs@easyprofit.net Earn at Home    Make $5000 at home without leaving the house    True
promo@freeiphones.com   Free iPhone Get a free iPhone right now True
info@national-lottery.ua    Lottery Congratulations! You won $1,000,000 in the lottery  True
cash@quickmoney.co  Quick Money Quick money without checks and certificates True
jobs@dream-career.net   Dream Job   Dream job with a salary of $100,000 True
ads@miraclepills.org    Miracle Pills   Lose 20 kg in a week with our pills True
admin@company.com   Meeting Tomorrow at 14:00 there is a meeting in the conference room False
support@company.com Tech Support    Your request to tech support has been processed False
sales@onlinestore.com   Order   Your order #12345 is ready for pickup   False
schedule@university.edu Schedule    New class schedule for next week    False
notices@subscription.com    Subscription    Your subscription expires in 3 days False

Add to Project File

Optionally, you can add the created TSV file to the cproj file for copy to the output directory.

    <ItemGroup>
        <None Update="email_dataset.tsv">
            <CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
        </None>
    </ItemGroup>

Step 4: Define Data Models

Once you have added the dataset, you should create the model that has the same columns — the attributes needed for detecting columns and their order.

Input Data Model

public class EmailData
{
    [LoadColumn(0)] public string Sender { get; set; } = string.Empty;

    [LoadColumn(1)] public string Subject { get; set; } = string.Empty;

    [LoadColumn(2)] public string Body { get; set; } = string.Empty;

    [LoadColumn(3)] public bool IsSpam { get; set; }
}

Another model is needed to show prediction information.

Prediction Output Model

public class SpamPrediction
{
    [ColumnName("PredictedLabel")] public bool IsSpam { get; set; }

    [ColumnName("Probability")] public float Probability { get; set; }

    [ColumnName("Score")] public float Score { get; set; }
}

Step 5: Define File Paths

You need to add the paths for the entry dataset, the trained model, and the user's correction file.

static class Program
{
    private const string DataPath = "email_dataset.tsv";
    private const string ModelPath = "spam_model.zip";
    private const string FeedbackPath = "feedback.tsv";

    static void Main() {}
}

Step 6: Create ML Context

This is the entry point for using ML.NET — similar to a DbContext in EF Core.

    static void Main()
    {
        Console.WriteLine("=== System for checking emails for spam ===\n");

        var mlContext = new MLContext();
    }

Step 7: Build the Pipeline

The pipeline transforms the text in each column into numerical features. These features are then combined into a single feature vector. At the end of the pipeline, we specify the FastTree algorithm as our trainer. FastTree uses this feature vector along with the target label column, IsSpam, which is not included in the feature vector itself.

While there are other available algorithms, FastTree is a well-suited choice for this task. It is a gradient boosting machine (GBM) algorithm designed for binary classification, regression, and ranking problems. In this context, it is used for binary classification — specifically, to distinguish between spam and non-spam emails.

FastTree is optimized for speed and performs well on tabular data. However, it has an important limitation: it requires a dataset with at least 1,000 samples to achieve acceptable accuracy. Smaller datasets may result in poor model performance.

static void Main()
    {
        Console.WriteLine("=== System for checking emails for spam ===\n");

        var mlContext = new MLContext();

        var pipeline = BuildPipeline(mlContext);
    }

private static IEstimator<ITransformer> BuildPipeline(MLContext mlContext)
    {
        return mlContext.Transforms.Text
            .FeaturizeText("SenderFeatures", nameof(EmailData.Sender))
            .Append(mlContext.Transforms.Text.FeaturizeText("SubjectFeatures", nameof(EmailData.Subject)))
            .Append(mlContext.Transforms.Text.FeaturizeText("BodyFeatures", nameof(EmailData.Body)))
            .Append(mlContext.Transforms.Concatenate("Features", "SenderFeatures", "SubjectFeatures", "BodyFeatures"))
            .Append(mlContext.Transforms.NormalizeLpNorm("Features"))
            .Append(mlContext.BinaryClassification.Trainers.FastTree(
                labelColumnName: nameof(EmailData.IsSpam), featureColumnName: "Features"));
    }

Step 8: Train the Model

Step 8.1: Load or Train

Check for a saved model. If none exists, load the dataset or train a new one. We should avoid retraining unless necessary.

if (File.Exists(ModelPath))
...

Step 8.2: Load Datasets

If no trained model is available, we need to load the text data into a DataView using the correct separator. Failing to do so will result in an error during training.

In addition to the main email dataset, we also load the user's feedback dataset. This is important for scenarios where the trained model has been deleted and needs to be retrained — ensuring that any user-provided corrections are preserved and included in the new model.

var allData = LoadAllData(mlContext);
...

Step 8.3: Split the Data

The test set ratio determines how much of the data is reserved for evaluation.

In general, a smaller test fraction allows for better training, as more data is used to train the model. However, having a test set is essential for measuring the model’s performance objectively. Without it, the model might simply "memorize" the training data, giving a false impression of accuracy.

The test set helps validate how well the model generalizes to unseen data and is used to compute key performance metrics. For small datasets, it's acceptable to allocate less than 20% to testing. Otherwise, a test split of 20–30% is typically recommended.

var split = mlContext.Data.TrainTestSplit(allDataView, testFraction: 0.2);
...

Step 8.4: Train and Evaluate

After data splitting, we use the training set for training the model and the test set for evaluation.

Console.WriteLine("Training model...");
model = pipeline.Fit(split.TrainSet);

Console.WriteLine("Evaluating model...");
var predictions = model.Transform(split.TestSet);
...

Step 8.5: Generate Metrics

After training, we evaluate the model's performance using several key metrics based on its predictions. These include Accuracy, AUC, and F1 Score:

Accuracy measures the percentage of correct predictions made by the model.
AUC (Area Under the ROC Curve) indicates how well the model distinguishes between spam and non-spam emails. A value of 1.0 represents perfect classification.
F1 Score is the harmonic mean of precision and recall, providing a balanced measure of the model’s ability to correctly identify spam while minimizing false positives and false negatives.

var metrics = mlContext.BinaryClassification.Evaluate(predictions, labelColumnName: nameof(EmailData.IsSpam));
...

Step 8.6: Save the Data

Once you have generated the model and metrics, you should save the trained model for further use.

mlContext.Model.Save(model, allDataView.Schema, ModelPath);
...

Optionally, you can copy the model to the project directory. If you did everything properly, you'll see a zip archive in your project.

CopyFileToProjectDirectory(ModelPath);
...

The ultimate code:

    static void Main()
    {
        Console.WriteLine("=== System for checking emails for spam ===\n");

        var mlContext = new MLContext();

        var pipeline = BuildPipeline(mlContext);

        ITransformer model = LoadOrTrainModel(mlContext, pipeline);
    }

    private static ITransformer LoadOrTrainModel(MLContext mlContext, IEstimator<ITransformer> pipeline)
    {
        if (File.Exists(ModelPath))
        {
            Console.WriteLine("Loading saved model...");
            return mlContext.Model.Load(ModelPath, out _);
        }

        Console.WriteLine("The model is not found. Training the new model...");
        var allData = LoadAllData(mlContext);
        return TrainEvaluateSaveModel(mlContext, pipeline, allData, saveFeedback: false);
    }

    private static List<EmailData> LoadAllData(MLContext mlContext)
    {
        IDataView originalData = mlContext.Data.LoadFromTextFile<EmailData>(
            DataPath, separatorChar: '\t', hasHeader: true);

        var allExamples = mlContext.Data
            .CreateEnumerable<EmailData>(originalData, reuseRowObject: false)
            .ToList();

        if (File.Exists(FeedbackPath))
        {
            Console.WriteLine("Found feedback data. Including it in training...");
            IDataView feedbackData = mlContext.Data.LoadFromTextFile<EmailData>(
                FeedbackPath, separatorChar: '\t', hasHeader: false);

            var feedbackList = mlContext.Data
                .CreateEnumerable<EmailData>(feedbackData, reuseRowObject: false)
                .ToList();

            allExamples.AddRange(feedbackList);
        }
        return allExamples;
    }

    private static ITransformer TrainEvaluateSaveModel(
        MLContext mlContext,
        IEstimator<ITransformer> pipeline,
        List<EmailData> allData,
        bool saveFeedback)
    {
        var allDataView = mlContext.Data.LoadFromEnumerable(allData);
        var split = mlContext.Data.TrainTestSplit(allDataView, testFraction: 0.2);

        Console.WriteLine("Training model...");
        var model = pipeline.Fit(split.TrainSet);

        Console.WriteLine("Evaluating model...");
        var predictions = model.Transform(split.TestSet);
        var metrics = mlContext.BinaryClassification.Evaluate(predictions, labelColumnName: nameof(EmailData.IsSpam));

        Console.WriteLine($"Accuracy: {metrics.Accuracy:P2}");
        Console.WriteLine($"AUC: {metrics.AreaUnderRocCurve:P2}");
        Console.WriteLine($"F1 Score: {metrics.F1Score:P2}\n");

        mlContext.Model.Save(model, allDataView.Schema, ModelPath);
        CopyFileToProjectDirectory(ModelPath);
        if (saveFeedback)
            CopyFileToProjectDirectory(FeedbackPath);

        Console.WriteLine($"The model saved to {ModelPath}\n");
        return model;
    }

    private static void CopyFileToProjectDirectory(string fileName)
    {
        string currentDir = Directory.GetCurrentDirectory();
        string projectDir = Path.GetFullPath(Path.Combine(currentDir, "..", "..", ".."));
        string sourcePath = Path.Combine(currentDir, fileName);
        string destPath = Path.Combine(projectDir, fileName);
        File.Copy(sourcePath, destPath, overwrite: true);
    }

Step 9: Make Predictions

Create a PredictionEngine to test individual emails.

    static void Main()
    {
        Console.WriteLine("=== System for checking emails for spam ===\n");

        var mlContext = new MLContext();

        var pipeline = BuildPipeline(mlContext);

        ITransformer model = LoadOrTrainModel(mlContext, pipeline);
        var predictionEngine = mlContext.Model.CreatePredictionEngine<EmailData, SpamPrediction>(model);
    }

Step 10: Interactive Input & Feedback Loop

This code allows you to input user data and get a prediction of SPAM/NOT SPAM.

static void Main()
    {
        Console.WriteLine("=== System for checking emails for spam ===\n");

        var mlContext = new MLContext();

        var pipeline = BuildPipeline(mlContext);

        ITransformer model = LoadOrTrainModel(mlContext, pipeline);
        var predictionEngine = mlContext.Model.CreatePredictionEngine<EmailData, SpamPrediction>(model);

        RunInteractiveCheck(mlContext, pipeline, ref model, ref predictionEngine);

        Console.WriteLine("The app completed successfully. Goodbye!");
    }

Step 10.1: Correction

From time to time, you may encounter cases where you disagree with the model’s prediction. For example, an important email might be incorrectly marked as spam, and you manually reclassify it as not spam.

We’ve implemented a similar mechanism: users can correct incorrect predictions, and the system will use this feedback to retrain the model and save the updated version. This helps improve accuracy over time by learning from real-world corrections.

var feedback = PromptInput("Do you agree with the result? (y/n): ", toLower: true);
if (feedback == "n")
...

private static string? PromptInput(string message, bool toLower = false)
    {
        Console.Write(message);
        var input = Console.ReadLine();
        if (input?.ToLower() == "q") return null;
        return toLower ? input?.ToLower() : input;
    }

Step 10.2: Save the Feedback

Once you have made the correction, you should save this data for retraining the model in the future.

SaveFeedback(sender, subject, body, userLabel);
...

Step 12: Testing

Let’s run the application and test it.
After launching and exiting the app, you’ll notice that a new model was trained during the first run. If you check the project directory, you’ll find a .zip file containing the trained model.

Now, let's run the app again.
As you can see, we loaded a pre-trained model.

Next, let’s input some data. The machine learning model predicts that the email is not spam.
However, if a stranger sends you an email offering to sell you an elephant, it's clear the prediction is incorrect — and you'd likely disagree with it.

In this case, we need to correct the prediction. Type "n" to indicate that the email is spam.
As you can see, the model has been updated accordingly. You can also find the user feedback dataset saved in your project directory.

Now, we have created the new rule for this email.

Now, let's repeat the actions and enter the same data again. Now, this email is detected as SPAM.

Conclusion

The ML.NET library is a powerful tool for training custom machine learning models using your own datasets. You can also find a wide variety of high-quality datasets on platforms like Kaggle.com. Unlike popular AI services such as OpenAI or Claude, which often raise privacy concerns, ML.NET keeps all your data securely on your own server.

However, ML.NET does have some challenges. It has a relatively steep learning curve, requiring a solid understanding of machine learning algorithms. You also need to source or create appropriate datasets, and the quality of your trained model heavily depends on the quality of the data you use.

I hope you found this guide helpful and that it encourages you to implement similar solutions in your own projects.

For your convenience, the complete source code is available on my GitHub repository for reference and further exploration.

SergKorol / SpamDetector

☕ If you liked this post, consider supporting me:

DEV Community