DEV Community

Cover image for Unlock the Power of LLM-Driven ETL: Transform Variable CSV to Clean JSON with C#, Semantic Kernel & Llama 3.2-3B
David Au Yeung
David Au Yeung

Posted on

Unlock the Power of LLM-Driven ETL: Transform Variable CSV to Clean JSON with C#, Semantic Kernel & Llama 3.2-3B

Introduction

We'll feed a messy CSV file to a lightweight llama3.2-3B model and let it infer a column mapping at runtime, then stream the whole file into a fixed Customer schema and emit line-delimited JSON – all inside a single .NET 8 console app.

If you’re new to LLM chatbots, check out my earlier post:
Step-by-Step Guide: Write Your First AI Storyteller with Ollama (llama3.2) and Semantic Kernel in C#
(Have your Chatbot helper ready before continuing.)

Why ETL + LLM?

Traditional ETL pipelines crumble the moment a vendor renames a column.
With an LLM in the loop we can teach the app to recognize columns on the fly.

Sample the first n rows of the CSV

Ask the LLM: “Which column is Id, which is SignupDate...?”
Use the returned JSON mapping to transform every remaining row.
No more fragile config files, let the model handle the heavy lifting for us!

What You’ll Build

  1. A console tool that reads any CSV, infers a mapping to the fixed schema Customer(Id, Name, Email, SignupDate, IsActive).
  2. Streams the full file and emits JSON lines (jsonl) ready for ingestion. All powered by Semantic Kernel, Ollama llama3.2-3B, and ~150 lines of C#.

sample.csv

customer_id,full_name,email_address,registration_date,status,age,country,referral_code,notes,last_login,preferred_language,loyalty_points,marketing_opt_in
101,Jane Doe,jane.doe@example.com,2024-05-01,true,29,USA,REF-A1,"Loves cats",2024-07-20T12:30:00Z,en,150,yes
102,Bob Smith,bob.smith@example.com,2024-05-03,false,41,CAN,REF-B2,"",,fr,20,no
103,Alice Johnson,alice.j@example.com,2024-06-10,1,33,GBR,REF-C3,"VIP customer",2024-07-18T08:11:42Z,en,760,yes
104,李四,li.si@example.cn,2024-06-20,true,26,CHN,,,"2024-07-21 02:05:00",zh,0,yes
105,Maria García,m.garcia@example.es,2024-07-01,false,38,ESP,REF-E5,"Prefers phone contact",,es,55,no
Enter fullscreen mode Exit fullscreen mode

output.jsonl

{"Id":1001,"Name":"Alice Lee","Email":"alice@example.com","SignupDate":"2024-06-20T00:00:00Z","IsActive":true}
{"Id":1002,"Name":"Bob Chan","Email":"bob@example.com","SignupDate":"2024-07-05T00:00:00Z","IsActive":false}
Enter fullscreen mode Exit fullscreen mode

Step 1 – Prerequisites

  1. .NET 8 SDK
  2. Ollama with llama3.2-3B
  3. NuGet packages:
dotnet add package Microsoft.SemanticKernel
dotnet add package Microsoft.SemanticKernel.ChatCompletion;
dotnet add package CsvHelper
dotnet add package OllamaSharp
Enter fullscreen mode Exit fullscreen mode

Step 2 – Prepare The Fixed Schema

public record Customer(
    int      Id,
    string   Name,
    string   Email,
    DateTime SignupDate,
    bool     IsActive);
Enter fullscreen mode Exit fullscreen mode

Step 3 – Prompt Engineering

The prompt (see MappingSystemPrompt in the code) forces the model to:

TARGET SCHEMA (exactly five properties, case-sensitive)
  Id           : int
  Name         : string
  Email        : string
  SignupDate   : datetime
  IsActive     : bool

RESPONSE RULES
1. Return one JSON array that contains EXACTLY five objects – one per
   target property above, in any order.
2. Each object MUST have these keys:
       "CsvColumn"       – the column name that best matches the property
       "TargetProperty"  – one of [Id,Name,Email,SignupDate,IsActive]
3. Do NOT include columns that don’t map; do NOT include comments,
   nulls, explanations, markdown fences, or any other text.
4. Your entire reply is the JSON array. Nothing else.
Enter fullscreen mode Exit fullscreen mode

Step 4 – Converting Dynamic -> Target

ToCustomer() does the heavy lifting:

var cust = new Customer(
    Id:         (Convert("Id",         Lookup("Id")) as int?)      ?? 0,
    Name:       (Convert("Name",       Lookup("Name")) as string?) ?? "",
    Email:      (Convert("Email",      Lookup("Email")) as string?)?? "",
    SignupDate: (Convert("SignupDate", Lookup("SignupDate"))as DateTime?)??DateTime.MinValue,
    IsActive:   (Convert("IsActive",   Lookup("IsActive"))as bool?)??false
);
Enter fullscreen mode Exit fullscreen mode

Key points:

  1. Lookup finds the CSV value that maps to each target property.
  2. Convert normalises strings -> int, DateTime, bool.
  3. Nullable casts (as int?) prevent NullReferenceException; the ?? fallback keeps the pipeline alive even on bad data.
  4. Date parsing super-power.
  5. Convert accepts 12+ date formats (20/6/2024, 2024-06-20, 20240620, ISO-8601…) which are perfect for real-world files.

Step 5 – End-to-End Code

You already have the full source (scroll up in the question).
The only thing left is running it:

Program.cs

using MyPlaygroundApp.Utils;

namespace MyPlaygroundApp
{
    public class Program
    {
        public static async Task Main()
        {
            var srcCsv = @"C:\Users\User\source\repos\MyPlaygroundApp\MyPlaygroundApp\CSV\sample.csv";
            var dstJson = @"C:\Users\User\source\repos\MyPlaygroundApp\MyPlaygroundApp\Output\output.jsonl";

            var cts = new CancellationTokenSource();

            var chat = new Chatbot("http://localhost:11434", "You are a data-integration assistant that maps columns.");

            var etl = new ETLTools(chat);

            Console.CancelKeyPress += (_, e) => { e.Cancel = true; cts.Cancel(); };

            await etl.RunAsync(srcCsv, dstJson, sampleRows: 10, cts.Token);
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

and ETLTools.cs

using CsvHelper;
using CsvHelper.Configuration;
using System.Globalization;
using System.Runtime.CompilerServices;
using System.Text.Json;
using System.Text.RegularExpressions;

namespace MyPlaygroundApp.Utils;

/// <summary>End-to-end CSV -> fixed-schema JSON ETL.</summary>
public sealed class ETLTools
{
    // Target Model 
    public record Customer(int Id, string Name, string Email, DateTime SignupDate, bool IsActive);

    public record Mapping(string CsvColumn, string TargetProperty);

    //reference: https://dev.to/auyeungdavid_2847435260/step-by-step-guide-write-your-first-ai-storyteller-with-ollama-llama32-and-semantic-kernel-in-c-1h40
    private readonly IChatbot _chat;

    private const string MappingSystemPrompt =
        """
            TARGET SCHEMA (exactly five properties, case-sensitive)
              Id           : int
              Name         : string
              Email        : string
              SignupDate   : datetime
              IsActive     : bool

            RESPONSE RULES
            1. Return one JSON array that contains EXACTLY five objects  one per
               target property above, in any order.
            2. Each object MUST have these keys:
                   "CsvColumn"        the column name that best matches the property
                   "TargetProperty"   one of [Id,Name,Email,SignupDate,IsActive]
            3. Do NOT include columns that dont map; do NOT include comments,
               nulls, explanations, markdown fences, or any other text.
            4. Your entire reply is the JSON array. Nothing else.
        """;

    private static readonly string[] _dateFormats =
    {
        // day-first
        "d/M/yyyy", "dd/MM/yyyy", "d/M/yy", "dd/MM/yy",
        // month-first
        "M/d/yyyy", "MM/dd/yyyy", "M/d/yy", "MM/dd/yy",
        // ISO / database
        "yyyy-MM-dd", "yyyy/MM/dd",
        "yyyy-MM-ddTHH:mm:ss", "yyyy-MM-ddTHH:mm:ssK",
        "yyyy-MM-ddTHH:mm:ss.fff", "yyyy-MM-ddTHH:mm:ss.fffK",
        // compact
        "yyyyMMdd"
    };

    public ETLTools(IChatbot chat) => _chat = chat;

    // Public facade
    public async Task RunAsync(string csvPath, string jsonPath, int sampleRows,
                               CancellationToken ct = default)
    {
        // 1️. Inference sample
        var sample = await ReadCsvAsync(csvPath, sampleRows, ct).ToListAsync(ct);

        var mapping = (await InferMappingAsync(sample)).ToDictionary(m => m.CsvColumn, m => m.TargetProperty,
                                    StringComparer.OrdinalIgnoreCase);

        Console.WriteLine("Inferred mapping:");
        foreach (var kv in mapping) Console.WriteLine($"  {kv.Key}{kv.Value}");

        // 2️. Stream full file, convert, write out
        await using var writer = new StreamWriter(jsonPath, append: false);
        var jsOpts = new JsonSerializerOptions { WriteIndented = false };

        await foreach (var row in ReadCsvAsync(csvPath, int.MaxValue, ct))
        {
            var cust = ToCustomer(row, mapping);
            await writer.WriteLineAsync(JsonSerializer.Serialize(cust, jsOpts));
        }

        Console.WriteLine($"✓ ETL finished → {jsonPath}");
    }

    // Private helpers 
    private Customer ToCustomer(Dictionary<string, object?> row,
                                IReadOnlyDictionary<string, string> map)
    {
        object? Lookup(string target) =>
            row.TryGetValue(map.First(m => m.Value == target).Key, out var v) ? v : null;

        return new Customer(
            Id: (Convert("Id", Lookup("Id")) as int?) ?? 0,
            Name: (Convert("Name", Lookup("Name")) as string) ?? "",
            Email: (Convert("Email", Lookup("Email")) as string) ?? "",
            SignupDate: (Convert("SignupDate", Lookup("SignupDate")) as DateTime?) ?? DateTime.MinValue,
            IsActive: (Convert("IsActive", Lookup("IsActive")) as bool?) ?? false
        );
    }

    private static object? Convert(string target, object? raw)
    {
        if (raw is null) return null;

        var s = raw.ToString()!.Trim();
        if (s.Length == 0) return null;

        return target switch
        {
            "Id" => int.TryParse(s, NumberStyles.Any, CultureInfo.InvariantCulture, out var id)
                        ? id : null,

            "SignupDate" => TryParseDate(s, out var dt) ? dt : null,

            "IsActive" => s.Equals("1") ||
                          s.Equals("true", StringComparison.OrdinalIgnoreCase) ||
                          s.Equals("yes", StringComparison.OrdinalIgnoreCase),

            _ => s
        };
    }

    private static bool TryParseDate(string s, out DateTime result)
    {
        // 1️. exact known formats (fast)
        if (DateTime.TryParseExact(s, _dateFormats, CultureInfo.InvariantCulture,
                                   DateTimeStyles.AssumeLocal | DateTimeStyles.AdjustToUniversal,
                                   out result))
            return true;

        // 2️. day-first culture
        if (DateTime.TryParse(s, CultureInfo.GetCultureInfo("en-GB"),
                              DateTimeStyles.AssumeLocal | DateTimeStyles.AdjustToUniversal,
                              out result))
            return true;

        // 3️. month-first / invariant
        if (DateTime.TryParse(s, CultureInfo.InvariantCulture,
                              DateTimeStyles.AssumeLocal | DateTimeStyles.AdjustToUniversal,
                              out result))
            return true;

        // 4️. whatever the machine’s current culture is
        return DateTime.TryParse(s, CultureInfo.CurrentCulture,
                                 DateTimeStyles.AssumeLocal | DateTimeStyles.AdjustToUniversal,
                                 out result);
    }

    private static readonly Regex _jsonArray =
        new(@"\[[\s\S]*\]", RegexOptions.Compiled); // first “[…]” block

    private async Task<IReadOnlyList<Mapping>> InferMappingAsync(
        IEnumerable<Dictionary<string, object?>> sample, CancellationToken ct = default)
    {
        var prompt =
            $"""
            {MappingSystemPrompt}

            CSV SAMPLE
            {JsonSerializer.Serialize(sample)}
        """;

        string raw = await _chat.AskQuestion(prompt);
        Console.WriteLine($"raw LLM reply >>>\n{raw}\n<<<");

        // Strip everything except the first JSON array
        var match = _jsonArray.Match(raw);
        if (!match.Success)
            throw new InvalidOperationException("LLM reply contained no JSON array.");

        string json = match.Value;

        // Parse & prune rows where TargetProperty is null/empty
        var opts = new JsonSerializerOptions { PropertyNameCaseInsensitive = true };
        var rows = JsonSerializer.Deserialize<List<Mapping>>(json, opts)
                  ?? throw new InvalidOperationException("JSON deserialised to null.");

        var clean = rows
            .Where(r => !string.IsNullOrWhiteSpace(r.TargetProperty))
            .ToList();

        if (clean.Count != 5)
            throw new InvalidOperationException(
                $"Expected 5 mappings, got {clean.Count}. JSON was:\n{json}");

        return clean;
    }

    private async IAsyncEnumerable<Dictionary<string, object?>> ReadCsvAsync(
        string path, int max,
        [EnumeratorCancellation] CancellationToken ct = default)
    {
        using var reader = new StreamReader(path);
        using var csv = new CsvReader(reader, new CsvConfiguration(CultureInfo.InvariantCulture));

        int i = 0;

        await foreach (var rec in csv.GetRecordsAsync<dynamic>(ct))
        {
            var dict = (IDictionary<string, object?>)rec; // runtime type is ExpandoObject
            yield return dict.ToDictionary(kv => kv.Key, kv => kv.Value, StringComparer.OrdinalIgnoreCase);

            if (++i == max) yield break;
        }
    }
}
Enter fullscreen mode Exit fullscreen mode
dotnet run --project MyPlaygroundApp
Enter fullscreen mode Exit fullscreen mode

Watch the console to see the mapping is correct or not:

Result:

Insights & Lessons Learned

  1. Dynamic to Static – LLMs excel at pattern matching; harness them to map unpredictable feeds to strict schemas.
  2. Lightweight wins – llama3.2-3B is only 2 GB but nails column inference in a blink.
  3. Soft parsing – treat every conversion as nullable plus a reasonable default; ETL should never crash on dirty data.
  4. Streaming – process gigabyte files with bounded memory thanks to await foreach.

Conclusion

We just unboxed the power of LLM-driven ETL: zero configs, dynamic conversions, and pocket-friendly infrastructure. Feel free to fork the coding, tweak the prompt, or extend the Customer schema.

REMEBER, this demo is only an experiment/a proof-of-concept that shows what’s possible with a small local LLM and a few hundred lines of code. It doesn’t cover enterprise functions such as schema-drift alerts, incremental loads, retry policies, data quality dashboards, or secure secret handling. Treat it as a playground to spark ideas, then harden, test, and monitor thoroughly before running it in production.

References

Semantic Kernel Quick-Start
Ollama llama3.2-3B

Have fun turning data chaos into order, and as always, Love C# & AI!

Top comments (1)

Collapse
 
auyeungdavid_2847435260 profile image
David Au Yeung

I have recently discovered the potential of lightweight LLMs (Large Language Models) or small language models (SLMs). Stay tuned for more fun experiments!