Introduction
We'll feed a messy CSV file to a lightweight llama3.2-3B model and let it infer a column mapping at runtime, then stream the whole file into a fixed Customer
schema and emit line-delimited JSON – all inside a single .NET 8 console app.
If you’re new to LLM chatbots, check out my earlier post:
Step-by-Step Guide: Write Your First AI Storyteller with Ollama (llama3.2) and Semantic Kernel in C#
(Have your Chatbot helper ready before continuing.)
Why ETL + LLM?
Traditional ETL pipelines crumble the moment a vendor renames a column.
With an LLM in the loop we can teach the app to recognize columns on the fly.
Sample the first n rows of the CSV
Ask the LLM: “Which column is Id, which is SignupDate...?”
Use the returned JSON mapping to transform every remaining row.
No more fragile config files, let the model handle the heavy lifting for us!
What You’ll Build
- A console tool that reads any CSV, infers a mapping to the fixed schema
Customer
(Id, Name, Email, SignupDate, IsActive). - Streams the full file and emits JSON lines (jsonl) ready for ingestion. All powered by Semantic Kernel, Ollama llama3.2-3B, and ~150 lines of C#.
sample.csv
customer_id,full_name,email_address,registration_date,status,age,country,referral_code,notes,last_login,preferred_language,loyalty_points,marketing_opt_in
101,Jane Doe,jane.doe@example.com,2024-05-01,true,29,USA,REF-A1,"Loves cats",2024-07-20T12:30:00Z,en,150,yes
102,Bob Smith,bob.smith@example.com,2024-05-03,false,41,CAN,REF-B2,"",,fr,20,no
103,Alice Johnson,alice.j@example.com,2024-06-10,1,33,GBR,REF-C3,"VIP customer",2024-07-18T08:11:42Z,en,760,yes
104,李四,li.si@example.cn,2024-06-20,true,26,CHN,,,"2024-07-21 02:05:00",zh,0,yes
105,Maria García,m.garcia@example.es,2024-07-01,false,38,ESP,REF-E5,"Prefers phone contact",,es,55,no
output.jsonl
{"Id":1001,"Name":"Alice Lee","Email":"alice@example.com","SignupDate":"2024-06-20T00:00:00Z","IsActive":true}
{"Id":1002,"Name":"Bob Chan","Email":"bob@example.com","SignupDate":"2024-07-05T00:00:00Z","IsActive":false}
Step 1 – Prerequisites
- .NET 8 SDK
- Ollama with llama3.2-3B
- NuGet packages:
dotnet add package Microsoft.SemanticKernel
dotnet add package Microsoft.SemanticKernel.ChatCompletion;
dotnet add package CsvHelper
dotnet add package OllamaSharp
Step 2 – Prepare The Fixed Schema
public record Customer(
int Id,
string Name,
string Email,
DateTime SignupDate,
bool IsActive);
Step 3 – Prompt Engineering
The prompt (see MappingSystemPrompt in the code) forces the model to:
TARGET SCHEMA (exactly five properties, case-sensitive)
Id : int
Name : string
Email : string
SignupDate : datetime
IsActive : bool
RESPONSE RULES
1. Return one JSON array that contains EXACTLY five objects – one per
target property above, in any order.
2. Each object MUST have these keys:
"CsvColumn" – the column name that best matches the property
"TargetProperty" – one of [Id,Name,Email,SignupDate,IsActive]
3. Do NOT include columns that don’t map; do NOT include comments,
nulls, explanations, markdown fences, or any other text.
4. Your entire reply is the JSON array. Nothing else.
Step 4 – Converting Dynamic -> Target
ToCustomer()
does the heavy lifting:
var cust = new Customer(
Id: (Convert("Id", Lookup("Id")) as int?) ?? 0,
Name: (Convert("Name", Lookup("Name")) as string?) ?? "",
Email: (Convert("Email", Lookup("Email")) as string?)?? "",
SignupDate: (Convert("SignupDate", Lookup("SignupDate"))as DateTime?)??DateTime.MinValue,
IsActive: (Convert("IsActive", Lookup("IsActive"))as bool?)??false
);
Key points:
- Lookup finds the CSV value that maps to each target property.
- Convert normalises strings -> int, DateTime, bool.
- Nullable casts (as int?) prevent NullReferenceException; the ?? fallback keeps the pipeline alive even on bad data.
- Date parsing super-power.
- Convert accepts 12+ date formats (20/6/2024, 2024-06-20, 20240620, ISO-8601…) which are perfect for real-world files.
Step 5 – End-to-End Code
You already have the full source (scroll up in the question).
The only thing left is running it:
Program.cs
using MyPlaygroundApp.Utils;
namespace MyPlaygroundApp
{
public class Program
{
public static async Task Main()
{
var srcCsv = @"C:\Users\User\source\repos\MyPlaygroundApp\MyPlaygroundApp\CSV\sample.csv";
var dstJson = @"C:\Users\User\source\repos\MyPlaygroundApp\MyPlaygroundApp\Output\output.jsonl";
var cts = new CancellationTokenSource();
var chat = new Chatbot("http://localhost:11434", "You are a data-integration assistant that maps columns.");
var etl = new ETLTools(chat);
Console.CancelKeyPress += (_, e) => { e.Cancel = true; cts.Cancel(); };
await etl.RunAsync(srcCsv, dstJson, sampleRows: 10, cts.Token);
}
}
}
and ETLTools.cs
using CsvHelper;
using CsvHelper.Configuration;
using System.Globalization;
using System.Runtime.CompilerServices;
using System.Text.Json;
using System.Text.RegularExpressions;
namespace MyPlaygroundApp.Utils;
/// <summary>End-to-end CSV -> fixed-schema JSON ETL.</summary>
public sealed class ETLTools
{
// Target Model
public record Customer(int Id, string Name, string Email, DateTime SignupDate, bool IsActive);
public record Mapping(string CsvColumn, string TargetProperty);
//reference: https://dev.to/auyeungdavid_2847435260/step-by-step-guide-write-your-first-ai-storyteller-with-ollama-llama32-and-semantic-kernel-in-c-1h40
private readonly IChatbot _chat;
private const string MappingSystemPrompt =
"""
TARGET SCHEMA (exactly five properties, case-sensitive)
Id : int
Name : string
Email : string
SignupDate : datetime
IsActive : bool
RESPONSE RULES
1. Return one JSON array that contains EXACTLY five objects – one per
target property above, in any order.
2. Each object MUST have these keys:
"CsvColumn" – the column name that best matches the property
"TargetProperty" – one of [Id,Name,Email,SignupDate,IsActive]
3. Do NOT include columns that don’t map; do NOT include comments,
nulls, explanations, markdown fences, or any other text.
4. Your entire reply is the JSON array. Nothing else.
""";
private static readonly string[] _dateFormats =
{
// day-first
"d/M/yyyy", "dd/MM/yyyy", "d/M/yy", "dd/MM/yy",
// month-first
"M/d/yyyy", "MM/dd/yyyy", "M/d/yy", "MM/dd/yy",
// ISO / database
"yyyy-MM-dd", "yyyy/MM/dd",
"yyyy-MM-ddTHH:mm:ss", "yyyy-MM-ddTHH:mm:ssK",
"yyyy-MM-ddTHH:mm:ss.fff", "yyyy-MM-ddTHH:mm:ss.fffK",
// compact
"yyyyMMdd"
};
public ETLTools(IChatbot chat) => _chat = chat;
// Public facade
public async Task RunAsync(string csvPath, string jsonPath, int sampleRows,
CancellationToken ct = default)
{
// 1️. Inference sample
var sample = await ReadCsvAsync(csvPath, sampleRows, ct).ToListAsync(ct);
var mapping = (await InferMappingAsync(sample)).ToDictionary(m => m.CsvColumn, m => m.TargetProperty,
StringComparer.OrdinalIgnoreCase);
Console.WriteLine("Inferred mapping:");
foreach (var kv in mapping) Console.WriteLine($" {kv.Key} → {kv.Value}");
// 2️. Stream full file, convert, write out
await using var writer = new StreamWriter(jsonPath, append: false);
var jsOpts = new JsonSerializerOptions { WriteIndented = false };
await foreach (var row in ReadCsvAsync(csvPath, int.MaxValue, ct))
{
var cust = ToCustomer(row, mapping);
await writer.WriteLineAsync(JsonSerializer.Serialize(cust, jsOpts));
}
Console.WriteLine($"✓ ETL finished → {jsonPath}");
}
// Private helpers
private Customer ToCustomer(Dictionary<string, object?> row,
IReadOnlyDictionary<string, string> map)
{
object? Lookup(string target) =>
row.TryGetValue(map.First(m => m.Value == target).Key, out var v) ? v : null;
return new Customer(
Id: (Convert("Id", Lookup("Id")) as int?) ?? 0,
Name: (Convert("Name", Lookup("Name")) as string) ?? "",
Email: (Convert("Email", Lookup("Email")) as string) ?? "",
SignupDate: (Convert("SignupDate", Lookup("SignupDate")) as DateTime?) ?? DateTime.MinValue,
IsActive: (Convert("IsActive", Lookup("IsActive")) as bool?) ?? false
);
}
private static object? Convert(string target, object? raw)
{
if (raw is null) return null;
var s = raw.ToString()!.Trim();
if (s.Length == 0) return null;
return target switch
{
"Id" => int.TryParse(s, NumberStyles.Any, CultureInfo.InvariantCulture, out var id)
? id : null,
"SignupDate" => TryParseDate(s, out var dt) ? dt : null,
"IsActive" => s.Equals("1") ||
s.Equals("true", StringComparison.OrdinalIgnoreCase) ||
s.Equals("yes", StringComparison.OrdinalIgnoreCase),
_ => s
};
}
private static bool TryParseDate(string s, out DateTime result)
{
// 1️. exact known formats (fast)
if (DateTime.TryParseExact(s, _dateFormats, CultureInfo.InvariantCulture,
DateTimeStyles.AssumeLocal | DateTimeStyles.AdjustToUniversal,
out result))
return true;
// 2️. day-first culture
if (DateTime.TryParse(s, CultureInfo.GetCultureInfo("en-GB"),
DateTimeStyles.AssumeLocal | DateTimeStyles.AdjustToUniversal,
out result))
return true;
// 3️. month-first / invariant
if (DateTime.TryParse(s, CultureInfo.InvariantCulture,
DateTimeStyles.AssumeLocal | DateTimeStyles.AdjustToUniversal,
out result))
return true;
// 4️. whatever the machine’s current culture is
return DateTime.TryParse(s, CultureInfo.CurrentCulture,
DateTimeStyles.AssumeLocal | DateTimeStyles.AdjustToUniversal,
out result);
}
private static readonly Regex _jsonArray =
new(@"\[[\s\S]*\]", RegexOptions.Compiled); // first “[…]” block
private async Task<IReadOnlyList<Mapping>> InferMappingAsync(
IEnumerable<Dictionary<string, object?>> sample, CancellationToken ct = default)
{
var prompt =
$"""
{MappingSystemPrompt}
CSV SAMPLE
{JsonSerializer.Serialize(sample)}
""";
string raw = await _chat.AskQuestion(prompt);
Console.WriteLine($"raw LLM reply >>>\n{raw}\n<<<");
// Strip everything except the first JSON array
var match = _jsonArray.Match(raw);
if (!match.Success)
throw new InvalidOperationException("LLM reply contained no JSON array.");
string json = match.Value;
// Parse & prune rows where TargetProperty is null/empty
var opts = new JsonSerializerOptions { PropertyNameCaseInsensitive = true };
var rows = JsonSerializer.Deserialize<List<Mapping>>(json, opts)
?? throw new InvalidOperationException("JSON deserialised to null.");
var clean = rows
.Where(r => !string.IsNullOrWhiteSpace(r.TargetProperty))
.ToList();
if (clean.Count != 5)
throw new InvalidOperationException(
$"Expected 5 mappings, got {clean.Count}. JSON was:\n{json}");
return clean;
}
private async IAsyncEnumerable<Dictionary<string, object?>> ReadCsvAsync(
string path, int max,
[EnumeratorCancellation] CancellationToken ct = default)
{
using var reader = new StreamReader(path);
using var csv = new CsvReader(reader, new CsvConfiguration(CultureInfo.InvariantCulture));
int i = 0;
await foreach (var rec in csv.GetRecordsAsync<dynamic>(ct))
{
var dict = (IDictionary<string, object?>)rec; // runtime type is ExpandoObject
yield return dict.ToDictionary(kv => kv.Key, kv => kv.Value, StringComparer.OrdinalIgnoreCase);
if (++i == max) yield break;
}
}
}
dotnet run --project MyPlaygroundApp
Watch the console to see the mapping is correct or not:
Result:
Insights & Lessons Learned
- Dynamic to Static – LLMs excel at pattern matching; harness them to map unpredictable feeds to strict schemas.
- Lightweight wins – llama3.2-3B is only 2 GB but nails column inference in a blink.
- Soft parsing – treat every conversion as nullable plus a reasonable default; ETL should never crash on dirty data.
- Streaming – process gigabyte files with bounded memory thanks to await foreach.
Conclusion
We just unboxed the power of LLM-driven ETL: zero configs, dynamic conversions, and pocket-friendly infrastructure. Feel free to fork the coding, tweak the prompt, or extend the Customer
schema.
REMEBER, this demo is only an experiment/a proof-of-concept that shows what’s possible with a small local LLM and a few hundred lines of code. It doesn’t cover enterprise functions such as schema-drift alerts, incremental loads, retry policies, data quality dashboards, or secure secret handling. Treat it as a playground to spark ideas, then harden, test, and monitor thoroughly before running it in production.
References
Semantic Kernel Quick-Start
Ollama llama3.2-3B
Have fun turning data chaos into order, and as always, Love C# & AI!
Top comments (1)
I have recently discovered the potential of lightweight LLMs (Large Language Models) or small language models (SLMs). Stay tuned for more fun experiments!