Aniket Hingane

Posted on Apr 1

Policy-Locked Triage for Messy Citizen Text: A Municipal-Style Routing PoC with SFT and Preference Alignment

#python #machinelearning #agents #civtech

How I stabilized noisy 311-style requests with supervised training and reviewer preferences in Python

TL;DR

This write-up is an experimental account of how I built a small routing proof of concept for synthetic municipal-style service requests. The goal was not to ship a city-wide system. From my perspective, the interesting part is the training story: start with labeled text, fit a transparent classifier, then inject reviewer-style preferences so the policy moves toward routes that match operational nuance. The repository is public, fully synthetic, and designed to run on a laptop without calling a hosted large language model. If you are looking for a polished civic product, this is not it. If you are looking for a clean, inspectable playground that mirrors how I think about aligning lightweight agents before any serious conversation about production, this article walks through the motivation, design, code, and limitations in depth.

Introduction

I have spent a fair amount of time thinking about the gap between a clever prompt and a dependable workflow. Prompts can feel magical until the edge cases arrive, and edge cases are exactly what public-facing intake systems collect. In my opinion, the hardest part is not the first eighty percent of routing accuracy on obvious phrases. The hardest part is the long tail where departments disagree, citizens mix multiple issues in one sentence, and the right answer depends on local policy interpretation rather than dictionary matching.

This article documents a solo experiment. I wrote the code, generated the synthetic corpus, and iterated on evaluation plots in isolation. Nothing here reflects a real municipality, a vendor engagement, or a production deployment. I am describing a personal proof of concept that helped me reason about supervised fine-tuning style steps and preference alignment style steps without pretending I trained a billion-parameter model on private data.

I also want to be clear about scope. I am not claiming breakthrough accuracy numbers on messy real-world corpora. The dataset is intentionally templated so I can focus on architecture and training mechanics. As per my experience, that trade-off is common in early research spikes: control the data so you can see whether your training loop behaves, then worry about realism later.

What is this article about?

The narrative centers on a routing policy that maps free-text citizen requests to a small set of department labels such as streets, parks, utilities, code enforcement, noise, and a catch-all other bucket. The policy is a multinomial logistic regression model on TF-IDF features. That choice is boring on purpose. Boring models are easy to explain, easy to diff between training iterations, and easy to pair with simple charts when I need to communicate results to someone outside machine learning circles.

The second thread is alignment in a practical, scaled-down sense. Large-scale preference optimization methods are fascinating, but they also come with engineering overhead that does not belong in every story. In my experiments, I approximate reviewer intent by augmenting the training set with duplicated examples that emphasize the chosen route and sparse contrastive hints that steer the model away from plausible wrong routes. The approximation is not a faithful implementation of direct preference optimization. It is a teaching device that still captures the intuition I care about: policies improve when human judgments are folded back into training data in a structured way.

Tech stack

Python 3.10 or newer for broad compatibility with current scientific libraries.
scikit-learn for TF-IDF vectorization and multinomial logistic regression.
NumPy for lightweight numerical handling during evaluation.
Matplotlib for offline charts that summarize accuracy and macro F1 movement between training phases.
A small amount of standard library code for ASCII tables in the terminal, because I wanted the demo to feel credible when I record or share a terminal capture.

I deliberately avoided deep learning frameworks in this repository. That decision is philosophical as much as technical. In my opinion, a public write-up about routing should let a curious reader inspect coefficients and vocabulary without downloading CUDA drivers. If I later extend the project with embeddings or small transformer heads, that can be a separate milestone with separate disclosure about compute and data handling.

Why read it?

If you are evaluating how to stage agent development, you might appreciate a story that separates policy training from prompt drafting. I structured the code so the “agent” is really a policy object with predictable inputs and outputs.
If you care about reproducibility, the synthetic generator and fixed seeds make runs comparable across machines, which is helpful when you are sanity-checking a pipeline before investing in data contracts.
If you are interested in alignment discussions but want a concrete anchor, the preference augmentation section translates abstract pairwise feedback into a dataset transformation you can read line by line.
If you want a reminder about ethics and privacy, the article ends with a candid discussion of why synthetic data is the responsible choice for a public artifact in this domain.

Let us design

Before typing code, I sketched the constraints I wanted the PoC to respect. First, the system should fail gracefully in the sense that every prediction returns a label with an interpretable basis in term frequency. Second, training should be fast enough that I can iterate during a single evening session. Third, the evaluation should include more than accuracy because class imbalance can hide weakness in rare departments.

Architecture-wise, I imagined three cooperating layers: ingestion of normalized text, featurization, and a training loop that can run in two phases. The first phase is ordinary supervised training on labeled examples. The second phase reweights and augments the corpus using preference pairs that represent reviewer corrections. The diagram below highlights how I think about the information flow at a high level.

I also wanted a sequence-oriented view because stakeholders often think in terms of tickets rather than matrices. The sequence chart is simplified, yet it captures the idea that routing is a service interface problem, not only a modeling problem.

Finally, I drew a training flowchart to keep myself honest about order of operations. When I build without a flowchart, I tend to mix evaluation leakage into augmentation steps by accident. The flowchart is a personal guardrail.

Let us get cooking

The heart of the project lives in a handful of modules under src/civic_triage. Rather than dump the entire repository into this article, I am highlighting the pieces that taught me the most while implementing the experiment.

Module: labels and synthetic corpus

I started by fixing a small enumeration of departments. Keeping labels explicit avoids silent typos that destroy evaluation integrity. The synthetic generator fills templates with street names, cross streets, and park names drawn from small pools. That approach introduces variety without requiring personally identifiable information.

from enum import Enum


class Department(str, Enum):
    STREETS = "streets"
    PARKS = "parks"
    UTILITIES = "utilities"
    CODE_ENFORCEMENT = "code_enforcement"
    NOISE = "noise"
    OTHER = "other"

This code is almost too simple to discuss, but that is the point. By constraining labels to an enum, I make downstream encoding and reporting consistent. When I wrote this, I was thinking about future refactors: if a label changes, I want a single source of truth rather than string literals scattered across scripts.

The synthetic data builder rotates through templates per department and adds occasional suffix phrases such as “Please route quickly.” Those suffixes inject light noise so the vectorizer cannot rely on a single memorized sentence. In my opinion, small perturbations matter when evaluating linear models because they reveal whether the model leans on a handful of accidental keywords.

Module: preference pairs

Preference pairs are where I tried to echo alignment ideas without importing a full preference optimization stack. For a subset of training rows, I simulate a reviewer disagreeing with a plausible wrong route. The chosen label remains the ground-truth department, while the rejected label is sampled from a hand-authored confusion map.

def iter_preference_pairs(requests, mistake_rate: float = 0.22, seed: int = 7):
    rng = random.Random(seed)
    confusion = {
        "streets": ("utilities", "noise"),
        "parks": ("noise", "other"),
        # ... additional mappings ...
    }
    for lr in requests:
        if rng.random() > mistake_rate:
            continue
        wrong_a, wrong_b = confusion[lr.label]
        rejected = rng.choice((wrong_a, wrong_b))
        yield PreferencePair(text=lr.text, chosen=lr.label, rejected=rejected)

I put it this way because I wanted the mistake rate to be explicit. If the rate is too high, augmentation dominates and can distort the base distribution. If the rate is too low, the second training phase barely differs from the first. In my experiments, a mid-teens to low-twenties rate produced visible dataset growth without drowning the original signal.

Module: modeling and alignment helper

The classifier pipeline combines TF-IDF with multinomial logistic regression. I kept regularization in a sensible default range and allowed sklearn to pick the multiclass strategy appropriate for the installed version. The alignment helper duplicates chosen-label rows multiple times and occasionally appends a short textual hint that reinforces the negation of a rejected route.

def apply_preference_alignment(base_texts, base_labels, pairs, oversample_chosen: int = 3, seed: int = 42):
    rng = np.random.default_rng(seed)
    texts = list(base_texts)
    labels = list(base_labels)
    for text, chosen, rejected in pairs:
        for _ in range(oversample_chosen):
            texts.append(text)
            labels.append(chosen)
        if rng.random() < 0.15:
            texts.append(text + " [reviewer_note: not " + rejected + "]")
            labels.append(chosen)
    return texts, labels

When I wrote this, I was thinking about how real reviewers often repeat themselves when they correct a mistake. Duplication is a crude stand-in for importance weighting, but it behaves well with linear models and keeps the code approachable for readers who are not ready to implement custom loss functions.

Module: reporting

I wanted terminal output that looks like a serious batch job. ASCII tables are not glamorous, yet they photograph well in articles and presentations. The reporting helper measures column widths and draws horizontal rules with plus signs, similar to old-school fixed-width reports.

def ascii_table(headers: list[str], rows: list[list[str | float]]) -> str:
    str_rows: list[list[str]] = [[str(h) for h in headers]]
    for r in rows:
        str_rows.append([f"{c:.4f}" if isinstance(c, float) else str(c) for c in r])
    widths = [max(len(row[i]) for row in str_rows) for i in range(len(headers))]
    lines: list[str] = []
    sep = "+-" + "-+-".join("-" * w for w in widths) + "-+"
    lines.append(sep)
    # ... render rows ...
    return "\n".join(lines)

This block taught me to separate presentation from computation. Metrics are computed once, then rendered. That separation makes it easier to swap the renderer later if I decide to integrate Rich or another library without touching training logic.

Entry point: orchestration

The main.py script wires everything together: generate data, split, train the supervised model, evaluate, build preference pairs, augment, retrain, and write charts. I kept the CLI small so the experiment stays legible.

def run_pipeline(seed: int, n_per_class: int, pref_mistake_rate: float) -> int:
    data = generate_labeled_requests(n_per_class=n_per_class, seed=seed)
    split = int(len(data) * 0.8)
    train, test = data[:split], data[split:]
    train_texts = [x.text for x in train]
    train_labels = [x.label for x in train]
    test_texts = [x.text for x in test]
    test_labels = [x.label for x in test]
    sft = fit_sft(train_texts, train_labels, seed=seed)
    sft_metrics = metrics_for(sft, test_texts, test_labels)
    pairs = list(iter_preference_pairs(train, mistake_rate=pref_mistake_rate, seed=seed + 1))
    pair_tuples = [(p.text, p.chosen, p.rejected) for p in pairs]
    aug_texts, aug_labels = apply_preference_alignment(train_texts, train_labels, pair_tuples, oversample_chosen=3, seed=seed + 2)
    aligned = fit_sft(aug_texts, aug_labels, seed=seed + 3)
    aligned_metrics = metrics_for(aligned, test_texts, test_labels)
    plot_metric_bars(sft_metrics, aligned_metrics, os.path.join(ROOT, "output", "metrics_compare.png"))
    return 0

Reading this after a break, I still like the explicit ordering. Augmentation happens only after the base model exists, which prevents me from accidentally comparing two augmented variants without a common baseline story.

Let us setup

Step by step details can be found in the repository README, and the canonical clone URL is https://github.com/aniket-work/CivicTriage-AI. I recommend creating a virtual environment inside the project folder so dependencies remain isolated from other work on the same laptop. On my machine, the setup sequence looks like the following.

Clone the repository to a working directory of your choice.
Create a virtual environment with python3 -m venv venv and activate it.
Install dependencies with pip install -r requirements.txt.
Run python main.py with default flags to verify charts appear under output/.

I also keep generated charts copied into images/ for documentation continuity. That step is not strictly required for execution, but it helps keep the README and article visuals aligned.

Let us run

When the pipeline runs successfully, the terminal prints ASCII tables comparing the supervised phase and the alignment-augmented phase. On my synthetic split, metrics often look strong because the dataset is separable by design. That outcome is useful for debugging plumbing, but it is not a claim about real civic text. Interpreting results responsibly matters more than chasing a flashy number.

The charts compare accuracy and macro F1 between phases. Macro F1 is particularly important when class counts differ, because accuracy alone can hide poor performance on rare labels.

Label distribution visualization is another sanity check. If one department dominates unexpectedly, I know to revisit sampling before trusting any headline metric.

Theory interlude: why linear models still deserve respect

It is tempting to assume that only large models deserve the label “agent.” In my opinion, that assumption mixes capability with agency. A small linear policy can still be embedded inside a broader agentic system that handles tool calls, retrieval, and escalation. The routing policy in this PoC is a single decision node, not the entire automation story. Thinking about nodes separately helps me reason about failure isolation. If routing fails, I can swap the node without rewriting unrelated orchestration code.

From a mathematical angle, multinomial logistic regression estimates a convex problem under typical regularization assumptions. Convexity does not guarantee perfect generalization, but it does provide a stable training baseline when compared with some deep model training loops that require careful tuning of learning rates and batch sizes. Stability matters when you are iterating nightly on a side project without a cluster.

Edge cases I worry about even in a toy setup

Multi-issue messages that require splitting or multi-label prediction. This PoC uses single-label classification only.
Language diversity and informal spelling. The synthetic generator uses English templates with light noise, not multilingual corpora.
Seasonal effects such as leaf pickup schedules or snow removal windows that change routing rules over time.
Equity concerns when certain neighborhoods file more tickets simply because access channels differ. A routing model can inherit those structural biases if trained blindly.

None of those issues disappears because the code is short. They are reminders that a serious path forward requires collaboration with domain experts and ongoing monitoring.

Ethics and data handling

I chose synthetic data because citizen text can include names, addresses, and medical references even when the intake form is labeled as non-emergency. Public repositories are the wrong place for that material unless there is a rigorous governance process. In my experiments, synthetic templates let me discuss routing ideas without crossing privacy boundaries. If I ever move toward real data, I would treat consent, retention limits, and redaction as prerequisites rather than afterthoughts.

Future roadmap for myself

Introduce a calibration layer so probability outputs map more reliably to operational thresholds.
Explore multi-label classification for compound requests, likely with a different architecture than plain multinomial logistic regression.
Add a human review queue simulation that measures how often uncertain predictions would escalate.
Experiment with character n-grams or lightweight embeddings while keeping the repository easy to run on CPU hardware.

Deeper notes on TF-IDF and why I still reach for it

Term frequency-inverse document frequency is not new. In my opinion, that is a feature rather than a flaw when you are writing about routing policies that must be explained in a public meeting. TF-IDF highlights discriminative words relative to the corpus without requiring GPU memory. It pairs naturally with linear models, and linear models yield coefficients that can be inspected if someone asks why a particular ticket leaned toward utilities instead of streets.

I also appreciate the control it gives me over n-gram breadth. In this PoC, I allowed unigrams and bigrams through the vectorizer configuration in the modeling module. Bigrams capture short phrases such as “dog off leash” that unigrams might fragment. The trade-off is a larger feature space and a higher chance of spurious bigrams if the dataset is tiny. Because I generated hundreds of rows per class, the bigram signal remained reasonably stable across random seeds in my local tests.

There is a honest limitation: TF-IDF does not understand paraphrase. If a citizen writes “hydrant leaking” versus “fire plug dripping,” the model might treat those as unrelated unless the training distribution includes both phrasings. In a real program, I would expect continuous vocabulary drift and periodic retraining. For this experimental article, I accepted that limitation and focused on making the training loop legible.

What I mean by supervised fine-tuning in this context

When people say “supervised fine-tuning” around large language models, they usually mean updating many parameters on instruction data. Here, the phrase is intentionally more literal: supervised training of a classifier head on labeled examples. I use the SFT language because the staged story mirrors how larger agent stacks are discussed, even though the parameter count is tiny.

The staging matters psychologically. In my experience, separating a baseline fit from a later refinement step helps me debug where a regression was introduced. If the second stage suddenly collapses accuracy, I know to inspect augmentation or duplication rates rather than the raw tokenizer. That kind of isolation is harder when everything happens inside one opaque fine-tuning run.

Preference alignment without a full DPO implementation

Direct preference optimization and related algorithms deserve their place in the research landscape. They also deserve a caution label for small solo projects that cannot afford extensive hyperparameter sweeps. I chose a transparent approximation: duplicate chosen labels, sprinkle in occasional negation hints, and refit the classifier. The goal is not to reproduce a paper result. The goal is to capture the intuition that pairwise feedback shifts the decision boundary.

If I squint, the augmentation resembles importance sampling toward reviewer-approved actions. If I squint less generously, it is just oversampling. Both perspectives are useful. The first perspective keeps me aligned with how I talk about learning from feedback in agent design. The second perspective keeps me humble about claims I can make in a public write-up.

Evaluation choices and why macro averaging matters

Accuracy is a convenient headline number, but it can lie when classes are imbalanced. Macro-averaged F1 computes metrics per class and averages them, giving rare departments a louder voice in the aggregate. In civic routing, rare classes are often the ones with the highest operational risk if misrouted.

I also log both phases on the same holdout split to avoid accidental optimism from resampling the test set. In my experiments, holding the test set fixed while changing training augmentation is the minimum bar for a fair comparison. I mention this because it is easy to cheat yourself with a sloppy split when synthetic data feels harmless.

Failure modes that showed up when I stress-tested my assumptions

Even with templated text, I found ways to break my own mental model. For example, if I lowered the number of examples per class too far, variance spiked and the confusion matrix looked ugly in ways that were not instructive, only noisy. If I pushed the preference pair rate too high, training time grew and the model began to overemphasize duplicated rows unless I watched the oversampling multiplier.

Another failure mode is more human: if I describe this PoC to someone as “AI that solves 311,” I am overselling it. Language shapes expectations. I prefer to describe it as a routing policy prototype with explicit training stages and visible metrics. That framing keeps the conversation grounded.

Monitoring and observability in a hypothetical next phase

If I were evolving this into a supervised pilot rather than a local script, I would want basic monitoring hooks even before considering fancy agents. At minimum, I would track label distribution over time, confidence histograms per department, and a sample of low-confidence predictions for manual review. None of that requires deep learning. It requires discipline.

I would also log model versions next to dataset hashes. In my opinion, reproducibility is part of safety. When someone asks why routing changed in March, I want to point to a dataset diff and a training configuration diff, not a shrug.

Security considerations for intake interfaces

Routing is only one layer. Real systems must handle authentication for staff tools, rate limits for public endpoints, and prompt-injection-like attacks where a citizen pastes instructions meant to confuse downstream automation. This PoC does not implement those protections. I am mentioning them because a public article about civic automation should not pretend the model exists in a vacuum.

Accessibility and channel fairness

Not everyone files a request through the same channel. Phone, web, and mobile apps produce different kinds of text and different kinds of errors. A model trained predominantly on web forms might underperform on transcribed phone calls. I did not simulate those channels separately in this repository. From my perspective, that is a future split worth modeling if the goal ever stops being educational.

Comparing this PoC to large-model fine-tuning stories

The high-level arc rhymes with bigger systems: train a policy, incorporate human preference signals, evaluate. The differences are scale, compute, and representation. Large models can generalize across phrasing with fewer explicit templates. Small models can be audited with a spreadsheet mindset. I am not arguing one replaces the other. I am arguing that practicing the arc at small scale sharpens the questions I ask when I read about large-scale alignment work.

What I would measure in a more realistic dataset pilot

If I ever graduate beyond synthetic templates, I would start with offline metrics on redacted logs, then move to shadow mode where predictions are logged but not acted upon, and only then consider limited automation with human escalation paths. Each gate exists to reduce the risk of silent harm. The sequencing is more important than any particular classifier architecture.

Personal reflection on solo experimentation

Working alone on this kind of spike has advantages and drawbacks. The advantage is speed. I can rename a module on a whim without coordinating across roles. The drawback is blind spots. I compensate by writing diagrams, running the same script under multiple seeds, and documenting limitations aggressively. That discipline does not eliminate bias, but it reduces the chance that I mistake a tidy synthetic world for the messiness of real operations.

Additional notes on plotting and communication

Charts are not ornamentation here. They are a contract with the reader. When I compare two training phases side by side, I force myself to confront whether the second phase genuinely moved the metrics I claim matter. If the plot looks flat, I do not hand-wave. I explain why flatness might be acceptable, such as a separable dataset where both models saturate, or I revisit the augmentation recipe.

Reproducibility checklist I used while preparing this article

Pin random seeds in data generation and model fitting.
Keep the evaluation split stable across training variants.
Store charts as files so visual results are reviewable without rerunning.
Avoid nondeterministic operations where possible, and accept that some BLAS operations may still introduce tiny drift.

How I think about versioning datasets and models together

One habit I picked up from earlier experiments is to treat datasets like code. If I change a template string in the synthetic generator, I should think of that as a dataset version bump even when the Git commit message talks about “just a wording tweak.” Small wording changes can shift term frequencies enough to alter which n-grams dominate. In a toy project, the stakes are low. In a pilot, the stakes are higher because departments may rely on trend lines that assume comparable distributions over time.

When I snapshot metrics, I try to record not only accuracy and macro F1 but also the training row counts before and after augmentation. Row counts tell a story about how aggressively preference duplication reshaped the effective loss landscape. If the augmented dataset balloons by an order of magnitude, I expect different regularization needs even within linear models.

Calibration and confidence: what I wish I added sooner

Probabilities from logistic regression can be overconfident, especially when features separate cleanly. I did not implement Platt scaling or isotonic regression in this repository because I wanted to keep the first iteration narrow. Looking back, a calibration section would make the PoC more instructive for readers who want to map scores to “send to human review if below threshold” workflows. That mapping is where many real systems spend their engineering time.

If I add calibration later, I would hold out a separate calibration split to avoid information leakage from evaluation metrics. The distinction sounds pedantic until you realize how easy it is to accidentally tune thresholds on the same rows you report as performance.

Narrative lessons from building the terminal output

The ASCII table formatting took longer than expected relative to its mathematical complexity. That is common when polish matters. I wanted the output to resemble a batch report because, in my experience, stakeholders trust artifacts that look like logs they already read. A wall of unstructured print statements signals hobby project. A bordered table signals intentionality.

The same principle applies to README quality. A repository with crisp diagrams and a clear run command earns attention in a way that scattered scripts do not. I am not claiming aesthetics replace correctness. I am claiming clarity reduces friction when someone else tries to reproduce your work months later.

What I would test if I add unit tests in a follow-up commit

Label integrity: every generated label must be a member of the known department enumeration.
Deterministic splits: with a fixed seed, the train and test partitions should be identical across runs.
Metric sanity: accuracy should fall between zero and one, and macro F1 should not exceed one.
Augmentation invariants: preference augmentation should never drop rows below the base training size unless explicitly intended.

Tests like these are small, but they catch embarrassing regressions when refactoring. They also document assumptions for future me, who will not remember why a function behaved a certain way on a late-night edit.

Communication boundaries when writing about civic technology

Public sector language is sensitive. I avoided describing any real jurisdiction, and I avoided implying that a municipality endorsed this work. I also avoided framing the PoC as a replacement for human intake workers. In my opinion, the best technical articles in this space acknowledge labor realities. Routing assistance should reduce repetitive triage, not erase human judgment from escalations.

Why I kept the stack small even when larger libraries are available

There is a cultural pull toward using the newest toolkit on every project. I understand the impulse. I also know that dependency weight matters for readers who clone a repository on a work laptop with limited install privileges. scikit-learn and Matplotlib are widely approved in enterprise environments compared with some deep learning stacks. That practical fact influenced my choices as much as modeling purity did.

A longer note on class balance and synthetic generation

Balanced classes per department make classroom demonstrations easier, but they can mislead you about deployment conditions where some routes are rare. I balanced classes here because I wanted clean learning curves while iterating on augmentation logic. If I simulate imbalance later, I would adjust metrics accordingly and probably introduce class weights or resampling strategies. The point is not to chase one recipe forever. The point is to match the evaluation setup to the question being asked.

How I would document an escalation policy in a future iteration

An escalation policy belongs in prose first, then in code. For example, if confidence is below a threshold, route to a human queue and attach the top three candidate departments with scores. If two departments are within a small margin, attach both and avoid pretending the model is decisive. Writing those rules down forces me to confront ambiguity instead of hiding behind a single argmax label.

Reflection on reading research papers versus shipping small prototypes

Reading about alignment methods is different from wiring even a simplified version into a repository. The distance between the two activities used to frustrate me. Over time, I reframed it. A simplified implementation is not a shallow imitation if the goal is to build intuition. The CivicTriage-AI PoC is my attempt to keep the wiring honest while staying within evenings-and-weekends effort.

What I track mentally when comparing training phases

Beyond headline metrics, I watch training row counts, augmentation counts, and whether the second model remains stable on obvious base cases. If the second model degrades on obvious cases, that is a sign that augmentation introduced conflicting signal or that duplication overwhelmed the original distribution. In my experiments, monitoring both phases on the same holdout made those conversations concrete rather than speculative.

Closing thoughts

This repository is a personal spike, not a recommendation for any city to adopt wholesale. I wrote it to practice structuring a training narrative that moves from supervised learning to preference-informed refinement without losing transparency. Along the way, I re-learned that the most persuasive demos are often the ones where the math is simple enough to inspect and the limitations are stated plainly.

If you fork the code, treat it as a starting sketch. Replace the synthetic generator with data that matches your governance constraints, expand evaluation beyond accuracy, and connect predictions to real workflows only after you have a monitoring plan. From my perspective, that is the difference between an educational artifact and a responsible pilot.

Repository

All source code and visual assets for this experimental article are available at https://github.com/aniket-work/CivicTriage-AI.

Disclaimer

The views and opinions expressed here are solely my own and do not represent the views, positions, or opinions of my employer or any organization I am affiliated with. The content is based on my personal experience and experimentation and may be incomplete or incorrect. Any errors or misinterpretations are unintentional, and I apologize in advance if any statements are misunderstood or misrepresented.

Tags: python, machinelearning, agents, civtech