<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Gagandeep Singh</title>
    <description>The latest articles on DEV Community by Gagandeep Singh (@gagan1985).</description>
    <link>https://dev.to/gagan1985</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F4011900%2F9f962f5d-c2a2-49df-9433-d1acdb6d6bdf.jpeg</url>
      <title>DEV Community: Gagandeep Singh</title>
      <link>https://dev.to/gagan1985</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/gagan1985"/>
    <language>en</language>
    <item>
      <title>Building an Open-Source Indian Address Parser: From Raw MCA/Bank Data to a Fine-Tuned LLM</title>
      <dc:creator>Gagandeep Singh</dc:creator>
      <pubDate>Thu, 02 Jul 2026 08:32:43 +0000</pubDate>
      <link>https://dev.to/gagan1985/building-an-open-source-indian-address-parser-from-raw-mcabank-data-to-a-fine-tuned-llm-2n9c</link>
      <guid>https://dev.to/gagan1985/building-an-open-source-indian-address-parser-from-raw-mcabank-data-to-a-fine-tuned-llm-2n9c</guid>
      <description>&lt;p&gt;&lt;em&gt;Cross-posting the full pipeline — data labeling, LoRA fine-tuning, cross-framework conversion, and a benchmark against an existing NER model — because most of the interesting bugs weren't in the ML at all.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The problem
&lt;/h2&gt;

&lt;p&gt;Indian addresses are notoriously unstructured. A single line can look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;FLAT NO.32, UTTARA TOWERS, MG ROAD GUWAHATI , Kamrup Unclassified AS 781029
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;House number, building name, street, locality, district, state, and pincode — all jammed into one free-text string with zero consistent formatting. If you've worked with Indian company registry data, bank KYC records, or delivery logistics, you already know this pain.&lt;/p&gt;

&lt;p&gt;I set out to build something that turns strings like the above into:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"houseNumber"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"FLAT NO.32"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"houseName"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"UTTARA TOWERS"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"street"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"MG ROAD"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"city"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"GUWAHATI"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"district"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Kamrup"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"state"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"AS"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"pincode"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"781029"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"poi"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"subsubLocality"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"subLocality"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"locality"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"village"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"subDistrict"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;13 fields, always present, &lt;code&gt;null&lt;/code&gt; when absent. Here's the whole pipeline, warts included.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting labeled data without a labeling budget
&lt;/h2&gt;

&lt;p&gt;Starting point: 4.37M raw addresses from two very differently-shaped sources — Indian MCA (Ministry of Corporate Affairs) company registrations, and bank/business-correspondent branch records. No labels.&lt;/p&gt;

&lt;p&gt;Manual labeling doesn't scale to that volume, so the pipeline is layered:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Rule-based tagging&lt;/strong&gt; — regex + gazetteer cross-checks (pincode → district/state lookup from India Post's official pincode CSV) give every record a confidence score. High-confidence ones auto-accept as "silver" labels.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLM-assisted labeling for the rest&lt;/strong&gt; — batched calls to an LLM via OpenRouter, with a system prompt that requires every extracted value to be copied &lt;em&gt;verbatim&lt;/em&gt; from the source text. If the model's field value isn't a substring of the input, it gets dropped rather than trusted. This alone eliminates a whole class of hallucination.&lt;/li&gt;
&lt;li&gt;A small &lt;strong&gt;human-reviewed slice&lt;/strong&gt; as a sanity check against the LLM's own accuracy before scaling up.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;One subtlety that actually mattered: MCA addresses have a machine-generated tail like &lt;code&gt;"...Kamrup Unclassified AS 781029"&lt;/code&gt;, where &lt;code&gt;"Unclassified"&lt;/code&gt; is a fixed placeholder meaning "no sub-district classification recorded" — not a place name. Early runs had the LLM tagging &lt;code&gt;"Unclassified"&lt;/code&gt; as a &lt;code&gt;subDistrict&lt;/code&gt; value. Fixed by explicitly teaching the model about this convention in the prompt. Small thing, but it's the kind of domain quirk no generic address parser would know to avoid.&lt;/p&gt;

&lt;p&gt;Also worth calling out: &lt;strong&gt;field taxonomy design is harder than model training&lt;/strong&gt;. The first schema (Google Maps' full geocoding component taxonomy, 35 types) was too granular for anyone — human or LLM — to label consistently. Collapsed it to 13 fields based on what a human reviewer could actually apply without agonizing over edge cases.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fine-tuning
&lt;/h2&gt;

&lt;p&gt;LoRA on &lt;code&gt;Qwen/Qwen3-0.6B&lt;/code&gt;, trained via MLX on an M4 Mac (&lt;code&gt;mlx-lm&lt;/code&gt;'s &lt;code&gt;lora&lt;/code&gt; command — genuinely pleasant to work with on Apple Silicon, no CUDA/bitsandbytes wrangling).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="py"&gt;rank&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;16, alpha=32, dropout=0.05&lt;/span&gt;
&lt;span class="py"&gt;target_modules&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj&lt;/span&gt;
&lt;span class="err"&gt;16&lt;/span&gt; &lt;span class="err"&gt;of&lt;/span&gt; &lt;span class="err"&gt;28&lt;/span&gt; &lt;span class="err"&gt;layers&lt;/span&gt; &lt;span class="err"&gt;fine-tuned,&lt;/span&gt; &lt;span class="err"&gt;2000&lt;/span&gt; &lt;span class="err"&gt;iterations,&lt;/span&gt; &lt;span class="err"&gt;~1.8&lt;/span&gt; &lt;span class="err"&gt;hours&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Results on a 237-example held-out gold test set:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;JSON parse rate&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mean per-field accuracy&lt;/td&gt;
&lt;td&gt;82.4%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Overall exact match (all fields)&lt;/td&gt;
&lt;td&gt;30.8%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The gap between per-field accuracy and exact-match is the interesting bit. Digging into disagreements, most of it isn't the model being wrong — it's &lt;strong&gt;schema ambiguity&lt;/strong&gt;. &lt;code&gt;locality&lt;/code&gt;/&lt;code&gt;subLocality&lt;/code&gt;/&lt;code&gt;subsubLocality&lt;/code&gt;/&lt;code&gt;village&lt;/code&gt; represent the same "named area, different granularity" concept, and even the gold labels are sometimes inconsistent about which bucket a given place name belongs in (I found gold records where the &lt;em&gt;same string&lt;/em&gt; was labeled as both &lt;code&gt;locality&lt;/code&gt; and &lt;code&gt;village&lt;/code&gt; simultaneously). That's a taxonomy problem, not a model problem, and no amount of additional training fixes it without a firmer labeling convention.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting it to run outside MLX
&lt;/h2&gt;

&lt;p&gt;This is where most of the actual debugging time went, and none of it was ML.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;mlx-lm&lt;/code&gt; produces its own adapter format — not PEFT-compatible. To make the model usable on CUDA/CPU (not just Apple Silicon), I had to hand-derive the weight conversion:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# mlx-lm: lora_a [in_features, r], lora_b [r, out_features], used as x @ A @ B
# PEFT:   lora_A.weight [r, in_features], lora_B.weight [out_features, r]
# So: peft_A = mlx_a.T, peft_B = mlx_b.T
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I verified this against &lt;code&gt;mlx-lm&lt;/code&gt;'s own &lt;code&gt;fuse()&lt;/code&gt; source (&lt;code&gt;delta = (scale * lora_b.T) @ lora_a.T&lt;/code&gt;) rather than trusting my own derivation, then confirmed numerically — ran the same 15 addresses through both the original MLX adapter and the converted PEFT version. 13/15 identical outputs; the 2 mismatches landed exactly on the already-known-ambiguous fields, consistent with floating-point differences between backends on a near-tied softmax decision rather than a conversion bug.&lt;/p&gt;

&lt;h2&gt;
  
  
  Publishing, and the dependency-floor whack-a-mole
&lt;/h2&gt;

&lt;p&gt;Published the model to Hugging Face (both formats — PEFT at root, MLX in a subfolder), then wrapped it as a &lt;code&gt;pip install&lt;/code&gt;-able package: &lt;a href="https://pypi.org/project/indian-address-parser/" rel="noopener noreferrer"&gt;&lt;code&gt;indian-address-parser&lt;/code&gt;&lt;/a&gt; on PyPI, source on &lt;a href="https://github.com/innerkorehq/indian-address-parser" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Then real users tried to install it into their existing environments (Anaconda base envs, specifically), and things broke in sequence:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;code&gt;peft&lt;/code&gt; imports &lt;code&gt;transformers.BloomPreTrainedModel&lt;/code&gt;&lt;/strong&gt;, whose lazy-loading chain unconditionally does &lt;code&gt;import tensorflow&lt;/code&gt;. In a conda env with a mismatched TF/numpy/h5py install, that crashed the whole thing before ever touching TensorFlow functionality. Fix: &lt;code&gt;os.environ["USE_TF"] = "0"&lt;/code&gt; before any transformers/peft import, so transformers' TF-detection short-circuits.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;code&gt;qwen3&lt;/code&gt; model type not recognized.&lt;/strong&gt; Turns out &lt;code&gt;transformers&lt;/code&gt; only added Qwen3 support at exactly version &lt;code&gt;4.51.0&lt;/code&gt; — verified by bisecting real PyPI releases (&lt;code&gt;4.50.0&lt;/code&gt;: no, &lt;code&gt;4.51.0&lt;/code&gt;: yes). My dependency floor (&lt;code&gt;&amp;gt;=4.45.0&lt;/code&gt;) was loose enough that pip left an old transformers in place instead of upgrading it.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;code&gt;hf_hub_download() got an unexpected keyword argument 'use_auth_token'&lt;/code&gt;.&lt;/strong&gt; &lt;code&gt;peft&amp;lt;0.18.0&lt;/code&gt; unconditionally passes &lt;code&gt;use_auth_token=None&lt;/code&gt; into &lt;code&gt;hf_hub_download&lt;/code&gt;, regardless of whether the caller asked for it. Recent &lt;code&gt;huggingface_hub&lt;/code&gt; (1.x) dropped that deprecated kwarg entirely. Bisected peft's source across ten versions to find the exact fix boundary (0.17.1: unconditional pass, 0.18.0: conditional via walrus operator).&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each fix was verified against the &lt;em&gt;actual reported failure&lt;/em&gt;, not just plausible-sounding — I built a venv pinned to the exact stale dependency trio from the bug report, installed the patched package, confirmed pip auto-upgraded everything, and ran real inference before calling it fixed.&lt;/p&gt;

&lt;p&gt;The lesson, if there is one: &lt;strong&gt;&lt;code&gt;&amp;gt;=X.Y.Z&lt;/code&gt; floors need to be the actual minimum that works, verified, not "whatever I happened to have installed while developing."&lt;/strong&gt; Loose floors don't fail for you — they fail for whoever has an older version already sitting in their environment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benchmarking against an existing model
&lt;/h2&gt;

&lt;p&gt;Once things were stable, I compared against &lt;a href="https://huggingface.co/shiprocket-ai/open-tinybert-indian-address-ner" rel="noopener noreferrer"&gt;Shiprocket's &lt;code&gt;open-tinybert-indian-address-ner&lt;/code&gt;&lt;/a&gt; — a 6-layer TinyBERT doing BIO-tagged token classification, a fundamentally different architecture (and a different field taxonomy) than a 0.6B causal LM generating JSON.&lt;/p&gt;

&lt;p&gt;Built an explicit field mapping covering the 9 conceptually-overlapping fields (their &lt;code&gt;house_details&lt;/code&gt; ↔ my &lt;code&gt;houseNumber&lt;/code&gt;, &lt;code&gt;road&lt;/code&gt; ↔ &lt;code&gt;street&lt;/code&gt;, etc.) and scored both against the same 237-example held-out set:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Field&lt;/th&gt;
&lt;th&gt;Mine&lt;/th&gt;
&lt;th&gt;Shiprocket's&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;city&lt;/td&gt;
&lt;td&gt;91.3%&lt;/td&gt;
&lt;td&gt;17.4%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;state&lt;/td&gt;
&lt;td&gt;96.2%&lt;/td&gt;
&lt;td&gt;41.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;pincode&lt;/td&gt;
&lt;td&gt;100.0%&lt;/td&gt;
&lt;td&gt;69.2%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;houseNumber&lt;/td&gt;
&lt;td&gt;84.5%&lt;/td&gt;
&lt;td&gt;27.1%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Higher accuracy on every shared field — but Shiprocket's model is &lt;strong&gt;~240x faster per address&lt;/strong&gt; (19ms vs 4.6s). That's not a quality artifact, it's architecture: a 6-layer classifier doing a single forward pass vs. autoregressive generation. If your use case needs high-throughput/low-latency parsing over perfect accuracy, that's a legitimate reason to pick the other model. I'd rather publish that tradeoff honestly than pretend the comparison only cuts one way.&lt;/p&gt;

&lt;h2&gt;
  
  
  Publishing the data too
&lt;/h2&gt;

&lt;p&gt;Also shipped the underlying data as two HF datasets:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://huggingface.co/datasets/gagan1985/indian-addresses-raw" rel="noopener noreferrer"&gt;&lt;code&gt;indian-addresses-raw&lt;/code&gt;&lt;/a&gt; — the full 4.37M-record unlabeled corpus&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://huggingface.co/datasets/gagan1985/indian-addresses-gold" rel="noopener noreferrer"&gt;&lt;code&gt;indian-addresses-gold&lt;/code&gt;&lt;/a&gt; — 4,834 span-labeled training examples&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Before publishing the raw corpus, I found something worth mentioning: bank/BC address records are KYC-style data and some of them embed real customer phone numbers and relational-name markers (&lt;code&gt;S/O&lt;/code&gt;/&lt;code&gt;D/O&lt;/code&gt;/&lt;code&gt;W/O&lt;/code&gt;/&lt;code&gt;C/O&lt;/code&gt; — "son of"/"care of", standard on Indian address forms). That's different from MCA's superficially similar &lt;code&gt;C/O &amp;lt;company director&amp;gt;&lt;/code&gt; convention, which is already public disclosure. Wrote a targeted redaction pass for the bank source (verified against the corpus, not assumed — caught a "Door No." vs "D/O [name]" false-positive collision along the way), and for the gold dataset specifically, &lt;strong&gt;dropped&lt;/strong&gt; the small number of affected records instead of redacting in place, since redacting text shifts the character offsets that the span labels depend on.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;indian-address-parser
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;indian_address_parser&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AddressParser&lt;/span&gt;

&lt;span class="n"&gt;parser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AddressParser&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# pulls weights from HF automatically
&lt;/span&gt;&lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;FLAT NO.32, UTTARA TOWERS, MG ROAD GUWAHATI , Kamrup Unclassified AS 781029&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Everything's open source and Apache 2.0: &lt;a href="https://huggingface.co/gagan1985/qwen3-0.6b-indian-address-parser" rel="noopener noreferrer"&gt;model&lt;/a&gt; · &lt;a href="https://github.com/innerkorehq/indian-address-parser" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; · &lt;a href="https://pypi.org/project/indian-address-parser/" rel="noopener noreferrer"&gt;PyPI&lt;/a&gt; · &lt;a href="https://huggingface.co/datasets/gagan1985/indian-addresses-gold" rel="noopener noreferrer"&gt;datasets&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Feedback and PRs welcome, especially on the locality/subLocality boundary ambiguity — I have a hypothesis for a firmer labeling convention that would help, but haven't tested whether it actually resolves the disagreement rate or just moves it around.&lt;/p&gt;

</description>
      <category>python</category>
      <category>huggingface</category>
      <category>ai</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
