DEV Community

Cover image for I cleaned India's Census 2011 data so you never have to
Ansuman Jaiswal
Ansuman Jaiswal

Posted on

I cleaned India's Census 2011 data so you never have to

Every Indian data scientist hits the same wall.

You need district-level population data. You go to censusindia.gov.in.
You find hundreds of inconsistent Excel files with merged headers,
footnote rows, and zero documentation.

You spend a full day just loading the data before doing any actual analysis.

I fixed that. Once. For everyone.

What I built

indiaset/census-2011
India's Census 2011 district data, clean, typed, and ready for pandas.
640 districts · 29 columns · 0 missing values
Validated against official India total · LGD codes attached

Load it in 4 lines

from huggingface_hub import hf_hub_download
import pandas as pd

path = hf_hub_download(
    repo_id="indiaset/census-2011",
    filename="census_2011_districts_final.parquet",
    repo_type="dataset"
)
df = pd.read_parquet(path)
print(df.shape)  # (640, 29)
Enter fullscreen mode Exit fullscreen mode

What's in it

Column Description
state_code Census 2011 state code
state_name Official state/UT name
district_code Census 2011 district code
district_name District name as per Census
lgd_code LGD permanent district code
district_name_lgd District name as per LGD
pop_total Total population
pop_male Male population
pop_female Female population
pop_under6_total Children under 6 years
pop_sc Scheduled Caste population
pop_st Scheduled Tribe population
literate_total Literate persons
literate_male Literate males
literate_female Literate females
illiterate_total Illiterate persons
workers_total Total workers
workers_male Male workers
workers_female Female workers
non_workers_total Non workers
literacy_rate Literate / Total × 100
sex_ratio Females per 1000 males
workforce_participation Workers / Total × 100

The validation

The most important test - do all 640 district populations
sum to India's official total?

print(df['pop_total'].sum())
# 1210854977 ✅ — exact match, zero discrepancy
Enter fullscreen mode Exit fullscreen mode

What the data actually shows

Most literate district → Pathanamthitta, Kerala : 88.74%

Least literate district → Alirajpur, Madhya Pradesh : 28.77%

Literacy gap across India : 60 points
Highest sex ratio → Mahe, Puducherry : 1176 per 1000 males

Lowest sex ratio → Leh, Jammu & Kashmir : 690 per 1000 males
National population → 1,210,854,977

Our district sum → 1,210,854,977

Difference → 0 ✅

Why LGD codes matter

Every district in this dataset carries an LGD code - the Government of India's permanent identifier for every administrative unit.

Without LGD codes, joining two Indian datasets is a nightmare:

# without LGD - name matching hell
df[df['district'] == 'Leh(Ladakh)']
# misses: "Leh Ladakh", "Leh", "LEH"

# with LGD - bulletproof
df[df['lgd_code'] == 9]
# always works, regardless of spelling
Enter fullscreen mode Exit fullscreen mode

This dataset has LGD codes for all 640 districts,
including manual verification of Yanam and Mahe - two tiny Puducherry enclaves missing from the official LGD export.

Known limitations

⚠️ This data reflects 2011 boundaries.

  • Telangana does not exist here - carved out of Andhra Pradesh in 2014
  • New districts post-2011 are not present - India had 640 then, 800+ now
  • Population figures are from 2011 - use for structural comparisons, not current headcounts

The cleaning pipeline

The full reproducible pipeline is on GitHub.
Clone it, run the notebook, get the exact same parquet file.

git clone https://github.com/indiaset/census-2011-pipeline
Enter fullscreen mode Exit fullscreen mode

Raw file → filter → clean → validate → LGD join → parquet.
Every step documented. Every decision explained.

What's next

This is dataset #1 under indiaset -
India's open data layer.

Dataset Status
Census 2011 districts ✅ Live
Indian Elections 1951–2024 🔜 Coming
RBI Economic Series 🔜 Coming
pip install indiaset 🔜 Coming

Citation

Jaiswal, Ansuman. (2026). India Census 2011 - District Level
[Dataset]. indiaset. Hugging Face.
https://huggingface.co/datasets/indiaset/census-2011

Licensed under CC-BY-4.0 - free to use, just credit the source.


🔗 Dataset → https://huggingface.co/datasets/indiaset/census-2011

🔗 Pipeline → https://github.com/indiaset/census-2011-pipeline

🔗 Follow → https://x.com/indiaset_data

Top comments (0)