Ansuman Jaiswal

Posted on Jun 16

I cleaned India's Census 2011 data so you never have to

#python #data #opendata #opensource

Every Indian data scientist hits the same wall.

You need district-level population data. You go to censusindia.gov.in.
You find hundreds of inconsistent Excel files with merged headers,
footnote rows, and zero documentation.

You spend a full day just loading the data before doing any actual analysis.

I fixed that. Once. For everyone.

What I built

indiaset/census-2011
India's Census 2011 district data, clean, typed, and ready for pandas.
640 districts · 29 columns · 0 missing values
Validated against official India total · LGD codes attached

Load it in 4 lines

from huggingface_hub import hf_hub_download
import pandas as pd

path = hf_hub_download(
    repo_id="indiaset/census-2011",
    filename="census_2011_districts_final.parquet",
    repo_type="dataset"
)
df = pd.read_parquet(path)
print(df.shape)  # (640, 29)

What's in it

Column	Description
`state_code`	Census 2011 state code
`state_name`	Official state/UT name
`district_code`	Census 2011 district code
`district_name`	District name as per Census
`lgd_code`	LGD permanent district code
`district_name_lgd`	District name as per LGD
`pop_total`	Total population
`pop_male`	Male population
`pop_female`	Female population
`pop_under6_total`	Children under 6 years
`pop_sc`	Scheduled Caste population
`pop_st`	Scheduled Tribe population
`literate_total`	Literate persons
`literate_male`	Literate males
`literate_female`	Literate females
`illiterate_total`	Illiterate persons
`workers_total`	Total workers
`workers_male`	Male workers
`workers_female`	Female workers
`non_workers_total`	Non workers
`literacy_rate`	Literate / Total × 100
`sex_ratio`	Females per 1000 males
`workforce_participation`	Workers / Total × 100

The validation

The most important test - do all 640 district populations
sum to India's official total?

print(df['pop_total'].sum())
# 1210854977 ✅ — exact match, zero discrepancy

What the data actually shows

Most literate district → Pathanamthitta, Kerala : 88.74%

Least literate district → Alirajpur, Madhya Pradesh : 28.77%

Literacy gap across India : 60 points
Highest sex ratio → Mahe, Puducherry : 1176 per 1000 males

Lowest sex ratio → Leh, Jammu & Kashmir : 690 per 1000 males
National population → 1,210,854,977

Our district sum → 1,210,854,977

Difference → 0 ✅

Why LGD codes matter

Every district in this dataset carries an LGD code - the Government of India's permanent identifier for every administrative unit.

Without LGD codes, joining two Indian datasets is a nightmare:

# without LGD - name matching hell
df[df['district'] == 'Leh(Ladakh)']
# misses: "Leh Ladakh", "Leh", "LEH"

# with LGD - bulletproof
df[df['lgd_code'] == 9]
# always works, regardless of spelling

This dataset has LGD codes for all 640 districts,
including manual verification of Yanam and Mahe - two tiny Puducherry enclaves missing from the official LGD export.

Known limitations

⚠️ This data reflects 2011 boundaries.

Telangana does not exist here - carved out of Andhra Pradesh in 2014
New districts post-2011 are not present - India had 640 then, 800+ now
Population figures are from 2011 - use for structural comparisons, not current headcounts

The cleaning pipeline

The full reproducible pipeline is on GitHub.
Clone it, run the notebook, get the exact same parquet file.

git clone https://github.com/indiaset/census-2011-pipeline

Raw file → filter → clean → validate → LGD join → parquet.
Every step documented. Every decision explained.

What's next

This is dataset #1 under indiaset -
India's open data layer.

Dataset	Status
Census 2011 districts	✅ Live
Indian Elections 1951–2024	🔜 Coming
RBI Economic Series	🔜 Coming
`pip install indiaset`	🔜 Coming

Citation

Jaiswal, Ansuman. (2026). India Census 2011 - District Level
[Dataset]. indiaset. Hugging Face.
https://huggingface.co/datasets/indiaset/census-2011

Licensed under CC-BY-4.0 - free to use, just credit the source.

🔗 Dataset → https://huggingface.co/datasets/indiaset/census-2011

🔗 Pipeline → https://github.com/indiaset/census-2011-pipeline

🔗 Follow → https://x.com/indiaset_data

DEV Community