Every Indian data scientist hits the same wall.
You need district-level population data. You go to censusindia.gov.in.
You find hundreds of inconsistent Excel files with merged headers,
footnote rows, and zero documentation.
You spend a full day just loading the data before doing any actual analysis.
I fixed that. Once. For everyone.
What I built
indiaset/census-2011
India's Census 2011 district data, clean, typed, and ready for pandas.
640 districts · 29 columns · 0 missing values
Validated against official India total · LGD codes attached
Load it in 4 lines
from huggingface_hub import hf_hub_download
import pandas as pd
path = hf_hub_download(
repo_id="indiaset/census-2011",
filename="census_2011_districts_final.parquet",
repo_type="dataset"
)
df = pd.read_parquet(path)
print(df.shape) # (640, 29)
What's in it
| Column | Description |
|---|---|
state_code |
Census 2011 state code |
state_name |
Official state/UT name |
district_code |
Census 2011 district code |
district_name |
District name as per Census |
lgd_code |
LGD permanent district code |
district_name_lgd |
District name as per LGD |
pop_total |
Total population |
pop_male |
Male population |
pop_female |
Female population |
pop_under6_total |
Children under 6 years |
pop_sc |
Scheduled Caste population |
pop_st |
Scheduled Tribe population |
literate_total |
Literate persons |
literate_male |
Literate males |
literate_female |
Literate females |
illiterate_total |
Illiterate persons |
workers_total |
Total workers |
workers_male |
Male workers |
workers_female |
Female workers |
non_workers_total |
Non workers |
literacy_rate |
Literate / Total × 100 |
sex_ratio |
Females per 1000 males |
workforce_participation |
Workers / Total × 100 |
The validation
The most important test - do all 640 district populations
sum to India's official total?
print(df['pop_total'].sum())
# 1210854977 ✅ — exact match, zero discrepancy
What the data actually shows
Most literate district → Pathanamthitta, Kerala : 88.74%
Least literate district → Alirajpur, Madhya Pradesh : 28.77%
Literacy gap across India : 60 points
Highest sex ratio → Mahe, Puducherry : 1176 per 1000 males
Lowest sex ratio → Leh, Jammu & Kashmir : 690 per 1000 males
National population → 1,210,854,977
Our district sum → 1,210,854,977
Difference → 0 ✅
Why LGD codes matter
Every district in this dataset carries an LGD code - the Government of India's permanent identifier for every administrative unit.
Without LGD codes, joining two Indian datasets is a nightmare:
# without LGD - name matching hell
df[df['district'] == 'Leh(Ladakh)']
# misses: "Leh Ladakh", "Leh", "LEH"
# with LGD - bulletproof
df[df['lgd_code'] == 9]
# always works, regardless of spelling
This dataset has LGD codes for all 640 districts,
including manual verification of Yanam and Mahe - two tiny Puducherry enclaves missing from the official LGD export.
Known limitations
⚠️ This data reflects 2011 boundaries.
- Telangana does not exist here - carved out of Andhra Pradesh in 2014
- New districts post-2011 are not present - India had 640 then, 800+ now
- Population figures are from 2011 - use for structural comparisons, not current headcounts
The cleaning pipeline
The full reproducible pipeline is on GitHub.
Clone it, run the notebook, get the exact same parquet file.
git clone https://github.com/indiaset/census-2011-pipeline
Raw file → filter → clean → validate → LGD join → parquet.
Every step documented. Every decision explained.
What's next
This is dataset #1 under indiaset -
India's open data layer.
| Dataset | Status |
|---|---|
| Census 2011 districts | ✅ Live |
| Indian Elections 1951–2024 | 🔜 Coming |
| RBI Economic Series | 🔜 Coming |
pip install indiaset |
🔜 Coming |
Citation
Jaiswal, Ansuman. (2026). India Census 2011 - District Level
[Dataset]. indiaset. Hugging Face.
https://huggingface.co/datasets/indiaset/census-2011
Licensed under CC-BY-4.0 - free to use, just credit the source.
🔗 Dataset → https://huggingface.co/datasets/indiaset/census-2011
🔗 Pipeline → https://github.com/indiaset/census-2011-pipeline
🔗 Follow → https://x.com/indiaset_data
Top comments (0)