Here is a thing that happens to every data scientist at least once a week.
You download a dataset. Clean CSV file, looks straightforward. You run pd.read_csv("data.csv"). Python explodes. Error you have never seen before. The file opened fine in Excel. What happened?
Or you call an API. JSON comes back. You try to load it. The structure is three levels deep and nothing is where you expected.
Or someone sends you an .xlsx file with five sheets, merged cells, and a color-coded header that means something only they understand.
Loading data is never as simple as the documentation makes it look. This post covers the real experience. Including the parts that break and why.
CSV: Simple in Theory
A CSV file is plain text. Values separated by commas. One row per line. Should be easy.
import pandas as pd
df = pd.read_csv("students.csv")
print(df.head())
That works about 60% of the time. The other 40%, something is off.
Problem 1: Wrong separator
Someone saved the file with semicolons instead of commas. Common in European locales where commas are used as decimal points.
df = pd.read_csv("data.csv", sep=";")
Problem 2: Encoding errors
You see characters like é instead of é. The file was saved in a different encoding than UTF-8.
df = pd.read_csv("data.csv", encoding="latin-1")
# or try:
df = pd.read_csv("data.csv", encoding="cp1252")
When you are not sure, try UTF-8 first. If you see garbage characters, try latin-1. That covers most cases.
Problem 3: Messy headers
df = pd.read_csv("data.csv")
print(df.columns.tolist())
# ['Name ', ' Age', 'Score ', ' Department'] <- spaces everywhere
Fix in one line:
df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")
print(df.columns.tolist())
# ['name', 'age', 'score', 'department']
Clean column names before doing anything else. Spaces and inconsistent casing cause mysterious errors later.
Problem 4: First few rows are metadata, not headers
Some files have a title or description before the actual data starts.
df = pd.read_csv("report.csv", skiprows=3)
Problem 5: The file is huge
Loading a 4GB CSV into memory crashes your laptop.
chunk_size = 10000
chunks = []
for chunk in pd.read_csv("huge_file.csv", chunksize=chunk_size):
processed = chunk[chunk["value"] > 0]
chunks.append(processed)
df = pd.concat(chunks, ignore_index=True)
print(f"Loaded {len(df)} rows after filtering")
chunksize tells Pandas to read the file in pieces. Process each piece, keep what you need, throw away the rest. You can handle files larger than your RAM this way.
JSON: When the Structure Is a Surprise
JSON comes from APIs. It looks clean. It is structured. It is also almost never a flat table.
import json
with open("users.json", "r") as f:
data = json.load(f)
print(type(data))
print(data[0] if isinstance(data, list) else data)
If it is a list of dictionaries, pd.DataFrame(data) usually just works.
df = pd.DataFrame(data)
print(df.head())
The problem is when it is not flat.
nested = {
"results": [
{"id": 1, "user": {"name": "Alex", "age": 25}, "score": 88},
{"id": 2, "user": {"name": "Priya", "age": 30}, "score": 92}
],
"total": 2,
"page": 1
}
pd.DataFrame(nested["results"]) gives you a column called user that contains dictionaries. Not what you want.
df = pd.json_normalize(nested["results"])
print(df)
Output:
id score user.name user.age
0 1 88 Alex 25
1 2 92 Priya 30
pd.json_normalize flattens nested JSON into columns. The dot notation in column names shows the nesting depth. Clean, usable, one function call.
Excel: The Format That Contains Surprises
df = pd.read_excel("report.xlsx")
That works for a clean single-sheet file. Real Excel files from real people are almost never that.
Multiple sheets:
all_sheets = pd.read_excel("report.xlsx", sheet_name=None)
print(all_sheets.keys())
for sheet_name, sheet_df in all_sheets.items():
print(f"{sheet_name}: {sheet_df.shape}")
sheet_name=None loads every sheet into a dictionary. Keys are sheet names, values are DataFrames.
Specific sheet:
df = pd.read_excel("report.xlsx", sheet_name="Q3 Sales")
Skipping decorative rows:
Many Excel reports have a logo, title, and blank rows before the real data starts.
df = pd.read_excel("report.xlsx", skiprows=4, header=0)
Selecting specific columns:
df = pd.read_excel("report.xlsx", usecols="A:E") # by letter
df = pd.read_excel("report.xlsx", usecols=[0, 1, 4]) # by position
Install openpyxl if you see an import error with xlsx files:
pip install openpyxl
APIs: JSON You Did Not Save
import requests
response = requests.get("https://jsonplaceholder.typicode.com/users")
print(f"Status code: {response.status_code}")
print(f"Content type: {response.headers['content-type']}")
data = response.json()
print(f"Type: {type(data)}")
print(f"Records: {len(data)}")
df = pd.json_normalize(data)
print(df[["id", "name", "email", "address.city"]].head())
Output:
Status code: 200
Content type: application/json; charset=utf-8
Type: <class 'list'>
Records: 10
id name email address.city
0 1 Leanne Graham Sincere@april.biz Gwenborough
1 2 Ervin Howell Shanna@melissa.tv Wisokyburgh
2 3 Clementine B.. Nathan@yesenia.net McKenziehaven
Always check status code before processing. 200 means success. 404 means the URL is wrong. 429 means you are hitting the rate limit. 500 means their server broke.
Handle failures:
def safe_get(url):
try:
response = requests.get(url, timeout=10)
response.raise_for_status()
return response.json()
except requests.exceptions.Timeout:
print("Request timed out")
except requests.exceptions.HTTPError as e:
print(f"HTTP error: {e}")
except requests.exceptions.RequestException as e:
print(f"Request failed: {e}")
return None
raise_for_status() throws an exception on any 4xx or 5xx response. Always use it in production code.
Useful read_csv Parameters Worth Knowing
df = pd.read_csv(
"data.csv",
sep=",", # delimiter
header=0, # row to use as column names (0 = first row)
index_col="id", # use this column as row index
usecols=["name", "age", "score"], # only load these columns
dtype={"age": int, "score": float}, # specify dtypes upfront
na_values=["N/A", "null", "none", "?"], # treat these as NaN
parse_dates=["date_column"], # parse as datetime
nrows=1000, # only load first 1000 rows
skiprows=[1, 2], # skip specific rows
encoding="utf-8",
low_memory=False # prevents mixed dtype warnings on large files
)
You will not use all of these every time. But when a file misbehaves, one of these usually fixes it.
Checking What You Loaded
Never assume your file loaded correctly. Always verify.
def inspect_dataframe(df, name="DataFrame"):
print(f"\n{'='*50}")
print(f"{name}")
print(f"{'='*50}")
print(f"Shape: {df.shape}")
print(f"\nColumn dtypes:")
print(df.dtypes)
print(f"\nMissing values:")
print(df.isnull().sum())
print(f"\nFirst 3 rows:")
print(df.head(3))
print(f"\nBasic stats:")
print(df.describe())
df = pd.read_csv("students.csv")
inspect_dataframe(df, "Students Dataset")
Run this every time you load a new file. It takes five seconds and saves you from building analysis on corrupted or misloaded data.
A Blog Worth Reading on This
Towards Data Science has a well-known post by Parul Pandey titled "A Beginner's Guide to Pandas" that covers real-world data loading patterns and common mistakes with great examples. Search "Parul Pandey pandas beginner guide towards data science" and it comes right up.
Will Koehrsen wrote a piece called "Pandas Tricks" on Towards Data Science that goes into depth on efficient data loading and the edge cases with CSV files. One of the most bookmarked posts in the data science community. Search "Will Koehrsen pandas tricks".
Both are worth reading alongside this post. Not instead of practicing the code, but as companions.
Try This
Create loading_practice.py.
Part one: download any CSV dataset from Kaggle (the Titanic dataset or the Iris dataset are perfect starters, both are free and tiny).
Load it and run the inspect_dataframe function above. Fix any column name issues with .str.strip().str.lower(). Print the missing value counts and decide what to do with them.
Part two: call this free API that returns a list of countries with their populations and regions.
https://restcountries.com/v3.1/all?fields=name,population,region,subregion
Use requests.get. Parse the JSON. Turn it into a DataFrame. The name field is itself a nested dict with a "common" key. Use pd.json_normalize and then print the first 10 rows with name, population, and region visible.
Sort by population descending. Print the top 10 most populated countries.
What's Next
You can load data now. But data from the real world is almost never clean. Next post is entirely about dirty data: missing values, wrong types, duplicate rows, impossible values. How to find all of it systematically and what to do about each kind.
Top comments (0)