Akhilesh

Posted on Apr 25

Loading Data From Anywhere (And Why It Always Breaks the First Time)

#ai #programming #productivity #beginners

Here is a thing that happens to every data scientist at least once a week.

You download a dataset. Clean CSV file, looks straightforward. You run pd.read_csv("data.csv"). Python explodes. Error you have never seen before. The file opened fine in Excel. What happened?

Or you call an API. JSON comes back. You try to load it. The structure is three levels deep and nothing is where you expected.

Or someone sends you an .xlsx file with five sheets, merged cells, and a color-coded header that means something only they understand.

Loading data is never as simple as the documentation makes it look. This post covers the real experience. Including the parts that break and why.

CSV: Simple in Theory

A CSV file is plain text. Values separated by commas. One row per line. Should be easy.

import pandas as pd

df = pd.read_csv("students.csv")
print(df.head())

That works about 60% of the time. The other 40%, something is off.

Problem 1: Wrong separator

Someone saved the file with semicolons instead of commas. Common in European locales where commas are used as decimal points.

df = pd.read_csv("data.csv", sep=";")

Problem 2: Encoding errors

You see characters like Ã© instead of é. The file was saved in a different encoding than UTF-8.

df = pd.read_csv("data.csv", encoding="latin-1")
# or try:
df = pd.read_csv("data.csv", encoding="cp1252")

When you are not sure, try UTF-8 first. If you see garbage characters, try latin-1. That covers most cases.

Problem 3: Messy headers

df = pd.read_csv("data.csv")
print(df.columns.tolist())
# ['Name ', ' Age', 'Score ', ' Department']  <- spaces everywhere

Fix in one line:

df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")
print(df.columns.tolist())
# ['name', 'age', 'score', 'department']

Clean column names before doing anything else. Spaces and inconsistent casing cause mysterious errors later.

Problem 4: First few rows are metadata, not headers

Some files have a title or description before the actual data starts.

df = pd.read_csv("report.csv", skiprows=3)

Problem 5: The file is huge

Loading a 4GB CSV into memory crashes your laptop.

chunk_size = 10000
chunks = []

for chunk in pd.read_csv("huge_file.csv", chunksize=chunk_size):
    processed = chunk[chunk["value"] > 0]
    chunks.append(processed)

df = pd.concat(chunks, ignore_index=True)
print(f"Loaded {len(df)} rows after filtering")

chunksize tells Pandas to read the file in pieces. Process each piece, keep what you need, throw away the rest. You can handle files larger than your RAM this way.

JSON: When the Structure Is a Surprise

JSON comes from APIs. It looks clean. It is structured. It is also almost never a flat table.

import json

with open("users.json", "r") as f:
    data = json.load(f)

print(type(data))
print(data[0] if isinstance(data, list) else data)

If it is a list of dictionaries, pd.DataFrame(data) usually just works.

df = pd.DataFrame(data)
print(df.head())

The problem is when it is not flat.

nested = {
    "results": [
        {"id": 1, "user": {"name": "Alex", "age": 25}, "score": 88},
        {"id": 2, "user": {"name": "Priya", "age": 30}, "score": 92}
    ],
    "total": 2,
    "page": 1
}

pd.DataFrame(nested["results"]) gives you a column called user that contains dictionaries. Not what you want.

df = pd.json_normalize(nested["results"])
print(df)

Output:

   id  score user.name  user.age
0   1     88      Alex        25
1   2     92     Priya        30

pd.json_normalize flattens nested JSON into columns. The dot notation in column names shows the nesting depth. Clean, usable, one function call.

Excel: The Format That Contains Surprises

df = pd.read_excel("report.xlsx")

That works for a clean single-sheet file. Real Excel files from real people are almost never that.

Multiple sheets:

all_sheets = pd.read_excel("report.xlsx", sheet_name=None)

print(all_sheets.keys())

for sheet_name, sheet_df in all_sheets.items():
    print(f"{sheet_name}: {sheet_df.shape}")

sheet_name=None loads every sheet into a dictionary. Keys are sheet names, values are DataFrames.

Specific sheet:

df = pd.read_excel("report.xlsx", sheet_name="Q3 Sales")

Skipping decorative rows:

Many Excel reports have a logo, title, and blank rows before the real data starts.

df = pd.read_excel("report.xlsx", skiprows=4, header=0)

Selecting specific columns:

df = pd.read_excel("report.xlsx", usecols="A:E")     # by letter
df = pd.read_excel("report.xlsx", usecols=[0, 1, 4]) # by position

Install openpyxl if you see an import error with xlsx files:

pip install openpyxl

APIs: JSON You Did Not Save

import requests

response = requests.get("https://jsonplaceholder.typicode.com/users")

print(f"Status code: {response.status_code}")
print(f"Content type: {response.headers['content-type']}")

data = response.json()
print(f"Type: {type(data)}")
print(f"Records: {len(data)}")

df = pd.json_normalize(data)
print(df[["id", "name", "email", "address.city"]].head())

Output:

Status code: 200
Content type: application/json; charset=utf-8
Type: <class 'list'>
Records: 10
   id               name                  email address.city
0   1   Leanne Graham   Sincere@april.biz      Gwenborough
1   2   Ervin Howell    Shanna@melissa.tv      Wisokyburgh
2   3   Clementine B..  Nathan@yesenia.net     McKenziehaven

Always check status code before processing. 200 means success. 404 means the URL is wrong. 429 means you are hitting the rate limit. 500 means their server broke.

Handle failures:

def safe_get(url):
    try:
        response = requests.get(url, timeout=10)
        response.raise_for_status()
        return response.json()
    except requests.exceptions.Timeout:
        print("Request timed out")
    except requests.exceptions.HTTPError as e:
        print(f"HTTP error: {e}")
    except requests.exceptions.RequestException as e:
        print(f"Request failed: {e}")
    return None

raise_for_status() throws an exception on any 4xx or 5xx response. Always use it in production code.

Useful read_csv Parameters Worth Knowing

df = pd.read_csv(
    "data.csv",
    sep=",",               # delimiter
    header=0,              # row to use as column names (0 = first row)
    index_col="id",        # use this column as row index
    usecols=["name", "age", "score"],   # only load these columns
    dtype={"age": int, "score": float}, # specify dtypes upfront
    na_values=["N/A", "null", "none", "?"],  # treat these as NaN
    parse_dates=["date_column"],         # parse as datetime
    nrows=1000,            # only load first 1000 rows
    skiprows=[1, 2],       # skip specific rows
    encoding="utf-8",
    low_memory=False       # prevents mixed dtype warnings on large files
)

You will not use all of these every time. But when a file misbehaves, one of these usually fixes it.

Checking What You Loaded

Never assume your file loaded correctly. Always verify.

def inspect_dataframe(df, name="DataFrame"):
    print(f"\n{'='*50}")
    print(f"{name}")
    print(f"{'='*50}")
    print(f"Shape: {df.shape}")
    print(f"\nColumn dtypes:")
    print(df.dtypes)
    print(f"\nMissing values:")
    print(df.isnull().sum())
    print(f"\nFirst 3 rows:")
    print(df.head(3))
    print(f"\nBasic stats:")
    print(df.describe())

df = pd.read_csv("students.csv")
inspect_dataframe(df, "Students Dataset")

Run this every time you load a new file. It takes five seconds and saves you from building analysis on corrupted or misloaded data.

A Blog Worth Reading on This

Towards Data Science has a well-known post by Parul Pandey titled "A Beginner's Guide to Pandas" that covers real-world data loading patterns and common mistakes with great examples. Search "Parul Pandey pandas beginner guide towards data science" and it comes right up.

Will Koehrsen wrote a piece called "Pandas Tricks" on Towards Data Science that goes into depth on efficient data loading and the edge cases with CSV files. One of the most bookmarked posts in the data science community. Search "Will Koehrsen pandas tricks".

Both are worth reading alongside this post. Not instead of practicing the code, but as companions.

Try This

Create loading_practice.py.

Part one: download any CSV dataset from Kaggle (the Titanic dataset or the Iris dataset are perfect starters, both are free and tiny).

Load it and run the inspect_dataframe function above. Fix any column name issues with .str.strip().str.lower(). Print the missing value counts and decide what to do with them.

Part two: call this free API that returns a list of countries with their populations and regions.

https://restcountries.com/v3.1/all?fields=name,population,region,subregion

Use requests.get. Parse the JSON. Turn it into a DataFrame. The name field is itself a nested dict with a "common" key. Use pd.json_normalize and then print the first 10 rows with name, population, and region visible.

Sort by population descending. Print the top 10 most populated countries.

What's Next

You can load data now. But data from the real world is almost never clean. Next post is entirely about dirty data: missing values, wrong types, duplicate rows, impossible values. How to find all of it systematically and what to do about each kind.

DEV Community