Navas Herbert

Posted on Jun 9

Python Week 6: Our Programs Finally Read Real Files

#python #beginners #dataengineering #ai

Every program we'd built so far had one problem.

The moment you closed it, everything disappeared. The students we added to the registry. The scores we calculated. The transactions we analysed. Gone. You'd open the program again, and it would start from scratch.

Real software doesn't work like that. Real software reads from files, processes data, writes results back out. Real software persists.

Week 6 was the week we crossed that line.

I walked in with a file called students.csv - 30 rows of messy, real-looking student data. Names in wrong cases. Missing scores. Inconsistent course names. I put it on the projector and said: "By the end of today, Python will read this, clean it, and report on it. Without us touching a single row manually."

The room got quiet in the good way.

String Methods: Cleaning Up Messy Text

Before we touched any file, we needed sharper tools for working with text. Real data is almost always dirty - inconsistent capitalisation, rogue spaces, mixed formats.

These are the methods that fix it:

name = "  amina WAWERU  "

print(name.strip())       # remove leading/trailing spaces
print(name.lower())       # all lowercase
print(name.upper())       # all uppercase
print(name.title())       # Title Case - first letter of each word
print(name.strip().title())  # chain them together

amina WAWERU
  amina waweru  
  AMINA WAWERU  
  Amina Waweru  
Amina Waweru

That last line - chaining .strip() and .title() - is where students start thinking like data cleaners. One messy string in, one clean string out.

More essential methods:

course = "  data engineering "

print(course.strip())              # "data engineering"
print(course.strip().title())      # "Data Engineering"
print(course.strip().replace("engineering", "science"))  # "data science"
print("Data Engineering".startswith("Data"))    # True
print("Data Engineering".endswith("ing"))       # True
print("Data Engineering".count("a"))            # 2
print("Data Engineering".split(" "))            # ['Data', 'Engineering']

data engineering
Data Engineering
data science
True
True
2
['Data', 'Engineering']

The SQL connection came up again naturally. Wanjiku said: ".strip() is TRIM(). .title() is INITCAP(). We already know this."

Exactly. Different language, same ideas. This is why learning SQL first pays off.

String Checking Methods

These return True or False - perfect for validation:

score_input = "88"
name_input = "Amina123"
empty_input = "   "

print(score_input.isdigit())      # True - all digits
print(name_input.isalpha())       # False - has numbers
print(empty_input.strip() == "")  # True - effectively empty
print("amina@lux.co.ke".find("@"))  # 5 - position of @

True
False
True
5

We used these to write a simple input validator - the kind of check every real form needs before saving data:

def validate_student_input(name, score_str):
    errors = []

    if not name.strip():
        errors.append("Name cannot be empty")
    if not score_str.isdigit():
        errors.append("Score must be a whole number")
    elif not (0 <= int(score_str) <= 100):
        errors.append("Score must be between 0 and 100")

    return errors

# Test it
print(validate_student_input("Amina", "88"))      # []
print(validate_student_input("", "88"))            # ['Name cannot be empty']
print(validate_student_input("Brian", "abc"))      # ['Score must be a whole number']
print(validate_student_input("Njeri", "150"))      # ['Score must be between 0 and 100']

[]
['Name cannot be empty']
['Score must be a whole number']
['Score must be between 0 and 100']

An empty list means everything is valid. A list with items means something failed. Clean, readable, reusable - everything we learned in Week 4 (functions) and Week 5 (lists) coming together.

File Handling: Reading From the Real World

Now the main event.

Python opens files with open(). But instead of using open() directly, we use the with statement - it opens the file, lets you work with it, and automatically closes it when you're done. No risk of leaving files hanging open.

with open("greeting.txt", "r") as file:
    content = file.read()
    print(content)

The modes:

"r" — read (file must exist)
"w" — write (creates file, overwrites if it exists)
"a" — append (adds to the end without deleting)

Writing a file:

with open("notes.txt", "w") as file:
    file.write("Python Week 6 Notes\n")
    file.write("Topic: File Handling\n")
    file.write("Status: Mind blown\n")

print("File written!")

File written!

Reading it back:

with open("notes.txt", "r") as file:
    for line in file:
        print(line.strip())    # .strip() removes the \n at the end of each line

Python Week 6 Notes
Topic: File Handling
Status: Mind blown

Simple. But the moment students saw their own data surviving after the program closed - there was a visible shift in the room.

Reading a CSV File Manually

This is the moment the session had been building toward.

A CSV (Comma-Separated Values) file is the most common format for tabular data. Every spreadsheet, every database export, every dataset you'll ever work with -almost certainly starts as a CSV at some point.

Here's what our students.csv looked like:

name,course,score
amina waweru,data engineering,88
BRIAN OTIENO,Data Science,74
njeri kamau ,DATA ENGINEERING,91
kamau mwangi,data science,
wanjiku achieng,Data Engineering,79
aisha baraka,DATA SCIENCE,85

Messy. Inconsistent capitalisation. One missing score. Trailing spaces. Real.

Reading it line by line:

with open("students.csv", "r") as file:
    header = file.readline()    # skip the header row
    print(f"Columns: {header.strip()}")

    for line in file:
        row = line.strip().split(",")
        print(row)

Columns: name,course,score
['amina waweru', 'data engineering', '88']
['BRIAN OTIENO', 'Data Science', '74']
['njeri kamau ', 'DATA ENGINEERING', '91']
['kamau mwangi', 'data science', '']
['wanjiku achieng', 'Data Engineering', '79']
['aisha baraka', 'DATA SCIENCE', '85']

Now we could see exactly what we were dealing with. Messy input, structured as lists. Time to clean it.

The Moment That Got Everyone: The Full Pipeline

We combined everything- file reading, string methods, validation, data structures - into one script. I walked through it step by step, building it live:

def clean_name(name):
    return name.strip().title()

def clean_course(course):
    return course.strip().title()

def parse_score(score_str):
    score_str = score_str.strip()
    if score_str.isdigit():
        return int(score_str)
    return None   # missing or invalid score

def calculate_grade(score):
    if score is None: return "N/A"
    if score >= 80: return "A"
    elif score >= 70: return "B"
    elif score >= 60: return "C"
    elif score >= 50: return "D"
    else: return "F"

# Read and process the file
students = []

with open("students.csv", "r") as file:
    file.readline()   # skip header

    for line in file:
        parts = line.strip().split(",")
        if len(parts) != 3:
            continue  # skip malformed rows

        name   = clean_name(parts[0])
        course = clean_course(parts[1])
        score  = parse_score(parts[2])
        grade  = calculate_grade(score)

        students.append({
            "name": name,
            "course": course,
            "score": score,
            "grade": grade
        })

# Print the cleaned report
print(f"\n{'='*55}")
print(f"  LUX ACADEMY - CLEANED STUDENT REPORT")
print(f"{'='*55}")
print(f"  {'NAME':<20} {'COURSE':<20} {'SCORE':<7} GRADE")
print(f"  {'-'*50}")

for s in students:
    score_display = str(s['score']) if s['score'] is not None else "MISSING"
    print(f"  {s['name']:<20} {s['course']:<20} {score_display:<7} {s['grade']}")

print(f"{'='*55}")

# Summary
valid_scores = [s['score'] for s in students if s['score'] is not None]
print(f"\n  Total students : {len(students)}")
print(f"  Missing scores : {len(students) - len(valid_scores)}")
print(f"  Class average  : {sum(valid_scores)/len(valid_scores):.1f}")
print(f"  Highest score  : {max(valid_scores)}")
print(f"  Lowest score   : {min(valid_scores)}")

=======================================================
  LUX ACADEMY - CLEANED STUDENT REPORT
=======================================================
  NAME                 COURSE               SCORE   GRADE
  --------------------------------------------------
  Amina Waweru         Data Engineering     88      A
  Brian Otieno         Data Science         74      B
  Njeri Kamau          Data Engineering     91      A
  Kamau Mwangi         Data Science         MISSING N/A
  Wanjiku Achieng      Data Engineering     79      B
  Aisha Baraka         Data Science         85      A
=======================================================

  Total students : 6
  Missing scores : 1
  Class average  : 83.4
  Highest score  : 91
  Lowest score   : 74

The room went silent when that output appeared.

Then Brian said: "We just built a data pipeline."

He wasn't wrong. Read raw data from a file. Clean it. Validate it. Structure it. Report on it. That's a pipeline - the same pattern that powers every data engineering workflow, just without the framework around it yet.

I told them: "When you get to Airflow, you'll write DAGs that do exactly this. The logic is identical - the scaffolding just gets more powerful."

Writing Results to a File

The last piece - saving the cleaned output so it persists:

with open("cleaned_students.csv", "w") as output:
    output.write("name,course,score,grade\n")

    for s in students:
        score_str = str(s['score']) if s['score'] is not None else ""
        output.write(f"{s['name']},{s['course']},{score_str},{s['grade']}\n")

print("Cleaned data saved to cleaned_students.csv ✅")

Cleaned data saved to cleaned_students.csv ✅

Input file in. Cleaned file out. The program did in seconds what would take a human twenty minutes of careful spreadsheet work — and it never makes a typo.

Practice Problems

Easy:

# 1. Write a program that asks for 3 lines of text and saves them to a file called "notes.txt"
# 2. Read "notes.txt" back and print each line numbered (1. line one, 2. line two...)
# 3. Clean this list of messy names using string methods:
#    names = ["  AMINA waweru", "brian OTIENO  ", "NJERI kamau  "]

Medium:

# Write a function that reads a text file and returns:
# - Total number of lines
# - Total number of words
# - The longest line

def analyse_file(filename):
    pass  # your code here

# Test with any .txt file you create

Challenge:

# Build a contact book that saves to and loads from a CSV file
# contacts.csv format: name,phone,email
#
# Your program should:
# 1. Load existing contacts from file on startup
# 2. Allow adding new contacts
# 3. Allow searching by name
# 4. Save all contacts back to the file on exit
#
# Bonus: validate that the email contains "@" and the phone is digits only

What I Noticed Teaching This Session

1. "We just built a data pipeline" is the sentence I wait for every cohort. When a student names what they've built - without being told - it means they understood not just the code but the concept. Brian earned that moment.

2. The SQL comparison came up again unprompted. At this point, students with the SQL background are making connections constantly - strip() is TRIM(), split() is like parsing, filtering a list is a WHERE clause. This is the compounding return on teaching SQL first.

3. Messy data was a better teacher than clean data. I could have given them a perfectly formatted CSV. Instead I gave them one with caps inconsistency, a missing score, and trailing spaces. Every problem they hit was a real problem real data has. Fixing it felt like real work.

4. The with statement is a habit worth building early. Some students tried open() without with first. When I showed what happens if the program crashes mid-write without proper closure - they switched immediately and never looked back.

What's Coming in Week 7: Error Handling & Practical Modules

So far our programs assume everything goes right. The file exists. The score is a number. The user types what we expect.

Real programs don't get that luxury:

try:
    score = int(input("Enter score: "))
    print(f"Score accepted: {score}")
except ValueError:
    print("That's not a number. Please enter a valid score.")
except FileNotFoundError:
    print("File not found. Check the filename and try again.")
finally:
    print("Validation complete.")

We'll also cover the csv module (which handles CSV files properly - quoted fields, commas in values, headers), json for API data, os for working with folders and paths, and datetime for timestamps.

Week 7 is where Python stops being fragile and starts being robust. See you then.

Try It Yourself

Create your own messy CSV - names in wrong cases, a missing value or two - and run the full pipeline on it. The messier you make it, the more satisfying the clean output.

I'm a data trainer in Nairobi running a full data programme -
Python foundations → Data Science or Data Engineering specialisations.
I write weekly about what we covered, what worked, and what surprised me.
Follow along or drop your questions in the comments.

DEV Community