In the world of data analytics, how we store, exchange, and process data depends heavily on the data format used.
From simple text files like CSV to efficient formats like Parquet, each serves a unique purpose for performance, scalability, and compatibility.
In this article, we’ll explore six common data formats used in analytics:
- CSV (Comma-Separated Values)
- SQL (Relational Table Format)
- JSON (JavaScript Object Notation)
- Parquet (Columnar Storage Format)
- XML (Extensible Markup Language)
- Avro (Row-based Storage Format)
We’ll use this simple dataset as an example:
Name | Register_No | Subject | Marks |
---|---|---|---|
Hari | 101 | Math | 89 |
Asha | 102 | Science | 92 |
Kiran | 103 | English | 85 |
Step 1: Install Dependencies
!pip install pandas pyarrow fastavro lxml
Step 2: Create the Dataset
We’ll first create a small dataset using pandas.
import pandas as pd
data = {
"Name": ["Hari", "Asha", "Kiran"],
"Register_No": [101, 102, 103],
"Subject": ["Math", "Science", "English"],
"Marks": [89, 92, 85]
}
df = pd.DataFrame(data)
df
1.CSV (Comma-Separated Values)
CSV is one of the simplest and most common formats for storing tabular data.
Each line represents a row, and columns are separated by commas. It’s human-readable and works easily with Excel or Python.
* Save to CSV
df.to_csv("students.csv", index=False)
* Verify content
print(open("students.csv").read())
Output:
Name,Register_No,Subject,Marks
Hari,101,Math,89
Asha,102,Science,92
Kiran,103,English,85
2.SQL (Relational Table Format)
SQL organizes data into relational tables using columns and rows.
You can store, query, and join tables efficiently in databases like MySQL, SQLite, or PostgreSQL.
import sqlite3
* Create SQLite database and table
conn = sqlite3.connect("students.db")
df.to_sql("students", conn, if_exists="replace", index=False)
* Verify content
pd.read_sql("SELECT * FROM students", conn)
3.JSON (JavaScript Object Notation)
JSON represents data as key-value pairs, making it ideal for APIs and web applications.
It’s lightweight and language-independent.
* Save to JSON (pretty format)
df.to_json("students.json", orient="records", indent=2)
* Verify content
print(open("students.json").read())
Output:
[
{
"Name": "Hari",
"Register_No": 101,
"Subject": "Math",
"Marks": 89
},
{
"Name": "Asha",
"Register_No": 102,
"Subject": "Science",
"Marks": 92
},
{
"Name": "Kiran",
"Register_No": 103,
"Subject": "English",
"Marks": 85
}
]
4.Parquet (Columnar Storage Format)
Parquet is a columnar storage format optimized for big data analytics.
It stores data by columns instead of rows, which improves compression and query performance — especially for analytical workloads.
* Save to Parquet
df.to_parquet("students.parquet", index=False)
* Verify by reading it back
pd.read_parquet("students.parquet")
Efficient and compact — great for distributed systems.
5.XML (Extensible Markup Language)
XML stores data using nested tags, similar to HTML.
It’s human-readable and useful for hierarchical or document-like data.
import xml.etree.ElementTree as ET
root = ET.Element("Students")
for _, row in df.iterrows():
student = ET.SubElement(root, "Student")
ET.SubElement(student, "Name").text = str(row["Name"])
ET.SubElement(student, "Register_No").text = str(row["Register_No"])
ET.SubElement(student, "Subject").text = str(row["Subject"])
ET.SubElement(student, "Marks").text = str(row["Marks"])
tree = ET.ElementTree(root)
tree.write("students.xml", encoding="utf-8", xml_declaration=True)
* Verify content
print(open("students.xml").read())
Output:
xml
<?xml version='1.0' encoding='utf-8'?>
<Students>
<Student><Name>Hari</Name><Register_No>101</Register_No><Subject>Math</Subject><Marks>89</Marks></Student>
<Student><Name>Asha</Name><Register_No>102</Register_No><Subject>Science</Subject><Marks>92</Marks></Student>
<Student><Name>Kiran</Name><Register_No>103</Register_No><Subject>English</Subject><Marks>85</Marks></Student>
</Students>
6.Avro (Row-based Storage Format)
Avro is a binary row-based format developed by Apache for data serialization.
It includes the schema with the data, making it ideal for streaming systems like Apache Kafka.
from fastavro import writer, reader, parse_schema
schema = {
"doc": "Student data",
"name": "Student",
"namespace": "example.avro",
"type": "record",
"fields": [
{"name": "Name", "type": "string"},
{"name": "Register_No", "type": "int"},
{"name": "Subject", "type": "string"},
{"name": "Marks", "type": "int"}
]
}
records = df.to_dict(orient="records")
with open("students.avro", "wb") as out:
writer(out, parse_schema(schema), records)
print("students.avro created")
Verify Avro Data
from fastavro import reader
with open("students.avro", "rb") as fo:
for record in reader(fo):
print(record)
Output:
{'Name': 'Hari', 'Register_No': 101, 'Subject': 'Math', 'Marks': 89}
{'Name': 'Asha', 'Register_No': 102, 'Subject': 'Science', 'Marks': 92}
{'Name': 'Kiran', 'Register_No': 103, 'Subject': 'English', 'Marks': 85}
Summary Check
print("Files created in Colab working directory:")
!ls -lh students.*
Conclusion
Each data format serves a specific purpose in analytics:
Format | Type | Best For |
---|---|---|
CSV | Text | Simple tabular data |
SQL | Relational | Structured, queryable data |
JSON | Hierarchical | APIs, NoSQL data |
Parquet | Columnar | Big data and analytics |
XML | Tagged | Configs and document data |
Avro | Binary (Row) | Streaming, schema evolution |
Understanding when to use each format helps you build scalable, efficient, and interoperable data systems in the cloud.
You can view and run this notebook yourself on Google Colab:
View Colab Notebook - https://colab.research.google.com/drive/1vEfM0Cv9ir_FXmjibvG1pNGUazFJKbNF?usp=sharing
Top comments (0)