In the world of data analytics, we deal with data in many forms — from simple spreadsheets to complex binary formats. Choosing the right data format can affect performance, storage efficiency, and compatibility.
In this post, I’ll show you 6 commonly used data formats — CSV, SQL, JSON, Parquet, XML, and Avro — with examples of the same dataset represented in each format.
example :
| Name | RegisterNo | Subject | Marks |
| ----- | ---------- | ------- | ----- |
| Alice | 101 | Math | 95 |
| Bob | 102 | Science | 88 |
| Carol | 103 | English | 92 |
1.CSV
Definition:
CSV is a simple text format where each row is a record and columns are separated by commas. It is easy to read and widely used.
Google Colab Code Example:
df.to_csv("data.csv", index=False)
print("✅ CSV file created.")
Output (CSV Table):
Name RegisterNo Subject Marks
Alice 101 Math 95
Bob 102 Science 88
Carol 103 English 92
2. SQL
Definition:
SQL stores data in structured tables. It allows querying and managing data efficiently.
Google Colab Code Example:
import sqlite3
conn = sqlite3.connect("students.db")
df.to_sql("Student", conn, if_exists="replace", index=False)
print(pd.read_sql_query("SELECT * FROM Student", conn))
Sample Output (SQL Table):
Name RegisterNo Subject Marks
Alice 101 Math 95
Bob 102 Science 88
Carol 103 English 92
3. JSON (JavaScript Object Notation)
Definition:
JSON stores data in key-value pairs. It’s human-readable and widely used in APIs and web apps.
Google Colab Code Example:
df.to_json("data.json", orient="records", indent=4)
print("✅ JSON file created.")
Sample Output (JSON Data):
[
{"Name": "Alice", "RegisterNo": 101, "Subject": "Math", "Marks": 95},
{"Name": "Bob", "RegisterNo": 102, "Subject": "Science", "Marks": 88},
{"Name": "Carol", "RegisterNo": 103, "Subject": "English", "Marks": 92}
]
4. Parquet (Columnar Storage Format)
Definition:
Parquet is a column-based storage format used in big data analytics. It is highly efficient for queries on large datasets.
Google Colab Code Example:
!pip install pyarrow
df.to_parquet("data.parquet")
print("✅ Parquet file created.")
Sample Output (Read in Python):
Name RegisterNo Subject Marks
Alice 101 Math 95
Bob 102 Science 88
Carol 103 English 92
5. XML (Extensible Markup Language)
Definition:
XML uses tags to structure data. It is extensible and readable by both humans and machines.
Google Colab Code Example:
!pip install dicttoxml
from dicttoxml import dicttoxml
xml_data = dicttoxml(df.to_dict(orient='records'), custom_root='Students', attr_type=False)
with open("data.xml", "wb") as f:
f.write(xml_data)
print("✅ XML file created.")
Sample Output (XML Data):
Name RegisterNo Subject Marks
Alice 101 Math 95
Bob 102 Science 88
Carol 103 English 92
6. Avro (Row-based Storage Format)
Definition:
Avro is a binary, row-based format commonly used in big data pipelines. It stores schema along with data for efficient processing.
Google Colab Code Example:
!pip install fastavro
import fastavro
schema = {
"type": "record",
"name": "Student",
"fields": [
{"name": "Name", "type": "string"},
{"name": "RegisterNo", "type": "int"},
{"name": "Subject", "type": "string"},
{"name": "Marks", "type": "int"}
]
}
records = df.to_dict(orient="records")
with open("data.avro", "wb") as out:
fastavro.writer(out, schema, records)
print("✅ Avro file created.")
Sample Output (Read in Python):
Name RegisterNo Subject Marks
Alice 101 Math 95
Bob 102 Science 88
Carol 103 English 92
Conclusion
Each format serves different purposes:
CSV: Simple, readable, good for small datasets
SQL: Structured, relational, great for databases
JSON: Lightweight, perfect for web APIs
Parquet: Columnar, efficient for analytics on large data
XML: Extensible, ideal for data exchange
Avro: Row-based, optimized for big data pipelines
Top comments (0)