Hindu Narmatha

Posted on Oct 7

Data Formats Used in Data Analytics

#python #dataengineering

In the world of data analytics, we deal with data in many forms — from simple spreadsheets to complex binary formats. Choosing the right data format can affect performance, storage efficiency, and compatibility.

In this post, I’ll show you 6 commonly used data formats — CSV, SQL, JSON, Parquet, XML, and Avro — with examples of the same dataset represented in each format.

example :
| Name | RegisterNo | Subject | Marks |
| ----- | ---------- | ------- | ----- |
| Alice | 101 | Math | 95 |
| Bob | 102 | Science | 88 |
| Carol | 103 | English | 92 |

1.CSV
Definition:
CSV is a simple text format where each row is a record and columns are separated by commas. It is easy to read and widely used.

Google Colab Code Example:

df.to_csv("data.csv", index=False)
print("✅ CSV file created.")

Output (CSV Table):

Name RegisterNo Subject Marks
Alice 101 Math 95
Bob 102 Science 88
Carol 103 English 92

2. SQL
Definition:
SQL stores data in structured tables. It allows querying and managing data efficiently.

Google Colab Code Example:

import sqlite3
conn = sqlite3.connect("students.db")
df.to_sql("Student", conn, if_exists="replace", index=False)
print(pd.read_sql_query("SELECT * FROM Student", conn))

Sample Output (SQL Table):

Name RegisterNo Subject Marks
Alice 101 Math 95
Bob 102 Science 88
Carol 103 English 92

3. JSON (JavaScript Object Notation)

Definition:
JSON stores data in key-value pairs. It’s human-readable and widely used in APIs and web apps.

Google Colab Code Example:

df.to_json("data.json", orient="records", indent=4)
print("✅ JSON file created.")

Sample Output (JSON Data):

[
{"Name": "Alice", "RegisterNo": 101, "Subject": "Math", "Marks": 95},
{"Name": "Bob", "RegisterNo": 102, "Subject": "Science", "Marks": 88},
{"Name": "Carol", "RegisterNo": 103, "Subject": "English", "Marks": 92}
]

4. Parquet (Columnar Storage Format)

Definition:
Parquet is a column-based storage format used in big data analytics. It is highly efficient for queries on large datasets.

Google Colab Code Example:

!pip install pyarrow
df.to_parquet("data.parquet")
print("✅ Parquet file created.")

Sample Output (Read in Python):

Name RegisterNo Subject Marks
Alice 101 Math 95
Bob 102 Science 88
Carol 103 English 92

5. XML (Extensible Markup Language)

Definition:
XML uses tags to structure data. It is extensible and readable by both humans and machines.

Google Colab Code Example:

!pip install dicttoxml
from dicttoxml import dicttoxml
xml_data = dicttoxml(df.to_dict(orient='records'), custom_root='Students', attr_type=False)
with open("data.xml", "wb") as f:
    f.write(xml_data)

print("✅ XML file created.")

Sample Output (XML Data):
Name RegisterNo Subject Marks
Alice 101 Math 95
Bob 102 Science 88
Carol 103 English 92

6. Avro (Row-based Storage Format)

Definition:
Avro is a binary, row-based format commonly used in big data pipelines. It stores schema along with data for efficient processing.

Google Colab Code Example:

!pip install fastavro
import fastavro

schema = {
    "type": "record",
    "name": "Student",
    "fields": [
        {"name": "Name", "type": "string"},
        {"name": "RegisterNo", "type": "int"},
        {"name": "Subject", "type": "string"},
        {"name": "Marks", "type": "int"}
    ]
}

records = df.to_dict(orient="records")

with open("data.avro", "wb") as out:
    fastavro.writer(out, schema, records)

print("✅ Avro file created.")

Sample Output (Read in Python):

Name RegisterNo Subject Marks
Alice 101 Math 95
Bob 102 Science 88
Carol 103 English 92

Conclusion

Each format serves different purposes:

CSV: Simple, readable, good for small datasets

SQL: Structured, relational, great for databases

JSON: Lightweight, perfect for web APIs

Parquet: Columnar, efficient for analytics on large data

XML: Extensible, ideal for data exchange

Avro: Row-based, optimized for big data pipelines

DEV Community

Data Formats Used in Data Analytics

Top comments (0)