DEV Community

Cover image for Data Formats Used in Data Analytics
Hindu Narmatha
Hindu Narmatha

Posted on

Data Formats Used in Data Analytics

In the world of data analytics, we deal with data in many forms — from simple spreadsheets to complex binary formats. Choosing the right data format can affect performance, storage efficiency, and compatibility.

In this post, I’ll show you 6 commonly used data formats — CSV, SQL, JSON, Parquet, XML, and Avro — with examples of the same dataset represented in each format.

example :
| Name | RegisterNo | Subject | Marks |
| ----- | ---------- | ------- | ----- |
| Alice | 101 | Math | 95 |
| Bob | 102 | Science | 88 |
| Carol | 103 | English | 92 |

1.CSV
Definition:
CSV is a simple text format where each row is a record and columns are separated by commas. It is easy to read and widely used.

Google Colab Code Example:

df.to_csv("data.csv", index=False)
print("✅ CSV file created.")
Enter fullscreen mode Exit fullscreen mode

Output (CSV Table):

Name RegisterNo Subject Marks
Alice 101 Math 95
Bob 102 Science 88
Carol 103 English 92

2. SQL
Definition:
SQL stores data in structured tables. It allows querying and managing data efficiently.

Google Colab Code Example:

import sqlite3
conn = sqlite3.connect("students.db")
df.to_sql("Student", conn, if_exists="replace", index=False)
print(pd.read_sql_query("SELECT * FROM Student", conn))
Enter fullscreen mode Exit fullscreen mode

Sample Output (SQL Table):

Name RegisterNo Subject Marks
Alice 101 Math 95
Bob 102 Science 88
Carol 103 English 92

3. JSON (JavaScript Object Notation)

Definition:
JSON stores data in key-value pairs. It’s human-readable and widely used in APIs and web apps.

Google Colab Code Example:

df.to_json("data.json", orient="records", indent=4)
print("✅ JSON file created.")
Enter fullscreen mode Exit fullscreen mode

Sample Output (JSON Data):

[
{"Name": "Alice", "RegisterNo": 101, "Subject": "Math", "Marks": 95},
{"Name": "Bob", "RegisterNo": 102, "Subject": "Science", "Marks": 88},
{"Name": "Carol", "RegisterNo": 103, "Subject": "English", "Marks": 92}
]

4. Parquet (Columnar Storage Format)

Definition:
Parquet is a column-based storage format used in big data analytics. It is highly efficient for queries on large datasets.

Google Colab Code Example:

!pip install pyarrow
df.to_parquet("data.parquet")
print("✅ Parquet file created.")
Enter fullscreen mode Exit fullscreen mode

Sample Output (Read in Python):

Name RegisterNo Subject Marks
Alice 101 Math 95
Bob 102 Science 88
Carol 103 English 92

5. XML (Extensible Markup Language)

Definition:
XML uses tags to structure data. It is extensible and readable by both humans and machines.

Google Colab Code Example:

!pip install dicttoxml
from dicttoxml import dicttoxml
xml_data = dicttoxml(df.to_dict(orient='records'), custom_root='Students', attr_type=False)
with open("data.xml", "wb") as f:
    f.write(xml_data)

print("✅ XML file created.")
Enter fullscreen mode Exit fullscreen mode

Sample Output (XML Data):
Name RegisterNo Subject Marks
Alice 101 Math 95
Bob 102 Science 88
Carol 103 English 92

6. Avro (Row-based Storage Format)

Definition:
Avro is a binary, row-based format commonly used in big data pipelines. It stores schema along with data for efficient processing.

Google Colab Code Example:

!pip install fastavro
import fastavro

schema = {
    "type": "record",
    "name": "Student",
    "fields": [
        {"name": "Name", "type": "string"},
        {"name": "RegisterNo", "type": "int"},
        {"name": "Subject", "type": "string"},
        {"name": "Marks", "type": "int"}
    ]
}

records = df.to_dict(orient="records")

with open("data.avro", "wb") as out:
    fastavro.writer(out, schema, records)

print("✅ Avro file created.")
Enter fullscreen mode Exit fullscreen mode

Sample Output (Read in Python):

Name RegisterNo Subject Marks
Alice 101 Math 95
Bob 102 Science 88
Carol 103 English 92

Conclusion

Each format serves different purposes:

CSV: Simple, readable, good for small datasets

SQL: Structured, relational, great for databases

JSON: Lightweight, perfect for web APIs

Parquet: Columnar, efficient for analytics on large data

XML: Extensible, ideal for data exchange

Avro: Row-based, optimized for big data pipelines

Top comments (0)