Shrutti Kannan

Posted on Oct 8

🔍 Understanding 6 Common Data Formats in Data Analytics (With Examples)

#analytics #beginners #dataengineering

When working in data analytics, we often need to store, share, and transform data in various formats. Each format has its own strengths, ideal use cases, and limitations. Whether you're wrangling data for machine learning, storing logs, or building a data pipeline, understanding these formats is key.

In this article, we’ll walk through six widely-used data formats with a simple example dataset. We'll look at:

CSV (Comma Separated Values)

SQL (Relational Table Format)

JSON (JavaScript Object Notation)

Parquet (Columnar Storage Format)

XML (Extensible Markup Language)

Avro (Row-based Storage Format)

🎓 Example Dataset

To make things concrete, let’s use this small dataset of student exam scores:

Name Register No Subject Marks
Alice 1001 Math 89
Bob 1002 Science 92
Charlie 1003 History 85

We'll now represent this dataset in each format.

📄 CSV (Comma Separated Values) 🔍 What is CSV?

CSV is one of the simplest data formats. It stores tabular data as plain text, with rows separated by line breaks and columns separated by commas. It’s human-readable, easy to generate, and widely supported.

🧾 Example:

Name,Register No,Subject,Marks
Alice,1001,Math,89
Bob,1002,Science,92
Charlie,1003,History,85

✅ Pros:

Simple and readable

Easy to parse

Ideal for small data

❌ Cons:

No support for nested data

No data types (everything is text)

Not efficient for large datasets

🗄️ SQL (Relational Table Format) 🔍 What is SQL?

SQL is the language used for managing relational databases. Data is stored in tables with defined columns and types. You query it using SQL syntax like SELECT, INSERT, etc.

🧾 Example:

CREATE TABLE student_scores (
Name TEXT,
RegisterNo INT,
Subject TEXT,
Marks INT
);

INSERT INTO student_scores (Name, RegisterNo, Subject, Marks) VALUES
('Alice', 1001, 'Math', 89),
('Bob', 1002, 'Science', 92),
('Charlie', 1003, 'History', 85);

✅ Pros:

Strong data typing

Powerful querying capabilities

Supports constraints and indexing

❌ Cons:

Requires database engine

Less flexible for hierarchical data

🌐 JSON (JavaScript Object Notation) 🔍 What is JSON?

JSON is a lightweight format for storing structured data, often used in web APIs. It supports nested and hierarchical data.

🧾 Example:
[
{
"Name": "Alice",
"RegisterNo": 1001,
"Subject": "Math",
"Marks": 89
},
{
"Name": "Bob",
"RegisterNo": 1002,
"Subject": "Science",
"Marks": 92
},
{
"Name": "Charlie",
"RegisterNo": 1003,
"Subject": "History",
"Marks": 85
}
]

✅ Pros:

Human-readable and flexible

Supports nested structures

Common in APIs and web apps

❌ Cons:

Larger file size than binary formats

No schema enforcement

🧱 Parquet (Columnar Storage Format) 🔍 What is Parquet?

Parquet is an open-source columnar storage format optimized for performance, especially with big data tools like Apache Spark and Hive. It stores data by columns instead of rows, which allows efficient compression and retrieval.

🧾 Example (Pseudo view):

We can’t represent Parquet directly as plain text, but here’s how the schema might look:

parquet_schema:
Name: BYTE_ARRAY (UTF8)
RegisterNo: INT32
Subject: BYTE_ARRAY (UTF8)
Marks: INT32
Columnar View:
Name: ["Alice", "Bob", "Charlie"]
RegisterNo: [1001, 1002, 1003]
Subject: ["Math", "Science", "History"]
Marks: [89, 92, 85]

You’d use a tool like pandas.to_parquet() or Apache Spark to read/write this format.

✅ Pros:

Great for analytics (fast column reads)

Efficient storage and compression

Preferred in big data environments

❌ Cons:

Not human-readable

Complex to debug without tools

🪵 XML (Extensible Markup Language) 🔍 What is XML?

XML is a markup language that stores structured data with custom tags. It’s verbose but flexible and still used in enterprise and legacy systems.

🧾 Example:

Alice
1001
Math
89

Bob
1002
Science
92

Charlie
1003
History
85

✅ Pros:

Hierarchical and self-descriptive

Schema (XSD) support for validation

❌ Cons:

Verbose

Slower to parse compared to JSON

📦 Avro (Row-Based Storage Format) 🔍 What is Avro?

Avro is a row-based binary format developed within the Apache Hadoop ecosystem. It stores both data and schema, making it ideal for serialization and transport in distributed systems.

🧾 Example:

Avro is a binary format, but here’s a conceptual view using JSON schema + data.

Schema (JSON):

{
"type": "record",
"name": "Student",
"fields": [
{"name": "Name", "type": "string"},
{"name": "RegisterNo", "type": "int"},
{"name": "Subject", "type": "string"},
{"name": "Marks", "type": "int"}
]
}

Data (stored in binary, but conceptualized like):

[
{"Name": "Alice", "RegisterNo": 1001, "Subject": "Math", "Marks": 89},
{"Name": "Bob", "RegisterNo": 1002, "Subject": "Science", "Marks": 92},
{"Name": "Charlie", "RegisterNo": 1003, "Subject": "History", "Marks": 85}
]

✅ Pros:

Schema evolution support

Compact binary format

Ideal for streaming and Hadoop

❌ Cons:

Not human-readable

Requires Avro libraries to read/write

📊 Summary Comparison
Format Human-readable Supports Nested Best For Compression
CSV ✅ ❌ Simple data exchange ❌
SQL ✅ ❌ Structured storage ✅ (DB level)
JSON ✅ ✅ APIs, web apps ❌
Parquet ❌ ✅ Big data analytics ✅✅✅
XML ✅ ✅ Legacy systems ❌
Avro ❌ ✅ Serialization, Kafka ✅✅
🧠 Conclusion

Choosing the right data format depends on your use case. For small, simple data, CSV or JSON works well. For large-scale analytics, go with Parquet. If you're dealing with streaming systems or need schema evolution, Avro is your friend.

Understanding these formats—and how to work with them—will help you be more effective whether you're doing ETL, data science, or building data pipelines.

👉 Which format do you use the most in your projects? Let me know in the comments!

DEV Community

🔍 Understanding 6 Common Data Formats in Data Analytics (With Examples)

Top comments (0)