When we work with data in cloud platforms like Google Cloud, AWS, or Azure, we often see different file formats.
Each format is used for a different purpose — some are easy to read, some save a lot of space, and some are perfect for big-data analytics.
In this post, I’ll explain 6 commonly used data formats with simple definitions and a small sample dataset represented in all formats.
Sample Dataset Used in All Examples
| Name | Register_No | Subject | Marks |
|---|---|---|---|
| Sri | 101 | Math | 87 |
| Vardhan | 102 | Science | 92 |
| Darsani | 103 | English | 89 |
1️⃣ CSV (Comma Separated Values)
CSV is the simplest and most widely used data format.
It is a plain text file where each value is separated by a comma.
✔ Easy to read
✔ Works with Excel, Google Sheets, and almost every tool
✔ Good for small datasets
✅ Example (CSV)
Name,Register_No,Subject,Marks
Sri,101,Math,87
Vardhan,102,Science,92
Keerthana,103,English,89
2️⃣ SQL (Relational Table Format)
SQL format represents data in a table structure (rows and columns).
It is used in relational databases like MySQL, PostgreSQL, and SQL Server.
✔ Best for structured, organized data
✔ Allows powerful querying using SQL commands
✅ Example (SQL)
CREATE TABLE Students (
Name TEXT,
Register_No INT,
Subject TEXT,
Marks INT
);
INSERT INTO Students VALUES
('Sri', 101, 'Math', 87),
('Vardhan', 102, 'Science', 92),
('Keerthana', 103, 'English', 89);
3️⃣ JSON (JavaScript Object Notation)
JSON stores data as key-value pairs.
It is widely used in APIs, web apps, and NoSQL databases like MongoDB.
✔ Human-readable
✔ Great for semi-structured data
✔ Works well with programming languages
✅ Example (JSON)
[
{ "Name": "Sri", "Register_No": 101, "Subject": "Math", "Marks": 87 },
{ "Name": "Vardhan", "Register_No": 102, "Subject": "Science", "Marks": 92 },
{ "Name": "Keerthana", "Register_No": 103, "Subject": "English", "Marks": 89 }
]
4️⃣ Parquet (Columnar Storage Format)
Parquet is a binary, column-oriented file format used in big-data systems like Spark, Hive, and BigQuery.
✔ Extremely efficient for analytics
✔ Compresses data well
✔ Reads only required columns → very fast
Parquet is not displayed as plain text, but here’s the conceptual layout:
📌 Parquet (Columnar View)
Name: ["Sri", "Vardhan", "Keerthana"]
Register_No: [101, 102, 103]
Subject: ["Math", "Science", "English"]
Marks: [87, 92, 89]
**
5️⃣ XML (Extensible Markup Language)**
XML stores data using custom tags (similar to HTML).
It was widely used in older systems and is still common in some enterprise applications.
✔ Self-descriptive
✔ Structured
✔ Used in configurations and data interchange
✅ Example (XML)
Sri
101
Math
87
Vardhan
102
Science
92
Keerthana
103
English
89
6️⃣ Avro (Row-Based Format with Schema)
Avro is a binary file format used mostly in streaming systems like Kafka and big-data pipelines.
✔ Stores data + schema together
✔ Good for fast data serialization
✔ Used in Hadoop ecosystems
📌 Avro Schema
{
"type": "record",
"name": "Student",
"fields": [
{ "name": "Name", "type": "string" },
{ "name": "Register_No", "type": "int" },
{ "name": "Subject", "type": "string" },
{ "name": "Marks", "type": "int" }
]
}
📌 Avro Data (Readable Preview)
[
{ "Name": "Sri", "Register_No": 101, "Subject": "Math", "Marks": 87 },
{ "Name": "Vardhan", "Register_No": 102, "Subject": "Science", "Marks": 92 },
{ "Name": "Keerthana", "Register_No": 103, "Subject": "English", "Marks": 89 }
]
Conclusion
Each data format has its own strengths:
CSV → simple and universal
SQL → perfect for structured data
JSON → great for APIs
Parquet → best for fast analytics
XML → still used in many systems
Avro → excellent for streaming data
Understanding these formats is essential for working with cloud computing, big data, and modern analytics tools.
Top comments (0)