When working with data in the cloud or in analytics pipelines, the format of your data matters. Choosing the right format can impact performance, storage cost, and ease of use.
In this article, we’ll explore six widely used data formats:
- CSV (Comma-Separated Values)
- SQL (Relational Tables)
- JSON (JavaScript Object Notation)
- Parquet (Columnar Storage)
- XML (Extensible Markup Language)
- Avro (Row-based Storage)
*SAMPLE DATASET: *
Name Register_No Subject Marks
Aisha 101 Math 85
Rahul 102 Science 90
Meera 103 English 78
1️⃣ CSV (Comma-Separated Values)
What is CSV?
CSV is the simplest data format. Each row represents a record, and values are separated by commas. It’s human-readable and widely supported by Excel, Python, R, and databases.
Dataset in CSV:
Name,Register_No,Subject,Marks
Aisha,101,Math,85
Rahul,102,Science,90
Meera,103,English,78
2️⃣ SQL (Relational Table Format)
What is SQL format?
SQL databases store data in tables with rows and columns. Data is structured, and queries are run using SQL commands.
Dataset in SQL:
CREATE TABLE Students (
Name VARCHAR(50),
Register_No INT,
Subject VARCHAR(50),
Marks INT
);
INSERT INTO Students (Name, Register_No, Subject, Marks) VALUES
('Aisha', 101, 'Math', 85),
('Rahul', 102, 'Science', 90),
('Meera', 103, 'English', 78);
3️⃣ JSON (JavaScript Object Notation)
What is JSON?
JSON is a lightweight, human-readable data format often used in APIs. It stores data as key-value pairs, arrays, and nested objects.
Dataset in JSON:
[
{
"Name": "Aisha",
"Register_No": 101,
"Subject": "Math",
"Marks": 85
},
{
"Name": "Rahul",
"Register_No": 102,
"Subject": "Science",
"Marks": 90
},
{
"Name": "Meera",
"Register_No": 103,
"Subject": "English",
"Marks": 78
}
]
4️⃣ Parquet (Columnar Storage Format)
What is Parquet?
Parquet is a binary columnar storage format optimized for big data analytics. Instead of storing row by row, it stores column by column, which speeds up aggregations and reduces storage.
Dataset in Parquet (conceptual view):
Columns:
Name → ["Aisha", "Rahul", "Meera"]
Register_No → [101, 102, 103]
Subject → ["Math", "Science", "English"]
Marks → [85, 90, 78]
5️⃣ XML (Extensible Markup Language)
What is XML?
XML is a markup language that uses tags to structure data. It’s verbose but still used in enterprise systems and web services.
Dataset in XML:
<Students>
<Student>
<Name>Aisha</Name>
<Register_No>101</Register_No>
<Subject>Math</Subject>
<Marks>85</Marks>
</Student>
<Student>
<Name>Rahul</Name>
<Register_No>102</Register_No>
<Subject>Science</Subject>
<Marks>90</Marks>
</Student>
<Student>
<Name>Meera</Name>
<Register_No>103</Register_No>
<Subject>English</Subject>
<Marks>78</Marks>
</Student>
</Students>
6️⃣ Avro (Row-based Storage Format)
What is Avro?
Avro is a row-based binary format developed by Apache. It’s schema-driven and commonly used in Kafka and Hadoop ecosystems. Unlike Parquet, it stores data row by row.
Dataset in Avro (conceptual view):
Schema (JSON-based):
{
"type": "record",
"name": "Student",
"fields": [
{"name": "Name", "type": "string"},
{"name": "Register_No", "type": "int"},
{"name": "Subject", "type": "string"},
{"name": "Marks", "type": "int"}
]
}
{"Name": "Aisha", "Register_No": 101, "Subject": "Math", "Marks": 85}
{"Name": "Rahul", "Register_No": 102, "Subject": "Science", "Marks": 90}
{"Name": "Meera", "Register_No": 103, "Subject": "English", "Marks": 78}
Conclusion
Each data format serves a different purpose:
CSV → Simple, universal
SQL → Structured, relational
JSON → Flexible, great for APIs
Parquet → Columnar, optimized for analytics
XML → Structured but verbose
Avro → Row-based, schema-driven
In modern cloud data platforms (like AWS, GCP, Azure), Parquet and Avro are heavily used for large-scale analytics, while CSV, JSON, and SQL remain popular for data interchange.
Top comments (0)