DEV Community

Sri Vardhan
Sri Vardhan

Posted on

Common Data Formats Used in Cloud Data Analytics

When we work with data in cloud platforms like Google Cloud, AWS, or Azure, we often see different file formats.
Each format is used for a different purpose — some are easy to read, some save a lot of space, and some are perfect for big-data analytics.

In this post, I’ll explain 6 commonly used data formats with simple definitions and a small sample dataset represented in all formats.

Sample Dataset Used in All Examples

Name Register_No Subject Marks
Sri 101 Math 87
Vardhan 102 Science 92
Darsani 103 English 89

1️⃣ CSV (Comma Separated Values)

CSV is the simplest and most widely used data format.
It is a plain text file where each value is separated by a comma.

✔ Easy to read
✔ Works with Excel, Google Sheets, and almost every tool
✔ Good for small datasets

✅ Example (CSV)
Name,Register_No,Subject,Marks
Sri,101,Math,87
Vardhan,102,Science,92
Keerthana,103,English,89

2️⃣ SQL (Relational Table Format)

SQL format represents data in a table structure (rows and columns).
It is used in relational databases like MySQL, PostgreSQL, and SQL Server.

✔ Best for structured, organized data
✔ Allows powerful querying using SQL commands

✅ Example (SQL)
CREATE TABLE Students (
Name TEXT,
Register_No INT,
Subject TEXT,
Marks INT
);

INSERT INTO Students VALUES
('Sri', 101, 'Math', 87),
('Vardhan', 102, 'Science', 92),
('Keerthana', 103, 'English', 89);

3️⃣ JSON (JavaScript Object Notation)

JSON stores data as key-value pairs.
It is widely used in APIs, web apps, and NoSQL databases like MongoDB.

✔ Human-readable
✔ Great for semi-structured data
✔ Works well with programming languages

✅ Example (JSON)
[
{ "Name": "Sri", "Register_No": 101, "Subject": "Math", "Marks": 87 },
{ "Name": "Vardhan", "Register_No": 102, "Subject": "Science", "Marks": 92 },
{ "Name": "Keerthana", "Register_No": 103, "Subject": "English", "Marks": 89 }
]

4️⃣ Parquet (Columnar Storage Format)

Parquet is a binary, column-oriented file format used in big-data systems like Spark, Hive, and BigQuery.

✔ Extremely efficient for analytics
✔ Compresses data well
✔ Reads only required columns → very fast

Parquet is not displayed as plain text, but here’s the conceptual layout:

📌 Parquet (Columnar View)
Name: ["Sri", "Vardhan", "Keerthana"]
Register_No: [101, 102, 103]
Subject: ["Math", "Science", "English"]
Marks: [87, 92, 89]
**
5️⃣ XML (Extensible Markup Language)**

XML stores data using custom tags (similar to HTML).
It was widely used in older systems and is still common in some enterprise applications.

✔ Self-descriptive
✔ Structured
✔ Used in configurations and data interchange

✅ Example (XML)


Sri
101
Math
87


Vardhan
102
Science
92


Keerthana
103
English
89

6️⃣ Avro (Row-Based Format with Schema)

Avro is a binary file format used mostly in streaming systems like Kafka and big-data pipelines.

✔ Stores data + schema together
✔ Good for fast data serialization
✔ Used in Hadoop ecosystems

📌 Avro Schema
{
"type": "record",
"name": "Student",
"fields": [
{ "name": "Name", "type": "string" },
{ "name": "Register_No", "type": "int" },
{ "name": "Subject", "type": "string" },
{ "name": "Marks", "type": "int" }
]
}

📌 Avro Data (Readable Preview)
[
{ "Name": "Sri", "Register_No": 101, "Subject": "Math", "Marks": 87 },
{ "Name": "Vardhan", "Register_No": 102, "Subject": "Science", "Marks": 92 },
{ "Name": "Keerthana", "Register_No": 103, "Subject": "English", "Marks": 89 }
]

Conclusion

Each data format has its own strengths:

CSV → simple and universal

SQL → perfect for structured data

JSON → great for APIs

Parquet → best for fast analytics

XML → still used in many systems

Avro → excellent for streaming data

Understanding these formats is essential for working with cloud computing, big data, and modern analytics tools.

Top comments (0)