DEV Community

Sujitha Selvaraj
Sujitha Selvaraj

Posted on

Data in the Cloud — 6 Common Data Formats Every Analyst Should Know

1. CSV (Comma Separated Values)

What it is:
CSV is the simplest and most widely used data format. It stores data in plain text where each line represents a record, and values are separated by commas.

Example:

name,reg_no,subject,marks
Asha Rao,R001,Maths,89
Vikram S,R002,Physics,76
Meera K,R003,Chemistry,92
Rohit P,R004,Maths,68

Where it’s used:
CSV is used in spreadsheets, data imports/exports, and small-scale analytics.

2. SQL (Relational Table Format)

What it is:
SQL represents data stored in relational databases. The data is organized in tables with defined columns and data types. Each row represents one record.

Example:

TABLE: students

name | reg_no | subject | marks

Asha Rao | R001 | Maths | 89
Vikram S | R002 | Physics | 76
Meera K | R003 | Chemistry | 92
Rohit P | R004 | Maths | 68

Where it’s used:
Used in databases like MySQL, PostgreSQL, and SQL Server for structured data and transactional operations.

3. JSON (JavaScript Object Notation)

What it is:
JSON stores data as key-value pairs. It is lightweight, human-readable, and commonly used in APIs and modern web applications.

Example:

[
{"name": "Asha Rao", "reg_no": "R001", "subject": "Maths", "marks": 89},
{"name": "Vikram S", "reg_no": "R002", "subject": "Physics", "marks": 76},
{"name": "Meera K", "reg_no": "R003", "subject": "Chemistry", "marks": 92},
{"name": "Rohit P", "reg_no": "R004", "subject": "Maths", "marks": 68}
]

Where it’s used:
APIs, web applications, configuration files, and NoSQL databases like MongoDB.

4. Parquet (Columnar Storage Format)

What it is:
Parquet is a columnar storage format used for big data analytics. Unlike row-based formats, Parquet stores data column-wise, which reduces storage space and increases query performance.

Example (conceptual view):

name reg_no subject marks
Asha Rao R001 Maths 89
Vikram S R002 Physics 76
Meera K R003 Chemistry 92
Rohit P R004 Maths 68

Where it’s used:
Big data platforms like Apache Spark, Hadoop, and AWS Athena for fast analytics and cloud storage efficiency.

5. XML (Extensible Markup Language)

What it is:
XML is a tag-based format used to represent structured data. It is similar to HTML but designed to store and transport data rather than display it.

Example:



Asha Rao
R001
Maths
89


Vikram S
R002
Physics
76

Where it’s used:
Web services (SOAP), configuration files, and systems that require strong data validation through schemas.

6. Avro (Row-based Storage Format)

What it is:
Avro is a compact binary format that stores both data and schema. It’s designed for fast data serialization and supports schema evolution, making it ideal for real-time data pipelines.

Example (logical representation):

{"name": "Asha Rao", "reg_no": "R001", "subject": "Maths", "marks": 89}
{"name": "Vikram S", "reg_no": "R002", "subject": "Physics", "marks": 76}
{"name": "Meera K", "reg_no": "R003", "subject": "Chemistry", "marks": 92}
{"name": "Rohit P", "reg_no": "R004", "subject": "Maths", "marks": 68}

Where it’s used:
Data streaming (Apache Kafka), data serialization, and large-scale data pipelines.

Conclusion:

Each data format serves a different purpose in the analytics and cloud ecosystem:

  • CSV is simple and universal.
  • SQL ensures structure and relationships.
  • JSON adds flexibility and nesting.
  • Parquet optimizes analytical queries.
  • XML emphasizes structure and validation.
  • Avro focuses on efficient, schema-based data transport.

Understanding when and how to use these formats is a core skill for any data analyst, data engineer, or cloud professional.

Top comments (0)