Rethan Kumar cv

Posted on Oct 7

📊 Understanding 6 Common Data Formats in Data Analytics

#dataengineering #analytics #datascience #database

Data formats are the backbone of analytics — they determine how efficiently data is stored, transferred, and processed. Whether you’re working with simple spreadsheets or massive big data systems, understanding data formats helps you pick the right tool for the job.

In this blog, let’s explore six popular data formats used in data analytics:

CSV (Comma Separated Values)
SQL (Relational Table Format)
JSON (JavaScript Object Notation)
Parquet (Columnar Storage Format)
XML (Extensible Markup Language)
Avro (Row-based Storage Format)

We’ll use a simple dataset throughout this article to represent the same data in all formats.

🧩 Our Sample Dataset

Let’s take a small dataset of student marks:

Name	Register Number	Subject	Marks
Riya Sharma	101	Math	95
Arjun Patel	102	Science	88
Meera Nair	103	English	92

1️⃣ CSV (Comma Separated Values)

CSV is the simplest and most human-readable data format. It stores data as plain text, where each line represents a record and values are separated by commas.

Advantages:

Easy to read and write
Supported by almost every data tool (Excel, Python, etc.)

Disadvantages:

No data types (everything is text)
Doesn’t handle nested or hierarchical data well

Example (CSV):

Name,Register Number,Subject,Marks
Riya Sharma,101,Math,95
Arjun Patel,102,Science,88
Meera Nair,103,English,92

2️⃣ SQL (Relational Table Format)

SQL data is stored in tables with defined columns and data types. It’s used in relational databases like MySQL, PostgreSQL, and SQLite.

Advantages:

Structured and enforceable schema
Supports powerful querying

Disadvantages:

Rigid structure
Not suitable for unstructured data

Example (SQL):

CREATE TABLE Students (
    Name VARCHAR(50),
    RegisterNumber INT,
    Subject VARCHAR(30),
    Marks INT
);

INSERT INTO Students (Name, RegisterNumber, Subject, Marks) VALUES
('Riya Sharma', 101, 'Math', 95),
('Arjun Patel', 102, 'Science', 88),
('Meera Nair', 103, 'English', 92);

3️⃣ JSON (JavaScript Object Notation)

JSON stores data as key-value pairs, commonly used in APIs and web applications. It supports nested structures and is easily parsed by programming languages.

Advantages:

Human-readable
Supports nested and complex data structures
Widely used in web and API data

Disadvantages:

Slightly larger file size
Parsing overhead compared to CSV

Example (JSON):

[
  {
    "Name": "Riya Sharma",
    "RegisterNumber": 101,
    "Subject": "Math",
    "Marks": 95
  },
  {
    "Name": "Arjun Patel",
    "RegisterNumber": 102,
    "Subject": "Science",
    "Marks": 88
  },
  {
    "Name": "Meera Nair",
    "RegisterNumber": 103,
    "Subject": "English",
    "Marks": 92
  }
]

4️⃣ Parquet (Columnar Storage Format)

Parquet is a columnar storage format used in big data frameworks like Apache Spark and Hadoop. It stores data by columns instead of rows — making analytical queries faster and storage smaller.

Advantages:

Highly compressed and efficient
Great for analytical queries (e.g., aggregate functions)
Supports complex data types

Disadvantages:

Not human-readable
Best used with big data tools

Example (Conceptual Representation):

Column 1: Name -> ["Riya Sharma", "Arjun Patel", "Meera Nair"]
Column 2: RegisterNumber -> [101, 102, 103]
Column 3: Subject -> ["Math", "Science", "English"]
Column 4: Marks -> [95, 88, 92]

(Actual Parquet files are binary and not viewable as text.)

5️⃣ XML (Extensible Markup Language)

XML represents data using custom tags, similar to HTML. It’s structured and self-descriptive but more verbose than JSON.

Advantages:

Self-descriptive tags
Good for hierarchical data

Disadvantages:

Verbose syntax
Slower to parse compared to JSON

Example (XML):

<Students>
  <Student>
    <Name>Riya Sharma</Name>
    <RegisterNumber>101</RegisterNumber>
    <Subject>Math</Subject>
    <Marks>95</Marks>
  </Student>
  <Student>
    <Name>Arjun Patel</Name>
    <RegisterNumber>102</RegisterNumber>
    <Subject>Science</Subject>
    <Marks>88</Marks>
  </Student>
  <Student>
    <Name>Meera Nair</Name>
    <RegisterNumber>103</RegisterNumber>
    <Subject>English</Subject>
    <Marks>92</Marks>
  </Student>
</Students>

6️⃣ Avro (Row-based Storage Format)

Avro, developed by Apache, is a row-based binary format often used for data serialization in big data pipelines. It requires a schema definition and stores data efficiently for transmission between systems.

Advantages:

Compact binary format
Schema-based (ensures consistency)
Ideal for streaming and big data

Disadvantages:

Not human-readable
Requires schema evolution handling

Example (Conceptual Representation):

Schema:

{
  "type": "record",
  "name": "Student",
  "fields": [
    {"name": "Name", "type": "string"},
    {"name": "RegisterNumber", "type": "int"},
    {"name": "Subject", "type": "string"},
    {"name": "Marks", "type": "int"}
  ]
}

Data:

[
  {"Name": "Riya Sharma", "RegisterNumber": 101, "Subject": "Math", "Marks": 95},
  {"Name": "Arjun Patel", "RegisterNumber": 102, "Subject": "Science", "Marks": 88},
  {"Name": "Meera Nair", "RegisterNumber": 103, "Subject": "English", "Marks": 92}
]

(Actual Avro data is stored in binary format, not plain text.)

🧠 Summary

Format	Structure	Readability	Common Use Case
CSV	Row-based	✅ Human-readable	Spreadsheets, simple data
SQL	Table	✅ Human-readable	Relational databases
JSON	Key-value	✅ Human-readable	Web APIs, configurations
Parquet	Columnar	❌ Binary	Big data analytics
XML	Tag-based	✅ Human-readable	Legacy web services
Avro	Row-based (binary)	❌ Binary	Data streaming, serialization

🚀 Final Thoughts

Each data format has its own strengths — CSV is great for simplicity, JSON for flexibility, SQL for structure, and Parquet or Avro for performance in big data environments.

Choosing the right one depends on your data size, structure, and use case.

💡 In modern analytics, you’ll often see multiple formats working together — for example, CSV input files transformed into Parquet for efficient querying in Spark.

✍️ Author: Rethan Kumar

UI/UX Designer | Tech Enthusiast | Exploring Data and Design

DEV Community

📊 Understanding 6 Common Data Formats in Data Analytics

🧩 Our Sample Dataset

1️⃣ CSV (Comma Separated Values)

2️⃣ SQL (Relational Table Format)

3️⃣ JSON (JavaScript Object Notation)

4️⃣ Parquet (Columnar Storage Format)

5️⃣ XML (Extensible Markup Language)

6️⃣ Avro (Row-based Storage Format)

🧠 Summary

🚀 Final Thoughts

✍️ Author: Rethan Kumar

Top comments (0)