Hari Venkatesh

Posted on Oct 8

Understanding 6 Common Data Formats in Data Analytics

#analytics #data #datascience

In the world of data analytics, data comes in many shapes and formats. Choosing the right one can make your analytics pipeline faster, more efficient, and easier to maintain.

In this blog, let’s explore 6 commonly used data formats in data analytics with a simple example dataset.

📊 Example Dataset

Let’s consider a simple dataset of students and their marks:

Name	Register Number	Subject	Marks
Hari	101	Math	95
Vignesh	102	Science	88
Priya	103	English	92

1️⃣ CSV (Comma Separated Values)

📘 Explanation:

CSV is one of the simplest and most widely used formats. Each row represents a record, and columns are separated by commas. It’s easy to create, read, and import into tools like Excel or Python pandas.

🧾 Example (data.csv):

Name,RegisterNumber,Subject,Marks
Hari,101,Math,95
Vignesh,102,Science,88
Priya,103,English,92

2️⃣ SQL (Relational Table Format)

📘 Explanation:

SQL format represents data stored in relational tables. You can use CREATE TABLE and INSERT INTO commands to define and populate your dataset.

🧾 Example (data.sql):

CREATE TABLE Students (
  Name VARCHAR(50),
  RegisterNumber INT,
  Subject VARCHAR(50),
  Marks INT
);

INSERT INTO Students VALUES
('Hari', 101, 'Math', 95),
('Vignesh', 102, 'Science', 88),
('Priya', 103, 'English', 92);

3️⃣ JSON (JavaScript Object Notation)

📘 Explanation:

JSON is a lightweight, human-readable data format used extensively in APIs and NoSQL databases. Data is stored as key-value pairs, making it flexible and hierarchical.

🧾 Example (data.json):

[
  {
    "Name": "Hari",
    "RegisterNumber": 101,
    "Subject": "Math",
    "Marks": 95
  },
  {
    "Name": "Vignesh",
    "RegisterNumber": 102,
    "Subject": "Science",
    "Marks": 88
  },
  {
    "Name": "Priya",
    "RegisterNumber": 103,
    "Subject": "English",
    "Marks": 92
  }
]

4️⃣ Parquet (Columnar Storage Format)

📘 Explanation:

Parquet is an optimized columnar storage format used in big data frameworks like Apache Spark, Hadoop, and BigQuery. It compresses data efficiently and allows faster analytical queries by reading only the required columns.

🧾 Example Representation (Conceptual):

Columnar Storage:

Name:       ["Hari", "Vignesh", "Priya"]
RegisterNo: [101, 102, 103]
Subject:    ["Math", "Science", "English"]
Marks:      [95, 88, 92]

💡 Note: Parquet files are binary and not human-readable, but this is how they conceptually organize data for fast column-based access.

5️⃣ XML (Extensible Markup Language)

📘 Explanation:

XML stores data using custom tags, similar to HTML. It’s self-descriptive and widely used in configurations and web data interchange.

🧾 Example (data.xml):

<Students>
  <Student>
    <Name>Hari</Name>
    <RegisterNumber>101</RegisterNumber>
    <Subject>Math</Subject>
    <Marks>95</Marks>
  </Student>
  <Student>
    <Name>Vignesh</Name>
    <RegisterNumber>102</RegisterNumber>
    <Subject>Science</Subject>
    <Marks>88</Marks>
  </Student>
  <Student>
    <Name>Priya</Name>
    <RegisterNumber>103</RegisterNumber>
    <Subject>English</Subject>
    <Marks>92</Marks>
  </Student>
</Students>

6️⃣ Avro (Row-based Storage Format)

📘 Explanation:

Avro is a binary row-based storage format developed by Apache. It’s efficient for serialization and supports schema evolution, making it popular for data pipelines and streaming platforms like Kafka.

🧾 Example Schema and Data (Conceptual):

{
  "type": "record",
  "name": "Student",
  "fields": [
    {"name": "Name", "type": "string"},
    {"name": "RegisterNumber", "type": "int"},
    {"name": "Subject", "type": "string"},
    {"name": "Marks", "type": "int"}
  ]
}

Data (conceptually):

Hari, 101, Math, 95
Vignesh, 102, Science, 88
Priya, 103, English, 92

💡 Avro data is stored in binary format, but it always includes the schema definition for decoding.

🔍 Summary Comparison

Format	Type	Human Readable	Best Use Case
CSV	Row-based	✅ Yes	Simple data, spreadsheets
SQL	Relational	✅ Yes	Databases and queries
JSON	Semi-structured	✅ Yes	APIs, NoSQL data
Parquet	Columnar	❌ No	Big Data Analytics
XML	Tagged	✅ Yes	Data interchange, configs
Avro	Row-based (Binary)	❌ No	Data pipelines, Kafka

🚀 Final Thoughts

Each format has its own strengths —

Use CSV or SQL for small datasets.
Use JSON for flexible, nested data.
Use Parquet or Avro for big data analytics and storage efficiency.
Use XML when data needs strong structure and tagging.

Understanding these formats helps you pick the right tool for your data analytics workflows! 💡

✍️ Written by Hari Venkatesh

DataAnalytics #BigData #Cloud #DataFormats #Learning

DEV Community

Understanding 6 Common Data Formats in Data Analytics

📊 Example Dataset

1️⃣ CSV (Comma Separated Values)

2️⃣ SQL (Relational Table Format)

3️⃣ JSON (JavaScript Object Notation)

4️⃣ Parquet (Columnar Storage Format)

5️⃣ XML (Extensible Markup Language)

6️⃣ Avro (Row-based Storage Format)

🔍 Summary Comparison

🚀 Final Thoughts

DataAnalytics #BigData #Cloud #DataFormats #Learning

Top comments (0)