Data in the Cloud: 6 Common Data Formats

#data #dataengineering #tutorial

In today's world of data analytics and cloud computing, how we store and exchange data can have a huge impact on performance, scalability, and compatibility. Whether you’re working on a data pipeline or exporting reports, understanding different data formats is essential.

In this article, I’ll walk you through six popular data formats — CSV, SQL, JSON, Parquet, XML, and Avro — using a simple example dataset.

Sample Dataset

Let’s use a small dataset of student marks:

Name RegisterNumber Subject Marks
Alice 1001 Math 85
Bob 1002 Math 90
Charlie 1003 Math 78

CSV (Comma-Separated Values)

What is it?
CSV is one of the simplest data formats. Each row is a line, and each value is separated by a comma. It's lightweight and easy to read.

When to use:
Exporting data from spreadsheets or databases
Quick sharing and inspection of tabular data

Example:

Name,RegisterNumber,Subject,Marks
Alice,1001,Math,85
Bob,1002,Math,90
Charlie,1003,Math,78

SQL (Relational Table Format)
What is it?
SQL is the standard language used to interact with relational databases. Data is stored in tables with columns and rows.

When to use:
Structured data that fits into relational models
OLTP and OLAP systems

Example:

CREATE TABLE StudentMarks (
  Name TEXT,
  RegisterNumber INT,
  Subject TEXT,
  Marks INT
);


INSERT INTO StudentMarks VALUES ('Alice', 1001, 'Math', 85);
INSERT INTO StudentMarks VALUES ('Bob', 1002, 'Math', 90);
INSERT INTO StudentMarks VALUES ('Charlie', 1003, 'Math', 78);

JSON (JavaScript Object Notation)
What is it?
JSON is a lightweight text-based format used for data interchange. It’s structured like a dictionary or object and is widely used in APIs.

When to use:
Web services and REST APIs
Semi-structured data

Example:

[
  {
    "Name": "Alice",
    "RegisterNumber": 1001,
    "Subject": "Math",
    "Marks": 85
  },
  {
    "Name": "Bob",
    "RegisterNumber": 1002,
    "Subject": "Math",
    "Marks": 90
  },
  {
    "Name": "Charlie",
    "RegisterNumber": 1003,
    "Subject": "Math",
    "Marks": 78
  }
]

Parquet (Columnar Storage Format)
What is it?
Parquet is an Apache columnar storage format, designed for efficient data compression and query performance. It’s a binary format and not human-readable.

When to use:
Big data workloads (Spark, Hive)
Analytics in cloud data lakes

Example:
Here’s how the dataset might look conceptually (in a tool like Spark or Python) before writing to Parquet:

import pandas as pd

data = [
    {"Name": "Alice", "RegisterNumber": 1001, "Subject": "Math", "Marks": 85},
    {"Name": "Bob", "RegisterNumber": 1002, "Subject": "Math", "Marks": 90},
    {"Name": "Charlie", "RegisterNumber": 1003, "Subject": "Math", "Marks": 78},
]

df = pd.DataFrame(data)
df.to_parquet("student_marks.parquet")

XML (Extensible Markup Language)
What is it?
XML is a markup language that uses custom tags to define data. It’s more verbose than JSON but was a standard for data exchange in the early web.

When to use:
Legacy systems
Document-centric data (RSS feeds, SOAP APIs)

Example:

<Students>
  <Student>
    <Name>Alice</Name>
    <RegisterNumber>1001</RegisterNumber>
    <Subject>Math</Subject>
    <Marks>85</Marks>
  </Student>
  <Student>
    <Name>Bob</Name>
    <RegisterNumber>1002</RegisterNumber>
    <Subject>Math</Subject>
    <Marks>90</Marks>
  </Student>
  <Student>
    <Name>Charlie</Name>
    <RegisterNumber>1003</RegisterNumber>
    <Subject>Math</Subject>
    <Marks>78</Marks>
  </Student>
</Students>

Avro (Row-based Binary Format)
What is it?
Avro is a row-based binary data format from Apache. It supports rich data structures and includes a schema definition, making it excellent for streaming and serialization.

When to use:
Kafka message serialization
Schema evolution
Compact binary storage

Example:
Here’s how you define and write Avro data using Python (conceptual):

{
  "type": "record",
  "name": "Student",
  "fields": [
    {"name": "Name", "type": "string"},
    {"name": "RegisterNumber", "type": "int"},
    {"name": "Subject", "type": "string"},
    {"name": "Marks", "type": "int"}
  ]
}

import avro.schema
import avro.datafile
import avro.io
from io import BytesIO

schema = avro.schema.parse(open("student.avsc", "r").read())
data = [
    {"Name": "Alice", "RegisterNumber": 1001, "Subject": "Math", "Marks": 85},
    {"Name": "Bob", "RegisterNumber": 1002, "Subject": "Math", "Marks": 90},
    {"Name": "Charlie", "RegisterNumber": 1003, "Subject": "Math", "Marks": 78},
]

with open("students.avro", "wb") as out_file:
    writer = avro.datafile.DataFileWriter(out_file, avro.io.DatumWriter(), schema)
    for record in data:
        writer.append(record)
    writer.close()

Conclusion
Understanding these formats helps you choose the right tool for the job, whether you're building a modern data pipeline or integrating with legacy systems.

DEV Community

Data in the Cloud: 6 Common Data Formats

Top comments (0)