Abinaya S

Posted on Oct 8

How the Cloud Stores Our Data

#cloudcomputing #aws #googlecloud

Introduction

Every photo you upload, message you send, or file you share — all of it lives somewhere beyond your device. That “somewhere” is the cloud.
Cloud technology allows massive amounts of data to be stored and accessed securely from anywhere in the world. Behind this simplicity lies a complex world of data formats — each designed for specific use cases in analytics, processing, and storage.
Let’s explore six of the most popular formats used across cloud platforms and analytics systems.

Data Formats in Cloud Analytics

Every time you store, share, or query data in the cloud, you’re likely dealing with one of these six formats:

CSV – Simple text-based, comma-separated data

SQL – Relational, structured data tables

JSON – Lightweight, flexible key-value data

Parquet – Efficient, columnar storage for big data

XML – Markup-based hierarchical data

Avro – Binary, schema-driven data for streaming

To make it easy to understand, let’s take a small dataset and represent it in all six formats.

Sample Dataset

Employee_ID	Name	Department	Salary
E101	Karthik	HR	52000
E102	Meena	IT	68000
E103	Varun	Finance	60000

1️⃣ CSV (Comma Separated Values)

CSV is one of the simplest and most human-readable formats. Each record is written in one line, and each field is separated by commas.

Example:

Employee_ID,Name,Department,Salary
E101,Karthik,HR,52000
E102,Meena,IT,68000
E103,Varun,Finance,60000

✅ Pros

Simple and widely supported
Can be opened in Excel, Notepad, or any tool

⚠️ Cons

No data types
Inefficient for big data analytics

2️⃣ SQL (Structured Query Language)

SQL is the language of relational databases. It stores data in tables with defined columns and allows complex queries.

Example:

CREATE TABLE Employees (
  Employee_ID VARCHAR(10),
  Name VARCHAR(50),
  Department VARCHAR(30),
  Salary INT
);

INSERT INTO Employees VALUES
('E101', 'Karthik', 'HR', 52000),
('E102', 'Meena', 'IT', 68000),
('E103', 'Varun', 'Finance', 60000);

✅ Pros

Highly structured and queryable
Supports relationships and constraints

⚠️ Cons

Fixed schema
Not flexible for nested data

3️⃣ JSON (JavaScript Object Notation)

JSON is the go-to format for APIs and NoSQL databases. It’s lightweight and great for representing hierarchical data.

Example:

[
  {"Employee_ID": "E101", "Name": "Karthik", "Department": "HR", "Salary": 52000},
  {"Employee_ID": "E102", "Name": "Meena", "Department": "IT", "Salary": 68000},
  {"Employee_ID": "E103", "Name": "Varun", "Department": "Finance", "Salary": 60000}
]

✅ Pros

Flexible and easy to parse
Perfect for modern web and mobile apps

⚠️ Cons

No built-in schema
Can become large in size

4️⃣ Parquet (Columnar Storage Format)

Parquet is built for big data analytics. It stores data column-wise, improving compression and query performance — ideal for tools like AWS Athena or Spark.

Conceptual View:

Employee_ID: ["E101", "E102", "E103"]
Name: ["Karthik", "Meena", "Varun"]
Department: ["HR", "IT", "Finance"]
Salary: [52000, 68000, 60000]

✅ Pros

High compression and query efficiency
Best for cloud-scale analytics

⚠️ Cons

Not readable without tools
Requires frameworks like Spark or PyArrow

5️⃣ XML (Extensible Markup Language)

XML represents data using tags. It’s structured and self-descriptive — often used in web services or configurations.

Example:

<Employees>
  <Employee>
    <Employee_ID>E101</Employee_ID>
    <Name>Karthik</Name>
    <Department>HR</Department>
    <Salary>52000</Salary>
  </Employee>
  <Employee>
    <Employee_ID>E102</Employee_ID>
    <Name>Meena</Name>
    <Department>IT</Department>
    <Salary>68000</Salary>
  </Employee>
  <Employee>
    <Employee_ID>E103</Employee_ID>
    <Name>Varun</Name>
    <Department>Finance</Department>
    <Salary>60000</Salary>
  </Employee>
</Employees>

✅ Pros

Highly structured
Excellent for document-based storage

⚠️ Cons

Verbose syntax
Slower parsing compared to JSON

6️⃣ Avro (Row-Based Storage Format)

Avro is a binary format often used in streaming pipelines like Apache Kafka. It’s compact and includes schema definitions.

Schema Example:

{
  "type": "record",
  "name": "Employee",
  "fields": [
    {"name": "Employee_ID", "type": "string"},
    {"name": "Name", "type": "string"},
    {"name": "Department", "type": "string"},
    {"name": "Salary", "type": "int"}
  ]
}

✅ Pros

Compact and fast
Schema evolution supported

⚠️ Cons

Not human-readable
Needs Avro-compatible tools

Conclusion

Each data format plays a critical role in how cloud systems store and process information.

Use Case	Format
Lightweight exports	CSV
Relational storage	SQL
APIs and NoSQL	JSON
Big data analytics	Parquet
Document hierarchy	XML
Streaming pipelines	Avro

Data is the foundation of the modern world — and the cloud is its home. Choosing the right format ensures efficiency, scalability, and smarter data handling.

DEV Community

How the Cloud Stores Our Data

Top comments (0)