DEV Community

Cover image for How the Cloud Stores Our Data
Abinaya S
Abinaya S

Posted on

How the Cloud Stores Our Data

Introduction

Every photo you upload, message you send, or file you share — all of it lives somewhere beyond your device. That “somewhere” is the cloud.
Cloud technology allows massive amounts of data to be stored and accessed securely from anywhere in the world. Behind this simplicity lies a complex world of data formats — each designed for specific use cases in analytics, processing, and storage.
Let’s explore six of the most popular formats used across cloud platforms and analytics systems.

Data Formats in Cloud Analytics

Every time you store, share, or query data in the cloud, you’re likely dealing with one of these six formats:

CSV – Simple text-based, comma-separated data

SQL – Relational, structured data tables

JSON – Lightweight, flexible key-value data

Parquet – Efficient, columnar storage for big data

XML – Markup-based hierarchical data

Avro – Binary, schema-driven data for streaming

To make it easy to understand, let’s take a small dataset and represent it in all six formats.

Sample Dataset

Employee_ID Name Department Salary
E101 Karthik HR 52000
E102 Meena IT 68000
E103 Varun Finance 60000

1️⃣ CSV (Comma Separated Values)

CSV is one of the simplest and most human-readable formats. Each record is written in one line, and each field is separated by commas.

Example:

Employee_ID,Name,Department,Salary
E101,Karthik,HR,52000
E102,Meena,IT,68000
E103,Varun,Finance,60000
Enter fullscreen mode Exit fullscreen mode

✅ Pros

  • Simple and widely supported
  • Can be opened in Excel, Notepad, or any tool

⚠️ Cons

  • No data types
  • Inefficient for big data analytics

2️⃣ SQL (Structured Query Language)

SQL is the language of relational databases. It stores data in tables with defined columns and allows complex queries.

Example:

CREATE TABLE Employees (
  Employee_ID VARCHAR(10),
  Name VARCHAR(50),
  Department VARCHAR(30),
  Salary INT
);

INSERT INTO Employees VALUES
('E101', 'Karthik', 'HR', 52000),
('E102', 'Meena', 'IT', 68000),
('E103', 'Varun', 'Finance', 60000);


Enter fullscreen mode Exit fullscreen mode

✅ Pros

  • Highly structured and queryable
  • Supports relationships and constraints

⚠️ Cons

  • Fixed schema
  • Not flexible for nested data

3️⃣ JSON (JavaScript Object Notation)

JSON is the go-to format for APIs and NoSQL databases. It’s lightweight and great for representing hierarchical data.

Example:

[
  {"Employee_ID": "E101", "Name": "Karthik", "Department": "HR", "Salary": 52000},
  {"Employee_ID": "E102", "Name": "Meena", "Department": "IT", "Salary": 68000},
  {"Employee_ID": "E103", "Name": "Varun", "Department": "Finance", "Salary": 60000}
]

Enter fullscreen mode Exit fullscreen mode

✅ Pros

  • Flexible and easy to parse
  • Perfect for modern web and mobile apps

⚠️ Cons

  • No built-in schema
  • Can become large in size

4️⃣ Parquet (Columnar Storage Format)

Parquet is built for big data analytics. It stores data column-wise, improving compression and query performance — ideal for tools like AWS Athena or Spark.

Conceptual View:

Employee_ID: ["E101", "E102", "E103"]
Name: ["Karthik", "Meena", "Varun"]
Department: ["HR", "IT", "Finance"]
Salary: [52000, 68000, 60000]
Enter fullscreen mode Exit fullscreen mode

✅ Pros

  • High compression and query efficiency
  • Best for cloud-scale analytics

⚠️ Cons

  • Not readable without tools
  • Requires frameworks like Spark or PyArrow

5️⃣ XML (Extensible Markup Language)

XML represents data using tags. It’s structured and self-descriptive — often used in web services or configurations.

Example:

<Employees>
  <Employee>
    <Employee_ID>E101</Employee_ID>
    <Name>Karthik</Name>
    <Department>HR</Department>
    <Salary>52000</Salary>
  </Employee>
  <Employee>
    <Employee_ID>E102</Employee_ID>
    <Name>Meena</Name>
    <Department>IT</Department>
    <Salary>68000</Salary>
  </Employee>
  <Employee>
    <Employee_ID>E103</Employee_ID>
    <Name>Varun</Name>
    <Department>Finance</Department>
    <Salary>60000</Salary>
  </Employee>
</Employees>

Enter fullscreen mode Exit fullscreen mode

✅ Pros

  • Highly structured
  • Excellent for document-based storage

⚠️ Cons

  • Verbose syntax
  • Slower parsing compared to JSON

6️⃣ Avro (Row-Based Storage Format)

Avro is a binary format often used in streaming pipelines like Apache Kafka. It’s compact and includes schema definitions.

Schema Example:

{
  "type": "record",
  "name": "Employee",
  "fields": [
    {"name": "Employee_ID", "type": "string"},
    {"name": "Name", "type": "string"},
    {"name": "Department", "type": "string"},
    {"name": "Salary", "type": "int"}
  ]
}
Enter fullscreen mode Exit fullscreen mode

✅ Pros

  • Compact and fast
  • Schema evolution supported

⚠️ Cons

  • Not human-readable
  • Needs Avro-compatible tools

Conclusion

Each data format plays a critical role in how cloud systems store and process information.

Use Case Format
Lightweight exports CSV
Relational storage SQL
APIs and NoSQL JSON
Big data analytics Parquet
Document hierarchy XML
Streaming pipelines Avro

Data is the foundation of the modern world — and the cloud is its home. Choosing the right format ensures efficiency, scalability, and smarter data handling.

Top comments (0)