Nandhini D

Posted on Oct 6

DATA IN SKY(Data in cloud)

#cloud #aws #database

Introduction
In a world where everything is connected from smart homes to self-driving cars,one invisible force powers it all: data. But the real magic isn’t just in the data itself...it’s where it lives.
Welcome to the cloud, where information floats freely, accessible anytime, anywhere.
When we say “data in the sky,” it doesn’t mean our files are literally floating above us. Instead, they’re safely stored in giant data centers around the world, managed by cloud providers who make sure your data is always available when you need it.
What is The Cloud Really?
The term cloud often sounds like something abstract, but it’s actually a network of powerful servers distributed across the globe.
When you upload a photo, stream a movie, or collaborate on a document in Google Drive-your data is being stored, processed, and served from these cloud data centers.
In short:
The cloud 🡪 someone else’s supercomputer that works for you over the internet.
Let’s dive into the six most commonly used data formats in data analytics and cloud systems.

Data Formats in Cloud Analytics

Every time you store, share, or query data in the cloud, you’re likely dealing with one of these six formats:

CSV – Simple text-based, comma-separated data

SQL – Relational, structured data tables

JSON – Lightweight, flexible key-value data

Parquet – Efficient, columnar storage for big data

XML – Markup-based hierarchical data

Avro – Binary, schema-driven data for streaming

To make it easy to understand, let’s take a small dataset and represent it in all six formats.

Sample Dataset

Name	Register_No	Subject	Marks
Aadhiran	101	AI	92
Kavin	102	ML	88
Rekha	103	DBMS	95

1.CSV (Comma Separated Values)
CSV is one of the most popular and simplest data formats. Each line in a CSV file represents a record, and each value is separated by a comma.
It’s easy to read, portable, and supported across almost every software tool — from Excel to Python to Google Sheets.

Example:

Name,Register_No,Subject,Marks
Aadhiran,101,AI,92
Kavin,102,ML,88
Rekha,103,DBMS,95

✅ Pros:

Human-readable and easy to edit
Works with almost every data tool

⚠️ Cons:

No schema or metadata
Not efficient for large-scale analytics

2.SQL (Structured Query Language)
SQL represents relational data — stored in tables with defined columns and data types.
This is the backbone of databases like MySQL, PostgreSQL, and Oracle.

Example:

CREATE TABLE Students (
  Name VARCHAR(50),
  Register_No INT,
  Subject VARCHAR(20),
  Marks INT
);

INSERT INTO Students VALUES 
('Aadhiran', 101, 'AI', 92),
('Kavin', 102, 'ML', 88),
('Rekha', 103, 'DBMS', 95);

✅ Pros:

Structured, relational, and queryable
Ideal for joins, filters, and aggregations

⚠️ Cons:

Rigid schema
Not suited for nested or semi-structured data

3.JSON (JavaScript Object Notation)
JSON is the king of web APIs and NoSQL databases.
It’s lightweight, flexible, and great for hierarchical or nested data structures.

Example:

[
  {"Name": "Aadhiran", "Register_No": 101, "Subject": "AI", "Marks": 92},
  {"Name": "Kavin", "Register_No": 102, "Subject": "ML", "Marks": 88},
  {"Name": "Rekha", "Register_No": 103, "Subject": "DBMS", "Marks": 95}
]

✅ Pros:

Easy to use and parse
Excellent for APIs and web applications

⚠️ Cons:

No enforced schema
Can grow large for big datasets

4.Parquet (Columnar Storage Format)
Apache Parquet is designed for big data analytics.
It stores data column-wise instead of row-wise, which improves compression and query performance.
It’s the preferred format for tools like Apache Spark, AWS Athena, and Google BigQuery.

Conceptual View:

Columns:
Name: ["Aadhiran", "Kavin", "Rekha"]
Register_No: [101, 102, 103]
Subject: ["AI", "ML", "DBMS"]
Marks: [92, 88, 95]

✅ Pros:

Highly compressed and efficient
Great for analytical queries

⚠️ Cons:

Not human-readable
Requires tools to read/write (e.g., PyArrow, Spark)

5.XML (Extensible Markup Language)
XML is a markup language that uses tags to define data structure.
It’s often used in web services, configuration files, and document exchange.

Example:

<Students>
  <Student>
    <Name>Aadhiran</Name>
    <Register_No>101</Register_No>
    <Subject>AI</Subject>
    <Marks>92</Marks>
  </Student>
  <Student>
    <Name>Kavin</Name>
    <Register_No>102</Register_No>
    <Subject>ML</Subject>
    <Marks>88</Marks>
  </Student>
  <Student>
    <Name>Rekha</Name>
    <Register_No>103</Register_No>
    <Subject>DBMS</Subject>
    <Marks>95</Marks>
  </Student>
</Students>

✅ Pros:

Self-descriptive and structured
Ideal for hierarchical data

⚠️ Cons:

Verbose
Slower parsing than JSON

6.Avro (Row-Based Storage Format)
Apache Avro is a binary row-based format used for data serialization — ideal for streaming and messaging systems like Apache Kafka.
It includes a schema with every file, ensuring data consistency and evolution over time.

Schema Example:

{
  "type": "record",
  "name": "Student",
  "fields": [
    {"name": "Name", "type": "string"},
    {"name": "Register_No", "type": "int"},
    {"name": "Subject", "type": "string"},
    {"name": "Marks", "type": "int"}
  ]
}

✅ Pros:

Compact binary format
Schema evolution supported
Excellent for data streaming

⚠️ Cons:

Not human-readable
Requires Avro libraries to use

Conclusion:
Each data format serves a unique purpose in the data ecosystem.

Use Case
Simple exports or logs 🡪 CSV
Relational storage 🡪 SQL
API responses or nested data 🡪 JSON
Cloud-scale analytics 🡪 Parquet
Hierarchical or document data 🡪 XML
Data pipelines or streaming 🡪 Avro

DEV Community

DATA IN SKY(Data in cloud)

Top comments (0)