DEV Community

Cover image for DATA IN SKY(Data in cloud)
Nandhini D
Nandhini D

Posted on

DATA IN SKY(Data in cloud)

Introduction
In a world where everything is connected from smart homes to self-driving cars,one invisible force powers it all: data. But the real magic isn’t just in the data itself...it’s where it lives.
Welcome to the cloud, where information floats freely, accessible anytime, anywhere.
When we say “data in the sky,” it doesn’t mean our files are literally floating above us.
Instead, they’re safely stored in giant data centers around the world, managed by cloud providers who make sure your data is always available when you need it.

What is The Cloud Really?
The term cloud often sounds like something abstract, but it’s actually a network of powerful servers distributed across the globe.
When you upload a photo, stream a movie, or collaborate on a document in Google Drive-your data is being stored, processed, and served from these cloud data centers.
In short:
The cloud 🡪 someone else’s supercomputer that works for you over the internet.
Let’s dive into the six most commonly used data formats in data analytics and cloud systems.

Data Formats in Cloud Analytics

Every time you store, share, or query data in the cloud, you’re likely dealing with one of these six formats:

CSV – Simple text-based, comma-separated data

SQL – Relational, structured data tables

JSON – Lightweight, flexible key-value data

Parquet – Efficient, columnar storage for big data

XML – Markup-based hierarchical data

Avro – Binary, schema-driven data for streaming

To make it easy to understand, let’s take a small dataset and represent it in all six formats.

Sample Dataset

Name Register_No Subject Marks
Aadhiran 101 AI 92
Kavin 102 ML 88
Rekha 103 DBMS 95

1.CSV (Comma Separated Values)
CSV is one of the most popular and simplest data formats. Each line in a CSV file represents a record, and each value is separated by a comma.
It’s easy to read, portable, and supported across almost every software tool — from Excel to Python to Google Sheets.

Example:

Name,Register_No,Subject,Marks
Aadhiran,101,AI,92
Kavin,102,ML,88
Rekha,103,DBMS,95
Enter fullscreen mode Exit fullscreen mode

✅ Pros:

  • Human-readable and easy to edit
  • Works with almost every data tool

⚠️ Cons:

  • No schema or metadata
  • Not efficient for large-scale analytics

2.SQL (Structured Query Language)
SQL represents relational data — stored in tables with defined columns and data types.
This is the backbone of databases like MySQL, PostgreSQL, and Oracle.

Example:

CREATE TABLE Students (
  Name VARCHAR(50),
  Register_No INT,
  Subject VARCHAR(20),
  Marks INT
);

INSERT INTO Students VALUES 
('Aadhiran', 101, 'AI', 92),
('Kavin', 102, 'ML', 88),
('Rekha', 103, 'DBMS', 95);


Enter fullscreen mode Exit fullscreen mode

✅ Pros:

  • Structured, relational, and queryable
  • Ideal for joins, filters, and aggregations

⚠️ Cons:

  • Rigid schema
  • Not suited for nested or semi-structured data

3.JSON (JavaScript Object Notation)
JSON is the king of web APIs and NoSQL databases.
It’s lightweight, flexible, and great for hierarchical or nested data structures.

Example:

[
  {"Name": "Aadhiran", "Register_No": 101, "Subject": "AI", "Marks": 92},
  {"Name": "Kavin", "Register_No": 102, "Subject": "ML", "Marks": 88},
  {"Name": "Rekha", "Register_No": 103, "Subject": "DBMS", "Marks": 95}
]

Enter fullscreen mode Exit fullscreen mode

✅ Pros:

  • Easy to use and parse
  • Excellent for APIs and web applications

⚠️ Cons:

  • No enforced schema
  • Can grow large for big datasets

4.Parquet (Columnar Storage Format)
Apache Parquet is designed for big data analytics.
It stores data column-wise instead of row-wise, which improves compression and query performance.
It’s the preferred format for tools like Apache Spark, AWS Athena, and Google BigQuery.

Conceptual View:

Columns:
Name: ["Aadhiran", "Kavin", "Rekha"]
Register_No: [101, 102, 103]
Subject: ["AI", "ML", "DBMS"]
Marks: [92, 88, 95]

Enter fullscreen mode Exit fullscreen mode

✅ Pros:

  • Highly compressed and efficient
  • Great for analytical queries

⚠️ Cons:

  • Not human-readable
  • Requires tools to read/write (e.g., PyArrow, Spark)

5.XML (Extensible Markup Language)
XML is a markup language that uses tags to define data structure.
It’s often used in web services, configuration files, and document exchange.

Example:

<Students>
  <Student>
    <Name>Aadhiran</Name>
    <Register_No>101</Register_No>
    <Subject>AI</Subject>
    <Marks>92</Marks>
  </Student>
  <Student>
    <Name>Kavin</Name>
    <Register_No>102</Register_No>
    <Subject>ML</Subject>
    <Marks>88</Marks>
  </Student>
  <Student>
    <Name>Rekha</Name>
    <Register_No>103</Register_No>
    <Subject>DBMS</Subject>
    <Marks>95</Marks>
  </Student>
</Students>
Enter fullscreen mode Exit fullscreen mode

✅ Pros:

  • Self-descriptive and structured
  • Ideal for hierarchical data

⚠️ Cons:

  • Verbose
  • Slower parsing than JSON

6.Avro (Row-Based Storage Format)
Apache Avro is a binary row-based format used for data serialization — ideal for streaming and messaging systems like Apache Kafka.
It includes a schema with every file, ensuring data consistency and evolution over time.

Schema Example:

{
  "type": "record",
  "name": "Student",
  "fields": [
    {"name": "Name", "type": "string"},
    {"name": "Register_No", "type": "int"},
    {"name": "Subject", "type": "string"},
    {"name": "Marks", "type": "int"}
  ]
}
Enter fullscreen mode Exit fullscreen mode

✅ Pros:

  • Compact binary format
  • Schema evolution supported
  • Excellent for data streaming

⚠️ Cons:

  • Not human-readable
  • Requires Avro libraries to use

Conclusion:
Each data format serves a unique purpose in the data ecosystem.

Use Case
Simple exports or logs 🡪 CSV
Relational storage 🡪 SQL
API responses or nested data 🡪 JSON
Cloud-scale analytics 🡪 Parquet
Hierarchical or document data 🡪 XML
Data pipelines or streaming 🡪 Avro

Top comments (0)