Data in the Cloud: 6 Common Formats for Data Analytics

#dataanalytics #python #dataengineering #bigdata

In the world of data analytics, how we store, exchange, and process data depends heavily on the data format used.

From simple text files like CSV to efficient formats like Parquet, each serves a unique purpose for performance, scalability, and compatibility.

In this article, we’ll explore six common data formats used in analytics:

CSV (Comma-Separated Values)
SQL (Relational Table Format)
JSON (JavaScript Object Notation)
Parquet (Columnar Storage Format)
XML (Extensible Markup Language)
Avro (Row-based Storage Format)

We’ll use this simple dataset as an example:

Name	Register_No	Subject	Marks
Hari	101	Math	89
Asha	102	Science	92
Kiran	103	English	85

Step 1: Install Dependencies

!pip install pandas pyarrow fastavro lxml

Step 2: Create the Dataset
We’ll first create a small dataset using pandas.

import pandas as pd  
data = {     
    "Name": ["Hari", "Asha", "Kiran"],     
    "Register_No": [101, 102, 103],     
    "Subject": ["Math", "Science", "English"],     
    "Marks": [89, 92, 85] 
}  

df = pd.DataFrame(data)
df

1.CSV (Comma-Separated Values)

CSV is one of the simplest and most common formats for storing tabular data.
Each line represents a row, and columns are separated by commas. It’s human-readable and works easily with Excel or Python.

* Save to CSV
df.to_csv("students.csv", index=False)

* Verify content
print(open("students.csv").read())

Output:
Name,Register_No,Subject,Marks
Hari,101,Math,89
Asha,102,Science,92
Kiran,103,English,85

2.SQL (Relational Table Format)

SQL organizes data into relational tables using columns and rows.
You can store, query, and join tables efficiently in databases like MySQL, SQLite, or PostgreSQL.

import sqlite3  

* Create SQLite database and table
conn = sqlite3.connect("students.db")  
df.to_sql("students", conn, if_exists="replace", index=False)  

* Verify content
pd.read_sql("SELECT * FROM students", conn)

3.JSON (JavaScript Object Notation)

JSON represents data as key-value pairs, making it ideal for APIs and web applications.
It’s lightweight and language-independent.

* Save to JSON (pretty format)
df.to_json("students.json", orient="records", indent=2)

* Verify content
print(open("students.json").read())

Output:

[
  {
    "Name": "Hari",
    "Register_No": 101,
    "Subject": "Math",
    "Marks": 89
  },
  {
    "Name": "Asha",
    "Register_No": 102,
    "Subject": "Science",
    "Marks": 92
  },
  {
    "Name": "Kiran",
    "Register_No": 103,
    "Subject": "English",
    "Marks": 85
  }
]

4.Parquet (Columnar Storage Format)

Parquet is a columnar storage format optimized for big data analytics.
It stores data by columns instead of rows, which improves compression and query performance — especially for analytical workloads.

* Save to Parquet
df.to_parquet("students.parquet", index=False)

* Verify by reading it back
pd.read_parquet("students.parquet")
Efficient and compact — great for distributed systems.

5.XML (Extensible Markup Language)

XML stores data using nested tags, similar to HTML.
It’s human-readable and useful for hierarchical or document-like data.

import xml.etree.ElementTree as ET  

root = ET.Element("Students")  

for _, row in df.iterrows():     
    student = ET.SubElement(root, "Student")     
    ET.SubElement(student, "Name").text = str(row["Name"])     
    ET.SubElement(student, "Register_No").text = str(row["Register_No"])     
    ET.SubElement(student, "Subject").text = str(row["Subject"])     
    ET.SubElement(student, "Marks").text = str(row["Marks"])  

tree = ET.ElementTree(root)  
tree.write("students.xml", encoding="utf-8", xml_declaration=True)

* Verify content
print(open("students.xml").read())

Output:
xml
<?xml version='1.0' encoding='utf-8'?>
<Students>
  <Student><Name>Hari</Name><Register_No>101</Register_No><Subject>Math</Subject><Marks>89</Marks></Student>
  <Student><Name>Asha</Name><Register_No>102</Register_No><Subject>Science</Subject><Marks>92</Marks></Student>
  <Student><Name>Kiran</Name><Register_No>103</Register_No><Subject>English</Subject><Marks>85</Marks></Student>
</Students>

6.Avro (Row-based Storage Format)

Avro is a binary row-based format developed by Apache for data serialization.
It includes the schema with the data, making it ideal for streaming systems like Apache Kafka.

from fastavro import writer, reader, parse_schema  

schema = {     
    "doc": "Student data",     
    "name": "Student",     
    "namespace": "example.avro",     
    "type": "record",     
    "fields": [         
        {"name": "Name", "type": "string"},         
        {"name": "Register_No", "type": "int"},         
        {"name": "Subject", "type": "string"},         
        {"name": "Marks", "type": "int"}     
    ] 
}  

records = df.to_dict(orient="records")  

with open("students.avro", "wb") as out:     
    writer(out, parse_schema(schema), records)  

print("students.avro created")

Verify Avro Data

from fastavro import reader  
with open("students.avro", "rb") as fo:     
    for record in reader(fo):         
        print(record)

Output:

{'Name': 'Hari', 'Register_No': 101, 'Subject': 'Math', 'Marks': 89}
{'Name': 'Asha', 'Register_No': 102, 'Subject': 'Science', 'Marks': 92}
{'Name': 'Kiran', 'Register_No': 103, 'Subject': 'English', 'Marks': 85}

Summary Check

print("Files created in Colab working directory:")
!ls -lh students.*

Conclusion
Each data format serves a specific purpose in analytics:

Format	Type	Best For
CSV	Text	Simple tabular data
SQL	Relational	Structured, queryable data
JSON	Hierarchical	APIs, NoSQL data
Parquet	Columnar	Big data and analytics
XML	Tagged	Configs and document data
Avro	Binary (Row)	Streaming, schema evolution