<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Hari Venkatesh</title>
    <description>The latest articles on DEV Community by Hari Venkatesh (@hari_venkatesh_).</description>
    <link>https://dev.to/hari_venkatesh_</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3457015%2F501b47dc-c3d8-4e92-8315-778eda435258.png</url>
      <title>DEV Community: Hari Venkatesh</title>
      <link>https://dev.to/hari_venkatesh_</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/hari_venkatesh_"/>
    <language>en</language>
    <item>
      <title>Understanding 6 Common Data Formats in Data Analytics</title>
      <dc:creator>Hari Venkatesh</dc:creator>
      <pubDate>Wed, 08 Oct 2025 13:44:43 +0000</pubDate>
      <link>https://dev.to/hari_venkatesh_/understanding-6-common-data-formats-in-data-analytics-4kf1</link>
      <guid>https://dev.to/hari_venkatesh_/understanding-6-common-data-formats-in-data-analytics-4kf1</guid>
      <description>&lt;p&gt;Understanding 6 Common Data Formats in Data Analytics&lt;/p&gt;

&lt;p&gt;In the world of &lt;strong&gt;data analytics&lt;/strong&gt;, data comes in many shapes and formats. Choosing the right one can make your analytics pipeline faster, more efficient, and easier to maintain.&lt;/p&gt;

&lt;p&gt;In this blog, let’s explore &lt;strong&gt;6 commonly used data formats&lt;/strong&gt; in data analytics with a simple example dataset.&lt;/p&gt;




&lt;h2&gt;
  
  
  📊 Example Dataset
&lt;/h2&gt;

&lt;p&gt;Let’s consider a simple dataset of students and their marks:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Name&lt;/th&gt;
&lt;th&gt;Register Number&lt;/th&gt;
&lt;th&gt;Subject&lt;/th&gt;
&lt;th&gt;Marks&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Hari&lt;/td&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;td&gt;Math&lt;/td&gt;
&lt;td&gt;95&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Vignesh&lt;/td&gt;
&lt;td&gt;102&lt;/td&gt;
&lt;td&gt;Science&lt;/td&gt;
&lt;td&gt;88&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Priya&lt;/td&gt;
&lt;td&gt;103&lt;/td&gt;
&lt;td&gt;English&lt;/td&gt;
&lt;td&gt;92&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  1️⃣ CSV (Comma Separated Values)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;📘 Explanation:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
CSV is one of the simplest and most widely used formats. Each row represents a record, and columns are separated by commas. It’s easy to create, read, and import into tools like Excel or Python pandas.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🧾 Example (data.csv):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Name,RegisterNumber,Subject,Marks
Hari,101,Math,95
Vignesh,102,Science,88
Priya,103,English,92
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  2️⃣ SQL (Relational Table Format)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;📘 Explanation:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
SQL format represents data stored in relational tables. You can use &lt;code&gt;CREATE TABLE&lt;/code&gt; and &lt;code&gt;INSERT INTO&lt;/code&gt; commands to define and populate your dataset.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🧾 Example (data.sql):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;Students&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;Name&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="n"&gt;RegisterNumber&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;Subject&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="n"&gt;Marks&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;Students&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'Hari'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;101&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'Math'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;95&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'Vignesh'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;102&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'Science'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;88&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'Priya'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;103&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'English'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;92&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  3️⃣ JSON (JavaScript Object Notation)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;📘 Explanation:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
JSON is a lightweight, human-readable data format used extensively in APIs and NoSQL databases. Data is stored as key-value pairs, making it flexible and hierarchical.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🧾 Example (data.json):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"Name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Hari"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"RegisterNumber"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;101&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"Subject"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Math"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"Marks"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;95&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"Name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Vignesh"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"RegisterNumber"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;102&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"Subject"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Science"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"Marks"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;88&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"Name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Priya"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"RegisterNumber"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;103&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"Subject"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"English"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"Marks"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;92&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  4️⃣ Parquet (Columnar Storage Format)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;📘 Explanation:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Parquet is an &lt;strong&gt;optimized columnar storage format&lt;/strong&gt; used in big data frameworks like Apache Spark, Hadoop, and BigQuery. It compresses data efficiently and allows faster analytical queries by reading only the required columns.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🧾 Example Representation (Conceptual):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Columnar Storage:

Name:       ["Hari", "Vignesh", "Priya"]
RegisterNo: [101, 102, 103]
Subject:    ["Math", "Science", "English"]
Marks:      [95, 88, 92]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;💡 Note: Parquet files are binary and not human-readable, but this is how they conceptually organize data for fast column-based access.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  5️⃣ XML (Extensible Markup Language)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;📘 Explanation:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
XML stores data using custom tags, similar to HTML. It’s self-descriptive and widely used in configurations and web data interchange.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🧾 Example (data.xml):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;Students&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;Student&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;Name&amp;gt;&lt;/span&gt;Hari&lt;span class="nt"&gt;&amp;lt;/Name&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;RegisterNumber&amp;gt;&lt;/span&gt;101&lt;span class="nt"&gt;&amp;lt;/RegisterNumber&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;Subject&amp;gt;&lt;/span&gt;Math&lt;span class="nt"&gt;&amp;lt;/Subject&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;Marks&amp;gt;&lt;/span&gt;95&lt;span class="nt"&gt;&amp;lt;/Marks&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;/Student&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;Student&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;Name&amp;gt;&lt;/span&gt;Vignesh&lt;span class="nt"&gt;&amp;lt;/Name&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;RegisterNumber&amp;gt;&lt;/span&gt;102&lt;span class="nt"&gt;&amp;lt;/RegisterNumber&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;Subject&amp;gt;&lt;/span&gt;Science&lt;span class="nt"&gt;&amp;lt;/Subject&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;Marks&amp;gt;&lt;/span&gt;88&lt;span class="nt"&gt;&amp;lt;/Marks&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;/Student&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;Student&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;Name&amp;gt;&lt;/span&gt;Priya&lt;span class="nt"&gt;&amp;lt;/Name&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;RegisterNumber&amp;gt;&lt;/span&gt;103&lt;span class="nt"&gt;&amp;lt;/RegisterNumber&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;Subject&amp;gt;&lt;/span&gt;English&lt;span class="nt"&gt;&amp;lt;/Subject&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;Marks&amp;gt;&lt;/span&gt;92&lt;span class="nt"&gt;&amp;lt;/Marks&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;/Student&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/Students&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  6️⃣ Avro (Row-based Storage Format)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;📘 Explanation:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Avro is a &lt;strong&gt;binary row-based storage format&lt;/strong&gt; developed by Apache. It’s efficient for serialization and supports schema evolution, making it popular for data pipelines and streaming platforms like Kafka.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🧾 Example Schema and Data (Conceptual):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"record"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Student"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"fields"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Name"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"string"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"RegisterNumber"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"int"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Subject"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"string"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Marks"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"int"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Data (conceptually):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Hari, 101, Math, 95
Vignesh, 102, Science, 88
Priya, 103, English, 92
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;💡 Avro data is stored in binary format, but it always includes the schema definition for decoding.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  🔍 Summary Comparison
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Format&lt;/th&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Human Readable&lt;/th&gt;
&lt;th&gt;Best Use Case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;CSV&lt;/td&gt;
&lt;td&gt;Row-based&lt;/td&gt;
&lt;td&gt;✅ Yes&lt;/td&gt;
&lt;td&gt;Simple data, spreadsheets&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SQL&lt;/td&gt;
&lt;td&gt;Relational&lt;/td&gt;
&lt;td&gt;✅ Yes&lt;/td&gt;
&lt;td&gt;Databases and queries&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;JSON&lt;/td&gt;
&lt;td&gt;Semi-structured&lt;/td&gt;
&lt;td&gt;✅ Yes&lt;/td&gt;
&lt;td&gt;APIs, NoSQL data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Parquet&lt;/td&gt;
&lt;td&gt;Columnar&lt;/td&gt;
&lt;td&gt;❌ No&lt;/td&gt;
&lt;td&gt;Big Data Analytics&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;XML&lt;/td&gt;
&lt;td&gt;Tagged&lt;/td&gt;
&lt;td&gt;✅ Yes&lt;/td&gt;
&lt;td&gt;Data interchange, configs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Avro&lt;/td&gt;
&lt;td&gt;Row-based (Binary)&lt;/td&gt;
&lt;td&gt;❌ No&lt;/td&gt;
&lt;td&gt;Data pipelines, Kafka&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  🚀 Final Thoughts
&lt;/h2&gt;

&lt;p&gt;Each format has its own strengths —  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use &lt;strong&gt;CSV&lt;/strong&gt; or &lt;strong&gt;SQL&lt;/strong&gt; for small datasets.
&lt;/li&gt;
&lt;li&gt;Use &lt;strong&gt;JSON&lt;/strong&gt; for flexible, nested data.
&lt;/li&gt;
&lt;li&gt;Use &lt;strong&gt;Parquet&lt;/strong&gt; or &lt;strong&gt;Avro&lt;/strong&gt; for &lt;strong&gt;big data analytics&lt;/strong&gt; and &lt;strong&gt;storage efficiency&lt;/strong&gt;.
&lt;/li&gt;
&lt;li&gt;Use &lt;strong&gt;XML&lt;/strong&gt; when data needs strong structure and tagging.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Understanding these formats helps you pick the right tool for your data analytics workflows! 💡&lt;/p&gt;




&lt;p&gt;✍️ &lt;em&gt;Written by Hari Venkatesh&lt;/em&gt;  &lt;/p&gt;

&lt;h1&gt;
  
  
  DataAnalytics #BigData #Cloud #DataFormats #Learning
&lt;/h1&gt;

&lt;p&gt;`&lt;/p&gt;

</description>
      <category>analytics</category>
      <category>data</category>
      <category>datascience</category>
    </item>
    <item>
      <title>Building a Simple Yelp-Style Dataset in MongoDB (Step-by-Step Guide)</title>
      <dc:creator>Hari Venkatesh</dc:creator>
      <pubDate>Mon, 25 Aug 2025 04:28:14 +0000</pubDate>
      <link>https://dev.to/hari_venkatesh_/building-a-simple-yelp-style-dataset-in-mongodb-step-by-step-guide-364</link>
      <guid>https://dev.to/hari_venkatesh_/building-a-simple-yelp-style-dataset-in-mongodb-step-by-step-guide-364</guid>
      <description>&lt;p&gt;When I first started exploring MongoDB, I wanted to try something practical instead of just going through documentation. So, I decided to build a mini Yelp-style dataset where businesses have reviews, and then run queries to analyze them.&lt;/p&gt;

&lt;p&gt;This blog walks you through how I created the dataset, inserted it into MongoDB, and ran useful queries like finding top-rated businesses and analyzing reviews. 🚀&lt;/p&gt;

&lt;p&gt;📌 Why This Project?&lt;/p&gt;

&lt;p&gt;Data is everywhere, and being able to store, organize, and query it effectively is an essential skill for developers. MongoDB’s flexible schema makes it perfect for datasets like business reviews, where each review might have slightly different details.&lt;/p&gt;

&lt;p&gt;📌 Deliverables&lt;/p&gt;

&lt;p&gt;Here’s what I aimed to achieve in this project:&lt;/p&gt;

&lt;p&gt;✅ Create a dataset of 25 businesses with reviews.&lt;br&gt;
✅ Insert data into MongoDB (using mongosh).&lt;br&gt;
✅ Run meaningful queries:&lt;/p&gt;

&lt;p&gt;Find top 5 businesses by rating&lt;/p&gt;

&lt;p&gt;Count reviews containing “good”&lt;/p&gt;

&lt;p&gt;Fetch reviews for a specific business&lt;/p&gt;

&lt;p&gt;Update and delete records&lt;/p&gt;

&lt;p&gt;✅ Export data if needed for further analysis&lt;/p&gt;

&lt;p&gt;📌 Step 1: Creating the Dataset&lt;/p&gt;

&lt;p&gt;I prepared a dataset of 25 businesses, each with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;business_id&lt;/li&gt;
&lt;li&gt;name&lt;/li&gt;
&lt;li&gt;rating&lt;/li&gt;
&lt;li&gt;review&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here’s a small sample from the dataset:&lt;/p&gt;

&lt;p&gt;{ "business_id": "B1", "name": "Cafe Mocha", "rating": 4, "review": "Good coffee and snacks" }&lt;br&gt;
{ "business_id": "B2", "name": "Pizza House", "rating": 5, "review": "Excellent pizza, very good taste!" }&lt;br&gt;
{ "business_id": "B3", "name": "Burger Point", "rating": 3, "review": "Average burger but good fries" }&lt;/p&gt;

&lt;p&gt;📌 Step 2: Setting Up MongoDB&lt;/p&gt;

&lt;p&gt;I used MongoDB Compass for visualization and mongosh (MongoDB Shell) for running commands.&lt;/p&gt;

&lt;p&gt;mongosh&lt;/p&gt;

&lt;p&gt;Switch to a database (I called it yelp):&lt;/p&gt;

&lt;p&gt;use yelp&lt;/p&gt;

&lt;p&gt;📌 Step 3: Inserting Data&lt;/p&gt;

&lt;p&gt;Using insertMany(), I inserted all 25 records:&lt;/p&gt;

&lt;p&gt;db.reviews.insertMany([&lt;br&gt;
  { "business_id": "B1", "name": "Cafe Mocha", "rating": 4, "review": "Good coffee and snacks" },&lt;br&gt;
  { "business_id": "B2", "name": "Pizza House", "rating": 5, "review": "Excellent pizza, very good taste!" },&lt;br&gt;
  ...&lt;br&gt;
  { "business_id": "B25", "name": "Coffee Day", "rating": 3, "review": "Average coffee, good sitting place" }&lt;br&gt;
])&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj8se97gccv5eotehz9f9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj8se97gccv5eotehz9f9.png" alt=" " width="800" height="510"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;📌 Step 4: Running Queries&lt;br&gt;
🔹 1. Top 5 Businesses by Rating&lt;br&gt;
db.reviews.aggregate([&lt;br&gt;
  { $group: { _id: "$business_id", name: { $first: "$name" }, avgRating: { $avg: "$rating" } } },&lt;br&gt;
  { $sort: { avgRating: -1 } },&lt;br&gt;
  { $limit: 5 }&lt;br&gt;
])&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F538n26x336eaa0kjqarp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F538n26x336eaa0kjqarp.png" alt=" " width="800" height="508"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;👉 This gave me a leaderboard of the highest-rated restaurants.&lt;/p&gt;

&lt;p&gt;🔹 2. Count Reviews Containing the Word “Good”&lt;br&gt;
db.reviews.countDocuments({ review: /good/i })&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fptmtqgk7s9k0pth78yae.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fptmtqgk7s9k0pth78yae.png" alt=" " width="605" height="63"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;👉 MongoDB’s regex search made it super easy to analyze customer sentiment.&lt;/p&gt;

&lt;p&gt;🔹 3. Get Reviews for a Specific Business&lt;br&gt;
db.reviews.find({ business_id: "B2" })&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffdh4s6o1xsoh3vvqglps.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffdh4s6o1xsoh3vvqglps.png" alt=" " width="592" height="284"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;👉 Perfect to pull all reviews for Pizza House. 🍕&lt;/p&gt;

&lt;p&gt;🔹 4. Update a Review&lt;br&gt;
db.reviews.updateOne(&lt;br&gt;
  { business_id: "B5" },&lt;br&gt;
  { $set: { review: "Service improved, food is good now" } }&lt;br&gt;
)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F00wfwdjt96nzayktizt8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F00wfwdjt96nzayktizt8.png" alt=" " width="754" height="389"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🔹 5. Delete a Record&lt;br&gt;
db.reviews.deleteOne({ business_id: "B25" })&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw0h79ucwqubd8w2yexm7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw0h79ucwqubd8w2yexm7.png" alt=" " width="621" height="182"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;📌 Step 5: Insights &amp;amp; Learnings&lt;/p&gt;

&lt;p&gt;MongoDB made it very easy to store flexible data (reviews don’t need a rigid schema).&lt;/p&gt;

&lt;p&gt;Queries like regex searches helped in basic sentiment analysis.&lt;/p&gt;

&lt;p&gt;Aggregations are powerful for ranking and analytics.&lt;/p&gt;

&lt;p&gt;This small project gave me confidence to handle real-world datasets.&lt;/p&gt;

&lt;p&gt;📌 Next Steps&lt;/p&gt;

&lt;p&gt;Add more fields like location, date, user_id.&lt;/p&gt;

&lt;p&gt;Perform sentiment analysis on reviews.&lt;/p&gt;

&lt;p&gt;Build a simple frontend to display results from MongoDB.&lt;/p&gt;

&lt;p&gt;🚀 Final Thoughts&lt;/p&gt;

&lt;p&gt;This project may look small, but it’s a great stepping stone for anyone starting with databases. By simulating a real-world use case (like Yelp), I not only practiced MongoDB commands but also learned how to analyze data effectively.&lt;/p&gt;

&lt;p&gt;If you’re new to MongoDB, I highly recommend creating your own dataset and experimenting with queries. It’s one of the best ways to learn! 🙌&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
