<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Salma Aga Shaik</title>
    <description>The latest articles on DEV Community by Salma Aga Shaik (@salma_aga).</description>
    <link>https://dev.to/salma_aga</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3211368%2F9cfb9755-78f8-443d-9b01-43b5f8a5bc97.png</url>
      <title>DEV Community: Salma Aga Shaik</title>
      <link>https://dev.to/salma_aga</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/salma_aga"/>
    <language>en</language>
    <item>
      <title>I Thought High Current Always Meant a Fault Until I Came Across Transformer Inrush Current</title>
      <dc:creator>Salma Aga Shaik</dc:creator>
      <pubDate>Thu, 11 Jun 2026 02:36:23 +0000</pubDate>
      <link>https://dev.to/salma_aga/-i-thought-high-current-always-meant-a-fault-until-i-came-across-transformer-inrush-current-2a66</link>
      <guid>https://dev.to/salma_aga/-i-thought-high-current-always-meant-a-fault-until-i-came-across-transformer-inrush-current-2a66</guid>
      <description>&lt;p&gt;For a long time, I had a simple rule in my mind: &lt;strong&gt;high current means fault&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;If a transformer suddenly drew 5 times or 10 times its rated current, I would immediately think something was wrong. Maybe a short circuit, a protection issue, or some kind of system problem. Then I started learning about transformer energization and came across something interesting called &lt;strong&gt;Transformer Inrush Current&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;What surprised me was that a transformer can draw a very large current when it is switched ON, even when there is no load connected and no fault in the system.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhimmjg3vh8yw2dol4op8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhimmjg3vh8yw2dol4op8.png" alt=" " width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So why does this happen? The answer is inside the transformer core. When a transformer is switched OFF, a small amount of magnetism can still remain in the core. This is called &lt;strong&gt;remnant flux&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;When the transformer is switched ON again, new magnetic flux is created. This new flux combines with the remnant flux already present in the core. If the total flux becomes too high, the transformer core becomes &lt;strong&gt;saturated&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;When this happens, the transformer draws a very large magnetizing current from the source. This temporary current is called &lt;strong&gt;inrush current&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;One thing I found interesting is that &lt;strong&gt;high current does not always mean a fault&lt;/strong&gt;. Inrush current is a normal transformer behavior and can be &lt;strong&gt;2 to 10 times the full load current&lt;/strong&gt;. This is also why transformer protection is important.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How Large Can Inrush Current Be?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The magnitude of inrush current is often estimated as a multiple of the transformer full load current.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Full Load Current&lt;/strong&gt; : IFL = S / (√3 × V)&lt;/p&gt;

&lt;p&gt;Where:&lt;br&gt;
S = Transformer Rating (MVA)&lt;br&gt;
V = Line Voltage (kV)&lt;br&gt;
Approximate Inrush Current&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Iinrush = K × IFL&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Where:&lt;br&gt;
K = 2 to 10&lt;br&gt;
IFL = Full Load Current&lt;/p&gt;

&lt;p&gt;For example, consider a 100 MVA, 220 kV transformer.&lt;/p&gt;

&lt;p&gt;IFL = 100 / (1.732 × 220)&lt;/p&gt;

&lt;p&gt;IFL = 0.262 kA&lt;/p&gt;

&lt;p&gt;If the transformer experiences a 10× inrush current:&lt;/p&gt;

&lt;p&gt;Iinrush = 10 × 0.262&lt;/p&gt;

&lt;p&gt;Iinrush = 2.62 kA&lt;/p&gt;

&lt;p&gt;This shows how a healthy transformer can temporarily draw several times its rated current during energization without any fault in the system.&lt;/p&gt;

&lt;p&gt;Imagine you are a differential relay. You suddenly see a very high current flowing through the transformer. How do you know if it is a fault or just transformer energization? The answer is &lt;strong&gt;second harmonic restraint&lt;/strong&gt;. Transformer inrush current contains a high second harmonic component, while internal fault current usually does not. Because of this, protection relays can tell the difference between inrush current and an actual fault.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;High Current + High Second Harmonic&lt;/strong&gt; = Inrush Current → No Trip&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;High Current + Low Second Harmonic&lt;/strong&gt; = Internal Fault → Trip&lt;/p&gt;

&lt;p&gt;To understand this better, I created a simple transformer energization model in PSCAD using a three phase source, transformer, circuit breaker, timed breaker logic, and saturation model. It was interesting to see the current spike immediately after the breaker closed. Seeing the waveform in PSCAD helped me understand the concept much better. I also changed remnant flux, saturation settings, and air core reactance to see how they affect the magnitude of the inrush current.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftllzqapjyxu61kyszlej.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftllzqapjyxu61kyszlej.png" alt=" " width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The biggest lesson for me was that not every high current event is a fault. Sometimes a transformer can draw several times its rated current and still be operating normally.&lt;/p&gt;

&lt;p&gt;That is one of the things I enjoy about power systems. Many things that look unusual at first actually have a good engineering explanation behind them.&lt;/p&gt;

&lt;h1&gt;
  
  
  How AI Can Help
&lt;/h1&gt;

&lt;p&gt;While learning about transformer inrush current, I started wondering how AI could help with this problem. Protection relays record voltage and current waveforms whenever a transformer is energized. Using FFT and machine learning models such as Random Forest, SVM, or LSTM, engineers can analyze harmonic patterns and distinguish between inrush current and actual fault conditions.&lt;/p&gt;

&lt;p&gt;AI can also help detect unusual transformer behavior, monitor equipment health, and support predictive maintenance. As power systems become more digital, it will be interesting to see how AI can be used alongside traditional protection methods for monitoring and diagnostics.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>powersystem</category>
      <category>electricalenginnering</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>I Thought Harmonics Were a Grid Problem, Then I Realized They Were Everywhere</title>
      <dc:creator>Salma Aga Shaik</dc:creator>
      <pubDate>Sun, 07 Jun 2026 18:32:30 +0000</pubDate>
      <link>https://dev.to/salma_aga/i-thought-harmonics-were-a-grid-problem-then-i-realized-they-were-everywhere-4emf</link>
      <guid>https://dev.to/salma_aga/i-thought-harmonics-were-a-grid-problem-then-i-realized-they-were-everywhere-4emf</guid>
      <description>&lt;p&gt;Whenever I heard about harmonics, I thought they were only related to large substations, transmission systems, and industrial facilities. I assumed harmonics were something utility engineers dealt with and not something connected to everyday devices.&lt;/p&gt;

&lt;p&gt;Phone chargers can create harmonics. Laptop chargers can create harmonics. LED lights can create harmonics. Even a UPS sitting under a desk can create harmonics.&lt;/p&gt;

&lt;p&gt;Today, modern power systems use many power electronic devices such as EV chargers, solar inverters, battery energy storage systems (BESS), UPS systems, data centers, and Variable Frequency Drives (VFDs). While these technologies bring many benefits, they can also introduce harmonic distortion.&lt;/p&gt;

&lt;p&gt;The more power electronic devices we connect to the grid, the more important harmonic analysis becomes.&lt;/p&gt;

&lt;p&gt;In this article, I will explain what harmonics are, what causes them, how they affect power quality, how they can be analyzed using PSCAD, and why they are becoming more important in modern power systems.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Felxg6nf0ldn5lrb5knys.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Felxg6nf0ldn5lrb5knys.png" alt=" " width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;Before we talk about harmonics, let's first understand electrical loads, because this is where harmonics usually begin.&lt;/p&gt;

&lt;h1&gt;
  
  
  What Is an Electrical Load?
&lt;/h1&gt;

&lt;p&gt;An electrical load is any device that uses electrical energy to perform useful work.&lt;/p&gt;

&lt;p&gt;For example, think about a typical evening at home. You turn on a ceiling fan, LED light, laptop, air conditioner, and phone charger. All of these devices use electricity, so they are called electrical loads. Examples of electrical loads include motors, heaters, fans, computers, air conditioners, lighting systems, and EV chargers.&lt;/p&gt;

&lt;p&gt;However, not all electrical loads use electricity in the same way. Some draw current smoothly, while others draw current in short pulses.&lt;/p&gt;

&lt;p&gt;This small difference is actually where the story of harmonics begins.&lt;/p&gt;




&lt;h1&gt;
  
  
  Linear vs Non-Linear Loads
&lt;/h1&gt;

&lt;p&gt;To understand harmonics, we first need to understand the difference between linear and non-linear loads.&lt;/p&gt;

&lt;p&gt;Although both types of loads consume electricity, they draw current from the power system in different ways. This difference has a direct impact on power quality and harmonic generation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Linear Loads
&lt;/h2&gt;

&lt;p&gt;Linear loads draw current smoothly from the power system. The current waveform follows the voltage waveform and remains close to a pure sine wave.&lt;/p&gt;

&lt;p&gt;Examples: Electric heaters, toasters, electric stoves, and incandescent lamps. &lt;/p&gt;

&lt;p&gt;These loads have a nearly constant impedance. As voltage increases, current increases proportionally according to Ohm's Law: V = I × R&lt;/p&gt;

&lt;p&gt;Because the relationship between voltage and current remains proportional, the current waveform stays sinusoidal. As a result, linear loads generally do not create significant harmonics.&lt;/p&gt;

&lt;h2&gt;
  
  
  Non-Linear Loads
&lt;/h2&gt;

&lt;p&gt;Non-linear loads draw current in short pulses instead of smoothly. Because of this, the current waveform becomes distorted and harmonics are generated.&lt;/p&gt;

&lt;p&gt;Examples: Mobile chargers, laptop chargers, LED lights, UPS systems, solar inverters, EV chargers, and Variable Frequency Drives (VFDs).&lt;/p&gt;

&lt;p&gt;These devices contain power electronic components such as rectifiers, diodes, transistors, and switching circuits. Instead of drawing current continuously, they draw current only during certain portions of the voltage waveform.&lt;/p&gt;

&lt;p&gt;Because the current no longer follows the voltage proportionally, additional frequencies are introduced into the system. These frequencies are known as harmonics.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzwskjevb082z8eaeeciv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzwskjevb082z8eaeeciv.png" alt=" " width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;The difference between linear and non-linear loads may seem small, but it is actually the main reason harmonics exist in modern power systems.&lt;/p&gt;

&lt;h1&gt;
  
  
  What Are Harmonics?
&lt;/h1&gt;

&lt;p&gt;Electricity is supplied at a fundamental frequency of 50 Hz in many countries and 60 Hz in the United States. Under normal operating conditions, voltage and current should appear as smooth sine waves.&lt;br&gt;
However, when non-linear loads draw current in short pulses, they introduce additional frequencies into the system. These frequencies are known as harmonics.&lt;/p&gt;

&lt;p&gt;Harmonics are unwanted frequencies that are integer multiples of the fundamental frequency.&lt;/p&gt;

&lt;h2&gt;
  
  
  Harmonic Frequency Formula
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;fn = n × f1&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;fn = Harmonic Frequency&lt;/li&gt;
&lt;li&gt;n = Harmonic Order&lt;/li&gt;
&lt;li&gt;f1 = Fundamental Frequency&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Example for a 50 Hz System
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Harmonic Order&lt;/th&gt;
&lt;th&gt;Frequency&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1st&lt;/td&gt;
&lt;td&gt;50 Hz&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3rd&lt;/td&gt;
&lt;td&gt;150 Hz&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5th&lt;/td&gt;
&lt;td&gt;250 Hz&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7th&lt;/td&gt;
&lt;td&gt;350 Hz&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;11th&lt;/td&gt;
&lt;td&gt;550 Hz&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;13th&lt;/td&gt;
&lt;td&gt;650 Hz&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Think of harmonics like unwanted noise added to your favorite song. The song is still playing, but its quality is reduced. Similarly, harmonics distort the original electrical waveform.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxdq7kwljpqp7tyovbvb6.png" alt=" " width="800" height="533"&gt;
&lt;/h2&gt;

&lt;h1&gt;
  
  
  Why Are Harmonics Important?
&lt;/h1&gt;

&lt;p&gt;As more power electronic devices are connected to the grid, harmonic levels continue to increase. When harmonic levels become too high, they can create several problems in a power system. Transformers can overheat, cables can run hotter than normal, capacitor banks can fail, system losses can increase, and equipment life can be reduced. Harmonics can also affect power quality and may cause sensitive equipment to operate incorrectly or malfunction. This is why understanding and controlling harmonics is becoming increasingly important in modern power systems.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foebnd7t74t4n2wegraa4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foebnd7t74t4n2wegraa4.png" alt=" " width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h1&gt;
  
  
  My Harmonic Analysis Study in PSCAD
&lt;/h1&gt;

&lt;p&gt;To better understand harmonics, I created a simple harmonic analysis model in PSCAD. The model included a Harmonic Current Injection Source, Three-Phase Voltage Source, Point of Common Coupling (PCC), Multimeter, FFT Analysis Block, and THD Calculator. The objective was simple: inject harmonic currents into the system and observe how they affect the voltage and current waveforms. This helped me analyze harmonic frequencies, study waveform distortion, and understand their impact on power quality.&lt;/p&gt;

&lt;h1&gt;
  
  
  Using FFT to Identify Harmonics
&lt;/h1&gt;

&lt;p&gt;After running the simulation, the next step was to identify which harmonic frequencies were present in the waveform. For this, I used FFT (Fast Fourier Transform). FFT converts a waveform from the time domain into the frequency domain and shows the frequency components present in the signal. Using FFT, engineers can identify the fundamental frequency along with the 3rd, 5th, 7th, and higher-order harmonics. This makes it easier to understand which frequencies are contributing to waveform distortion.&lt;/p&gt;

&lt;h1&gt;
  
  
  Measuring Distortion Using THD
&lt;/h1&gt;

&lt;p&gt;Identifying harmonic frequencies is important, but it is also useful to know the overall level of distortion in the waveform. For this, engineers use THD (Total Harmonic Distortion). THD provides a single value that represents the amount of distortion caused by harmonics compared to the fundamental frequency. A lower THD value indicates better power quality, while a higher THD value indicates greater waveform distortion.&lt;/p&gt;

&lt;h2&gt;
  
  
  THD Formula
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;THD (%) = √(I₂² + I₃² + I₄² + ... + Iₙ²) / I₁ × 100&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;I₁ = Fundamental Current&lt;/li&gt;
&lt;li&gt;I₂, I₃, I₄... = Harmonic Currents&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Typical THD Levels
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;THD&lt;/th&gt;
&lt;th&gt;Condition&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Less than 2%&lt;/td&gt;
&lt;td&gt;Excellent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Less than 5%&lt;/td&gt;
&lt;td&gt;Acceptable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Greater than 8%&lt;/td&gt;
&lt;td&gt;Investigation Required&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;According to IEEE 519, voltage THD should generally remain below 5%.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7ipe6dg0p3vg049qldoj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7ipe6dg0p3vg049qldoj.png" alt=" " width="800" height="425"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h1&gt;
  
  
  How Engineers Reduce Harmonics
&lt;/h1&gt;

&lt;p&gt;Once harmonics are identified, the next step is reducing their impact on the power system. Engineers use different methods depending on the type of load and the level of harmonic distortion.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Method&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Passive Harmonic Filters&lt;/td&gt;
&lt;td&gt;Remove specific harmonic frequencies&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Active Harmonic Filters&lt;/td&gt;
&lt;td&gt;Cancel harmonic currents in real time&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;K-Rated Transformers&lt;/td&gt;
&lt;td&gt;Handle harmonic currents safely&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Line Reactors&lt;/td&gt;
&lt;td&gt;Reduce harmonic distortion&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;12-Pulse / 18-Pulse Drives&lt;/td&gt;
&lt;td&gt;Reduce lower-order harmonics&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;There is no one solution that works for every system. The best method depends on the load, harmonic levels, and system requirements.&lt;/p&gt;

&lt;h1&gt;
  
  
  How AI Can Help Monitor Harmonics
&lt;/h1&gt;

&lt;p&gt;Traditional harmonic studies help engineers understand how a power system behaves at a particular time. However, real power systems keep changing. Loads switch on and off, EV chargers connect to the grid, and solar power changes throughout the day. Because of this, harmonic levels can also change.&lt;/p&gt;

&lt;p&gt;This is where AI and Machine Learning can help.&lt;/p&gt;

&lt;p&gt;Modern substations collect large amounts of voltage, current, and THD data. AI can analyze this data in real time and identify unusual harmonic patterns. Some Machine Learning techniques include LSTM, Random Forest, SVM, and Anomaly Detection. These techniques can help with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Real-time harmonic monitoring&lt;/li&gt;
&lt;li&gt;Harmonic source identification&lt;/li&gt;
&lt;li&gt;Predictive maintenance&lt;/li&gt;
&lt;li&gt;THD trend prediction&lt;/li&gt;
&lt;li&gt;Smart grid monitoring&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As power systems become more digital and use more power electronic devices, AI and Machine Learning can help engineers monitor harmonics more effectively.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzdrz7n9ibrok698us0dx.png" alt=" " width="800" height="533"&gt;
&lt;/h2&gt;

&lt;h1&gt;
  
  
  Key Takeaways
&lt;/h1&gt;

&lt;ul&gt;
&lt;li&gt;Non-linear loads are the primary source of harmonics.&lt;/li&gt;
&lt;li&gt;Everyday devices such as phone chargers, laptop chargers, LED lights, UPS systems, and EV chargers can generate harmonics.&lt;/li&gt;
&lt;li&gt;Harmonics are integer multiples of the fundamental frequency.&lt;/li&gt;
&lt;li&gt;FFT helps identify harmonic frequencies.&lt;/li&gt;
&lt;li&gt;THD measures waveform distortion.&lt;/li&gt;
&lt;li&gt;IEEE 519 is the most widely used harmonic standard.&lt;/li&gt;
&lt;li&gt;PSCAD is a powerful tool for harmonic analysis.&lt;/li&gt;
&lt;li&gt;AI and Machine Learning may play an important role in future harmonic monitoring systems.&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  Final Thoughts
&lt;/h1&gt;

&lt;p&gt;Harmonics are not just a utility or industrial problem. Many devices we use every day can generate harmonics, making power quality an increasingly important topic for modern power system engineers.&lt;/p&gt;

&lt;p&gt;Have you worked on harmonic analysis using PSCAD, ETAP, PowerFactory, or MATLAB? I'd love to hear about your experience in the comments.&lt;/p&gt;

</description>
      <category>powersystem</category>
      <category>electricalengineering</category>
      <category>powerengineering</category>
      <category>ai</category>
    </item>
    <item>
      <title>Beyond PSCAD: How AI Can Help Monitor Harmonics</title>
      <dc:creator>Salma Aga Shaik</dc:creator>
      <pubDate>Sun, 07 Jun 2026 15:59:47 +0000</pubDate>
      <link>https://dev.to/salma_aga/beyond-pscad-how-ai-can-help-monitor-harmonics-3p0f</link>
      <guid>https://dev.to/salma_aga/beyond-pscad-how-ai-can-help-monitor-harmonics-3p0f</guid>
      <description>&lt;p&gt;While learning harmonic analysis in PSCAD, one question came to my mind:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What happens after the harmonic study is completed?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A PSCAD simulation helps us understand harmonic behavior under specific operating conditions. However, real power systems do not remain the same.&lt;/p&gt;

&lt;p&gt;Loads switch on and off, EV chargers connect to the grid, solar generation changes throughout the day, and new equipment is added over time. Because of these changes, harmonic levels can also change continuously.&lt;/p&gt;

&lt;p&gt;This is where Artificial Intelligence (AI) and Machine Learning (ML) can help.&lt;/p&gt;

&lt;p&gt;Today, substations and power quality monitoring devices collect large amounts of voltage, current, and THD data every second. AI can analyze this data in real time and help engineers identify harmonic issues much faster than traditional methods.&lt;/p&gt;

&lt;p&gt;For example, if THD levels at a substation gradually increase over time, an AI-based monitoring system can alert engineers before the issue becomes serious and causes equipment overheating, reduced equipment life, or power quality problems.&lt;/p&gt;

&lt;p&gt;Some potential applications of AI and ML in harmonic monitoring include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Real-time harmonic detection&lt;/li&gt;
&lt;li&gt;Harmonic source identification&lt;/li&gt;
&lt;li&gt;Predictive maintenance of transformers and cables&lt;/li&gt;
&lt;li&gt;Early warning of power quality issues&lt;/li&gt;
&lt;li&gt;Smart filter optimization&lt;/li&gt;
&lt;li&gt;Smart grid power quality monitoring&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Several Machine Learning techniques are being explored for these applications:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;LSTM (Long Short-Term Memory):&lt;/strong&gt; Used to predict future THD levels and identify abnormal harmonic trends.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Random Forest:&lt;/strong&gt; Can help identify possible sources of harmonics such as EV chargers, VFDs, and solar inverters.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Support Vector Machine (SVM):&lt;/strong&gt; Can classify different types of harmonic disturbances.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Anomaly Detection Models:&lt;/strong&gt; Can detect unusual harmonic behavior before equipment failures occur.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;While PSCAD remains a powerful tool for studying harmonics through simulation, AI and ML have the potential to improve real-time monitoring and decision-making in modern power systems.&lt;/p&gt;

&lt;p&gt;As power systems become more digital and increasingly dependent on power electronics, the combination of PSCAD studies, harmonic analysis, Artificial Intelligence, and Machine Learning will play an important role in maintaining reliable and efficient electrical grids.&lt;/p&gt;

&lt;p&gt;In a future article, I plan to explore how LSTM networks and anomaly detection techniques can be applied to real-time harmonic monitoring and predictive maintenance in modern power systems.&lt;/p&gt;

</description>
      <category>powersystem</category>
      <category>machinelearning</category>
      <category>ai</category>
      <category>harmonic</category>
    </item>
    <item>
      <title>Understand Hadoop and Apache Spark</title>
      <dc:creator>Salma Aga Shaik</dc:creator>
      <pubDate>Sun, 15 Mar 2026 17:41:49 +0000</pubDate>
      <link>https://dev.to/salma_aga/understand-hadoop-and-apache-spark-f74</link>
      <guid>https://dev.to/salma_aga/understand-hadoop-and-apache-spark-f74</guid>
      <description>&lt;p&gt;Imagine a company that runs a very popular online platform. Every day, millions of users visit the website, make purchases, click on products, and generate application logs. All these activities produce &lt;strong&gt;a very large amount of data&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;After some time, the company collects &lt;strong&gt;terabytes of data&lt;/strong&gt;. This data includes customer transactions, website clicks, machine logs, and system events.&lt;/p&gt;

&lt;p&gt;Now the company wants to &lt;strong&gt;analyze this data&lt;/strong&gt; to answer questions like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which products are selling the most?&lt;/li&gt;
&lt;li&gt;What time do customers visit the website?&lt;/li&gt;
&lt;li&gt;Are there any system errors?&lt;/li&gt;
&lt;li&gt;How can the company improve its services?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At first, the company tries to process the data using &lt;strong&gt;one computer&lt;/strong&gt;, but the data is too large. The computer becomes slow and cannot process the data efficiently.&lt;/p&gt;

&lt;p&gt;To solve this problem, the company decides to use a &lt;strong&gt;distributed system&lt;/strong&gt;, where many machines work together to store and process the data.&lt;/p&gt;

&lt;p&gt;This is where &lt;strong&gt;Hadoop&lt;/strong&gt; and &lt;strong&gt;Apache Spark&lt;/strong&gt; come into the picture.&lt;/p&gt;




&lt;h1&gt;
  
  
  Hadoop: Storing and Processing Large Data
&lt;/h1&gt;

&lt;p&gt;The company first starts using &lt;strong&gt;Hadoop&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Hadoop is a &lt;strong&gt;big data framework&lt;/strong&gt; that helps companies &lt;strong&gt;store and process large datasets using multiple machines&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;One important part of Hadoop is &lt;strong&gt;HDFS (Hadoop Distributed File System)&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Instead of storing a large file on one machine, Hadoop &lt;strong&gt;splits the file into smaller blocks&lt;/strong&gt; and stores those blocks across many machines in the cluster. This allows the system to &lt;strong&gt;store huge amounts of data reliably&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Hadoop also uses a processing model called &lt;strong&gt;MapReduce&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;MapReduce processes the data step by step across the cluster. However, during processing it &lt;strong&gt;writes intermediate data to disk many times&lt;/strong&gt;, which makes the processing slower.&lt;/p&gt;

&lt;p&gt;Hadoop works well for &lt;strong&gt;batch processing&lt;/strong&gt;, where large data is processed in stages.&lt;/p&gt;




&lt;h1&gt;
  
  
  Spark: Faster Data Processing
&lt;/h1&gt;

&lt;p&gt;Later, the company learns about &lt;strong&gt;Apache Spark&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Spark is a &lt;strong&gt;fast distributed data processing engine&lt;/strong&gt; designed to process large datasets quickly.&lt;/p&gt;

&lt;p&gt;Like Hadoop, Spark also processes data across &lt;strong&gt;multiple machines in a cluster&lt;/strong&gt;. However, Spark has a major advantage.&lt;/p&gt;

&lt;p&gt;Spark performs &lt;strong&gt;in-memory computation&lt;/strong&gt;, which means it processes data in &lt;strong&gt;memory (RAM)&lt;/strong&gt; instead of repeatedly writing data to disk.&lt;/p&gt;

&lt;p&gt;Because memory is much faster than disk, Spark can process data &lt;strong&gt;much faster than Hadoop MapReduce&lt;/strong&gt;.&lt;/p&gt;




&lt;h1&gt;
  
  
  How Spark Works
&lt;/h1&gt;

&lt;p&gt;In a Spark system, many machines work together.&lt;/p&gt;

&lt;p&gt;At the center of the system is the &lt;strong&gt;Driver Program&lt;/strong&gt;. The driver acts like the &lt;strong&gt;manager&lt;/strong&gt; of the Spark application. It starts the job, creates the execution plan, and manages the processing.&lt;/p&gt;

&lt;p&gt;The actual data processing happens in &lt;strong&gt;Executors&lt;/strong&gt;. Executors run on worker machines in the cluster and perform the real computation.&lt;/p&gt;

&lt;p&gt;When a Spark job starts, the driver creates a plan called a &lt;strong&gt;DAG (Directed Acyclic Graph)&lt;/strong&gt;. This plan shows how the data will be processed step by step.&lt;/p&gt;

&lt;p&gt;Spark then divides the job into smaller tasks and sends those tasks to executors. The executors process the data in parallel and return the results to the driver.&lt;/p&gt;




&lt;h1&gt;
  
  
  Transformations and Actions in Spark
&lt;/h1&gt;

&lt;p&gt;Spark operations are divided into two types.&lt;/p&gt;

&lt;p&gt;The first type is &lt;strong&gt;Transformations&lt;/strong&gt;. These operations modify the data but do not execute immediately. Examples include filtering rows or selecting columns.&lt;/p&gt;

&lt;p&gt;The second type is &lt;strong&gt;Actions&lt;/strong&gt;. Actions trigger the actual execution of the Spark job. Examples include counting records or saving results.&lt;/p&gt;

&lt;p&gt;Spark waits until an action is called before executing the full computation. This concept is called &lt;strong&gt;lazy evaluation&lt;/strong&gt;, which helps improve performance.&lt;/p&gt;




&lt;h1&gt;
  
  
  Where Spark Is Used
&lt;/h1&gt;

&lt;p&gt;Spark is widely used in &lt;strong&gt;data engineering and analytics pipelines&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;p&gt;Data Sources&lt;br&gt;
→ Streaming systems or APIs&lt;br&gt;
→ Spark processing&lt;br&gt;
→ Data lake (Amazon S3 or HDFS)&lt;br&gt;
→ Data warehouse (Redshift or Snowflake)&lt;br&gt;
→ BI tools like Power BI or Tableau&lt;/p&gt;

&lt;p&gt;Spark processes and transforms the data so that companies can analyze it and generate insights.&lt;/p&gt;




&lt;h1&gt;
  
  
  Difference Between Hadoop and Spark
&lt;/h1&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Hadoop&lt;/th&gt;
&lt;th&gt;Spark&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;What it is&lt;/td&gt;
&lt;td&gt;Hadoop is a big data framework used to store and process large data.&lt;/td&gt;
&lt;td&gt;Spark is a fast data processing engine used to process large data quickly.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;How it processes data&lt;/td&gt;
&lt;td&gt;Hadoop processes data using MapReduce and writes data to disk many times.&lt;/td&gt;
&lt;td&gt;Spark processes data mostly in memory (RAM).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Speed&lt;/td&gt;
&lt;td&gt;Hadoop is slower because it reads and writes data to disk frequently.&lt;/td&gt;
&lt;td&gt;Spark is faster because it processes data in memory.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Main use&lt;/td&gt;
&lt;td&gt;Hadoop is mainly used for storing large data and batch processing.&lt;/td&gt;
&lt;td&gt;Spark is used for fast data processing and analytics.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Type of processing&lt;/td&gt;
&lt;td&gt;Hadoop mostly supports batch processing.&lt;/td&gt;
&lt;td&gt;Spark supports batch processing, streaming, machine learning, and SQL.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ease of coding&lt;/td&gt;
&lt;td&gt;Hadoop MapReduce requires more code and is harder to write.&lt;/td&gt;
&lt;td&gt;Spark is easier to use because it has APIs like Python, Java, and SQL.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Where it is used&lt;/td&gt;
&lt;td&gt;Hadoop is often used for distributed storage using HDFS.&lt;/td&gt;
&lt;td&gt;Spark is used for ETL pipelines, real-time analytics, and big data processing.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h1&gt;
  
  
  Conclusion
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;Hadoop and Spark are both technologies used to process very large datasets using multiple machines&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Hadoop is mainly used for &lt;strong&gt;distributed storage and batch processing&lt;/strong&gt;, while Spark is designed for &lt;strong&gt;fast data processing using in-memory computation&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Today, many companies use Spark with cloud platforms such as &lt;strong&gt;AWS EMR, AWS Glue, and Databricks&lt;/strong&gt; to build modern data engineering and analytics systems.&lt;/p&gt;

</description>
      <category>beginners</category>
      <category>data</category>
      <category>dataengineering</category>
      <category>distributedsystems</category>
    </item>
    <item>
      <title>Modern Data Engineering Architecture Across AWS, GCP, and Azure</title>
      <dc:creator>Salma Aga Shaik</dc:creator>
      <pubDate>Sun, 15 Mar 2026 16:58:47 +0000</pubDate>
      <link>https://dev.to/salma_aga/modern-data-engineering-architecture-across-aws-gcp-and-azure-14o3</link>
      <guid>https://dev.to/salma_aga/modern-data-engineering-architecture-across-aws-gcp-and-azure-14o3</guid>
      <description>&lt;p&gt;In modern data platforms, organizations build &lt;strong&gt;end-to-end data pipelines&lt;/strong&gt; to &lt;strong&gt;collect, process, store, and analyze large volumes of data&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Although different cloud providers offer different services, the &lt;strong&gt;core architecture pattern remains the same&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;A typical &lt;strong&gt;data engineering architecture&lt;/strong&gt; contains the following stages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Data Generation&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Data Ingestion&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Data Processing&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Data Lake Storage&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;SQL Query Layer&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Data Warehouse Analytics&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Business Intelligence Visualization&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  &lt;strong&gt;End-to-End Data Pipeline Architecture&lt;/strong&gt;
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhjrmmg045ub72gtwa1tt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhjrmmg045ub72gtwa1tt.png" alt="Image" width="800" height="470"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9g0itmz2dgxwg1jninzo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9g0itmz2dgxwg1jninzo.png" alt="Image" width="720" height="541"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwsoa30hqzihlr3qyofnc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwsoa30hqzihlr3qyofnc.png" alt="Image" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqao5135srzre2tsqzlcp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqao5135srzre2tsqzlcp.png" alt="Image" width="800" height="414"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The diagram above represents a &lt;strong&gt;typical enterprise data pipeline architecture&lt;/strong&gt; used by modern companies.&lt;/p&gt;

&lt;p&gt;The goal of this architecture is to move data from &lt;strong&gt;operational systems&lt;/strong&gt; into &lt;strong&gt;analytics platforms&lt;/strong&gt; where it can generate &lt;strong&gt;business insights&lt;/strong&gt;.&lt;/p&gt;




&lt;h1&gt;
  
  
  &lt;strong&gt;Cloud Data Engineering Architecture Comparison&lt;/strong&gt;
&lt;/h1&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Architecture Layer&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;What Happens in This Layer&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;AWS Implementation&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;GCP Implementation&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Azure Implementation&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;1. Data Sources&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Data is generated from &lt;strong&gt;applications, IoT devices, databases, logs, and user transactions&lt;/strong&gt;.&lt;/td&gt;
&lt;td&gt;Applications, &lt;strong&gt;RDS databases&lt;/strong&gt;, server logs, IoT sensors&lt;/td&gt;
&lt;td&gt;Applications, &lt;strong&gt;Cloud SQL&lt;/strong&gt;, logs, IoT devices&lt;/td&gt;
&lt;td&gt;Applications, &lt;strong&gt;Azure SQL&lt;/strong&gt;, logs, IoT devices&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;2. Data Ingestion (Streaming)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Real-time data&lt;/strong&gt; is continuously collected and streamed into the data pipeline.&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Amazon Kinesis&lt;/strong&gt; or &lt;strong&gt;Managed Kafka (MSK)&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Google Cloud Pub/Sub&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Azure Event Hubs&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;3. Batch Data Ingestion&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Batch data from &lt;strong&gt;files, APIs, or databases&lt;/strong&gt; is periodically ingested.&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;AWS Glue&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Cloud Dataflow&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Azure Data Factory&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;4. Data Processing (ETL / Big Data Processing)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Data is &lt;strong&gt;cleaned, transformed, and enriched&lt;/strong&gt; using distributed processing frameworks.&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Amazon EMR running Apache Spark&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Dataproc&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Azure Databricks&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;5. Data Lake Storage&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Raw and processed data is stored in &lt;strong&gt;scalable object storage systems&lt;/strong&gt;.&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Amazon S3&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Google Cloud Storage&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Azure Data Lake Storage&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;6. Metadata &amp;amp; Catalog&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Stores &lt;strong&gt;metadata information&lt;/strong&gt; such as schema definitions and table structures.&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;AWS Glue Data Catalog&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Data Catalog&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Azure Purview&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;7. SQL Query Engine&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Engineers and analysts run &lt;strong&gt;SQL queries on large datasets&lt;/strong&gt; stored in the data lake.&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Amazon Athena&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;BigQuery&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Azure Synapse Analytics&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;8. Data Warehouse&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Processed data is loaded into a &lt;strong&gt;data warehouse optimized for analytics queries&lt;/strong&gt;.&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Amazon Redshift&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;BigQuery&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Azure Synapse Analytics&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;9. Workflow Orchestration&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Pipelines are &lt;strong&gt;scheduled and automated&lt;/strong&gt; to manage dependencies.&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;AWS Step Functions / Managed Airflow&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Cloud Composer&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Azure Data Factory Pipelines&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;10. Monitoring &amp;amp; Logging&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Pipeline performance and failures are tracked using &lt;strong&gt;monitoring tools&lt;/strong&gt;.&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Amazon CloudWatch&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Cloud Monitoring&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Azure Monitor&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;11. Visualization / BI&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Business teams analyze data using &lt;strong&gt;dashboards and reports&lt;/strong&gt;.&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Amazon QuickSight&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Looker&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Power BI&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h1&gt;
  
  
  Data Pipeline Flow
&lt;/h1&gt;

&lt;p&gt;A typical &lt;strong&gt;data engineering pipeline&lt;/strong&gt; works like this:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data Sources&lt;/strong&gt;: Applications,transaction systems, and log systems generate raw data.&lt;br&gt;
&lt;strong&gt;Streaming Ingestion&lt;/strong&gt;: Streaming platforms like Apache Kafka or Amazon Kinesis capture real-time events.&lt;br&gt;
&lt;strong&gt;Data Processing&lt;/strong&gt;: Processing engines such as Apache Spark perform data cleaning, transformation, and aggregation.&lt;br&gt;
&lt;strong&gt;Data Lake Storage&lt;/strong&gt;: Data is stored in scalable Data Lakes such as Amazon S3, Google Cloud Storage, or Azure Data Lake Storage.&lt;br&gt;
&lt;strong&gt;SQL Query Layer&lt;/strong&gt;: Tools like Amazon Athena, BigQuery, or Azure Synapse allow engineers to run SQL queries on big data.&lt;br&gt;
&lt;strong&gt;Data Warehouse Analytics&lt;/strong&gt;: Structured analytics data is stored in Amazon Redshift,BigQuery, or Synapse Analytics.&lt;br&gt;
&lt;strong&gt;BI Dashboards&lt;/strong&gt;: Visualization tools such as Power BI, Looker, or Amazon QuickSight create interactive dashboards and reports.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Understanding Hadoop Architecture</title>
      <dc:creator>Salma Aga Shaik</dc:creator>
      <pubDate>Sun, 15 Mar 2026 15:52:27 +0000</pubDate>
      <link>https://dev.to/salma_aga/understanding-hadoop-architecture-16al</link>
      <guid>https://dev.to/salma_aga/understanding-hadoop-architecture-16al</guid>
      <description>&lt;p&gt;Imagine a company that collects a &lt;strong&gt;large amount of data&lt;/strong&gt; every day, such as &lt;strong&gt;website logs, transactions, or user activity&lt;/strong&gt;. After some time, the data becomes &lt;strong&gt;too large for a single computer&lt;/strong&gt; to store and process. This is where &lt;strong&gt;Hadoop&lt;/strong&gt; helps.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hadoop&lt;/strong&gt; is a &lt;strong&gt;big data framework&lt;/strong&gt; designed to &lt;strong&gt;store and process very large datasets across many machines&lt;/strong&gt;. Instead of using one powerful computer, Hadoop uses &lt;strong&gt;multiple machines working together&lt;/strong&gt;, which are called &lt;strong&gt;nodes&lt;/strong&gt;. These machines together form a &lt;strong&gt;Hadoop cluster&lt;/strong&gt;.&lt;/p&gt;




&lt;h1&gt;
  
  
  Storage Layer — &lt;strong&gt;HDFS&lt;/strong&gt;
&lt;/h1&gt;

&lt;p&gt;The storage system used by Hadoop is called &lt;strong&gt;HDFS (Hadoop Distributed File System)&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;When a &lt;strong&gt;large file&lt;/strong&gt; is stored in Hadoop, it is automatically &lt;strong&gt;split into smaller pieces called blocks&lt;/strong&gt;. These &lt;strong&gt;blocks&lt;/strong&gt; are then &lt;strong&gt;distributed across multiple machines&lt;/strong&gt; in the cluster.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Block 1 → Machine 1&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Block 2 → Machine 2&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Block 3 → Machine 3&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This approach is called &lt;strong&gt;distributed storage&lt;/strong&gt;, and it allows Hadoop to store &lt;strong&gt;very large datasets efficiently&lt;/strong&gt;.&lt;/p&gt;




&lt;h1&gt;
  
  
  Hadoop Nodes
&lt;/h1&gt;

&lt;p&gt;In a Hadoop cluster, there are two important types of nodes.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;NameNode (Master Node)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The &lt;strong&gt;NameNode&lt;/strong&gt; acts like the &lt;strong&gt;manager of the system&lt;/strong&gt;. It stores &lt;strong&gt;metadata&lt;/strong&gt;, which includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;file names&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;block locations&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;which machine stores each block&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;strong&gt;NameNode does not store actual data&lt;/strong&gt;. It only &lt;strong&gt;manages the file system and keeps track of the data&lt;/strong&gt;.&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;DataNodes&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;DataNodes&lt;/strong&gt; are the machines that &lt;strong&gt;store the actual data blocks&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;DataNode 1 → Block 1&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;DataNode 2 → Block 2&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;DataNode 3 → Block 3&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This layer is known as the &lt;strong&gt;storage layer&lt;/strong&gt; of Hadoop.&lt;/p&gt;




&lt;h1&gt;
  
  
  Processing Layer — &lt;strong&gt;MapReduce&lt;/strong&gt;
&lt;/h1&gt;

&lt;p&gt;After the data is stored, Hadoop needs to &lt;strong&gt;process the data&lt;/strong&gt;. This is done using &lt;strong&gt;MapReduce&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MapReduce&lt;/strong&gt; is a &lt;strong&gt;distributed data processing framework&lt;/strong&gt; that works in two main steps.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Map Phase&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;In the &lt;strong&gt;Map phase&lt;/strong&gt;, a &lt;strong&gt;large task is divided into smaller tasks&lt;/strong&gt;.&lt;br&gt;
Each machine processes a &lt;strong&gt;small part of the data&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Reduce Phase&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;In the &lt;strong&gt;Reduce phase&lt;/strong&gt;, the &lt;strong&gt;results from all machines are combined&lt;/strong&gt; to produce the &lt;strong&gt;final output&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This process allows Hadoop to &lt;strong&gt;process huge datasets quickly&lt;/strong&gt; by using &lt;strong&gt;parallel processing&lt;/strong&gt;.&lt;/p&gt;




&lt;h1&gt;
  
  
  Example
&lt;/h1&gt;

&lt;p&gt;Imagine a company wants to analyze &lt;strong&gt;millions of website log records&lt;/strong&gt; to see &lt;strong&gt;how many users visited from each country&lt;/strong&gt;.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The &lt;strong&gt;log data is stored in HDFS&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Hadoop &lt;strong&gt;splits the logs into blocks&lt;/strong&gt; and stores them across &lt;strong&gt;multiple DataNodes&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MapReduce processes the data in parallel&lt;/strong&gt; on different machines.&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;Reduce phase combines the results&lt;/strong&gt; and generates the &lt;strong&gt;final report&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;




&lt;h1&gt;
  
  
  Architecture
&lt;/h1&gt;

&lt;h2&gt;
  
  
  &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn82eqotbsoliwv04ih1m.png" alt=" " width="800" height="533"&gt;
&lt;/h2&gt;

&lt;h1&gt;
  
  
  Conclusion
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;Hadoop architecture&lt;/strong&gt; works with two main components:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;HDFS → for distributed storage&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;MapReduce → for distributed processing&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By &lt;strong&gt;splitting data across multiple machines and processing it in parallel&lt;/strong&gt;, Hadoop allows organizations to &lt;strong&gt;store and analyze massive datasets efficiently&lt;/strong&gt;.&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>beginners</category>
      <category>dataengineering</category>
      <category>distributedsystems</category>
    </item>
    <item>
      <title>AWS S3 Storage Classes (Start to End)</title>
      <dc:creator>Salma Aga Shaik</dc:creator>
      <pubDate>Sun, 22 Feb 2026 19:34:16 +0000</pubDate>
      <link>https://dev.to/salma_aga/aws-s3-storage-classes-start-to-end-258c</link>
      <guid>https://dev.to/salma_aga/aws-s3-storage-classes-start-to-end-258c</guid>
      <description>&lt;h2&gt;
  
  
  1) What is &lt;strong&gt;Amazon S3&lt;/strong&gt;?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Amazon S3 (Simple Storage Service)&lt;/strong&gt; is an AWS service used to store files like &lt;strong&gt;images, videos, logs, backups, datasets, and reports&lt;/strong&gt; as &lt;strong&gt;objects&lt;/strong&gt; inside &lt;strong&gt;buckets&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Bucket&lt;/strong&gt; = main container (like a top-level folder)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Object&lt;/strong&gt; = the actual file (data + metadata)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;S3 is widely used for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Data lakes&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Backups and disaster recovery&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Application logs&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Static website files&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Analytics and machine learning datasets&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Long-term archiving and compliance&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  2) Why does S3 have &lt;strong&gt;multiple storage classes&lt;/strong&gt;?
&lt;/h2&gt;

&lt;p&gt;Not all data is used in the same way:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Some data is used &lt;strong&gt;daily&lt;/strong&gt; (hot data)&lt;/li&gt;
&lt;li&gt;Some data is used &lt;strong&gt;sometimes&lt;/strong&gt; (cold data)&lt;/li&gt;
&lt;li&gt;Some data is &lt;strong&gt;almost never&lt;/strong&gt; used (archive data)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So AWS provides different &lt;strong&gt;S3 storage classes&lt;/strong&gt; to help you balance:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cost&lt;/strong&gt; – how much you pay for storage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Speed&lt;/strong&gt; – how fast you can read data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Availability&lt;/strong&gt; – how often data is accessible&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Risk&lt;/strong&gt; – multi-AZ vs single-AZ&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Retrieval fee&lt;/strong&gt; – extra cost when you download data in some classes&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  3) Key Terms
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Term&lt;/th&gt;
&lt;th&gt;Simple Meaning&lt;/th&gt;
&lt;th&gt;Easy Example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Durability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;How safe your data is from being lost&lt;/td&gt;
&lt;td&gt;Even if disks fail, AWS still keeps your file safe&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;11 nines durability (99.999999999%)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Extremely high safety&lt;/td&gt;
&lt;td&gt;“Almost never lost”&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Availability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;How often data is accessible&lt;/td&gt;
&lt;td&gt;99.99% means very little downtime&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;How fast you can access data&lt;/td&gt;
&lt;td&gt;Milliseconds = very fast&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Throughput&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;How much data can be read/written per second&lt;/td&gt;
&lt;td&gt;Important for big analytics jobs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Retrieval fee&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Extra cost when you download data&lt;/td&gt;
&lt;td&gt;Some classes charge when you read&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Availability Zone (AZ)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;One data center inside a region&lt;/td&gt;
&lt;td&gt;Multi-AZ is safer than single AZ&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  4) There are 8 Storage Classes
&lt;/h2&gt;

&lt;h3&gt;
  
  
  4.1  &lt;strong&gt;S3 Standard&lt;/strong&gt; – Hot Data
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Used for &lt;strong&gt;frequently accessed&lt;/strong&gt; and &lt;strong&gt;business-critical&lt;/strong&gt; data.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Very fast access (milliseconds):&lt;/strong&gt; Suitable for real-time applications and user-facing systems.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;High availability:&lt;/strong&gt; Designed to be available almost all the time for applications.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-AZ durability:&lt;/strong&gt; Data is safely stored across multiple data centers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No retrieval fee:&lt;/strong&gt; You don’t pay extra when reading or downloading data.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Use Cases:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Website images and videos served to users&lt;/li&gt;
&lt;li&gt;Daily application logs used by engineers&lt;/li&gt;
&lt;li&gt;Active analytics datasets queried many times per day&lt;/li&gt;
&lt;li&gt;Frequently used ML training and inference data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; Today’s sales data used every hour → &lt;strong&gt;S3 Standard&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Remember:&lt;/strong&gt; Standard = &lt;strong&gt;Hot + Fast&lt;/strong&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  4.2  &lt;strong&gt;S3 Intelligent-Tiering&lt;/strong&gt; – AWS Decides Automatically
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;For data where you &lt;strong&gt;don’t know&lt;/strong&gt; how often it will be accessed.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Automatic movement between tiers:&lt;/strong&gt; AWS moves objects to cheaper tiers when access reduces.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No performance impact:&lt;/strong&gt; Applications access data the same way.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Small monitoring fee:&lt;/strong&gt; Charged for AWS to track access patterns.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Use Cases&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data lakes where new data is hot and old data becomes cold&lt;/li&gt;
&lt;li&gt;ML datasets where some features are used more than others&lt;/li&gt;
&lt;li&gt;Analytics history that changes in access frequency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; Some months of logs are queried often, others not → &lt;strong&gt;Intelligent-Tiering&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Remember:&lt;/strong&gt; Intelligent = &lt;strong&gt;“I don’t know access pattern”&lt;/strong&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  4.3  &lt;strong&gt;S3 Standard-IA&lt;/strong&gt; – Cold but Fast
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;For data accessed &lt;strong&gt;rarely&lt;/strong&gt;, but must be accessed &lt;strong&gt;immediately&lt;/strong&gt; when needed.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Lower storage cost than Standard:&lt;/strong&gt; Helps save money for infrequently used data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fast access:&lt;/strong&gt; Still milliseconds when you retrieve data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Retrieval fee applies:&lt;/strong&gt; Extra cost when you download data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-AZ durability:&lt;/strong&gt; Safe across multiple data centers.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Use Cases:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Backups used only during failures&lt;/li&gt;
&lt;li&gt;Disaster recovery data&lt;/li&gt;
&lt;li&gt;Old reports accessed occasionally&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; Weekly backups restored only during failure → &lt;strong&gt;Standard-IA&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Remember:&lt;/strong&gt; IA = &lt;strong&gt;Rare, but fast&lt;/strong&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  4.4  &lt;strong&gt;S3 One Zone-IA&lt;/strong&gt; – Cheaper but Risky
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Same as Standard-IA, but stored in &lt;strong&gt;one Availability Zone only&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cheaper than Standard-IA:&lt;/strong&gt; Cost saving for non-critical data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Single AZ risk:&lt;/strong&gt; If that AZ goes down, data can be unavailable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fast access:&lt;/strong&gt; Still millisecond latency.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Retrieval fee applies.&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Use Cases:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Re-creatable ETL outputs&lt;/li&gt;
&lt;li&gt;Temporary pipeline files&lt;/li&gt;
&lt;li&gt;Secondary backups&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; Temporary pipeline files → &lt;strong&gt;One Zone-IA&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Remember:&lt;/strong&gt; One Zone = &lt;strong&gt;Cheap + Risk&lt;/strong&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  4.5 &lt;strong&gt;S3 Glacier Instant Retrieval&lt;/strong&gt; – Archive + Fast
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;S3 Glacier Instant Retrieval is a storage class for archived data that is rarely accessed, but when you need it, you can open it immediately. It is mainly used for long-term storage where data is kept for compliance or record-keeping, but still needs instant access sometimes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key Features&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Very low storage cost&lt;/li&gt;
&lt;li&gt;Instant (milliseconds) access&lt;/li&gt;
&lt;li&gt;Retrieval fee applies&lt;/li&gt;
&lt;li&gt;Multi-AZ durability&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Use Cases:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Compliance documents that must open quickly&lt;/li&gt;
&lt;li&gt;Audit logs needed during investigations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; Legal docs opened only during audits → &lt;strong&gt;Glacier Instant&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Remember:&lt;/strong&gt; Glacier Instant = &lt;strong&gt;Archive + Fast&lt;/strong&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  4.6 &lt;strong&gt;S3 Glacier Flexible Retrieval&lt;/strong&gt; – Archive + Wait
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;S3 Glacier Flexible Retrieval is used for archived data that is almost never accessed, and when it is accessed, you are okay to wait some time before getting the data back. This class is mainly for long-term backups and historical data.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Very low cost for long-term storage&lt;/li&gt;
&lt;li&gt;Multiple retrieval speeds: expedited, standard, bulk&lt;/li&gt;
&lt;li&gt;Suitable for large archive restores&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Use Cases:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Old backups&lt;/li&gt;
&lt;li&gt;Historical logs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Remember:&lt;/strong&gt; Flexible = &lt;strong&gt;Waiting is okay&lt;/strong&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  4.7 &lt;strong&gt;S3 Glacier Deep Archive&lt;/strong&gt; – Cheapest + Slowest
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;S3 Glacier Deep Archive is the lowest-cost storage class in Amazon S3. It is used for data that must be kept for many years and is almost never accessed. This is mainly for legal, regulatory, and compliance requirements.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cheapest storage class&lt;/li&gt;
&lt;li&gt;Retrieval time 12–48 hours&lt;/li&gt;
&lt;li&gt;Best for compliance and legal retention&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Use Cases:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Financial records&lt;/li&gt;
&lt;li&gt;Government data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Remember:&lt;/strong&gt; Deep Archive = &lt;strong&gt;Coldest + Slowest + Cheapest&lt;/strong&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  4.8 &lt;strong&gt;S3 Express One Zone&lt;/strong&gt; – Extra Fast, Single AZ
&lt;/h3&gt;

&lt;p&gt;S3 Express One Zone is a storage class designed for very high-performance workloads. It is used when applications need very low latency and very high request rates for reading and writing data. Data is stored in only one Availability Zone, so it is faster but less resilient compared to multi-AZ classes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Features :&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ultra-fast performance for request-heavy workloads&lt;/li&gt;
&lt;li&gt;High throughput for many small reads/writes&lt;/li&gt;
&lt;li&gt;Stored in one AZ only (less resilient)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Use Cases:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Real-time analytics&lt;/li&gt;
&lt;li&gt;ML feature stores&lt;/li&gt;
&lt;li&gt;Hot ETL intermediate data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; Pipeline reading millions of small files → &lt;strong&gt;Express One Zone&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Remember:&lt;/strong&gt; Express = &lt;strong&gt;Extra fast&lt;/strong&gt;, One Zone = &lt;strong&gt;Single AZ&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  5) Comparision table for all 8 S3 Storage Classes
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Storage Class&lt;/th&gt;
&lt;th&gt;Access Pattern&lt;/th&gt;
&lt;th&gt;Retrieval Speed&lt;/th&gt;
&lt;th&gt;Storage Cost&lt;/th&gt;
&lt;th&gt;Extra Cost&lt;/th&gt;
&lt;th&gt;Availability / Risk&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;S3 Standard&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Frequently accessed&lt;/td&gt;
&lt;td&gt;Milliseconds&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Multi-AZ, very safe&lt;/td&gt;
&lt;td&gt;Hot data, websites, active logs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;S3 Intelligent-Tiering&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Unknown / changing&lt;/td&gt;
&lt;td&gt;Milliseconds&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Monitoring fee&lt;/td&gt;
&lt;td&gt;Multi-AZ&lt;/td&gt;
&lt;td&gt;Unpredictable workloads&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;S3 Standard-IA&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Infrequent but fast needed&lt;/td&gt;
&lt;td&gt;Milliseconds&lt;/td&gt;
&lt;td&gt;Lower&lt;/td&gt;
&lt;td&gt;Retrieval fee&lt;/td&gt;
&lt;td&gt;Multi-AZ&lt;/td&gt;
&lt;td&gt;Backups, DR&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;S3 One Zone-IA&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Infrequent, non-critical&lt;/td&gt;
&lt;td&gt;Milliseconds&lt;/td&gt;
&lt;td&gt;Cheaper&lt;/td&gt;
&lt;td&gt;Retrieval fee&lt;/td&gt;
&lt;td&gt;Single AZ risk&lt;/td&gt;
&lt;td&gt;Re-creatable data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;S3 Glacier Instant Retrieval&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Rare but instant needed&lt;/td&gt;
&lt;td&gt;Milliseconds&lt;/td&gt;
&lt;td&gt;Very low&lt;/td&gt;
&lt;td&gt;Retrieval fee&lt;/td&gt;
&lt;td&gt;Multi-AZ&lt;/td&gt;
&lt;td&gt;Compliance archives&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;S3 Glacier Flexible Retrieval&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Very rare access&lt;/td&gt;
&lt;td&gt;Minutes → Hours&lt;/td&gt;
&lt;td&gt;Very low&lt;/td&gt;
&lt;td&gt;Retrieval fee&lt;/td&gt;
&lt;td&gt;Multi-AZ&lt;/td&gt;
&lt;td&gt;Old backups, logs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;S3 Glacier Deep Archive&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Almost never accessed&lt;/td&gt;
&lt;td&gt;12–48 hours&lt;/td&gt;
&lt;td&gt;Lowest&lt;/td&gt;
&lt;td&gt;Retrieval fee&lt;/td&gt;
&lt;td&gt;Multi-AZ&lt;/td&gt;
&lt;td&gt;Legal &amp;amp; long-term records&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;S3 Express One Zone&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Very frequent, high-performance&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Ultra-fast&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Higher&lt;/td&gt;
&lt;td&gt;Request-based pricing&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Single AZ&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;High-performance analytics, ML&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  6) How to Choose Quickly
&lt;/h2&gt;

&lt;p&gt;Ask yourself these &lt;strong&gt;3 simple questions&lt;/strong&gt;:&lt;/p&gt;

&lt;h3&gt;
  
  
  i) How often will the data be accessed?
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Daily or many times a day&lt;/strong&gt; → &lt;strong&gt;S3 Standard&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not sure / changes over time&lt;/strong&gt; → &lt;strong&gt;S3 Intelligent-Tiering&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rarely&lt;/strong&gt; → Use IA or Glacier classes&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  ii) When needed, how fast must I get the data?
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Instant (milliseconds)&lt;/strong&gt; → Standard, Standard-IA, Glacier Instant&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Can wait minutes or hours&lt;/strong&gt; → Glacier Flexible&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Can wait 1–2 days&lt;/strong&gt; → Glacier Deep Archive&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  iii) Is the data critical or can it be recreated?
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Critical data&lt;/strong&gt; → Choose &lt;strong&gt;multi-AZ classes&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Non-critical or re-creatable data&lt;/strong&gt; → Choose &lt;strong&gt;single-AZ classes&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Quick Mapping Table
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Best Choice&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;App serving images every second&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;S3 Standard&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Logs with changing access patterns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Intelligent-Tiering&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Weekly backups&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Standard-IA&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Temporary ETL output&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;One Zone-IA&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Compliance docs needing instant access&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Glacier Instant&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Large archive restores&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Glacier Flexible&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10-year legal retention&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Glacier Deep Archive&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;High-performance ML feature reads&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;S3 Express One Zone&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  7) How to Remember
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Hot&lt;/strong&gt; → Standard&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unknown&lt;/strong&gt; → Intelligent&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cold&lt;/strong&gt; → IA&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Very Cold&lt;/strong&gt; → Glacier&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Coldest&lt;/strong&gt; → Deep Archive&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ultra-fast hot data&lt;/strong&gt; → Express One Zone&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  8) What is Amazon S3 and What is a Bucket?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Amazon S3 (Simple Storage Service)&lt;/strong&gt; is a &lt;strong&gt;cloud storage service&lt;/strong&gt; provided by &lt;strong&gt;AWS&lt;/strong&gt;. It is used to store &lt;strong&gt;files and data&lt;/strong&gt; such as &lt;strong&gt;images&lt;/strong&gt;, &lt;strong&gt;videos&lt;/strong&gt;, &lt;strong&gt;logs&lt;/strong&gt;, &lt;strong&gt;backups&lt;/strong&gt;, &lt;strong&gt;datasets&lt;/strong&gt;, and &lt;strong&gt;documents&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;An &lt;strong&gt;Amazon S3 bucket&lt;/strong&gt; is the &lt;strong&gt;main container&lt;/strong&gt; where all your &lt;strong&gt;files (objects)&lt;/strong&gt; are stored. You cannot upload a file directly to S3 without a bucket. Every file must be inside a bucket.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Bucket&lt;/strong&gt; is like a &lt;strong&gt;main folder&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Object&lt;/strong&gt; is like a &lt;strong&gt;file inside the folder&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example&lt;/strong&gt; You create a bucket named &lt;strong&gt;company-data-bucket&lt;/strong&gt;.&lt;br&gt;
Inside this bucket, you store:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;logs/app-logs-2026.json&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;reports/sales-jan.csv&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;images/profile.png&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here, &lt;strong&gt;company-data-bucket&lt;/strong&gt; is the &lt;strong&gt;bucket&lt;/strong&gt;, and each file is an &lt;strong&gt;object&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  9) Basic Structure of Amazon S3
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Term&lt;/th&gt;
&lt;th&gt;Meaning in Simple Words&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Bucket&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The &lt;strong&gt;top-level container&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;company-analytics-bucket&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Object&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The &lt;strong&gt;actual file&lt;/strong&gt; stored&lt;/td&gt;
&lt;td&gt;2026/jan/sales.csv&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Key&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The &lt;strong&gt;full path&lt;/strong&gt; of the file inside the bucket&lt;/td&gt;
&lt;td&gt;2026/jan/sales.csv&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Region&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The &lt;strong&gt;AWS location&lt;/strong&gt; where the bucket lives&lt;/td&gt;
&lt;td&gt;us-east-1, ap-south-1&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Important points:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Each &lt;strong&gt;bucket belongs to one AWS region&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Your &lt;strong&gt;data is physically stored&lt;/strong&gt; in that region&lt;/li&gt;
&lt;li&gt;You can access the bucket from anywhere if &lt;strong&gt;permissions&lt;/strong&gt; allow it&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  10) Why Do We Need Amazon S3 Buckets?
&lt;/h2&gt;

&lt;p&gt;Amazon S3 buckets are used to store and manage &lt;strong&gt;almost all types of data&lt;/strong&gt; in the cloud.&lt;/p&gt;

&lt;p&gt;Common real-world use cases:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Data lakes&lt;/strong&gt; Store &lt;strong&gt;raw data&lt;/strong&gt;, &lt;strong&gt;logs&lt;/strong&gt;, &lt;strong&gt;CSV&lt;/strong&gt;, &lt;strong&gt;JSON&lt;/strong&gt;, and &lt;strong&gt;Parquet&lt;/strong&gt; files&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Backups&lt;/strong&gt; Store &lt;strong&gt;database backups&lt;/strong&gt;, &lt;strong&gt;server backups&lt;/strong&gt;, and &lt;strong&gt;application backups&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Application files&lt;/strong&gt; Store &lt;strong&gt;images&lt;/strong&gt;, &lt;strong&gt;videos&lt;/strong&gt;, and &lt;strong&gt;documents&lt;/strong&gt; used by web and mobile apps&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Analytics and Big Data&lt;/strong&gt; Store data for &lt;strong&gt;Athena&lt;/strong&gt;, &lt;strong&gt;Glue&lt;/strong&gt;, &lt;strong&gt;EMR&lt;/strong&gt;, and &lt;strong&gt;Redshift Spectrum&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Static website hosting&lt;/strong&gt; Store &lt;strong&gt;HTML&lt;/strong&gt;, &lt;strong&gt;CSS&lt;/strong&gt;, and &lt;strong&gt;JavaScript&lt;/strong&gt; files for static websites&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In short, &lt;strong&gt;Amazon S3 buckets&lt;/strong&gt; are the &lt;strong&gt;foundation of data storage&lt;/strong&gt; in AWS.&lt;/p&gt;




&lt;h2&gt;
  
  
  11) Amazon S3 Bucket Naming Rules
&lt;/h2&gt;

&lt;p&gt;S3 bucket names follow &lt;strong&gt;strict global rules&lt;/strong&gt;. These rules exist because bucket names are used in &lt;strong&gt;URLs&lt;/strong&gt; and must work with the &lt;strong&gt;internet DNS system&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Rule 1: Globally Unique Name
&lt;/h3&gt;

&lt;p&gt;Every &lt;strong&gt;bucket name must be globally unique&lt;/strong&gt; across all AWS accounts and regions. If someone else has already created a bucket with a name, you cannot use that name.&lt;/p&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;mybucket may already be taken&lt;/li&gt;
&lt;li&gt;mycompany-analytics-2026 is more likely to be available&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Rule 2: Length Rules
&lt;/h3&gt;

&lt;p&gt;Bucket name length must be between &lt;strong&gt;3 and 63 characters&lt;/strong&gt;.&lt;/p&gt;




&lt;h3&gt;
  
  
  Rule 3: Allowed Characters
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;You can use only: lowercase letter from a to z,numbers from 0 to 9,hyphens,dots&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;You cannot use: uppercase letters,underscores,spaces,special characters&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Examples: my-data-bucket,company.logs.backup,analytics2026&lt;/p&gt;

&lt;p&gt;Invalid examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;MyBucket&lt;/li&gt;
&lt;li&gt;my_bucket&lt;/li&gt;
&lt;li&gt;my bucket&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Rule 4: Start and End with Letter or Number
&lt;/h3&gt;

&lt;p&gt;Bucket name must &lt;strong&gt;start and end with a letter or number&lt;/strong&gt;. It should not start or end with a hyphen or dot.&lt;/p&gt;




&lt;h3&gt;
  
  
  Rule 5: No IP Address Format
&lt;/h3&gt;

&lt;p&gt;Bucket names cannot look like an &lt;strong&gt;IP address&lt;/strong&gt; such as 192.168.1.1. This is because bucket names are used in &lt;strong&gt;URLs&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  12) Why These Rules Exist
&lt;/h2&gt;

&lt;p&gt;Amazon S3 buckets are accessed using &lt;strong&gt;web URLs&lt;/strong&gt; like:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://my-data-bucket.s3.amazonaws.com/file.csv" rel="noopener noreferrer"&gt;https://my-data-bucket.s3.amazonaws.com/file.csv&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To make sure these URLs work correctly with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;internet routing, DNS system, SSL certificates&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;AWS enforces strict &lt;strong&gt;bucket naming rules&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  13) Important Features of Amazon S3 Buckets
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Region
&lt;/h3&gt;

&lt;p&gt;When you create a &lt;strong&gt;bucket&lt;/strong&gt;, you select a &lt;strong&gt;region&lt;/strong&gt;. Your &lt;strong&gt;data stays in that region&lt;/strong&gt;. This helps with &lt;strong&gt;low latency&lt;/strong&gt;, &lt;strong&gt;cost control&lt;/strong&gt;, and &lt;strong&gt;legal compliance&lt;/strong&gt;.&lt;/p&gt;




&lt;h3&gt;
  
  
  Access Control
&lt;/h3&gt;

&lt;p&gt;By default, &lt;strong&gt;buckets are private&lt;/strong&gt;. You control access using:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;IAM users and roles, Bucket policies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Public access is usually used only for &lt;strong&gt;public website content&lt;/strong&gt;.&lt;/p&gt;




&lt;h3&gt;
  
  
  Versioning
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Versioning&lt;/strong&gt; keeps &lt;strong&gt;multiple versions&lt;/strong&gt; of the same file. If someone overwrites or deletes a file, older versions are still stored. This helps with &lt;strong&gt;data recovery&lt;/strong&gt; and &lt;strong&gt;mistake protection&lt;/strong&gt;.&lt;/p&gt;




&lt;h3&gt;
  
  
  Encryption
&lt;/h3&gt;

&lt;p&gt;Amazon S3 supports &lt;strong&gt;encryption&lt;/strong&gt; to protect your data. Data can be encrypted:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;at rest, in transit&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Encryption is important for &lt;strong&gt;security&lt;/strong&gt; and &lt;strong&gt;compliance requirements&lt;/strong&gt;.&lt;/p&gt;




&lt;h3&gt;
  
  
  Lifecycle Rules
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Lifecycle rules&lt;/strong&gt; help you &lt;strong&gt;automate storage management&lt;/strong&gt;. You can move old data to &lt;strong&gt;cheaper storage classes&lt;/strong&gt; or &lt;strong&gt;delete data&lt;/strong&gt; after a fixed time. This helps reduce &lt;strong&gt;storage cost&lt;/strong&gt; automatically.&lt;/p&gt;




&lt;h2&gt;
  
  
  14) Real-Life Example from Data Engineering
&lt;/h2&gt;

&lt;p&gt;In a real data engineering project:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;New logs come &lt;strong&gt;every day&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Old logs are accessed &lt;strong&gt;rarely&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Compliance rules require keeping data for &lt;strong&gt;many years&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You may create different buckets:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;company-raw-logs&lt;/strong&gt; for daily logs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;company-processed-data&lt;/strong&gt; for transformed data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;company-archive-data&lt;/strong&gt; for long-term storage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Lifecycle rules can move old files automatically to &lt;strong&gt;cheaper storage classes&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  15) How to Remember Amazon S3 Bucket Rules
&lt;/h2&gt;

&lt;p&gt;Use the word &lt;strong&gt;BUCKET&lt;/strong&gt; as a memory trick:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;B&lt;/strong&gt; means Bucket is the main container&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;U&lt;/strong&gt; means Unique globally&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;C&lt;/strong&gt; means Characters allowed are lowercase letters, numbers, hyphens, and dots&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;K&lt;/strong&gt; means Keep name length between 3 and 63&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;E&lt;/strong&gt; means End with a letter or number&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;T&lt;/strong&gt; means Tied to one AWS region&lt;/li&gt;
&lt;/ul&gt;




</description>
      <category>aws</category>
      <category>beginners</category>
      <category>cloud</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Data Engineering Basics: From What is Data to Modern Lakehouse Architecture</title>
      <dc:creator>Salma Aga Shaik</dc:creator>
      <pubDate>Sun, 22 Feb 2026 05:19:08 +0000</pubDate>
      <link>https://dev.to/salma_aga/-data-engineering-basics-from-what-is-data-to-modern-lakehouse-architecture-1l10</link>
      <guid>https://dev.to/salma_aga/-data-engineering-basics-from-what-is-data-to-modern-lakehouse-architecture-1l10</guid>
      <description>&lt;p&gt;This post explains &lt;strong&gt;data fundamentals&lt;/strong&gt;, &lt;strong&gt;databases&lt;/strong&gt;, &lt;strong&gt;data warehousing&lt;/strong&gt;, &lt;strong&gt;data lakes&lt;/strong&gt;, and &lt;strong&gt;modern lakehouse architecture&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  What is &lt;strong&gt;Data&lt;/strong&gt;?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Data&lt;/strong&gt; is &lt;strong&gt;raw facts or raw information&lt;/strong&gt; collected from &lt;strong&gt;applications, users, and machines&lt;/strong&gt;. On its own, data has little meaning. When we &lt;strong&gt;process, clean, and analyze data&lt;/strong&gt;, it becomes &lt;strong&gt;useful information&lt;/strong&gt; for &lt;strong&gt;business decisions&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Examples of data:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Customer name&lt;/strong&gt;, &lt;strong&gt;email&lt;/strong&gt;,&lt;strong&gt;Order amount&lt;/strong&gt;, &lt;strong&gt;order time&lt;/strong&gt;,&lt;strong&gt;Website clicks&lt;/strong&gt;, &lt;strong&gt;error logs&lt;/strong&gt;,&lt;strong&gt;Sensor readings&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;An e-commerce app stores every order as data. When analysts look at monthly sales trends and top-selling products, that processed data becomes insights.&lt;/p&gt;




&lt;h2&gt;
  
  
  Types of &lt;strong&gt;Data&lt;/strong&gt; (Structured, Semi-Structured, Unstructured)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Structured Data&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Semi-Structured Data&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Unstructured Data&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;What it means&lt;/td&gt;
&lt;td&gt;Data stored in &lt;strong&gt;rows and columns&lt;/strong&gt; with a &lt;strong&gt;fixed schema&lt;/strong&gt;.&lt;/td&gt;
&lt;td&gt;Data with &lt;strong&gt;some structure&lt;/strong&gt; (keys/tags), but no fixed table schema.&lt;/td&gt;
&lt;td&gt;Data with &lt;strong&gt;no predefined structure&lt;/strong&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Where it is stored&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Relational Databases&lt;/strong&gt;, &lt;strong&gt;Data Warehouses&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Data Lakes&lt;/strong&gt;, modern warehouses&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Object storage&lt;/strong&gt;, file systems&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;How easy to query&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Very easy&lt;/strong&gt; with &lt;strong&gt;SQL&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;Needs &lt;strong&gt;parsing/flattening&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;Needs &lt;strong&gt;preprocessing/AI-ML&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Examples&lt;/td&gt;
&lt;td&gt;Customer table, Orders table&lt;/td&gt;
&lt;td&gt;JSON from APIs, Web logs, Avro/Parquet&lt;/td&gt;
&lt;td&gt;Images, videos, PDFs, emails&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Databases and Data Storage&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Databases&lt;/strong&gt; are systems used to &lt;strong&gt;store and manage structured data&lt;/strong&gt; for applications.&lt;br&gt;
&lt;strong&gt;Data storage&lt;/strong&gt; includes databases plus &lt;strong&gt;file systems&lt;/strong&gt; and &lt;strong&gt;cloud object storage&lt;/strong&gt; (for raw files).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Examples:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Databases:&lt;/strong&gt; PostgreSQL, MySQL, Oracle&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Object Storage:&lt;/strong&gt; AWS S3, Azure ADLS, Google GCS&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  &lt;strong&gt;SQL &amp;amp; Relational Databases&lt;/strong&gt;
&lt;/h2&gt;
&lt;h3&gt;
  
  
  What is &lt;strong&gt;SQL&lt;/strong&gt;?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;SQL (Structured Query Language)&lt;/strong&gt; is used to &lt;strong&gt;read and write data&lt;/strong&gt; in &lt;strong&gt;relational databases&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;
  
  
  What is a &lt;strong&gt;Relational Database (RDBMS)&lt;/strong&gt;?
&lt;/h3&gt;

&lt;p&gt;An &lt;strong&gt;RDBMS&lt;/strong&gt; stores data in &lt;strong&gt;tables with relationships&lt;/strong&gt; (primary keys and foreign keys).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Examples:&lt;/strong&gt; &lt;strong&gt;PostgreSQL&lt;/strong&gt;, MySQL, SQL Server&lt;/p&gt;


&lt;h3&gt;
  
  
  &lt;strong&gt;DDL&lt;/strong&gt; vs &lt;strong&gt;DML&lt;/strong&gt;
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;th&gt;Examples&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DDL (Data Definition Language)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Defines or changes &lt;strong&gt;table structure&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;CREATE, ALTER, DROP&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DML (Data Manipulation Language)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Reads and modifies &lt;strong&gt;data inside tables&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;INSERT, UPDATE, DELETE, SELECT&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Example (DDL):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Example (DML):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'Salma'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  &lt;strong&gt;OLTP&lt;/strong&gt; vs &lt;strong&gt;OLAP&lt;/strong&gt; (Databases vs Analytics)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;OLTP (Online Transaction Processing)&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;OLAP (Online Analytical Processing)&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Main purpose&lt;/td&gt;
&lt;td&gt;Run &lt;strong&gt;daily transactions&lt;/strong&gt; for apps&lt;/td&gt;
&lt;td&gt;Run &lt;strong&gt;analytics and reporting&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Query pattern&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Many small, fast writes/reads&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Large scans and aggregations&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Current operational data&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Historical, aggregated data&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Typical systems&lt;/td&gt;
&lt;td&gt;PostgreSQL, MySQL&lt;/td&gt;
&lt;td&gt;Snowflake, BigQuery, Redshift&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Example&lt;/td&gt;
&lt;td&gt;Placing an order&lt;/td&gt;
&lt;td&gt;Yearly sales analysis&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;ACID Transactions&lt;/strong&gt; (Why Databases are Reliable)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Term&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Atomicity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;A transaction is &lt;strong&gt;all-or-nothing&lt;/strong&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Consistency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Data stays &lt;strong&gt;valid and correct&lt;/strong&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Isolation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Parallel users &lt;strong&gt;do not interfere&lt;/strong&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Durability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Once saved, data &lt;strong&gt;will not be lost&lt;/strong&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;br&gt;
If a payment fails halfway, &lt;strong&gt;Atomicity&lt;/strong&gt; ensures the whole transaction is rolled back.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Data Warehouse vs Data Lake&lt;/strong&gt;
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Data Warehouse&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Data Lake&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Data types&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Structured only&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Structured, semi-structured, unstructured&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Schema&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Schema-on-write&lt;/strong&gt; (define before load)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Schema-on-read&lt;/strong&gt; (define at query time)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost&lt;/td&gt;
&lt;td&gt;Higher storage cost&lt;/td&gt;
&lt;td&gt;Lower storage cost&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Main use&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;BI reports, dashboards&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Raw storage, ML/AI, exploration&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Examples&lt;/td&gt;
&lt;td&gt;Snowflake, Redshift, BigQuery&lt;/td&gt;
&lt;td&gt;AWS S3, Azure ADLS, Google GCS&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Data Formats&lt;/strong&gt;: Avro vs Parquet vs ORC
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Format&lt;/th&gt;
&lt;th&gt;Storage Style&lt;/th&gt;
&lt;th&gt;Best for&lt;/th&gt;
&lt;th&gt;Example use&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Avro&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Row-based&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Streaming, fast writes&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Kafka pipelines&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Parquet&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Column-based&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Analytics, fast reads&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;BI queries in Spark&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;ORC&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Column-based&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Analytics with compression&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Hive/Spark&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Row-Based vs Column-Based Storage&lt;/strong&gt;
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Row-Based Storage&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Column-Based Storage&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;How data is stored&lt;/td&gt;
&lt;td&gt;Entire &lt;strong&gt;rows together&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;Same &lt;strong&gt;columns together&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best for&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;OLTP transactions&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;OLAP analytics&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Typical systems&lt;/td&gt;
&lt;td&gt;PostgreSQL, MySQL&lt;/td&gt;
&lt;td&gt;BigQuery, Redshift, Parquet&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Example&lt;/td&gt;
&lt;td&gt;Fetch one customer record&lt;/td&gt;
&lt;td&gt;Aggregate one column across millions of rows&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;RDBMS (Row-Based) vs Columnar Databases&lt;/strong&gt;
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;RDBMS (Row-Based)&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Columnar Databases&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Workload&lt;/td&gt;
&lt;td&gt;Transactions&lt;/td&gt;
&lt;td&gt;Analytics&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Writes&lt;/td&gt;
&lt;td&gt;Fast&lt;/td&gt;
&lt;td&gt;Slower&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reads (aggregations)&lt;/td&gt;
&lt;td&gt;Slower&lt;/td&gt;
&lt;td&gt;Very fast&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Example&lt;/td&gt;
&lt;td&gt;PostgreSQL&lt;/td&gt;
&lt;td&gt;BigQuery, Redshift&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Data Warehousing Concepts: Facts and Dimensions&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What are &lt;strong&gt;Fact Tables&lt;/strong&gt;?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Fact tables&lt;/strong&gt; store &lt;strong&gt;measurable numbers&lt;/strong&gt; (metrics).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Examples:&lt;/strong&gt; sales_amount, quantity, revenue&lt;/p&gt;

&lt;h3&gt;
  
  
  What are &lt;strong&gt;Dimension Tables&lt;/strong&gt;?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Dimension tables&lt;/strong&gt; store &lt;strong&gt;descriptive attributes&lt;/strong&gt; to analyze facts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Examples:&lt;/strong&gt; customer, product, date, location&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Types of Facts&lt;/strong&gt;
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Transactional Fact&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;One row per transaction&lt;/td&gt;
&lt;td&gt;Each order&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Snapshot Fact&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;State at a point in time&lt;/td&gt;
&lt;td&gt;Daily inventory&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Accumulating Fact&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Tracks process over time&lt;/td&gt;
&lt;td&gt;Order lifecycle&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Characteristics of Fact vs Dimension Tables&lt;/strong&gt;
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Fact Table&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Dimension Table&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;What it stores&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Metrics (numbers)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Descriptions (attributes)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Size&lt;/td&gt;
&lt;td&gt;Very large&lt;/td&gt;
&lt;td&gt;Smaller&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Keys&lt;/td&gt;
&lt;td&gt;Foreign keys to dimensions&lt;/td&gt;
&lt;td&gt;Primary keys&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Example&lt;/td&gt;
&lt;td&gt;Sales fact&lt;/td&gt;
&lt;td&gt;Customer dimension&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Data Lakehouse Architecture&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Source → Ingestion → Data Lake Storage → Lakehouse Layer → BI / ML / Analytics&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What the Lakehouse layer adds:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;ACID transactions&lt;/strong&gt; for reliability&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Indexing&lt;/strong&gt; for faster queries&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Metadata&lt;/strong&gt; for governance and discovery&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Performance optimizations&lt;/strong&gt; for analytics&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Examples of Lakehouse Technologies:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Delta Lake (Databricks)&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Apache Iceberg&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Apache Hudi&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;What is Informatica?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Informatica&lt;/strong&gt; is an &lt;strong&gt;enterprise ETL tool&lt;/strong&gt; used to &lt;strong&gt;extract, transform, and load data&lt;/strong&gt; from source systems into &lt;strong&gt;data warehouses or data lakes&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;br&gt;
Move sales data from PostgreSQL → clean it → load into Snowflake.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final End-to-End Summary
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;OLTP databases&lt;/strong&gt; run daily business transactions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OLAP systems (data warehouses)&lt;/strong&gt; support analytics and reporting.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data lakes&lt;/strong&gt; store raw data of all types.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lakehouse architecture&lt;/strong&gt; combines low-cost storage with fast analytics.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Facts and dimensions&lt;/strong&gt; organize data for reporting.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Avro/Parquet/ORC&lt;/strong&gt; and &lt;strong&gt;row vs column storage&lt;/strong&gt; decide performance.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>architecture</category>
      <category>beginners</category>
      <category>database</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>Traditional vs Modern Data Architecture</title>
      <dc:creator>Salma Aga Shaik</dc:creator>
      <pubDate>Sun, 22 Feb 2026 00:47:51 +0000</pubDate>
      <link>https://dev.to/salma_aga/traditional-vs-modern-data-architecture-37cn</link>
      <guid>https://dev.to/salma_aga/traditional-vs-modern-data-architecture-37cn</guid>
      <description>&lt;h2&gt;
  
  
  1. Introduction
&lt;/h2&gt;

&lt;p&gt;In many companies, data comes from different systems like ERP, CRM, application databases, and web logs. This data is used for reports, dashboards, and business decisions. To use this data properly, we need a data architecture.&lt;/p&gt;

&lt;p&gt;There are two main types of data architecture:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Traditional Data Architecture (ETL + Data Warehouse)&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Modern Data Architecture (ELT + Data Lake + Lakehouse)&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This document explains both approaches. It also explains why we use tools like Data Lake, Data Warehouse, Spark, Databricks, Delta Lake, Iceberg, Snowflake, BigQuery, Redshift, ADLS, GCS, S3, and Datadog.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. High-Level Data Flow
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Data Sources → Ingestion → Data Lake → Processing → Lakehouse Tables → Data Warehouse → BI &amp;amp; Reports → Monitoring&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This means:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Data comes from source systems.&lt;/li&gt;
&lt;li&gt;Data is ingested (copied) into the platform.&lt;/li&gt;
&lt;li&gt;Raw data is stored in a data lake.&lt;/li&gt;
&lt;li&gt;Data is cleaned and transformed using processing tools.&lt;/li&gt;
&lt;li&gt;Clean and reliable tables are created.&lt;/li&gt;
&lt;li&gt;Final data is loaded into a data warehouse for reports.&lt;/li&gt;
&lt;li&gt;The full system is monitored using monitoring tools.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  3. Data Sources (Where data comes from)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Examples:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;ERP systems:&lt;/strong&gt; Finance, HR, inventory data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CRM systems:&lt;/strong&gt; Customer and sales data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OLTP databases:&lt;/strong&gt; Application transaction data (orders, payments)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Web logs:&lt;/strong&gt; Website or app activity (clicks, errors, requests)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Why we use them:&lt;/strong&gt;&lt;br&gt;
These systems run the business. They create the data that we later analyze.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When we use them:&lt;/strong&gt;&lt;br&gt;
All the time. These are live systems used daily by the business.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Traditional Data Architecture (ETL)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  4.1 What is Traditional Architecture?
&lt;/h3&gt;

&lt;p&gt;In traditional architecture, data is transformed &lt;strong&gt;before&lt;/strong&gt; it is loaded into the data warehouse.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Flow:&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;Sources → ETL Tool → Data Warehouse → BI/Reports&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffk85ui4agzxowy3ixojn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffk85ui4agzxowy3ixojn.png" alt=" " width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  4.2 What is ETL?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;ETL = Extract → Transform → Load&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Extract:&lt;/strong&gt; Take data from source systems.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transform:&lt;/strong&gt; Clean the data, fix formats, remove duplicates, join tables, and calculate metrics.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Load:&lt;/strong&gt; Put the clean data into the data warehouse.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4.3 Why Traditional Architecture was used
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Data warehouses were expensive.&lt;/li&gt;
&lt;li&gt;Storage and compute were limited.&lt;/li&gt;
&lt;li&gt;Only clean data was allowed in the warehouse.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4.4 Limitations
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Not easy to scale for big data.&lt;/li&gt;
&lt;li&gt;Raw data is lost after transformation.&lt;/li&gt;
&lt;li&gt;Not flexible for machine learning and advanced analytics.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  5. Modern Data Architecture (ELT + Data Lake + Lakehouse)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  5.1 What is Modern Architecture?
&lt;/h3&gt;

&lt;p&gt;In modern architecture, raw data is first stored in a data lake. Transformations happen later using powerful compute engines.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Flow:&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;Sources → Ingestion → Data Lake → Transform (Spark/Databricks) → Lakehouse Tables → Data Warehouse → BI &amp;amp; ML → Monitoring&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhcrgjv0ieqyjjvnws6xz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhcrgjv0ieqyjjvnws6xz.png" alt=" " width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  5.2 What is ELT?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;ELT = Extract → Load → Transform&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Extract:&lt;/strong&gt; Take data from sources.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Load:&lt;/strong&gt; Store raw data directly in the data lake.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transform:&lt;/strong&gt; Clean and process data later using Spark or Databricks.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  5.3 Why Modern Architecture is used
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Cloud storage is cheap and scalable.&lt;/li&gt;
&lt;li&gt;We can store raw data and use it later for new use cases.&lt;/li&gt;
&lt;li&gt;We can support both analytics and machine learning.&lt;/li&gt;
&lt;li&gt;Compute can scale up and down based on need.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  6. Data Lake (S3, ADLS, GCS)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What is a Data Lake?&lt;/strong&gt;&lt;br&gt;
A data lake is a storage system that stores raw data in any format (CSV, JSON, images, logs).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why we use it:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cheap storage&lt;/li&gt;
&lt;li&gt;Store raw data for future use&lt;/li&gt;
&lt;li&gt;Useful for big data and machine learning&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Where we use it:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AWS S3 (AWS cloud)&lt;/li&gt;
&lt;li&gt;Azure ADLS (Azure cloud)&lt;/li&gt;
&lt;li&gt;Google GCS (GCP cloud)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  7. Data Warehouse (Snowflake, BigQuery, Redshift, Synapse)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What is a Data Warehouse?&lt;/strong&gt;&lt;br&gt;
A data warehouse stores clean, structured data for analytics and reporting.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why we use it:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fast SQL queries&lt;/li&gt;
&lt;li&gt;Business reports and dashboards&lt;/li&gt;
&lt;li&gt;Used by analysts and managers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Examples:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Snowflake&lt;/li&gt;
&lt;li&gt;Google BigQuery&lt;/li&gt;
&lt;li&gt;AWS Redshift&lt;/li&gt;
&lt;li&gt;Azure Synapse&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  8. Data Lakehouse (Delta Lake, Apache Iceberg)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What is a Lakehouse?&lt;/strong&gt;&lt;br&gt;
A lakehouse combines the low-cost storage of a data lake with the reliability of a data warehouse.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why we use Delta Lake and Iceberg:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ACID transactions (safe updates and deletes)&lt;/li&gt;
&lt;li&gt;Schema changes without breaking pipelines&lt;/li&gt;
&lt;li&gt;Time travel (see old versions of data)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Where we use it:&lt;/strong&gt;&lt;br&gt;
On top of the data lake, usually with Databricks and Spark.&lt;/p&gt;




&lt;h2&gt;
  
  
  9. Processing Layer (Spark and Databricks)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What is Spark?&lt;/strong&gt;&lt;br&gt;
Spark is a fast distributed engine to process large data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is Databricks?&lt;/strong&gt;&lt;br&gt;
Databricks is a platform that manages Spark and provides notebooks, clusters, and job scheduling.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why we use them:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;To clean and transform large data&lt;/li&gt;
&lt;li&gt;To run batch and streaming jobs&lt;/li&gt;
&lt;li&gt;To build machine learning pipelines&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  10. File Formats (Avro, Parquet, ORC)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Avro:&lt;/strong&gt;&lt;br&gt;
Used for data movement and streaming. Good for schema evolution.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Parquet:&lt;/strong&gt;&lt;br&gt;
Column-based format. Very fast for analytics queries.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ORC:&lt;/strong&gt;&lt;br&gt;
Column-based format. Used in big data systems like Hive and Spark.&lt;/p&gt;




&lt;h2&gt;
  
  
  11. OLTP vs OLAP
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;OLTP:&lt;/strong&gt;&lt;br&gt;
Used by applications for daily transactions (orders, payments).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OLAP:&lt;/strong&gt;&lt;br&gt;
Used for analytics and reporting (data warehouse queries).&lt;/p&gt;




&lt;h2&gt;
  
  
  12. Monitoring with Datadog
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What is Datadog?&lt;/strong&gt;&lt;br&gt;
Datadog is a monitoring and observability tool.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why we use it:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Monitor data pipelines&lt;/li&gt;
&lt;li&gt;Monitor Spark jobs&lt;/li&gt;
&lt;li&gt;Monitor servers and applications&lt;/li&gt;
&lt;li&gt;Get alerts when something fails&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;When we use it:&lt;/strong&gt;&lt;br&gt;
In production environments to keep the system healthy and reliable.&lt;/p&gt;




&lt;h2&gt;
  
  
  13. ETL vs ELT
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;ETL (Traditional)&lt;/th&gt;
&lt;th&gt;ELT (Modern)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Transform&lt;/td&gt;
&lt;td&gt;Before load&lt;/td&gt;
&lt;td&gt;After load&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Storage&lt;/td&gt;
&lt;td&gt;Data Warehouse&lt;/td&gt;
&lt;td&gt;Data Lake&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scalability&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost&lt;/td&gt;
&lt;td&gt;Higher&lt;/td&gt;
&lt;td&gt;Lower&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Use Cases&lt;/td&gt;
&lt;td&gt;Reports&lt;/td&gt;
&lt;td&gt;Reports + ML&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  14. Example End-to-End Use Case
&lt;/h2&gt;

&lt;p&gt;Data from ERP and CRM systems and web logs is ingested into a data lake on AWS S3. Raw data is stored in Parquet format. Spark on Databricks processes and cleans the data. Clean tables are stored using Delta Lake. Final analytics data is loaded into Snowflake. Business users use dashboards to view reports. Datadog monitors the pipelines and sends alerts when jobs fail.&lt;/p&gt;




&lt;h2&gt;
  
  
  15. Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Traditional architecture uses &lt;strong&gt;ETL + Data Warehouse&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Modern architecture uses &lt;strong&gt;ELT + Data Lake + Lakehouse&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Data Lake stores raw data.&lt;/li&gt;
&lt;li&gt;Data Warehouse stores clean data for reporting.&lt;/li&gt;
&lt;li&gt;Spark and Databricks handle large-scale processing.&lt;/li&gt;
&lt;li&gt;Delta Lake and Iceberg make data lakes reliable.&lt;/li&gt;
&lt;li&gt;Datadog monitors the entire system.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>architecture</category>
      <category>data</category>
      <category>dataengineering</category>
      <category>systemdesign</category>
    </item>
    <item>
      <title>How I Learned AI by Building an Offline PDF Chatbot with Local LLMs</title>
      <dc:creator>Salma Aga Shaik</dc:creator>
      <pubDate>Mon, 23 Jun 2025 17:33:14 +0000</pubDate>
      <link>https://dev.to/salma_aga/how-i-learned-ai-by-building-an-offline-pdf-chatbot-with-local-llms-52lk</link>
      <guid>https://dev.to/salma_aga/how-i-learned-ai-by-building-an-offline-pdf-chatbot-with-local-llms-52lk</guid>
      <description>&lt;p&gt;Hey everyone! I’m &lt;strong&gt;Shaik Salma Aga&lt;/strong&gt;, and I love learning by building. Instead of just reading theory, I built something that helped me &lt;strong&gt;understand AI practically&lt;/strong&gt; and also &lt;strong&gt;prepare better for my interviews&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Let me walk you through what I built, how it works, and how you can try it too.&lt;/p&gt;




&lt;h2&gt;
  
  
  Project Goal: Learning AI by Building, Not Just Reading
&lt;/h2&gt;

&lt;p&gt;I didn’t just want to use AI tools. I wanted to &lt;strong&gt;build one from scratch&lt;/strong&gt; and see what happens under the hood.&lt;/p&gt;

&lt;p&gt;I was exploring concepts like embeddings, vector search, and local LLMs but theory alone wasn’t sticking. So I built this project &lt;strong&gt;an Offline PDF Analyzer&lt;/strong&gt; to learn how documents are split, embedded, searched, and how local models generate smart responses.&lt;/p&gt;

&lt;p&gt;This project became my practical journey into AI and now it helps others too, especially those preparing for interviews.&lt;/p&gt;




&lt;h2&gt;
  
  
  What This Project Does
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Upload one or more PDFs through a simple web UI.&lt;/li&gt;
&lt;li&gt;Ask your questions in simple English.&lt;/li&gt;
&lt;li&gt;The system reads and understands the content, then gives you a relevant answer from the document.&lt;/li&gt;
&lt;li&gt;Everything runs &lt;strong&gt;locally&lt;/strong&gt; no internet or API keys needed.&lt;/li&gt;
&lt;li&gt;It can also count how many questions are in the PDF useful for exam prep.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Upload a PDF on Machine Learning and ask: “What is the difference between supervised and unsupervised learning?”&lt;br&gt;
You get a clear, to-the-point answer pulled directly from the relevant section of the document instantly.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  How It Works
&lt;/h2&gt;

&lt;p&gt;Below is the complete flow of how the &lt;strong&gt;Offline PDF Analyzer&lt;/strong&gt; works behind the scenes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;PDF Upload&lt;/strong&gt;: The user uploads one or more PDF files through the UI.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Text Extraction&lt;/strong&gt;: The app reads all pages using &lt;code&gt;PyMuPDF&lt;/code&gt; and extracts clean text.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chunking&lt;/strong&gt;: Long text is split into overlapping chunks using &lt;code&gt;RecursiveCharacterTextSplitter&lt;/code&gt; to preserve context.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Embeddings&lt;/strong&gt;: Each chunk is converted into a vector (a list of numbers) using &lt;code&gt;OllamaEmbeddings&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FAISS Vector Search&lt;/strong&gt;: When a question is asked, similar chunks are searched using fast cosine similarity.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Answer Generation&lt;/strong&gt;: The selected chunks are passed to a &lt;strong&gt;local LLM&lt;/strong&gt; (like &lt;code&gt;phi&lt;/code&gt;, &lt;code&gt;mistral&lt;/code&gt;, or &lt;code&gt;llama2&lt;/code&gt;) to generate the final answer.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Choose Your Local AI Model
&lt;/h2&gt;

&lt;p&gt;You can select models like &lt;strong&gt;phi&lt;/strong&gt;, &lt;strong&gt;mistral&lt;/strong&gt;, or &lt;strong&gt;llama2&lt;/strong&gt; all running locally on your laptop using &lt;strong&gt;Ollama&lt;/strong&gt; for fast and efficient results.&lt;/p&gt;




&lt;h2&gt;
  
  
  System Design Diagram: How PDF Analyzer Works
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqpq0cd7izfmujj2a9snj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqpq0cd7izfmujj2a9snj.png" alt=" " width="542" height="339"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Tech Stack I Used
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Streamlit&lt;/strong&gt;: For building a user-friendly frontend with just a few lines of Python.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PyMuPDF (fitz)&lt;/strong&gt;: To extract text from all pages of uploaded PDFs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LangChain&lt;/strong&gt;: To handle end-to-end chaining from query to retrieval to LLM response.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RecursiveCharacterTextSplitter&lt;/strong&gt;: Breaks the text into chunks with overlaps, so context is preserved.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ollama&lt;/strong&gt;: Runs local LLMs (phi, mistral, llama2) directly on your machine without internet.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FAISS&lt;/strong&gt;: A super-fast vector search library to retrieve relevant chunks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Python&lt;/strong&gt;: For the backend logic, caching, state management, and pre-processing.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Challenges I Faced &amp;amp; How I Solved Them
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Wrong Answers from Wrong Sections
&lt;/h3&gt;

&lt;p&gt;In the beginning, it showed answers from the wrong part of the PDF, which didn’t match the question and made things confusing.&lt;br&gt;
&lt;strong&gt;Fix&lt;/strong&gt;: I adjusted the chunk overlap size, used better metadata like page numbers and source file names, and added tagging.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Answers Coming from Previous PDF.
&lt;/h3&gt;

&lt;p&gt;Even after uploading a different PDF, it still showed answers from the old one. &lt;br&gt;
&lt;strong&gt;Fix&lt;/strong&gt;: I added &lt;strong&gt;file hashing&lt;/strong&gt; to detect newly uploaded PDFs. If the incoming file is different from the previous one, the system discards the old data and processes the new file from scratch.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Short Queries Gave Confusing Answers
&lt;/h3&gt;

&lt;p&gt;If I typed "types?" or "examples?", the app didn’t understand what I meant. &lt;br&gt;
&lt;strong&gt;Fix&lt;/strong&gt;: I made a way to automatically turn short questions into full ones. For example, if someone types &lt;strong&gt;"types?"&lt;/strong&gt;, it changes to &lt;strong&gt;"What are the different types mentioned in the document?"&lt;/strong&gt; so the model understands better.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. No Info on Where the Answer Came From
&lt;/h3&gt;

&lt;p&gt;I wasn’t sure if the answer was right because it didn’t show where in the PDF it found the info.&lt;br&gt;
&lt;strong&gt;Fix&lt;/strong&gt;: Now it shows the &lt;strong&gt;PDF name&lt;/strong&gt; and &lt;strong&gt;page number&lt;/strong&gt; where the answer came from, and you can &lt;strong&gt;click to see more details&lt;/strong&gt; if you want.&lt;/p&gt;




&lt;h2&gt;
  
  
  Techniques I Used
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;@st.cache_data&lt;/code&gt;: To avoid reloading the same PDF again and again.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;File Hashing&lt;/strong&gt;: So that the app resets only when a new PDF is uploaded.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Session State&lt;/strong&gt;: Used in Streamlit to store user-uploaded files and questions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Regex Matching&lt;/strong&gt;: To support question formats like “How many questions are in this PDF?”&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prompt Templates&lt;/strong&gt;: Help the model understand and answer better when the user's question is short or unclear.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Any Frontend?
&lt;/h2&gt;

&lt;p&gt;Yes! I made a clean and user-friendly interface using &lt;strong&gt;Streamlit&lt;/strong&gt; that makes it easy to upload PDFs and get answers quickly.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Choose your preferred LLM (phi / mistral / llama2)&lt;/li&gt;
&lt;li&gt;Upload one or more PDFs&lt;/li&gt;
&lt;li&gt;Ask your question&lt;/li&gt;
&lt;li&gt;See the answer + source (page number + filename)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No delays, no registration everything happens on your own system.&lt;/p&gt;




&lt;h2&gt;
  
  
  What’s Next?
&lt;/h2&gt;

&lt;p&gt;Here’s what I want to add next:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PDF Summarizer&lt;/strong&gt;: Get a quick summary of the whole PDF.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Export Chat History&lt;/strong&gt;: Save your Q&amp;amp;A for later.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Find All Questions&lt;/strong&gt;: List all questions found inside the PDF.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Tech Terms
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Chunking&lt;/strong&gt;: Breaking a big document into small, readable parts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Embeddings&lt;/strong&gt;: Turning text into numbers so that the model understands meaning.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FAISS&lt;/strong&gt;: Finds the best match for your question from the chunks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Local LLMs&lt;/strong&gt;: Small AI models running on your laptop (no internet needed).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LangChain&lt;/strong&gt;: Connects everything PDFs, questions, answers — in one neat pipeline.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Interview Questions You Can Expect
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;How does chunk overlap affect retrieval quality?&lt;/li&gt;
&lt;li&gt;What’s the role of FAISS in a RAG pipeline?&lt;/li&gt;
&lt;li&gt;Why are prompt templates useful in real-world applications?&lt;/li&gt;
&lt;li&gt;How do you make vector indexes update-safe when files change?&lt;/li&gt;
&lt;li&gt;What are the trade-offs of using local LLMs vs cloud APIs?&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;This project started as a way to &lt;strong&gt;learn AI deeply&lt;/strong&gt; by building something useful. It taught me how to use embeddings, vector search, local LLMs, and chaining tools all while helping me with interview prep.&lt;/p&gt;

&lt;p&gt;If you want to learn by doing start small, build real, and break things.&lt;/p&gt;

&lt;p&gt;Let’s keep learning. Let’s keep building.&lt;br&gt;&lt;br&gt;
✍️&lt;strong&gt;Shaik Salma Aga&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;🔗 &lt;a href="https://github.com/ShaikSalmaAga/offline-pdf-analyzer" rel="noopener noreferrer"&gt;GitHub Repo&lt;/a&gt;&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Why Agentic AI Changed My Job Prep Journey</title>
      <dc:creator>Salma Aga Shaik</dc:creator>
      <pubDate>Tue, 03 Jun 2025 05:02:15 +0000</pubDate>
      <link>https://dev.to/salma_aga/-title-why-agentic-ai-changed-my-job-prep-journey--1hnk</link>
      <guid>https://dev.to/salma_aga/-title-why-agentic-ai-changed-my-job-prep-journey--1hnk</guid>
      <description>&lt;p&gt;Hi everyone! I'm &lt;strong&gt;Salma&lt;/strong&gt;, a student and software engineer preparing for full-time roles. While applying for jobs and preparing for interviews, I realized something big:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Just knowing how to code is no longer enough."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Today’s tech world is changing fast. We see &lt;strong&gt;AI everywhere&lt;/strong&gt;, and one term you’ll hear again and again is &lt;strong&gt;Agentic AI&lt;/strong&gt;. Some people know what it is, many don’t. But if you’re a student or professional looking for a job, understanding Agentic AI gives you a huge advantage.&lt;/p&gt;

&lt;h2&gt;
  
  
  Let’s Imagine you're building your own travel assistant app.
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Traditional AI (like ChatGPT):
&lt;/h3&gt;

&lt;p&gt;You: "Book a flight to Delhi."&lt;br&gt;&lt;br&gt;
AI: "Sure. Please tell me the date, airline, timing, etc."&lt;/p&gt;

&lt;h3&gt;
  
  
  Agentic AI:
&lt;/h3&gt;

&lt;p&gt;You: "I need to be in Delhi next week for a conference."&lt;/p&gt;

&lt;p&gt;AI:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Checks your calendar for free days&lt;/li&gt;
&lt;li&gt;Suggests flight options&lt;/li&gt;
&lt;li&gt;Books your ticket&lt;/li&gt;
&lt;li&gt;Adds it to your calendar&lt;/li&gt;
&lt;li&gt;Sends you a reminder and even books your cab&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This isn’t just AI that responds. &lt;strong&gt;It’s AI that acts on its own.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Agentic AI?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Agentic AI&lt;/strong&gt; is artificial intelligence that &lt;strong&gt;sets goals&lt;/strong&gt;, &lt;strong&gt;makes decisions&lt;/strong&gt;, &lt;strong&gt;takes action&lt;/strong&gt;, and &lt;strong&gt;learns&lt;/strong&gt; all on its own.&lt;/p&gt;

&lt;p&gt;It doesn’t wait for your prompt. It’s like &lt;strong&gt;hiring a junior employee&lt;/strong&gt; who knows what to do next.&lt;/p&gt;

&lt;h2&gt;
  
  
  Traditional AI vs Agentic AI
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Traditional AI&lt;/strong&gt; works based on prompts. You give it instructions, it gives an output.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Agentic AI&lt;/strong&gt; works based on goals. You give it a goal, and it figures out how to reach it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Difference Between Traditional AI and Agentic AI
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Traditional AI&lt;/th&gt;
&lt;th&gt;Agentic AI&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Needs prompts&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Can act on goals&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Decision-making&lt;/td&gt;
&lt;td&gt;Basic logic&lt;/td&gt;
&lt;td&gt;Complex reasoning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Example&lt;/td&gt;
&lt;td&gt;Chatbot&lt;/td&gt;
&lt;td&gt;Calendar + Travel Manager&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Lifecycle of Agentic AI
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl5l10k5a1lu9pxg8xrwg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl5l10k5a1lu9pxg8xrwg.png" alt=" " width="270" height="405"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Perceive&lt;/strong&gt; – Collects data (emails, APIs, sensors)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reason&lt;/strong&gt; – Understands the task and plans next steps&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Act&lt;/strong&gt; – Executes using APIs and tools&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Learn&lt;/strong&gt; – Evaluates and improves its performance&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Collaborate&lt;/strong&gt; – Works with humans or other agents&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  How Agentic AI Solves Customer Support Issues
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Perceive: Reads an angry customer email&lt;/li&gt;
&lt;li&gt;Reason: Understands it’s about a delayed shipment&lt;/li&gt;
&lt;li&gt;Act: Sends an apology and discount coupon&lt;/li&gt;
&lt;li&gt;Learn: Tracks response from customer&lt;/li&gt;
&lt;li&gt;Collaborate: Notifies human agent if unresolved&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Types of Agentic AI
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Single-Agent System&lt;/strong&gt; : One agent handles everything.&lt;br&gt;&lt;br&gt;
Example: Budget manager bot that tracks, predicts, and alerts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-Agent System&lt;/strong&gt; : Several agents with different responsibilities.&lt;br&gt;&lt;br&gt;
Example: Email agent one reads, another replies, another logs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Goal-Oriented Agent&lt;/strong&gt; : Given a goal, it plans and acts.&lt;br&gt;&lt;br&gt;
Example: “Grow Instagram to 5K followers.”&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reactive Agent&lt;/strong&gt; : Reacts quickly but doesn’t plan ahead.&lt;br&gt;&lt;br&gt;
Example: Auto-braking system in cars.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Deliberative Agent&lt;/strong&gt; : Thinks and reasons before acting.&lt;br&gt;&lt;br&gt;
Example: Schedules meetings based on mood, urgency, and history.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Build Agentic AI
&lt;/h2&gt;

&lt;p&gt;To build an Agentic AI system, you begin with a &lt;strong&gt;frontend&lt;/strong&gt; that accepts input from users. The request is handled by a &lt;strong&gt;backend&lt;/strong&gt; which forwards the data to a &lt;strong&gt;language model (LLM)&lt;/strong&gt; such as GPT-4 or Claude. The LLM reasons about the task and initiates actions. These actions may include calling APIs or updating systems. Context or memory is stored using vector databases. Results and state changes are saved in a storage system like PostgreSQL.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnrx9w5rw8rz5o6m50dbk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnrx9w5rw8rz5o6m50dbk.png" alt=" " width="570" height="148"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  How Agentic AI Can Automate Resume Screening
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Recruiter uploads resumes on the web interface&lt;/li&gt;
&lt;li&gt;Backend forwards data to the LLM&lt;/li&gt;
&lt;li&gt;LLM ranks the candidates based on fit&lt;/li&gt;
&lt;li&gt;Memory layer remembers past hiring preferences&lt;/li&gt;
&lt;li&gt;Action layer sends top resumes to HR&lt;/li&gt;
&lt;li&gt;PostgreSQL stores rankings and history&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Components Used
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Frontend&lt;/strong&gt;: HTML, React
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Backend&lt;/strong&gt;: Python (Flask, FastAPI)
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLM&lt;/strong&gt;: GPT-4, Claude, LLaMA
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memory&lt;/strong&gt;: FAISS, Pinecone
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Actions&lt;/strong&gt;: APIs, Zapier, CRMs
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Storage&lt;/strong&gt;: PostgreSQL, Redis
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How LangChain + Agentic AI Works
&lt;/h2&gt;

&lt;p&gt;This diagram shows how an &lt;strong&gt;Agentic AI system&lt;/strong&gt; works when you build it using &lt;strong&gt;LangChain&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ficn2x7q8a8aj0oyasqdu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ficn2x7q8a8aj0oyasqdu.png" alt=" " width="412" height="408"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;User Input&lt;/strong&gt; : The user gives a request.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Example:&lt;/strong&gt; “Remind me about my meeting and send a message if I’m late.”&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reasoning / Planning&lt;/strong&gt; : The system now goes into &lt;strong&gt;thinking mode&lt;/strong&gt;.  It uses a smart model (like GPT-4 or Claude) to figure out what to do next.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Action&lt;/strong&gt; : Based on the plan, it performs the actual work:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Checks your calendar&lt;/li&gt;
&lt;li&gt;Sends messages&lt;/li&gt;
&lt;li&gt;Searches the web&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Uses Tools&lt;/strong&gt; : To complete tasks, the AI uses different tools:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Web Search&lt;/strong&gt; to gather new information
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;API Calls&lt;/strong&gt; to apps like your calendar or email
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Databases, Zapier, or CRMs&lt;/strong&gt; to interact with systems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Memory / Storage&lt;/strong&gt; : After doing the task, it &lt;strong&gt;stores what happened&lt;/strong&gt; for future reference so it can learn and improve next time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Back to User or Move to Next Task&lt;/strong&gt; : It either&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Updates the user about the result
&lt;/li&gt;
&lt;li&gt;Or starts working on the next goal&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This full loop User → Plan → Act → Tools → Back to User is what makes Agentic AI powerful. It’s not just replying like a chatbot. It’s doing real work for you like a smart digital assistant.&lt;/p&gt;

&lt;h2&gt;
  
  
  Advantages of Agentic AI
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Proactive and autonomous&lt;/li&gt;
&lt;li&gt;Learns and adapts over time&lt;/li&gt;
&lt;li&gt;Integrates with tools and systems&lt;/li&gt;
&lt;li&gt;Can collaborate with other agents or humans&lt;/li&gt;
&lt;li&gt;Reduces repetitive human work&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Drawbacks of Agentic AI
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Risk of incorrect actions due to bad data&lt;/li&gt;
&lt;li&gt;Hard to debug errors in multi-step logic&lt;/li&gt;
&lt;li&gt;Requires safeguards and human override&lt;/li&gt;
&lt;li&gt;Complexity in design and testing&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What If Agentic AI Fails?
&lt;/h2&gt;

&lt;p&gt;Failures can occur. Here's how to make systems robust:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Task Queues&lt;/strong&gt;: Split large tasks into traceable chunks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Session Tokens&lt;/strong&gt;: Avoid confusion between user sessions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prompt Templates&lt;/strong&gt;: Keep communication consistent&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Escalation Paths&lt;/strong&gt;: Alert humans when automation fails&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Failure Example:
&lt;/h3&gt;

&lt;p&gt;If a meeting booking fails due to calendar API error:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Retry booking&lt;/li&gt;
&lt;li&gt;On failure again, send alert to user and log the error&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Where Is Agentic AI Used Today?
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Salesforce – AI customer support agents
&lt;/li&gt;
&lt;li&gt;Hippocratic AI – Medical virtual assistants
&lt;/li&gt;
&lt;li&gt;Ema AI – Business workflow automation
&lt;/li&gt;
&lt;li&gt;Juna – Factory control agents
&lt;/li&gt;
&lt;li&gt;Jasper + HubSpot – AI-powered marketing&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  AI Agents vs Agentic AI
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;AI Agent&lt;/strong&gt; : Acts only after manual user input.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Example:&lt;/strong&gt; Gmail Smart Reply you click it, it sends.&lt;/p&gt;

&lt;h3&gt;
  
  
  Difference Between AI Agents and Agentic AI
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;AI Agents&lt;/th&gt;
&lt;th&gt;Agentic AI&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;User initiated&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Goal planning&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-step task&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Learning ability&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Evolution of AI :&lt;/strong&gt; AI has progressed in 3 major stages:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fme7ccflglnohcgnyo2y8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fme7ccflglnohcgnyo2y8.png" alt=" " width="510" height="324"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Predictive AI :&lt;/strong&gt; Forecasting the future&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Example:&lt;/strong&gt; Credit scoring, fraud detection&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Generative AI :&lt;/strong&gt; Creating new content&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Example:&lt;/strong&gt; ChatGPT, DALL·E, MidJourney&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agentic AI :&lt;/strong&gt; Thinking, planning, acting&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Example:&lt;/strong&gt; AI assistant managing tasks and meetings&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;Agentic AI is not just a buzzword. It’s a &lt;strong&gt;career game-changer&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;If you're a student or developer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Learn the lifecycle of agentic systems
&lt;/li&gt;
&lt;li&gt;Build a real mini-project (e.g. with LangChain)
&lt;/li&gt;
&lt;li&gt;Write about it or share your GitHub
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  - Talk about it in interviews
&lt;/h2&gt;

&lt;p&gt;✍️ Written by Shaik Salma Aga&lt;/p&gt;




</description>
    </item>
    <item>
      <title>How I Built My Own RAG Chatbot with Local LLMs (And the Roadblocks That Taught Me More Than the Code)</title>
      <dc:creator>Salma Aga Shaik</dc:creator>
      <pubDate>Sat, 31 May 2025 20:25:53 +0000</pubDate>
      <link>https://dev.to/salma_aga/how-i-built-my-own-rag-chatbot-with-local-llms-and-the-roadblocks-that-taught-me-more-than-the-3kmd</link>
      <guid>https://dev.to/salma_aga/how-i-built-my-own-rag-chatbot-with-local-llms-and-the-roadblocks-that-taught-me-more-than-the-3kmd</guid>
      <description>&lt;p&gt;A while back, I wrote a &lt;strong&gt;beginner-to-expert guide&lt;/strong&gt; on &lt;strong&gt;Retrieval-Augmented Generation (RAG)&lt;/strong&gt;. That article was all theory. How RAG works, the difference between &lt;strong&gt;sparse and dense embeddings&lt;/strong&gt;, and why it’s powerful.&lt;/p&gt;

&lt;p&gt;This time, I wanted to get my hands dirty. I wanted to build something real.&lt;/p&gt;

&lt;p&gt;So I built a working &lt;strong&gt;RAG chatbot&lt;/strong&gt;. Completely offline. Locally.&lt;/p&gt;

&lt;p&gt;Let me walk you through the full journey what I built, how it works, and what went wrong (and how I fixed it).&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Why I Ran It Locally&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;This wasn’t about saving money or staying private. It was about &lt;strong&gt;learning&lt;/strong&gt; raw, hands-on, deep learning.&lt;/p&gt;

&lt;p&gt;I didn’t want to just connect APIs and feel like a builder. I wanted to:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;• Understand how text becomes vectors&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;• Debug retrieval when it breaks&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;• Run a model myself and see how it responds&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I wanted to learn the hard way and &lt;strong&gt;local was the best way&lt;/strong&gt; to make sure I did.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;My Project: PDF Q&amp;amp;A Chatbot (All Offline)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;I had one clear goal: &lt;strong&gt;Ask questions from a PDF and get meaningful answers without internet.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I used a document called &lt;code&gt;Evolution_of_AI.pdf&lt;/code&gt;. I asked questions like:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"What are the phases in AI development?"&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The chatbot searched the PDF, found the right section, fed it to a local LLM, and gave me a perfect answer.&lt;/p&gt;

&lt;p&gt;All offline.&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;System Design Diagram: How Offline RAG Chatbot Works&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fds1qkv6dfh3bp2y9eqix.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fds1qkv6dfh3bp2y9eqix.png" alt=" " width="406" height="404"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Here’s the process:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;User sends a question to the &lt;strong&gt;RAG chatbot&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;The chatbot uses &lt;strong&gt;PyPDFLoader&lt;/strong&gt; to load the PDF.&lt;/li&gt;
&lt;li&gt;It splits the text using &lt;strong&gt;RecursiveCharacterTextSplitter&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Text chunks are converted to vectors via &lt;strong&gt;HuggingFace Embeddings&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Vectors are stored and retrieved using &lt;strong&gt;FAISS&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;The top relevant chunks are passed to a local &lt;strong&gt;LLM&lt;/strong&gt; via &lt;strong&gt;Ollama&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;The final answer is shown to the user.&lt;/li&gt;
&lt;/ol&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Tech Stack I Used&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;PyPDFLoader:&lt;/strong&gt; Used for extracting raw text from the PDF so the bot can "read" it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RecursiveCharacterTextSplitter:&lt;/strong&gt; It ensures that even long paragraphs are broken into manageable, overlapping pieces that preserve meaning.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;HuggingFaceEmbeddings:&lt;/strong&gt; Converts those text chunks into number lists (vectors) that reflect context, not just words.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;FAISS:&lt;/strong&gt; A lightning-fast search tool that finds which vectors (chunks) are closest to the question vector.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ollama:&lt;/strong&gt; Runs lightweight models like &lt;code&gt;phi&lt;/code&gt; on your machine, no cloud needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LangChain:&lt;/strong&gt; The backbone. It handles all connections from question to document to model and back.&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;The Hidden Struggles and My Fixes&lt;/strong&gt;
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Empty Answers or Garbage Output&lt;/strong&gt;&lt;br&gt;
My initial PDF had just one sentence not enough for meaningful retrieval.&lt;br&gt;
&lt;strong&gt;Fix:&lt;/strong&gt; I created a structured PDF (&lt;code&gt;Evolution_of_AI.pdf&lt;/code&gt;) with real content.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Wrong Chunks Being Retrieved&lt;/strong&gt;&lt;br&gt;
Asked about AI phases, but got results about NLP techniques.&lt;br&gt;
&lt;strong&gt;Fix:&lt;/strong&gt; Added more chunk overlap, changed embedding model, and tagged the chunks with extra metadata.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Deprecation Warnings in LangChain&lt;/strong&gt;&lt;br&gt;
The &lt;code&gt;.run()&lt;/code&gt; method stopped working.&lt;br&gt;
&lt;strong&gt;Fix:&lt;/strong&gt; Switched to the &lt;code&gt;.invoke()&lt;/code&gt; method per latest LangChain docs.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Ollama Crashes with Heavy Models&lt;/strong&gt;&lt;br&gt;
Running models like Mistral overloaded my RAM.&lt;br&gt;
&lt;strong&gt;Fix:&lt;/strong&gt; Downgraded to &lt;code&gt;phi&lt;/code&gt;, a lighter model that worked well locally.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;No Change After Updating PDF&lt;/strong&gt;&lt;br&gt;
I changed the PDF but still got answers from the old one.&lt;br&gt;
&lt;strong&gt;Fix:&lt;/strong&gt; I cleared the FAISS index and re-embedded everything.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Short or Vague Queries Confused the Bot&lt;/strong&gt;&lt;br&gt;
“Phases?” returned irrelevant content.&lt;br&gt;
&lt;strong&gt;Fix:&lt;/strong&gt; I used prompt templates to expand such queries into full sentences automatically.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Technical Bits Explained Simply&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Chunking&lt;/strong&gt;&lt;br&gt;
Breaks large documents into overlapping sections so important parts aren’t lost during processing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Embeddings&lt;/strong&gt;&lt;br&gt;
Turns sentences into numbers that represent meaning. That way, "vacation" and "holiday" look nearly the same to the machine.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cosine Similarity&lt;/strong&gt;&lt;br&gt;
A math trick to check how similar two vectors (questions and chunks) are. Smaller angle = better match.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;FAISS&lt;/strong&gt;&lt;br&gt;
A tool that finds which chunks are most similar to the question super quickly.&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;LangChain&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;LangChain simplifies the complex plumbing. It:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;• Takes your question&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;• Converts it to a vector&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;• Finds the most relevant document chunks via FAISS&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;• Sends it all to the LLM&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;• Collects and returns the final answer&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;All without you needing to manually stitch the logic together.&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Evaluation Techniques I Used&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Manually compared answers with the PDF&lt;/li&gt;
&lt;li&gt;Asked intentionally vague or tricky questions&lt;/li&gt;
&lt;li&gt;Checked that the answers didn’t hallucinate&lt;/li&gt;
&lt;li&gt;Made sure important info wasn’t skipped (avoided the “lost in the middle” issue)&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Any Frontend?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Not yet, but I’m planning:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A &lt;strong&gt;Streamlit&lt;/strong&gt;-based UI for chatting with the bot&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;FastAPI&lt;/strong&gt; backend to make it modular&lt;/li&gt;
&lt;li&gt;A desktop wrapper so anyone can use it easily&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;What’s Next?&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Multi-PDF support&lt;/li&gt;
&lt;li&gt;Chunk summaries for quick previews&lt;/li&gt;
&lt;li&gt;Using &lt;strong&gt;ragas&lt;/strong&gt; for automated evaluation&lt;/li&gt;
&lt;li&gt;Feedback-based learning loop&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Interview Questions&lt;/strong&gt;
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;How does &lt;strong&gt;chunk overlap&lt;/strong&gt; affect retrieval quality in RAG systems?&lt;/li&gt;
&lt;li&gt;What are the benefits of &lt;strong&gt;local embeddings&lt;/strong&gt; over API-based ones?&lt;/li&gt;
&lt;li&gt;How do you &lt;strong&gt;debug wrong or missing retrievals&lt;/strong&gt; in vector search?&lt;/li&gt;
&lt;li&gt;What’s the trade-off between &lt;strong&gt;dense and sparse embeddings&lt;/strong&gt;?&lt;/li&gt;
&lt;li&gt;How do you handle &lt;strong&gt;stale or outdated indexes&lt;/strong&gt; in a vector DB like FAISS?&lt;/li&gt;
&lt;/ol&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Final Thoughts&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Building this RAG chatbot wasn’t just about code it was about transforming theory into practice. Every bug I fixed and every wrong answer I debugged helped me grow.&lt;/p&gt;

&lt;p&gt;If you’ve read about RAG and want to &lt;em&gt;really&lt;/em&gt; learn it build something.&lt;/p&gt;

&lt;p&gt;Let’s keep learning, building, and breaking things together.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Shaik Salma Aga&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;[🔗 GitHub: &lt;a href="https://github.com/ShaikSalmaAga/rag-chatbot" rel="noopener noreferrer"&gt;https://github.com/ShaikSalmaAga/rag-chatbot&lt;/a&gt;](url)&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
