<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Jatin Solanki</title>
    <description>The latest articles on DEV Community by Jatin Solanki (@spjatin4).</description>
    <link>https://dev.to/spjatin4</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1025610%2F60c56ca9-eb79-458c-90f6-d9545e0d28df.jpeg</url>
      <title>DEV Community: Jatin Solanki</title>
      <link>https://dev.to/spjatin4</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/spjatin4"/>
    <language>en</language>
    <item>
      <title>What is data observability?</title>
      <dc:creator>Jatin Solanki</dc:creator>
      <pubDate>Tue, 14 Feb 2023 11:11:42 +0000</pubDate>
      <link>https://dev.to/spjatin4/what-is-data-observability-2ogg</link>
      <guid>https://dev.to/spjatin4/what-is-data-observability-2ogg</guid>
      <description>&lt;p&gt;I started my career as a first-generation analyst focusing on writing SQL scripts, learning R, and publishing dashboards. As things progressed, I graduated into Data Science and Data Engineering where my focus shifted to managing the life-cycle of ML models and data pipelines. 2022 is my 16th year in the data industry and I am still learning new ways to be productive and impactful. Today, I am now the head of a data science &amp;amp; data engineering function in one of the unicorns and I would like to share my findings and where I am heading next.&lt;/p&gt;

&lt;p&gt;When I look at the big picture, I realised that the problems most companies face are quite similar. Their vision towards being data-driven has turned into a BHAG — pronounced “bee hag” (Big Hairy Audacious Goal). We data folks like patterns, so here are my findings:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;During 5 out of 10 review meets, I have witnessed people question the reliability of the data/report/dashboard. Additionally, HODs will also try to convince others that their data is the most accurate or reliable :)&lt;/li&gt;
&lt;li&gt;A lot of times, HOD comes and says that the data is not updated. The data team is already working to fix the report/data table.&lt;/li&gt;
&lt;li&gt;A new product got launched the week before, however, we are yet to figure out the performance. The data team is working on a query change and will soon update the CXO team.&lt;/li&gt;
&lt;li&gt;Everyone has built expertise around writing complicated ML (machine-learned) models, however very few talk about or deploy inference monitoring. There is a high probability of model drift or performance drift in the coming weeks/months if not monitored or observed efficiently.&lt;/li&gt;
&lt;li&gt;Very few companies deploy solutions or models to detect performance anomalies.
The list is long, I am sure you can relate or add more to this.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In a nutshell, I found that data reliability is a BIG challenge and there is a need for a solution that is easy to use, understand, deploy, and also not heavy on investment.&lt;/p&gt;

&lt;p&gt;Hello, I am Jatin Solanki who is on a mission to build and develop a solution to make your data reliable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is needed to make your data more reliable?&lt;/strong&gt;&lt;br&gt;
Complexities around data infrastructure are surging as companies gear to get a competitive edge and out-of-the-box offerings.&lt;/p&gt;

&lt;p&gt;Every company goes through a data maturity matrix. In order to reach a level where you deploy AI models or self-service models, you need to invest in a robust foundation.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;In my opinion, the foundation begins with a reliable data source or defining source of truth. Your data models won’t be impactful if it’s ingested with bad data. You know it’s garbage in garbage out&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;On a high level, here are a few checks you can implement to ensure data reliability:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Volume: It ensures all the row/events are captured or ingested.&lt;/li&gt;
&lt;li&gt;Freshness: Recency of the data. If your data gets updated every xx mins, this test will ensure its updated and raises an incident if not.&lt;/li&gt;
&lt;li&gt;Schema Change: If there is a schema change or a new feature that was launched, your data team needs to be aware to update the scripts.&lt;/li&gt;
&lt;li&gt;Distribution: All the events are in an acceptable range. e.g if a critical shouldn’t contain null values, then this test ensures to raise an alert for any null or missing values.&lt;/li&gt;
&lt;li&gt;Lineage: This is a must-have module, however, we always underplay these ones. Lineage provides a handy info to the data team of the upstream and downstream.&lt;/li&gt;
&lt;li&gt;Reconciliation: I would say recon or finding deltas between two given datasets. This could be used to understand the difference between stagingand production OR between source and destination . This could be effective in running some financial recon too, like payment gateway to the sales table.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What next? How do we implement this?&lt;/strong&gt;&lt;br&gt;
The most common question people face with:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Build versus Buy&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I am a big fan of open source tech, however, in some critical modules, I prefer buying an out-of-the-box solution because it’s scalable and already tested in the market. Developing in-house might cost you around US2k per month and it includes a few hours of engineer’s time along with cloud cost.&lt;/p&gt;

&lt;p&gt;If you are inclined toward buying an out-of-the-box solution, here are a few factors that should be part of your checklist.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Should be able to connect to popular sources which require minimal config.&lt;/li&gt;
&lt;li&gt;Extract information automatically without the need for additional code.&lt;/li&gt;
&lt;li&gt;No-code or CLI (I leave it to you)&lt;/li&gt;
&lt;li&gt;Lineage and Catalog module.&lt;/li&gt;
&lt;li&gt;Data Reconciliation along with scheduling feature.&lt;/li&gt;
&lt;li&gt;Anomaly detection&lt;/li&gt;
&lt;li&gt;Of course, Of course, all the tests we discussed earlier along with alerts should be in a position to tell where to &lt;code&gt;debug&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It should be in a position to automatically detect my critical data assets and apply hygiene checks.&lt;/p&gt;

&lt;p&gt;At last, the solution should help you reduce data quality incidents and make your data more reliable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;So, do I need a data observability platform?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If your answer to any of the below questions or scenarios is “Yes”, then you should procure or deploy a data observability solution right away.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Dashboard not getting updated on regular basis?&lt;/li&gt;
&lt;li&gt;Don’t know which report is accurate?&lt;/li&gt;
&lt;li&gt;Business stakeholders are the first to learn about data incidents.&lt;/li&gt;
&lt;li&gt;Questions during a meeting on the performance stats.&lt;/li&gt;
&lt;li&gt;Have at least 2 members in the data team.&lt;/li&gt;
&lt;li&gt;Deployed a business intelligence tool.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;As software developers have leveraged on DataDog, Dynatrace, etc kind of solutions to ensure web/app uptime, data leaders should invest in data observability solutions to ensure data reliability.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Interested to learn more about data observability? Reach out to &lt;a href="https://www.linkedin.com/in/jatinsolanki"&gt;me&lt;/a&gt; or you can visit our &lt;a href="https://decube.io"&gt;site&lt;/a&gt; for more info.&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>data</category>
      <category>database</category>
      <category>python</category>
    </item>
    <item>
      <title>Monitor &amp; Detect - schema changes, broken dashboards etc</title>
      <dc:creator>Jatin Solanki</dc:creator>
      <pubDate>Mon, 13 Feb 2023 00:41:28 +0000</pubDate>
      <link>https://dev.to/spjatin4/monitor-detect-schema-changes-broken-dashboards-etc-1fj3</link>
      <guid>https://dev.to/spjatin4/monitor-detect-schema-changes-broken-dashboards-etc-1fj3</guid>
      <description>&lt;p&gt;As a data engineer, one of the challenging and sometimes frustrating task is to handle messy information or weird data points passing the pipeline.&lt;/p&gt;

&lt;p&gt;Business teams are always firefighting with Data team on accuracy, dashboards not correct or no updated etc.&lt;/p&gt;

&lt;p&gt;There are multiple options whether its opensource or closed ones. &lt;/p&gt;

&lt;p&gt;Would like to intro with all about &lt;a href="https://decube.io"&gt;decube&lt;/a&gt;, not only it helps in managing the data quality or writing tests but also in managing the data catalog which is out of the box option.&lt;/p&gt;

&lt;p&gt;It has community edition too (free-forever) with no cap on tables for catalog and lineage.&lt;/p&gt;

&lt;p&gt;I strongly suggest to give it a try. &lt;/p&gt;

</description>
      <category>database</category>
      <category>python</category>
      <category>dataengineering</category>
      <category>data</category>
    </item>
  </channel>
</rss>
