<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Aman Gupta</title>
    <description>The latest articles on DEV Community by Aman Gupta (@aman_gupta_7c59c96e9e167a).</description>
    <link>https://dev.to/aman_gupta_7c59c96e9e167a</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1673771%2Ffdb79ab8-e1a5-443b-8965-69db5b624f0e.jpg</url>
      <title>DEV Community: Aman Gupta</title>
      <link>https://dev.to/aman_gupta_7c59c96e9e167a</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/aman_gupta_7c59c96e9e167a"/>
    <language>en</language>
    <item>
      <title>Simplifying SDMX Data Integration with Python</title>
      <dc:creator>Aman Gupta</dc:creator>
      <pubDate>Mon, 24 Jun 2024 08:15:35 +0000</pubDate>
      <link>https://dev.to/aman_gupta_7c59c96e9e167a/simplifying-sdmx-data-integration-with-python-1p01</link>
      <guid>https://dev.to/aman_gupta_7c59c96e9e167a/simplifying-sdmx-data-integration-with-python-1p01</guid>
      <description>&lt;h1&gt;
  
  
  Simplifying SDMX Data Integration with Python
&lt;/h1&gt;

&lt;p&gt;Statistical Data and Metadata eXchange (SDMX) is an international standard used extensively by global organizations, government agencies, and financial institutions to facilitate the efficient exchange, sharing, and processing of statistical data.&lt;/p&gt;

&lt;p&gt;Utilizing SDMX enables seamless integration and access to a broad spectrum of statistical datasets covering economics, finance, population demographics, health, and education, among others.&lt;/p&gt;

&lt;p&gt;These capabilities make it invaluable for creating robust, data-driven solutions that rely on accurate and comprehensive data sources.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fstorage.googleapis.com%2Fdlt-blog-images%2Fsdmx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fstorage.googleapis.com%2Fdlt-blog-images%2Fsdmx.png" alt="embeddable etl"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why SDMX?
&lt;/h2&gt;

&lt;p&gt;SDMX not only standardizes data formats across disparate systems but also simplifies the access to data provided by institutions such as Eurostat, the ECB (European Central Bank), the IMF (International Monetary Fund), and many national statistics offices.&lt;/p&gt;

&lt;p&gt;This standardization allows data engineers and scientists to focus more on analyzing data rather than spending time on data cleaning and preparation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Installation and Basic Usage
&lt;/h3&gt;

&lt;p&gt;To start integrating SDMX data sources into your Python applications, install the sdmx library using pip:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;sdmx1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here's an example of how to fetch data from multiple SDMX sources, illustrating the diversity of data flows and the ease of access:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sdmx_source&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sdmx_source&lt;/span&gt;

&lt;span class="n"&gt;source&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sdmx_source&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data_source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ESTAT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dataflow&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PRC_PPP_IND&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;freq&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;A&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;na_item&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PLI_EU28&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ppp_cat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;A0101&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;geo&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;EE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;FI&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]},&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;table_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;food_price_index&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data_source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ESTAT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dataflow&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sts_inpr_m&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;M.PROD.B-D+C+D.CA.I15+I10.EE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data_source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ECB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dataflow&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;EXR&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;FREQ&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;A&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CURRENCY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;USD&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}}&lt;/span&gt;
&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This configuration retrieves data from:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Eurostat (ESTAT) for the Purchasing Power Parity (PPP) and Price Level Indices providing insights into economic factors across different regions.&lt;/li&gt;
&lt;li&gt;Eurostat's short-term statistics (sts_inpr_m) on industrial production, which is crucial for economic analysis.&lt;/li&gt;
&lt;li&gt;European Central Bank (ECB) for exchange rates, essential for financial and trade-related analyses.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Loading the data with dlt, leveraging best practices
&lt;/h2&gt;

&lt;p&gt;After retrieving data using the sdmx library, the next challenge is effectively integrating this data into databases.&lt;br&gt;
The dlt library excels in this area by offering a robust solution for data loading that adheres to best practices in several key ways:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Automated schema management -&amp;gt; dlt infers types and evolves schema as needed. It automatically handles nested structures too. You can customise this behavior, or turn the schema into a data contract.&lt;/li&gt;
&lt;li&gt;Declarative configuration -&amp;gt; You can easily switch between write dispositions (append/replace/merge) or destinations.&lt;/li&gt;
&lt;li&gt;Scalability -&amp;gt; dlt is designed to handle large volumes of data efficiently, making it suitable for enterprise-level applications and high-volume data streams. This scalability ensures that as your data needs grow, your data processing pipeline can grow with them without requiring significant redesign or resource allocation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Martin Salo, CTO at Yummy, a food logistics company, uses dlt to efficiently manage complex data flows from SDMX sources.&lt;br&gt;
By leveraging dlt, Martin ensures that his data pipelines are not only easy to build, robust and error-resistant but also optimized for performance and scalability.&lt;/p&gt;

&lt;p&gt;View &lt;a href="https://gist.github.com/salomartin/d4ee7170f678b0b44554af46fe8efb3f" rel="noopener noreferrer"&gt;Martin Salo's implementation&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Martin Salo's implementation of the sdmx_source package effectively simplifies the retrieval of statistical data from diverse SDMX data sources using the Python sdmx library.&lt;br&gt;
The design is user-friendly, allowing both simple and complex data queries, and integrates the results directly into pandas DataFrames for immediate analysis.&lt;/p&gt;

&lt;p&gt;This implementation enhances data accessibility and prepares it for analytical applications, with built-in logging and error handling to improve reliability.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Integrating sdmx and dlt into your data pipelines significantly enhances data management practices, ensuring operations are robust,&lt;br&gt;
scalable, and efficient. These tools provide essential capabilities for data professionals looking to seamlessly integrate&lt;br&gt;
complex statistical data into their workflows, enabling more effective data-driven decision-making.&lt;/p&gt;

&lt;p&gt;By engaging with the data engineering community and sharing strategies and insights on effective data integration,&lt;br&gt;
data engineers can continue to refine their practices and achieve better outcomes in their projects.&lt;/p&gt;

&lt;p&gt;Join the conversation and share your insights in our &lt;a href="https://dlthub.com/community" rel="noopener noreferrer"&gt;Slack community&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>pipeline</category>
      <category>etl</category>
      <category>sdmx</category>
    </item>
    <item>
      <title>Replacing Saas ETL with Python dlt: A painless experience for Yummy.eu</title>
      <dc:creator>Aman Gupta</dc:creator>
      <pubDate>Mon, 24 Jun 2024 08:12:53 +0000</pubDate>
      <link>https://dev.to/aman_gupta_7c59c96e9e167a/replacing-saas-etl-with-python-dlt-a-painless-experience-for-yummyeu-3mme</link>
      <guid>https://dev.to/aman_gupta_7c59c96e9e167a/replacing-saas-etl-with-python-dlt-a-painless-experience-for-yummyeu-3mme</guid>
      <description>&lt;p&gt;About &lt;a href="https://about.yummy.eu/" rel="noopener noreferrer"&gt;Yummy.eu&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Yummy is a Lean-ops meal-kit company streamlines the entire food preparation process for customers in emerging markets by providing personalized recipes,&lt;br&gt;
nutritional guidance, and even shopping services. Their innovative approach ensures a hassle-free, nutritionally optimized meal experience,&lt;br&gt;
making daily cooking convenient and enjoyable.&lt;/p&gt;

&lt;p&gt;Yummy is a food box business. At the intersection of gastronomy and logistics, this market is very competitive.&lt;br&gt;
To make it in this market, Yummy needs to be fast and informed in their operations.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pipelines are not yet a commodity.
&lt;/h3&gt;

&lt;p&gt;At Yummy, efficiency and timeliness are paramount. Initially, Martin, Yummy’s CTO, chose to purchase data pipelining tools for their operational and analytical&lt;br&gt;
needs, aiming to maximize time efficiency. However, the real-world performance of these purchased solutions did not meet expectations, which&lt;br&gt;
led to a reassessment of their approach.&lt;/p&gt;

&lt;h3&gt;
  
  
  What’s important: Velocity, Reliability, Speed, time. Money is secondary.
&lt;/h3&gt;

&lt;p&gt;Martin was initially satisfied with the ease of setup provided by the SaaS services.&lt;/p&gt;

&lt;p&gt;The tipping point came when an update to Yummy’s database introduced a new log table, leading to unexpectedly high fees due to the vendor’s default settings that automatically replicated new tables fully on every refresh. This situation highlighted the need for greater control over data management processes and prompted a shift towards more transparent and cost-effective solutions.&lt;/p&gt;

&lt;p&gt;💡 Proactive management of data pipeline settings is essential.&lt;br&gt;
Automatic replication of new tables, while common, often leads to increased costs without adding value, especially if those tables are not immediately needed.&lt;br&gt;
Understanding and adjusting these settings can lead to significant cost savings and more efficient data use.&lt;/p&gt;

&lt;h2&gt;
  
  
  10x faster, 182x cheaper with dlt + async + modal
&lt;/h2&gt;

&lt;p&gt;Motivated to find a solution that balanced cost with performance, Martin explored using dlt, a tool known for its simplicity in building data pipelines.&lt;br&gt;
By combining dlt with asynchronous operations and using &lt;a href="https://modal.com/" rel="noopener noreferrer"&gt;Modal&lt;/a&gt;  for managed execution, the improvements were substantial:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data processing speed increased tenfold.&lt;/li&gt;
&lt;li&gt;Cost reduced by 182 times compared to the traditional SaaS tool.&lt;/li&gt;
&lt;li&gt;The new system supports extracting data once and writing to multiple destinations without additional costs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For a peek into on how Martin implemented this solution, &lt;a href="https://gist.github.com/salomartin/c0d4b0b5510feb0894da9369b5e649ff" rel="noopener noreferrer"&gt;please see Martin's async Postgres source on GitHub.&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://twitter.com/salomartin/status/1755146404773658660" rel="noopener noreferrer"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fstorage.googleapis.com%2Fdlt-blog-images%2Fmartin_salo_tweet.png" alt="salo-martin-tweet"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Taking back control with open source has never been easier
&lt;/h2&gt;

&lt;p&gt;Taking control of your data stack is more accessible than ever with the broad array of open-source tools available. SQL copy pipelines, often seen as a basic utility in data management, do not generally differ significantly between platforms. They perform similar transformations and schema management, making them a commodity available at minimal cost.&lt;/p&gt;

&lt;p&gt;SQL to SQL copy pipelines are widespread, yet many service providers charge exorbitant fees for these simple tasks. In contrast, these pipelines can often be set up and run at a fraction of the cost—sometimes just the price of a few coffees.&lt;/p&gt;

&lt;p&gt;At dltHub, we advocate for leveraging straightforward, freely available resources to regain control over your data processes and budget effectively.&lt;/p&gt;

&lt;p&gt;Setting up a SQL pipeline can take just a few minutes with the right tools. Explore these resources to enhance your data operations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dlthub.com/docs/dlt-ecosystem/verified-sources/sql_database" rel="noopener noreferrer"&gt;30+ SQL database sources&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gist.github.com/salomartin/c0d4b0b5510feb0894da9369b5e649ff" rel="noopener noreferrer"&gt;Martin’s async PostgreSQL source&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.notion.so/Martin-Salo-Yummy-2061c3139e8e4b7fa355255cc994bba5?pvs=21" rel="noopener noreferrer"&gt;Arrow + connectorx&lt;/a&gt; for up to 30x faster data transfers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For additional support or to connect with fellow data professionals, &lt;a href="https://dlthub.com/community" rel="noopener noreferrer"&gt;join our community&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>saasetl</category>
      <category>dataengineering</category>
      <category>python</category>
    </item>
    <item>
      <title>On Orchestrators: You Are All Right, But You Are All Wrong Too</title>
      <dc:creator>Aman Gupta</dc:creator>
      <pubDate>Mon, 24 Jun 2024 08:09:56 +0000</pubDate>
      <link>https://dev.to/aman_gupta_7c59c96e9e167a/on-orchestrators-you-are-all-right-but-you-are-all-wrong-too-4dba</link>
      <guid>https://dev.to/aman_gupta_7c59c96e9e167a/on-orchestrators-you-are-all-right-but-you-are-all-wrong-too-4dba</guid>
      <description>&lt;p&gt;It's been nearly half a century since cron was first introduced, and now we have a handful orchestration tools that go way beyond just scheduling tasks. With data folks constantly debating about which tools are top-notch and which ones should leave the scene, it's like we're at a turning point in the evolution of these tools. By that I mean the term 'orchestrator' has become kind of a catch-all, and that's causing some confusion because we're using this one word to talk about a bunch of different things.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fstorage.googleapis.com%2Fdlt-blog-images%2Fblog-on-orchestrators-dates.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fstorage.googleapis.com%2Fdlt-blog-images%2Fblog-on-orchestrators-dates.png" alt="dates"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Think about the word “date.” It can mean a fruit, a romantic outing, or a day on the calendar, right? We usually figure out which one it is from the context, but what does context mean when it comes to orchestration? It might sound like a simple question, but it's pretty important to get this straight.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;And here's a funny thing: some people, after eating an odd-tasting date (the fruit, of course), are so put off that they naively swear off going on romantic dates altogether. It's an overly exaggerated figurative way of looking at it, but it shows how one bad experience can color our view of something completely different. That's kind of what's happening with orchestration tools. If someone had a bad time with one tool, they might be overly critical towards another, even though it might be a totally different experience.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So the context in terms of orchestration tools seems to be primarily defined by one thing - WHEN a specific tool was first introduced to the market (&lt;em&gt;aside from the obvious factors like the technical background of the person discussing these tools and their tendency to be a chronic complainer&lt;/em&gt; 🙄).&lt;/p&gt;




&lt;h2&gt;
  
  
  IT'S ALL ABOUT TIMING!
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fstorage.googleapis.com%2Fdlt-blog-images%2Fblog-on-orchestrators-evolution.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fstorage.googleapis.com%2Fdlt-blog-images%2Fblog-on-orchestrators-evolution.png" alt="evolution-of-data-orchestration"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The Illegitimate Child
&lt;/h3&gt;

&lt;p&gt;Cron was initially released in 1975 and is undoubtedly the father of all scheduling tools, including orchestrators, but I’m assuming Cron didn’t anticipate this many offspring in the field of data (or perhaps it did). As Oracle brought the first commercial relational database to market in 1979, people started to realize that data needs to be moved on schedule, and without manual effort. And it was doable, with the help of Control-M, though it was more of a general workflow automation tool that didn’t pay special attention to data workflows.&lt;/p&gt;

&lt;p&gt;Basically, since the solutions weren’t data driven at that time, it was more “The job gets done, but without a guarantee of data quality.”&lt;/p&gt;

&lt;h3&gt;
  
  
  Finally Adopted
&lt;/h3&gt;

&lt;p&gt;Unlike Control-M, Informatica was designed for data operations in mind from the beginning. As data started to spread across entire companies, advanced OLAPs started to emerge with a broad use of datawarehousing. Now data not only needed to be moved, but integrated across many systems and users. The data orchestration solution from Informatica was inevitably influenced by the rising popularity of the contemporary drag-and-drop concept, that is, to the detriment of many modern data engineers who would recommend to skip Informatica and other GUI based ETL tools that offer ‘visual programming’.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;As the creator of Airflow, Max Beauchemin, said: “There's a multitude of reasons why complex pieces of software are not developed using drag and drop tools: &lt;strong&gt;it's that ultimately code is the best abstraction there is for software.&lt;/strong&gt;”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  To Be Free, That Is, Diverse
&lt;/h3&gt;

&lt;p&gt;With traditional ETL tools, such as IBM DataStage and Talend, becoming well-established in the 1990s and early 2000s, the big data movement started gaining its momentum with Hadoop as the main star. Oozie, later made open-source in 2011, was tasked with workflow scheduling of Hadoop jobs, with closed-source solutions, like K2View starting to operate behind the curtains.&lt;/p&gt;

&lt;p&gt;Fast forward a bit, and the scene exploded, with Airflow quickly becoming the heavyweight champ, while every big data service out there began rolling out their own orchestrators. This burst brought diversity, but with diversity came a maze of complexity. All of a sudden, there’s an orchestrator for everyone — whether you’re chasing features or just trying to make your budget work 👀 — picking the perfect one for your needs has gotten even trickier.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fstorage.googleapis.com%2Fdlt-blog-images%2Fblog-on-orchestrators-types.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fstorage.googleapis.com%2Fdlt-blog-images%2Fblog-on-orchestrators-types.png" alt="types"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The Bottom Line
&lt;/h3&gt;

&lt;p&gt;The thing is that every tool out there has some inconvenient truths, and real question isn't about escaping the headache — it's about choosing your type of headache. Hence, the endless sea of “versus” articles, blog posts, and guides trying to help you pick your personal battle.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A Redditor: &lt;a href="https://www.reddit.com/r/dataengineering/comments/10ttbvl/comment/j7a4685/?utm_source=share&amp;amp;utm_medium=web3x&amp;amp;utm_name=web3xcss&amp;amp;utm_term=1&amp;amp;utm_content=share_button" rel="noopener noreferrer"&gt;“Everyone has hated all orchestration tools for all time. People just hated Airflow less and it took off.“&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;What I'm getting at is this: we're all a bit biased by the "law of the instrument." You know, the whole “If all you have is a hammer, everything looks like a nail” thing. Most engineers probably grabbed the latest or most hyped tool when they first dipped their toes into data orchestration and have stuck with it ever since. Sure, Airflow is the belle of the ball for the community, but there's a whole lineup of contenders vying for the spotlight.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fstorage.googleapis.com%2Fdlt-blog-images%2Fblog-on-orchestrators-perspectives.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fstorage.googleapis.com%2Fdlt-blog-images%2Fblog-on-orchestrators-perspectives.png" alt="law-of-instrument"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And there are obviously those who would relate to the following:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.reddit.com/r/dataengineering/comments/168p757/comment/jyx9gs7/?utm_source=share&amp;amp;utm_medium=web3x&amp;amp;utm_name=web3xcss&amp;amp;utm_term=1&amp;amp;utm_content=share_button" rel="noopener noreferrer"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fstorage.googleapis.com%2Fdlt-blog-images%2Fblog-on-orchestrators-reddit-screenshot.png" alt="reddit-screenshot"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  A HANDY DETOUR POUR TOI 💐
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Fundamentals
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.prefect.io/blog/brief-history-of-workflow-orchestration" rel="noopener noreferrer"&gt;A Brief History of Workflow Orchestration&lt;/a&gt; by Prefect.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://medium.com/@hugolu87/what-is-data-orchestration-and-why-is-it-misunderstood-844878ac8c0a" rel="noopener noreferrer"&gt;What is Data Orchestration and why is it misunderstood?&lt;/a&gt; by Hugo Lu.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://jonathanneo.substack.com/p/the-evolution-of-data-orchestration" rel="noopener noreferrer"&gt;The evolution of data orchestration: Part 1 - the past and present&lt;/a&gt; by Jonathan Neo.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://jonathanneo.substack.com/p/the-evolution-of-data-orchestration-002" rel="noopener noreferrer"&gt;The evolution of data orchestration: Part 2 - the future&lt;/a&gt; by Jonathan Neo.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.dedp.online/part-2/4-ce/bash-stored-procedure-etl-python-script.html" rel="noopener noreferrer"&gt;Bash-Script vs. Stored Procedure vs. Traditional ETL Tools vs. Python-Script&lt;/a&gt; by Simon Späti.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  About Airflow
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.ibm.com/blog/6-issues-with-airflow/" rel="noopener noreferrer"&gt;6 inconvenient truths about Apache Airflow (and what to do about them)&lt;/a&gt; by IBM.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://airflow.apache.org/blog/airflow-survey-2022/" rel="noopener noreferrer"&gt;Airflow Survey 2022&lt;/a&gt; by Airflow.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Miscellaneous
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://medium.com/arthur-engineering/picking-a-kubernetes-orchestrator-airflow-argo-and-prefect-83539ecc69b" rel="noopener noreferrer"&gt;Picking A Kubernetes Orchestrator: Airflow, Argo, and Prefect&lt;/a&gt; by Ian McGraw.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://towardsdatascience.com/airflow-prefect-and-dagster-an-inside-look-6074781c9b77" rel="noopener noreferrer"&gt;Airflow, Prefect, and Dagster: An Inside Look&lt;/a&gt; by Pedram Navid.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  WHAT THE FUTURE HOLDS...
&lt;/h2&gt;

&lt;p&gt;I'm no oracle or tech guru, but it's pretty obvious that at their core, most data orchestration tools are pretty similar. They're like building blocks that can be put together in different ways—some features come, some go, and users are always learning something new or dropping something old. So, what's really going to make a difference down the line is NOT just about having the coolest features. It's more about having a strong community that's all in on making the product better, a welcoming onboarding process that doesn't feel like rocket science, and finding that sweet spot between making things simple to use and letting users tweak things just the way they like.&lt;/p&gt;

&lt;p&gt;In other words, it's not just about what the tools can do, but how people feel about using them, learning them, contributing to them, and obviously how much they spend to maintain them. That's likely where the future winners in the data orchestration game will stand out. But don’t get me wrong, features are important — it's just that there are other things equally important.&lt;/p&gt;




&lt;h2&gt;
  
  
  SO WHO'S ACTUALLY TRENDING?
&lt;/h2&gt;

&lt;p&gt;I’ve been working on this article for a WHILE now, and, honestly, it's been a bit of a headache trying to gather any solid, objective info on which data orchestration tool tops the charts. The more I think about it, the more I realise it's probably because trying to measure "the best" or "most popular" is a bit like trying to catch smoke with your bare hands — pretty subjective by nature. Plus, only testing them with non-production level data probably wasn't my brightest move.&lt;/p&gt;

&lt;p&gt;However, I did create a fun little project where I analysed the sentiment of comments on articles about selected data orchestrators on Hacker News and gathered Google Trends data for the past year.&lt;/p&gt;

&lt;p&gt;Just a heads-up, though: the results are BY NO MEANS reliable and are skewed due to some fun with words. For instance, searching for “Prefect” kept leading me to articles about Japanese prefectures, “Keboola” resulted in Kool-Aid content, and “Luigi”... well, let’s just say I ran into Mario’s brother more than once 😂.&lt;/p&gt;




&lt;h2&gt;
  
  
  THE FUN LITTLE PROJECT
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;Straight to the &lt;a href="https://github.com/dlt-hub/dlt_demos/tree/main/dlt-dagster-snowflake" rel="noopener noreferrer"&gt;GitHub repo&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I used Dagster and &lt;code&gt;dlt&lt;/code&gt; to load data into Snowflake, and since both of them have integrations with Snowflake, it was easy to set things up and have them all running:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fstorage.googleapis.com%2Fdlt-blog-images%2Fdlt_dagster_snowflake_demo_overview.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fstorage.googleapis.com%2Fdlt-blog-images%2Fdlt_dagster_snowflake_demo_overview.png" alt="Pipeline overview"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This project is very minimal, including just what's needed to run Dagster locally with &lt;code&gt;dlt&lt;/code&gt;. Here's a quick breakdown of the repo’s structure:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;.dlt&lt;/code&gt;: Utilized by the &lt;code&gt;dlt&lt;/code&gt; library for storing configuration and sensitive information. The Dagster project is set up to fetch secret values from this directory as well.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;charts&lt;/code&gt;: Used to store chart images generated by assets.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;dlt_dagster_snowflake_demo&lt;/code&gt;: Your Dagster package, comprising Dagster assets, &lt;code&gt;dlt&lt;/code&gt; resources, Dagster resources, and general project configurations.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Dagster Resources Explained
&lt;/h3&gt;

&lt;p&gt;In the &lt;code&gt;resources&lt;/code&gt; folder, the following two Dagster resources are defined as classes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;DltPipeline&lt;/code&gt;: This is our &lt;code&gt;dlt&lt;/code&gt; object defined as a Dagster ConfigurableResource that creates and runs a &lt;code&gt;dlt&lt;/code&gt; pipeline with the specified data and table name. It will later be used in our Dagster assets to load data into Snowflake.&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;DltPipeline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ConfigurableResource&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Initialize resource with pipeline details
&lt;/span&gt;    &lt;span class="n"&gt;pipeline_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;dataset_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;destination&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;create_pipeline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;resource_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;table_name&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Creates and runs a dlt pipeline with specified data and table name.

    Args:
      resource_data: The data to be processed by the pipeline.
      table_name: The name of the table where data will be loaded.

    Returns:
      The result of the pipeline execution.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="c1"&gt;# Configure the dlt pipeline with your destination details
&lt;/span&gt;    &lt;span class="n"&gt;pipeline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dlt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pipeline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
      &lt;span class="n"&gt;pipeline_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pipeline_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="n"&gt;destination&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;destination&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="n"&gt;dataset_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dataset_name&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Run the pipeline with your parameters
&lt;/span&gt;    &lt;span class="n"&gt;load_info&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pipeline&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resource_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;table_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;table_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;load_info&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;LocalFileStorage&lt;/code&gt;: Manages the local file storage, ensuring the storage directory exists and allowing data to be written to files within it. It will be later used in our Dagster assets to save images into the &lt;code&gt;charts&lt;/code&gt; folder.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;dlt&lt;/code&gt; Explained
&lt;/h3&gt;

&lt;p&gt;In the dlt folder within dlt_dagster_snowflake_demo, necessary dlt resources and sources are defined. Below is a visual representation illustrating the functionality of dlt:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fstorage.googleapis.com%2Fdlt-blog-images%2Fdlt_dagster_snowflake_demo_dlt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fstorage.googleapis.com%2Fdlt-blog-images%2Fdlt_dagster_snowflake_demo_dlt.png" alt="dlt explained"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;hacker_news&lt;/code&gt;: A &lt;code&gt;dlt&lt;/code&gt; resource that yields stories related to specified orchestration tools from Hackernews. For each tool, it retrieves the top 5 stories that have at least one comment. The stories are then appended to the existing data.&lt;/p&gt;

&lt;p&gt;Note that the &lt;code&gt;write_disposition&lt;/code&gt; can also be set to &lt;code&gt;merge&lt;/code&gt; or &lt;code&gt;replace&lt;/code&gt;:&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;- The merge write disposition merges the new data from the resource with the existing data at the destination. It requires a `primary_key` to be specified for the resource. More details can be found here.
- The replace write disposition replaces the data in the destination with the data from the resource. It deletes all the classes and objects and recreates the schema before loading the data.

More details can be found [here](https://dlthub.com/docs/general-usage/resource).
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;comments&lt;/code&gt;: A &lt;code&gt;dlt&lt;/code&gt; transformer - a resource that receives data from another resource. It fetches comments for each story yielded by the &lt;code&gt;hacker_news&lt;/code&gt; function.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;hacker_news_full&lt;/code&gt;: A &lt;code&gt;dlt&lt;/code&gt; source that extracts data from the source location using one or more resource components, such as &lt;code&gt;hacker_news&lt;/code&gt; and &lt;code&gt;comments&lt;/code&gt;. To illustrate, if the source is a database, a resource corresponds to a table within that database.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;google_trends&lt;/code&gt;: A &lt;code&gt;dlt&lt;/code&gt; resource that fetches Google Trends data for specified orchestration tools. It attempts to retrieve the data multiple times in case of failures or empty responses. The retrieved data is then appended to the existing data.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;As you may have noticed, the &lt;code&gt;dlt&lt;/code&gt; library is designed to handle the unnesting of data internally. When you retrieve data from APIs like Hacker News or Google Trends, &lt;code&gt;dlt&lt;/code&gt; automatically unpacks the nested structures into relational tables, creating and linking child and parent tables. This is achieved through unique identifiers (&lt;code&gt;_dlt_id&lt;/code&gt; and &lt;code&gt;_dlt_parent_id&lt;/code&gt;) that link child tables to specific rows in the parent table. However, it's important to note that you have control over &lt;a href="https://dlthub.com/docs/general-usage/destination-tables" rel="noopener noreferrer"&gt;how this unnesting is done&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Results
&lt;/h3&gt;

&lt;p&gt;Alright, so once you've got your Dagster assets all materialized and data loaded into Snowflake, let's take a peek at what you might see:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fstorage.googleapis.com%2Fdlt-blog-images%2Fblog-on-orchestrators-chart.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fstorage.googleapis.com%2Fdlt-blog-images%2Fblog-on-orchestrators-chart.png" alt="sentiment counts"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I understand if you're scratching your head at first glance, but let me clear things up. Remember those sneaky issues I mentioned with Keboola and Luigi earlier? Well, I've masked their charts with the respective “culprits”.&lt;/p&gt;

&lt;p&gt;Now, onto the bars. Each trio of bars illustrates the count of negative, neutral, and positive comments on articles sourced from Hacker News that have at least one comment and were returned when searched for a specific orchestration tool, categorized accordingly by the specific data orchestration tool.&lt;/p&gt;

&lt;p&gt;What's the big reveal? It seems like Hacker News readers tend to spread more positivity than negativity, though neutral comments hold their ground.&lt;/p&gt;

&lt;p&gt;And, as is often the case with utilizing LLMs, this data should be taken with a grain of salt. It's more of a whimsical exploration than a rigorous analysis. However, if you take a peek behind Kool Aid and Luigi, it's intriguing to note that articles related to them seem to attract a disproportionate amount of negativity. 😂&lt;/p&gt;




&lt;h2&gt;
  
  
  IF YOU'RE STILL HERE
&lt;/h2&gt;

&lt;p&gt;… and you're just dipping your toes into the world of data orchestration, don’t sweat it. It's totally normal if it doesn't immediately click for you. For beginners, it can be tricky to grasp because in small projects, there isn't always that immediate need for things to happen "automatically" -  you build your pipeline, run it once, and then bask in the satisfaction of your results - just like I did in my project. However, if you start playing around with one of these tools now, it could make it much easier to work with them later on. So, don't hesitate to dive in and experiment!&lt;/p&gt;

&lt;p&gt;… And hey, if you're a seasoned pro about to drop some knowledge bombs, feel free to go for it - because what doesn’t challenge us, doesn’t change us 🥹. &lt;em&gt;(*Cries in Gen Z*)&lt;/em&gt;&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>etl</category>
      <category>pipeline</category>
    </item>
    <item>
      <title>What is the REST API Source toolkit?</title>
      <dc:creator>Aman Gupta</dc:creator>
      <pubDate>Mon, 24 Jun 2024 08:07:11 +0000</pubDate>
      <link>https://dev.to/aman_gupta_7c59c96e9e167a/what-is-the-rest-api-source-toolkit-1j9g</link>
      <guid>https://dev.to/aman_gupta_7c59c96e9e167a/what-is-the-rest-api-source-toolkit-1j9g</guid>
      <description>&lt;h2&gt;
  
  
  What is the REST API Source toolkit?
&lt;/h2&gt;

&lt;p&gt;tl;dr: You are probably familiar with REST APIs.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Our new &lt;strong&gt;REST API Source&lt;/strong&gt; is a short, declarative configuration driven way of creating sources.&lt;/li&gt;
&lt;li&gt;Our new &lt;strong&gt;REST API Client&lt;/strong&gt; is a collection of Python helpers used by the above source, which you can also use as a standalone, config-free, imperative high-level abstraction for building pipelines.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Want to skip to docs? Links at the bottom of the post.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why REST configuration pipeline? Obviously, we need one!
&lt;/h3&gt;

&lt;p&gt;But of course! Why repeat write all this code for requests and loading, when we could write it once and re-use it with different APIs with different configs?&lt;/p&gt;

&lt;p&gt;Once you have built a few pipelines from REST APIs, you can recognise we could, instead of writing code, write configuration.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;We can call such an obvious next step in ETL tools a “&lt;a href="https://en.wikipedia.org/wiki/Focal_point_(game_theory)"&gt;focal point&lt;/a&gt;” of “&lt;a href="https://en.wikipedia.org/wiki/Convergent_evolution"&gt;convergent evolution&lt;/a&gt;”.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;And if you’ve been in a few larger more mature companies, you will have seen a variety of home-grown solutions that look similar. You might also have seen such solutions as commercial products or offerings.&lt;/p&gt;

&lt;h3&gt;
  
  
  But ours will be better…
&lt;/h3&gt;

&lt;p&gt;So far we have seen many REST API configurators and products — they suffer from predictable flaws:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Local homebrewed flavors are local for a reason: They aren’t suitable for the broad audience. And often if you ask the users/beneficiaries of these frameworks, they will sometimes argue that they aren’t suitable for anyone at all.&lt;/li&gt;
&lt;li&gt;Commercial products are yet another data product that doesn’t plug into your stack, brings black boxes and removes autonomy, so they simply aren’t an acceptable solution in many cases.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So how can &lt;code&gt;dlt&lt;/code&gt; do better?&lt;/p&gt;

&lt;p&gt;Because it can keep the best of both worlds: the autonomy of a library, the quality of a commercial product.&lt;/p&gt;

&lt;p&gt;As you will see further, we created not just a standalone “configuration-based source builder” but we also expose the REST API client used enabling its use directly in code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Hey community, you made us do it!
&lt;/h2&gt;

&lt;p&gt;The push for this is coming from you, the community. While we had considered the concept before, there were many things &lt;code&gt;dlt&lt;/code&gt; needed before creating a new way to build pipelines. A declarative extractor after all, would not make &lt;code&gt;dlt&lt;/code&gt; easier to adopt, because a declarative approach requires more upfront knowledge.&lt;/p&gt;

&lt;p&gt;Credits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;So, thank you Alex Butler for building a first version of this and donating it to us back in August ‘23:  &lt;a href="https://github.com/dlt-hub/dlt-init-openapi/pull/2"&gt;https://github.com/dlt-hub/dlt-init-openapi/pull/2&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;And thank you Francesco Mucio and Willi Müller for re-opening the topic, and creating video &lt;a href="https://www.youtube.com/playlist?list=PLpTgUMBCn15rs2NkB4ise780UxLKImZTh"&gt;tutorials&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;And last but not least, thank you to &lt;code&gt;dlt&lt;/code&gt; team’s Anton Burnashev (also known for &lt;a href="https://github.com/burnash/gspread"&gt;gspread&lt;/a&gt; library) for building it out!&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The outcome? Two Python-only interfaces, one declarative, one imperative.
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;dlt’s REST API Source&lt;/strong&gt; is a Python dictionary-first declarative source builder, that has enhanced flexibility, supports callable passes, native config validations via python dictionaries, and composability directly in your scripts. It enables generating sources dynamically during runtime, enabling straightforward, manual or automated workflows for adapting sources to changes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;dlt’s REST API Client&lt;/strong&gt; is the low-level abstraction that powers the REST API Source. You can use it in your imperative code for more automation and brevity, if you do not wish to use the higher level declarative interface.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Useful for those who frequently build new pipelines
&lt;/h3&gt;

&lt;p&gt;If you are on a team with 2-3 pipelines that never change much you likely won’t see much benefit from our latest tool.&lt;br&gt;
What we observe from early feedback a declarative extractor is great at is enabling easier work at scale.&lt;br&gt;
We heard excitement about the &lt;strong&gt;REST API Source&lt;/strong&gt; from:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;companies with many pipelines that frequently create new pipelines,&lt;/li&gt;
&lt;li&gt;data platform teams,&lt;/li&gt;
&lt;li&gt;freelancers and agencies,&lt;/li&gt;
&lt;li&gt;folks who want to generate pipelines with LLMs and need a simple interface.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  How to use the REST API Source?
&lt;/h2&gt;

&lt;p&gt;Since this is a declarative interface, we can’t make things up as we go along, and instead need to understand what we want to do upfront and declare that.&lt;/p&gt;

&lt;p&gt;In some cases, we might not have the information upfront, so we will show you how to get that info during your development workflow.&lt;/p&gt;

&lt;p&gt;Depending on how you learn better, you can either watch the videos that our community members made, or follow the walkthrough below.&lt;/p&gt;
&lt;h2&gt;
  
  
  &lt;strong&gt;Video walkthroughs&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;In these videos, you will learn at a leisurely pace how to use the new interface.&lt;br&gt;
&lt;a href="https://www.youtube.com/playlist?list=PLpTgUMBCn15rs2NkB4ise780UxLKImZTh"&gt;Playlist link.&lt;/a&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  Workflow walkthrough: Step by step
&lt;/h2&gt;

&lt;p&gt;If you prefer to do things at your own pace, try the workflow walkthrough, which will show you the workflow of using the declarative interface.&lt;/p&gt;

&lt;p&gt;In the example below, we will show how to create an API integration with 2 endpoints. One of these is a child resource, using the data from the parent endpoint to make a new request.&lt;/p&gt;
&lt;h3&gt;
  
  
  Configuration Checklist: Before getting started
&lt;/h3&gt;

&lt;p&gt;In the following, we will use the GitHub API as an example.&lt;/p&gt;

&lt;p&gt;We will also provide links to examples from this &lt;a href="https://colab.research.google.com/drive/1qnzIM2N4iUL8AOX1oBUypzwoM3Hj5hhG#scrollTo=SCr8ACUtyfBN&amp;amp;forceEdit=true&amp;amp;sandboxMode=true"&gt;Google Colab tutorial.&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Collect your api url and endpoints, &lt;a href="https://colab.research.google.com/drive/1qnzIM2N4iUL8AOX1oBUypzwoM3Hj5hhG#scrollTo=bKthJGV6Mg6C"&gt;Colab example&lt;/a&gt;:

&lt;ul&gt;
&lt;li&gt;An URL is the base of the request, for example: &lt;code&gt;https://api.github.com/&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;An endpoint is the path of an individual resource such as:

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;/repos/{OWNER}/{REPO}/issues&lt;/code&gt;;&lt;/li&gt;
&lt;li&gt;or  &lt;code&gt;/repos/{OWNER}/{REPO}/issues/{issue_number}/comments&lt;/code&gt; which would require the issue number from the above endpoint;&lt;/li&gt;
&lt;li&gt;or &lt;code&gt;/users/{username}/starred&lt;/code&gt; etc.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Identify the authentication methods, &lt;a href="https://colab.research.google.com/drive/1qnzIM2N4iUL8AOX1oBUypzwoM3Hj5hhG#scrollTo=mViSDre8McI7"&gt;Colab example&lt;/a&gt;:

&lt;ul&gt;
&lt;li&gt;GitHub uses bearer tokens for auth, but we can also skip it for public endpoints &lt;a href="https://docs.github.com/en/rest/authentication/authenticating-to-the-rest-api?apiVersion=2022-11-28"&gt;https://docs.github.com/en/rest/authentication/authenticating-to-the-rest-api?apiVersion=2022-11-28&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Identify if you have any dependent request patterns such as first get ids in a list, then use id for requesting details.&lt;br&gt;
For GitHub, we might do the below or any other dependent requests. &lt;a href="https://colab.research.google.com/drive/1qnzIM2N4iUL8AOX1oBUypzwoM3Hj5hhG#scrollTo=vw7JJ0BlpFyh"&gt;Colab example.&lt;/a&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Get all repos of an org &lt;code&gt;https://api.github.com/orgs/{org}/repos&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Then get all contributors &lt;code&gt;https://api.github.com/repos/{owner}/{repo}/contributors&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;How does pagination work? Is there any? Do we know the exact pattern? &lt;a href="https://colab.research.google.com/drive/1qnzIM2N4iUL8AOX1oBUypzwoM3Hj5hhG#scrollTo=rqqJhUoCB9F3"&gt;Colab example.&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;On GitHub, we have consistent &lt;a href="https://docs.github.com/en/rest/using-the-rest-api/using-pagination-in-the-rest-api?apiVersion=2022-11-28"&gt;pagination&lt;/a&gt; between endpoints that looks like this &lt;code&gt;link_header = response.headers.get('Link', None)&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Identify the necessary information for incremental loading, &lt;a href="https://colab.research.google.com/drive/1qnzIM2N4iUL8AOX1oBUypzwoM3Hj5hhG#scrollTo=fsd_SPZD7nBj"&gt;Colab example&lt;/a&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Will any endpoints be loaded incrementally?&lt;/li&gt;
&lt;li&gt;What columns will you use for incremental extraction and loading?&lt;/li&gt;
&lt;li&gt;GitHub example: We can extract new issues by requesting issues after a particular time: &lt;code&gt;https://api.github.com/repos/{repo_owner}/{repo_name}/issues?since={since}&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;
  
  
  Configuration Checklist: Checking responses during development
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Data path:

&lt;ul&gt;
&lt;li&gt;You could print the source and see what is yielded. &lt;a href="https://colab.research.google.com/drive/1qnzIM2N4iUL8AOX1oBUypzwoM3Hj5hhG#scrollTo=oJ9uWLb8ZYto&amp;amp;line=6&amp;amp;uniqifier=1"&gt;Colab example.&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Unless you had full documentation at point 4 (which we did), you likely need to still figure out some details on how pagination works.

&lt;ol&gt;
&lt;li&gt;To do that, we suggest using &lt;code&gt;curl&lt;/code&gt; or a second python script to do a request and inspect the response. This gives you flexibility to try anything. &lt;a href="https://colab.research.google.com/drive/1qnzIM2N4iUL8AOX1oBUypzwoM3Hj5hhG#scrollTo=tFZ3SrZIMTKH"&gt;Colab example.&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Or you could print the source as above - but if there is metadata in headers etc, you might miss it.&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;
  
  
  Applying the configuration
&lt;/h3&gt;

&lt;p&gt;Here’s what a configured example could look like:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Base URL and endpoints.&lt;/li&gt;
&lt;li&gt;Authentication.&lt;/li&gt;
&lt;li&gt;Pagination.&lt;/li&gt;
&lt;li&gt;Incremental configuration.&lt;/li&gt;
&lt;li&gt;Dependent resource (child) configuration.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you are using a narrow screen, scroll the snippet below to look for the numbers designating each component &lt;code&gt;(n)&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# This source has 2 resources:
# - issues: Parent resource, retrieves issues incl. issue number
# - issues_comments: Child resource which needs the issue number from parent.
&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;rest_api&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RESTAPIConfig&lt;/span&gt;

&lt;span class="n"&gt;github_config&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;RESTAPIConfig&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;client&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;base_url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.github.com/repos/dlt-hub/dlt/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;      &lt;span class="c1"&gt;#(1)
&lt;/span&gt;        &lt;span class="c1"&gt;# Optional auth for improving rate limits
&lt;/span&gt;        &lt;span class="c1"&gt;# "auth": {                                                   #(2)
&lt;/span&gt;        &lt;span class="c1"&gt;#     "token": os.environ.get('GITHUB_TOKEN'),
&lt;/span&gt;        &lt;span class="c1"&gt;# },
&lt;/span&gt;    &lt;span class="p"&gt;},&lt;/span&gt;
      &lt;span class="c1"&gt;# The paginator is autodetected, but we can pass it explicitly  #(3)
&lt;/span&gt;      &lt;span class="c1"&gt;#  "paginator": {
&lt;/span&gt;      &lt;span class="c1"&gt;#      "type": "header_link",
&lt;/span&gt;      &lt;span class="c1"&gt;#      "next_url_path": "paging.link",
&lt;/span&gt;      &lt;span class="c1"&gt;#  }
&lt;/span&gt;    &lt;span class="c1"&gt;# We can declare generic settings in one place
&lt;/span&gt;    &lt;span class="c1"&gt;# Our data is stateful so we load it incrementally by merging on id
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;resource_defaults&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;primary_key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                                          &lt;span class="c1"&gt;#(4)
&lt;/span&gt;        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;write_disposition&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;merge&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                                 &lt;span class="c1"&gt;#(4)
&lt;/span&gt;        &lt;span class="c1"&gt;# these are request params specific to GitHub
&lt;/span&gt;        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;endpoint&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;params&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;per_page&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;resources&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="c1"&gt;# This is the first resource - issues
&lt;/span&gt;        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;issues&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;endpoint&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;issues&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                                     &lt;span class="c1"&gt;#(1)
&lt;/span&gt;                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;params&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                      &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sort&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;updated&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                      &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;direction&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;desc&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                      &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;state&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;open&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                      &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;since&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;incremental&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                    &lt;span class="c1"&gt;#(4)
&lt;/span&gt;                            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cursor_path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;updated_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;              &lt;span class="c1"&gt;#(4)
&lt;/span&gt;                            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;initial_value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2024-01-25T11:21:28Z&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;#(4)
&lt;/span&gt;                        &lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="c1"&gt;# Configuration for fetching comments on issues              #(5)
&lt;/span&gt;        &lt;span class="c1"&gt;# This is a child resource - as in, it needs something from another
&lt;/span&gt;        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;issue_comments&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;endpoint&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;issues/{issue_number}/comments&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;            &lt;span class="c1"&gt;#(1)
&lt;/span&gt;                &lt;span class="c1"&gt;# For child resources, you can use values from the parent resource for params.
&lt;/span&gt;                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;params&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;issue_number&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                        &lt;span class="c1"&gt;# Use type "resolve" to define child endpoint wich should be resolved
&lt;/span&gt;                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;resolve&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="c1"&gt;# Parent endpoint
&lt;/span&gt;                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;resource&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;issues&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="c1"&gt;# The specific field in the issues resource to use for resolution
&lt;/span&gt;                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;field&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;number&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="p"&gt;}&lt;/span&gt;
                &lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="c1"&gt;# A list of fields, from the parent resource, which will be included in the child resource output.
&lt;/span&gt;            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;include_from_parent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  And that’s a wrap — what else should you know?
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;As we mentioned, there’s also a &lt;strong&gt;REST Client&lt;/strong&gt; - an imperative way to use the same abstractions, for example, the auto-paginator - check out this runnable snippet:&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dlt.sources.helpers.rest_client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RESTClient&lt;/span&gt;

&lt;span class="c1"&gt;# Initialize the RESTClient with the Pokémon API base URL
&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;RESTClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://pokeapi.co/api/v2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Using the paginate method to automatically handle pagination
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;paginate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/pokemon&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;li&gt;&lt;p&gt;We are going to generate a bunch of sources from OpenAPI specs — stay tuned for an update in a couple of weeks!&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Next steps
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Share back your work! Instructions: &lt;strong&gt;&lt;a href="https://www.notion.so/7a7f7ddb39334743b1ba3debbdfb8d7f?pvs=21"&gt;dltHub-Community-Sources-Snippets&lt;/a&gt;&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Read more about the

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://dlthub.com/docs/dlt-ecosystem/verified-sources/rest_api"&gt;REST API Source&lt;/a&gt;&lt;/strong&gt; and&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://dlthub.com/docs/general-usage/http/rest-client"&gt;REST API Client&lt;/a&gt;,&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;and the related &lt;strong&gt;&lt;a href="https://dlthub.com/devel/general-usage/http/overview"&gt;API helpers&lt;/a&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;a href="https://dlthub.com/docs/general-usage/http/requests"&gt;requests&lt;/a&gt;&lt;/strong&gt; helper.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://dlthub.com/community"&gt;Join our community&lt;/a&gt;&lt;/strong&gt; and give us feedback!&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>dataengineering</category>
      <category>etl</category>
      <category>datapipelines</category>
    </item>
    <item>
      <title>How I contributed my first data pipeline to the open source.</title>
      <dc:creator>Aman Gupta</dc:creator>
      <pubDate>Mon, 24 Jun 2024 08:00:38 +0000</pubDate>
      <link>https://dev.to/aman_gupta_7c59c96e9e167a/how-i-contributed-my-first-data-pipeline-to-the-open-source-d7n</link>
      <guid>https://dev.to/aman_gupta_7c59c96e9e167a/how-i-contributed-my-first-data-pipeline-to-the-open-source-d7n</guid>
      <description>&lt;p&gt;Hello, I'm Aman Gupta. Over the past eight years, I have navigated the structured world of civil engineering, but recently, I have found myself captivated by data engineering. Initially, I knew how to stack bricks and build structural pipelines. But this newfound interest has helped me build data pipelines, and most of all, it was sparked by a workshop hosted by &lt;strong&gt;dlt.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;dlt (data loading tool) is an open-source library that you can add to your Python scripts to load data from various and often messy data sources into well-structured, live datasets.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;dlt&lt;/code&gt; workshop took place in November 2022, co-hosted by Adrian Brudaru, my former mentor and co-founder of &lt;code&gt;dlt&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;An opportunity arose when another client needed data migration from FreshDesk to BigQuery. I crafted a basic pipeline version, initially designed to support my use case. Upon presenting my basic pipeline to the dlt team, Alena Astrakhatseva, a team member, generously offered to review it and refine it into a community-verified source.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fstorage.googleapis.com%2Fdlt-blog-images%2Fblog_my_first_data_pipeline.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fstorage.googleapis.com%2Fdlt-blog-images%2Fblog_my_first_data_pipeline.png" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;My first iteration was straightforward—loading data in &lt;a href="https://dlthub.com/docs/general-usage/incremental-loading#the-3-write-dispositions" rel="noopener noreferrer"&gt;replace mode&lt;/a&gt;. While adequate for initial purposes, a verified source demanded features like &lt;a href="https://dlthub.com/docs/general-usage/http/overview#explicitly-specifying-pagination-parameters" rel="noopener noreferrer"&gt;pagination&lt;/a&gt; and &lt;a href="https://dlthub.com/docs/general-usage/incremental-loading" rel="noopener noreferrer"&gt;incremental loading&lt;/a&gt;. To achieve this, I developed an API client tailored for the Freshdesk API, integrating rate limit handling and pagination:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;FreshdeskClient&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Client for making authenticated requests to the Freshdesk API. It incorporates API requests with
    rate limit and pagination.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;domain&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# Contains stuff like domain, credentials and base URL.
&lt;/span&gt;        &lt;span class="k"&gt;pass&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_request_with_rate_limit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Handles rate limits in HTTP requests and ensures that the client doesn't exceed the limit set by the server.
&lt;/span&gt;        &lt;span class="k"&gt;pass&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;paginated_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;per_page&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;updated_at&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Iterable&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;TDataItem&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="c1"&gt;# Fetches a paginated response from a specified endpoint.
&lt;/span&gt;        &lt;span class="k"&gt;pass&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To further make the pipeline effective, I developed dlt &lt;a href="https://dlthub.com/docs/general-usage/resource" rel="noopener noreferrer"&gt;resources&lt;/a&gt; that could handle incremental data loading. This involved creating resources that used &lt;strong&gt;&lt;code&gt;dlt&lt;/code&gt;&lt;/strong&gt;'s incremental functionality to fetch only new or updated data:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;incremental_resource&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;updated_at&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dlt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sources&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;incremental&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;updated_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;initial_value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2022-01-01T00:00:00Z&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Generator&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Fetches and yields paginated data from a specified API endpoint.
    Each page of data is fetched based on the `updated_at` timestamp
    to ensure incremental loading.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="c1"&gt;# Retrieve the last updated timestamp to fetch only new or updated records.
&lt;/span&gt;    &lt;span class="n"&gt;updated_at&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;updated_at&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last_value&lt;/span&gt;

    &lt;span class="c1"&gt;# Use the FreshdeskClient instance to fetch paginated responses
&lt;/span&gt;    &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;freshdesk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;paginated_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;per_page&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;per_page&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;updated_at&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;updated_at&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With the steps defined above, I was able to load the data from Freshdesk to BigQuery and use the pipeline in production. Here’s a summary of the steps I followed:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Created a Freshdesk API token with sufficient privileges.&lt;/li&gt;
&lt;li&gt;Created an API client to make requests to the Freshdesk API with rate limit and pagination.&lt;/li&gt;
&lt;li&gt;Made incremental requests to this client based on the “updated_at” field in the response.&lt;/li&gt;
&lt;li&gt;Ran the pipeline using the Python script.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;While my journey from civil engineering to data engineering was initially intimidating, it has proved to be a profound learning experience. Writing a pipeline with &lt;strong&gt;&lt;code&gt;dlt&lt;/code&gt;&lt;/strong&gt; mirrors the simplicity of a GET request: you request data, yield it, and it flows from the source to its destination. Now, I help other clients integrate &lt;strong&gt;&lt;code&gt;dlt&lt;/code&gt;&lt;/strong&gt; to streamline their data workflows, which has been an invaluable part of my professional growth.&lt;/p&gt;

&lt;p&gt;In conclusion, diving into data engineering has expanded my technical skill set and provided a new lens through which I view challenges and solutions. As for me, the lens view mainly was concrete and steel a couple of years back, which has now begun to notice the pipelines of the data world. &lt;/p&gt;

&lt;p&gt;Data engineering has proved both challenging, satisfying and a good carrier option for me till now. For those interested in the detailed workings of these pipelines, I encourage exploring dlt's &lt;a href="https://github.com/dlt-hub/verified-sources" rel="noopener noreferrer"&gt;GitHub repository&lt;/a&gt; or diving into the &lt;a href="https://dlthub.com/docs/dlt-ecosystem/verified-sources/freshdesk" rel="noopener noreferrer"&gt;documentation&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>etl</category>
      <category>data</category>
      <category>pipeline</category>
    </item>
  </channel>
</rss>
