<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Tariq Abughofa</title>
    <description>The latest articles on DEV Community by Tariq Abughofa (@tariqabughofa).</description>
    <link>https://dev.to/tariqabughofa</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F252892%2F86e1d43f-b46f-4397-840c-e3dba8c2a9ae.jpg</url>
      <title>DEV Community: Tariq Abughofa</title>
      <link>https://dev.to/tariqabughofa</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/tariqabughofa"/>
    <language>en</language>
    <item>
      <title>10 Books That Should be on each Programmer's Library</title>
      <dc:creator>Tariq Abughofa</dc:creator>
      <pubDate>Tue, 27 Apr 2021 15:52:14 +0000</pubDate>
      <link>https://dev.to/tariqabughofa/10-books-that-should-be-on-each-programmer-s-library-57cn</link>
      <guid>https://dev.to/tariqabughofa/10-books-that-should-be-on-each-programmer-s-library-57cn</guid>
      <description>&lt;p&gt;I have read many books during my learning journey but in this article I present a list of the best books that helped me digest important topics. This special book list shaped my understanding of computer science, and kept me going back any time I’m in doubt.&lt;/p&gt;

&lt;h4&gt;
  
  
  1. Head First Design Patterns by Eric &amp;amp; Elisabeth Freeman
&lt;/h4&gt;

&lt;p&gt;This is the book that introduced me to programming design patterns and what an introduction it was! Clear, comprehensive, and with easy to understand and imagine real-world examples that stick to your memory easily. Every time I have a doubt about which design pattern to use, I go back to this book and it solves my problem. A must read by any programmer!&lt;br&gt;
&lt;a href="https://www.amazon.com/Head-First-Design-Patterns-Brain-Friendly/dp/0596007124/ref=sm_n_ma_dka_CA_pr_ran?adId=1484200772&amp;amp;creativeASIN=1484200772&amp;amp;linkId=6bd5a1626e3a8d3d4e4f515b9db70f2c&amp;amp;tag=rabbitoncode-20&amp;amp;linkCode=w58&amp;amp;slotNum=2&amp;amp;imprToken=c21889014dac02b29b08d8ce5985ac71&amp;amp;adType=smart&amp;amp;adMode=manual&amp;amp;adFormat=card&amp;amp;impressionTimestamp=1619536821955" rel="noopener noreferrer"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzih5za4ak5l21z6t9ede.jpg"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  2. The C++ Programming Language 4th Edition by Bjarne Stroustrup
&lt;/h4&gt;

&lt;p&gt;Which book would be better to learn a programming language more than a one written by the creator of that language? Moreover, what if the language is as hard as C++? This book is a comprehensive guide for C++ 11 and provides a good guide for many general programming concepts including Object Oriented Programming.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.amazon.com/C-Programming-Language-4th/dp/0321563840/ref=sm_n_ma_dka_CA_pr_ran?adId=1484200772&amp;amp;creativeASIN=1484200772&amp;amp;linkId=6bd5a1626e3a8d3d4e4f515b9db70f2c&amp;amp;tag=rabbitoncode-20&amp;amp;linkCode=w58&amp;amp;slotNum=2&amp;amp;imprToken=c21889014dac02b29b08d8ce5985ac71&amp;amp;adType=smart&amp;amp;adMode=manual&amp;amp;adFormat=card&amp;amp;impressionTimestamp=1619536821955" rel="noopener noreferrer"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgqca3zbygz9hc22dls73.jpg"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  3. Pro Git, 2nd Edition by Scott Chacon and Ben Straub
&lt;/h4&gt;

&lt;p&gt;If you want to handle code conflicts like a boss and understand why your manager start pulling out his hair when you amend a pushed commit, this is the book for you. This book goes into depth with the version control tool and it helped me understand the genius yet simple concepts git is built on.&lt;br&gt;
&lt;a href="https://www.amazon.com/Pro-Git-Scott-Chacon-ebook/dp/B01ISNIKES/ref=sm_n_ma_dka_CA_pr_ran?adId=1484200772&amp;amp;creativeASIN=1484200772&amp;amp;linkId=6bd5a1626e3a8d3d4e4f515b9db70f2c&amp;amp;tag=rabbitoncode-20&amp;amp;linkCode=w58&amp;amp;slotNum=2&amp;amp;imprToken=c21889014dac02b29b08d8ce5985ac71&amp;amp;adType=smart&amp;amp;adMode=manual&amp;amp;adFormat=card&amp;amp;impressionTimestamp=1619536821955" rel="noopener noreferrer"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fld2qvtwsefow374byd6y.jpg"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  4. Thinking Functionally with Haskell by Richard Bird
&lt;/h4&gt;

&lt;p&gt;Although functional programming is unlikely to threaten the dominance of imperative programming anytime soon, learning a functional language helps enrich the programmer thinking and broaden their perspective. Many functional concepts were introduced to imperative languages in the last decade such as lambda functions and high-order functions. Hence, this book is a great read to write more elegant less verbose code.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.amazon.com/Thinking-Functionally-Haskell-Richard-Bird/dp/1107452643/ref=sm_n_ma_dka_CA_pr_ran?adId=1484200772&amp;amp;creativeASIN=1484200772&amp;amp;linkId=6bd5a1626e3a8d3d4e4f515b9db70f2c&amp;amp;tag=rabbitoncode-20&amp;amp;linkCode=w58&amp;amp;slotNum=2&amp;amp;imprToken=c21889014dac02b29b08d8ce5985ac71&amp;amp;adType=smart&amp;amp;adMode=manual&amp;amp;adFormat=card&amp;amp;impressionTimestamp=1619536821955" rel="noopener noreferrer"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F82hu4ebwxuf4ma3r1p36.jpg"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  5. O'reilly's Linux in a Nutshell
&lt;/h4&gt;

&lt;p&gt;Since this book is 944 pages long, I doubt the authors know what "in a nutshell" means... However, this is more of a reference book for Linux. It servers as the go-to book whenever I need some information about a command or a system feature. I consider knowing how to handle Linux/Unix  as a core skill for any serious programmer.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.amazon.com/Linux-Nutshell-Desktop-Quick-Reference/dp/0596154488/ref=sm_n_ma_dka_CA_pr_ran?adId=1484200772&amp;amp;creativeASIN=1484200772&amp;amp;linkId=6bd5a1626e3a8d3d4e4f515b9db70f2c&amp;amp;tag=rabbitoncode-20&amp;amp;linkCode=w58&amp;amp;slotNum=2&amp;amp;imprToken=c21889014dac02b29b08d8ce5985ac71&amp;amp;adType=smart&amp;amp;adMode=manual&amp;amp;adFormat=card&amp;amp;impressionTimestamp=1619536821955" rel="noopener noreferrer"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy79tyhbyu86a351k64kw.jpg"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  6. Thinking in Java by Bruce Eckel
&lt;/h4&gt;

&lt;p&gt;The book I learned Java from. Easy to read and comprehensive. Great to pick up Object Oriented Programming.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.amazon.com/Thinking-Java-4th-Bruce-Eckel/dp/0131872486/ref=sm_n_ma_dka_CA_pr_ran?adId=1484200772&amp;amp;creativeASIN=1484200772&amp;amp;linkId=6bd5a1626e3a8d3d4e4f515b9db70f2c&amp;amp;tag=rabbitoncode-20&amp;amp;linkCode=w58&amp;amp;slotNum=2&amp;amp;imprToken=c21889014dac02b29b08d8ce5985ac71&amp;amp;adType=smart&amp;amp;adMode=manual&amp;amp;adFormat=card&amp;amp;impressionTimestamp=1619536821955" rel="noopener noreferrer"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcq8uiec5g3spi5is8eaz.jpg"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  7. Programming Ruby 1.9 &amp;amp; 2.0 by Dave Thomas
&lt;/h4&gt;

&lt;p&gt;This book was given to me by my manager who introduced our company to this charming programming language. Although I had a love-hate relationships with many programming languages, this precious gem is still by far my favourite. It introduced one of the best programmer-friendly frameworks I've seen and many brilliant pleasing-to-use libraries. I hereby feel the need to give the following disclaimer: learning Ruby is guaranteed to produce unwanted side effects that include displeasure and disgust when dealing with any other alternatives ;).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.amazon.com/Programming-Ruby-1-9-2-0-Programmers/dp/1937785491/ref=sm_n_ma_dka_CA_pr_ran?adId=1484200772&amp;amp;creativeASIN=1484200772&amp;amp;linkId=6bd5a1626e3a8d3d4e4f515b9db70f2c&amp;amp;tag=rabbitoncode-20&amp;amp;linkCode=w58&amp;amp;slotNum=2&amp;amp;imprToken=c21889014dac02b29b08d8ce5985ac71&amp;amp;adType=smart&amp;amp;adMode=manual&amp;amp;adFormat=card&amp;amp;impressionTimestamp=1619536821955" rel="noopener noreferrer"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzdxvg3vgyiigf94tw37b.jpg"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  8. Database System Concepts 7th Edition by Avi Silberschatz, Henry Korth, S. Sudarshan
&lt;/h4&gt;

&lt;p&gt;This is a comprehensive book on SQL database management. The book covers the relational database model, design, Querying, and transaction management in general without covering a specific system. In the last chapters, the book discuss these concepts in existing database systems: Postgres,  Oracle, DB2, and MSSQL. It's a great resource to build a great in-depth knowledge of how relational databases work.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.amazon.com/Database-System-Concepts-Abraham-Silberschatz/dp/1260084507/ref=sm_n_ma_dka_CA_pr_ran?adId=1484200772&amp;amp;creativeASIN=1484200772&amp;amp;linkId=6bd5a1626e3a8d3d4e4f515b9db70f2c&amp;amp;tag=rabbitoncode-20&amp;amp;linkCode=w58&amp;amp;slotNum=2&amp;amp;imprToken=c21889014dac02b29b08d8ce5985ac71&amp;amp;adType=smart&amp;amp;adMode=manual&amp;amp;adFormat=card&amp;amp;impressionTimestamp=1619536821955" rel="noopener noreferrer"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F69d9nf7en4cjkoc8d7vq.jpg"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  9. The Hundred-Page Machine Learning Book by Andriy Burkov
&lt;/h4&gt;

&lt;p&gt;A fast read to get exposure to the modern concepts of machine learning for anyone in the computer science field. The point of the book is not to provide code but to explain the concepts. A heads-up: a lot of math in there, but again: you're reading about machine learning. If you don't want to deal with math, you're into the wrong topic! ;)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.amazon.com/Hundred-Page-Machine-Learning-Book/dp/199957950X/ref=sm_n_ma_dka_CA_pr_ran?adId=1484200772&amp;amp;creativeASIN=1484200772&amp;amp;linkId=6bd5a1626e3a8d3d4e4f515b9db70f2c&amp;amp;tag=rabbitoncode-20&amp;amp;linkCode=w58&amp;amp;slotNum=2&amp;amp;imprToken=c21889014dac02b29b08d8ce5985ac71&amp;amp;adType=smart&amp;amp;adMode=manual&amp;amp;adFormat=card&amp;amp;impressionTimestamp=1619536821955" rel="noopener noreferrer"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F57q8suai3cf8a0v6xs9x.jpg"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  10. SQL Performance Explained by Markus Winard
&lt;/h4&gt;

&lt;p&gt;This book helped me to understand how to do SQL optimizations and how indexing actually work. Every developer deals with query optimizations at some point whether it's for large data migrations or application performance enhancements. Definitely must known principles for ETL and data engineering.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.amazon.com/Performance-Explained-Everything-Developers-about/dp/3950307826/ref=sm_n_ma_dka_CA_pr_ran?adId=1484200772&amp;amp;creativeASIN=1484200772&amp;amp;linkId=6bd5a1626e3a8d3d4e4f515b9db70f2c&amp;amp;tag=rabbitoncode-20&amp;amp;linkCode=w58&amp;amp;slotNum=2&amp;amp;imprToken=c21889014dac02b29b08d8ce5985ac71&amp;amp;adType=smart&amp;amp;adMode=manual&amp;amp;adFormat=card&amp;amp;impressionTimestamp=1619536821955" rel="noopener noreferrer"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq211tg2cvhddcj01sv6f.jpg"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Bonus Book: Clean Code by Robert R. Martin
&lt;/h4&gt;

&lt;p&gt;I have to admit, this is not a personal recommendation. Through the years, almost everyone I know recommended this book to me as the best book to read for programming but I didn't have the chance to read it yet. Maybe I should now :D.&lt;/p&gt;

&lt;p&gt;Here's what a friend of mine had to say about the book: &lt;em&gt;"I would say Clean Code is the best book on programming. It touches on many aspects on CS while giving a voice of wisdom on what the author believes is most clean... It's very opinionated though."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I will leave it to you to judge on the last criticizing statement.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.amazon.com/Clean-Code-Handbook-Software-Craftsmanship-ebook/dp/B001GSTOAM/ref=sm_n_ma_dka_CA_pr_ran?adId=1484200772&amp;amp;creativeASIN=1484200772&amp;amp;linkId=6bd5a1626e3a8d3d4e4f515b9db70f2c&amp;amp;tag=rabbitoncode-20&amp;amp;linkCode=w58&amp;amp;slotNum=2&amp;amp;imprToken=c21889014dac02b29b08d8ce5985ac71&amp;amp;adType=smart&amp;amp;adMode=manual&amp;amp;adFormat=card&amp;amp;impressionTimestamp=1619536821955" rel="noopener noreferrer"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flefaedfgymfo70nmfriz.jpg"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>programming</category>
      <category>books</category>
      <category>webdev</category>
      <category>beginners</category>
    </item>
    <item>
      <title>3 approaches to scroll through data in Elasticsearch</title>
      <dc:creator>Tariq Abughofa</dc:creator>
      <pubDate>Mon, 06 Jan 2020 01:36:59 +0000</pubDate>
      <link>https://dev.to/tariqabughofa/3-approaches-to-scroll-through-data-in-elasticsearch-4ii3</link>
      <guid>https://dev.to/tariqabughofa/3-approaches-to-scroll-through-data-in-elasticsearch-4ii3</guid>
      <description>&lt;p&gt;Elasticsearch is a search engine that provides full-text search capabilities. It stores data in collections called indices in a document format. In this article I go through the supported techniques to paginate through collections or as they are called in Elasticsearch "indices"&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;from&lt;/code&gt; / &lt;code&gt;size&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;Pagination of results can be done by using the &lt;code&gt;from&lt;/code&gt; and &lt;code&gt;size&lt;/code&gt; parameters. The &lt;code&gt;from&lt;/code&gt; parameter defines the number of items you want to skip from the start. The &lt;code&gt;size&lt;/code&gt; parameter is the maximum amount of hits to be returned.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;GET users/_search
{
    "from" : 0, "size" : 100,
    "query" : {
        "term" : { "user" : "john" }
    }
}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;You can filter using this method. You can also sort by adding this JSON in the root level of the previous request body:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="nl"&gt;"sort"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"date"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"asc"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;In Elasticsearch, you can't paginate beyond the &lt;code&gt;max_result_window&lt;/code&gt; index setting which is 10,000 by default. Which means that &lt;code&gt;from&lt;/code&gt; + &lt;code&gt;size&lt;/code&gt; should be less than that value. In practice, &lt;code&gt;max_result_window&lt;/code&gt; is not a limitation but a safe guard against deep pagination which might crash the server since using this method requires loading the previous pages as well.&lt;/p&gt;

&lt;h3&gt;
  
  
  The &lt;code&gt;scroll&lt;/code&gt; API
&lt;/h3&gt;

&lt;p&gt;A recommend solution for efficient deep pagination and required when reaching the &lt;code&gt;max_result_window&lt;/code&gt; limit. The scroll API can be used to retrieve large number of results. It resembles cursors in SQL databases where it involves the server in keeping where the pagination has reached so far. Also in the same manner, it's not designed to get data for user requests but rather for processing large amount of data.&lt;/p&gt;

&lt;p&gt;In order to start scrolling, an initial request has to sent to start a search context on the server. The request also specifies how long it should stay alive with the &lt;code&gt;scroll=TTL&lt;/code&gt; query parameter. This request keep the context alive for 1 minutes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;POST users/_search?scroll=1m
{
    "size": 100,
    "query" : {
        "term" : { "user" : "john" }
    }
}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Th response of this request returns a &lt;code&gt;scroll_id&lt;/code&gt; value to be used in the subsequent fetch requests.&lt;/p&gt;

&lt;p&gt;After this request, the client can start scrolling through the data. To retrieve the next page of the result (including the first page) the same request has to be sent:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;POST _search/scroll
{
    "scroll" : "1m",
    "scroll_id" : "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAD4WYm9laVYtZndUQlNsdDcwakFMNjU1QQ=="
}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;As you see, the request has to specify the &lt;code&gt;scroll_id&lt;/code&gt; which the client get from the initial request) and scroll parameter which tells the server to keep the context alive for another 1 minute.&lt;/p&gt;

&lt;h3&gt;
  
  
  The &lt;code&gt;search_after&lt;/code&gt; parameter
&lt;/h3&gt;

&lt;p&gt;The scroll API is great for deep pagination but the scroll context are costly to keep alive and they are not recommended to be used for real-time user requests. As a substitution to scroll context for these situation, the &lt;code&gt;search_after&lt;/code&gt; parameter was introduced to the search API to allow the user to provide information about the previous page that helps retrieving the current page. That means that a certain order for the result is &lt;em&gt;necessary&lt;/em&gt; in the search query. Let's assume the first page was retrieved with the following query:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;GET users/_search
{
    "size": 10,
    "query" : {
        "term" : { "user" : "john" }
    },
    "sort": [
        {"date": "asc"}
    ]
}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;For subsequent pages, we can use the sort values of the last document returned by this request and we pass these values with the &lt;code&gt;search_after&lt;/code&gt; parameter. A later request would like something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;GET users/_search
{
    "size": 10,
    "query" : {
        "term" : { "user" : "john" }
    },
    "sort": [
        {"date": "asc"}
    ],
    "search_after": [1463538857]
}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;The from parameter can't be used when search_after is passed as the functionality of both contradicts. This solution is very similar to the scroll API but it relieves the server from the keeping the pagination state. Which also means it always returns the latest version of the data. For this reason the sort order may change during a walk if updates or deletes happen on the index.&lt;/p&gt;

&lt;p&gt;This solution has the clear disadvantage of not being able to get a page at random as there is a need to fetch pages from 0..99 to fetch page 100. Still the solution is good for user pagination when you can only move next/previous through the pages.&lt;/p&gt;

&lt;h3&gt;
  
  
  Random access with &lt;code&gt;search_after&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;As explained before, the search_after parameter doesn't allow to have random-access pagination. However, there is a way to have random access by keeping statistical data about the indexes in Elasticsearch. This approach is inspired by histograms in Postgres database. Histograms contains statistics about column value distribution in the form of bucket boundary list. The idea is to implement that manually in Elasticsearch. Have an index that has documents that has the  schema of the following schema:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"bucket_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"starts_after"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;102181&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Let's call this index &lt;code&gt;pagination_index&lt;/code&gt;. Before creating and filling this index we should decide on a bucket size. Let's say it's 1000 documents. The next step is to fill this index using the &lt;code&gt;search&lt;/code&gt; API with the &lt;code&gt;search_after&lt;/code&gt; parameter. Let's assume the index is called &lt;code&gt;articles&lt;/code&gt;. the operation would look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight ruby"&gt;&lt;code&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="no"&gt;Elasticsearch&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="no"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;max_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;span class="n"&gt;bucket_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;span class="ss"&gt;do:
    &lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="ss"&gt;index: &lt;/span&gt;&lt;span class="s1"&gt;'pagination_index'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ss"&gt;body: &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="ss"&gt;bucket_id: &lt;/span&gt;&lt;span class="n"&gt;bucket_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ss"&gt;starts_after: &lt;/span&gt;&lt;span class="n"&gt;max_id&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="ss"&gt;index: &lt;/span&gt;&lt;span class="s1"&gt;'articles'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ss"&gt;body: &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="ss"&gt;size: &lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ss"&gt;search_after: &lt;/span&gt;&lt;span class="n"&gt;max_id&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="n"&gt;max_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;last&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;id&lt;/span&gt;
    &lt;span class="n"&gt;bucket_id&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;empty?&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Now to paginate, each time the pagination is done in order (get next page) &lt;code&gt;search_after&lt;/code&gt; is used from the previous page. The same as we were doing with the regular &lt;code&gt;search_after&lt;/code&gt; pagination. When there is a need to access a random page, we query the &lt;code&gt;pagination_index&lt;/code&gt; for the &lt;code&gt;starts_after&lt;/code&gt; and we use it get the required page. It would like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight ruby"&gt;&lt;code&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="no"&gt;Elasticsearch&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="no"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;page_size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;
&lt;span class="n"&gt;bucket_size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;
&lt;span class="c1"&gt;# get page 0&lt;/span&gt;
&lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="ss"&gt;index: &lt;/span&gt;&lt;span class="s1"&gt;'articles'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ss"&gt;body: &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="ss"&gt;size: &lt;/span&gt;&lt;span class="n"&gt;page_size&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="c1"&gt;# do some processing or rendering of results.&lt;/span&gt;
&lt;span class="n"&gt;max_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;last&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;id&lt;/span&gt;
&lt;span class="c1"&gt;# get page 1&lt;/span&gt;
&lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="ss"&gt;index: &lt;/span&gt;&lt;span class="s1"&gt;'articles'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ss"&gt;body: &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="ss"&gt;size: &lt;/span&gt;&lt;span class="n"&gt;page_size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ss"&gt;search_after: &lt;/span&gt;&lt;span class="n"&gt;max_id&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="c1"&gt;# do some processing or rendering of results.&lt;/span&gt;
&lt;span class="c1"&gt;# get page 200&lt;/span&gt;
&lt;span class="n"&gt;bucket_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;page_size&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;
&lt;span class="n"&gt;page_info&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="ss"&gt;index: &lt;/span&gt;&lt;span class="s1"&gt;'pagination_index'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ss"&gt;body: &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="ss"&gt;bucket_id: &lt;/span&gt;&lt;span class="n"&gt;bucket_id&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="n"&gt;after&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;page_info&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;starts_after&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;page_size&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;
&lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="ss"&gt;index: &lt;/span&gt;&lt;span class="s1"&gt;'articles'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ss"&gt;body: &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="ss"&gt;size: &lt;/span&gt;&lt;span class="n"&gt;page_size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ss"&gt;search_after: &lt;/span&gt;&lt;span class="n"&gt;after&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;This approach works for any query including any filtering that is needed but it will only work for that specific query. The &lt;code&gt;pagination_index&lt;/code&gt; has to be maintained regularly as well. Until it get updated the pages will be approximate. It is still though a good approach to show real-time results which requires deep random-pagination.&lt;/p&gt;

</description>
      <category>database</category>
      <category>nosql</category>
      <category>elasticsearch</category>
    </item>
    <item>
      <title>80+ Free Big Data Resources to Satisfy Your Knowledge Appetite - part 2</title>
      <dc:creator>Tariq Abughofa</dc:creator>
      <pubDate>Sun, 29 Dec 2019 19:12:20 +0000</pubDate>
      <link>https://dev.to/tariqabughofa/80-free-big-data-resources-to-satisfy-your-knowledge-appetite-part-2-5ela</link>
      <guid>https://dev.to/tariqabughofa/80-free-big-data-resources-to-satisfy-your-knowledge-appetite-part-2-5ela</guid>
      <description>&lt;p&gt;This is a continuation of the resources I listed in part 1&lt;/p&gt;


&lt;div class="ltag__link"&gt;
  &lt;a href="/tariqabughofa" class="ltag__link__link"&gt;
    &lt;div class="ltag__link__pic"&gt;
      &lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--dhQgoHlZ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://res.cloudinary.com/practicaldev/image/fetch/s--l9sa7M4---/c_fill%2Cf_auto%2Cfl_progressive%2Ch_150%2Cq_auto%2Cw_150/https://dev-to-uploads.s3.amazonaws.com/uploads/user/profile_image/252892/86e1d43f-b46f-4397-840c-e3dba8c2a9ae.jpg" alt="tariqabughofa image"&gt;
    &lt;/div&gt;
  &lt;/a&gt;
  &lt;a href="/tariqabughofa/80-free-big-data-resources-to-satisfy-your-knowledge-appetite-part-1-4361" class="ltag__link__link"&gt;
    &lt;div class="ltag__link__content"&gt;
      &lt;h2&gt;80+ Free Big Data Resources to Satisfy Your Knowledge Appetite - part 1&lt;/h2&gt;
      &lt;h3&gt;Tariq Abughofa ・ Dec 22 '19 ・ 5 min read&lt;/h3&gt;
      &lt;div class="ltag__link__taglist"&gt;
        &lt;span class="ltag__link__tag"&gt;#database&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#datascience&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#nosql&lt;/span&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/a&gt;
&lt;/div&gt;


&lt;p&gt;This part includes the following four categories:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Machine Learning &amp;amp; Algorithms in Big Data&lt;/li&gt;
&lt;li&gt;Data Processing Systems&lt;/li&gt;
&lt;li&gt;Real-time Processing&lt;/li&gt;
&lt;li&gt;Graph Processing&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;a&gt;&lt;/a&gt;Machine Learning and Algorithms in Big Data
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://engineering.fb.com/core-data/recommending-items-to-more-than-a-billion-people/"&gt;Recommending items to more than a billion people&lt;/a&gt;: An article about collaborative filtering at Facebook.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://h2o-release.s3.amazonaws.com/h2o/rel-xia/3/docs-website/h2o-docs/booklets/SparklingWaterBooklet.pdf"&gt;Machine Learning with Sparkling Water&lt;/a&gt;: Using H2O the machine learning framework with Apache Spark.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://stanford.edu/~rezab/sparkworkshop/slides/xiangrui.pdf"&gt;MLlib&lt;/a&gt;: Scalable Machine Learning library on Apache Spark from Stanford/Databricks.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.usenix.org/system/files/conference/osdi16/osdi16-abadi.pdf"&gt;TensorFlow&lt;/a&gt;: the famous large-scale machine learning library.&lt;/p&gt;

&lt;p&gt;Large-scale parallel collaborative filtering for the Netflix prize: an algorithm that for large scale recommendations of Netflix movies.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;a&gt;&lt;/a&gt;Data Processing Systems
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://medium.com/airbnb-engineering/airflow-a-workflow-management-platform-46318b977fd8"&gt;Airflow&lt;/a&gt;: a workflow management system by AirBnB.&lt;/p&gt;

&lt;p&gt;Oozie: a workflow management system for Hadoop by Yahoo!.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://sameeragarwal.github.io/blinkdb_eurosys13.pdf"&gt;BlinkDb&lt;/a&gt;: analytics on large scale data from Berkeley.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/35650.pdf"&gt;FlumeJava&lt;/a&gt;: a library for developing parallel data pipelines from Google.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf"&gt;MapReduce&lt;/a&gt;: the google framework behind Hadoop.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pdfs.semanticscholar.org/fe1c/c4e034ad4d3a54a5aa0a53a24b6f564298c4.pdf"&gt;Pig&lt;/a&gt;: an engine that supports &lt;a href="http://infolab.stanford.edu/~olston/publications/sigmod08.pdf"&gt;PigLatin&lt;/a&gt; a procedural dataflow language for Hadoop from Yahoo.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://event.cwi.nl/lsde/papers/hive.pdf"&gt;Hive&lt;/a&gt; &lt;a href="http://infolab.stanford.edu/~ragho/hive-icde2010.pdf"&gt;(resource#2)&lt;/a&gt;: A data warehouse on top of Hadoop.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf"&gt;The Dataflow Model&lt;/a&gt;: the model behind Google Cloud Dataflow which provides simplified stream and batch processing.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/41378.pdf"&gt;MillWheel&lt;/a&gt;: stream processing engine from Google.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/41318.pdf"&gt;Photon&lt;/a&gt;: A tool to join data streams at Google.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://d0.awsstatic.com/whitepapers/whitepaper-streaming-data-solutions-on-aws-with-amazon-kinesis.pdf"&gt;Kinesis&lt;/a&gt;: stream processing engine from Amazon.&lt;/p&gt;

&lt;p&gt;&lt;a href="http://asterios.katsifodimos.com/assets/publications/flink-deb.pdf"&gt;Apache Flink&lt;/a&gt; &lt;a href="https://www.ververica.com/blog/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink"&gt;(resource#2)&lt;/a&gt;: stream and batch processing engine from TU Berlin.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/trill-vldb2015.pdf"&gt;Trill&lt;/a&gt;: incremental data analytics engine from Microsoft.&lt;/p&gt;

&lt;p&gt;&lt;a href="http://notes.stephenholiday.com/Kafka.pdf"&gt;Kafka&lt;/a&gt;: the famous distributed messaging system from LinkedIn.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://people.eecs.berkeley.edu/~alig/papers/spark-cacm.pdf"&gt;Apache Spark&lt;/a&gt;: the famous &lt;a href="https://people.csail.mit.edu/matei/papers/2013/sosp_spark_streaming.pdf"&gt;stream&lt;/a&gt; and &lt;a href="https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf"&gt;batch&lt;/a&gt; processing engine. It uses distributed memory abstractions: &lt;a href="https://www.kdnuggets.com/2017/08/three-apache-spark-apis-rdds-dataframes-datasets.html"&gt;RDDs, Dataframes, and Datasets&lt;/a&gt;. Since Spark 2 was released, it moved to &lt;a href="https://cs.stanford.edu/~matei/papers/2018/sigmod_structured_streaming.pdf"&gt;structured streaming&lt;/a&gt; &lt;a href="https://databricks.com/blog/2017/01/19/real-time-streaming-etl-structured-streaming-apache-spark-2-1.html"&gt;(resource#2)&lt;/a&gt; &lt;a href="https://databricks.com/blog/2016/07/28/continuous-applications-evolving-streaming-in-apache-spark-2-0.html"&gt;(3)&lt;/a&gt; &lt;a href="https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html"&gt;(4)&lt;/a&gt; and the &lt;a href="https://amplab.cs.berkeley.edu/wp-content/uploads/2015/03/SparkSQLSigmod2015.pdf"&gt;SparkSQL&lt;/a&gt; library was introduced to allow SQL queries over Spark Dataframes. The whole &lt;a href="https://databricks.com/blog/category/engineering"&gt;Databricks blog&lt;/a&gt; is a great resource for the project.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cs.stanford.edu/~matei/papers/2016/sigmod_sparkr.pdf"&gt;SparkR&lt;/a&gt;: a Spark library to write processing application in R.&lt;/p&gt;

&lt;p&gt;&lt;a href="http://www.istc-cc.cmu.edu/publications/papers/2013/grades-graphx_with_fonts.pdf"&gt;GraphX&lt;/a&gt; &lt;a href="https://amplab.cs.berkeley.edu/wp-content/uploads/2014/09/graphx.pdf"&gt;(resource#2)&lt;/a&gt;: distributed graph processing with Spark's RDDs.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cs.stanford.edu/~matei/papers/2016/grades_graphframes.pdf"&gt;GraphFrames&lt;/a&gt;: distributed graph processing with Spark's Dataframes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.snappydata.io/snappy-industrial"&gt;SnappyData&lt;/a&gt; &lt;a href="http://cidrdb.org/cidr2017/papers/p28-mozafari-cidr17.pdf"&gt;(resource#2)&lt;/a&gt;: a transaction datastore on top of Spark.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;a&gt;&lt;/a&gt;Real-time Processing
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://engineering.linkedin.com/blog/2018/11/samza-1-0--stream-processing-at-massive-scale"&gt;Samza&lt;/a&gt; &lt;a href="https://martin.kleppmann.com/papers/kafka-debull15.pdf"&gt;(resource#2)&lt;/a&gt; &lt;a href="http://www.vldb.org/pvldb/vol10/p1634-noghabi.pdf"&gt;(3)&lt;/a&gt; &lt;a href="https://martin.kleppmann.com/papers/samza-encyclopedia.pdf"&gt;(4)&lt;/a&gt;: Stream processing engine from LinkedIn.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cs.brown.edu/courses/csci2270/archives/2015/papers/ss-storm.pdf"&gt;Storm&lt;/a&gt;: real-time data processing engine from Twitter.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.microsoft.com/en-us/research/wp-content/uploads/2017/06/heron_icde-1.pdf"&gt;Heron&lt;/a&gt;: the new Storm from Twitter.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://research.fb.com/wp-content/uploads/2016/11/realtime_data_processing_at_facebook.pdf"&gt;Real-time data processing at facebook&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://tech.ebayinc.com/engineering/announcing-pulsar-real-time-analytics-at-scale/"&gt;Pulsar&lt;/a&gt;: real-time data processing engine from eBay.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;a&gt;&lt;/a&gt;Graph Processing
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://stanford.edu/~rezab/papers/wtf_overview.pdf"&gt;WTF&lt;/a&gt;: the who to follow service at Twitter.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.vldb.org/pvldb/vol9/p1281-sharma.pdf"&gt;GraphJet&lt;/a&gt;: real-time recommendation graph engine at Twitter.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.cs.cmu.edu/~pavlo/courses/fall2013/static/papers/p135-malewicz.pdf"&gt;Pregel&lt;/a&gt;: large-scale graph processing engine at Google.&lt;/p&gt;

&lt;p&gt;&lt;a href="http://www.vldb.org/pvldb/vol8/p1804-ching.pdf"&gt;Giraph&lt;/a&gt;: open source implementation of Pregel by Facebook.&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>resources</category>
      <category>distributedsystems</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>80+ Free Big Data Resources to Satisfy Your Knowledge Appetite - part 1</title>
      <dc:creator>Tariq Abughofa</dc:creator>
      <pubDate>Sun, 22 Dec 2019 21:07:51 +0000</pubDate>
      <link>https://dev.to/tariqabughofa/80-free-big-data-resources-to-satisfy-your-knowledge-appetite-part-1-4361</link>
      <guid>https://dev.to/tariqabughofa/80-free-big-data-resources-to-satisfy-your-knowledge-appetite-part-1-4361</guid>
      <description>&lt;p&gt;Data is becoming a cornerstone in software services. Whether it is the business model or it drives revenue or both, tech companies are flocking to use this "free" resource to provide better services and excel over there competitors.&lt;/p&gt;

&lt;p&gt;If you are in the "new-sexy" position in computer science or you're doing research in this field, you will find the resources in this article extremely helpful the same way they helped me. Frontier companies in this field like Google, Facebook, LinkedIn, and Twitter as well as big universities released tens of papers and articles on the subject outlining internal projects they worked on. These projects were released later as open sources to become a stable in the field. To save you the time and pain of getting lost in the labyrinth of endless resources over the internet (the way I did), I compiled a categorized list here for your pleasure. I will try to update the list frequently to keep it up-to-data.&lt;/p&gt;

&lt;p&gt;I divided the resources into 8 main categories. This first part includes the following four:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Big Data Storage &amp;amp; NoSQL Databases&lt;/li&gt;
&lt;li&gt;Interactive Data Analytics&lt;/li&gt;
&lt;li&gt;Big Data Challenges and Ecosystems&lt;/li&gt;
&lt;li&gt;Resource Management&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;a&gt;&lt;/a&gt;Big Data Storage &amp;amp; NoSQL
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf"&gt;Bigtable&lt;/a&gt;: the terrabyte NoSQL database behind google cloud storage.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf"&gt;Cassandra&lt;/a&gt;: the Facebook column-oriented database.&lt;/p&gt;

&lt;p&gt;&lt;a href="http://static.usenix.org/events/fast/tech/full_papers/Sumbaly.pdf"&gt;Voldemort&lt;/a&gt;: Distributed database by LinkedIn.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://sites.cs.ucsb.edu/~agrawal/fall2009/dynamo.pdf"&gt;Dynamo&lt;/a&gt;: Amazon's key-value store.&lt;/p&gt;

&lt;p&gt;&lt;a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.294.8459&amp;amp;rep=rep1&amp;amp;type=pdf"&gt;HBase&lt;/a&gt;: Column-oriented storage over HDFS by Facebook.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://neo4j.com/graph-databases-book/"&gt;Neo4J&lt;/a&gt;: the famous graph database.&lt;/p&gt;

&lt;p&gt;&lt;a href="http://pages.cs.wisc.edu/~remzi/Classes/739/Spring2004/Papers/p215-dageville-snowflake.pdf"&gt;Snowflake&lt;/a&gt;: A data warehouse for the cloud.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf"&gt;The Google File System&lt;/a&gt;: the big data file system and the base behind distributed storage in Hadoop.&lt;/p&gt;

&lt;p&gt;&lt;a href="http://storageconference.us/2010/Papers/MSST/Shvachko.pdf"&gt;HDFS&lt;/a&gt;: The Hadoop Distributed File System.&lt;/p&gt;

&lt;p&gt;&lt;a href="http://img.bigdatabugs.com/RCFile%20A%20Fast%20and%20Space-efficient%20Data%20Placement%20Structure%20in%20MapReduce-based%20Warehouse%20Systems@www.bigDataBugs.com.pdf"&gt;RCFile&lt;/a&gt;: Data placement for data warehouses used in Apache Hive.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://blog.twitter.com/engineering/en_us/a/2013/dremel-made-simple-with-parquet.html"&gt;Parquet&lt;/a&gt;: columnar storage format.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Beaver.pdf"&gt;Haystack&lt;/a&gt;: an object storage system optimized for Facebook’s Photos application.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/LRC12-cheng20webpage.pdf"&gt;Windows Azure Storage&lt;/a&gt;: Cloud Storage System from Microsoft.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://ir.lib.uwo.ca/cgi/viewcontent.cgi?referer=https://www.google.com/&amp;amp;httpsredir=1&amp;amp;article=1069&amp;amp;context=electricalpub"&gt;Data management in cloud environments - NoSQL and NewSQL data stores&lt;/a&gt;: A paper surveying data stores beyond SQL such as Redis, HBase, ...etc.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;a&gt;&lt;/a&gt;Interactive Data Analytics
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36632.pdf"&gt;Dremel&lt;/a&gt;: analytics system by Google.&lt;/p&gt;

&lt;p&gt;&lt;a href="http://cidrdb.org/cidr2015/Papers/CIDR15_Paper28.pdf"&gt;Impala&lt;/a&gt;: SQL engine for Hadoop by Cloudera.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cwiki.apache.org/confluence/display/incubator/DrillProposal"&gt;Drill&lt;/a&gt;: An open source implementation of Dremel.&lt;/p&gt;

&lt;p&gt;&lt;a href="http://www.news.cs.nyu.edu/~jinyang/sp07/papers/dryad.pdf"&gt;Dryad&lt;/a&gt;: a framework to define dataflow graphs from Microsoft.&lt;/p&gt;

&lt;p&gt;&lt;a href="http://web.eecs.umich.edu/~mosharaf/Readings/Tez.pdf"&gt;Tez&lt;/a&gt;: an open source implementation of Dryad from Hortonworks and Microsoft.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://kudu.apache.org/kudu.pdf"&gt;Kudo&lt;/a&gt;: A storage for fast analytics on Big Data by Cloudera.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;a&gt;&lt;/a&gt;Big Data Challenges and Ecosystems
&lt;/h3&gt;

&lt;p&gt;&lt;a href="http://snap.stanford.edu/class/cs224w-readings/Brin98Anatomy.pdf"&gt;Google&lt;/a&gt;: how the large-scale search engine was built.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://codahale.com/you-cant-sacrifice-partition-tolerance/"&gt;The CAP theorem&lt;/a&gt; &lt;a href="https://mwhittaker.github.io/blog/an_illustrated_proof_of_the_cap_theorem/"&gt;(resource #2)&lt;/a&gt;: the theory which paved the way to NoSQL databases.&lt;/p&gt;

&lt;p&gt;&lt;a href="http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html"&gt;The Lambda Architecture&lt;/a&gt;: an architecture for data pipelines.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.oreilly.com/radar/questioning-the-lambda-architecture/"&gt;The Kappa Architecture&lt;/a&gt;: an alternative architecture for data pipelines.&lt;/p&gt;

&lt;p&gt;&lt;a href="http://www.vldb.org/pvldb/vol7/p1441-boykin.pdf"&gt;Summingbird&lt;/a&gt;: a framework for integrating batch and online computations.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying"&gt;The Log Problem in Big Data&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cs.stanford.edu/people/chrismre/cs345/rl/eventually-consistent.pdf"&gt;Eventual Consistency&lt;/a&gt;: A look at how data consistency works in NoSQL database systems.&lt;/p&gt;

&lt;p&gt;&lt;a href="http://web.cs.wpi.edu/~cs525/f13b-EAR/cs525-homepage/lectures/PAPERS/p1125-sumbaly.pdf"&gt;The Big Data Ecosystem at LinkedIn&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;a&gt;&lt;/a&gt;Resource Management
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://lamport.azurewebsites.net/pubs/paxos-simple.pdf"&gt;Paxos&lt;/a&gt;: a consensus algorithm for distributed systems.&lt;/p&gt;

&lt;p&gt;&lt;a href="http://files.catwell.info/misc/mirror/raft/raft.pdf"&gt;Raft&lt;/a&gt;: an alternative consensus algorithm to Paxos&lt;/p&gt;

&lt;p&gt;&lt;a href="https://marcoserafini.github.io/papers/zab.pdf"&gt;Zab&lt;/a&gt;: the consensus algorithm used in Zookeeper. &lt;a href="https://cwiki.apache.org/confluence/display/ZOOKEEPER/Zab+vs.+Paxos"&gt;Here is a comparison between Zab with Paxos&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.usenix.org/legacy/event/usenix10/tech/full_papers/Hunt.pdf"&gt;Zookeeper&lt;/a&gt;: Coordinator and distributed configuration system by Yahoo!.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.cse.ust.hk/~weiwa/teaching/Fall15-COMP6611B/reading_list/YARN.pdf"&gt;YARN&lt;/a&gt;: resource manager for Hadoop.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pdos.csail.mit.edu/6.824/papers/borg.pdf"&gt;Borg&lt;/a&gt;: Cluster manager by Google.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://assets.openshift.com/hubfs/pdfs/Kubernetes_OpenShift.pdf"&gt;Kubernetes&lt;/a&gt;: container-orchestration system for automating application deployment, scaling, and management by Google.&lt;/p&gt;

&lt;p&gt;The second part will focus on Algorithms, ML, and data processing systems to stay tuned ;)&lt;/p&gt;

</description>
      <category>database</category>
      <category>datascience</category>
      <category>nosql</category>
    </item>
    <item>
      <title>How to Load Data into an SQL database:
From simple inserts to bulk loads</title>
      <dc:creator>Tariq Abughofa</dc:creator>
      <pubDate>Sun, 15 Dec 2019 21:49:07 +0000</pubDate>
      <link>https://dev.to/tariqabughofa/how-to-load-data-into-an-sql-database-from-simple-inserts-to-bulk-loads-4352</link>
      <guid>https://dev.to/tariqabughofa/how-to-load-data-into-an-sql-database-from-simple-inserts-to-bulk-loads-4352</guid>
      <description>&lt;p&gt;Data storage is the most integral part of a transactional database system. In this article, I will go into details about different ways to load data into a transaction SQL system. From inserting a couple of records, all the way to millions of records.&lt;/p&gt;

&lt;p&gt;The first and easiest way to insert data into the database is through the ORM (Object-Relational Mapping) tool. For the purpose of this tutorial, I will use Rails's ActiveRecord as a demonstration of ORM operations. Inserting data through ORM ensures the execution of all the business rules and validations so you don't worry about it. It's as easy as doing this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight ruby"&gt;&lt;code&gt;&lt;span class="no"&gt;Users&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="ss"&gt;username: &lt;/span&gt;&lt;span class="s2"&gt;"John Doe"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ss"&gt;role: &lt;/span&gt;&lt;span class="s2"&gt;"admin"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;The disadvantages of this method is that it doesn't scale. For each record, it creates a model object , execute the attached callbacks for validation and business rules, and executes a DML transaction.&lt;/p&gt;

&lt;p&gt;Some frameworks provide a way to do a bulk insert instead. A single &lt;code&gt;INSERT&lt;/code&gt; SQL query is prepared and a single sql statement is sent to the database, without instantiating the model or invoking model callbacks or validations. For example, in Rails 6 it's something like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight ruby"&gt;&lt;code&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="no"&gt;Users&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;insert_all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="ss"&gt;id: &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="ss"&gt;username: &lt;/span&gt;&lt;span class="s1"&gt;'John Doe'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="ss"&gt;role: &lt;/span&gt;&lt;span class="s1"&gt;'admin'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="ss"&gt;id: &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="ss"&gt;username: &lt;/span&gt;&lt;span class="s1"&gt;'Jane Doe'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="ss"&gt;role: &lt;/span&gt;&lt;span class="s1"&gt;'admin'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Or that can be done with raw SQL as it has exactly the same effect:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt;
  &lt;span class="n"&gt;users&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;username&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;role&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;VAULES&lt;/span&gt;
  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;"John Doe"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;"admin"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'Jane Doe'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;"admin"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Both solutions can have conflicts (such as breaking primary key uniqueness or unsuitable data types), or they can break business roles or high-level application validation rules. In both instances, the programmer has to deal with the problems by handling database-level error and ensuring the execution or the following of higher-level rules. For example, if you want to ignore duplicates on the the primary key column column you can add something like this to the end of the query:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;CONFLICT&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;DO&lt;/span&gt; &lt;span class="k"&gt;NOTHING&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;What if the loaded data contains updates to existing rows in the database? This operation is called an upsert. If the row already exist in the table (the existence is determined based on the primary key), the row is updated with the passed values. Otherwise, it is inserted as  a new row. In Rails 6, It's as easy as replacing the &lt;code&gt;insert_all&lt;/code&gt; function with &lt;code&gt;upsert_all&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight ruby"&gt;&lt;code&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="no"&gt;Users&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;upsert_all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="ss"&gt;id: &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="ss"&gt;username: &lt;/span&gt;&lt;span class="s1"&gt;'John Doe'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="ss"&gt;role: &lt;/span&gt;&lt;span class="s1"&gt;'admin'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="ss"&gt;id: &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="ss"&gt;username: &lt;/span&gt;&lt;span class="s1"&gt;'Jane Doe'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="ss"&gt;role: &lt;/span&gt;&lt;span class="s1"&gt;'admin'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;In SQL it will be something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt;
  &lt;span class="n"&gt;users&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;username&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;role&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;VAULES&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'John Doe'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'admin'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'Jane Doe'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'admin'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;DUPLICATE&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;
&lt;span class="k"&gt;UPDATE&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;VALUES&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;username&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;VALUES&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;username&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="k"&gt;role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;VALUES&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;role&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;The scalability of this method is still limited. There is a maximum limit to the query length in most servers and even if the limit doesn't exist you wouldn't want to send a query with a length in the order of gigabytes over the network. A simple solution is to &lt;code&gt;UPSERT&lt;/code&gt; in batches. That would be something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight ruby"&gt;&lt;code&gt;&lt;span class="n"&gt;record_num&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;records&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;length&lt;/span&gt;
&lt;span class="n"&gt;batch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;
&lt;span class="n"&gt;batch_num&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;record_num&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;..&lt;/span&gt;&lt;span class="n"&gt;batch_num&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;do&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;
  &lt;span class="n"&gt;lower_bound&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batch_num&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;batch&lt;/span&gt;
  &lt;span class="n"&gt;higher_bound&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;batch_num&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;batch&lt;/span&gt;
  &lt;span class="no"&gt;Users&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;upsert_all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;records&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;lower_bound&lt;/span&gt;&lt;span class="o"&gt;..&lt;/span&gt;&lt;span class="n"&gt;higher_bound&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Or with raw SQL instead of the &lt;code&gt;upsert_all&lt;/code&gt; function. Same thing.&lt;/p&gt;

&lt;p&gt;This solution technically scales well. However, to increase the performance, most SQL databases has a copy functionality which loads data from a file into a table. To use that functionality, the data is dumped into a file in a format that the database engine supports (a common one is CSV). After that, the table is filled from the file with an SQL command such as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;COPY&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="s1"&gt;'path/to/my/csv/file.csv'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;This statement appends to the users table all the data within the CSV file. This solution can give much better performance for files in the scale of multiple GBs of data.&lt;/p&gt;

&lt;p&gt;However this solution doesn't handle upserts. To do so, we create a temporary table, fill it from the file, then merge it in the original table using the primary key:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TEMP&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;tmp_table&lt;/span&gt;
&lt;span class="p"&gt;...;&lt;/span&gt; &lt;span class="cm"&gt;/* same schema as `users` table */&lt;/span&gt;

&lt;span class="k"&gt;COPY&lt;/span&gt; &lt;span class="n"&gt;tmp_table&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="s1"&gt;'path/to/my/csv/file.csv'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt;
  &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
  &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;tmp_table&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;This solution avoid doing multiple transactions to the database and ensures higher performance. It's definitely worth looking into when loading GBs of data into the database.&lt;/p&gt;

</description>
      <category>sql</category>
      <category>database</category>
      <category>postgres</category>
      <category>datascience</category>
    </item>
    <item>
      <title>Pagination in MongoDB and the Bucket Pattern</title>
      <dc:creator>Tariq Abughofa</dc:creator>
      <pubDate>Tue, 03 Dec 2019 23:22:50 +0000</pubDate>
      <link>https://dev.to/tariqabughofa/pagination-in-mongodb-and-the-bucket-pattern-k3m</link>
      <guid>https://dev.to/tariqabughofa/pagination-in-mongodb-and-the-bucket-pattern-k3m</guid>
      <description>&lt;p&gt;MongoDB is a document-based NoSQL database based on the JSON data format. Because of its nested document data structure, tables (or collections as they are called in MongoDB), can have more records than it's SQL counterpart. Which makes paginating efficiently important to have. Pagination can be used to do batch data processing or to show data on user interfaces. In this article, I go through the approaches that MongoDB provides for this problem.&lt;/p&gt;

&lt;h3&gt;
  
  
  Using the &lt;code&gt;cursor&lt;/code&gt; API's &lt;code&gt;skip&lt;/code&gt; and &lt;code&gt;limit&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;The cursor API in MongoDB provide tow functions that helps implementing pagination:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;cursor.skip(n)&lt;/code&gt; which returns a cursor which begin returning results after skipping the first &lt;code&gt;n&lt;/code&gt; documents.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;cursor.limit(m)&lt;/code&gt; which constrains the size of a cursor’s result set to &lt;code&gt;m&lt;/code&gt; documents.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is how you paginate using the MongoDB shell:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// 1st page&lt;/span&gt;
&lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;users&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;find&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nx"&gt;limit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;// 2nd page&lt;/span&gt;
&lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;users&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;find&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nx"&gt;skip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nx"&gt;limit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;// 3rd page&lt;/span&gt;
&lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;users&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;find&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nx"&gt;skip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nx"&gt;limit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Two things to note here: MongoDB cursors are &lt;em&gt;not&lt;/em&gt; the same as cursors in SQL databases which does server-full pagination on a data set. It's actually similar to &lt;code&gt;offset&lt;/code&gt; and &lt;code&gt;limit&lt;/code&gt; in SQL.&lt;/p&gt;

&lt;h3&gt;
  
  
  Using the &lt;code&gt;_id&lt;/code&gt; field
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;_id&lt;/code&gt; field is a column that is part of all MongoDB collections by default. It has the data type &lt;code&gt;ObjectId&lt;/code&gt;. &lt;code&gt;ObjectIds&lt;/code&gt; are 12 bytes long unique ordered auto-generated values that act as an identifier for the document. Kind of like how primary keys are in SQL databases. The important features about &lt;code&gt;_id&lt;/code&gt; fields is that it is ordered and indexed by default which make them suitable to use for pagination if they are used with the &lt;code&gt;limit&lt;/code&gt; function:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// 1st page&lt;/span&gt;
&lt;span class="kd"&gt;set&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;users&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;find&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nx"&gt;limit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;max_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kd"&gt;set&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;_id&lt;/span&gt;
&lt;span class="c1"&gt;// 2nd page&lt;/span&gt;
&lt;span class="kd"&gt;set&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;users&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;find&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;_id&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;$gt&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;max_id&lt;/span&gt;&lt;span class="p"&gt;}}).&lt;/span&gt;&lt;span class="nx"&gt;limit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;max_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kd"&gt;set&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;_id&lt;/span&gt;
&lt;span class="c1"&gt;// 3rd page&lt;/span&gt;
&lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;users&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;find&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;_id&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;$gt&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;max_id&lt;/span&gt;&lt;span class="p"&gt;}}).&lt;/span&gt;&lt;span class="nx"&gt;limit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;h3&gt;
  
  
  Using an indexed field
&lt;/h3&gt;

&lt;p&gt;If you have an indexed field and you wanted to return the pages sorted on that field instead, a good solution is to combine the &lt;code&gt;cursor.sort()&lt;/code&gt; , &lt;code&gt;cursor.limit(n)&lt;/code&gt; with a comparison query operator (&lt;code&gt;$gt&lt;/code&gt;, &lt;code&gt;$ls&lt;/code&gt;) to skip the previous pages. This way the query will use the index for the query to skip the unwanted documents and then it will read only the &lt;code&gt;n&lt;/code&gt; wanted documents. The  query looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;users&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;find&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;created_date&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;$gt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;ISODate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;2018-07-21T12:01:35&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;sort&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;created_date&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;limit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;The downside is that we can't jump directly to a specific page. If that is necessary this page doesn't work.&lt;/p&gt;

&lt;h3&gt;
  
  
  Using the Bucket Pattern
&lt;/h3&gt;

&lt;p&gt;This is a unique storage/pagination technique that can be only used with document-based NoSQL databases. It has great scalability in terms of the size of the stored data and the index. At the same time it allows to navigate to any page randomly. However, this method starts with the way we store the data.&lt;/p&gt;

&lt;p&gt;A good use-case for this pattern is time-series data. Let's say we're getting location updates through GPS each minute and we store the document this way:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
   &lt;span class="nl"&gt;_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;ObjectId&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt;
   &lt;span class="nx"&gt;source_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;12345&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="nx"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;ISODate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;2019-09-28T02:00:00.000Z&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
   &lt;span class="nx"&gt;latitude&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;8.80173&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="nx"&gt;longitude&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;20.63476&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="p"&gt;{&lt;/span&gt;
   &lt;span class="na"&gt;_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;ObjectId&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt;
   &lt;span class="na"&gt;source_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;12345&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="na"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;ISODate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;2019-09-28T02:01:00.000Z&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
   &lt;span class="na"&gt;latitude&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;8.80175&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="na"&gt;longitude&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;20.63478&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="p"&gt;{&lt;/span&gt;
   &lt;span class="na"&gt;_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;ObjectId&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt;
   &lt;span class="na"&gt;source_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;12345&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="na"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;ISODate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;2019-09-28T02:02:00.000Z&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
   &lt;span class="na"&gt;latitude&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;8.80178&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="na"&gt;longitude&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;20.63486&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Very convenient indices to have here is one on &lt;code&gt;source_id&lt;/code&gt; and another on &lt;code&gt;timestamp&lt;/code&gt;. Next we can paginate the data sorted on &lt;code&gt;timestamp&lt;/code&gt; as we saw in the previous method. However, the scalability of this solution is questionable as the &lt;code&gt;timestamp&lt;/code&gt; index and the collection get huge really fast.&lt;/p&gt;

&lt;p&gt;Here the Bucket Pattern comes to the rescue. Instead of saving each data point as a document, we leverage the document data model that MongoDB uses. We save data points that appear in each, let's say hour, as a list in one single document we refer to as a bucket. We also add to the document extra attributes stating to the &lt;code&gt;start_timestamp&lt;/code&gt; the date at which the bucket data points start, and maybe some aggregation data. The bucket would look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nl"&gt;source_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;12345&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nx"&gt;start_timestamp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;ISODate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;2019-09-28T02:00:00.000Z&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nx"&gt;locations&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
       &lt;span class="p"&gt;{&lt;/span&gt;
           &lt;span class="na"&gt;latitude&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;8.80173&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="na"&gt;longitude&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;20.63476&lt;/span&gt;
       &lt;span class="p"&gt;},&lt;/span&gt;
       &lt;span class="p"&gt;{&lt;/span&gt;
           &lt;span class="na"&gt;latitude&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;8.80175&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="na"&gt;longitude&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;20.63478&lt;/span&gt;
       &lt;span class="p"&gt;},&lt;/span&gt;
       &lt;span class="err"&gt;…&lt;/span&gt;
       &lt;span class="p"&gt;{&lt;/span&gt;
           &lt;span class="nl"&gt;latitude&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;8.80378&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="nx"&gt;longitude&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;20.63786&lt;/span&gt;
       &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="nx"&gt;average_speed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;56.056&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Using the Bucket Pattern, we went down from 60 documents each hour into only one. For the index, we can now index on &lt;code&gt;start_timestamp&lt;/code&gt; instead of &lt;code&gt;timestamp&lt;/code&gt; so the size is 60 times less.&lt;/p&gt;

&lt;p&gt;Now you might ask "how does this help with pagination though?". The answer is that by pre-aggregating data per hour, we implemented a built-in pagination for the collection. So to get the 10th page from the collection we just need to get the 10th document from the collection:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// the date at which our measurement started&lt;/span&gt;
&lt;span class="nx"&gt;data_start_point&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;ISODate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;2019-01-01T01:00:00.000Z&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;// add a 10-hour period to the date which is 10*60*60*1000 in milliseconds&lt;/span&gt;
&lt;span class="nx"&gt;page_timestamp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nb"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data_start_point&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;getTime&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;users&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;find&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;start_timestamp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;page_timestamp&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;If you want to get the 10th page but you prefer each page to has 3 hours instead of one, it's just a matter of math:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// the date at which our measurement started&lt;/span&gt;
&lt;span class="nx"&gt;data_start_point&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;ISODate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;2019-01-01T01:00:00.000Z&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;// add a 30-hour period to the date which is 3*10*60*60*1000 in milliseconds&lt;/span&gt;
&lt;span class="nx"&gt;page_start_timestamp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nb"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data_start_point&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;getTime&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;// use "limit" to get 3-hour data&lt;/span&gt;
&lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;users&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;find&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;start_timestamp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;$gte&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;page_timestamp&lt;/span&gt; &lt;span class="p"&gt;}}).&lt;/span&gt;&lt;span class="nx"&gt;limit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



</description>
      <category>mongodb</category>
      <category>database</category>
      <category>nosql</category>
    </item>
    <item>
      <title>The case against jQuery</title>
      <dc:creator>Tariq Abughofa</dc:creator>
      <pubDate>Mon, 25 Nov 2019 18:11:05 +0000</pubDate>
      <link>https://dev.to/tariqabughofa/the-case-against-jquery-57oi</link>
      <guid>https://dev.to/tariqabughofa/the-case-against-jquery-57oi</guid>
      <description>&lt;h1&gt;
  
  
  The Good
&lt;/h1&gt;

&lt;p&gt;jQuery was a great library. It made manipulating the DOM and adding listeners  very easy in a time where javascript wasn't as mature as it is today. It saved programmers a lot of trouble making sure your code work properly in all browsers. The syntax is very user friendly and easy to learn.&lt;/p&gt;

&lt;p&gt;However, These same great features of jQuery makes it the pain that it today. When I started working in frontend development, I was introduced to jQuery at the same time I was introduced to javascript. I was hooked up immediately. Why wouldn't I? The Web API implementation varied widely between browsers and all the good plugins out there depended on jQyery. &lt;/p&gt;

&lt;h1&gt;
  
  
  The Bad
&lt;/h1&gt;

&lt;p&gt;Since every time I needed to manipulate the DOM I imported jQuery, I didn't even bother to learn how it's done. Neither did my colleagues. And it is a wide spread problem. Many developers think of javascript as jQuery. A small look into stackoverflow shows how a lot and a lot of people answer javascript questions using the jQuery API. You have to say vanilla javascript or without jQuery to get a proper answer and still you would get answers like "you should use jQuery" :( A simple search for any DOM related javascript question on google shows the same problem (The term &lt;em&gt;vanilla javascript&lt;/em&gt; is a conundrum on its own. Check this cool satire website about &lt;a href="http://vanilla-js.com/"&gt;vanilla js&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This is a trap that many falls into. Instead of learning javascript, the web api, then jQuery, the order happen in reverse or maybe it never go further that just learning jQuery. It's like learning Rails without learning Ruby. Once things get a bit complicated you'll be in hot water and you're stuck with using the framework even when it's not needed. &lt;br&gt;
Add to that the confusion it produces to whether a variable is a native javascript DOM element or a jQuery wrapped one.&lt;/p&gt;

&lt;p&gt;If you use a frontend framework you will see how much the code becomes polluted if you wanted to manipulate the DOM with jQuery since all framework rightfully pass native DOM elements. Not to mention that jQuery encourages writing spaghetti code. Some of the reasons behind it is the lack of structure standard associated with it and the ability to chain DOM selectors&lt;/p&gt;

&lt;h1&gt;
  
  
  The Ugly
&lt;/h1&gt;

&lt;p&gt;You can say "I learned javascript properly and when I don't want to use jQuery I can just do it". Well it's not that easy. Almost every javascript library is jQuery plugin. The responsive design libraries like Bootstrap and foundation, WordPress, select 2, fancy box, and many other frontend libraries are &lt;em&gt;dependent&lt;/em&gt; on jQuery.&lt;/p&gt;

&lt;p&gt;That adds at least 82.54 KB of initial load required download to your website (for the minified version). Not to mentioned that jQuery sacrifices performance to be able to do its magic. The need to include it anyway lures developers to use it in their code anyway and the hole keeps getting deeper.&lt;/p&gt;

&lt;h1&gt;
  
  
  Opposing Arguments
&lt;/h1&gt;

&lt;h3&gt;
  
  
  Cross-browser support
&lt;/h3&gt;

&lt;p&gt;The web api difference between browsers has dropped significantly since the introduction of jQuery. Not to mention that browser use sparsity are much more concentrated nowadays in Chrome as the browser (not that I'm happy about it ¯\&lt;em&gt;(ツ)&lt;/em&gt;/¯) and it is closer to the latest versions of whatever browser they are using since the update process is much easier today.&lt;/p&gt;

&lt;p&gt;You might say that your users use some ancient IE version you need to support. Well luckily this argument does't live anymore since you can use the &lt;a href="https://babeljs.io/"&gt;Babel&lt;/a&gt; project to support any list of browsers and versions you like. Plus Babel is not a run-time dependency so no performance overhead are added.&lt;/p&gt;

&lt;h3&gt;
  
  
  The shortcomings of javascript
&lt;/h3&gt;

&lt;p&gt;A strong argument for jQuery was that javascript used to produce a lot of boilerplate. Functions like &lt;code&gt;$.inArray()&lt;/code&gt; or ​&lt;code&gt;().forEach&lt;/code&gt; use to overcome a painful way of array iteration in javascript. However, alternative (&lt;code&gt;forEach&lt;/code&gt; and &lt;code&gt;Object.keys()&lt;/code&gt;) has been existing for a long time and supported IE9+. Javascript has evolved a lot since ES5 and even for browsers that have limited feature support, using a transpiler is far better that using a run-time libary. &lt;/p&gt;

&lt;h3&gt;
  
  
  I don't use react (or Vue) so I use jQuery
&lt;/h3&gt;

&lt;p&gt;Does it really need to be either or? :) &lt;/p&gt;

&lt;h3&gt;
  
  
  Animation libraries require jQuery
&lt;/h3&gt;

&lt;p&gt;There are many alternative lightweight animation libraries that doesn't require jQuery such as: &lt;a href="https://shopify.github.io/draggable/"&gt;draggable&lt;/a&gt;, &lt;a href="https://github.com/cferdinandi/smooth-scroll"&gt;smoth-scroll&lt;/a&gt; and &lt;a href="https://github.com/SortableJS/Sortable"&gt;sortable&lt;/a&gt;.&lt;/p&gt;

&lt;h1&gt;
  
  
  How can I help?
&lt;/h1&gt;

&lt;p&gt;Make sure to use native javascript DOM manipulation. Many websites can help you find the alternatives syntax and show you how easy it is: &lt;a href="http://youmightnotneedjquery.com/"&gt;http://youmightnotneedjquery.com/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Another way is to support and use lightweight libraries that do not depend on jQuery. &lt;a href="https://blog.bigbinary.com/2017/06/20/rails-5-1-has-dropped-dependency-on-jquery-from-the-default-stack.html"&gt;Rails removed jQuery as a dependency since 5.1&lt;/a&gt;. &lt;a href="https://github.blog/2018-09-06-removing-jquery-from-github-frontend/"&gt;Github ditched jQuery last year&lt;/a&gt;.  &lt;a href="https://news.ycombinator.com/item?id=19147466"&gt;Bootstrap 5 will not depend on jQuery&lt;/a&gt;, and I listed many animation libraries above.&lt;br&gt;
You can also share here the libraries you like using which doesn't depend on jQuery.&lt;/p&gt;

&lt;p&gt;Do you have a reason why you personally use jQuery or do you think it has a place today? please share in the comment and I will be happy to discuss it.&lt;/p&gt;

</description>
      <category>javascript</category>
      <category>webdev</category>
      <category>frontend</category>
      <category>jquery</category>
    </item>
    <item>
      <title>General and Unconventional Pagination Techniques in Postgres</title>
      <dc:creator>Tariq Abughofa</dc:creator>
      <pubDate>Tue, 19 Nov 2019 03:26:29 +0000</pubDate>
      <link>https://dev.to/tariqabughofa/general-and-unconventional-pagination-techniques-postgres-24fh</link>
      <guid>https://dev.to/tariqabughofa/general-and-unconventional-pagination-techniques-postgres-24fh</guid>
      <description>&lt;p&gt;I talked in a previous article on how pagination can be done in SQL databases &lt;em&gt;(head there to read more about the general techniques. Their advantages, disadvantages, and scenarios)&lt;/em&gt;:&lt;br&gt;
&lt;/p&gt;
&lt;div class="ltag__link"&gt;
  &lt;a href="/tariqabughofa" class="ltag__link__link"&gt;
    &lt;div class="ltag__link__pic"&gt;
      &lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--dhQgoHlZ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://res.cloudinary.com/practicaldev/image/fetch/s--l9sa7M4---/c_fill%2Cf_auto%2Cfl_progressive%2Ch_150%2Cq_auto%2Cw_150/https://dev-to-uploads.s3.amazonaws.com/uploads/user/profile_image/252892/86e1d43f-b46f-4397-840c-e3dba8c2a9ae.jpg" alt="tariqabughofa image"&gt;
    &lt;/div&gt;
  &lt;/a&gt;
  &lt;a href="/tariqabughofa/how-to-paginate-the-right-way-in-sql-hdc" class="ltag__link__link"&gt;
    &lt;div class="ltag__link__content"&gt;
      &lt;h2&gt;How to Paginate (the right way) in SQL&lt;/h2&gt;
      &lt;h3&gt;Tariq Abughofa ・ Nov 13 '19 ・ 4 min read&lt;/h3&gt;
      &lt;div class="ltag__link__taglist"&gt;
        &lt;span class="ltag__link__tag"&gt;#sql&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#database&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#postgres&lt;/span&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/a&gt;
&lt;/div&gt;
&lt;br&gt;
In this article, I show how to implement these approaches in the Postgres database engine. In addition to 3 unconventional pagination methods special for Postgres.
&lt;h3&gt;
  
  
  &lt;code&gt;limit&lt;/code&gt; / &lt;code&gt;offset&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;limit&lt;/code&gt; / &lt;code&gt;offset&lt;/code&gt; is pretty SQL standard with PLSQL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
  &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;products&lt;/span&gt;
 &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;sale_date&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
 &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;sold&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;TRUE&lt;/span&gt;
 &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt; &lt;span class="k"&gt;OFFSET&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;h3&gt;
  
  
  Cursors
&lt;/h3&gt;

&lt;p&gt;Cursors are also pretty straightforward. You first declare the cursor with the query that it will execute (the query can be bounded or unbounded). Then, you open the cursor and fetch pages from it with each &lt;code&gt;fetch&lt;/code&gt; statement. It's closed in the end to release the resources.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;BEGIN&lt;/span&gt;
   &lt;span class="k"&gt;DECLARE&lt;/span&gt; &lt;span class="n"&gt;cur&lt;/span&gt; &lt;span class="k"&gt;CURSOR&lt;/span&gt; &lt;span class="k"&gt;FOR&lt;/span&gt; &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
       &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;products&lt;/span&gt;
       &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;production_year&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2000&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
   &lt;span class="c1"&gt;-- Open the cursor&lt;/span&gt;
   &lt;span class="k"&gt;OPEN&lt;/span&gt; &lt;span class="n"&gt;cur&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
   &lt;span class="n"&gt;LOOP&lt;/span&gt;
    &lt;span class="c1"&gt;-- fetch row into the film&lt;/span&gt;
      &lt;span class="k"&gt;FETCH&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;cur&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="c1"&gt;-- exit when no more row to fetch&lt;/span&gt;
      &lt;span class="n"&gt;EXIT&lt;/span&gt; &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;FOUND&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
   &lt;span class="k"&gt;END&lt;/span&gt; &lt;span class="n"&gt;LOOP&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

   &lt;span class="c1"&gt;-- Close the cursor&lt;/span&gt;
   &lt;span class="k"&gt;CLOSE&lt;/span&gt; &lt;span class="n"&gt;cur&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;END&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;h3&gt;
  
  
  Key-based Pagination
&lt;/h3&gt;

&lt;p&gt;Pretty much SQL standard as well:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;products&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="k"&gt;ASC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Now let's start with the unconventional pagination approaches:&lt;/p&gt;

&lt;h3&gt;
  
  
  Paginating over &lt;code&gt;xmin&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;xmin&lt;/code&gt; is one of many system columns that Postgres adds implicitly to each table. This column in particular represents an identifier of the database transaction the inserted/updated the corresponding column. Because of that, it servers as a good solution to identify the changes that appeared on a table after a certain point and to sort the rows on the time they were touched by a transaction.&lt;/p&gt;

&lt;p&gt;To use the xmin column for pagination, we can use the same key-based pagination approach. The following query paginate with a 1000 row limit with the data sorted in ascending order on the last update time.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;xmin&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;text&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;5000&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;xmin&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;text&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="k"&gt;ASC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;The method has the same disadvantages as key-based pagination as you can't reach a certain page randomly. Instead scrolling through the pages until you reach the needed page.&lt;/p&gt;

&lt;p&gt;One thing to be aware of when using &lt;code&gt;xmin&lt;/code&gt; is that since it represents the transaction id, rows that are inserted/updated in the same transaction has the same &lt;code&gt;xmin&lt;/code&gt; and thus no specific order.&lt;/p&gt;

&lt;h3&gt;
  
  
  paginating over &lt;code&gt;ctid&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;ctid&lt;/code&gt; is another system column in Postgres. it has the physical location of the row so when the data is sorted on this column, the data is returned with true randomness logically speaking since the order is on storage location. It also means that the retrieved data is fetched very fast since there is no disk access cost to move to the next row. That's why internally indices use &lt;code&gt;ctid&lt;/code&gt;s instead of the primary key to point to the rows.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;ctid&lt;/code&gt; value consists of a tuple &lt;code&gt;(p, n)&lt;/code&gt; where p represent the page number and n represents the row number within the page.&lt;/p&gt;

&lt;p&gt;This is good for a scenario in which:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;we need extremely fast deep page access.&lt;/li&gt;
&lt;li&gt;filtering is not needed.&lt;/li&gt;
&lt;li&gt;we don't care about the row orders or we want random row order.&lt;/li&gt;
&lt;li&gt;we don't require all the pages to have the same number of rows (since deleted rows leave holes in pages).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To get page number &lt;code&gt;p&lt;/code&gt;, we need to do some calculations. Each page contains a "block size" bytes of data (8192 bytes or 8 KB by default). Rows are referenced by a 32-bit pointer so there are at most block size / 4 rows per page. The following query will generate all the &lt;code&gt;ctid&lt;/code&gt; values in page &lt;code&gt;p&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'('&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="s1"&gt;','&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="s1"&gt;')'&lt;/span&gt;&lt;span class="p"&gt;)::&lt;/span&gt;&lt;span class="n"&gt;tid&lt;/span&gt;
 &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;generate_series&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;current_setting&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'block_size'&lt;/span&gt;&lt;span class="p"&gt;)::&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;To get rows in page &lt;code&gt;p&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;ctid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;ANY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ARRAY&lt;/span&gt;
  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'('&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="s1"&gt;','&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="s1"&gt;')'&lt;/span&gt;&lt;span class="p"&gt;)::&lt;/span&gt;&lt;span class="n"&gt;tid&lt;/span&gt;
     &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;generate_series&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;current_setting&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'block_size'&lt;/span&gt;&lt;span class="p"&gt;)::&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;h3&gt;
  
  
  Pagination using row statistics
&lt;/h3&gt;

&lt;p&gt;Postgres records statistics about its tables in the &lt;code&gt;pg_statistics&lt;/code&gt; catalog and provide asn interface to access the information with the view &lt;code&gt;pg_stats&lt;/code&gt;. One of these statistics is called &lt;code&gt;histograms_bounds&lt;/code&gt;. It hold column statistics representing the value distribution divided into buckets. When querying this field we get something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;histogram_bounds&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stats&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;tablename&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'users'&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;attname&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'id'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

                   &lt;span class="n"&gt;histogram_bounds&lt;/span&gt;
&lt;span class="c1"&gt;------------------------------------------------------&lt;/span&gt;
 &lt;span class="err"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;993&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;1997&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;3050&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;4040&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;5036&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;5957&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;7057&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;8029&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;9016&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;9995&lt;/span&gt;&lt;span class="err"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;We notice the in the example the values for the &lt;code&gt;id&lt;/code&gt; column goes from 0 to 9995. The values are divided into buckets with around a 1000 values each. The first bucket goes from &lt;code&gt;id&lt;/code&gt; 0 to 993, the second one is from 993 to 1997, and so on. As you can see, there is an opportunity here to use these buckets to do pagination over &lt;code&gt;id&lt;/code&gt;. If we assumed the bucket size is &lt;code&gt;b&lt;/code&gt;, the page size is &lt;code&gt;n&lt;/code&gt;, and the page we want is &lt;code&gt;p&lt;/code&gt;, with a simple calculation we can find that the bucket which contain our page has a lower bound with index &lt;code&gt;n * p / b + 1&lt;/code&gt; and an upper bound with index &lt;code&gt;n * p / b + 2&lt;/code&gt;. After we get the histogram bounds for our bucket, the query is pretty easy to do:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;bucket&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;histogram_bounds&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;text&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;[])[&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;lower_bound&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;histogram_bounds&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;text&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;[])[&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;upper_bound&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stats&lt;/span&gt;
    &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;tablename&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'users'&lt;/span&gt;
    &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;attname&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'id'&lt;/span&gt;
    &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="n"&gt;lower_bound&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;bucket&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="n"&gt;upper_bound&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;bucket&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="k"&gt;ASC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;
&lt;span class="k"&gt;OFFSET&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Notice that we use Offset in the query. However, it's only applied within the bucket instead of the whole table. Which means at the worst case scenario in which we are fetching the last page, we are reading all &lt;code&gt;b&lt;/code&gt; rows from the database instead of the whole table.&lt;/p&gt;

&lt;p&gt;The results, however, can be approximate since we use table statistics which might not be up-to-date with the table. However, the method is blazing fast for deep pagination with random access that tolerates approximate results and doesn't require any filtering.&lt;/p&gt;

&lt;p&gt;Stay tuned for more posts on this subject with NoSQL Databases :). &lt;/p&gt;

</description>
      <category>sql</category>
      <category>postgres</category>
      <category>database</category>
    </item>
    <item>
      <title>How to Paginate (the right way) in SQL</title>
      <dc:creator>Tariq Abughofa</dc:creator>
      <pubDate>Wed, 13 Nov 2019 23:28:30 +0000</pubDate>
      <link>https://dev.to/tariqabughofa/how-to-paginate-the-right-way-in-sql-hdc</link>
      <guid>https://dev.to/tariqabughofa/how-to-paginate-the-right-way-in-sql-hdc</guid>
      <description>&lt;h4&gt;
  
  
  Many ways to scroll through a table. Easy to misuse them.
&lt;/h4&gt;

&lt;p&gt;Server-side pagination is a commonly-used feature in SQL databases. It helps when showing a huge set of results on user interfaces, it's required in RESTful APIs, and it comes to the rescue whenever you need to process large data in bulk and it doesn't fit in memory. The problem is that if it's done wrong it can be as inefficient as loading the full set or more.&lt;/p&gt;

&lt;p&gt;There is a number of ways to implement pagination with SQL systems. I will go through the methods in this article with the advantages, disadvantages, and the reasonable scenarios of using each one.&lt;/p&gt;

&lt;h4&gt;
  
  
  Client-side Pagination:
&lt;/h4&gt;

&lt;p&gt;The requested query is passed as it is without pagination. After the client gets the full set of data, it divides the data and shows it paginated to the user.&lt;/p&gt;

&lt;h6&gt;
  
  
  Advantages
&lt;/h6&gt;

&lt;p&gt;It reduces the number of requests to the database. The data can even be cached on the client-side to avoid future requests which makes later loads even faster.&lt;/p&gt;

&lt;h6&gt;
  
  
  Disadvantages
&lt;/h6&gt;

&lt;p&gt;Not suitable for regularly changing data. The dataset should be small to fit in memory and not painfully slow the initial load.&lt;/p&gt;

&lt;h6&gt;
  
  
  Suitable Scenarios
&lt;/h6&gt;

&lt;p&gt;Small data sets which are not regularly updated and needed frequently such as the categories.&lt;/p&gt;

&lt;h4&gt;
  
  
  Using Limit/Offset:
&lt;/h4&gt;

&lt;p&gt;The &lt;code&gt;limit&lt;/code&gt; and &lt;code&gt;offset&lt;/code&gt; are standard SQL keywords and the first solution that comes to mind for pagination over datasets. The query syntax would look something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
  &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;products&lt;/span&gt;
 &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;sale_date&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
 &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;sold&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;TRUE&lt;/span&gt;
 &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt; &lt;span class="k"&gt;OFFSET&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;h6&gt;
  
  
  Advantages
&lt;/h6&gt;

&lt;p&gt;Easy to implement which made many famous ORM solution use it. It's as easy as chaining the query function like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight ruby"&gt;&lt;code&gt;&lt;span class="no"&gt;Product&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;limit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;offset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;findAll&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;It also allows you to filter the table while pagination with the &lt;code&gt;WHERE&lt;/code&gt; statement. You can sort on any column and it still works. Which makes the query very customizable. Not the whole data is loaded into memory which means no running out of memory.&lt;/p&gt;

&lt;h6&gt;
  
  
  Disadvantages
&lt;/h6&gt;

&lt;p&gt;It can be HORRIBLE. The query performance goes downhill as the offset value increases. Let's say you are fetching page &lt;code&gt;n&lt;/code&gt;. How would offset skip the first &lt;code&gt;n-1&lt;/code&gt;? It wouldn't. It has to linearly scan the table for the first &lt;code&gt;n-1&lt;/code&gt; and then load page &lt;code&gt;n&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;An offset means that a certain number of records will be skipped from the start, but what if new data where inserted in page &lt;code&gt;n-1&lt;/code&gt; while page &lt;code&gt;n&lt;/code&gt; is being loaded? rows from page &lt;code&gt;n-1&lt;/code&gt; will be pushed into page &lt;code&gt;n&lt;/code&gt; causing inconsistency, and worst; if the data is being updated by the paginating process, rows might be processed multiple times causing data inconsistency.&lt;/p&gt;

&lt;h6&gt;
  
  
  Suitable Scenarios
&lt;/h6&gt;

&lt;p&gt;Great for user interfaces, easy to implement and very customizable to different filter/order preferences. As long as the deepness of the search results is limited. Perfect for pages where users paginate down the results with scrolling. Have this functionality on a large dataset with a "last page" button showing up on the interface and embrace yourself for the a database server out of service.&lt;/p&gt;

&lt;p&gt;For migrations and large dataset processing, my advice: Offset the &lt;code&gt;offset&lt;/code&gt; statement.&lt;/p&gt;

&lt;h4&gt;
  
  
  Cursors
&lt;/h4&gt;

&lt;p&gt;SQL cursors makes the server do the pagination for you. All what you need to worry about is the query you want to paginate. It looks something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Create a cursor for the query and open it&lt;/span&gt;
&lt;span class="k"&gt;DECLARE&lt;/span&gt; &lt;span class="n"&gt;cur&lt;/span&gt; &lt;span class="k"&gt;CURSOR&lt;/span&gt; &lt;span class="k"&gt;FOR&lt;/span&gt; &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;products&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;OPEN&lt;/span&gt; &lt;span class="n"&gt;curEmail&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;-- Retrieve ten rows each time&lt;/span&gt;
&lt;span class="k"&gt;FETCH&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;cur&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;FETCH&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;cur&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;-- All done&lt;/span&gt;
&lt;span class="k"&gt;CLOSE&lt;/span&gt; &lt;span class="n"&gt;cur&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;h6&gt;
  
  
  Advantages
&lt;/h6&gt;

&lt;p&gt;Supports arbitrary queries. No performance drop when going further in the pages.&lt;/p&gt;

&lt;h6&gt;
  
  
  Disadvantages
&lt;/h6&gt;

&lt;p&gt;the implementation details is different from an SQL engine to the other but in general they held resources on the server and create locks. Clients can't share cursors and thus they would have to open each it's own which mean they can't share the same pagination.&lt;/p&gt;

&lt;h6&gt;
  
  
  Suitable Scenarios
&lt;/h6&gt;

&lt;p&gt;One client which needs to paginate over large-sets of data and cares about pagination consistency. There are many types that can suite different applications: &lt;code&gt;READ_ONLY&lt;/code&gt; which avoids locks on the table, &lt;code&gt;STATIC&lt;/code&gt; which copies the result into a temporary table which is good when updates doesn't matter, &lt;code&gt;KEYSET&lt;/code&gt; which only copies the primary keys to provide you with updates but not with new rows, &lt;code&gt;DYNAMIC&lt;/code&gt; same as &lt;code&gt;KEYSET&lt;/code&gt; but primary keys copy is updated so it can also see newly inserted and deleted rows. ...etc. the availability of these solutions depends on the engine.&lt;/p&gt;

&lt;h4&gt;
  
  
  Key-based Pagination
&lt;/h4&gt;

&lt;p&gt;This technique is my favourite for data migrations. You need an indexed column to use it but that introduces great optimization while staying stateless on the server side.&lt;/p&gt;

&lt;p&gt;The first page is fetched with the following statement:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;products&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="k"&gt;ASC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;For the second page, the query uses the maximum &lt;code&gt;id&lt;/code&gt; value fetched in the first page. Let's say it's 1000, the query is as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;products&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="k"&gt;ASC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;An so on for the next pages.&lt;/p&gt;

&lt;h5&gt;
  
  
  Advantages
&lt;/h5&gt;

&lt;p&gt;Offers great scalability: It's as fast for the 1,000,000th page as for the first page. Also pagination consistency is preserved. It supports filtering and  ordering on any column or multiple columns as long as you have indices on them.&lt;/p&gt;

&lt;h5&gt;
  
  
  Disadvantages
&lt;/h5&gt;

&lt;p&gt;In general, there is no way to jump to a specific page. However, If you have an auto incremented identifier in the dataset, that can be done with some simple calculations. To retrieve page &lt;code&gt;n&lt;/code&gt;, the lower bound for the query would be like this: &lt;code&gt;n * 1000&lt;/code&gt; where 1000 is the page size or the limit.&lt;/p&gt;

&lt;h5&gt;
  
  
  Suitable Scenarios
&lt;/h5&gt;

&lt;p&gt;Mostly any one. Very convenient for scalable applications where a lot of server requests are expected.&lt;/p&gt;

</description>
      <category>sql</category>
      <category>database</category>
      <category>postgres</category>
    </item>
    <item>
      <title>Git: Theirs vs Ours</title>
      <dc:creator>Tariq Abughofa</dc:creator>
      <pubDate>Fri, 18 Oct 2019 22:13:47 +0000</pubDate>
      <link>https://dev.to/tariqabughofa/git-theirs-vs-ours-3i7h</link>
      <guid>https://dev.to/tariqabughofa/git-theirs-vs-ours-3i7h</guid>
      <description>&lt;p&gt;Conflicts in git can be a pain to deal with especially that you have to go into each code block which generated a conflict and check which version you want to keep and sometimes it can be hard to be sure. However, in some scenarios, the choice is easy. You want to keep whatever already existed on the server, or the opposite, you want to override it all with what you have. In this case, instead of going through the conflicts manually, git has extremely helpful flags which are &lt;code&gt;--ours&lt;/code&gt; and &lt;code&gt;--theirs&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Let's assume that the conflict looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt; HEAD
 this is some content
=======
 this is a totally different content
&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; new_branch
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;When you have a merge conflict and you know exactly which version you want to keep, entirely or on a file basis, you can use these flags like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;git checkout --ours .
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Or:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;git checkout --our /path/to/your/conflict/file.rb
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Then you commit the chosen changes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;git add /path/to/your/conflict/file.rb
git commit
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;But when you use which? Is it "ours" or "theirs"? well, this can be a bit confusing... In the conflict above for example, it's hard to tell. You might say "HEAD is whatever already on the branch we're merging into and the other part is what's on the new branch" but it's actually more complicated than that.&lt;/p&gt;

&lt;p&gt;If you looked into the help for the git checkout command, you'll see the following:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;   -2, --ours            checkout our version for unmerged files
   -3, --theirs          checkout their version for unmerged files
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;That doesn't sound very helpful. Let's go to the actual documentation. For version 2.23.0 it says:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;--ours
--theirs
When checking out paths from the index, check out stage #2 (ours) or #3 (theirs) for unmerged paths.

Note that during git rebase and git pull --rebase, ours and theirs may appear swapped; --ours gives the version from the branch the changes are rebased onto, while --theirs gives the version from the branch that holds your work that is being rebased.

This is because rebase is used in a workflow that treats the history at the remote as the shared canonical one, and treats the work done on the branch you are rebasing as the third-party work to be integrated, and you are temporarily assuming the role of the keeper of the canonical history during the rebase. As the keeper of the canonical history, you need to view the history from the remote as ours (i.e. "our shared canonical history"), while what you did on your side branch as theirs (i.e. "one contributor’s work on top of it").
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Let's explain this further. We need to make the distinction between the three git operations which integrate changes and thus can generate conflicts: merges, rebases, and pulls.&lt;/p&gt;

&lt;h3&gt;
  
  
  Merges
&lt;/h3&gt;

&lt;p&gt;A merge scenario can be like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;git checkout master
git merge new_branch
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;This one feels more intuitive as your guess is probably right:&lt;/p&gt;

&lt;p&gt;To keep the changes that are on &lt;code&gt;master&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;git checkout --ours .
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;To keep the changes that are on &lt;code&gt;new_branch&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;git checkout --theirs .
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;h3&gt;
  
  
  Rebases
&lt;/h3&gt;

&lt;p&gt;A conflict appears within a rebase you do something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;git checkout new_branch
git rebase master
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;This one feels a bit  counter-intuitive as ours is not the branch we're on:&lt;/p&gt;

&lt;p&gt;To keep the changes that are on &lt;code&gt;master&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;git checkout --ours .
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;To keep the changes that are on &lt;code&gt;new_branch&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;git checkout --theirs .
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;So exactly the same as before.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pulls
&lt;/h3&gt;

&lt;p&gt;A pull command fetches data from a remote source and incorporate the changes into the local branch. The incorporation part can be done either with a merge operation (which is the default mode), or with a rebase operation.&lt;/p&gt;

&lt;p&gt;The fetch + merge scenario looks as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;git checkout master
git pull origin new_branch
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;To keep the changes that are on &lt;code&gt;master&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;git checkout --ours .
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;To keep the changes that are on &lt;code&gt;new_branch&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;git checkout --theirs .
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;So exactly the same as a merge operation.&lt;/p&gt;

&lt;p&gt;The fetch + rebase scenario happens in the following way:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;git checkout new_branch
git pull origin master --rebase
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;To keep the changes that are on &lt;code&gt;master&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;git checkout --ours .
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;To keep the changes that are on &lt;code&gt;new_branch&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;git checkout --theirs .
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Exactly the same as a rebase operation.&lt;/p&gt;

&lt;p&gt;However, you don't have to memorize it like this: merge makes sense and rebase doesn't 😅. It is actually much simpler than that and it can be summerized in one sentence:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"&lt;code&gt;ours&lt;/code&gt; represents the history and &lt;code&gt;theirs&lt;/code&gt; is the new applied commits"&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In a merge, git takes the current branch and apply the additional commits to it's HEAD. The current branch is the history &lt;code&gt;ours&lt;/code&gt; and the additional commits are new &lt;code&gt;theirs&lt;/code&gt;. In a rebase, git rewrites the history of the current branch. Making it compatible with the other branch. Then, applies any additional commits from the current branch. The other branch becomes the history thus &lt;code&gt;ours&lt;/code&gt; and the current branch might have new additions  &lt;code&gt;theirs&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Now let's get back to our conflict example if the operations was a merge, the final code would be:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;this is some content
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;while with a rebase it would be:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;this is a totally different content
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



</description>
      <category>git</category>
      <category>github</category>
    </item>
  </channel>
</rss>
