DEV Community: Tariq Abughofa

10 Books That Should be on each Programmer's Library

Tariq Abughofa — Tue, 27 Apr 2021 15:52:14 +0000

I have read many books during my learning journey but in this article I present a list of the best books that helped me digest important topics. This special book list shaped my understanding of computer science, and kept me going back any time I’m in doubt.

1. Head First Design Patterns by Eric & Elisabeth Freeman

This is the book that introduced me to programming design patterns and what an introduction it was! Clear, comprehensive, and with easy to understand and imagine real-world examples that stick to your memory easily. Every time I have a doubt about which design pattern to use, I go back to this book and it solves my problem. A must read by any programmer!

2. The C++ Programming Language 4th Edition by Bjarne Stroustrup

Which book would be better to learn a programming language more than a one written by the creator of that language? Moreover, what if the language is as hard as C++? This book is a comprehensive guide for C++ 11 and provides a good guide for many general programming concepts including Object Oriented Programming.

3. Pro Git, 2nd Edition by Scott Chacon and Ben Straub

If you want to handle code conflicts like a boss and understand why your manager start pulling out his hair when you amend a pushed commit, this is the book for you. This book goes into depth with the version control tool and it helped me understand the genius yet simple concepts git is built on.

4. Thinking Functionally with Haskell by Richard Bird

Although functional programming is unlikely to threaten the dominance of imperative programming anytime soon, learning a functional language helps enrich the programmer thinking and broaden their perspective. Many functional concepts were introduced to imperative languages in the last decade such as lambda functions and high-order functions. Hence, this book is a great read to write more elegant less verbose code.

5. O'reilly's Linux in a Nutshell

Since this book is 944 pages long, I doubt the authors know what "in a nutshell" means... However, this is more of a reference book for Linux. It servers as the go-to book whenever I need some information about a command or a system feature. I consider knowing how to handle Linux/Unix as a core skill for any serious programmer.

6. Thinking in Java by Bruce Eckel

The book I learned Java from. Easy to read and comprehensive. Great to pick up Object Oriented Programming.

7. Programming Ruby 1.9 & 2.0 by Dave Thomas

This book was given to me by my manager who introduced our company to this charming programming language. Although I had a love-hate relationships with many programming languages, this precious gem is still by far my favourite. It introduced one of the best programmer-friendly frameworks I've seen and many brilliant pleasing-to-use libraries. I hereby feel the need to give the following disclaimer: learning Ruby is guaranteed to produce unwanted side effects that include displeasure and disgust when dealing with any other alternatives ;).

8. Database System Concepts 7th Edition by Avi Silberschatz, Henry Korth, S. Sudarshan

This is a comprehensive book on SQL database management. The book covers the relational database model, design, Querying, and transaction management in general without covering a specific system. In the last chapters, the book discuss these concepts in existing database systems: Postgres, Oracle, DB2, and MSSQL. It's a great resource to build a great in-depth knowledge of how relational databases work.

9. The Hundred-Page Machine Learning Book by Andriy Burkov

A fast read to get exposure to the modern concepts of machine learning for anyone in the computer science field. The point of the book is not to provide code but to explain the concepts. A heads-up: a lot of math in there, but again: you're reading about machine learning. If you don't want to deal with math, you're into the wrong topic! ;)

10. SQL Performance Explained by Markus Winard

This book helped me to understand how to do SQL optimizations and how indexing actually work. Every developer deals with query optimizations at some point whether it's for large data migrations or application performance enhancements. Definitely must known principles for ETL and data engineering.

Bonus Book: Clean Code by Robert R. Martin

I have to admit, this is not a personal recommendation. Through the years, almost everyone I know recommended this book to me as the best book to read for programming but I didn't have the chance to read it yet. Maybe I should now :D.

Here's what a friend of mine had to say about the book: "I would say Clean Code is the best book on programming. It touches on many aspects on CS while giving a voice of wisdom on what the author believes is most clean... It's very opinionated though."

I will leave it to you to judge on the last criticizing statement.

3 approaches to scroll through data in Elasticsearch

Tariq Abughofa — Mon, 06 Jan 2020 01:36:59 +0000

Elasticsearch is a search engine that provides full-text search capabilities. It stores data in collections called indices in a document format. In this article I go through the supported techniques to paginate through collections or as they are called in Elasticsearch "indices"

`from` / `size`

Pagination of results can be done by using the from and size parameters. The from parameter defines the number of items you want to skip from the start. The size parameter is the maximum amount of hits to be returned.

GET users/_search
{
    "from" : 0, "size" : 100,
    "query" : {
        "term" : { "user" : "john" }
    }
}

You can filter using this method. You can also sort by adding this JSON in the root level of the previous request body:

"sort": [
  {"date": "asc"},
]

In Elasticsearch, you can't paginate beyond the max_result_window index setting which is 10,000 by default. Which means that from + size should be less than that value. In practice, max_result_window is not a limitation but a safe guard against deep pagination which might crash the server since using this method requires loading the previous pages as well.

The `scroll` API

A recommend solution for efficient deep pagination and required when reaching the max_result_window limit. The scroll API can be used to retrieve large number of results. It resembles cursors in SQL databases where it involves the server in keeping where the pagination has reached so far. Also in the same manner, it's not designed to get data for user requests but rather for processing large amount of data.

In order to start scrolling, an initial request has to sent to start a search context on the server. The request also specifies how long it should stay alive with the scroll=TTL query parameter. This request keep the context alive for 1 minutes:

POST users/_search?scroll=1m
{
    "size": 100,
    "query" : {
        "term" : { "user" : "john" }
    }
}

Th response of this request returns a scroll_id value to be used in the subsequent fetch requests.

After this request, the client can start scrolling through the data. To retrieve the next page of the result (including the first page) the same request has to be sent:

POST _search/scroll
{
    "scroll" : "1m",
    "scroll_id" : "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAD4WYm9laVYtZndUQlNsdDcwakFMNjU1QQ=="
}

As you see, the request has to specify the scroll_id which the client get from the initial request) and scroll parameter which tells the server to keep the context alive for another 1 minute.

The `search_after` parameter

The scroll API is great for deep pagination but the scroll context are costly to keep alive and they are not recommended to be used for real-time user requests. As a substitution to scroll context for these situation, the search_after parameter was introduced to the search API to allow the user to provide information about the previous page that helps retrieving the current page. That means that a certain order for the result is necessary in the search query. Let's assume the first page was retrieved with the following query:

GET users/_search
{
    "size": 10,
    "query" : {
        "term" : { "user" : "john" }
    },
    "sort": [
        {"date": "asc"}
    ]
}

For subsequent pages, we can use the sort values of the last document returned by this request and we pass these values with the search_after parameter. A later request would like something like this:

GET users/_search
{
    "size": 10,
    "query" : {
        "term" : { "user" : "john" }
    },
    "sort": [
        {"date": "asc"}
    ],
    "search_after": [1463538857]
}

The from parameter can't be used when search_after is passed as the functionality of both contradicts. This solution is very similar to the scroll API but it relieves the server from the keeping the pagination state. Which also means it always returns the latest version of the data. For this reason the sort order may change during a walk if updates or deletes happen on the index.

This solution has the clear disadvantage of not being able to get a page at random as there is a need to fetch pages from 0..99 to fetch page 100. Still the solution is good for user pagination when you can only move next/previous through the pages.

Random access with `search_after`

As explained before, the search_after parameter doesn't allow to have random-access pagination. However, there is a way to have random access by keeping statistical data about the indexes in Elasticsearch. This approach is inspired by histograms in Postgres database. Histograms contains statistics about column value distribution in the form of bucket boundary list. The idea is to implement that manually in Elasticsearch. Have an index that has documents that has the schema of the following schema:

{
    "bucket_id": 100,
    "starts_after": 102181
}

Let's call this index pagination_index. Before creating and filling this index we should decide on a bucket size. Let's say it's 1000 documents. The next step is to fill this index using the search API with the search_after parameter. Let's assume the index is called articles. the operation would look like this:

client = Elasticsearch::Client.new()
max_id = 0
bucket_id = 0
do:
    client.create(index: 'pagination_index', body: { bucket_id: bucket_id, starts_after: max_id })
    page = client.search(index: 'articles', body: { size: 1000, search_after: max_id })
    max_id = page.last.id
    bucket_id += 1
while !page.empty?

Now to paginate, each time the pagination is done in order (get next page) search_after is used from the previous page. The same as we were doing with the regular search_after pagination. When there is a need to access a random page, we query the pagination_index for the starts_after and we use it get the required page. It would like this:

client = Elasticsearch::Client.new()
page_size = 100
bucket_size = 1000
# get page 0
page = client.search(index: 'articles', body: { size: page_size })
# do some processing or rendering of results.
max_id = page.last.id
# get page 1
page = client.search(index: 'articles', body: { size: page_size, search_after: max_id })
# do some processing or rendering of results.
# get page 200
bucket_id = 200 * page_size / 1000
page_info = client.search(index: 'pagination_index', body: { bucket_id: bucket_id })
after = page_info.starts_after + 200 * page_size % 1000
page = client.search(index: 'articles', body: { size: page_size, search_after: after })

This approach works for any query including any filtering that is needed but it will only work for that specific query. The pagination_index has to be maintained regularly as well. Until it get updated the pages will be approximate. It is still though a good approach to show real-time results which requires deep random-pagination.

80+ Free Big Data Resources to Satisfy Your Knowledge Appetite - part 2

Tariq Abughofa — Sun, 29 Dec 2019 19:12:20 +0000

This is a continuation of the resources I listed in part 1

80+ Free Big Data Resources to Satisfy Your Knowledge Appetite - part 1

Tariq Abughofa ・ Dec 22 '19 ・ 5 min read

#database #datascience #nosql

This part includes the following four categories:

Machine Learning & Algorithms in Big Data
Data Processing Systems
Real-time Processing
Graph Processing

Machine Learning and Algorithms in Big Data

Recommending items to more than a billion people: An article about collaborative filtering at Facebook.

Machine Learning with Sparkling Water: Using H2O the machine learning framework with Apache Spark.

MLlib: Scalable Machine Learning library on Apache Spark from Stanford/Databricks.

TensorFlow: the famous large-scale machine learning library.

Large-scale parallel collaborative filtering for the Netflix prize: an algorithm that for large scale recommendations of Netflix movies.

Data Processing Systems

Airflow: a workflow management system by AirBnB.

Oozie: a workflow management system for Hadoop by Yahoo!.

BlinkDb: analytics on large scale data from Berkeley.

FlumeJava: a library for developing parallel data pipelines from Google.

MapReduce: the google framework behind Hadoop.

Pig: an engine that supports PigLatin a procedural dataflow language for Hadoop from Yahoo.

Hive (resource#2): A data warehouse on top of Hadoop.

The Dataflow Model: the model behind Google Cloud Dataflow which provides simplified stream and batch processing.

MillWheel: stream processing engine from Google.

Photon: A tool to join data streams at Google.

Kinesis: stream processing engine from Amazon.

Apache Flink (resource#2): stream and batch processing engine from TU Berlin.

Trill: incremental data analytics engine from Microsoft.

Kafka: the famous distributed messaging system from LinkedIn.

Apache Spark: the famous stream and batch processing engine. It uses distributed memory abstractions: RDDs, Dataframes, and Datasets. Since Spark 2 was released, it moved to structured streaming (resource#2) (3) (4) and the SparkSQL library was introduced to allow SQL queries over Spark Dataframes. The whole Databricks blog is a great resource for the project.

SparkR: a Spark library to write processing application in R.

GraphX (resource#2): distributed graph processing with Spark's RDDs.

GraphFrames: distributed graph processing with Spark's Dataframes.

SnappyData (resource#2): a transaction datastore on top of Spark.

Real-time Processing

Samza (resource#2) (3) (4): Stream processing engine from LinkedIn.

Storm: real-time data processing engine from Twitter.

Heron: the new Storm from Twitter.

Real-time data processing at facebook.

Pulsar: real-time data processing engine from eBay.

Graph Processing

WTF: the who to follow service at Twitter.

GraphJet: real-time recommendation graph engine at Twitter.

Pregel: large-scale graph processing engine at Google.

Giraph: open source implementation of Pregel by Facebook.

80+ Free Big Data Resources to Satisfy Your Knowledge Appetite - part 1

Tariq Abughofa — Sun, 22 Dec 2019 21:07:51 +0000

Data is becoming a cornerstone in software services. Whether it is the business model or it drives revenue or both, tech companies are flocking to use this "free" resource to provide better services and excel over there competitors.

If you are in the "new-sexy" position in computer science or you're doing research in this field, you will find the resources in this article extremely helpful the same way they helped me. Frontier companies in this field like Google, Facebook, LinkedIn, and Twitter as well as big universities released tens of papers and articles on the subject outlining internal projects they worked on. These projects were released later as open sources to become a stable in the field. To save you the time and pain of getting lost in the labyrinth of endless resources over the internet (the way I did), I compiled a categorized list here for your pleasure. I will try to update the list frequently to keep it up-to-data.

I divided the resources into 8 main categories. This first part includes the following four:

Big Data Storage & NoSQL Databases
Interactive Data Analytics
Big Data Challenges and Ecosystems
Resource Management

Big Data Storage & NoSQL

Bigtable: the terrabyte NoSQL database behind google cloud storage.

Cassandra: the Facebook column-oriented database.

Voldemort: Distributed database by LinkedIn.

Dynamo: Amazon's key-value store.

HBase: Column-oriented storage over HDFS by Facebook.

Neo4J: the famous graph database.

Snowflake: A data warehouse for the cloud.

The Google File System: the big data file system and the base behind distributed storage in Hadoop.

HDFS: The Hadoop Distributed File System.

RCFile: Data placement for data warehouses used in Apache Hive.

Parquet: columnar storage format.

Haystack: an object storage system optimized for Facebook’s Photos application.

Windows Azure Storage: Cloud Storage System from Microsoft.

Data management in cloud environments - NoSQL and NewSQL data stores: A paper surveying data stores beyond SQL such as Redis, HBase, ...etc.

Interactive Data Analytics

Dremel: analytics system by Google.

Impala: SQL engine for Hadoop by Cloudera.

Drill: An open source implementation of Dremel.

Dryad: a framework to define dataflow graphs from Microsoft.

Tez: an open source implementation of Dryad from Hortonworks and Microsoft.

Kudo: A storage for fast analytics on Big Data by Cloudera.

Big Data Challenges and Ecosystems

Google: how the large-scale search engine was built.

The CAP theorem (resource #2): the theory which paved the way to NoSQL databases.

The Lambda Architecture: an architecture for data pipelines.

The Kappa Architecture: an alternative architecture for data pipelines.

Summingbird: a framework for integrating batch and online computations.

The Log Problem in Big Data.

Eventual Consistency: A look at how data consistency works in NoSQL database systems.

The Big Data Ecosystem at LinkedIn.

Resource Management

Paxos: a consensus algorithm for distributed systems.

Raft: an alternative consensus algorithm to Paxos

Zab: the consensus algorithm used in Zookeeper. Here is a comparison between Zab with Paxos.

Zookeeper: Coordinator and distributed configuration system by Yahoo!.

YARN: resource manager for Hadoop.

Borg: Cluster manager by Google.

Kubernetes: container-orchestration system for automating application deployment, scaling, and management by Google.

The second part will focus on Algorithms, ML, and data processing systems to stay tuned ;)

How to Load Data into an SQL database: From simple inserts to bulk loads

Tariq Abughofa — Sun, 15 Dec 2019 21:49:07 +0000

Data storage is the most integral part of a transactional database system. In this article, I will go into details about different ways to load data into a transaction SQL system. From inserting a couple of records, all the way to millions of records.

The first and easiest way to insert data into the database is through the ORM (Object-Relational Mapping) tool. For the purpose of this tutorial, I will use Rails's ActiveRecord as a demonstration of ORM operations. Inserting data through ORM ensures the execution of all the business rules and validations so you don't worry about it. It's as easy as doing this:

Users.create(username: "John Doe", role: "admin")

The disadvantages of this method is that it doesn't scale. For each record, it creates a model object , execute the attached callbacks for validation and business rules, and executes a DML transaction.

Some frameworks provide a way to do a bulk insert instead. A single INSERT SQL query is prepared and a single sql statement is sent to the database, without instantiating the model or invoking model callbacks or validations. For example, in Rails 6 it's something like:

result = Users.insert_all(
  [
    { id: 1,
      username: 'John Doe',
      role: 'admin' },
    { id: 2,
      username: 'Jane Doe',
      role: 'admin' },
  ]
)

Or that can be done with raw SQL as it has exactly the same effect:

INSERT INTO
  users(id, username, role)
  VAULES
  (1, "John Doe", "admin"),
  (2, 'Jane Doe', "admin");

Both solutions can have conflicts (such as breaking primary key uniqueness or unsuitable data types), or they can break business roles or high-level application validation rules. In both instances, the programmer has to deal with the problems by handling database-level error and ensuring the execution or the following of higher-level rules. For example, if you want to ignore duplicates on the the primary key column column you can add something like this to the end of the query:

ON CONFLICT (id) DO NOTHING;

What if the loaded data contains updates to existing rows in the database? This operation is called an upsert. If the row already exist in the table (the existence is determined based on the primary key), the row is updated with the passed values. Otherwise, it is inserted as a new row. In Rails 6, It's as easy as replacing the insert_all function with upsert_all:

result = Users.upsert_all(
  [
    { id: 1,
      username: 'John Doe',
      role: 'admin' },
    { id: 2,
      username: 'Jane Doe',
      role: 'admin' },
  ]
)

In SQL it will be something like this:

INSERT INTO
  users(id, username, role)
  VAULES
    (1, 'John Doe', 'admin'),
    (2, 'Jane Doe', 'admin')
ON DUPLICATE KEY
UPDATE id=VALUES(id), username=VALUES(username), role=VALUES(role);

The scalability of this method is still limited. There is a maximum limit to the query length in most servers and even if the limit doesn't exist you wouldn't want to send a query with a length in the order of gigabytes over the network. A simple solution is to UPSERT in batches. That would be something like this:

record_num = records.length
batch = 1000
batch_num = record_num / 1000
(1..batch_num).do |n|
  lower_bound = (batch_num - 1) * batch
  higher_bound = batch_num * batch
  Users.upsert_all(records[lower_bound..higher_bound])
end

Or with raw SQL instead of the upsert_all function. Same thing.

This solution technically scales well. However, to increase the performance, most SQL databases has a copy functionality which loads data from a file into a table. To use that functionality, the data is dumped into a file in a format that the database engine supports (a common one is CSV). After that, the table is filled from the file with an SQL command such as:

COPY users FROM 'path/to/my/csv/file.csv';

This statement appends to the users table all the data within the CSV file. This solution can give much better performance for files in the scale of multiple GBs of data.

However this solution doesn't handle upserts. To do so, we create a temporary table, fill it from the file, then merge it in the original table using the primary key:

CREATE TEMP TABLE tmp_table
...; /* same schema as `users` table */

COPY tmp_table FROM 'path/to/my/csv/file.csv';

INSERT INTO users
  SELECT *
  FROM tmp_table;

This solution avoid doing multiple transactions to the database and ensures higher performance. It's definitely worth looking into when loading GBs of data into the database.

Pagination in MongoDB and the Bucket Pattern

Tariq Abughofa — Tue, 03 Dec 2019 23:22:50 +0000

MongoDB is a document-based NoSQL database based on the JSON data format. Because of its nested document data structure, tables (or collections as they are called in MongoDB), can have more records than it's SQL counterpart. Which makes paginating efficiently important to have. Pagination can be used to do batch data processing or to show data on user interfaces. In this article, I go through the approaches that MongoDB provides for this problem.

Using the `cursor` API's `skip` and `limit`

The cursor API in MongoDB provide tow functions that helps implementing pagination:

cursor.skip(n) which returns a cursor which begin returning results after skipping the first n documents.
cursor.limit(m) which constrains the size of a cursor’s result set to m documents.

This is how you paginate using the MongoDB shell:

// 1st page
db.users.find().limit(5)
// 2nd page
db.users.find().skip(5).limit(5)
// 3rd page
db.users.find().skip(10).limit(5)

Two things to note here: MongoDB cursors are not the same as cursors in SQL databases which does server-full pagination on a data set. It's actually similar to offset and limit in SQL.

Using the `_id` field

The _id field is a column that is part of all MongoDB collections by default. It has the data type ObjectId. ObjectIds are 12 bytes long unique ordered auto-generated values that act as an identifier for the document. Kind of like how primary keys are in SQL databases. The important features about _id fields is that it is ordered and indexed by default which make them suitable to use for pagination if they are used with the limit function:

// 1st page
set = db.users.find().limit(1000)
max_id = set[4]._id
// 2nd page
set = db.users.find({'_id': {'$gt': max_id}}).limit(1000)
max_id = set[4]._id
// 3rd page
db.users.find({'_id': {'$gt': max_id}}).limit(1000)

Using an indexed field

If you have an indexed field and you wanted to return the pages sorted on that field instead, a good solution is to combine the cursor.sort() , cursor.limit(n) with a comparison query operator ($gt, $ls) to skip the previous pages. This way the query will use the index for the query to skip the unwanted documents and then it will read only the n wanted documents. The query looks like this:

db.users.find({ created_date: { $gt: ISODate("2018-07-21T12:01:35") })
    .sort({ created_date: 1 })
    .limit(1000)

The downside is that we can't jump directly to a specific page. If that is necessary this page doesn't work.

Using the Bucket Pattern

This is a unique storage/pagination technique that can be only used with document-based NoSQL databases. It has great scalability in terms of the size of the stored data and the index. At the same time it allows to navigate to any page randomly. However, this method starts with the way we store the data.

A good use-case for this pattern is time-series data. Let's say we're getting location updates through GPS each minute and we store the document this way:

{
   _id: ObjectId(...)
   source_id: 12345,
   timestamp: ISODate("2019-09-28T02:00:00.000Z"),
   latitude: -8.80173,
   longitude: -20.63476
}

{
   _id: ObjectId(...)
   source_id: 12345,
   timestamp: ISODate("2019-09-28T02:01:00.000Z"),
   latitude: -8.80175,
   longitude: -20.63478
}

{
   _id: ObjectId(...)
   source_id: 12345,
   timestamp: ISODate("2019-09-28T02:02:00.000Z"),
   latitude: -8.80178,
   longitude: -20.63486
}

Very convenient indices to have here is one on source_id and another on timestamp. Next we can paginate the data sorted on timestamp as we saw in the previous method. However, the scalability of this solution is questionable as the timestamp index and the collection get huge really fast.

Here the Bucket Pattern comes to the rescue. Instead of saving each data point as a document, we leverage the document data model that MongoDB uses. We save data points that appear in each, let's say hour, as a list in one single document we refer to as a bucket. We also add to the document extra attributes stating to the start_timestamp the date at which the bucket data points start, and maybe some aggregation data. The bucket would look like this:

{
    source_id: 12345,
    start_timestamp: ISODate("2019-09-28T02:00:00.000Z"),
    locations: [
       {
           latitude: -8.80173,
           longitude: -20.63476
       },
       {
           latitude: -8.80175,
           longitude: -20.63478
       },
       …
       {
           latitude: -8.80378,
           longitude: -20.63786
       }
    ],
    average_speed: 56.056
}

Using the Bucket Pattern, we went down from 60 documents each hour into only one. For the index, we can now index on start_timestamp instead of timestamp so the size is 60 times less.

Now you might ask "how does this help with pagination though?". The answer is that by pre-aggregating data per hour, we implemented a built-in pagination for the collection. So to get the 10th page from the collection we just need to get the 10th document from the collection:

// the date at which our measurement started
data_start_point = ISODate("2019-01-01T01:00:00.000Z")
// add a 10-hour period to the date which is 10*60*60*1000 in milliseconds
page_timestamp = new Date(data_start_point.getTime() - 10*60*60*1000)
db.users.find({ start_timestamp: page_timestamp })

If you want to get the 10th page but you prefer each page to has 3 hours instead of one, it's just a matter of math:

// the date at which our measurement started
data_start_point = ISODate("2019-01-01T01:00:00.000Z")
// add a 30-hour period to the date which is 3*10*60*60*1000 in milliseconds
page_start_timestamp = new Date(data_start_point.getTime() - 3*10*60*60*1000)
// use "limit" to get 3-hour data
db.users.find({ start_timestamp: { $gte: page_timestamp }}).limit(3)

The case against jQuery

Tariq Abughofa — Mon, 25 Nov 2019 18:11:05 +0000

The Good

jQuery was a great library. It made manipulating the DOM and adding listeners very easy in a time where javascript wasn't as mature as it is today. It saved programmers a lot of trouble making sure your code work properly in all browsers. The syntax is very user friendly and easy to learn.

However, These same great features of jQuery makes it the pain that it today. When I started working in frontend development, I was introduced to jQuery at the same time I was introduced to javascript. I was hooked up immediately. Why wouldn't I? The Web API implementation varied widely between browsers and all the good plugins out there depended on jQyery.

The Bad

Since every time I needed to manipulate the DOM I imported jQuery, I didn't even bother to learn how it's done. Neither did my colleagues. And it is a wide spread problem. Many developers think of javascript as jQuery. A small look into stackoverflow shows how a lot and a lot of people answer javascript questions using the jQuery API. You have to say vanilla javascript or without jQuery to get a proper answer and still you would get answers like "you should use jQuery" :( A simple search for any DOM related javascript question on google shows the same problem (The term vanilla javascript is a conundrum on its own. Check this cool satire website about vanilla js.

This is a trap that many falls into. Instead of learning javascript, the web api, then jQuery, the order happen in reverse or maybe it never go further that just learning jQuery. It's like learning Rails without learning Ruby. Once things get a bit complicated you'll be in hot water and you're stuck with using the framework even when it's not needed.
Add to that the confusion it produces to whether a variable is a native javascript DOM element or a jQuery wrapped one.

If you use a frontend framework you will see how much the code becomes polluted if you wanted to manipulate the DOM with jQuery since all framework rightfully pass native DOM elements. Not to mention that jQuery encourages writing spaghetti code. Some of the reasons behind it is the lack of structure standard associated with it and the ability to chain DOM selectors

The Ugly

You can say "I learned javascript properly and when I don't want to use jQuery I can just do it". Well it's not that easy. Almost every javascript library is jQuery plugin. The responsive design libraries like Bootstrap and foundation, WordPress, select 2, fancy box, and many other frontend libraries are dependent on jQuery.

That adds at least 82.54 KB of initial load required download to your website (for the minified version). Not to mentioned that jQuery sacrifices performance to be able to do its magic. The need to include it anyway lures developers to use it in their code anyway and the hole keeps getting deeper.

Opposing Arguments

Cross-browser support

The web api difference between browsers has dropped significantly since the introduction of jQuery. Not to mention that browser use sparsity are much more concentrated nowadays in Chrome as the browser (not that I'm happy about it ¯\(ツ)/¯) and it is closer to the latest versions of whatever browser they are using since the update process is much easier today.

You might say that your users use some ancient IE version you need to support. Well luckily this argument does't live anymore since you can use the Babel project to support any list of browsers and versions you like. Plus Babel is not a run-time dependency so no performance overhead are added.

The shortcomings of javascript

A strong argument for jQuery was that javascript used to produce a lot of boilerplate. Functions like $.inArray() or ().forEach use to overcome a painful way of array iteration in javascript. However, alternative (forEach and Object.keys()) has been existing for a long time and supported IE9+. Javascript has evolved a lot since ES5 and even for browsers that have limited feature support, using a transpiler is far better that using a run-time libary.

I don't use react (or Vue) so I use jQuery

Does it really need to be either or? :)

Animation libraries require jQuery

There are many alternative lightweight animation libraries that doesn't require jQuery such as: draggable, smoth-scroll and sortable.

How can I help?

Make sure to use native javascript DOM manipulation. Many websites can help you find the alternatives syntax and show you how easy it is: http://youmightnotneedjquery.com/

Another way is to support and use lightweight libraries that do not depend on jQuery. Rails removed jQuery as a dependency since 5.1. Github ditched jQuery last year. Bootstrap 5 will not depend on jQuery, and I listed many animation libraries above.
You can also share here the libraries you like using which doesn't depend on jQuery.

Do you have a reason why you personally use jQuery or do you think it has a place today? please share in the comment and I will be happy to discuss it.

General and Unconventional Pagination Techniques in Postgres

Tariq Abughofa — Tue, 19 Nov 2019 03:26:29 +0000

I talked in a previous article on how pagination can be done in SQL databases (head there to read more about the general techniques. Their advantages, disadvantages, and scenarios):

How to Paginate (the right way) in SQL

Tariq Abughofa ・ Nov 13 '19 ・ 4 min read

#sql #database #postgres

In this article, I show how to implement these approaches in the Postgres database engine. In addition to 3 unconventional pagination methods special for Postgres.

`limit` / `offset`

The limit / offset is pretty SQL standard with PLSQL:

SELECT *
  FROM products
 ORDER BY sale_date DESC
 WHERE sold = TRUE
 LIMIT 10 OFFSET 30;

Cursors

Cursors are also pretty straightforward. You first declare the cursor with the query that it will execute (the query can be bounded or unbounded). Then, you open the cursor and fetch pages from it with each fetch statement. It's closed in the end to release the resources.

BEGIN
   DECLARE cur CURSOR FOR SELECT *
       FROM products
       WHERE production_year = 2000;
   -- Open the cursor
   OPEN cur;
   LOOP
    -- fetch row into the film
      FETCH 10 FROM cur;
    -- exit when no more row to fetch
      EXIT WHEN NOT FOUND;
   END LOOP;

   -- Close the cursor
   CLOSE cur;
END;

Key-based Pagination

Pretty much SQL standard as well:

SELECT *
FROM products
WHERE id > 1000
ORDER BY id ASC
LIMIT 1000;

Now let's start with the unconventional pagination approaches:

Paginating over `xmin`

xmin is one of many system columns that Postgres adds implicitly to each table. This column in particular represents an identifier of the database transaction the inserted/updated the corresponding column. Because of that, it servers as a good solution to identify the changes that appeared on a table after a certain point and to sort the rows on the time they were touched by a transaction.

To use the xmin column for pagination, we can use the same key-based pagination approach. The following query paginate with a 1000 row limit with the data sorted in ascending order on the last update time.

SELECT *
FROM users
WHERE xmin::text::int > 5000
ORDER BY xmin::text::int ASC
LIMIT 1000;

The method has the same disadvantages as key-based pagination as you can't reach a certain page randomly. Instead scrolling through the pages until you reach the needed page.

One thing to be aware of when using xmin is that since it represents the transaction id, rows that are inserted/updated in the same transaction has the same xmin and thus no specific order.

paginating over `ctid`

ctid is another system column in Postgres. it has the physical location of the row so when the data is sorted on this column, the data is returned with true randomness logically speaking since the order is on storage location. It also means that the retrieved data is fetched very fast since there is no disk access cost to move to the next row. That's why internally indices use ctids instead of the primary key to point to the rows.

The ctid value consists of a tuple (p, n) where p represent the page number and n represents the row number within the page.

This is good for a scenario in which:

we need extremely fast deep page access.
filtering is not needed.
we don't care about the row orders or we want random row order.
we don't require all the pages to have the same number of rows (since deleted rows leave holes in pages).

To get page number p, we need to do some calculations. Each page contains a "block size" bytes of data (8192 bytes or 8 KB by default). Rows are referenced by a 32-bit pointer so there are at most block size / 4 rows per page. The following query will generate all the ctid values in page p:

SELECT ('(' || p || ',' || s.n || ')')::tid
 FROM generate_series(0,current_setting('block_size')::int / 4) AS s(n);

To get rows in page p:

SELECT * FROM users WHERE ctid = ANY (ARRAY
  (SELECT ('(' || p || ',' || s.n || ')')::tid
     FROM generate_series(0,current_setting('block_size')::int / 4) AS s(n)
  )
);

Pagination using row statistics

Postgres records statistics about its tables in the pg_statistics catalog and provide asn interface to access the information with the view pg_stats. One of these statistics is called histograms_bounds. It hold column statistics representing the value distribution divided into buckets. When querying this field we get something like this:

SELECT histogram_bounds FROM pg_stats
WHERE tablename='users' AND attname='id';

                   histogram_bounds
------------------------------------------------------
 {0,993,1997,3050,4040,5036,5957,7057,8029,9016,9995}

We notice the in the example the values for the id column goes from 0 to 9995. The values are divided into buckets with around a 1000 values each. The first bucket goes from id 0 to 993, the second one is from 993 to 1997, and so on. As you can see, there is an opportunity here to use these buckets to do pagination over id. If we assumed the bucket size is b, the page size is n, and the page we want is p, with a simple calculation we can find that the bucket which contain our page has a lower bound with index n * p / b + 1 and an upper bound with index n * p / b + 2. After we get the histogram bounds for our bucket, the query is pretty easy to do:

WITH bucket AS (
    SELECT (histogram_bounds::text::int[])[n * p / b + 1] AS lower_bound,
           (histogram_bounds::text::int[])[n * p / b + 2] AS upper_bound
    FROM pg_stats
    WHERE tablename = 'users'
    AND attname = 'id'
    LIMIT 1
  )
SELECT *
FROM users
WHERE id >= (select lower_bound from bucket)
AND id < (select upper_bound from bucket)
ORDER BY id ASC
LIMIT n
OFFSET (n * p % b);

Notice that we use Offset in the query. However, it's only applied within the bucket instead of the whole table. Which means at the worst case scenario in which we are fetching the last page, we are reading all b rows from the database instead of the whole table.

The results, however, can be approximate since we use table statistics which might not be up-to-date with the table. However, the method is blazing fast for deep pagination with random access that tolerates approximate results and doesn't require any filtering.

Stay tuned for more posts on this subject with NoSQL Databases :).

How to Paginate (the right way) in SQL

Tariq Abughofa — Wed, 13 Nov 2019 23:28:30 +0000

Many ways to scroll through a table. Easy to misuse them.

Server-side pagination is a commonly-used feature in SQL databases. It helps when showing a huge set of results on user interfaces, it's required in RESTful APIs, and it comes to the rescue whenever you need to process large data in bulk and it doesn't fit in memory. The problem is that if it's done wrong it can be as inefficient as loading the full set or more.

There is a number of ways to implement pagination with SQL systems. I will go through the methods in this article with the advantages, disadvantages, and the reasonable scenarios of using each one.

Client-side Pagination:

The requested query is passed as it is without pagination. After the client gets the full set of data, it divides the data and shows it paginated to the user.

Advantages

It reduces the number of requests to the database. The data can even be cached on the client-side to avoid future requests which makes later loads even faster.

Disadvantages

Not suitable for regularly changing data. The dataset should be small to fit in memory and not painfully slow the initial load.

Suitable Scenarios

Small data sets which are not regularly updated and needed frequently such as the categories.

Using Limit/Offset:

The limit and offset are standard SQL keywords and the first solution that comes to mind for pagination over datasets. The query syntax would look something like this:

SELECT *
  FROM products
 ORDER BY sale_date DESC
 WHERE sold = TRUE
 LIMIT 10 OFFSET 30;

Advantages

Easy to implement which made many famous ORM solution use it. It's as easy as chaining the query function like this:

Product.limit(10).offset(30).findAll()

It also allows you to filter the table while pagination with the WHERE statement. You can sort on any column and it still works. Which makes the query very customizable. Not the whole data is loaded into memory which means no running out of memory.

Disadvantages

It can be HORRIBLE. The query performance goes downhill as the offset value increases. Let's say you are fetching page n. How would offset skip the first n-1? It wouldn't. It has to linearly scan the table for the first n-1 and then load page n.

An offset means that a certain number of records will be skipped from the start, but what if new data where inserted in page n-1 while page n is being loaded? rows from page n-1 will be pushed into page n causing inconsistency, and worst; if the data is being updated by the paginating process, rows might be processed multiple times causing data inconsistency.

Suitable Scenarios

Great for user interfaces, easy to implement and very customizable to different filter/order preferences. As long as the deepness of the search results is limited. Perfect for pages where users paginate down the results with scrolling. Have this functionality on a large dataset with a "last page" button showing up on the interface and embrace yourself for the a database server out of service.

For migrations and large dataset processing, my advice: Offset the offset statement.

Cursors

SQL cursors makes the server do the pagination for you. All what you need to worry about is the query you want to paginate. It looks something like this:

-- Create a cursor for the query and open it
DECLARE cur CURSOR FOR SELECT * FROM products;
OPEN curEmail;
-- Retrieve ten rows each time
FETCH 10 FROM cur;
FETCH 10 FROM cur;
-- All done
CLOSE cur;

Advantages

Supports arbitrary queries. No performance drop when going further in the pages.

Disadvantages

the implementation details is different from an SQL engine to the other but in general they held resources on the server and create locks. Clients can't share cursors and thus they would have to open each it's own which mean they can't share the same pagination.

Suitable Scenarios

One client which needs to paginate over large-sets of data and cares about pagination consistency. There are many types that can suite different applications: READ_ONLY which avoids locks on the table, STATIC which copies the result into a temporary table which is good when updates doesn't matter, KEYSET which only copies the primary keys to provide you with updates but not with new rows, DYNAMIC same as KEYSET but primary keys copy is updated so it can also see newly inserted and deleted rows. ...etc. the availability of these solutions depends on the engine.

Key-based Pagination

This technique is my favourite for data migrations. You need an indexed column to use it but that introduces great optimization while staying stateless on the server side.

The first page is fetched with the following statement:

SELECT *
FROM products
ORDER BY id ASC
LIMIT 1000;

For the second page, the query uses the maximum id value fetched in the first page. Let's say it's 1000, the query is as follows:

SELECT *
FROM products
WHERE id > 1000
ORDER BY id ASC
LIMIT 1000;

An so on for the next pages.

Advantages

Offers great scalability: It's as fast for the 1,000,000th page as for the first page. Also pagination consistency is preserved. It supports filtering and ordering on any column or multiple columns as long as you have indices on them.

Disadvantages

In general, there is no way to jump to a specific page. However, If you have an auto incremented identifier in the dataset, that can be done with some simple calculations. To retrieve page n, the lower bound for the query would be like this: n * 1000 where 1000 is the page size or the limit.

Suitable Scenarios

Mostly any one. Very convenient for scalable applications where a lot of server requests are expected.

Git: Theirs vs Ours

Tariq Abughofa — Fri, 18 Oct 2019 22:13:47 +0000

Conflicts in git can be a pain to deal with especially that you have to go into each code block which generated a conflict and check which version you want to keep and sometimes it can be hard to be sure. However, in some scenarios, the choice is easy. You want to keep whatever already existed on the server, or the opposite, you want to override it all with what you have. In this case, instead of going through the conflicts manually, git has extremely helpful flags which are --ours and --theirs.

Let's assume that the conflict looks like this:

<<<<<<< HEAD
 this is some content
=======
 this is a totally different content
>>>>>>> new_branch

When you have a merge conflict and you know exactly which version you want to keep, entirely or on a file basis, you can use these flags like this:

git checkout --ours .

Or:

git checkout --our /path/to/your/conflict/file.rb

Then you commit the chosen changes:

git add /path/to/your/conflict/file.rb
git commit

But when you use which? Is it "ours" or "theirs"? well, this can be a bit confusing... In the conflict above for example, it's hard to tell. You might say "HEAD is whatever already on the branch we're merging into and the other part is what's on the new branch" but it's actually more complicated than that.

If you looked into the help for the git checkout command, you'll see the following:

   -2, --ours            checkout our version for unmerged files
   -3, --theirs          checkout their version for unmerged files

That doesn't sound very helpful. Let's go to the actual documentation. For version 2.23.0 it says:

--ours
--theirs
When checking out paths from the index, check out stage #2 (ours) or #3 (theirs) for unmerged paths.

Note that during git rebase and git pull --rebase, ours and theirs may appear swapped; --ours gives the version from the branch the changes are rebased onto, while --theirs gives the version from the branch that holds your work that is being rebased.

This is because rebase is used in a workflow that treats the history at the remote as the shared canonical one, and treats the work done on the branch you are rebasing as the third-party work to be integrated, and you are temporarily assuming the role of the keeper of the canonical history during the rebase. As the keeper of the canonical history, you need to view the history from the remote as ours (i.e. "our shared canonical history"), while what you did on your side branch as theirs (i.e. "one contributor’s work on top of it").

Let's explain this further. We need to make the distinction between the three git operations which integrate changes and thus can generate conflicts: merges, rebases, and pulls.

Merges

A merge scenario can be like this:

git checkout master
git merge new_branch

This one feels more intuitive as your guess is probably right:

To keep the changes that are on master:

git checkout --ours .

To keep the changes that are on new_branch:

git checkout --theirs .

Rebases

A conflict appears within a rebase you do something like this:

git checkout new_branch
git rebase master

This one feels a bit counter-intuitive as ours is not the branch we're on:

To keep the changes that are on master:

git checkout --ours .

To keep the changes that are on new_branch:

git checkout --theirs .

So exactly the same as before.

Pulls

A pull command fetches data from a remote source and incorporate the changes into the local branch. The incorporation part can be done either with a merge operation (which is the default mode), or with a rebase operation.

The fetch + merge scenario looks as follows:

git checkout master
git pull origin new_branch

To keep the changes that are on master:

git checkout --ours .

To keep the changes that are on new_branch:

git checkout --theirs .

So exactly the same as a merge operation.

The fetch + rebase scenario happens in the following way:

git checkout new_branch
git pull origin master --rebase

To keep the changes that are on master:

git checkout --ours .

To keep the changes that are on new_branch:

git checkout --theirs .

Exactly the same as a rebase operation.

However, you don't have to memorize it like this: merge makes sense and rebase doesn't 😅. It is actually much simpler than that and it can be summerized in one sentence:

"ours represents the history and theirs is the new applied commits".

In a merge, git takes the current branch and apply the additional commits to it's HEAD. The current branch is the history ours and the additional commits are new theirs. In a rebase, git rewrites the history of the current branch. Making it compatible with the other branch. Then, applies any additional commits from the current branch. The other branch becomes the history thus ours and the current branch might have new additions theirs.

Now let's get back to our conflict example if the operations was a merge, the final code would be:

this is some content

while with a rebase it would be:

this is a totally different content

DEV Community: Tariq Abughofa

10 Books That Should be on each Programmer's Library

1. Head First Design Patterns by Eric & Elisabeth Freeman

2. The C++ Programming Language 4th Edition by Bjarne Stroustrup

3. Pro Git, 2nd Edition by Scott Chacon and Ben Straub

4. Thinking Functionally with Haskell by Richard Bird

5. O'reilly's Linux in a Nutshell

6. Thinking in Java by Bruce Eckel

7. Programming Ruby 1.9 & 2.0 by Dave Thomas

8. Database System Concepts 7th Edition by Avi Silberschatz, Henry Korth, S. Sudarshan

9. The Hundred-Page Machine Learning Book by Andriy Burkov

10. SQL Performance Explained by Markus Winard

Bonus Book: Clean Code by Robert R. Martin

3 approaches to scroll through data in Elasticsearch

from / size

The scroll API

The search_after parameter

Random access with search_after

80+ Free Big Data Resources to Satisfy Your Knowledge Appetite - part 2

80+ Free Big Data Resources to Satisfy Your Knowledge Appetite - part 1

Tariq Abughofa ・ Dec 22 '19 ・ 5 min read

Machine Learning and Algorithms in Big Data

Data Processing Systems

Real-time Processing

Graph Processing

80+ Free Big Data Resources to Satisfy Your Knowledge Appetite - part 1

Big Data Storage & NoSQL

Interactive Data Analytics

Big Data Challenges and Ecosystems

Resource Management

How to Load Data into an SQL database: From simple inserts to bulk loads

Pagination in MongoDB and the Bucket Pattern

Using the cursor API's skip and limit

Using the _id field

Using an indexed field

Using the Bucket Pattern

The case against jQuery

The Good

The Bad

The Ugly

Opposing Arguments

Cross-browser support

The shortcomings of javascript

I don't use react (or Vue) so I use jQuery

Animation libraries require jQuery

How can I help?

General and Unconventional Pagination Techniques in Postgres

How to Paginate (the right way) in SQL

Tariq Abughofa ・ Nov 13 '19 ・ 4 min read

limit / offset

Cursors

Key-based Pagination

Paginating over xmin

paginating over ctid

Pagination using row statistics

How to Paginate (the right way) in SQL

Many ways to scroll through a table. Easy to misuse them.

Client-side Pagination:

Advantages

Disadvantages

Suitable Scenarios

Using Limit/Offset:

Advantages

Disadvantages

Suitable Scenarios

Cursors

Advantages

Disadvantages

Suitable Scenarios

Key-based Pagination

Advantages

Disadvantages

Suitable Scenarios

Git: Theirs vs Ours

Merges

Rebases

Pulls

`from` / `size`

The `scroll` API

The `search_after` parameter

Random access with `search_after`

Using the `cursor` API's `skip` and `limit`

Using the `_id` field

`limit` / `offset`

Paginating over `xmin`

paginating over `ctid`