DEV Community: Veerpal

How to Run a Program or Script Hourly on macOS

Veerpal — Thu, 18 Jan 2024 12:44:07 +0000

Do you have a program or bash script that needs to run continuously or on a specific time interval on your Mac? The solution lies in using launchd, an Apple-recommended approach and an open-source service management framework.

What is `launchd`?

launchd is an open-source service management framework recommended by Apple. It enables you to "start, stop, and manage various processes, including daemons, applications, and scripts" [1]. For the purpose of this blog, we'll concentrate on working with a launch agent, a process that runs on behalf of the logged-in user.

How Does `launchd` Work?

Generate a Plist File: Create a property list (plist) file, which stores preferences in XML format. Use any text editor to define which program or script to run and how often. I'll explain the structure of the file more below.
Save to ~/Library/LaunchAgents/: Save the plist file to the ~/Library/LaunchAgents/ folder. The system monitors this folder and uses the plist to run your program or script based on the specified time frequency.
Use launchctl for Testing: The launchctl command-line utility helps start, stop, and load your job for testing purposes.

Creating a Plist File

A plist file is a straightforward XML file with key-value entries. Here's an overview of the key entries and their significance:

Label (Required Key): The Label key is mandatory, serving as the unique name for your job. It must be distinctive to avoid conflicts with other jobs.
Program: The Program key specifies the program or script you want to run. In our example, it points to a script containing the logic you want to execute hourly. If you're using a script, ensure it is set to be executable by your user. You can achieve this with the command chmod +x <path/to/script>.
- StartInterval: The StartInterval key determines how frequently your job should run, specified in seconds. For running a job every hour, set it to 3600 seconds (60 seconds/minute * 60 minutes/hour).
StandardOutPath, StandardInPath, StandardErrorPath: These keys allow you to define the paths for standard output, standard input, and standard error logs, respectively. It's useful for organizing and accessing logs related to your job.

Below is an example plist file for running a script every hour:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
    <key>Label</key>
    <string>local.example.script.start</string>
    <key>Program</key>
    <string>/Users/me/path/to/file/script.sh</string>
    <key>StartInterval</key>
    <integer>3600</integer>
    <key>StandardOutPath</key>
    <string>/Users/me/path/to/logs/log.stdout</string>
    <key>StandardInPath</key>
    <string>/Users/me/path/to/logs/log.stdin</string>
    <key>StandardErrorPath</key>
    <string>/Users/me/path/to/logs/log.stderr</string>
</dict>
</plist>

Ensure that your plist file adheres to this structure, and customize the values accordingly to suit your specific requirements.

Testing Your Launch Agent

Use the following command to load a new job:launchctl load -w ~/Library/LaunchAgents/local.example.script.start.plist

Once loaded, run your job and check the final status.

launchctl start local.example.script.start: Start a specific job.
launchctl list | grep "local.example": Check the status of your job. A status of zero is a successful run.

If you encounter a non-zero status, you can decipher them using: launchctl error my_err_code

Using LaunchControl

For a more user-friendly experience and better error messages, consider using third-party software like LaunchControl. It verifies your plist file and helps identify issues. For instance, if your script is not executable, LaunchControl makes this clear in the UI and provides a clear error messages more precise then the output of launchctl error. You can download LaunchControl for free, and the trial version allows you to verify your plist files.

Resources

Launchd.info: An excellent resource for learning more about configuring and running launch agents.

Solving Top K Frequent Objects with Count Min Sketch

Veerpal — Fri, 29 Sep 2023 21:55:24 +0000

A recent system design problem I came across is how to calculate top-K items at a high scale. For instance, determining the top 100 videos on a streaming site.

In the "leetcode" version of a top K problem, a hash or a heap track the count of an item. However, both hash and heap have a space complexity of O(n). For 1 billion videos, that equates to 4GB of data – 8GB if you consider the need to store both video ID and count. Additionally, a heap has a log(n) insertion time, so as more videos are tracked, updates will slow down.

There's an alternative for counting a large number of items: the Count-Min Sketch.

Enter the Count-Min Sketch

The Count-Min Sketch (CMS) is a probabilistic data structure. It provides approximate counts for large-scale data streams using limited memory.

A CMS comprises many arrays of a fixed size n. This n can be smaller than the total number of items you're tracking. Each array has an associated hash function.

When you want to increment the count of an item, iterate over each array. For each array, compute the item's hash. The resulting value is an index. To raise the count, you increment the value at that index.

To find out the final count for an item, repeat the hashing process for each row. Instead of increasing the count, get the current value at the index. The item's count is the minimum value across all rows.

Let's walk through an example for clarity.

Walkthrough

Suppose we have a CMS with three arrays, each of size 4.

0	0	0	0
0	0	0	0
0	0	0	0

Now, let's say we want to increment the count for videoOne. After hashing the video ID for each row, assume:

hash_row_1(video_1) = 0
hash_row_2(video_1) = 3
hash_row_3(video_1) = 2

Our CMS would then be:

1	0	0	0
0	0	0	1
0	0	1	0

Now, suppose we increment the count of another video, videoTwo. Using our hashing, the CMS would be:

1	1	0	0
0	0	0	2
1	0	1	0

Collisions

Note the collision in row 2, where both videos hashed to the same index. That's why when determining the count for an item, we take the minimum value across all arrays. For example, the count for videoOne is the minimum of (1, 2, 1), which is 1. It's improbable for two videos to hash identically across all rows. Hence, even if some rows have collisions, we use the minimum across all rows to determine the count.

This makes the CMS an "approximation" algorithm. It does not guarantee an accurate count. It may overestimate but will never underestimate the real count.

This blog mentions that with a "depth of 10 and a width of 2,000, the probability of not having an error is 99.9%" Increasing the depth of the CMS can further reduce the error rate.

Why Use an Approximation Algorithm?

Why would we use an approximation algorithm that might not return precise results? CMS is memory-efficient since it uses fixed space for estimates. Regardless of how many items we track, its size remains constant. For instance, a 10x4000 CMS uses only 160KB, considerably less than the 4GB required for a heap.

Additionally, a CMS has constant-time update and lookup, compared to the log(n) update in a heap. This makes the CMS a faster solution, crucial when dealing with millions or billions of items.

A min heap of size K is still used to track the final K videos. For each item, update the sketch, estimate the count, and check if this estimate surpasses the heap's minimum. If so, the heap is updated. The computational cost of updating the min heap remains O(log(k)). A heap of a size like 100 remains manageable in memory compared to one of a million.

The primary trade-off with a CMS is accuracy for space and speed. In most high-scale systems, accuracy is vital. Therefore, a CMS can be paired with a more precise solution, such as MapReduce.

Using Count-Min Sketch with MapReduce

The overarching strategy is to utilize CMS for instant, estimated updates on the top k videos. In the background, run more time-intensive calculations with MapReduce to achieve an accurate top k. Periodically, the count min estimates are refreshed with the precise calculations from MapReduce.

Workflow

Real-time Updates: Utilize Count-Min Sketch for immediate top k video estimates.
Batch Processing: Periodically employ MapReduce for precise counts.
Refinement: Refresh the CMS using the exact MapReduce values.

The result is the best of both worlds: immediate insights with gradually improved precision.

Conclusion

Approximation algorithms, like the CMS, are effective for managing vast amounts of data without excessive storage requirements. If accuracy matters, such algorithms can be supplemented with more extended, accurate calculations, providing precise counts at intervals.

Code

You can view a simple implementation of a CMS in this github gist.

Source

Connecting applications in Minikube

Veerpal — Fri, 09 Jun 2023 21:55:03 +0000

Over the past few months, I've been learning about Kubernetes through a side project. As I work with Minikube to run a local cluster with multiple services, I find myself just scratching the surface of Kubernetes. In this blog post, I aim to document my current understanding of the various ways applications in Minikube can connect to each other, the host machine, and the outside world.

First, here is the setup I'm working with currently: I am using Minikube to create my cluster on my local machine. My cluster is running different services which need to communicate with one another. Some of these services talk to a database running on my host machine, outside of the cluster. Finally, some of these services expose HTTP ports outside the cluster, where a "user" can make API requests to.

Connecting to the host machine's database

Within a Kubernetes cluster, services are isolated from the external environment. My database resides on my laptop outside of my cluster. I needed my services to connect to this database. Thankfully, Minikube offers a convenient solution by adding the host.minikube.internal hostname entry to the /etc/hosts file. This allows services to access the host's IP address and establish a connection with the database.

 > minikube ssh
Last login: Sat Mar 4 00:43:49 2023 from 192.168.49.1
docker@minikube:~$ cat /etc/hosts
127.0.0.1   localhost
::1 localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
192.168.49.2    minikube
192.168.65.2    host.minikube.internal
192.168.49.2    control-plane.minikube.internal
docker@minikube:~$

Accessing Other Services:

With Kubernetes, each pod in a cluster has a unique IP and can connect to other pods without extra network configuration. This extends to services as well which are an abstraction layer around a group of pods.

To access a service within the cluster, you can utilize [the service name and port]((https://kubernetes.io/docs/tasks/access-application-cluster/access-cluster-services/#manually-constructing-apiserver-proxy-urls). For example, if there's an authentication service named auth running on port 5000, other services can connect to it using http://auth:5000. It's important to note that this URL is not exposed outside the cluster and is limited to inter-cluster communication.

Utilizing Ingress for Gateway Applications:

Ingress is a powerful tool for exposing HTTP and HTTPS routes externally in a Kubernetes cluster. By defining routing rules, Ingress allows external requests to be directed to different applications within the cluster. It's worth mentioning that Ingress supports only HTTP and HTTPS, while other protocols and ports require alternative services such as NodePort or LoadBalancer. (I only used ClusterIP services so far and so omit discussion about NodePort or LoadBalancer services from this post.)

To set up Ingress, you define URLs to be exposed and specify the service within the cluster that each URL should route to. Consider the following example:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: gateway-ingress
spec:
  rules:
    - host: my-public-url.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service: 
                name: gateway
                port:
                  number: 8080

In the above configuration, requests made to my-public-url will be automatically routed to the gateway service by Ingress.

When working with Minikube, running minikube tunnel is essential to enable external access. Once set up, accessing my-public-url in a browser will route the request to the cluster running on your computer.

Conclusion:

In this blog post, I've shared my learnings from working with Minikube and exploring connectivity in Kubernetes. While I've only scratched the surface, I hope this article provides valuable insights. As I continue to learn and delve deeper into Kubernetes, I may update this post with new insights in the future.

Resouces

Using a bash function to push a docker image

Veerpal — Fri, 09 Jun 2023 21:53:42 +0000

I've been learning a bit about docker and found myself repeating the same commands over and over again to push a docker image. I decided to see if I could create an alias for multiple commands in bash.

A quick google search shows that you can define functions in your .bashrc to run multiple commands at once. In the end, this is what I came up with.

function docker_push {
  LINE=$(docker build . 2>&1 | grep "writing image sha256")
  IMAGE_SHA=$(echo  $LINE |  awk '{print substr($0,26,10)}')
  docker tag $IMAGE_SHA $1
  docker push $1
}

Then you can run docker_push username/repo:version to push your docker image.

If you're bash knowledge is as rusty as mine, here is a quick breakdown of how docker_push works.

Redirect docker build output to grep

First off, I knew I wanted to grep the output for docker build . for the docker image SHA256 code. I tried to do docker build . | grep "writing image sha256" but that resulted in an empty file.

Then I realized that docker build outputs to stderr, not stdout.

💡 Bash automatically provides 3 types of file descriptors. There is stdout (file descriptor 1), stderr (file descriptor 2), and stdin (file descriptor 0). Commands read from stdin and then output to stdout or stdin.
When we use | in bash, we are piping the stdout of the first command as the stdin of the second command.

Therefore I used 2>&1 to redirect the stderr of docker build command to the stdout file descriptor instead. Then I could use | to redirect the stdout of docker build . to the stdin of the grep command.

This let me grep for the line with the SHA256 code. I save the output of grep into a variable for later use. This is done with MY_VAR=$(COMMAND) syntax, where the result of COMMAND is saved to MY_VAR.

For reference, the value of LINE is something like #11 writing image sha256:ee19794e19c05bfab071c3e3593379a20ae9b59cf0dd47ac0c39274e0333e6b2 done

Extracting the SHA256 code from the grep output

Next, I used awk to get the substring of the LINE that contains the beginning of the SHA256 code.

Since I know that LINE always starts with #11 writing image sha256:, I decided to get the substring starting at character 26, and get the next 10 characters, which are the starting of the SHA256 code. I did this with awk '{print substr($0,26,10)}' (full credit: stackoverflow)

Again, awk reads from stdin so I used echo $LINE to get the value of LINE and then I redirected that to stdin of awk. I save the result in $IMAGE_SHA.

Getting function arguments.

Now that I have the image SHA256 code, I can pass that as an argument to docker tag. The docker tag command needs the SHA256 code and the repo tag. Since the repo tag value changes based on which repo you are working with, I decided to pass that in as an argument to docker_push. Then I can use $1 to reference the first argument passed to my function.

So if I call docker_push username/repo:version then the value of $1 is "username/repo:version".

Sources:

Dependency Management With Bundler

Veerpal — Fri, 09 Jun 2023 21:51:00 +0000

The venv module in python isolates packages of one python project from another project. I remember trying to install flask and running into dependency conflicts until I learned about venv. Recently, I started wondering why I don't run into the same issues when working with rails. This led me down the rabbit hole of learning about bundler and dependency isolation.

What is bundler?

Bundler is a popular ruby gem used to install project dependencies instead of installing each gem via ruby gem. An application can define a Gemfile with all the project's gem dependencies. Then bundle install will install each of the gems. It will also resolve any dependency conflicts. For example, assume the application depends on gem_a and gem_b. It requires gem_a to be version 3 or higher. gem_b also depends on gem_a but it requires version 4 or higher. Bundler will install version 4 since that satisfies all dependencies. Bundler then creates a Gemfile.lock file which lists all the gems installed and their versions. This makes it easy for another developer to install the same dependencies on their computer.

What is bundle exec?

So how does that offer dependency isolation? Well, it is common to run rails with bundle exec (ie bundle exec rspec <path/to/file>). Running bundle exec ensures all the gems specified in the Gemfile are automatically available to the ruby application via require. More so, it ensures that only those gems are available. So if you have many versions of a gem installed, it will ensure only the version specified in the Gemfile is available to the application.

For example, require 'json' will always use the latest version of json installed on the computer. So if another application is using a higher version of the gem, that version may be imported instead of the version you intended.

With bundle exec, only the versions specified in the Gemfile will be available, which ensures the correct version is imported by require.

Ruby load path

So, how exactly does bundle exec ensure the correct version is used by the application?

RubyGem uses a global variable called $LOAD_PATH, which stores the path to a gem on a computer. require uses the $LOAD_PATH to find the gem and import it. By default, the $LOAD_PATH has the path to the latest version of a gem.

However bundle exec overrides the $LOAD_PATHto contain paths to the gems in the Gemfile (with the version specified in the Gemfile) and only those gems. This ensures that the correct version of each gem is always used regardless of which other versions may be installed on the computer.

Testing this in practice

You can see this in action by running code that requires the JSON gem and then prints the load path. It also converts a hash to JSON.

require 'json'
pp $LOAD_PATH

print JSON.generate({"key"=>"http://www.example.com/test"}, escape_slash: true)

Without bundler, it loads the latest json gem I have installed (2.6.3). Notice this version of the json gem escapes the slashes.


 % ruby json_with_escape.rb
["/Users/veerpalbrar/.rvm/gems/ruby-2.7.2/gems/json-2.6.3/lib",
 "/Users/veerpalbrar/.rvm/gems/ruby-2.7.2/extensions/x86_64-darwin-21/2.7.0/json-2.6.3",
 "/Users/veerpalbrar/.rvm/rubies/ruby-2.7.2/lib/ruby/site_ruby/2.7.0",
...]

{"key":"http:\/\/www.example.com\/test"}

When I create a Gemfile specifying version 2.3.1, and run the ruby file with bundle exec, you can see that version 2.3.1 is listed in the $LAOD_PATH. You can also see this version of the gem doesn't escape the slashes in the url.

 % bundle exec ruby json_with_escape.rb
["/Users/veerpalbrar/.rvm/gems/ruby-2.7.2/gems/bundler-2.3.19/lib",
 "/Users/veerpalbrar/.rvm/gems/ruby-2.7.2/gems/json-2.3.1/lib",
 "/Users/veerpalbrar/.rvm/gems/ruby-2.7.2/extensions/x86_64-darwin-21/2.7.0/json-2.3.1",
 "/Users/veerpalbrar/.rvm/rubies/ruby-2.7.2/lib/ruby/site_ruby/2.7.0",
...]
{"key":"http://www.example.com/test"}

This is why your application needs to specify its dependencies. If the gem is updated, you want the application to continue to use the older version and not break existing behaviour. bundler is one tool you can you for this dependency management.

Sources

Database updates using a quorum

Veerpal — Wed, 11 May 2022 19:47:26 +0000

Problem Statement

In a distributed system, you want many replica's of your database to ensure that data is never lost. The challenge with database replica's is ensuring the data stay's consistent across replica's. If you update the data in one database, all the replica's should also get updated.

One approach is to update all the replica's on write but this can cause your system to become unreliable. If one replica is unavailable, the replica's would be out of sync. When even one replica is unavailable, the system can not write to the database. The more database replica's there are, the more likely it is that a replica will be unavailable.

One solution to the database consistency problem is to use a quorum.

What is a quorum?

A quorum is the minimum number of nodes that need to perform an operation for it to be considered a success. Usually, the quorum will be a number that represents a majority. By not requiring all nodes to accept an operation, we make our system more fault tolerant. You can continue to perform read and write operations as long as most of the replica's are available. This is reliable because it's unlikely many replica's will be unavailable at the same time.

Example execution of a write operation

Consider the case where we want to update to a row in our database. We need a majority of the replica's to agree to the update for it to be considered successful.

If we have 5 replica's (N1, N2, N3, N4, N5), we would push the update to all the replica's. We need to have three replica's to form a quorum. Meaning three replica's need to respond and say the update was successful. For example, if N1, N3, and N4 respond to the update request, we have formed a quorum. We can tell the client the write was successful without waiting for a response from N2 and N5. Note that N2 and N5 will still process the update if they are available.

You can see a simple example of this below. In wait_for_result, we wait for a response from the different "nodes". Once we have enough responses to form a quorum we return and consider the write successful.

Aside: I use threads and the sleep function to represent how nodes take varying amounts of time to respond. I also kill threads early to mimic how some replica's can be unavailable and not receive the update.

class Quorum
  attr_reader :nodes

  def initialize(nodes)
    @nodes = nodes
  end

  def write(key, value)
    wait_for_result(:write, key, value, Time.now)
  end

  private

  def quorum_size
    @size ||= (@nodes.length / 2.to_f).ceil
  end

  def wait_for_result(action, *args)
    responses = []
    tasklist = []

    # Set the threads going
    puts "STARTING #: #{action} #{args}"
    nodes.each do |node|
      task = Thread.new do
        sleep(rand(3)); #mimic the variable response times from the network
        result = node.send(action, *args)
        responses.push(result)
      end
      tasklist << task
    end

    # Wait for quorum to be formed
    sleep 0.1 while responses.length < quorum_size

    # thread clean up
    tasklist.each { |task|
      task.kill if task.alive?
    }

    puts "FINISHED #: #{action} #{args}"
    responses
  end
end

Even if some nodes are unavailable, the other nodes successfully process the update. A quorum is formed and the operation is considered a success. Now, this could lead to some unavailable nodes not having the latest data. I'll show how we handle conflicts later on.

Example execution of a read operation

Similar to how we have a quorum for the write operation, we need to form a quorum for reading data. If we were to only read from one replica, we risk returning outdated data if the replica is not up to date.

Instead, we send the read request to all the replica's and wait for enough responses to form a quorum. If all the replica's in the quorum return the same data, we can assume the data is up to date and return it to the client.

class quorum
  attr_reader :nodes

  def initialize(nodes)
    @nodes = nodes
  end

  def read(key)
    results = wait_for_result(:read, key)
    if read_conflicts?(results)
      raise "Conflicting reads"
    end

    puts "No conflicts"
    results.first[:value]
  end
end

Conflict Resolution in Reads

Sometimes, the replica's in the quorum may not have the same data. If one of the replica's was unavailable during a previous update, it will have outdated data.

That is why we have to check if all the replica's return the same result for the read operation. If the result is different, it means that some of the replica's have outdated data.

In this case, we should return the result of the most recent update. If you look at the code for the write operation, you can see we save a timestamp with each write. We can use the timestamp to see which replica has the most recent update. This is the result we will return to the client.

Once we resolve a read conflict, we should update all the replica's to ensure they are up-to-date.

class quorum
  attr_reader :nodes

  def initialize(nodes)
    @nodes = nodes
  end

  def read(key)
    results = wait_for_result(:read, key)
    if read_conflicts?(results)
      puts "Conflicting reads: #{results.map{|r| r ? r[:value] : nil}.uniq}"

      latest_value = latest_value(results)
      wait_for_result(:write, key, latest_value[:value], latest_value[:time])

      return latest_value[:value]
    end

    puts "No conflicts"
    results.first[:value]
  end

  private


  def read_conflicts?(results)
    results.map { |result| result ? result[:value] : nil }.uniq.size > 1
  end

  def latest_value(results)
    results.reduce(nil) do |latest, result|
      if result && (!latest || result[:time] > latest[:time])
        result
      else
        latest
      end
    end
  end
end

### SAMPLE OUTOUT 
STARTING #: write [:foo, "bar", 2022-05-10 15:35:52 -0400]
FINISHED #: write [:foo, "bar", 2022-05-10 15:35:52 -0400]
STARTING #: read [:foo]
FINISHED #: read [:foo]
Conflicting reads: ["bar", nil]
STARTING #: write [:foo, "bar", 2022-05-10 15:35:52 -0400]
FINISHED #: write [:foo, "bar", 2022-05-10 15:35:52 -0400]

Achieving consistency

How can we be certain that one of the read results will be the most recent data? What if all the replica's in the quorum are out of data? Well, remember that we need a majority to form a write quorum. Likewise, when we read data, we need a response from a majority of the replica's. Thus, there will be an overlap between the replica's that are part of the write quorum and the read quorum. So we will see at least one response from a replica that was part of the last update. Thus, we can be certain that we will see the most recent result returned by at least one replica.

Conclusion

In conclusion, when you have many database replica's, you need a system to keep the replica's in sync. Using a quorum is one way to ensure you provide consistent results while having a reliable and fault tolerant system.

Code

View the code from this post in github.

Sources

Educative Grokking the System Design Interview course.
Distributed Systems 5.2: Quorums by Martin Kleppmann

Fixing N+1 queries when using validates_associated with has_many

Veerpal — Tue, 21 Dec 2021 16:56:50 +0000

In ActiveRecord, when validating an object, validates_accociated validates any associated objects. Assume that an author has many books. Every time the author is validated, validates_associated also validates the author's books.

class Author < ActiveRecord::Base
 has_many :books
 validates_associated :books
end

class Book < ActiveRecord::Base
 belongs_to :author
 has_one :cover
 validates :title, presence: true
end

a = Author.last
a.name = "New Name"
a.save!

You can see the associated object validation in the logs. When saving the author, all the author's books are loaded into memory for validation.

TRANSACTION (0.1ms) begin transaction

Book Load (0.3ms) SELECT "books".* FROM "books" WHERE "books"."author_id" = ? [["author_id", 1]]
Author Update (0.4ms) UPDATE "authors" SET "name" = ? WHERE "authors"."id" = ? [["name", "New Name"], ["id", 1]]

TRANSACTION (1.2ms) commit transaction

validates_associated is a quick way to ensure that active records objects dependent on each other don't become invalid when one of the objects changes. It may seem like a good idea to add this to all your models and always be confident that models are valid.

However, it's important to not overuse this method. Assume that a book has a cover. Every time the book updates, the cover also needs to be validated to ensure the cover has the correct title.

class Author < ActiveRecord::Base
 has_many :books
 validates_associated :books
end

class Book < ActiveRecord::Base
 belongs_to :author
 has_one :cover
 validates :title, presence: true
 validates_associated :cover
end

class Cover < ActiveRecord::Base
 belongs_to :book
 validates_presence_of :book

 validate :cover_has_correct_title?
end

a = Author.last
a.name = "New Name"
a.save!

When saving the author model, all the author's books are still loaded into memory for validation. All the covers for each book are also loaded into memory one at a time. This uses N+1 queries to fetch all the books and covers from the database.

TRANSACTION (0.1ms) begin transaction
Book Load (0.2ms) SELECT "books".* FROM "books" WHERE "books"."author_id" = ? [["author_id", 1]]

Cover Load (0.1ms) SELECT "covers".* FROM "covers" WHERE "covers"."book_id" = ? LIMIT ? [["book_id", 1], ["LIMIT", 1]]
Cover Load (0.1ms) SELECT "covers".* FROM "covers" WHERE "covers"."book_id" = ? LIMIT ? [["book_id", 2], ["LIMIT", 1]]
Cover Load (0.1ms) SELECT "covers".* FROM "covers" WHERE "covers"."book_id" = ? LIMIT ? [["book_id", 3], ["LIMIT", 1]]

Author Update (0.4ms) UPDATE "authors" SET "name" = ? WHERE "authors"."id" = ? [["name", "New Name"], ["id", 1]]

TRANSACTION (1.0ms) commit transaction

Book covers need to be validated when a book updates not when the author information changes. Adding validates_associated to a model is a simple change with potential performance hits. Now multiple database calls are made whenever an author's information changes.

Solution 1: Narrow the scope of validation

The first solution I've found to this problem is to narrow the scope of validation. Consider the scenarios where a model can be invalid. Then, set up your validation to only trigger in that scenario instead of trigger on every validation check.

In the author-book-cover example, a cover can be invalid if the title of the book changes. Then, the code should only validate the cover if a book's title has changed. This is possible by using some of the validates_associated configuration options.

class Book < ActiveRecord::Base
 belongs_to :author
 has_one :cover
 validates :title, presence: true
 validates_associated :cover, if: -> { title_changed? }
end

With this approach, validates_associated checks if the title has changed. If it has, the associated cover is validated. Otherwise, assume that the cover is still valid from the last time the cover was validated.

Now, if you look at the logs, you can see that the N+1 query does not happen:

TRANSACTION (0.1ms) begin transaction
Book Load (0.2ms) SELECT "books".* FROM "books" WHERE "books"."author_id" = ? [["author_id", 1]]

Author Update (0.4ms) UPDATE "authors" SET "name" = ? WHERE "authors"."id" = ? [["name", "New Name"], ["id", 1]]

TRANSACTION (1.7ms) commit transaction

By being more fine-grained with your validation, you can ensure you do not trigger unnecessary processing.

Solution #2

If you need to validate books and covers every time an author updates, then avoid using validate_associated. Instead, have the author load both books and covers before running the validation.

class Author < ActiveRecord::Base
 has_many :books
 validate :books_are_valid

 def books_are_valid
 books.preload(:cover).all?(&:valid?)
 end
end

class Book < ActiveRecord::Base
 belongs_to :author
 has_one :cover
 validates :title, presence: true
 validates_associated :cover
end

a = Author.last
a.name = "New Name"
a.save!

preload loads all the covers for all the author's books in one database query. This avoids the N+1 query problem caused by loading the covers for each book one at a time.

TRANSACTION (0.0ms) begin transaction

Book Load (0.1ms) SELECT "books".* FROM "books" WHERE "books"."author_id" = ? [["author_id", 1]]
Cover Load (0.2ms) SELECT "covers".* FROM "covers" WHERE "covers"."book_id" IN (?, ?, ?) [["book_id", 1], ["book_id", 2], ["book_id", 3]]

Author Update (0.4ms) UPDATE "authors" SET "name" = ? WHERE "authors"."id" = ? [["name", "New Name"], ["id", 1]]

TRANSACTION (0.9ms) commit transaction

The downside to this approach is that the Author model is aware of the relationship between books and covers, coupling the three models together. That may be a trade-off you are willing to make to avoid calling the database more than necessary.

In conclusion

In conclusion, Rails has a lot of "magic" methods that can make it easy to add new functionality. However, it can sometimes come with unintended consequences in practice such as N+1 queries.

Sometimes extra validation makes you feel safer but it could be slowing down your code if you are not careful. Thus, it is better to add as much validation as you need and nothing more.

Code

If you want to view and run the code mentioned in the blog post, you can see the source code in this github gist.

Include, Extend, and Prepend in Ruby

Veerpal — Sat, 27 Nov 2021 00:14:07 +0000

This month, I took the time to go back to basics and try to understand how include, extend and prepend work in ruby.

Modules

Ruby uses modules to share behaviour across classes. A module will contain all the logic for the desired behaviour. Any class which would like to use the same behaviour, can either include or extend the module.

What is the difference between include and extend? When a class include's a module, it adds the module methods as instance methods on the class.

When a class extend's a module, it adds the module methods as class methods on the class.

module A
 def hello
 "world"
 end
end

class Foo
 include A
end

class Bar
 extend A
end

Foo.new.hello #works
Foo.hello #error

Bar.new.hello #error
Bar.hello #works

If it makes sense for an instance of a class to implement the behaviour, then you would include the module. Then each instance has access to the module methods.

If the behaviour is not tied to a particular instance, then you can extend the module. Then the methods will be available as class methods.

self.included

What if you want some methods to be instance methods and others to be class methods? A common way to implement this is to use the self.included callback. Whenever a class includes a module, it runs the self.included callback on the module. We can add the logic for extending another module on the class inside of the self.included method.

To do this, we create a nested module that contains the class methods. The self.included callback will extend the nested module on every class that includes the main module. Then the class will have access to the nested module's methods as class methods.

module A
 def self.included(base)
 base.extend(ClassMethods)
 end

 def hello
 "world"
 end

 module ClassMethods
 def hi
 "bye"
 end
 end
end

class Foo
 include A
end

Foo.new.hello #works
Foo.hello #error

Foo.new.hi #error
Foo.hi #works

Using self.included, lets us provide both instance and class methods when the module is included.

Note that this approach only works with the module that is included in a class. If we were to extend the module in this example, then Foo would have hello as a class method but not hi.

module A
 def self.included(base)
 base.extend(ClassMethods)
 end

 def hello
 "world"
 end

 module ClassMethods
 def hi
 "bye"
 end
 end
end

class Foo
 extend A
end

Foo.new.hello #error
Foo.hello #works

Foo.new.hi #error
Foo.hi #error

Ancestor chain

So what's actually happening when you include or extend a module?
When you include a module, you add it to the ancestor chain of the class.
The ancestor chain is the order of lookup Ruby follows when determining if a method is defined on an object. When you call a method on a class, ruby will check to see if the method is defined on the first item in the ancestor chain (the class). If it is not, it will check the next item in the ancestor chain and so on.

module A
 def hello
 "world"
 end
end

class Foo
 include A
end

Foo.ancestors # [Foo, A, Object, Kernel, BasicObject]

Similarly, if you extend a module, you add the module to the ancestor list of the singleton class. If you're unfamiliar with singleton classes, I mention them in my post on singleton methods in ruby. The main idea is that every object has a hidden singleton class which stores methods implemented only on that object. A class object also has a singleton class that stores methods implemented on that class ie class methods.

When calling a class method, ruby will look at the singleton classes ancestor chain to see where the class method is defined. Since class methods get defined on the singleton class, extending a module adds it to the singleton class's ancestor chain.

module A
 def hello
 "world"
 end
end

class Bar
 extend A
end

Bar.ancestors # [Bar, Object, Kernel, BasicObject]
Bar.singleton_class.ancestors # [#<Class:Bar>, A, #<Class:Object>, #<Class:BasicObject>, Class, Module, Object, Kernel, BasicObject]

Prepend

Prepend is like include in its functionality. The only difference is where in the ancestor chain the module is added. With include, the module is added after the class in the ancestor chain. With prepend, the module is added before the class in the ancestor chain. This means ruby will look at the module to see if an instance method is defined before checking if it is defined in the class.

This is useful if you want to wrap some logic around your methods.

Module A
 def hello
 put "Log hello in module"
 super
 end
end

class Foo
 include A

 def hello
 "World"
 end
end

Foo.new.hello
# log hello from module
# World

Resources

Consistent Hashing (with ruby implementation)

Veerpal — Tue, 26 Oct 2021 22:28:27 +0000

Problem

Let's assume you have a web application that's running on multiple servers. To help speed up queries, you add a cache to store data accessed often by your application. Before calling the database for a piece of information, you first check if it exists in the cache. As you gain more users, one cache instance is too small to provide a significant performance boost. In this case, you add more cache instances to your server to cache more information.

But now, you have to check every cache instance to see if it contains a key. It would be easier if you knew which cache instance has the key beforehand.

Hashing

You can use hashing to determine which cache instance to save the key in. Compute hash(key) % N where hash is some hashing function and N is the number of cache instances. This function returns a number between 0 and N where each number refers to a cache instance. Thus you can map keys to cache instances. To check if a key exists in the cache, hash the key to get the cache instance and only check if that instance has the key. This strategy enables you to have multiple cache instances while keeping lookup efficient.

However, what happens if a cache instance crashes? The cache instance will be unavailable, and you will lose the cached data. In future queries, you will need to recache the data in a different cache instance. The only problem is that the value of N in (hash(key) % N) has changed. All your keys will map to a new cache instance. A key that maps to server:A now maps to server:B even though only server:C is unavailable. This increases cache misses across all cache instances even if one cache instance is unavailable. Ideally, we would only want to remap the keys for the unavailable server.

Consistent Hashing

Consistent hashing is a strategy to map keys to cache instances but allows cache instances to be added or removed from the list of available instances.

Consistent hashing works by imagining a circle. Each key and cache instance is assigned a corresponding point on this circle. To determine which cache instance to add a key to, we map the key to the closest cache on the circle going in a clockwise direction.

Programmatically, consistent hashing is simple to implement. We map each of our cache servers to some integer using a hash function. Here, the hash represents the point on the circle for the cache.

  def add_node(node)
    hash = hash_value(node)
    hash_to_node[hash] = node

    puts "Nodes map to #{@hash_to_node}"
  end

 def hash_value(node)
    Digest::SHA256.digest(node).sum % 360
 end

In the code above, we keep track of the mapping of hashes to nodes in hash_to_node.

To determine which cache instance to add a key to, we hash the key ie we find the corresponding point on the circle. Then we find the cache that hashes to a number greater than the key's hash. This is effectively the cache that is closest to the key's hash.

  def find_cache(key)
    hash = hash_value(key)
    puts "#{key} hashes to #{hash}"

    node_hash = closest_node_hash(hash)
    node = hash_to_node[node_hash]

    puts "#{key} maps to  #{node}"
  end

 def closest_node_hash(key)
   @hash_to_node.keys.sort.bsearch { |server| server >= key } || @hash_to_node.keys.sort.first
 end

In closest_node_hash(key), we sort the cache instance hashes. Then we do a binary search (bsearch) to find the integer with a value greater than our hashed key.

If a value is not found, we return the first cache in the list. This emulates a circle since we "wrap" around to the beginning of the list.

Once we have the hash that is greater than the key, we get the corresponding cache instance. This is the cache we should add the key to.

We now have a consistent way to map our keys to cache instances.

Adding and Removing Nodes

Now let's test what happens when you add or remove a cache instance. Let's run this code on a set of keys to see what the mapping looks like:

Nodes map to {213=>"server:A", 154=>"server:B", 331=>"server:C"}

a hashes to 319
a maps to  server:C

b hashes to 65
b maps to  server:B

z hashes to 284
z maps to  server:C

hello hashes to 165
hello maps to  server:A

As you can see, the keys are distributed among the three cache instances.

Now, let's add a node to our list and run it again.

Nodes map to {213=>"server:A", 154=>"server:B", 331=>"server:C", 301=>"server:B1"}

a hashes to 319
a maps to  server:C

b hashes to 65
b maps to  server:B

z hashes to 284
z maps to  server:B1

hello hashes to 165
hello maps to  server:A

When we add a server, only a small subset of keys get remapped to the new instance. Thus, only a small subset of keys will experience a cache miss as they get moved to a new cache. This is because the mapping depends on which node is "closest" to the key. When you add a new server, the closest server does not change for most keys. Thus the mapping for most of the keys remains consistent.

Now, let's remove server:B from the list and see what happens.

Nodes map to {213=>"server:A", 331=>"server:C", 301=>"server:B1"}

a hashes to 319
a maps to  server:C

b hashes to 65
b maps to  server:A

z hashes to 284
z maps to  server:B1

hello hashes to 165
hello maps to  server:A

Only keys that mapped to server:B need to be remapped. All the other keys remain the same as their "closest" server has not changed.

As you can see, consistent hashing makes scaling our cache instances easier. Cache instances can be added and removed without having to remap all the keys.

As nodes are added and removed, the distribution of the keys can be uneven between the servers. In this case, we can add "fake" nodes which map to an existing server. For example, we can add another node for server A in the list. This will cause some keys to get remapped to server A and even out the distribution of keys.

Conclusion

I used caches in this blog post for a practical application of this hashing strategy. However, consistent hashing can be applied anytime you want to divide a set of keys across multiple nodes. For example, in peer-to-peer networks or a load balancer. My favorite part of learning about consistent hashing was seeing how a hash table can be modified to work in a more distributed way.

Code

You can find the complete implementation of the consistent hashing code on Github.

Resources

Scaling Applications With Message Queues

Veerpal — Wed, 29 Sep 2021 22:20:44 +0000

This month I started looking into system design patterns for scaling and application. I started off by learning about message queues: what are they and why are the useful?

The Problem

In a typical web application, a client sends a request to a server that processes it and returns a response. For example, the client may request a list of products. The server would query the database for the list of products and return the list. As the number of requests increases, one server can not handle all the requests. Some clients will be unable to connect with the server as it is unavailable. In this case, you can horizontally scale the application. You buy more servers for the application so that you can handle the increased load.

Now imagine, some requests are computationally expensive. For example, they need to generate a large report that uses a lot of CPU and takes many seconds to run. While generating reports, the server is unable to process other requests from clients.

One solution could be to buy even more servers to run your application. This can be expensive and wasteful. Say the report generation requests are more likely at the end of a month. Then for most of the month, you will have extra servers you don't need. The additional servers are only required when there is an increased load from generating reports.

Synchronous Vs Asynchronous

We can solve this problem by changing how we think about processing requests. Currently, the client sends a request and then waits for a response from the server. The client is stuck waiting for many seconds while the server generates the report. The client needs a response from the server, but that response does not have to be the final report. Instead, the server can send a response that acknowledges the request for the report without returning the report. Then, it can process the request for the report asynchronously in the background. Once the report generates, the server can send an email to the client and let it know the report is complete.

By moving to asynchronous computation, we reduce the response time for the client. Instead of waiting for a response from the server, the client can complete other tasks. From a user's perspective, they clicked a button and got a message that the report is being generated. The user can now do other things on the site while the report is generating.

By making the report generation asynchronous, the server can respond to more requests. Yet, what happens if the server gets a lot of requests to generate a report? It will try to generate all the reports in the background. The server will be doing too much background work and will slow down or run out of memory.

Queue

A server should only process one or two reports in the background at a time. If more requests for a report come in, they can be added to a report queue. Once a report is generated, the server can start generating the next report in the queue. This way, all the reports will eventually be generated without overwhelming the server.

This approach is better, but where is this queue stored? One solution is to store it on the server. However, this could lead to an unequal distribution of report generation requests. A server with a larger queue will take longer to generate reports compared to servers with smaller queues.

A better solution is to have a shared queue for all the servers. A set of servers can respond to requests and add tasks to the queue. The tasks could be any task we want to offload from the request servers. For example, sending emails or uploading a file to the cloud).
Another set of servers can process background jobs currently in the queue. In this case, the queue would be a persistent data store (database, Redis cache, etc) that all servers can access.

This idea is known as a task queue (sometimes called a message queue).

Task Queues

Task queues enable multiple systems to communicate with each other. One system acts as a producer and will add tasks to the queue. Another system is a consumer and processes the tasks in the queue and actions on them. In this case, the server handling requests is the producer which adds tasks to the queue. The servers which process the tasks are the consumers.

A task queue has many benefits. First, a producer and consumer never have to communicate with each other directly. The producers do not make an API call to the consumer to let them know of an event. Producers only need access to the queue. A producer can add a task to the queue even if none of the consumers are online. Once the consumers are back online, they would start processing that tasks in the queue.

Furthermore, the producers and consumers can scale independently. As the number of tasks increases, you can add more consumers without increasing the number of producers.

However, one downside to task queues (and asynchronous processes) is that the order of execution is no longer linear. You can't guarantee the order the tasks run in. If some tasks depend on others completing first, the task queue logic becomes more complex.

Conclusion

As your application grows, offloading certain tasks to a message queue is a great way to scale your application. This blog post only touches the surfaces of tasks queues. Message queue software, such as RabbitMQ, has a lot of built-in functionality for managing message queues. They also allow you to implement other patterns with your message queue such as the publisher-subscriber pattern.

Resources

Understanding Rspec Best Practices

Veerpal — Mon, 30 Aug 2021 22:26:07 +0000

This past month, I looked at "best practices" for writing RSpec tests. Sites like betterspecs and the RSpec style guide offer simple rules to follow. Yet, they do not elaborate on why they suggest the practices they do. Therefore, I decided to spend some time better understanding their recommendations.

DRY vs DAMP

Both sites mention DRY(Don't Repeat Yourself) at some point. DRY (Don't Repeat Yourself) is a programming principle that aims to reduce duplication in code. Since you are testing one class in many scenarios, you can expect some duplication in the setup and execution of your tests. If you follow DRY, you would move this duplication into before and let blocks.

However, it can be harder to figure out what is being tested because all of the logic is outside of the actual test. This makes it harder to read the code and understand how a class is expected to work. You should aim to make tests readable and easy to understand, even if you duplicate some bits of code. This is sometimes known as DAMP (Descriptive and Meaningful phrases).

That said, lots of duplication in tests makes them harder to modify. The RSpec style guide suggests "doing everything directly in your it blocks even if it is duplication and then refactor your tests after you have them working to be a little more DRY".

The aim is to strike a balance between DAMP and DRY and be okay with some duplication to help increase readability.

Using `let` vs `before` blocks

Both sites suggest instantiating variables using let statements instead of inside before blocks. Code within each before(:each) block runs before every example block. A variable defined in a before block is created for each example, even if the test does not reference the variable. Creating a lot of database objects in a before(:each) block, will slow down tests. In comparison, let is lazy-loaded. A let object is only created after it is referenced in a test. Each test will only create the objects referenced in the test itself. Thus, you avoid creating unnecessary objects in your tests.

Avoid using before(:all) to instantiate data that is used across many tests. It can cause data to leak between tests, leading to flaky or false positive tests. All examples in Rspec run in a transaction. All database changes are rolled back at the end of the test. That way, you start with a clean database at the beginning of each example. Changes made in a before(:all) block are not part of the transaction. Though you can clean up the database changes in an after(:all) block. If you forget to clean up the data, it will persist across all tests and could cause other tests to fail. Database changes made in let blocks or before(:each) blocks get rolled back at the end of the example by the database transaction.

Factories

Both sites advocate for factories over fixtures (though there is a not clear consensus). With fixtures, test objects are all defined in fixture files with predefined data. Fixtures can be used across tests but modifying an existing fixture can break tests that depend on that fixture. As a codebase grows managing fixtures for all the various states of your object can be difficult. In comparison, factories let you build and configure new objects per test.

Working with factories can also be overwhelming, especially when you are new to them. I have found a couple of helpful tips that can make working with factories easier:

When defining factory defaults, only provide the attributes required to pass validation. All other functionality should be added via traits. Avoid creating associations that are not required by default. That way you don't create database objects that are not required for each test.
When using factories in a test, provide only the traits required for the test to pass. It clarifies the properties of the object that are required to make the test pass.
If your test references a default value of a factory, set the default value during object creation. For example, even if the default name for a user is "Bob", create should your user with build(:user, name: "Bob"). This indicates that the name is important for the test and makes it explicit where the value of "Bob" is coming from.
If you use FactoryBot, try to build your factory objects instead of creating them. When you use create, it calls the database to instantiate the object and all its associations. build, will set up the attributes but not save them to the database. It will still call create on the associations and will run validation on those. Finally, if you use build_stubbed, the object associated are stubbed out so the database is not called. So, try to build test objects to avoid hitting the DB and help speed up tests.

Mocking

The rails style guide has some guidelines related to mocking objects.

First, they suggest to not stub the object you are trying to test. For example, avoid doing allow(object_under_test).to receive(:foo).and_return("bar").

Tests ensure that your code does what you expect it to. When you stub out parts of the object you are testing, you risk false positive tests. The stubbed code never runs, so even if the test passes, you can't be confident that your code works.

Sometimes, we want to see what a method returns based on the state of the test object. Thus, we're tempted to stub some of its methods to match the expected state. Instead of stubbing the state of the object, build the object with the desired state using a factory. Likewise, you might want to stub out a method that makes a complicated library call that's hard to test. In that case, either stub out the library call or extract the complicated logic into another class. Then stub out the class in your tests. When you extract the logic into another class, you are now stubbing the collaborator, instead of the object under test.

Mocking collaborators of the object under test is acceptable. The collaborator has been tested in its own unit tests. You can test the collaborator is called with the correct arguments but stub the response for faster tests. Therefore, you rely on the collaborators interface rather than its implementation.

In conclusion

When I started researching best practices, I wanted some tips on writing better tests. In reality, I've realized it's not that clear-cut, and there are many ways of testing an object. I realized that even "best practices" have exceptions. Instead of following rules blindly, it helps to understand the reasoning behind the rules. Then you can confidently know you are using these rules correctly.

Resources

Tips for debugging in ruby

Veerpal — Wed, 28 Jul 2021 23:22:30 +0000

This month I looked into debugging ruby code. While I usually can figure out the source of bugs, I've been thinking about how to debug code more efficiently. When I debug in ruby, I tend to rely on printing variables to the terminal. If the code is more complex, I step through the code with byebug or binding.pry. During this past month, I've been learning techniques that let me level up these skills.

Navigating code

I learned a few techniques to navigate the codebase faster and level up both printing and byebug based debugging.

Printing methods

One technique that I already used but is still worth mentioning is to use p instead of puts. puts calls to_s on the object, which by default is the object class and id. You can override the to_s class to return detailed information about the object. The other option is to use p, which calls .inspect on the object. By default, inspect returns a string with the class, object_id, and instance variables. The output of p can be difficult to parse if an object has many instance variables. In this case, you can use pp, which stands for pretty print, and makes the output easier to read. pp is also helps to format hashes and JSON objects.

Raising errors

Sometimes it can be difficult to find the print statements in the server logs. One option is to prepend print statements with strings like "!!!" and search for them on the server output. Another technique is to raise an exception immediately after the print statements. Then you can find the code faster as you know it happens right before the exception. Raising errors is useful if that section of code runs many times. You can use conditional logic to raise an error in the cases you want to investigate.

Freezing

If you want to know when an object is modified, you can freeze it. Then whenever the object is modified, it will raise an exception. Freezing an object is a faster way to figure out which classes are modifying it.

Leveraging Ruby

Methods such as inspect and pp are useful but don't always appear in beginner ruby tutorials. I've found that learning more about ruby has given me new tools for debugging Ruby code. There are many methods in ruby that are there to make it easier for developers to work with ruby.

Objects

For example, if you have a method that takes an input ( input_obj) but it's not clear what type of input it is. Normally, I would search the code base for all the locations that this method is invoked. In the calling method, you can figure out what is passed in as the input. A faster way to figure this out would be to run the code and do p input_obj.class.name. That way, you know the exact class of the input. Everything in ruby is an object and inherits from the Object class class. It has methods such as methods, instance_variables, responds_to? that you can use to learn more about method inputs. Granted, you can figure out a lot of this information with inspect.

The Object class also mixes the Kernel module, which has a caller method. You can use caller to get the calling stack for an object. caller is a faster way to figure out who is calling a method instead of searching through the entire code base.

Method

In ruby, even methods are objects! You can determine where a method is implemented by calling source_location on the method:

ClassName.instance_method(:method_name).source_location

Using source_location is especially useful when the method name is common and is harder to search for in the code. If a method calls super, you can use super_method to get the Method object for the super method:
ClassName.instance_method(:method_name).super_method

Inheritance Hierarchy

Sometimes, the source of bugs is due to objects extending many modules that change their behavior in unexpected ways. You can track when a module is added to an object with included. You can overwrite included to print information when a module is included on an object. Use method_added to track when an instance method is added to a module. These methods help track down bugs related to metaprogramming.

Tracepoint

Tracepoint allows you to trace the call stack for a piece of code. To see all the methods called while a code block run, you could trace the call stack with Tracepoint:

trace = TracePoint.new(:call) do |tp|
 p[tp.path, tp.lineno, tp.defined_class, tp.method_id]
end

trace.enable
User.some_method
trace.disable

After you create a new tracepoint, you must enable it. When enabled, a tracepoint object will log all the methods calls until the trace is disabled. When you initialize a new tracepoint, it takes a block executes for each method call. The example above prints the file the method is located in (tp.path), the line number tp.lineno, the class tp.defined_class, and the method tp.method_id.

The logging for Tracepoint is quite verbose as it will also output the method calls for code in gems. Thus, Tracepoint is more useful for getting the general execution path for the code.

To reduce the output, you can use conditionals to only print in certain cases:

TracePoint.trace(:call) do |tp|
 next unless tp.self.is_a?(User) # only print method calls for Users
 # tracing logic
end

That way, you can see how the execution path for a particular object to see how it is used.

Tracepoint's code is also less intuitive to write. Rather than memorizing the code, I'd save it in a snippet and copy it whenever I want to use it.

Reading gem source code

Sometimes, the code I'm interested in exists in a gem instead of the application code. Understanding gem code usually requires reading the gem documentation to figure out how the code work. If you can not find the information in the docs, you would have to read the source code. I can read the code on Github, but this can be tedious to navigate and search. Instead, you can do bundle open <gem_name> to open the code for the gem in a text editor. It will open the version specified in the nearest Gemfile. That way, you can use your IDE to search and navigate the gem code. In your application code, you can use source_location to find the location of a method defined in a gem! You can also use print statements and byebug to debug the gem source code if needed. When you finish debugging, use gem pristine <gem_name> to clean up any changes.

Conclusion

Debugging ruby goes beyond the use of print statements to trace code execution. There is a lot of built-in ruby functionality which can help you more effectively debug your code. As I dig deeper into ruby, I now consider how I can leverage what I learn to debug code.

DEV Community: Veerpal

How to Run a Program or Script Hourly on macOS

What is launchd?

How Does launchd Work?

Creating a Plist File

Testing Your Launch Agent

Using LaunchControl

Resources

Solving Top K Frequent Objects with Count Min Sketch

Enter the Count-Min Sketch

Walkthrough

Collisions

Why Use an Approximation Algorithm?

Using Count-Min Sketch with MapReduce

Workflow

Conclusion

Code

Source

Connecting applications in Minikube

Connecting to the host machine's database

Accessing Other Services:

Utilizing Ingress for Gateway Applications:

Conclusion:

Resouces

Using a bash function to push a docker image

Redirect docker build output to grep

Extracting the SHA256 code from the grep output

Getting function arguments.

Sources:

Dependency Management With Bundler

What is bundler?

What is bundle exec?

Ruby load path

Testing this in practice

Sources

Database updates using a quorum

Problem Statement

What is a quorum?

Example execution of a write operation

Example execution of a read operation

Conflict Resolution in Reads

Achieving consistency

Conclusion

Code

Sources

Fixing N+1 queries when using validates_associated with has_many

Solution 1: Narrow the scope of validation

Solution #2

In conclusion

Code

Include, Extend, and Prepend in Ruby

Modules

self.included

Ancestor chain

Prepend

Resources

Consistent Hashing (with ruby implementation)

Problem

Hashing

Consistent Hashing

Adding and Removing Nodes

Conclusion

Code

Resources

Scaling Applications With Message Queues

The Problem

Synchronous Vs Asynchronous

Queue

Task Queues

Conclusion

Resources

Understanding Rspec Best Practices

DRY vs DAMP

Using let vs before blocks

Factories

Mocking

In conclusion

Resources

Tips for debugging in ruby

Navigating code

What is `launchd`?

How Does `launchd` Work?

Using `let` vs `before` blocks