DEV Community: Amree Zaid

Scaling PostgreSQL Performance with Table Partitioning

Amree Zaid — Fri, 13 Jun 2025 01:25:06 +0000

Table of contents:

The background
The investigation
The execution
The result
What we would do differently
Summary

The background

In CoinGecko we have multiple tables that we use to store crypto prices for various purposes. However, after over 8 years of data, one of the tables we store hourly data in grew over 1TB to a point it took over 30 seconds on average to query.

We started to see higher IOPS usage whenever there are more requests hitting price endpoints. Requests queues started to increase and our Apdex score started to go down. For a short term fix, we increased the IOPS up to 24K. However, the IOPS keeps on getting breached causing alerts everyday.

In order to ensure this situation doesn’t affect our SLO and eventually our SLA, we started to look into what we can do to improve the situation.

FYI, we are using PostgreSQL RDS as our main database.

The investigation

Adding indexes was initially considered the quickest solution, but this approach was unsuccessful. The query utilizes a JSONB column with keys based on supported currencies, presenting an additional challenge. Indexing different keys for various applications was deemed excessive, as an added index might only benefit a single application.

Ultimately, table partitioning was chosen as the solution most likely to yield the greatest returns, despite its complexity.

What is table partitioning?

Table partitioning involves dividing a large table into smaller, more manageable pieces called partitions. These partitions share the same logical structure as the original table but are physically stored as separate tables.

This allows queries to operate on only the relevant partitions, improving performance by reducing the amount of data scanned.

There are three methods of partitions which are:

Range: Partitions data based on a range of values (e.g., dates, numerical ranges).
List: Partitions data based on specific list values (e.g., countries, categories).
Hash: Partitions data by applying a hash function to a column's value, distributing data evenly across partitions.

The question is, which partitioning method should we use? As usual, the answer is: it depends. We need to analyze our query patterns to determine which method will provide the greatest benefit. The key is to select the method that minimizes the amount of data our queries need to read every time it runs.

In our case, range partitioning is the optimal choice. This is due to the fact that almost all of our queries on this table incorporate a timestamp range in the WHERE clause. Moreover, we know that we generally only require data for a few months at a time, with a maximum of four. As a result, partitioning the table by month will guarantee that our queries only access up to four partitions (most of the time).

IF for some reason, we are not limiting the read based on the timestamp, we may need to use the Hash method as that will limit the read based on a foreign key. Again, it depends on the use case.

What would the code look like?

CREATE TABLE orders (
   id SERIAL,
   customer_id INTEGER,
   order_date DATE,
   amount DECIMAL(10,2)
) PARTITION BY RANGE (amount);

-- Create partitions for different amount ranges
CREATE TABLE orders_small
   PARTITION OF orders
   FOR VALUES FROM (0) TO (100);

CREATE TABLE orders_medium
   PARTITION OF orders
   FOR VALUES FROM (100) TO (500);

CREATE TABLE orders_large
   PARTITION OF orders
   FOR VALUES FROM (500) TO (1000);

CREATE TABLE orders_extra_large
   PARTITION OF orders
   FOR VALUES FROM (1000) TO (MAXVALUE);

-- Insert sample data
INSERT INTO orders (customer_id, order_date, amount) VALUES
   (1, '2024-01-15', 50.00),    -- Goes to orders_small
   (2, '2024-01-15', 150.00),   -- Goes to orders_medium
   (3, '2024-01-15', 600.00),   -- Goes to orders_large
   (4, '2024-01-15', 1200.00),  -- Goes to orders_extra_large
   (5, '2024-01-15', 75.00),    -- Goes to orders_small
   (6, '2024-01-15', 450.00);   -- Goes to orders_medium

In my case, I just name the partition based on this format: table_YYYYMM.

What did I learn during the investigation?

If someone had to do this again, these are the info that I will pass on:

We can’t change the table from an unpartitioned table to a partitioned table. We need to create a new table and copy the data into it before making the switch.
The partitioned table needs to have the partition key as part of the primary key. If we have ID as the original primary key, then, we need to use composite keys on the new table.
In order to use partitioned tables in Ruby on Rails, we need to change the schema format from schema.rb to schema.sql.
We need to figure out how to have both tables running at the same time if we can’t afford downtime.
Since we will be creating a new table, we have to be careful about the cache. Technically, the new table doesn’t have cache at all and the performance will be very bad. We have to figure out how to “warm” up the new table. I am referring to the cache in PostgreSQL itself.
To warm up the table, please learn about pg_prewarm.
Copying data as big as 1.2TB would require bigger resources such as IOPS. We need to take that into account.

With all the info that we had, we created a Release Plan document outlining when and what is going to happen. We used that document as our main reference point for everyone to see. The document contains these info:

Date: When it is going to happen.
Prerequisites: Before executing the todos, we may need to do other tasks first. They will be listed in this section ensuring we do not start the todos without completing them.
Risks: For every risk, we will list down what could happen and what are the mitigation plans.
Todos: This section will list down what needs to be done and once it is done, we will tick them off from the list.

The execution

Dry Run

Our customers are very particular about the uptime of our services, hence, we proactively conduct dry runs to safeguard our SLO/SLA commitments. Before we start our dry run, we will list down what we want to do and what kind of statistics that we want to collect.

For this project, we spinned up another database identical to our production. Then, we ran all the commands or scripts that we will run on the production instance later. In our case, we were looking for these data:

Before and after query performance.
How long will it actually take to copy the data?
How long does it take to warm up the table partitions?
What does the CPU and IOPS look like for every action that we did?

Avoid operating in the dark.. Be prepared so that we won’t miss our objectives. Remember, using a database similar to our production is going to cost a lot of money.

What we found out during the dry run:

The original table was too slow at first. But, this is expected as without any cache, we won’t be able to work on it.
It took us 10 hours to just warm up the original table before we could start trying out our commands.
It took us 3 days+ to finish copying the data.
Total IOPS can spike up to 6,000 during this operation, even when running in isolation without any other database workload. To put this in perspective, 6,000 IOPS for the read is virtually identical to what our production database handles under normal operating conditions.
We can get 6-8x performance based on the same query that we had when we switched to partitioned tables.
Prewarming the partitioned table only took 3 hours compared to the original table which was 10 hours.

Once we have the right statistics, we made the necessary arrangements such as:

Announcing the day and time where we will do this on the production database.
Increase the storage capacity to ensure our database can fit the new table and still have extra spaces left until we drop the original table. We also need to consider the amount of storage needed for everyday tasks.
Increase the IOPS so that the Primary and the Replicas can handle the load due to the data copy process.

Go-Live Day

The work will start at the beginning of the week to minimize weekend work and ensure our engineers have a peaceful weekend. We also have a backup engineer and SRE support.

It’s quite normal that things didn’t go as planned, but the first challenge was something that I didn’t expect at all.

Challenge #1: Our original table was so bad that I couldn’t even complete copying one day's worth of data.

Based on what we did during the dry run, we should be able to copy one day of data within 2-3 minutes. However, when it comes to production, I didn’t expect it was going to be much worse. Technically, the production data should have sufficient cache as it is actively being used. I didn’t spend too much time looking into it, but I know I can’t warm up the table as we won’t have enough IOPS for it.

We have 8 years of data that we need to copy. So, imagine waiting 15 minutes for 1 day of data. Actually, I don’t know how long it would take as I just kill the query after 15 minutes.

What was the solution? Well, we know for a fact from our dry run, if we warm up the table, it will only take 2-3 minutes for one day of data. But, we cannot warm up our production’s table. So, what can we do?

I remember a Postgres feature called Foreign data wrappers. Basically, we will read from another host and write to the partitioned table in the production’s host. This way, we don’t have to warm up the table in the production and we also won’t use too much IOPS as well. This seems like a win to us.

Based on that idea, we improvised our plan for a little bit:

Provision another production grade database.
Prewarm the table.
Setup Foreign data wrapper.
Update our copy script to read from the new host and write to the current production’s database.

This whole process set us back by two to three days. But, it’s something that we cannot avoid.

Challenge #2: Warming up all databases, including replicas, was necessary

We didn't realize we needed to warm up our replicas until the go-live day began. We had only been focusing on the primary database. This oversight added extra work to the process but at least we are not so worried about the partitions not warmed up enough in the replicas.

The final query

Once we have gone through all of the tasks, we just flip the switch by renaming the table:

BEGIN;
-- Remove existing trigger
DROP TRIGGER IF EXISTS ...;

-- THE IMPORTANT BITS
ALTER TABLE prices RENAME TO prices_old;
ALTER TABLE prices_partitioned RENAME TO prices;

-- Create the trigger function
CREATE OR REPLACE FUNCTION sync_prices_changes_v2()
RETURNS TRIGGER AS $$
BEGIN
  -- ...
END;
$$ LANGUAGE plpgsql;

-- Attach the trigger on the new table so that prices_old will get the changes
CREATE TRIGGER sync_to_partitioned_table
AFTER INSERT OR UPDATE OR DELETE ON prices
FOR EACH ROW
EXECUTE FUNCTION sync_prices_changes_v2();

COMMIT;

As you can see, we are using triggers to copy the data from and to the new and the old table. We still need the old table. Remember, things could go wrong and we need our Plan B, C and so on.

What We Did Right

Table warm-ups: Based on past experience from the database upgrade, we made the right call to warm up the partitions. This ensured that query time didn't increase when we switched from the unpartitioned table to the partitioned table.
Scripted the tasks: We prepared scripts for every task and developed a small Go app to manage data copying. The app included essential features like timestamps, the ability to specify the year for data copying, and the time taken to copy data. We also created an app for warming up the table, allowing us to carefully manage CPU and IOPS usage.
Used CloudWatch Dashboard: We decided to fully utilize CloudWatch Dashboard for this project, and it proved invaluable for monitoring IOPS, CPU, Replica Lag, and other metrics across multiple replicas. Learning to set up the vertical line feature was particularly helpful for visualizing before-and-after comparisons.
Team Backup: Having backup from another person or team was beneficial. They helped identify things we might have missed and provided a sounding board for ideas during planning and execution.

The result

Now let’s go to the fun part. However, there was also a regression after we switched to the partitioned table. Details are in the “What we would do differently” section.

When we are talking about the result, we should return back to why we are doing this in the first place. On the micro level, we want to reduce the IOPS for certain queries. On the macro level, we want our endpoints to be faster and more resilient towards requests spikes. Severe high IOPS can cause replica lags as well.

IOPS

The IOPS was reduced by 20% right after this exercise. Since this table was being used extensively across all of our applications, we can reduce the maximum IOPS thus allowing us to save our costs further. To be clear, we are running multiple replicas so the cost savings are multiplied by the number of replicas that we have.

Response Time

This was quite significant as we managed to reduce the p99 from 4.13s to 578ms. That is a about 86% reduction in terms of the response time. You can see how flat the chart is right after we made the switch.

Replica Lag

Right before we made the table switch, we had increased usage of the affected endpoints causing higher IOPS which in result caused us replica lags. But, it went away the moment we flipped to the partitioned table.

What we would do differently

Block the deployments

The biggest mistake in our planning was not blocking deployments on the day of the switch. While the switch itself wouldn't disrupt others' work, we overlooked the fact that two deployments were related to the table we were optimizing. This caused confusion about the impact of the table partitioning.

The higher CPU and IOPS utilization observed after the switch put the exercise at risk of rollback. We eventually identified that an earlier deployment caused the problem. However, pinpointing the cause required rolling back the changes and extensive discussion. This situation could have been avoided by blocking deployments for a day to clearly assess the impact of our changes.

Too focused on the replicas

Our focus on resolving the replica issue led us to overlook the impact on the primary database. We failed to identify which queries would be affected, and one query, in particular, performed worse after the switch. This query triggered scans across all of the table partitions, increasing IOPS and CPU usage. By modifying the query, we managed to resolve it.

This experience highlighted that without the correct query, table partitioning can be detrimental. In this instance, the query lacked a lower limit for the date range, resulting in all partitions being scanned unnecessarily. Interestingly, the same query performed well on the unpartitioned table.

Extra partitions

This was another instance where we noticed a regression on one endpoint that helped us realize the mistake.

To reduce the future workload of creating partitions, we initially created all the partitions for 2025. We discovered that one query was structured like 'created_at > ?' without an upper limit. It caused the query to scan future partitions that were empty. By removing these partitions, we fixed the issue.

Going forward, we need to determine a better strategy for when to create future partitions.

Incremental release

We shouldn’t have waited for the entire table to be migrated before starting to use it. Since most users only access data from the last four partitions (or months), we could have implemented a feature toggle to direct queries for data after a specific date to the already migrated partitioned tables.

This approach would have allowed us to start using the partitioned tables sooner and reduced the risks and potential negative impact of any issues that might arise. As it stands, rolling back our changes would be costly in terms of IOPS, as we would need to prewarm the old table again to avoid production downtime due to cold cache.

Summary

This is just the beginning. With this experience under our belt, we can start exploring the possibility of implementing table partitions on other tables as well. Of course, table partitioning isn't a one-size-fits-all solution. We need to diagnose the issue before proceeding. Sometimes, something as simple as adding an index can resolve the problem.

In conclusion, while partitioning the 1TB+ "prices" table presented some challenges, especially during the go-live phase, the overall outcome was substantial performance improvements. This initiative aligns with the API Team's ongoing goal: providing the best possible experience to better serve our customers. Our API is now more stable and resilient against sudden spikes in requests during peak hours.

Caching Basics

Amree Zaid — Mon, 01 Jan 2024 03:10:10 +0000

Caching is something we use to reduce the access to the original source. To do that, we store the data temporarily on another medium. We can use various tech to achieve that, but whatever we choose should be faster or lighter than the original source. This will help our users to retrieve our data faster than before. It will also help lower the initial resource utilization. Depending on what we choose, we can save some costs.

What kind of data can we cache? I am unsure if there's a limitation, but anything retrieved can be cached. The only question is, where do you store them? That highly depends on what kind of data that we have.

Let's start with something basic first. When we retrieve data from our database, we might not realize there's a cache happening there, too. But is it good enough? Depending on your use case, it won't since the cache can easily get busted due to changes in the data. You should retrieve the data and cache it on the application layer to have a proper cache. But where would you store it? This is the part where we choose something faster than the database itself because if we don't, there's no point in caching, right?

To make it short, I suggest storing it on an in-memory database such as Redis. Consequently, the web server doesn't have to contact the database whenever someone requests the endpoint that calls for the query. The response time will be faster since the data is stored in the memory instead of files. The endpoint doesn't have to compute or do additional work to get the data. We are supposed to store precomputed data close to the final form anyway. As you can see, we have freed the database to work on something else. It is also a technique we can use to avoid expensive computations on the database.

With the previous implementation, we have reduced the requests that go to the database, but you might have noticed that the requests will still reach our servers. It's good to stop here for most use cases. However, that is different with high-traffic websites. The key to handling many requests from your websites is shifting the responsibility to something else. It's quite similar to how we avoid access to the database by asking Redis to store the data. What kind of options do we have?

There are various ways to do this, such as HAProxy or Nginx. However, I would like to discuss how we can use a Content Delivery Network (CDN) such as CloudFlare (CF) to do this.

"A CDN, or Content Delivery Network, is a system of distributed servers that deliver web content and other web services to users based on their geographic location, the origin of the web page, and a content delivery server. The primary purpose of a CDN is to improve access speed and efficiency by reducing the distance between the user and the content." - ChatGPT.

We can do so many things with CDN, but in this context, we would like to temporarily store the cache on the CDN itself. When we put the content on the CDN, depending on how we configure the cache, the CDN will serve the requests instead of the "origin" servers (our servers). As you might have guessed, we just freed the servers from serving those requests, which, as a result, reduced the server utilizations. So, now, the CDN servers, which are strategically located around the world, will serve the data. Normally, this is done by setting the proper "Cache-Control" header.

Everything sounds so good, but remember, our decisions always have trade-offs, and these are some of them:

"There are only two hard things in Computer Science: cache invalidation and naming things" - Phil Karlton.

When we temporarily store the data, we must determine how long it will be considered fresh. Once we have all the right requirements, we must figure out how to invalidate the data. The methods depend highly on the stack that you are using. Some frameworks made it easy by specifying the Time to Live (TTL). We can also use the Observer pattern to invalidate the data when an event happens, but we need to know the key to store the cache. We can even configure the cache TTL on the server level. For example, Redis has eviction policies such as Least Recently Used (LRU), Least Frequently Used (LFU) and others. What about the caches on the CDN? The basic is still the same: we either remove it based on the TTL or force it by triggering an API call to the CDN.

One big disadvantage I want to point out when caching on the CDN is how we need to use JavaScript to handle the dynamic portion of the cached page. Caching means the CDN will save a snapshot of the page on its servers for a certain period, which means everything that was rendered that time won't change any more.

For example, we put a time on the page based on when the request comes in after the cache has expired. The CDN will then cache that time based on that particular request. Since the time code is coming from our server side, it won't change anymore, as the next request will hit the CDN instead of our servers. Remember, CDN is serving what has been saved. They won't re-render the page from scratch as that's the job of our server. So, what do we do? We had to move some code to JavaScript (JS) since JS is dynamic and can run on the user's machine. Depending on the requirements, we can have the JS asynchronously load some of the pages. At the same time, the CDN will handle the initial page load. So, we can still cache the page on the CDN and update some of the content anytime we want.

Alright, that's it for now. There is still more to discuss, but I will stop here first. There are also more ways to approach the cache invalidation. Maybe next time, it will be about the Dogpile Effect. Till then, happy New Year, everyone!

Tiny Guide to Webscaling

Amree Zaid — Sun, 26 Nov 2023 10:38:49 +0000

Someone on Twitter asked what if Khairul Aming wanted to set up his own website for his sambal? For those who may not know, his product has gained fame and typically sells out quickly once he opens orders. At present, he utilizes Shopee.

From a business standpoint, it's advisable for him to stay with Shopee. My post is primarily for educational purposes.

Disclaimer: I am not an SRE/DevOps professional, but rather someone eager to share insights that might broaden understanding of web scalability, drawn from my limited experiences. Therefore, there may be inaccuracies, though I hope none too significant.

Expect some odd structuring in the paragraphs, as the content was initially made for Twitter.

Introduction

Firstly, traffic won’t spike up on day one. Even if the founder is a marketing genius, he might prefer an established app over starting from scratch. However, if he's starting anew, what’s the accepted baseline for stress tests?

If he already has good sales and no problems, what’s the business decision on moving to a completely new platform?

Then, it circles back to the possibility of scaling what the owner currently has, which is most likely his own custom platform. That means legacy baggage.

A small tip for those interviewing: don't simply throw tons of cloud jargon at your interviewer. First, ask about any constraints. Your system design answer will be more relatable.

The solution should be about improving the existing problem. Yes, it's not cool and easy to improve old stuff. But we are not in school anymore. We rarely start with something 100% new.

Ask and investigate the existing stack. Then, we can start working on it.

This article explained it best: https://mensurdurakovic.com/hard-to-swallow-truths-they-wont-tell-you-about-software-engineer-job/

Understanding the process from the user typing the address to getting the response back is important. I'd say we can simplify it to:

User -> Server -> Data Storage (DS)
User -> DNS -> Load Balancer (LB) -> Servers -> DS
User -> DNS -> CDN -> LB -> Servers -> DS

It is important because we want to know which part of the section to look at and fix when there is a problem.

Developers may not care as there are DevOps or someone else to take care of it, but, in my humble opinion, it's good to understand the stack even at the surface level to help the team.

DNS

Let’s start with the DNS. Every part of the architecture is important, but DNS is the first layer of the request that will be touched before it goes somewhere else. This means you can do lots of interesting things here.

See the traffic sequence image if you use CloudFlare (CF).

From the image, you can see that CF can do lots of things such as DDoS protection, redirections, caches, firewall (WAF), and others.

CF is not the only option. There's CloudFront as well. But I'm more used to CF than CloudFront, so I'm going to talk more about this service.

Let's start from the top: DDoS. You can’t have a popular website without DDoS protection. If you are big enough, that means you are popular enough for others to do bad stuff, like bots, attacks, fake traffic, and so on.

There’s always a limit to what you can handle.

To protect from those attacks, we need to stop them before they reach your servers. If they manage to get through, then you'll have bigger problems.

CF can help mitigate the attack. But not all of them. You have to get used to tuning the knobs manually when needed.

So, remember, don’t trust the advertisements from the service providers. Test them out, get attacked, gain experience in looking at the traffic, see the anomalies, separate them, and control them by banning or throttling them so that your servers can handle the incoming traffic.

Page Rules - You can override the behavior of certain pages without code. For example, you might set cache-control to cache at CF for 5 minutes from the code, then, for whatever reason, you need to make the cache stay longer, and that can be done from this page.

WAF - My favorite. This is where you can ban, throttle the traffic based on IP/ASN/country/custom matchers/etc. This page has saved me from sleepless nights countless times.

Imagine having a sale and getting attacked at the same time?

Before moving on to the next layer, it's important to note that this is where we point http://domain.com to an ALB / Load Balancer address / your server. It's how CF will know how to route the request to the next step.

It will also go through the traffic sequence mentioned above.

I didn’t talk much about Content Delivery Networks (CDN). I think that’s the lowest hanging fruit that one can do to ensure your assets are served by the CDN servers located nearest to your visitors.

Remember, the less traffic goes to the origin, the better.

This is also why some people tweak their WordPress/Web to be served by the CDN as much as possible.

What’s the catch? Once cached, how do you expire the cache? Cache is also less usable if your traffic is too random. You have to understand your traffic before reaching for it.

Before moving to the next layer, learn about the Cache-Control header as much as you can. Learn to leverage it not just for your assets, but also for your normal pages. Figure out how to get the most out of it and understand the downside as well, then you'll have a very fast website.

Load Balancer

Let's move on to the Load Balancer (LB). Its purpose is to distribute incoming traffic across various servers, typically using a round-robin algorithm. In our example, traffic comes from CF and goes to the LB. In AWS, this is known as the Application Load Balancer (ALB).

Focusing on the ALB, which operates at the Layer 7 or application layer (HTTP/HTTPS), it represents the last point before traffic is distributed to the servers. This is why the ALB address is entered into CF.

Securing the ALB address is critical because attackers might launch direct attacks, bypassing the protections set up in CF. Normally, traffic is only allowed from trusted sources like CF, with all other traffic denied.

ALB is just one of AWS's services; you can also use other open-source software (OSS) as your LB. Even nginx can handle load balancing, but whether it can withstand the traffic is another matter. AWS ALB, in my experience, is very reliable.

Discussing scalability leads us to Auto Scaling Groups (ASG) and Target Groups (TG). Some believe these services will solve all scaling problems, but it's not that straightforward. They are helpful, but endless scaling isn't the goal, right?

In summary: Traffic from the ALB is forwarded to the TG. The ASG, using a Launch Template (LT), manages the addition of servers as needed—this much I'm sure of.

The criteria for adding servers, such as a server's CPU usage hitting 60% for 10 minutes, can be based on various factors. When triggered, AWS launches more servers according to the LT specifications. To reduce costs, many opt for auto spot instances, which are notably cheaper. New servers are registered to the TG for the ALB's utilization.

The servers will be terminated once the preset conditions are no longer met, allowing resources to fluctuate based on demand. This is the essence of 'autoscaling.'

To be clear, autoscaling isn't mandatory. It's for teams who anticipate sudden spikes in traffic that require additional server capacity. The thought of managing this manually is daunting. It's wise to research the pros and cons of auto spot instances as well.

But it's not a magic solution; there are always trade-offs. Adding servers impacts other resources—nothing comes without cost.

Specifically, there are implications for storage, but let's discuss the servers and application first, or perhaps address them simultaneously since they're interconnected.

And if anyone is still reading this, kudos to you lol 🤷

Servers

After the ALB routes the data, it's up to the servers to push it to the application layer. 'Servers' could refer to either physical hardware or application servers. My expertise lies with EC2 rather than serverless, so that's where I'll concentrate.

EC2 is a service from AWS where you deploy your application. AWS offers various server types optimized for compute, memory, storage, etc. The best choice depends on your workload. Understanding your server's specs is crucial.

Without knowing a server's limits, we may inadvertently overuse it. Recognizing when a server is at capacity is a skill in itself.

Cost is a significant factor. AWS provides regular, spot, and dedicated instances, among others. Upfront payment can also offer cost savings. Tools like spot.io assist in cost optimization, but understanding your workload is fundamental.

Spot instances are economical but can be terminated unexpectedly, so it might be prudent to start with dedicated instances and then transition to spot instances based on usage and requirements. These concepts warrant further research for a comprehensive understanding.

Cost considerations become even more critical when auto-scaling because you could potentially spin up many servers. It's essential to determine a reasonable limit. I haven't even begun to discuss the impact on other resources.

Now, let's delve into the server itself. The software stack depends on your application. For Ruby on Rails, a classic setup might include nginx, puma, and then your code. Understanding this flow is vital.

You should consider how your application server uses resources, as this will dictate the necessary memory and CPU. Optimizing a single server might mean you don't need to spin up ten servers for the same load.

I recall a 'fixer' at a seminar who explained how he calculated the required RAM based on active processes during a nationwide application outage years ago, using just basic Linux tools, without AWS.

Web scaling isn't merely about adding more servers. It's about understanding why resources are being strained. At my work, the recent incident made me question if the resource usage was high because we couldn't pinpoint the underlying issue.

I highly recommend this article from Judoscale, https://bit.ly/46yg8hF. It offers excellent visuals on data flow from the ALB to the application server. It's a valuable read, even for those not using Ruby on Rails, as the principles apply broadly.

Application

At the application layer, it’s clear that handling substantial traffic isn’t solely the server team's responsibility. Developers must ensure that the code is optimized to handle the load, necessitating close collaboration across the team to minimize or eliminate downtime.

The infrastructure team can do a lot, but if developers consistently push two-second queries, the servers will inevitably struggle. We must acknowledge limits, particularly financial ones. Short-term solutions might work, but the key question is: How short is our short-term?

Applications vary, but certain best practices apply universally, such as deferring non-critical tasks to background jobs. Even simple actions like voting can be processed in the background, leading to quicker responses.

As previously stated, use a CDN whenever possible to serve assets and cache pages. Understanding and leveraging cache-control can significantly reduce the load on your servers.

For heavy database queries, consider caching results in faster storage solutions like Redis. The goal is to minimize the workload during the main request. Fewer tasks during the request equate to quicker response times.

Beyond simply caching, it's crucial to understand how caching works. For instance, what happens if a cache expires just as a thousand requests per second hit? Strategizing for such scenarios is key to reducing origin server load.

Addressing the original scalability question, particularly in e-commerce, one of the biggest challenges is managing stock availability during high demand, akin to new iPhone releases or limited event tickets.

Locking stock without causing database lockups is a delicate process. I’ve implemented a solution for this, but I cannot guarantee it could handle the intense traffic like @khairulaming's sales events.

In a Malaysian context, holding stock for a brief period is essential to allow users to complete transactions, especially when they must navigate to their banks’ payment interfaces.

When payment failures occur, it's necessary to return the unclaimed stock to the pool for others to purchase. This often involves coupon systems and requires updating user wallets concurrently, which is resource-intensive due to the need for transactions and row locking.

These operations are costly, and while they may handle a substantial number of requests, the scalability to the level of something like @khairulaming’s order volume or COVID-19 appointment systems remains uncertain.

Rapid deployment capability is also critical, particularly for CI/CD processes. This ensures that any issues can be addressed quickly, which is vital during high-traffic events to avoid frustrating users.

From my experience, database optimization is frequently the bottleneck. Developers unfamiliar with tuning and structuring their databases will encounter issues well before traffic peaks. With an adequately optimized database, excessive caching might be unnecessary.

At a minimum, developers should utilize tools like EXPLAIN to diagnose slow queries. Eliminating N+1 queries and applying appropriate indexing are fundamental skills that remain vital across all database platforms.

Database

The database is often the most challenging aspect, not because other areas are problem-free, but due to its complexity and the impact of its performance. Let's explore some straightforward infrastructural optimizations.

Implement connection pooling, whether it's HA Proxy, RDS Proxy, pgPool, pgBouncer, or similar. It's crucial to comprehend the nuances between application and server-side pooling. Connections are costly in terms of memory, so monitor usage and set appropriate quotas.

Pooling allows your application to reuse open connections, which can be more memory-efficient. However, you'll need to understand how these applications work and may need to adjust them based on your specific application needs.

Employ replicas and balance the load between them. It's ideal to dedicate a server to a single application and split read operations based on usage. Adjusting the master server for writes is more complex; I've seen professionals split a primary server to support an increased number of write operations.

Splitting the master server and data partitioning or sharding are advanced solutions that require careful configuration. These methods are complex and should not be your first line of approach—start with simpler solutions.

Avoid default configurations. Pinpoint the issues you're facing, then focus on the relevant settings. Be mindful of potential cascading effects when changing configurations. Always monitor the changes and revert if they don't yield improvements.

Understand your database's strengths and limitations. Your technology choice may not be the best, but proficiency can help resolve most issues. For instance, Uber switched to MySQL, although PostgreSQL advocates might disagree with their reasons (https://bit.ly/3sG6I5S).

Utilize raw queries when necessary. ORMs are convenient but come with overhead, often requiring more memory. Don't hesitate to write direct queries to leverage your database's full capabilities.

Like with EC2, don't focus solely on CPU usage. Consider other metrics like memory and IOPS, which can limit your database depending on the specifications you choose.

Conduct schema and data migrations cautiously. Aim for zero downtime, even if it means multiple stages or phases.

While proper database normalization is common, there are times when denormalization is necessary to enhance performance. Step outside the standard practices if needed, but ensure you understand the trade-offs.

Choose the right database for your needs, not just what's trendy. If a part of your app benefits from key-value storage, consider Redis. Mixing and matching technologies is fine, but be cognizant of the time and money costs for maintenance.

I won't delve deeply into Redis. The principles are similar to other technologies: use replicas, connection pooling, understand the tech and its uses. Asking why Redis is faster can lead to a deeper understanding—it's not just a key-value store; it offers much more.

Observability

Let's approach the final chapter I hadn't planned on writing: Observability.

I believe that's the term, though I’m not a DevOps/SRE—so that's my disclaimer for any inaccuracies, lols.

Assuming we've optimized our code and servers, there's still one crucial element to consider, and it should be addressed concurrently: monitoring tools. It's essential to have a system that alerts you to problems.

I'll share tools I'm familiar with, acknowledging my preference for paid services—though I wasn't always like this. Discussing their use might lead to identifying similar, free alternatives.

By combining these tools, we can gain the most benefit.

Application Performance Management (APM) is one category. I use @newrelic, which offers a comprehensive suite for monitoring request queues, searching logs, tracing details, historical performance, external call performance, among other metrics.

For real-time server stats like CPU usage, network, and IOPS, I turn to AWS CloudWatch. AWS Performance Insights is also a go-to as it provides a real-time overview of database performance.

I’ve started using dashboards to consolidate AWS information, allowing me to view everything, including ASG and ALB metrics, on one screen.

@pganalyze is another invaluable tool for identifying slow queries, underused indexes, bloat stats, idle connections, and more. It offers insights into query throughput and IOPS, though there's some overlap with Performance Insights. One downside is its data refresh rate, which is every 10 minutes by default but might be adjustable.

Error tracking is another critical area. While New Relic handles this, dedicated error tracking services like Bugsnag and Sentry offer specialized capabilities, though they can become costly with increased traffic.

Integration with notification services is also crucial. PagerDuty or even Slack can be used to alert engineers of issues.

The goal of these tools is to provide immediate insights into what's happening, enabling quick identification and response to issues for both short-term fixes and long-term solutions.

Summary

I think I have covered most of what I want to talk about. I would love to elaborate more on some of them, but this is Twitter. So, let's go to the summary.

I can't stress enough that what works at another place will not necessarily work the same for you. Learn from others and use them as guidance, but don't expect the same result 100%. This is the main reason I keep on using "it depends". Same with this thread, it may work for you or it may not 🤷.

Keep on learning. Don't simply say "auto-scale" without understanding the consequences and, of course, be humble. There are still too many things to learn. I've made mistakes by thinking my way is the only way before, and once I learned more, I realized the answer can be different.

Change the mindset. Most of the time, the solution is not that clear. There is always a trade-off between them. But which one is I'm ok to go with? Win some, lose some. It doesn't have to be perfect, but it has to work, if possible so that we can return to our sleep lol.

Hopefully, this will help someone. I know it would have helped me three years ago.

Thanks for reading and THE END.

How to do big upgrades with small changes

Amree Zaid — Fri, 18 Nov 2022 10:06:35 +0000

Introduction

I'd like to share how I manage multiple library/gem/package update at the same time when I'm working on big upgrades. But before we start, it's important to know why.

Big upgrades require bigger change of files. The amount of files that you need to change can get really out of hand. You may have to replace some code due to the deprecations or maybe you have to deploy in multiple stages in order to not break the production.

Which is why I strive to have smaller changes that can help reduce the final changes. In the end, the final Pull Request will only have small changes that would help your colleagues to review them safely. Yeah, no one is gonna review 50+ changed files.

It's very common for me to have more than 30 PR opened (not at the same time). In fact, it's not that weird to have 10+ PR(s) opened waiting for approval or deployments. So, it can be very confusing to maintain all those PR(s).

It's not just about having small changes, it's also about having the ability to get into the future where you are in a state where you have everything merged so that you can work on the future upgrade at the same time.

In my case, I was working on upgrading v5.2.7 -> v5.2.8 -> v6.1.7 -> v7.0.4 (I'm also upgrading Ruby at the same time, but won't go into that to simplify the explanation). The good thing is that I don't have to wait for everything to be ready before even working on the next upgrade. When I was working on v5.2.8, I have already started to work on v7.0.4 at the same time while waiting for code review and deployments. To me, that's a BIG ADVANTAGE.

How am I doing it

Let's talk about how I'm doing it. Pretty sure someone else has done this before. It doesn't require tools (you can create a script if you want to), just need some basic git commands and some small notes. It's a little bit hard to explain this without a visualization, so, I've created one:

Every PR must be based on the master (master <- pkg-upgrade-1 )

This means, you can ensure you won't break production with this particular change and you can always rebase from master whenever needed to. If things went wrong when you deployed, you will immediately know the problem compared to having multiple upgrades in one PR. If you have one PR with multiple upgrades, you'd have to hunt down the part that is causing the problem

If the change requires another change, use a different base (master <- pkg-upgrade-1 <- pkg-upgrade-2)

The world is never that simple, there's always a chance where you'd have to depend on a different package before you can work on another upgrade. Obviously you can wait for the first changes to be deployed first, but why should you?

Use a 'before' branch before the big upgrade (master <- before-big-upgrade (upgrade1, upgrade2, ..) <- the-big-upgrade)

The big upgrades usually requires multiple changes. We can't use 2nd technique here as that is only suitable for one or two changes. We handle this by creating a branch that will combine every small changes that hasn't been merged to master.

That 'before' branch will be used as the base for the big upgrade PR. This is HOW WE TRAVEL INTO THE FUTURE. You can even see if your changes actually work here. The final branch/PR should pass your CI

Use the combination of those techniques to work on another major upgrade

We can combine them and the result would look like this: master <- major-upgrade-1 <- major-upgrade-2 <- major-upgrade-3

It may look simple, but 'major-upgrade-1' PR is a combination of techniques from 1 to 3. A little bit hard to wrap your mind when you read it for the first time, but you guys can take a look at the next image. Hopefully, it will help.

Here are some git command that I use to handle everything that I mentioned:

# Sometimes, I need to update a branch that depends on another branch
# This has to be done before we work on others
# Usually, I squashed and keep one commit
git checkout upgrade-pkg-1; git rebase --onto master HEAD~1
git checkout upgrade-pkg-2; git rebase --onto upgrade-pkg-2 HEAD~1

# Reset to master
# 'before' branch should be refreshed periodically to ensure it won't break production
# It's ok to reset since we are not creating a PR for it
git checkout before-rails7_0_4 ; git reset --hard master


# Merge those branch that hasn't been approved / wip
# Normally, I have close to 10 package that needs to merged
git merge upgrade-pkg-2 \ # I'm skipping pkg-1 because this pkg-2 has the changes
upgrade-pkg-3 \
upgrade-pkg-4

# Sometimes, not all packages can just be merged
# This requires manual merge due to the conflicts
git merge pkg-5 # resolve conflict

# The straightforward verion
# Rebase final branch to the 'before' branch
git checkout am-upgrade-rails7_0_4; git rebase --onto am-before-rails7_0_4 HEAD~1

# This can happen:
# I had to do something like this when the upgrade has major conflict
git checkout am-upgrade-rails7_0_4 ; git reset --soft HEAD~1
git reset
git checkout Gemfile.lock
git add .; git stash; git reset --hard am-before-rails7_0_4
git stash pop # fix conflicts
bundle update rails rails-i18n
git add .
git commit -m 'Upgrade Rails from v6.1.7 to v7.0.4'

Is this the best way to do this? I'm not sure, but it certainly helped me to keep on working without being blocked by pending PR or deployments. Do remember, I don't deploy immediately to ensure I can monitor for regressions. Usually, I'd put some gaps between deployments.

I highly doubt people will read this far lol, but if you did, thanks! I'm planning to do a presentation on this in the future and writing this will certainly help me to explain the method better.

Rails Connection Pool vs PgBouncer

Amree Zaid — Wed, 31 Aug 2022 04:56:17 +0000

Rails by default comes with connection pooler on the application side but I always wonder what is the difference if we use another connection pooler such as PgBouncer. So, here is some notes on trying to understand some of it.

To try this out, I'm using docker so that I don't have to install extra application:



$ mkdir conn-poc; cd conn-poc
$ rails new blog --api -T --database=postgresql

$ mkdir bouncer
$ mkdir db

# create docker network so that PgBouncer and PostgreSQL can communicate with eacher other
$ docker network create conn-poc-1-net

# start postgresql
$ cd db
$ docker run --rm \
  --net conn-poc-1-net \
  --name conn-poc-1-pg \
  -e POSTGRES_USER=postgres \
  -e POSTGRES_PASSWORD=postgres \
  -e POSTGRES_DB=blog_development \
  -v $(pwd):/var/lib/postgresql/data  \
  -p 6432:5432 \
  -it postgres:13.8-alpine

# start pgbouncer
$ cd bouncer
$ docker run --rm \
  --net conn-poc-1-net \
  --name conn-poc-1-bouncer \
  -e DATABASE_URL="postgres://postgres:postgres@conn-poc-1-pg/blog_development" \
  -v $(pwd):/etc/pgbouncer \
  -p 7432:5432 \
  edoburu/pgbouncer

PgBouncer config:



[databases]
blog_development = host=conn-poc-1-pg port=5432 user=postgres

[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 5432
user = postgres
auth_file = /etc/pgbouncer/userlist.txt
auth_type = md5
ignore_startup_parameters = extra_float_digits
pool_mode = transaction

Update some of the code:



# config/database.yml
default: &default
  adapter: postgresql
  encoding: unicode
  username: "postgres"
  password: "postgres"
  port: <%= ENV.fetch("DB_PORT") %>
  host: localhost
  pool: <%= ENV.fetch("RAILS_MAX_THREAD") %>
  checkout_timeout: 5
  idle_timeout: 3
  prepared_statements: false



# config/puma.rb
workers ENV.fetch("WEB_CONCURRENCY") { 2 }

Setup the DB and create a table:



$ rails g model User name
$ rails db:create db:migrate



# config/routes.rb
Rails.application.routes.draw do
  root "home#index"
end

# app/controllers/home_controller.rb
class HomeController < ApplicationController
  def index
    render json: User.first
  end
end

# config/environments/development.rb
config.hosts.clear

We can connect to the PostgreSQL with:



# 6432 = Connect to PostgreSQL directly
# 7432 = Connect throught PgBouncer
psql -U postgres -h localhost -d blog_development -p 6432

We can use this SQL to check the amount of connections:



SELECT * FROM pg_stat_activity WHERE datname = 'blog_development'

To run the server, I use this command (values will be changed depending on what I want to try):



DB_PORT=7432 \
  RAILS_MAX_THREAD=5 \
  WEB_CONCURRENCY=3 \
  rails s -b 0.0.0.0

To send request, we can use this command:



# Apache benchmark
docker run --rm jordi/ab -c 500 -n 500 http://host.docker.internal:3000/

The result:

Calculating the amount of database needed

The first part is to figure out the max connections per process:

The pool size in database.yml depends on how many threads/concurrency that you set
If you set RAILS_MAX_THREAD to 10, then, that's the amount the pool size needed
But you might different value when you have a background job, e.g: Sidekiq
Sidekiq might have different concurrency, so, if the concurrency is set to 20, then you need to increase the pool size to accomodate that. So, there's a chance the web might have bigger value unless you can specify different config between the worker and the web server (assuming they live in different server)

Once we have figured that out, we need to think about the amount of process that we will have:

A process is something like how many puma/sidekiq process that you have
In this exercise, I'm using WEB_CONCURRENCY

Example:

Puma: WEB_CONCURRENCY=2
Puma: RAILS_MAX_THREAD=5

This means, we need to have at least 2x5 = 10 connections. We also need to set the DB pool size to 5.

Let's add Sidekiq to the mix:

bundle exec sidekiq -C config/sidekiq/payment.yml
bundle exec sidekiq -C config/sidekiq/data.yml

Assuming concurrency set to 20 each in those configs, we need to have 2 x 20 = 40 connections. DB pool should be set to 20 in this case.

So, we need a total of 10 + 40 = 50 connections.

Again, we need to ensure proper DB pool size is set.

Notes

These are some the notes based on my observations:

Opening rails console won't immediately open a connection
Without PgBouncer, Rails will immediately open all possible connections
PgBouncer will increase the connections after I ran the benchmark couple of times, but never reached the max
Both Rails and PgBouncer has an option to disconnect idle connections
Without the right pool/thread size, Rails will throw ActiveRecord::ConnectionTimeoutError error
I had to use transaction mode with prepared_statement option disabled. Need to read more about this
I'm not splitting the read and write

Conclusion

I assumed PgBouncer will always use less connections which is kind of true but I'm not sure if more requests comes in, will the amount of connections keeps on increasing to the max?
We can rely on the timeout to remove the idle connections for both Rails and PgBouncer
PgBouncer is definitely a must if we are connecting to the DB from various applications and one of them might not have their own pool manager
Based on some searching, it's not possible to disable Rails connection pooler and just use PgBouncer
I think PgBouncer is capable of processing multiple SQL using one connection because it has the multiplex feature, but I can't confirm this, yet

References

Ruby on Rails and Docker for Testing

Amree Zaid — Thu, 13 Jan 2022 02:46:10 +0000

In the previous post, we managed to figure out a way to make our Docker setup work for Development. It’s time to figure out how we can run our tests with it. In the end, we should be able to run single and multiple tests. This also includes Capybara tests using headless Chrome.

We will also look into how to use multiple docker-compose files to override what we have based on the environment. But we’ll start with something simple first.

Let us install RSpec first but I will skip this part and refer you guys to this guide. However, we need to update .rspec to be:

--require rails_helper

Without that, we will get an uninitialized constant error.

The first thing we need to do is to prepare the database.

docker-compose run -e "RAILS_ENV=test" web rails db:create db:migrate

That is quite straightforward as we just override the RAILS_ENV to test and prepare the database based on that environment. However, I had problems because I was using DATABASE_URL from docker-compose and that will override the database name.

Maybe there are better ways to do this, but this is how I fixed it:

# config/database.yml
default: &default
  adapter: postgresql
  encoding: unicode
  username: <%= ENV['DATABASE_USERNAME'] %>
  password: <%= ENV['DATABASE_PASSWORD'] %>
  host: <%= ENV['DATABASE_HOST'] %>
  pool: <%= ENV.fetch("RAILS_MAX_THREADS") { 5 } %>

development:
  <<: *default
  database: blog_development

test:
  <<: *default
  database: blog_test

production:
  <<: *default
  database: app_production
  username: app
  password: <%= ENV['APP_DATABASE_PASSWORD'] %>

# docker-compose.yml
services:
  # ...
    environment:
      - DATABASE_USERNAME=postgres
      - DATABASE_PASSWORD=password
      - DATABASE_HOST=db

As you can see, we only provide the username, password and host. We will let the database name change based on the environment it is being run.

Add a simple spec:

# spec/models/post_spec.rb
describe Post, type: :model do
  it "can be created successfully" do
    post = Post.new

    post.save

    expect(post.persisted?).to eql(true)
  end
end

We can then run the spec by running:

docker-compose run --rm -e "RAILS_ENV=test" web bundle exec rspec spec/models/post_spec.rb

Multiple Compose Files

As you can see from the previous guide, we had to override the environment variable using -e. This is ok for one or two variables but it is not scalable when we have more than that. docker-compose has extend feature that would allow us to use a base compose file and override with another one.

Using the same example from the above, we can create a new docker-compose file:

# docker-compose.test.yml
services:
  web:
    environment:
      - RAILS_ENV=test

After that, we can run the spec with:

docker-compose \
  -f docker-compose.yml \
  -f docker-compose.test.yml \
  run --rm web bundle exec rspec spec/models/post_spec.rb

This means we can separate different configs depending on the environment. We just need to a base config and override with the file we specified.

Do remember not to commit the docker-compose.*.yml as it might contain sensitive information. Create a template file such as docker-compose.test.template.yml for others to copy and change accordingly.

Another important note is if we have docker-compose.override.yml, we don’t have to specify -f to override the compose config as docker will do that automatically.

Use docker-compose config to see your final config. Use -f override_file if needed. That might be helpful in debugging complex configurations.

Browser Testing

Add a simple spec for feature testing first:

# spec/features/user_creates_post_spec.rb
RSpec.describe "User creates post", type: :system, js: true do
  scenario "successfully" do
    visit posts_path

    expect(page).to have_content("New Post")
  end
end

Headless

Add these files first:

# spec/support/capybara.rb
RSpec.configure do |config|
  config.before :each, type: :system, js: true do
    url = "http://#{ENV['SELENIUM_REMOTE_HOST']}:4444/wd/hub"

    driven_by :selenium, using: :chrome, options: {
      browser: :remote,
      url: url,
      desired_capabilities: :chrome
    }

    Capybara.server_host = `/sbin/ip route|awk '/scope/ { print $9 }'`.strip
    Capybara.server_port = "43447"
    session_server       = Capybara.current_session.server
    Capybara.app_host    = "http://#{session_server.host}:#{session_server.port}"
  end
end

# docker-compose.test.yml
services:
  web:
    environment:
      - RAILS_ENV=test
      - SELENIUM_REMOTE_HOST=selenium
    depends_on:
      - selenium

  selenium:
    image: selenium/standalone-chrome

As you can see, we are going to use a selenium container for the test to run. The web container will access it at port 4444 and we do not need to open it as they communicated with each other within docker’s network itself. We will also open Capybara at port 43447.

Non-headless

TBD - This is something that I haven’t been able to figure out. I will definitely update it once I managed to solve it.

Parallel Testing

TBD - More on this once I’ve figured the production part.

References

Ruby on Rails Development using Docker

Amree Zaid — Fri, 31 Dec 2021 16:30:25 +0000

Check out my previous post if you want to start learning from just a Dockerfile. As you may have realized, it's not scalable if we keep on using the long command to manage our containers. We haven't even started talking about different services such as PostgreSQL and Redis, yet.

In this post, we are aiming to use docker-compose to make our development experience easier.

Prepare our application

I'm assuming we are starting from fresh where we don't have any Ruby on Rails application ready, yet. I'll put a note later if you are starting with an existing application.

The files that we need in order to dockerize our application are:

Gemfile - List of gems that are going to be installed
Gemfile.lock - Locked version of Gemfile
package.json - List of npm packages that are going to be installed
yarn.lock - Locked version of package.json

Why these files? They are the ones that will determine what packages will be installed in order for us to have a working Rails application.

How do we get them without installing the Rails application in our system? Similar to what we did before, we need to create a temporary application to copy them.

Create a directory and put this Dockerfile inside:

FROM ruby:2.6.3

RUN apt-get update && apt-get install -y --no-install-recommends \
  curl build-essential libpq-dev && \
  curl -sL https://deb.nodesource.com/setup_16.x | bash - && \
  curl -sS https://dl.yarnpkg.com/debian/pubkey.gpg | apt-key add - && \
  echo "deb https://dl.yarnpkg.com/debian/ stable main" | tee /etc/apt/sources.list.d/yarn.list && \
  apt-get install -y nodejs
RUN rm -rf /var/lib/apt/lists/*

WORKDIR /app

RUN npm install -g yarn
RUN gem install rails:6.1.4.4 bundler:2.3.4

CMD ["rails", "server", "-b", "0.0.0.0"]

Then, run these commands to generate a new rails application:

docker build . -t blog

docker run --rm -it \
  -v $(pwd):/app \
  -v bundle-2.6.3:/bundle \
  -v node_modules-rails-6.1.4.1:/app/node_modules \
  -e BUNDLE_PATH=/bundle \
  blog \
  bash

# run this in the container
rails new . --database=postgresql

Once that is done, we will have all of the necessary files to really start our new application.

Our next target is to run the basic application successfully. Replace the existing Dockerfile with this content:

FROM ruby:2.6.3

RUN apt-get update \
  && apt-get install -y --no-install-recommends \
    curl \
    build-essential \
    libpq-dev \
  && curl -sL https://deb.nodesource.com/setup_16.x | bash - \
  && curl -sS https://dl.yarnpkg.com/debian/pubkey.gpg | apt-key add - \
  && echo "deb https://dl.yarnpkg.com/debian/ stable main" | tee /etc/apt/sources.list.d/yarn.list \
  && apt-get install -y nodejs \
  && rm -rf /var/lib/apt/lists/*

WORKDIR /app

RUN gem install rails:6.1.4.4 bundler:2.3.4
COPY Gemfile* /app
RUN bundle install --jobs "$(nproc)"

RUN npm install -g yarn
COPY package.json /app
COPY yarn.lock /app
RUN yarn install

ADD . /app

EXPOSE 3000
CMD ["rails", "server", "-b", "0.0.0.0"]

Create another file docker-compose.yml

services:
  db:
    image: postgres:14.1-alpine
    volumes:
      - db:/var/lib/postgresql/data
    environment:
      - POSTGRES_PASSWORD=password

  webpacker:
    build: .
    command: bash -c "bin/webpack-dev-server"
    ports:
      - "3035:3035"
    volumes:
      - .:/app
      - bundle:/bundle
      - node_modules:/app/node_modules
    environment:
      - BUNDLE_PATH=/bundle
      - WEBPACKER_DEV_SERVER_HOST=0.0.0.0

  web:
    build: .
    command: bash -c "rm -f /app/tmp/pids/server.pid && rails s -b 0.0.0.0"
    ports:
      - "3000:3000"
    volumes:
      - .:/app
      - bundle:/bundle
      - node_modules:/app/node_modules
    environment:
      - BUNDLE_PATH=/bundle
      - DATABASE_URL=postgres://postgres:password@db/blog_development
      - WEBPACKER_DEV_SERVER_HOST=webpacker
    depends_on:
      - db
      - webpacker

volumes:
  db: {}
  bundle: {}
  node_modules: {}

Run these commands:

docker-compose build
docker-compose run --rm web rails db:create db:migrate

Run docker-compose up to see if it works. You can verify all containers are running by issuing docker container ls. There should be three containers: web, webpacker and the db.

Open http://localhost:3000/ to ensure everything is good.

Generate some resources so that we can have something that we can try on:

docker-compose run web rails g scaffold Post title body:text
docker-compose run web rails db:migrate

Things to try out:

Create a new Post and see if you can persist it
Change the application.js and see if your webpacker will compile the changes
See if your hot reload works when your js file changed

Cool, now we have a running local rails application using Docker. We will start tackling some normal workflows one by one.

Base Image

This is actually something I realized later, but I’m adding it to the top to help us save some time when trying out other stuff mentioned below. Normally, I would use this config whenever I wanted to add a new service in our docker-compose:

services:
  console:
    build: .
    command: bash -c "rails console"
    volumes:
      - .:/app
      - bundle:/bundle
      - node_modules:/app/node_modules
    environment:
      - BUNDLE_PATH=/bundle
      - DATABASE_URL=postgres://postgres:password@db/blog_development
      - WEBPACKER_DEV_SERVER_HOST=webpacker
    depends_on:
      - db

That will actually rebuild the image again which means installing all the apt, gems, npm, etc. You can try docker-compose up and see what happened. I’ll wait.

To prevent this, we can build an image first and use the same to all of our services:

services:
  base_web:
    build: .
    image: base_web
    command: /bin/true

  webpacker:
    # build: . # remove
    image: base_web

  web:
    # build: . # remove
    image: base_web

  worker:
    # build: . # remove
    image: base_web

Adding a new gem

Actually, we can just add into the Gemfile just like we normally do and run

docker-compose run --rm web bundle install

However, I do notice that running docker-compose build will re-install everything again. I’m not sure yet how to handle this and it doesn’t make sense to keep on installing all gems every time we change something in Gemfile. I’ll just defer this to later.

Accessing pry console

The first thing that I noticed is docker-compose up will launch all services together and you won’t be able to access the prompt when you load the page. It will just go past the console as if nothing happened:

web_1        | Processing by PostsController#index as HTML
web_1        |
web_1        | From: /app/controllers/posts_controller.rb:8 PostsController#index:
web_1        |
web_1        |      5: def index
web_1        |      6:   @posts = Post.all
web_1        |      7:
web_1        |  =>  8:   require 'pry'; binding.pry
web_1        |      9:
web_1        |     10:   puts "a"
web_1        |     11: end
web_1        |
[1] pry(#<PostsController>)>
web_1        |   Rendering layout layouts/application.html.erb
web_1        |   Rendering layout layouts/application.html.erb

In order to catch/access the prompt, we need to update the docker-compose.yml for the web part:

web:
  tty: true
  stdin_open: true

Run docker-compose up again and load the page with pry code. Once it’s stopped, open a new terminal and run

docker attach container_name

We can get the right container name by running docker container ls. Make sure we choose the web’s container name as that is where the prompt is.

Once we ran that command, we will notice nothing happened. We are actually already in the prompt itself. Just press a key, e.g: enter and we will get the prompt. Once we have exited the prompt, the would still be in the container itself. To detach, just press CTRL-p CTRL-q key sequence.

By the way, we can tail the log from another window by running:

docker-compose logs --follow

Sidekiq

It is pretty common to use background job processing in our app and one of the most popular services out there would be Sidekiq. Add sidekiq gem to our Gemfile and install it using the steps mentioned above.

We need to make some adjustments to existing configurations:

# docker-compose.yml
services:
  # ...
  redis:
    image: redis:6.2-alpine
    command: redis-server
    volumes:
      - redis:/data

  web:
      # ...
    environment:
      # ...
      - REDIS_URL=redis://redis:6379
    depends_on:
      # ...
      - redis

  worker:
    build: .
    command: bash -c "bundle exec sidekiq"
    volumes:
      - .:/app
      - bundle:/bundle
      - node_modules:/app/node_modules
    environment:
      - BUNDLE_PATH=/bundle
      - DATABASE_URL=postgres://postgres:password@db/blog_development
      - REDIS_URL=redis://redis:6379
    depends_on:
      - db
      - redis

volumes:
  # ...
  redis: {}

Create a simple worker to test it out:

# app/workers/test_worker.rb
class TestWorker
  include Sidekiq::Worker

  def perform
    Post.create(title: "Blogging at #{Time.current}")
  end
end

Do a docker-compose down and then docker-compose up to restart everything. Run console to manually execute the worker:

docker-compose run --rm web rails console

> TestWorker.perform_async

We will notice the worker will be processed from the server log. It will look like this:

worker_1  | 2021-12-31T03:49:18.807Z pid=1 tid=gpmykgmlp class=TestWorker jid=1f261ce8ab3bdb0b7c11d2e2 INFO: start
worker_1  | 2021-12-31T03:49:19.137Z pid=1 tid=gpmykgmlp class=TestWorker jid=1f261ce8ab3bdb0b7c11d2e2 elapsed=0.329 INFO: done

Do note that the worker will be run automatically whenever we use docker-compose up. We can always run it manually using:

docker-compose run --rm web bundle exec sidekiq

Basically use the existing container and do make sure we already set the right Redis connection for Sidekiq to work.

Development ENV(s)

It is pretty common to have environment variables loaded with different values based on the environment. It could also be different because it is unique to the developer himself. So, how do we handle this? Let us assume we want to load FOO=bar in every service that we created.

Just add this config in our services, e.g:

# .env.dev
FOO=bar

# docker-compose.yml
services:
    # ...
    web:
        # ...
        env_file:
        - .env.dev

    # do the same for worker, webpacker, etc

We can test it out with:

docker-compose run --rm web bash

> echo $FOO
bar

You might be asking, wouldn’t this mean we are stuck with the same development variable for all environments? Yes, that would be a problem, but we will solve it in the next chapter. Right now, we just need to worry about our development environment to simplify learning.

With the addition of .env*, we need to ensure it won’t be persisted into the docker image for security reasons. I know we can do this with .dockerignore, but I’m not sure how to verify or validate it. I’ll keep that in mind first and return to this later.

Private or Commercial gems

This actually took me the longest to understand how it works and I don’t think I have the best solution yet, but it is working with some caveats. Let’s get on to it.

I am using Sidekiq paid version in one of my projects but this problem can be applied to private gems hosted in GitLab as well. Let us tackle Sidekiq’s gem first.

Sidekiq required us to supply a username and password. We need to figure out how to make it work during the build and runtime. We will talk about the disadvantages of this approach later.

## Gemfile
gem 'sidekiq-ent', source: "https://enterprise.contribsys.com/"

## .dockerignore
/.env.development
/build_credentials
/docker-compose.override.yml

## .gitignore
/.env.development
/build_credentials
/docker-compose.override.yml

## Dockerfile
# ..
RUN gem install rails:6.1.4.4 bundler:2.3.4
COPY Gemfile* /app
COPY bundle_install.sh .
RUN --mount=type=secret,id=bundle_credentials ./bundle_install.sh
# ..

# bundle_install.sh
#!/bin/bash

set -euo pipefail

# Pre-installation
if [ -f /run/secrets/bundle_credentials ]; then
  export $(grep -v '^#' /run/secrets/bundle_credentials | xargs)
fi

# Run installation
bundle install --jobs "$(nproc)"

# Cleanup

# bundle_credentials
BUNDLE_ENTERPRISE__CONTRIBSYS__COM=username:password
BUNDLE_GITLAB__COM=amree:personal_token

# docker-compose.yml
services:
  db:
    image: postgres:14.1-alpine
    volumes:
      - db:/var/lib/postgresql/data
    environment:
      - POSTGRES_PASSWORD=password

  redis:
    image: redis:6.2-alpine
    command: redis-server
    volumes:
      - redis:/data

  webpacker:
    image: blog_base
    command: bash -c "bin/webpack-dev-server"
    ports:
      - "3035:3035"
    volumes:
      - .:/app
      - bundle:/bundle
      - node_modules:/app/node_modules
    environment:
      - BUNDLE_PATH=/bundle
      - WEBPACKER_DEV_SERVER_HOST=0.0.0.0

  web:
    image: blog_base
    command: bash -c "rm -f /app/tmp/pids/server.pid && rails s -b 0.0.0.0"
    tty: true
    stdin_open: true
    ports:
      - "3000:3000"
    volumes:
      - .:/app
      - bundle:/bundle
      - node_modules:/app/node_modules
    environment:
      - BUNDLE_PATH=/bundle
      - DATABASE_URL=postgres://postgres:password@db/blog_development
      - WEBPACKER_DEV_SERVER_HOST=webpacker
      - REDIS_URL=redis://redis:6379
    depends_on:
      - db
      - webpacker
      - redis

  worker:
    image: blog_base
    command: bash -c "bundle exec sidekiq"
    volumes:
      - .:/app
      - bundle:/bundle
      - node_modules:/app/node_modules
    environment:
      - BUNDLE_PATH=/bundle
      - DATABASE_URL=postgres://postgres:password@db/blog_development
      - REDIS_URL=redis://redis:6379
    depends_on:
      - db
      - redis

volumes:
  db: {}
  bundle: {}
  node_modules: {}
  redis: {}

# docker-compose.override.yml
services:
  webpacker:
    env_file:
      - .env.development

  web:
    env_file:
      - .env.development

  worker:
    env_file:
      - .env.development

That is quite some changes 😅 But this is needed for security purposes. We don’t want our image to contain sensitive information. It is OK if we are not going to push it to production, but for best practice, we will just try this out first even though it’s a little bit complicated. Things may change once I started to look into the deployment part, but let’s focus on what we have first.

We are using a couple of features from Docker here, mainly secret and override. Credit to this blog post that solves most of the problems. We just need to adapt what was written to our problem.

docker secret

This is the secure way to build our image without persisting the secret in the image itself. From what I understand, the secret won’t be saved in the image as it’s mounted temporarily, which is why we need to run the bundle install the script instead of outside.

We can’t even export the value out and we also do not want to save the value somewhere where people can access it once they managed to get our image. This is also the reason why we move the bundle install into the script as that is where the required credentials are available without being exposed to the image.

docker override

This feature would allow us to specify different values for the variables based on different environments. A small tip: Use docker-compose config to check your final output.

Do note we need these two features for different reasons. docker secret is going to be used when we build the image and docker override is being used when we are running the services from docker-compose.

Since we can’t use the secret feature from docker-compose, we will just build the image using docker from now on. The image generated will be used in docker-compose.yml by specifying the tag name.

To build the image:

docker build -t blog_base --progress=plain --secret id=bundle_credentials,src=.env.development .

As you can see, we tag it as blog_base and use the same name in docker-compose.yml

If we want to verify whether the credential was leaked, we can use:

docker history blog_base --format "table{{.ID}}, {{.CreatedBy}}" --no-trunc

Random Notes

Volumes were created with a namespace, most likely based on the directory name. I had to reinstall the gems because of this (ref)
The default username for postgres image is postgres

References

Introduction to Ruby on Rails and Dockerfile

Amree Zaid — Sat, 04 Dec 2021 09:51:12 +0000

I think everyone knows by now how good Docker is in making sure we have almost the same setup everywhere. However, I think it is not easy as everyone thought for people new to it. There are just so many questions that I have to the point they discouraged me from using it as my daily driver.

I decided to spend some time to look into this and document this journey through this blog post. I hope this will help me and you in learning this awesome service. There will be more after this, but we will start with this one first.

At the end of this blog post, you will be able to run a Ruby on Rails application with a working asset compilation using just a Dockerfile. We will be using SQLite first for now.

Using Dockerfile

Let us start with a simple Dockerfile. You can create a directory and place this file there. Right now, we just want to create a new Rails application without installing the rails gem on our local machine.

# Dockerfile
FROM ruby:2.6.3

RUN apt-get update && apt-get install -y --no-install-recommends \
  curl build-essential libpq-dev && \
  curl -sL https://deb.nodesource.com/setup_10.x | bash - && \
  curl -sS https://dl.yarnpkg.com/debian/pubkey.gpg | apt-key add - && \
  echo "deb https://dl.yarnpkg.com/debian/ stable main" | tee /etc/apt/sources.list.d/yarn.list && \
  apt-get update && \
  apt-get install -y nodejs yar

WORKDIR /app

RUN gem install rails bundler

ENTRYPOINT ["/bin/bash"]

We will go through this file line by line

FROM ruby:2.6.3

This will pull an image from DockerHub and use it as the base. This particular line will pull the image tagged as 2.6.3, which is not listed on the main page. However, you can always look it up from the Tags page.

There are still many variants that you can choose from. That will determine the size of your local image and the library that is loaded with it.

I also learnt that once you build the image using that base, you can only clear it using docker builder prune

RUN apt-get update ...

This line will install all of our required libraries in order to make sure we can run what we want. Try not to install unnecessary applications/services to reduce the image size.

WORKDIR /app

This will set /app as our default directory. Every subsequent command will be run from that directory.

ENTRYPOINT ["/bin/bash"]

This is the command that will be executed when we run the container. In this case, we need access to bash first so that we can create a new rails application and install all the required gems.

Let's build the container for the first time:

docker build . -t blog

Whenever we build the image, we need to tag it using -t. We also need to supply the Dockerfile but we can just . and docker will find the file on its own.

We will see the base image is downloaded and cached. It may take some time for the first time, but it will be faster once everything is cached.

Since we need to create a new Rails application, we will interact with the container:

docker run --rm -it blog

--rm will ensure there will be no unused container once we exit from it. We can verify it using docker container ls. -it is actually --interactive + --tty and it is for to interact with the container. The explanation is a bit long, but we can read it from these pages (1, 2). The blog param is just the image name that was created using docker build.

Once we got it, we realize we don't have anything yet, so, we need to create the Rails application with:

rails new .

Once everything is installed and you exit the container, you will realize the Rails application you generated is not available in your local copy. It is also not available in your docker container if you use the docker run again. Obviously, we don't want to keep on creating a new Rails application.

Create another directory in the same level as the Dockerfile called blog. This directory will be mapped to /app in our container.

We can use volume for this:

docker run --rm -it -v /local/path/to/blog:/app blog

This time, we can create a Rails application with rails new . from within the container itself and it will be persisted in our local copy as well. Any changes made from our machine (the host) and the container will be reflected on both sides.

Once that is done, try exiting and entering your container again. We will realize rails -v will throw us an error saying we are missing lots of gems. This is happening because our gem installation wasn't persisted. In order to fix this, we can use the volume feature again.

This time, we won't be mounting from our local directory just like we did with our Rails application, instead, it will be created first and then we will supply it as one of the arguments:

We just need to specify a new volume name and it will be created automatically and in this case, we are using bundle as the volume name:

docker run --rm -it \
  -v /local/path/to/blog:/app \
  -v bundle:/bundle \
  -e BUNDLE_PATH=/bundle \
  blog

BUNDLE_PATH is just an environment variable that is used by ruby to install the bundled gems. Just do bundle and exit once it's done. Run the container with the same command and rails -v won't give us any error.

Run this command to finish setting up our Rails application:

rails db:create db:migrate

Normally we can just use rails s to access our welcome page, but we can't yet as there is no way we can access it from the host. In order to solve this, we need to expose a port when we run the container:

docker run --rm -it \
  -v /local/path/to/blog:/app \
  -v bundle:/bundle \
  -e BUNDLE_PATH=/bundle \
  -p 3000:3000
  blog

We also need to start our server using rails -b 0.0.0.0 as we want it to be accessible to the host as well. You should be able to see the Welcome page now when you open http://localhost:3000/

We need to use this line in order to get the docker run to run the rails server automatically without going into the bash first anymore:

ENTRYPONT ["/bin/bash"]
# vs
CMD ["rails", "server", "-b", "0.0.0.0"]

I think this answer explained about ENTRYPOINT vs CMD pretty well.

webpacker

I didn't manage to get this working on the first try due to the lack of knowledge on how docker and webpacker works. But, you guys are lucky because I have the summary on how to get it to work.

./bin/webpack-dev/server that compiles our asset is serving the files from the memory, not from our file system. However, it writes to public/packs/manifest.json if it doesn't exist or maybe during the update as well.

Without a working webpacker, you might run into a situation where the rails server is the one doing the compilation which is slow and wrong.

Normally, both of them would run in the same host, but with docker, you are running them in different containers. So, we need to ensure the webpacker and web containers can connect with each other.

If we don't specify any network when we run our docker container, it will be connected to the bridge network by default. You can connect to another container using the IP but not through the container name. To fix this, you need to create a different network:

docker network create blog-net # bridge driver by default

Once that is done, we need to update our docker run commands:

# for webpacker
docker run --rm -it \
  -v /local/path/to/blog:/app \
  -v bundle:/bundle \
  -e BUNDLE_PATH=/bundle \
  -e WEBPACKER_DEV_SERVER_HOST=0.0.0.0 \
  -p 3035:3035 \ # needed for auto reload page
  --network blog-net \
  --name webpacker \
  blog \
  bin/webpack-dev-server

# for rails
docker run --rm -it \
  -v /local/path/to/blog:/app \
  -v bundle:/bundle \
  -e BUNDLE_PATH=/bundle \
  -e WEBPACKER_DEV_SERVER_HOST=webpacker \
  -p 3000:3000 \
  --network blog-net \
  --name web \
  blog

The main reason for these changes is to ensure our rails container can access webpacker container. Confused about the network part? I did and this tutorial helped me a lot.

Basically, the Rails container need to know where it can find the assets, which is from the webpacker host. The webpacker itself need to ensure that the assets can be accessed by anyone.

Tips

Some other tips that I discovered:

rails is not in /bundle but it is in the original directory which is /usr/local/bundle. I think this is because of the base image. Then, this raises another question which is how do we upgrade rails themselves?

We can override the ENTRYPOINT with:

docker run --rm -it blog command-to-override

This also means we can use it to run other rails command such as rails db:migrate, rails g scaffold and so on.
If you don't want to kill the container when exiting it, you can use Ctrl + P, Ctrl + Q. After that, you can use docker attach container-name to get in again.

docker-compose

As we have noticed, the docker run command is getting longer. It doesn't make sense to keep on passing that command to everyone. What about the database? Webpack? Etc. As the title said, we can use docker-compose to improve our docker experience.

I will talk about it in the next post.

Reference

Speed Up your Ruby on Rails Development Environment

Amree Zaid — Sat, 08 May 2021 23:55:30 +0000

There will be a time when your Rails development environment started to become very slow due to multiple reasons, but mostly because your codebase is very big and the monolith architecture is just too sweet for you to pass on.

An example of this is you have to wait like 10 seconds (or worse) when you change something in your Rails code and hit that refresh button. Normally, starting Rails console/server also can be affected.

What works for me might not work for you, but it should help you to get started. BTW, I'm optimizing 7 years old codebase. So here are some tips for speed improvements.

Get a baseline first on how slow it is so that you know you are actually improving it instead of just using your feeling:



$ time bundle exec rake environment

This should show you the load time without spring. Any changes we make should make the numbers better than before.
Alright, now we need to identify what's causing the slowness. We can use https://github.com/nevir/Bumbler gem for this. Just run this command first:



$ bumbler

It will show you gems that are taking time to be loaded/require.

But what can you do with that output? Start from the bottom and see if you put your gem in the right development. You don't have to put your server monitoring gem in the development environment, right?

You can also do require: false in your Gemfile for certain gems that are being used rarely, but you need to require it manually when you want to use it.

Use your judgment wisely. Check out how https://github.com/discourse/discourse/blob/master/Gemfile is doing it

Next, run:



$ bumbler --initializers

This will show you the load time of initializers.

In my case, it was the routes (we have thousands of routes due to our support of multiple languages). So, reducing the routes will help but we can't remove the routes, it's one of our best features.

Since you won't always be using ALL of them in the development, so you can do this:

What does it do? By default, it will use all of the languages except when you have that MIN_LOCALE environment variable which you would only put in your local development.

Another place to check would be your Admin routes. Again, you are not accessing your Admin routes all the time, so, you can use the same trick in your routes.rb. Try optimizing your codebase based on what bumbler told you.

One last tip would be to do some cleanup on your gems. Remove gems that are not being used. Sometimes we just forgot that we don't need it anymore.

That's it, folks. Here is my result from this exercise:

Reduce slow requires by 41%
Reduce Initialize require by 72%
Reduce the number of routes by 91%
Reduce rake time by 55%

Thanks for reading

How to Integrate TradingView's HTML5 Charting Library with Ruby on Rails v6

Amree Zaid — Mon, 23 Nov 2020 06:59:37 +0000

As you probably know, the charting library is not accessible publicly. You need to request access from them. So, I can't really give a complete repo as an example. I did however open a PR at https://github.com/tradingview/charting-library-examples/pull/197, but I'm not sure if it's going to be accepted.

Anyway, the given example is using asset pipeline and the modern Ruby on Rails application is using webpacker. So after trying out a few times, I figured a working way to load the sample chart.

It's not straight forward for me, so maybe this will help someone else in the future. However, I'm not sure if it's the best way to load it, however, it works 😁

I'm going to assume you're using the Ruby on Rails v6.0.3.4. Once you've cloned the library into charting_library directory, you can do these steps:

Copy charting_library/charting_library.js into app/javascript/packs/charting_library/charting_library.js
Copy datafeeds/udf/dist/*.js into app/javascript/packs/datafeeds/
Copy charting_library/*.html into public/charting_library/
Copy charting_library/bundles into public/charting_library/bundles

Don't worry about serving outdated files just because you put it in the public directory as the charting library will use a new hash on the files every time there's a new update.

Once we got the files in the correct places, we can use this code to load the sample:

// app/javascript/packs/application.js
require("@rails/ujs").start()
require("turbolinks").start()
require("channels")
require("packs/datafeeds/polyfills")

const Datafeeds = require("packs/datafeeds/bundle")
const TradingView = require("packs/charting_library/charting_library")

function getLanguageFromURL() {
  const regex = new RegExp('[\\?&]lang=([^&#]*)');
  const results = regex.exec(location.search);

  return results === null ? null : decodeURIComponent(results[1].replace(/\+/g, ' '));
}

function initOnReady() {
  var widget = window.tvWidget = new TradingView.widget({
    symbol: 'AAPL',
    datafeed: new Datafeeds.UDFCompatibleDatafeed('https://demo_feed.tradingview.com'),
    interval: 'D',
    container_id: 'tv_chart_container',
    library_path: '/charting_library/',

    locale: getLanguageFromURL() || 'en',
    disabled_features: ['use_localstorage_for_settings'],
    enabled_features: ['study_templates'],
    charts_storage_url: 'https://saveload.tradingview.com',
    charts_storage_api_version: '1.1',
    client_id: 'tradingview.com',
    user_id: 'public_user_id',
    fullscreen: false,
    autosize: true,
    studies_overrides: {},
  });

  widget.onChartReady(() => {
    widget.headerReady().then(() => {
      const button = widget.createButton();

      button.setAttribute('title', 'Click to show a notification popup');
      button.classList.add('apply-common-tooltip');

      button.addEventListener('click', () => widget.showNoticeDialog({
        title: 'Notification',
        body: 'TradingView Charting Library API works correctly',
        callback: () => {
          console.log('Noticed!');
        },
      }));

      button.innerHTML = 'Check API';
    });
  });
};

window.addEventListener('DOMContentLoaded', initOnReady, false);

The code is actually coming from the sample with slight modifications.

Create a view and put this HTML in:

<div class="page-tv-chart-container" id="tv_chart_container">
</div>

TradingView will use that ID to load the chart.

Just start your server and that's it. You should have a working TradingView chart by now 👍

Exporting data from RDS to S3 using AWS Glue

Amree Zaid — Mon, 05 Oct 2020 09:37:43 +0000

Overview

Why would we even want to do this? Just imagine if you have data that are infrequently accessed. Keeping it in your RDS might costs you more than it should. Besides, having big table is gonna cause you a bigger headache with your database maintenance (indexing, auto vacuum, etc).

What if we can store those data somewhere where it's gonna cost less and it's in an even smaller size? You can read more about the cost-saving part here. I'll be focusing on the how and not the why in this post.

Exporting data from RDS to S3 through AWS Glue and viewing it through AWS Athena requires a lot of steps. But it’s important to understand the process from the higher level.

IMHO, I think we can visualize the whole process as two parts, which are:

Input: This is the process where we’ll get the data from RDS into S3 using AWS Glue
Output: This is where we’ll use AWS Athena to view the data in S3

It’s important to note both processes require almost similar steps. We need to specify the database and table for both of them.

Database and table don’t exactly carry the same meaning as our normal PostgreSQL. The database in this context is more like containers for the tables and doesn’t really have any extra configurations.

The table is a little bit different as it has a schema attached to it. The table in AWS Glue is just the metadata definition that represents your data and it doesn’t have data inside it. The data is available somewhere else. It can be in RDS/S3/other places.

How do we create a table? We can either create it manually or use Crawlers in AWS Glue for that. We can also create a table from AWS Athena itself.

The database and tables that you see in AWS Glue will also be available in AWS Athena.

Recommendations

In order not to confuse ourselves, I think it’d better if we use different database names for the input and output. We need to differentiate between what’s the input and output for easier reference when we set up the AWS Glue Job.
More will be added

Steps

Prerequisite

Security

These security configurations are required to prevent errors when we ran AWS Glue

Amazon VPC Endpoints for Amazon S3:

Go to VPC > Endpoints
Create Endpoint
Search by Services: S3 (com.amazonaws.ap-southeast-1.s3)
Select your VPC
Tick a Route Table ID
Choose Full Access
Create Endpoint

The result will be shown in the “Route Tables > Routes” page. There’ll be a new route added with VPC as the target and S3 service as the destination

Reference: https://docs.aws.amazon.com/glue/latest/dg/vpc-endpoints-s3.html

RDS Security Group:

Select the security group of the database that you want to use
Edit inbound rules
Add rule
Type: All TCP
Source: Custom and search for the security group name itself
Save rules

Roles

This will allow Glue to call AWS service on our behalf

Go to IAM > Roles > Create role
Type of trusted identity: AWS Service
Service: Glue
Next
Search and select AWSGlueServiceRole
Next
We can skip adding tags
Next
Roles: AWSGlueServiceDefault (can be anything)
Create Role

Add Database Connections (for Input)

Go to AWS Glue > Databases > Connections
Click “Add Connection”
Connection type: Amazon RDS
Database Engine: PostgreSQL
Next
Instance: Choose an RDS
Put the database details: name, username, and password
Next
Review and click Finish
Use “Test Connection” in the “Connections” page to test it out (this might take a while)

Setup S3 access (for Output)

Go to AWS Glue > Databases > Connections
Click “Add Connection”
Connection type: Network
Next
VPC: Select the same one as the RDS*
Subnet: Select the same one as the RDS*
Security Group: default*
Next
Review and click Finish

(*) Other options might work too, but I didn’t try them out.

Add Databases

This will be the parent/container for the table. A table might come from the input and output.

Go to AWS Glue > Databases > Add database
Database name: anything will do

I created for both. E.g: myapp_input and myapp_output

Add Crawlers

Before we create a job to import the data, we need to set up our input table’s schema. This schema will be used for the data input in the Job later.

Naming is hard. I decided to go with this format: rds_db_name_env_table_name_crawler. It’s easier if we can grasp what the crawler does from the name even though we can have a shorter name and put the details in the description.

Go to AWS Glue > Tables > Add tables > Add tables using a crawler
Crawler name: Anything
Crawler source type: Data stores
Next
Choose a data store: JDBC
Connection: Choose the one we created above
Include path: db_name/public/table_name (assuming we want to take data from table table_nameas we can use % as the wild card)
Next
Add another data store: No
Next
IAM role: Choose the one we created above (AWSGlueServiceDefault)
Next
Frequency
Use Run on demand
Next
Configure the crawler’s output
Database: Choose database for the crawler output (this will be the source for our Job later)
Next
Review and click Finish

The crawler that we’ve just defined is just to create a table with a schema based on the RDS’s table that we just specified.

Let’s run it to see the output. Just go to the Crawlers page and select “Run crawler”. It’ll take a moment before it starts and there’s no log when it’s running (or at least I can’t find it yet). However, there’ll be a log once it’s done. The only thing you can do to monitor its progress is to keep clicking the Refresh icon on the Crawlers page.

Once it’s done, you’ll see the table created automatically in the Tables section. You can filter out the list of the tables by going through Databases first. You should see a table with defined schema similar to what you have in RDS.

We need to add another crawler that will define the schema of our output. But this time it’ll be for our S3 (which is the output)

Go to AWS Glue > Tables > Add tables > Add tables using a crawler
Crawler name: Anything (I chose s3_db_name_env_table_name_crawler)
Next
Crawler source type: Data stores
Next
Choose a data store: S3
Connection: Use connection declared before for S3 access
Crawl data in: Specified path in my account
Include path: s3://you-data-path/. This will be the path where you’ll store the output from Job that you’ll create later. The output here means the Apache Parquet files. I chose: s3://glue-dir/env/database_name/table_name/
Next
Add another data store: No
Next
Choose IAM role
Choose an existing IAM role
IAM role: AWSGlueServiceRoleDefault
Next
Create a schedule for this crawler:
Frequency: Run on demand
Next
Configure the crawler’s output:
Database: A database where you’ll store the output from S3
Next
Review and click Finish

We’re not going to run this crawler yet as the S3 directory is empty. We’ll run it once we’ve exported RDS data to S3.

Add a job

For some unknown reason, I couldn’t get this to work without using AWS Glue Studio. Maybe I’ll figure it out once I have more time. But I’ll just use AWS Glue Studio for now:

Open AWS Glue Studio in ETL section
Choose "Create and manage jobs"
Source: RDS
Target: S3
Click Create
Click on the “Data source - JDBC” node
Database: Use the database that we defined earlier for the input
Table: Choose the input table (should be coming from the same database)
You’ll notice that the node will now have a green check
Click on the “Data target - S3 bucket” node
Format: Glue Parquet
Compression type: Snappy
S3 Target location: This will the place where the parquet files will be generated. This path should also be the same as what we defined in our crawler for the output before. Remember, this is for the data, not for the schema. The crawler is responsible for the schema. I chose to use s3://glue-dir/env/database_name/table_name
You’ll notice that the node will now have a green check
Now go to “Job details” tab
Name: Can be anything, I chose rds_to_s3_db_name_env_table_name
IAM Role: Choose the role that we created before - AWSGlueServiceRoleDefault
Expand “Advanced Properties”
We’re going to specify some paths so that it won’t litter our top-level s3
Script path: s3://glue-dir/scripts/
Spark UI logs path: s3://glue-dir/sparkHistoryLogs/
Temporary path: s3://glue-dir/temporary/
Click Save
Click “Run” to run the script. You can see the log in “Run details” tab. If everything is working as expected, you should files generated in s3://glue-dir/env/database_name/table_name

Viewing the record using AWS Athena

Before we can view the output, we need to create a table/schema for those parquet files. This is the job for the crawler (you can also create the table manually if you want to). That’s what the second crawler does. Just run the crawler and you’ll get a new table created if it’s new. Refer to previous steps on how we run the first crawler.

To confirm the table has been created, just go to the database for the output and then click on the “Tables ..” link. You should see it there. There’s also an alert at the top of the Crawler index page once it has finished the job.

To view the record:

Go to AWS Athena
Select your database and table.
Click on the three dots on the right side of the table name and choose Preview Table
You’ll see some data in the Results

References

Order Update with Two-Step Payment

Amree Zaid — Sat, 09 May 2020 00:36:26 +0000

I was asked about this from my latest job application. Didn't realize it's going to be this long so I thought I should share it publicly, easier for me to show it to the next interviewer. The exact question was:

Describe a recent technical solution or achievement that you are proud of. Anything goes, from a tiny one hour ticket to a large system, we are just interested in how you think.

The answer:

It's just order updating, how complex can it be? I realized I was wrong about the complexity when I started my investigation. The complexity comes from the two-step payment system that we're going to implement to make sure the whole order editing work smoothly. It was actually my first time hearing the two-step payment words.

In case you didn't know: A two-step payment system is where you hold a certain amount of money on someone's credit card. Depending on the requirement, you'll charge the credit card later. We're using Stripe for our payment system.

A little bit of background: We don't want to charge the credit card until the cut-off date of the food delivery, which will then allow the customer to change their order online without contacting us. So, a customer can keep on changing the menu (which will affect the price) as much as he wanted without us having to deal with the credit card refund and charge manually.

The simplest workflow would be:

Customer checkout from our websites with a date far in the future
We'll queue the card authorization 7 days before the cut off date
Customer didn't make any changes
When the time comes, we authorized the amount, queue another job for the card capture process
On the day of the cut off date, we'll capture the amount automatically
No order update from the web will be allowed at this point. They need to contact our customer support for this

The complexity keeps on compounding as you need to think about these scenarios:

During the checkout we need to know whether we should authorize, capture, or process the card immediately (one-step payment).
The biggest one would be the editing part. We need to think what's the current state of the order and what action that was made. Is the order in the authorized state? Capture state? When is the cut off date? Do we need to refund everything? Do we need to do a partial refund? Do we need to refund and charge a new amount?

So, I was given this task alone (we don't have many devs last time). There were too many things that need to be done, so I had to break it up by phases:

Update existing checkout to support two-step payment system (this is when the order was created)
I need to update/add the code to handle cancel, refund, authorize and capture the card. Each action has its own complexity, but that's the high overview.
Alter the database to support the new payment state
Figure out the best time to capture the payment (e.g: cut off date - weekend). I also need to give some buffer for the customer support to handle if there's an error during any of the processes above.

At the end of the project, I got helps from my awesome team for things like mailers and other kinds of updates. So, I still need to work on the core parts. I don't think I can meet the deadline without those helps lol.

I'd love to tell you more about the whole process, but it was pretty long. But you can see the branch conditions that I've drawn here:

P = Pending Authorization
A = Authorized
C = Capture
R = Refund
H = Amount higher
L = Amount lower

But the biggest thing I learned with this project is that visualization will help a lot. It doesn't have to be in a standard format like when you study in the university. Just draw however you want as long it can help you see the problem and possible blockers.

In terms of the code itself, I had to dive into React and Redux to implement the whole update (we have complex menu selections). Of course, testing is very important. With lots of new and updated code, I need to make sure none were broken every time I added/updated new ones. At first, I mocked a lot of API requests, but it doesn't feel safe, so I used VCR library to record the interactions and the result feels more accurate and safe. For the front end part, I used Capybara/Chrome for the feature tests.

Together with a feature flag in place, I can safely deploy the changes every day without having to do one big rollout. In terms of the backend code, I used a lot of service objects to keep the classes small. It's also easier to read and find, e.g: ChargeProcessor, AuthorizeProcessor, etc. Everything was also namespaced to ensure I don't pollute the main service directory.

With this feature implemented, we improved further with other features as well such as the ability to save and delete credit cards. The checkout is also easier as the customer can just select from the previous credit card. The support couldn't be happier as well as they don't have to handle manual order updates.

I think I better stop here lol