DEV Community: Sudheer Tripathi

Reading 20k excel files per minute with Node

Sudheer Tripathi — Sun, 16 Mar 2025 11:37:55 +0000

Recently I was working on an adhoc activity which involved writing a script that would extract some cell data from around 75k excel files, assume these files are readily available on your disk. Because these are user provided files, some of these are corrupt and can cause a range of issues when parsed. In an attempt to optimize the data extraction script, I ended up at a point where the final script ran 30X faster and handled errors gracefully. Lets look at the approach.

Excel files are like bombs

Parsing excel files is like running user provided code on your system, when these files are read using excel libs, these can cause issues like heap overflow, blocking of main thread, unforeseen exceptions, etc. The most popular excel libraries in the JS ecosystem are SheetJs and ExcelJs, both of them face issues like blocking of main thread & heap overflow.

How do we read them then ?

Because these files can block the main thread (infinite loop), cause heap overflow and throw unknown exception during parsing, we need to run them in a container type environment, we can use NodeJs worker threads like a container. NodeJs allows us to specify the memory we want to allocate to the worker beyond which the worker would terminate. If the worker goes in infinite loop we can add a timer in the main thread which would terminate the worker if it doesnt respond after a few seconds.

Approach 1: Spawn worker thread when reading file

We can limit the worker threads memory by specifying it in resourceLimits property (see this) when creating the worker. If the worker goes in infinite loop, we need to specify a timeout after which we will terminate the worker. In NodeJs, a worker thread gets automatically terminated once the script inside it has completed running.

You can find the code for this approach here. When run on 20k excel files (with sizes ranging from 200 to 300KB), with 40 corrupt files, it completes in 40 minutes.

Approach 2: Re-use the same thread

In the previous approach we were spawning a new thread for each excel file, because the worker exited after running the script, to make the worker run indefinitely we can add a really long interval timer in the worker. Now we want to push another file into the worker as soon as it has completed processing the last file or thrown an exception.

How do we do it ?

We can use the EventEmitter to inform other part of code that has to push a new file to worker for processing. With this we can have a completely even driven system which would process files, listen to events to make decisions of when to push file to worker, when to terminate worker, etc.

You can find the code for this approach here. When run on 20k excel files (with sizes ranging from 200 to 300KB), with 40 corrupt files, it completes in 5 minutes 🔥 . Wow, this is much better than our previous scripts performance.

Approach 3: Use all cores !

We are still not using all the resources, we can process files parallely in multiple threads, for an octacore system, we can spawn 7 parallel threads and leave the last core for the main thread.

This also adds complexity because now we want to manage a pool of 7 workers, manage their timeouts and send messages to correct worker which becomes available. Writing this kind of code using even driven methodology is really easy. Whenever a worker exists or sends any kind of message, we use its threadId to take an action.

You can find the code for this approach here. When run on 20k excel files (with sizes ranging from 200 to 300KB), with 40 corrupt files, it completes in around 1 minute 💀 .

The scripts were run on a Macbook Air M3 and took a maximum RAM of 600MB.

Conclusion

The approach is very similar to how message queues work in a distributed system, the queue consumers pull message from queue as soon as they have completed processing previous message, if a consumer/worker goes down, it can be restarted. Multiple consumers can pull from the messages queue without any delay. This is a good example of how good Nodes IO is, if used well, it can be used to complete IO tasks from hours to minutes.

The source code for the examples can be found in this repository, give it a star if you learned something interesting.

]]>

Observability and Security with AWS

Sudheer Tripathi — Fri, 29 Mar 2024 14:45:09 +0000

Compliance is boring, as developers we don't actively think about security loopholes in our infrastructure, just a few standard practices and that's it. However, AWS provides us with the tools that can help improve the observability of our infra and protect us from security issues down the line.

AWS S3 Security

Observe who is accessing the S3 buckets

You can enable server access logs for your s3 buckets and view detailed records of requests that are made to the bucket. The logs can be loaded in Athena to identify malicious access to buckets. Ideally you should have one server access logs bucket in each region with logs added based on bucket name prefix paths.

Observe S3 configuration changes

You should be able to log all configuration changes to your S3 buckets, this is useful to detect malicious modifications to your bucket configurations. This is easily possible with AWS CloudTrail. AWS CloudTrail will log all API access to your buckets in CloudWatch. You can add a metric filter with alarm on CloudWatch to send you a notification if any malicious S3 configuration change activity is detected.

Block public access to buckets if possible

If you are serving public data via cloudfront, you can block all public access to S3 buckets and add a bucket policy to only allow cloudfront to access your buckets. This is an additional security measure, although not necessary if you are serving non-sensitive data.

Enforce HTTPS access on buckets

It's a good practice to disable HTTP access to buckets, to do this you just need to add a policy (aws:SecureTransport) to your bucket that disables insecure access

Cloudfront

Enable WAF on CloudFront distributions

WAF protects your distribution from bots and common web exploits by blocking such requests. This feature is just one click away on your CloudFront dashboard. Be careful as it blocks genuine requests too, hence you can enable it in monitor mode to start off with and add custom rules later on.

CloudFront Standard Logs

This is similar to S3 server access logs, detailed information about requests to you distribution are logged in a separate S3 bucket.

IAM

Cloudwatch alarm for sign in / authorization failures

If you have enabled management events in CloudTrail. Authorization failures will also be logged in the associated CloudWatch group. All you need to do is add a metric filter with alarm that filters out console signin failures and authorization failures.

Rotate IAM user passwords

AWS recommends to regularly rotate the passwords of all IAM users. This is possible by adding a password policy from the AWS console.

At least one user with support role

AWS provides the ability to manage incidents with AWS Support through a managed policy called arn:aws:iam::aws:policy/AWSSupportAccess.This allows an IAM user to interact with the AWS Support Center where users can talk to AWS agents to resolve issues. It is recommended that at least one IAM user is assigned this role.

Enable IAM access analyzer

It is an AWS security feature that regularly scans your IAM policies for misconfigurations, over-privileges and possible security risks. It is recommended to enable it in all active regions.

VPC

VPC flow logs

Flow logs capture information about incoming / outgoing / restricted traffic without any impact on network performance. They are helpful in diagnosing overly restrictive security group rules. It is recommended to enabled flog logs on all subnets in your VPC

Observe VPC configuration changes

VPC configuration changes should be logged to CloudTrail and a log metric filter with associated alarm should be configured in CloudWatch so that all changes to VPC configurations are logged.

Lambda

Code Signing

This feature ensures only code from trusted sources runs in your lambda functions. For this you'll have to first create an AWS signer profile and AWS signing config using another service called is AWS signer.

CloudWatch Lambda Insights

Lambda functions runtime performance metrics (eg: CPU, memory, disk usage etc) can be monitored by enabling enhanced monitoring form the lambda function console. This will allow you to monitor unusual behaviour of your lambda functions.

EBS

Volumes and Snapshots should be encrypted

It is recommended that all EBS snapshots and volumes in use are encrypted using an AWS KMS customer managed key. For existing volumes, a snapshot can be taken, then they can be encrypted and restored. Also you have an option in the console to enable encryption by default when new volumes are created next time.

KMS

Customer managed keys must be rotated regularly

Rotation of AWS managed keys is handled by AWS, for Customer managed keys, they can be configured to rotate after after a fixed interval from the AWS console.

Monitory KMS configuration changes

Just like services mentioned previously, metric filters can be added to cloudtrail logs on CloudWatch that filter API calls like (ScheduleKeyDeletion , DisableKey ) and an alarm should be associated with them

EC2

Ensure IMDSv1 is not in use

The instance metadata service on EC2 instances provides an api to access meta information about EC2 instances, however this also an exploitation target for attackers. You need to ensure that the EC2 instances are using IMDSv2 and not IMDSv1, read more here

A lot of the above mentioned security features come with a cost, so you have to be cautious of the impact of such features on the AWS cloud bill.

Hopefully, you learned something new !!

Designing SQL databases for rock solid data quality

Sudheer Tripathi — Sun, 02 Oct 2022 18:07:27 +0000

The blog focuses on 3 things,

Prevention from data redundancy.
Good design choices.
Bad design choices.

Prerequisites: Basic knowledge of databases and practices like indexing, normalization, etc.

1. Unique partial index to avoid data duplication

While indexes are generally used to improve database performance, they can also be used to avoid redundancy in data. You can add unique constraints on your columns in DDL, but for conditional unique constraints you will have to define a unique partial index.

Consider an example of a shopping-cart application, where along with orders you would also like to uniquely identify every checkout.

In the above example, under cart_id abc-123, the product_id xyz-456 has a duplicate entry. Such cases can be detected by adding a unique index.

create unique index unique_product_cart on user_orders (product_id, cart_id) where deleted_at is null

2. Soft delete cascades

Keeping soft deleted data in your database is awesome but it makes ensuring data quality difficult. Every time you add a constraint, you'll have to take care of deleted_at column (the first point is a good example).

Managing soft deletes has to be done at application level since in SQL databases cascading on update only happens for foreign keys. If you are lucky, your ORM might support soft delete cascades, otherwise you'll have to implement it by using model observers or extending your ORM's query builder for instance. They can be implemented using triggers too.

3. Transactions at application level

A single operation at application level might be a combination of multiple writes at database level. While you might enforce all kinds of constrains on database, your data will get ruined if you don't implement transactions at application level.

Consider an example of a movie ticket booking application, when you reserve a seat for a movie, it can be a combination of multiple operations at database level, eg: add rows in seat_reservation table, user_activity and snacks_orders table. All of the operations would usually come under a function reserveTicket(). Such a function has to be put under a transaction in application code.

4. Keep historic changes and quick lookup columns.

Sometimes saving historic data in a separate table becomes important. For example consider a credit card company, when you apply for a credit card it goes through multiple intermediate stages before getting approved.

In above example, the applications table keeps track of users' applications while the application_status table keeps track of status of each application.
The reason for keeping historic changes is to keep track of who changed the status (creator_id) and why was the status changed (comments).
Notice that the applications table also has a latest_status_id for quick lookup to latest status, it's a good practice to have such columns.

Although credit card companies don't work this way, but I hope you get my point 🙃.

5. Save last data point in a JSON meta column

Sometimes, you might not need to track how data for some row has changed historically but just keep the last data point for rolling back. Consider the same example in point 4, this time we save the status in the same table but we also save the old status in a JSON column

This example doesn't do justice to how useful it is to save old critical data points in JSON meta column 😶, when used in correct place, it has the potential to save you a lot of time while doing data analysis or data rollback.

6. Log data changes

While schema changes are tracked by migration files, it's important to keep track of all data changes in the database, usually this would be implemented using a polymorphic table, which would keep track of changes to all models.

How to do it ?

Aspect Oriented Programming (you are lucky if your framework has good support for it)
Database Triggers
Overriding your query builder's CRUD methods/events or using external libraries that do that.

It's also important to regularly cleanup these logs and store them somewhere else as the table gets heavy quickly.

7. Use Polymorphic tables

The table used in point 6 is a polymorphic table, the column model_id is a foreign_key that points to any table in the database.

Let's take another example of an online forum, the forum can have comment sections at multiple places and from multiple sources, i.e comments from users, admins, automated comments from system, etc. Instead of storing comments in multiple tables we can have a polymorphic table for storing comments.

8. Avoid saving UUIDs as string

While keeping UUIDs as primary keys is a good choice. Saving them as binaries is much better than saving them as strings.

UUIDs consists of 32 hexadecimal digits with 4 hyphens. If saved as a string it can take upto 36 bytes (288 bits). Most SQL databases support saving them as binaries i.e 128 bits. Index size of binary UUIDs is less than half the index size of index size of string UUIDs.

9. Avoid un-necessary indexes

Indexes are cool but it's important to keep in mind that indexes occupy memory and need to be updated in every write operation (in Postgres). Too many indexes means too many indexes to be updated for every insert/delete operation, this will basically slow down your database writes.

10. Use views when possible

There are multiple cases where views are a good choice, I'll tell one of them. Consider a case where stakeholders require reports of entire data in all critical tables. For this, you would aggregate data from multiple table using joins, overall that would be a complex query. There can be multiple instances where you would want to re-use such aggregate data i.e reuse the query. In such case writing a view will be a good choice.

Creating a view is same as writing the same query at application level, it's just that in case of view you can reuse the query which is saved at database level.

There are much more things that can be good choices when it comes to designing databases and maintaining data quality. If you have things to share please do in comments.

-- sudheer121

Nodejs developer gets blown away by Laravel

Sudheer Tripathi — Fri, 04 Feb 2022 07:54:40 +0000

"They've got docker setup upfront, that's awesome !!" - on the first day of learning laravel

I don't know what you think about PHP, but the developer experience with Laravel has been really good, and it motivates me to write this blog.

My background with other Frameworks / Libs

Most of my projects have been around FullStack Javascript / Typescript, which makes me look at Laravel from a different eye.

Here are the top 5 things I liked about Laravel.

Automated dockerizing with `Laravel Sail`

Managing different database types/versions and switching between them for different projects gets messy.
With Laravel Sail, you can get your laravel application and the database of your choice within a docker container in no time.
The Best part, sail cli connects you with your dockerized laravel application from outside docker.

Server Side Rendering with `Blade`

If your frontend is complex, separating it out would be better and ideally NextJs / NuxtJs / etc would be the way to go.
But, when it comes to writing SSR code within your backend application, this framework really shines here. Writing SSR code with Laravel Blade is way cleaner than the Express + EJS duo.
You get more powerful directives, multiple ways to nest components, custom directives, etc. If your component involves heavy logic, you can create class-based components too.

Consider an example where you have to show some posts to a logged in user, skipping the first post.

Laravel Blade creates an "aha" moment.

Application bootstrapping and Dependency Injection

The Laravel application instance is called a service container. You can bind a class instance to a service container as a singleton and reuse it wherever you want.
Laravel automatically resolves constructor injection and method injection (kinda like NestJs).
With AppServiceProvider you can easily swap out what dependency gets injected by default.

Migrations, Factories and Query Builder

Writing migrations and seeding the database is easy.
Personally, I have used Sequelize heavily and working with migrations in Sequelize has some initial learning curve.
I found laravel migrations cleaner. Also, laravel has out of the box developer-friendly query builder, like TypeORM but more flexible.

Consider the example below where we want to include relations Author, Category, Comments with a Post table and filter the results by Post body and Category name.

Top-notch support for `miscellaneous requirements`.

With NodeJs frameworks, adding(and updating) npm packages is more frequent, you usually install an npm package for any miscellaneous requirement.
Laravel has inbuilt/supplemented support for authentication, request validation, cron jobs, mailing, event handling, http requests, notifications, caching, file storage, OAuth etc.

The above reviews were solely based on developer experience. Laravel is relatively slower than other backend frameworks, but it does fine for most of your requirements.

There's something unique to learn from every framework (I learned some new design patterns in Laravel) and it's important to not get attached to the one you are using.

Hiring Alert

I work as a Software Engineer Intern at ClearGlass, we are a Cost Transparency company based out of London, UK.

ClearGlass is looking for Senior Software Engineers for the Engineering team. Most of our tech stack is in NodeJS / PHP. Know more about us and apply here. Learn about our tech stack here
See you there 👋.