DEV Community: Tarun Batra

macbook.init(): a dev machine set-up guide

Tarun Batra — Mon, 03 Oct 2022 05:11:48 +0000

I started a new job, and I am excited. I get a new MacBook delivered and after keeping the delicate wrapping paper intact during the intricate unboxing, it dawns upon me that it is not my MacBook. It is a brand-new system devoid of all the custom tools that make my (work) life easier. On top of it, I do not even remember what those tools were, to begin with. Was it oh-my-zsh, or was I using fish? What did I do last time to get the top bar show Bluetooth? I end up setting small things like these, slowly every day, until it becomes all familiar, and I forget about them.

Not this time, I said to myself. As I start working at Okta, I decided to log the steps to set up my system, mostly for future me, but also for anyone else stumbling on the same StackOverflow articles over and over again.

Update OS

Yeah, just update the OS now that you have patience. It is probably outdated and the next step may require it.

Install Xcode

Do not wait on some CLI tool to fail on you later and install Xcode now. Might get lucky running xcode-select --install.

Install Firefox/Brave

Use Safari to install Firefox and Brave and log into them. Also set up extensions like -

Password managers
Ad blockers
Web3 wallets

Set up terminal environment

Install iTerm2

Download the latest build here

Install homebrew

Refer https://brew.sh/#install, or

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Install oh my zsh

Refer https://ohmyz.sh/#install, or

sh -c "$(curl -fsSL https://raw.github.com/ohmyzsh/ohmyzsh/master/tools/install.sh)"

Disable history sharing across zsh sessions

echo "unsetopt share_history\nsetopt no_share_history" >> ~/.zshrc

Setup keys 🔑

SSH and GPG keys need to be generated/imported and updated everywhere, especially GitHub.

SSH

ssh-keygen -t ecdsa -b 521

PGP

gpg --import <private-key-file>

Install nvm

Refer https://github.com/nvm-sh/nvm#installing-and-updating

curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.1/install.sh | bash

Install tmuxinator

brew insatll tmuxinator

Customize vim

Restore dotfiles and customize vim and other things.

Next steps

At some point, I should figure out that something is wrong with the trackpad and check that box that says Tap to click in settings. This is a living document and I will keep adding things to it, as I find. Now I should get back to that new system I just configured and get some work done!

PyPi packages decoded for npm developers

Tarun Batra — Mon, 31 Aug 2020 20:59:27 +0000

If you are a developer in Javascript / Node.js ecosystem, you have used npm. It's the world's largest software registry and the ecosystem around it is advanced and mature. I realized it when I tried to port password-validator, an open source npm package I maintain, to Python. I spent weeks to figure out things which were very intuitive to me in the Javascript world. This blog will list down some of the key differences I found between the two ecosystems for anybody else who treads the same path. By the end of this post, we should be able to install our package using:

pip install <my-awesome-package>

What is a package?

In npm, any folder which has a package.json file is a package. In Python world, we need a file called setup.py in the root folder to make our project a package for PyPi, a common repository for Python packages. This file is used to declare meta data like name, version, author, and even dependencies (with a catch). Look at this sample setup.py file to get an idea.

Normally, the actual package code goes inside a directory with the package name and it is required to have a __init__.py file in it to indicate that it is a module. So the directory structure would look more or less like:

package-project
├── setup.py
└── my-package
    ├── __init__.py
    └── lib.py

In this structure your actual package code will go in __init__.py and/or any other python file in this directory like lib.py.

Dependency management

Dependency management is one area where I found Python ecosystem lacking and primordial. npm has concepts like dependencies and devDependencies and package-lock which make dependency management easier. The manifest file setup.py has a section called install-requires which is a list of the dependencies required to run the project. These dependencies are downloaded while installing the package using pip. It's not very widely used to define the exact dependencies and dependencies of dependencies.

Instead, a combination of a virtual environment and requirements file is used.

Virtual environment

npm allows installing packages globally or local to a project. In pip, all packages are installed globally. To separate the local dependencies from system-wide packages, a virtual environment is created which contains all the dependencies local to the project and even a local Python distribution. Code running in a virtual environment doesn't require any system dependency, so one can be assured that the code should run in all systems wherever the local virtual environment is set up correctly.

There are multiple ways of setting up a virtual environment, like venv and virtualenv. I found former to be the easiest to use.

python3 -m venv <dir>

The above command creates a virtual environment in directory <dir> which means this directory will house everything required to run the program including Python and pip distributions. This directory needs not be committed to version control. To use the virtual environment, it has to be "activated".

source <dir>/bin/activate

Requirements file

Every virtual environment has its own pip and Python distribution. So it's a clean slate when you start. When we install dependencies, we can make a log of them using:

pip freeze > requirements.txt

This will log all the packages and their versions installed by pip in the file requirements.txt. The file name can be anything but it's a convention in the Python community to use this name. There is no concept of devDependencies in Python but it's not uncommon for some projects to have a requirements-dev.txt to log devDependencies of the project. Here's a sample requirements file.

Installing a project's dependencies now can be done with:

pip install -r requirements.txt

Documentation

JSDocs is the most common way to document code in JavaScript and the list of documentation tools alomst stops there. Documentation in Python is taken very seriously.

Sphinx

Sphinx is a documentation generator for Python which can be used in various ways thanks to a variety of extensions available. To get started with sphinx, we need a conf.py file which will have configuration for sphinx. The file can be generated using sphinx-quickstart CLI utility. Here's a sample conf.py file to get started.

Sphinx and Python ecosystem prefers using reStructuredText (reST) for documentation instead of Markdown which is common in the JavaScript world. M↓ can be used with Sphinx too, but it's a bit of work and reST is very similar to M↓. Here's a sample README in reST.

Docstrings

Sphinx can look through special comments in Python code and include them in the documentation. These comments which are called Docstrings are used to describe functions, classes etc. and are very similar to JSDocs in JavaScript and JavaDocs in Java. Here's a method documented with Docstrings:

def add(a, b):
  ''' Performs addition of 2 numbers

  Example:
    >>> res = add(1, 2)
    3
  Returns:
    int: Result of addition
  '''
  return a + b

There are a number of formats for writing Docstrings, which can be a topic discussion in itself. I found myself going back to this guide by RealPython and I recommend giving it a read for a deeper understanding of Docstrings. Using the combination of reSt files and Docstrings, we can generate documentation using Sphinx in a form like PDF and HTML. Here's a sample HTML documentation generated using Sphinx.

There are less popular alternatives to Sphinx which can also be used for documentation, most notable being ReadTheDocs.

Tests

I found writing unit tests in Python easier than in JavaScript due to the availability of mature and built-in tools.

unittest

Python has a built-in module unittest which can be used to run the tests. The syntax is very similar to Jest or Mocha. Every "test" is a class and every class method is a test case. Methods like setUp() and teardown() have special meanings and are equivalent to before and after methods seen in JavaScript testing frameworks. Here's a sample test file. Tests are run by:

python3 -m unittest discover -s <test_directory>

Pytest and nose2 are some of the alternatives but I found unittest adequate.

Doctests

Earlier in Documentation section, we discussed about Docstrings. The Examples section in a Docstring provide the user an understanding of how to use the function. What's great is we can also run the example against an automated unit test, using doctest (another built-in module). This can keep the documentation and implementation in sync.

Publishing

Packages can be publushed to PyPi after building. Earlier the builds were made in Egg format, however now it is being replaced by Wheel format. To create a wheel build, we need to install setuptools and wheel.

python3 -m pip install --user --upgrade setuptools wheel

Now the build can be created by running:

python3 setup.py sdist bdist_wheel

This will create two files in dist folder:

One with .tar.gz extension which is a source archive.
Another with .whl extension which is the build file ready to be installed.

Now we need to upload the package build to PyPi. We need twine for that.

python3 -m pip install --user --upgrade twine

Final step, the uploading of the build:

python3 -m twine upload --repository testpypi dist/*

--repository here lets you select a specific package repository like custom or private npm registeries. It can be configured with a .pypirc file.

Additional resources

Two resources which helped me a lot in fuguring out my issues were:

The official Python Packaging User Guide. I recommend going through it once.
The Real Python. I recommend clicking on any link from this website which comes up in search results.

It was an interesting experience for me to tackle all the challenges faced when I was on a journey to publish my first package on PyPi. A mature testing frameworks was a breath of fresh air while dependency management and documentation generation was a daunting task due to all the different tools to manage. In the end I published password_validator and I am happy with the result.

Do reach out in comments or on my Twitter if you have any issues with the steps I have mentioned here. Happy publishing!

Deploy your website on IPFS: Why and How

Tarun Batra — Mon, 20 Apr 2020 18:48:20 +0000

I wanted to learn about IPFS and Web 3.0 so I started exploring it and tried to upload my site there. This is an account of the concepts I learnt along the way and how I finally got my site hosted on tbking.eth. I'll give a brief intro about IPFS and why hosting static content there makes sense. If you are already familiar with IPFS, you can skip to the hosting part.

What is IPFS

Inter-Planetary File System is a decentralized network of shared content. It has a very simple yet interesting design philosophy:

Build a network that works inter planets, and it will be a better network for communication across Earth too.

IPFS is a decentralized network where peers are connected and share files, in many ways like BitTorrent. The fundamental principle is that unlike the traditional web, where files are served based on their location, in IPFS files are served based on their content. Consider this comparison:

Google's privacy policy is identified as a file hosted on Google's servers on the address https://policies.google.com/privacy. The content of the policy doesn't matter. This is called location-addressing.
On the other hand, IPFS identifies files by their content, using the hash of the file. Let's say you want to read the "XKCD #327 - Exploits of a Mom". Its IPFS address is https://ipfs.io/ipfs/QmZVjV5jFV7Jo4Hfj6WPyRnHCxf8kbadkqtQBco2gef64x/. This address doesn't depend on the location of the server but the hash. This is called content-addressing. Whoever cares about XKCD, can host it. This makes broken links unlikely, which is also one of the goals of IPFS.

IPFS docs explain these concepts well. I recommend them to anyone who wants to dig deeper.

Hosting on IPFS

Deploying static content such as personal websites on IPFS is easy. The steps I'm listing below can be used for any file such as a plain HTML file, a website generated by static site generators like Jekyll, Hugo, Hexo, and Gatsby or even a media file. So let's dive in.

Adding files

IPFS Desktop

If you have IPFS Desktop installed and running, you can add your files using a normal file selector. Just import the directory which has the contents of your static website.

Right-click on the newly added content and select Copy CID. Open https://ipfs.io/ipfs/<cid> to see our files on the web! 🚀

Easy right? 😀

#### IPFS CLI
IPFS CLI allows adding of files and directories using add subcommand.

ipfs add -r ./sample_website

added QmaYZE5QQVR3TqQSL5WDcwuxUwBqczbMkmbHeoeNEAumbe sample_website/index.css
added Qme5TzVW4en8HWP49N9nLHwUkAMhCn6zvWteoknmFDhwJg sample_website/index.html
added QmThzr7LgYmUz6tfTArJK5tbrtidXSA12ZCL4ESy6HB8P2 sample_website/index.js
added QmSYDncaMte6GPx5jC7kqeXWnrKsCp4nkLfQXPxiAKtECU sample_website/something.html
added QmeUG2oZvyx4NzfpP9rruKbmV5UNDmTQ8MoxuhTJGVZVTW sample_website
 1.01 KiB / 1.01 KiB [=================================================] 100.00%

The hash printed in the last line is the CID for the whole directory and thus our website. We can see the sample website hosted on https://ipfs.io/ipfs/QmeUG2oZvyx4NzfpP9rruKbmV5UNDmTQ8MoxuhTJGVZVTW/

TIP: It is important to use relative links in your website since IPFS gateways have a URL which looks like <gateway>/ipfs/<cid>/file.ext.

Pinning

In the last section, we added files to our IPFS node for the network to find. This is why the IPFS gateway was able to resolve it and show it in the browser. But the site will most likely become unreachable once you shut down your IPFS daemon. Even though, after requesting some content on IPFS, the receiving node also becomes a host of the content, but the content will be garbage collected after 12 hours. How to serve your site round the clock in a decentralized web without a server?

Welcome, Pinning!

A node that pins some content on IPFS hosts it forever (until it unpins it). There are pinning services like Pinata which pin the files on their IPFS nodes. This way the website is available always.
In Pinata, you can either upload a file or just provide it's hash if the content has already been uploaded to IPFS. Here's how I pinned the sample website we uploaded above.

TIP: It is good to pin your site using multiple pinning services for redundancy.

Automated deployments

As you might have noticed already, using IPFS is very easy. I would go as far as to say it's easier than dealing with the traditional web that we use. However, this process has to be repeated every time you want to change your files which is not very ideal. There are tools like Fleek which help in automating all the steps listed above.

Fleek is like Travis or CircleCi for IPFS deployments. You can link your Github account with it and using the Github hooks, Fleek will trigger a deployment on every push to the Github repository. They also pin everything they deploy.

So I use Hexo to generate this blog and I was able to add a build step in the Fleek dashboard itself so that I don't need to generate the HTML and push to my repository. Here's the build command that I use:

git submodule update --recursive --init && npm i && npm run build

Yes, I needed to install the submodules myself. They don't do that by default (yet). Check it out, it's super easy.

Link to a domain

So now we have our site up and running, but content on IPFS is not as easy to look up as on the traditional web. My site can be found at https://tarunbatra.com. But on IPFS, the current version can be accessed at https://ipfs.io/ipfs/QmTPTa1ddoSkuakaW56SaL9dicbC71BbwfjRbVjasshCXs/.
You see the problem. This is not human readable. And the CID changes after every update. There are 2 solutions to this:

DNSLink

With DNSLink, you can point a normal domain to an IPFS content. It can be set up on Fleek very easily. I've pointed ipfs.tarunbatra.com to the IPFS version using Fleek and you'll be able to open this site.

IPNS (Inter-Planetary Name Service) also exists and is similar to DNSLink but it's much slower right now.

Ethereum Name Service

ENS is a naming service that lives on Ethereum blockchain and uses smart contracts to buy domains and set resolver records to them. Since Ethereum is involved, you'd need MetaMask to use it.

I bought the domain tbking.eth and pointed it to the IPFS content of this site. It still needs to be updated every time I change the contents of my website. I'm not sure if there's any way to automate this yet. However, as of now, Fleek is working on ENS domains integration. I'll update here once that's out. If you know another way, let me know in the comments.

This was it. IPFS is perfect as the storage layer of the Web 3.0. Try for yourself, and if you liked this blog, share it and maybe pin it! :)

How to choose between Kafka and RabbitMQ

Tarun Batra — Fri, 11 Oct 2019 22:30:00 +0000

Kafka and RabbitMQ – These two terms are frequently thrown around in tech meetings discussing distributed architecture. I have been part of a series of such meetings where we discussed their pros and cons, and whether they fit our needs or not. Here’s me documenting my findings for others and my future self.

Spoiler: We ended up using both for different use-cases.

Message Routing

With respect to message routing capabilities, Kafka is very light. Producers produce messages to topics. Kafka logs the messages in its very simple data structure which resembles… a log! It can scale as much as the disk can. Topics can further have partitions (like sharding). Consumers connect to these partitions to read the messages. Kafka uses a pull-based approach. So, the onus of fetching messages and tracking offsets of read messages lies on consumers.

RabbitMQ has very strong routing capabilities. It can route the messages through a complex system of exchanges and queues. Producers send messages to exchanges which act according to their configuration. For example, they can broadcast the message to every queue connected with them, or deliver the message to some selected queues, or even expire the messages if not read in a stipulated time. Exchanges can also pass messages to other exchanges, making a wide variety of permutations possible. Consumers can listen to messages in a queue or a pattern of queues. Unlike Kafka, RabbitMQ pushes the messages to the consumers, so the consumers don’t need to keep track of what they have read.

Delivery guarantee

Distributed systems can have 3 delivery semantics:

at-most-once delivery

In case of failure in message delivery, no retry is done which means data loss can happen, but data duplication can not. This isn’t the most used semantic due to obvious reasons.

at-least-once delivery

In case of failure in message delivery, retries are done untill delivery is successfully acknowledged. This ensures no data is lost but this can result in duplicated delivery.

exactly-once delivery

Messages are ensured to be delivered exactly once. This is the most desirable delivery semantic and almost impossible to achieve in a distributed environment.

Both Kafka and RabbitMQ offer at-most-once and at-least-once delivery guarantees.

Kafka provides exactly-once delivery between producer to the broker using idempotent producers (enable.idempotence=true). Exactly-once message delivery to the consumers is more complex. It is achieved at consumers end by using transactions API and only reading messages belonging to committed transactions (isolation.level=read_committed). To truly achieve this, consumers would need to avoid non-idempotent processing of messages in case a transaction has to be aborted which is not always possible. So, Kafka transactions are not very useful in my opinion.

In RabbitMQ, exactly-once delivery is not supported due to the combination of complex routing and the push-based delivery. Generally, it’s recommended to use at-lease-once delivery with idempotent consumers.

NOTE: Kafka Streams is an example of truely idempotent system, which it achieves by eliminating non-idempotent operations in a transaction. It, however is out of the scope of this article. I recommend reading “Enabling Exactly-Once in Kafka Streams” by Confluent if you want to dig in it further.

Throughput

Throughput of message queues depends on many variables like message size, number of nodes, replication configuration, delivery guarantees, etc. I will be focussing on the speed of messages produced versus consumed. The two cases which arise are:

Queue is empty due to messages being consumed as and when they are produced.
Queue is backed up due to consumers being offline or producers being faster than consumers.

RabbitMQ stores the messages in DRAM for consumption. In the case where consumers are not far behind, the messages are served quickly from the DRAM. Performance takes a hit when a lot messages are unread and the queue is backed up. In this case, the messages have to be read from the disk, which is slower. So, RabbitMQ works faster with empty queues.

Kafka uses sequential disk I/O to read chunks of the log in an orderly fashion. Performance improves further in case fresh messages are being consumed, as the messages are served from OS page cache without any I/O reads. However, it should be noted that implementing transactions as discussed in last section will have negative effect on the throughput.

Overall, Kafka can process millions of messages in a second and is faster than RabbitMQ. Whereas, RabbitMQ can process upwards of 20k messages per second.

Conclusion

RabbitMQ offers complex use-cases which can not be realized with Kafka’s simple architecture. However, Kafka provides higher throughput in almost all cases. Apart from these differences, both of them provide similar capabilities like fault-tolerance, high availability, scalable, etc. Keeping this in mind, we at smallcase used RabbitMQ for consistent polling in our transactions system and Kafka for making our notifications system quick and snappy.

References

Error messages in login process: Privacy and Security

Tarun Batra — Sat, 28 Apr 2018 19:14:03 +0000

Most of us, while developing sites swear to the rule of making error messages shown to the end-user as specific as possible, which goes a long way in creating a friendly UX. But the same rule if used in the login flow, can have significant privacy and security implications. On entering wrong login credentials, most sites show variations of Invalid username/password error at the login screen.

It can be safely assumed that no user for such emails exists, yet the error messages shown by these prominent sites don’t indicate that. Do you see why?

Protecting user base information

On feeding random emails, a simple login/reset password may show that the user doesn’t exist and therefore no login/reset password is possible. A hostile client may create a list of valid users from this information. Not every site cares about it though. Facebook doesn’t.

It doesn’t hurt for a site like Facebook to leak the user base because everyone’s on it, but when the site is something like Ashley Maddison it can have implications.

Closing the loop

People often argue that the user base is anyways leaked during the registration process, which can’t continue if the user already exists. This leak is avoidable too, by continuing the registration through an email. If the user already exists the email may say so, otherwise it may direct the receiver to continue the registration process.

Deterrence to brute-force

In a brute-force attack, large number of combinations of usernames and passwords are tried to get into a user’s account. A dictionary attack is faster and better as it takes advantage of human tendencies of using common passwords and re-using old passwords. Specific messages like Invalid username make these attacks faster by eliminating large number of combinations in a few tries.

Demo

To quantify the difference detailed error messages can make to brute-force or dictionary attacks, I created a script, login-attack-demo which included:

Dictionary

A collection of 1575 most commonly used passwords based on danielmiessler/SecLists.

Naive server

A server which gives out specific error messages based on the failing stage of the login process. If the username is invalid, it’d say Invalid username.

Smart server

A server which gives out vague error messages on failure of the login process. If the password is invalid and not the username, it’d still say Invalid username/password.

Attacker

A hostile client which uses the dictionary to generate a pair of username and password and uses them to break into the system. It also analyzes the error messages on the following rules:

An error message saying the password was invalid, indicates the username was a valid one.
An error message saying the username was invalid, indicates no combination of password will result in a successful login.
An inconclusive error message is ignored and the next combination is tried.

Results

The blue points indicate the number of tries it took the attacker to break into the user’s account with vague error messages, while the red ones are when the error messages were specific. The yellow line is the max number of combinations to try.

It is evident that in theory vague error messages make it quite tough for the attacker, sometimes as much as 1000 times tougher. However, plugging all the holes may make the UX far from ideal. In the end, it all comes to trade off between UX and security. Depending on the space you’re operating, such methods can be used to safeguard the users.

Originally published here at tarunbatra.com.