DEV Community: Samyak Jain

Scaling Is All You Need: Understanding sqrt(dₖ) in Self-Attention

Samyak Jain — Tue, 11 Nov 2025 09:13:12 +0000

Been trying to understand the scaling in the attention formula, specifically sqrt(d_k). It confused me a bit why do we need to divide at all?

I was confused because we subtract each value with the max value inside softmax anyway (so exp doesn't explode our numbers), so why do we need to scale before this step as well?

Turns out the difference lies between numerical stability and statistical calibration.

Division vs. Subtraction

When we divide by sqrt(d_k), we're reducing the magnitude of each value proportionally, which shrinks the differences between them for example, [100, 102, 103] becomes [10.0, 10.2, 10.3], where the 2-unit and 1-unit gaps become 0.2 and 0.1. This brings the values closer together before they reach softmax.

In contrast, when we subtract (like subtracting the max in softmax), we shift where the values sit on the number line without changing the differences between them at all [100, 102, 103] becomes [-3, -1, 0], but the gaps remain 2 and 1.

At first, I thought: if we're just reducing magnitudes for softmax, why not simply subtract the max value like we do inside softmax for stability? But then it occurred to me that subtraction doesn't actually bring the values closer together it only shifts them.

Preserving Proportions for Softmax

The problem is that I need to preserve the proportional relationships between numbers (the ratios like 102/100 = 1.02 stay the same after division), because softmax relies on these relative differences to produce meaningful probabilities. I don't want to lose how much bigger one value is compared to another.

However, I also can't keep the absolute magnitude of these differences too large, because softmax's exponential would exaggerate them further turning a 3-unit spread into a distribution like [0.09, 0.24, 0.67] where one value dominates.

So division is the perfect solution: it keeps the proportionality the same (10.2/10.0 = 1.02, just like before) while bringing the absolute differences closer (from 2 units to 0.2 units), ensuring values don't look way too far apart before softmax amplifies them into a more balanced distribution like [0.30, 0.33, 0.37].

Another major reason is future-safe design. As d_k increases (say from 64 to 512), the dot product naturally grows larger since we're summing more terms, but this growth doesn't represent meaningful differences in attention it's just an artifact of dimensionality. By dividing by sqrt(d_k), we compensate for this growth and keep the scale consistent: whether d_k is small or large, the proportional relationships remain stable.

Dividing by d_k directly would shrink values too aggressively. Using sqrt(d_k) is the right balance because the variance of the dot product grows linearly with d_k, and dividing by sqrt(d_k) keeps the standard deviation roughly constant. This ensures that the scale of the values entering softmax remains consistent, no matter the dimensionality.

Key Insight

The key insight is that values need to be close together before entering softmax, so we can let softmax do the exaggeration through its exponential function in a controlled way, while subtraction just ensures numerical stability without affecting the relative distances that softmax actually cares about.

Basically there are 3 important things happening here:

Division changing how far apart the values are (brings them closer)
Subtraction changing where they sit on the number line (doesn't change separation)
We want values close together BEFORE softmax, so softmax's exponential amplification produces a reasonable distribution, not an extreme one

Let the softmax do the exaggeration.

[Boost]

Samyak Jain — Mon, 29 Sep 2025 15:05:34 +0000

Samyak Jain

Sep 28 '25

Positional Encoding - Sense of direction for Transformers

#machinelearning #programming #ai #architecture

11 min read

Positional Encoding - Sense of direction for Transformers

Samyak Jain — Sun, 28 Sep 2025 11:29:08 +0000

I have been trying to understand how transformers work lately, and whenever we read or hear about transformers, there is one word which comes up more than any other: ‘Attention’ , although it’s not something that first appeared with transformers, but it has become the centerpiece of the whole architecture.

That said, attention by itself doesn’t get us very far. There’s another idea that doesn’t always get the same spotlight, but without it, self-attention would completely fall apart. That idea is positional encoding the thing that lets transformers keep track of word order.

But why self attention in transformers is useless without this Positional encoding?

The issue is that self-attention has no built-in sense of order. Each token looks at every other token simultaneously, like in the diagram from 3Blue1Brown.

Now because each token (approximately a word) is looking at each token at the same time there is no information of how far a token is to another token or whats the order of the tokens.

So if my sentence is “A cat ran behind a mouse,” the transformer without positional encoding would only see it as a bag of words: {A, cat, ran, behind, a, mouse}. Now flip it to “A mouse ran behind a cat” — the bag looks identical to a positional encoding less transformer. The meaning, however, is completely different. That’s the blind spot positional encoding is designed to fix.

This wasn’t really a problem for RNNs, because the architecture itself processed tokens step by step. Each word was fed in after the previous one, and the hidden state carried along a memory of everything seen so far. In other words, the order of the sequence was baked into the way RNNs worked.

Transformers flipped that idea. Instead of moving sequentially, they looked at all the tokens in parallel. That parallelism is one of the reasons that makes them so powerful and efficient — but it also means they lose the natural sense of order that RNNs had for free.

That’s where the positional encoding steps in: it reintroduces the concept of order without giving up the transformer’s parallelism.

So how do we encode this "sense" of position in each token?

Well if you look at the original "Attention is all you need" paper they have solved this problem using these two formula

P E_{(p os, 2 i)} = sin (\frac{p os}{1000 0 ^{2 i / d_{model}}})

P E_{(p os, 2 i + 1)} = cos (\frac{p os}{1000 0 ^{2 i / d_{model}}})

Haha this didn't really made a lot of sense to me at the first glance , So I'll try to explain what I have understood about it so far.

I had questions like how this formula is derived? why there is a pair of sine and cosine? whats with using 10,000? So Lets try to reach to this formula step by step.

Idea 1: Just Count (Integer Indexing)

The simplest idea to have positional information is to assign each token its index in the sequence.

The Idea: The first word is 1, the second is 2, the third is 3, and so on. [1, 2, 3, 4, ...]
The Flaw: The numbers can become very large for long sentences. Neural networks work best with small, normalized values, and large numbers can make training unstable. Also, a model might see very different ranges of numbers during training vs. testing, making it hard to generalize.

Ok so we cant have direct integers as its not Neural network friendly , it can cause vanishing or exploding gradient. Normally neural networks like their values to be balanced between positive and negative values.

Idea 2: Normalize the Count

To fix the scaling issue, the next logical step is to normalize the indices to a fixed range, like [0, 1].

The Idea: Divide each index by the length of the sentence. For a 4-word sentence, the encoding would be: [1/4, 2/4, 3/4, 4/4] or [0.25, 0.5, 0.75, 1.0]
The Flaw: This makes the meaning of a position dependent on the sentence length. For example, the value 0.5 means the 2nd position in a 4-word sentence, but it would mean the 10th position in a 20-word sentence. The model has no consistent way to interpret what a position value means, which is a deal-breaker.

Ok so we need static information which represents a position it cant be dependent on the length of the sentence or some other variable factor.

Idea 3: Use a Binary Vector

To handle arbitrary lengths consistently while keeping values small, the next idea is to represent the position index in binary and turn it into a vector.

The Idea: Represent each number with its binary equivalent, creating a vector for each position. Here we solved the previous two problems which were keeping the values between 0 and 1 and keeping the information static regardless of the length of sentence (still all the numbers are positive though).
- Position 2 -> [0, 0, 1, 0]
- Position 3 -> [0, 0, 1, 1]
- Position 4 -> [0, 1, 0, 0]
The Flaw: This is discrete and "jagged." A small change in position (e.g., from 7 (0111) to 8 (1000)) can cause every single value in the vector to flip. This gives the model no smooth, continuous sense of distance or proximity between nearby positions.

Below is an example of what it means by "jagged"

Checkout the source here

Ok so we want that our model gets a smooth curve not an abrupt change between two adjacent positions.

Idea 4: Using Sine Waves

Alright so we need something which is smooth, meaning it periodically changes from 0 to 1 to 0 and so on, well sine function can perform this perfectly and its also in the range of [-1,1], which solves our problem of having equal positive and negatives.

Like this

Checkout the source here

Alright lets try to use a sin function to calculate positions for our tokens

Does this sine wave approach checks all the problems that we found in other approaches?

something which gives out a smooth curve
would be static irrespective of the length of the sequence
not integers so wont cause problems for the neural network

This is a very brief version which tries to derive the formula but it has been explained in a very well manner here and here.

Alright, does that mean we’re all set now? We would just use one sine function per position and call it a day?

Well... not exactly

1. Periodicity problem

Sine is periodic, which means it repeats its values over and over.
For example, if we move along the sine wave by 1 radian per position:
- sin(1), sin(2), …
- By the time we reach sin(6), the value is already very close to what we had earlier, just because sine repeats.
This creates a problem: the model sees almost the same number at different positions, even though they are far apart in the sequence. It can’t reliably tell positions apart.

2. Phase problem

Even without worrying about repetition, phase ambiguity is an issue.
Consider sin(30) and sin(150). Both return roughly 0.5.
For the model, if you only provide the sine value, it sees 0.5 and has no idea whether it came from position 30 or 150.
This is why sine alone doesn’t uniquely identify positions — the same value can appear at many different points in the sequence.
You cant locate a point in a 2d space just with a y axis (sine)

3. Linear Transformation

Another major problem is linear transformation. You might think that moving from pos = 6 to pos = 9 should be equivalent to moving from pos = 18 to pos = 21, because the offset k = 3 is the same. But it wont happen with the current approach

But before we look at why it wont happen lets understand why we want this property? This is very well explained in a blog that am quoting here

Why is this a very desirable property to have? Imagine we have a network that is trying to translate the sentence "I am going to eat."

The combination "is/am/are" +"going"+"to" +"verb" is a very common grammatical structure, with a fixed positional structure. "going" always ends up at index 1, "to" at index 2, etc.

In this case, when translating "verb", we may want the network to learn to pay attention to the noun "I" that occurs before "am going to". "I" is located 4 position units to the left of "verb".

Since our attention layer uses linear transformations to form the keys, queries, and values, it would be nice if we have positional encodings such that the linear transformation can translate the position vector located 4 units to the left, so that it lines up with the position vector of "verb".

The query and key would then match up perfectly.

Checkout the source here

However, when you use only sine values in positional encoding, this isn’t true. Here’s why:

Sine depends on absolute position:

- Sine is a wave that oscillates between -1 and 1.
- The amount the sine value changes when you move by `k` depends on **where you start** on the wave.
- For example, moving 3 steps near a peak of the wave might produce a tiny change, while moving 3 steps near zero might produce a large change.

Shift is not consistent:

- The same offset `k` produces **different differences in sine values** depending on the starting position.
- That means the “distance” between `pos` and `pos+k` is **variable**, unlike what you might expect from a true linear transformation.

Consequence:

- Moving by the same offset doesn’t give the same change in information.
- The sine-only encoding is **position-dependent**, so relative shifts aren’t uniform across the sequence.

For an illustration checkout this video

Enter Sin + Cos based embeddings

We are close to the formula that the paper discussed, so we saw that sine based embeddings even though provided us with smooth curve it had its own problems.

Lets see if we can solve those problems with sin + cos

Each position is now represented as a pair [sin(f*pos), cos(f*pos)] for each frequency.

Unique positions:

The combination of sine and cosine gives a unique point on the circle for every position.

Earlier, using sine alone, we ran into the phase problem:

sin(30°) ≈ 0.5
sin(150°) ≈ 0.5

The model would see 0.5 in both cases and wouldn’t know whether it came from position 30 or 150 — clearly a problem!

Now, with sin + cos:

Represent position as a 2D point: [sin(pos), cos(pos)]
Example:

Position	sin(pos)	cos(pos)	Point (sin, cos)
30°	0.5	0.866	(0.5, 0.866)
150°	0.5	-0.866	(0.5, -0.866)

Even though the sine values are the same, the cosine values are different.

The points (0.5, 0.866) and (0.5, -0.866) are distinct on the circle.
The model can now tell these positions apart.

So, combining sine and cosine removes the ambiguity and makes each position uniquely identifiable on the circle.

Periodicity Problem

The sine + cosine pairs doesnt directly helps in solving this problem, i think the creators just delayed this problem by creating multiple pairs of sin and cos (dimensions = d_model) and on top of that this explains why they used the scaling of 10,000 -

Using a large number like 10,000 spreads the frequencies logarithmically across dimensions.
Low dimensions oscillate fast, high dimensions oscillate slowly.
Because of this spread, the combined vector of [sin, cos] across all dimensions is very unlikely to repeat for nearby positions, even though each individual sine/cosine is periodic.

Linear Transformation

Now, each position is represented as two numbers: [sin(f * pos), cos(f * pos)]

Think of this as a point on a circle, where the angle = f * pos.

- `sin(f * pos)` = vertical coordinate        
- `cos(f * pos)` = horizontal coordinate

How shifting positions works

Moving by k positions = increasing the angle by f * k
In 2D, this is equivalent to rotating the point around the circle.
A rotation has a fixed 2×2 matrix:

PE_{p os} = [sin (f \cdot p os) cos (f \cdot p os)]

Key point: this matrix depends only on k, not on the starting position.

Why this solves linear transformation

Uniform distance changes
- The Euclidean distance between a point and its rotated version is constant for a given k
- No matter where you start, moving forward by k produces the same shift in the vector
Predictable relative shifts
- Because rotation is linear, the model can learn a fixed transformation for any offset k
- This is exactly what the Transformer exploits to reason about relative positions
Intuition
1. Sine-only = wavy line → inconsistent movement
2. Sin+cos = circle → rotation → consistent movement

This concept of linear transformation in PE is very well explained here and here

Experiment

To see the difference in practice, I created positional embeddings in two ways:

Sine-only embeddings
Sine + Cosine pairs

I then computed the cosine similarity between two pairs of positions:

Pair 1: pos = 5 and pos = 9
Pair 2: pos = 13 and pos = 17

Both pairs have the same offset (Δpos = 4).

Results:

Sine-only embeddings:
- The cosine similarity for Pair 1 was different from Pair 2.
- This confirms that sine-only embeddings produce variable shifts depending on the starting position.
Sine + Cos embeddings:
- The cosine similarity for Pair 1 exactly matched Pair 2.
- Using [sin, cos] ensures that shifts are consistent across the sequence, independent of the starting position.

Here is the google Colab to check this experiment out

Conclusion

So yeah, positional encoding might not get the same spotlight as attention, but it’s what keeps transformers from being just a fancy bag-of-words model. We went from simple counting to sine waves and then to sine + cosine pairs, and along the way, we saw why each step mattered. In the end, sin + cos gives each position a unique spot on a circle, solves phase issues, and makes relative shifts predictable—letting the model actually make sense of word order while still enjoying the parallelism that makes transformers so powerful.

Positional encoding is still an active area of research. People are exploring relative positional encodings instead of static ones, like we discussed—check it out here.

There are still models like autoregressive models which runs fine without using Positional encoding at all, which makes this topic even more interesting that when do we need to use Positional encodings and when not, but still need to read a lot before I can try to explain it in my own words.

Since this topic is really vast, I’d love for readers to critique this blog and point out any mistakes I might have made or assumptions that could be wrong.

References

https://towardsdatascience.com/master-positional-encoding-part-i-63c05d90a0c3 - this one derives the Positional embeddings starting starting from just indexes
https://blog.timodenk.com/linear-relationships-in-the-transformers-positional-encoding/ - helped me understand the linear transformation reasoning
https://huggingface.co/blog/designing-positional-encoding - Another great blog which derives the Positional embeddings starting from just indexes
https://naokishibuya.github.io/blog/2021-10-31-transformers-positional-encoding
https://kazemnejad.com/blog/transformer_architecture_positional_encoding/ -

Understanding SVD's Intuition (Singular Value Decomposition)

Samyak Jain — Mon, 12 May 2025 05:36:51 +0000

What is SVD

Singular Value Decomposition (SVD) is a fundamental matrix factorisation technique in linear algebra that decomposes a matrix into three simpler component matrices. It's incredibly versatile and powerful, serving as the backbone for numerous applications across various fields.

Why Was SVD Developed?

SVD was developed to solve the problem of finding the best approximation of a matrix by a lower-rank matrix. Mathematicians needed a way to:

Factorise the original matrix into three smaller matrices that capture hidden relationship , It will find latent features that explain the patterns in a given matrix.
Understand the fundamental structure of linear transformations.
Analyse the underlying properties of matrices regardless of their dimensions.

The core purpose was to find a way to decompose any matrix into simpler, more manageable components that reveal its essential properties - specifically its rank, range (column space), and null space.

What does it mean to find the best approximation of a matrix by a lower-rank matrix?

When we have a large, complex matrix A with rank r, we often want to simplify it to save computational resources while preserving as much of the original information as possible.

Finding the "best approximation" means:

Creating a new matrix Â with lower rank k (where k < r)
Ensuring this approximation minimises the error (typically measured as the Frobenius norm ||A - Â||)
Capturing the most important patterns/structures in the original data

This is valuable because lower-rank matrices:

Require less storage space
Allow faster computations
Often filter out noise while preserving signal

What is a Latent Feature?

A latent feature is:

A property or concept that isn't explicitly observed in the data, but is inferred from patterns across the matrix.

In simple terms:

It's a hidden factor that explains why people behave the way they do.
You don’t know what it is, but you see its effects.

Analogy: Music Taste

Imagine a user-song matrix: rows = users, columns = songs, entries = ratings.

You don’t label features like "likes punk rock" or "prefers instrumental", but...

If two users rate the same songs highly,
And those songs share some vibe,

Then you can infer there's some latent preference shared.
That hidden dimension — like “preference for energetic music” — is a latent feature.

Definition

Given any real matrix A ∈ ℝ^m×n, SVD says:

Where:

Component	Shape	Role
U	ℝ^m×k	Orthonormal matrix — maps original rows into a k-dimensional latent space. Each row is a **latent [[Vectors\
Σ	ℝ^k×k	Diagonal matrix with non-negative real numbers called singular values, sorted from largest to smallest. These represent the importance (energy) of each dimension in the latent space.
Vᵀ	ℝ^k×n	Orthonormal matrix — maps original columns into the same k-dimensional latent space. Each column is a latent vector for a column of A.

The value k = rank(A) in exact decomposition, or you can choose a smaller k for approximation (low-rank SVD).

📊 What Each Matrix Encodes (in abstract terms)

Component	Encodes
U	Each row in A becomes a vector in a latent space — capturing how each row projects onto abstract "directions" in the data. This matrix shows how each user aligns with the latent features discovered by SVD. It does not label them explicitly as “loves punk rock” or “loves romantic music,” but it captures underlying preferences. In essence, U describes what kind of latent features (or preferences) each user has, without explicitly telling us the names of those features. It just tells us the degree to which each user aligns with each latent feature.
Σ	Singular values show how much of A’s structure lies in each direction (feature). The higher the value, the more "energy" or "information" it captures. The singular values in Σ represent how important each latent feature is
Vᵀ	Each column in A becomes a vector in the same latent space — capturing how each column contributes along those same directions. This matrix describes how each song correlates with the latent features. For example, Song A (Punk) might have a high score for the latent feature "energetic music," while Song B (Romantic) might also score highly on that feature, but could also have a moderate score for a latent feature related to "emotional depth" (which both romantic and punk music might share). Song C (Jazz) and Song D (Classical) would likely score higher on a latent feature related to "calm or soothing music." In essence, Vᵀ describes what kind of latent features (or qualities) each song has, again without explicitly telling us what the features are. It just tells us the degree to which each song aligns with each latent feature.

Example Scenario

Let’s start with a user-song rating matrix where users rate songs from different genres. Each user has their own tastes, and some might enjoy music from different genres.

	Song A (Punk)	Song B (Romantic)	Song C (Jazz)	Song D (Classical)
Alice	5	4	?	2
Bob	4	5	?	1
Carol	1	2	5	5

Alice: Likes Song A (Punk) and Song B (Romantic).
Bob: Likes Song B (Romantic) and Song A (Punk) as well, but with slightly different ratings.
Carol: Prefers Song C (Jazz) and Song D (Classical). #### The Problem: How Do We Predict Missing Ratings?

We want to predict ratings for the missing entries (denoted by ?). For example, we want to predict how Alice might rate Song C (Jazz) or Song D (Classical). Similarly, we want to predict how Bob might rate Song C.

We don't explicitly know which genres each user likes. But SVD can discover latent features or hidden preferences based on their ratings.

Breakdown of U, Σ, Vᵀ

U (User-to-Latent Features) – Describes the users in terms of the latent features.
Σ (Singular Values) – Indicates the importance of each latent feature.
Vᵀ (Song-to-Latent Features) – Describes the songs in terms of the latent features.

U: User-to-Latent Feature Matrix

Rows (users): Each row in U corresponds to one user.
Columns (latent features): Each column in U corresponds to one latent feature that was discovered by SVD.
Values: Each entry in U shows how strongly a user aligns with each latent feature. A high value in a column indicates that the user has a strong preference for that particular latent feature. A low value means they don’t have much of a preference for that feature.
Let’s say after applying SVD, we get a matrix U for our music example that looks like this:

	Feature 1 (Energetic Music)	Feature 2 (Calm Music)
Alice	0.8	0.2
Bob	0.9	0.1
Carol	-0.3	0.9

Alice has a strong preference for Feature 1 (Energetic Music) with a value of 0.8, and a weak preference for Feature 2 (Calm Music) with a value of 0.2.
Bob also has a strong preference for Feature 1 (Energetic Music) (0.9), but he has a much weaker preference for Feature 2 (Calm Music) (0.1).
Carol, on the other hand, has a strong preference for Feature 2 (Calm Music) (0.9) and a weak preference for Feature 1 (Energetic Music) (-0.3).
What this Means:
- Alice and Bob both like energetic music, which is why they have high values for Feature 1.
- Carol prefers calm music, as indicated by her high value for Feature 2. #### Σ: Singular Value Matrix
A larger singular value indicates that the latent feature explains more of the variance in the ratings. For instance, the first latent feature might explain a significant portion of the ratings data (because both Alice and Bob like energetic music), while the second latent feature (which explains Carol’s preference for calm music) might explain less.

Vᵀ: Song-to-Latent Feature Matrix
Rows (songs): Each row in Vᵀ corresponds to a particular song.
Columns (latent features): Each column corresponds to one latent feature discovered by SVD.
Values: Each entry in Vᵀ tells us how much the song aligns with a particular latent feature.
Here’s how the matrix Vᵀ might look for our music example:

	Feature 1 (Energetic Music)	Feature 2 (Calm Music)
Song A (Punk)	0.9	-0.1
Song B (Romantic)	0.8	0.2
Song C (Jazz)	-0.2	0.9
Song D (Classical)	-0.5	0.8

Song A (Punk) has a high value for Feature 1 (Energetic Music) (0.9), meaning that this song aligns with energetic or lively music, but it has a low value for Feature 2 (Calm Music) (-0.1), meaning it doesn't align with calm or soothing music.
Song B (Romantic) also has a high value for Feature 1 (Energetic Music) (0.8) but also has a moderate value for Feature 2 (Calm Music) (0.2), showing it may combine both energetic and calming elements.
Song C (Jazz) has a low value for Feature 1 (Energetic Music) (-0.2) and a high value for Feature 2 (Calm Music) (0.9), meaning it's a calm, soothing song.
Song D (Classical) also has a low value for Feature 1 (Energetic Music) (-0.5) and a high value for Feature 2 (Calm Music) (0.8), making it more of a calm or soothing piece of music.
What this Means:
- Vᵀ shows us how each song relates to the latent features.
- Song A (Punk) is closely related to Feature 1 (Energetic Music), while Song C (Jazz) is closely related to Feature 2 (Calm Music). ### Key Insight: What SVD Actually Reveals

Here’s where SVD’s magic comes in:

Alice and Bob like both Song A (Punk) and Song B (Romantic).
- SVD captures that they share a preference for energetic, lively music, even though one song is punk and the other is romantic.
- This shared preference shows up as a latent feature in U for Alice and Bob: both have high scores for this feature.
Carol, on the other hand, likes Song C (Jazz) and Song D (Classical), which are more calm and soothing.
- SVD identifies this preference as another latent feature, and Carol scores highly on this second latent feature in U.
SVD doesn’t explicitly label these features as genres like “punk” or “romantic.”
- It simply sees that Alice and Bob share some ratings for energetic music, and Carol shares ratings for calming music.
- This means SVD will identify latent features that explain why Alice and Bob like Songs A and B, and why Carol prefers Songs C and D.

Predicting Missing Ratings with SVD

Now that we understand what each matrix represents, let's see how SVD helps us predict missing ratings:

We've decomposed our original ratings matrix into U, Σ, and Vᵀ
To predict Alice's rating for Song C (Jazz), we multiply her latent feature values by the importance of each feature, then by Song C's latent feature values:

Alice's predicted rating for Song C = (U_Alice × Σ × Vᵀ_SongC)

Using our example values:

Alice's latent feature values: [0.8, 0.2]
Singular values (assuming Σ = [[3, 0], [0, 2]]): 3 for Feature 1, 2 for Feature 2
Song C's latent feature values: [-0.2, 0.9]

Calculation:

Alice's preference for Feature 1 × Importance of Feature 1 × Song C's alignment with Feature 1:
0.8 × 3 × (-0.2) = -0.48
Alice's preference for Feature 2 × Importance of Feature 2 × Song C's alignment with Feature 2:
0.2 × 2 × 0.9 = 0.36
Sum these values: -0.48 + 0.36 = -0.12
Since ratings are typically positive, we might scale this to a rating range (e.g., 1-5):
Adjusted rating ≈ 2.5 (neutral/slightly negative)

This makes intuitive sense: Alice strongly prefers energetic music (Feature 1), but Song C (Jazz) is negatively associated with that feature and strongly associated with calm music (Feature 2), which Alice only weakly prefers. Therefore, SVD predicts Alice would give Song C a relatively low rating.

Application in Dimensionality Reduction

In real-world data, there’s often a lot of redundancy — users rate songs similarly, or products have overlapping qualities. SVD helps by:

Capturing the most important features (patterns) in the data.
Removing noise and redundancy by ignoring less significant singular values.
Compressing the data: Instead of storing the full original matrix, we store a low-rank approximation.

How the Reduction Works

To reduce dimensionality:

Choose a smaller rank k, where k < full rank.
Keep only the top k singular values and their corresponding vectors in U and Vᵀ.
Reconstruct the matrix approximately:

$A_k \approx U_k \Sigma_k V_k^T$

This reduced version of the matrix retains most of the meaningful information but uses far fewer numbers.

👥 Example: Alice, Bob, and Carol

Imagine we have a matrix of 3 users (Alice, Bob, Carol) rating 4 songs:

	Punk	Rock	Love	Ballad
Alice	5	4	1	1
Bob	4	5	1	0
Carol	1	1	5	4

This is a 3×4 matrix.

After applying SVD and reducing it to k = 2, we get:

U_k: 3×2 matrix — Each user represented by just 2 latent features.
Σ_k: 2×2 diagonal matrix — Strength of these 2 features.
V_kᵀ: 2×4 matrix — Each song represented using just 2 features.

This reduced version:

Reveals that Alice and Bob prefer a “Punk/Rock” dimension, while Carol prefers a “Love/Ballad” dimension.
Allows us to reconstruct an approximation of the original matrix using just the most important patterns.
Reduces noise and dimensionality without losing much of the core structure.

Why to perform this?

1. Captures Core Patterns, Not Raw Values

The original matrix told you explicitly:

“Alice rated Punk 5, Love 1…”

But those are surface-level observations.

The reduced version asks:

“What underlying factors might explain why Alice likes Punk and Rock more?”

For example:

Maybe Factor 1 represents a preference for high-energy music.
Maybe Factor 2 represents a preference for emotional or romantic themes.

In the reduced matrix, Alice might be represented as:

Alice → [2.1, 0.1] → Strong on Factor 1, Weak on Factor 2 Carol → [0.1, 2.3] → Weak on Factor 1, Strong on Factor 2

This tells us more than raw numbers — it shows why people like what they like.

2. Good Generalization — Not Just Memorization

The full matrix memorizes exact numbers.

The reduced matrix generalizes, letting us:

Predict missing values more robustly.
Spot similar users or items even if their ratings don’t match exactly.
Cluster users or songs by deeper, shared preferences.

This is crucial in recommendation systems like Netflix or Spotify — we often have incomplete data, and we want smart predictions, not perfect reconstructions.

3. Removes Noise

In real-world data, some ratings are noisy:

A person misclicked a 1 instead of 4.
Someone rated a song randomly.

SVD smooths over such inconsistencies by focusing on consistent trends, not one-off values.

Conclusion: Why SVD Matters

SVD's power lies in its ability to:

Discover hidden patterns (latent features) in data
Reduce dimensionality while preserving important information
Enable accurate predictions for missing values
Filter noise from data

Whether you're building recommendation systems, processing images, or analyzing text, SVD provides a mathematical foundation for understanding and working with complex data relationships.

Error-Correcting Codes: Hamming Code

Samyak Jain — Mon, 03 Mar 2025 12:11:09 +0000

So recently I started to read about error detection, that how error detection started and implemented,

But wait—what kind of errors are we dealing with? Lets take an example of ECC RAM (Error-Correcting Code RAM). Unlike the RAM in your laptop, which can afford the occasional unnoticed error, ECC RAM actively fixes mistakes in real-time—because in places like banks and space missions, even a single corrupted bit could mean disaster.

Lets take an example

Let’s say a server is storing financial transactions in RAM. You might think errors happen due to faulty hardware or software bugs and it should be fixable normally, whats the big deal?
But sometimes, errors don’t come from faulty code or bad hardware—they come from something you cant control like outer space!! A random bit flip occurs due to cosmic radiation.

Cosmic rays are high-energy particles from space (mainly protons) that constantly hit Earth. When they strike electronic components, they can knock electrons loose, causing bit flips in memory. This is called a Single Event Upset (SEU).

So due to cosmic radiation the stored amount $1000 (binary 1111101000) accidentally becomes $1008 (binary 1111101100). Just a single bit got flipped and hence it changed the amount saved in the system. That’s a serious problem—imagine this happening thousands of times across a banking system.

A bit of history..

So in 1940s there was an American mathematician and Engineer Richard Hamming who was working at Bell Labs and used to access a complex punch card computer, and the programs that he used to pass through kept failing because of the random error which used to flip a bit and the whole program used to crash.

Frustrated by unintentional errors caused by factors beyond human control (like cosmic radiation), he developed the first error correction code.

Parity Checks

Before talking about hamming's algorithm there is some context we need to have about parity checks, because its used heavily in Hamming's algorithm. The idea of Parity checks is pretty simple but cool. To understand Parity Checks lets take an example

Let’s say we need to transmit a 7-bit message, and its binary representation looks like this: - 1101010 , now, when the receiver gets this message, how do they know whether it was transmitted correctly or if an error occurred?

To help with this, a parity bit is added to the message. Instead of sending just 7 bits, we send 8 bits, where one extra bit is included purely for error detection.

Even Parity Check

One common method is the even parity check, where:

We count the number of 1s in the message.
If the number of 1s is odd, we add a 1 as the parity bit to make the total count even.
If the number of 1s is already even, we add a 0 as the parity bit to keep it even.

For example, in our message 1101010, the number of 1s is four (which is already even). So, the parity bit added will be 0:

Original message:  1101010  
Parity bit:        0  
Transmitted data:  11010100

If instead the message was 1101011 (which has five 1s), we would add a 1 to make it even:

Original message:  1101011  
Parity bit:        1  
Transmitted data:  11010111

Error Detection with Parity Checks

After transmission, the receiver recounts the number of 1s.

If the total count is even, the message is considered correct.
If the total count is odd, it means an error occurred during transmission.

Parity checks are a simple and efficient way to detect errors.

Problem in Parity Checks

1. False Positives

A parity bit works by ensuring that the total number of 1s in a message (including the parity bit) is always even (or odd, depending on the scheme).

If a single bit flips due to an error, the parity check will detect it because the total count of 1s will become incorrect.
But if two bits flip, the total count of 1s changes by +2 or -2, which still maintains the even (or odd) parity, making the error undetectable.

Example of Two-Bit Error Failure

Suppose we use even parity and transmit this 8-bit data with a parity bit:

Data	Parity Bit	Total 1s (should be even)
10110010	0	4 (even)

If a single-bit error occurs (e.g., 1011001*0 → 10110000), the number of 1s changes from **4 → 3*, which is odd, and the error is detected.

But if two bits flip (e.g., 10110010 → 11110000), the number of 1s changes from 4 → 4 (still even), and the error is not detected.

2. No way to pinpoint the error

Although we could detect the presence of an error, we had no way of identifying which bit was incorrect. The only solution was to retransmit the entire message—a costly and inefficient approach.

Hamming Codes

So when Richard hamming was facing this issue in his Punch Card computer, he created an algorithm which is today known as Hamming Codes.

Now Hamming's Algorithm also uses parity checks but instead of using just one parity bit for the entire message, he introduced multiple parity bits. Their positions are strategically calculated based on binary representation, allowing errors to be not only detected but also corrected.

Before getting into details lets just abstract everything , we will use multiple parity bits to help us identify and fix errors. Although their positions follow a specific calculation, for now, I’ll place them manually and explain the logic behind their placement later.

For this example, we will use a 7-bit message with 4 parity bits along with 1 extra bit (which we’ll ignore for now). This gives us the following structure, like this..

See the 4 highlighted cells? Those are our parity bits , we’ll discuss how their positions were determined later, but for now, let's focus on how we use them to detect errors.

So How do we use these 4 parity bits now? The process is similar to a simple parity check, but instead of a single parity bit verifying the entire message, each parity bit is responsible for checking a specific subset of bits.
(How are these subsets assigned? We’ll cover that soon.)

Lets take this set first

1. Parity at index 1
Here, the parity bit is at index 1, and its corresponding set of message bits is highlighted.

Running a standard parity check, we see that the number of 1s is even, meaning this subset is correct—no error detected.

2. Parity at index 2

Next, we move to the parity bit at index 2:

Checking parity here, we find an odd number of 1s, which indicates an error in one of these bits—but we don’t know which one yet.

3. Parity at index 4

Now, let’s check the index 4 parity bit, covering the 2nd and 4th rows:

The parity check here is even, meaning no errors were found in this subset.

4. Parity at index 8

The last parity bit, at index 8, covers the 3rd and 4th rows:

Running a parity check, we find an odd number of 1s, confirming an error in this subset as well. Pinpointing the Error

Parity bit at index 1 verified that column 2 is correct.
Parity bit at index 2 revealed an error in either column 1 or column 3.
Parity bit at index 4 confirmed that the error is not in column 2, narrowing it down further.
Parity bit at index 8 confirmed the error is in index 10.

By combining these results, we precisely locate the error bit and can correct it.

Haha, sounds like just luck, right? I mean, what if the error was one bit higher or lower? These errors happen randomly, so how can we be sure this ‘intersection’ always works? Are we just hoping for the best? So many questions!!!

Actually this above example was just to give an idea that how the algorithm works at an uber level, so that when i go in the actual intuition behind each step i can refer to the above example.

Parity Bit Placement Logic

Ok , so the parity bit's locations are not just selected randomly but are based on the property of Positional Notation (a numerical representation system where the value of a digit depends on its position in the number.)

In any positional notation system, a number is either:

A single power of the base (if it corresponds to a pure power).
A sum of multiple distinct powers of the base (if it is not a pure power).

For Example

Binary (Base 2)

8 → 1000 → ( 2^3 ) (a single power of 2)
10 → 1010 → ( 2^3 + 2^1 ) (sum of distinct powers of 2)

Decimal (Base 10)

1000 → ( 10^3 ) (a single power of 10)
1,234 → ( 10^3 + 2 \times 10^2 + 3 \times 10^1 + 4 \times 10^0 ) (sum of distinct powers of 10)

But how did Hamming used this property?

The whole logic doesn’t rely on the values of the bits but on their positions. Hamming placed the parity bits at positions that are pure powers of 2 (i.e., 1, 2, 4, 8, …). These positions are chosen because their binary representation contains only a single 1, meaning they are not formed by summing multiple powers of 2.

But why choose only pure powers? Because every other number is just a combination of two or more pure powers of 2. For example, 9 is formed by 2³ + 2⁰, which in binary is 1001.

By reserving these pure power positions for parity bits, we can systematically group all other bits under the relevant parity checks. Any bit whose binary representation has 1 in the least significant position (i.e., xxxx1) falls under parity bit at index 1 (0001), meaning it contributes to the 2⁰ position.

However, bits do not belong to just one parity check. Since numbers are made up of multiple powers of 2, a bit like 9 (1001) falls under both parity bit at index 8 (2³) and parity bit at index 1 (2⁰).

So now, if you look at the grid above, the example images I added—and the specific set of bits assigned to each parity bit—should start to make sense. Each parity bit is responsible for checking all positions that include its corresponding power of 2 in their binary representation. This systematic assignment is what actually allows us to detect and correct errors efficiently.

At first I was a bit confused that are we sure that each number will be uniquely covered? I mean can't the combination of two parity bits point at two different message bits?

Well first of all thats not mathematically possible that exactly n pure base 2 powers will result in two different numbers i mean thats not mathematically possible.

When performing parity checks on all blocks, I can identify that errors exist in specific parity blocks (e.g., X and Y). When I XOR these parity bits, I pinpoint the exact erroneous bit.

Why? Because parity bits are derived from power-of-2 structures, meaning each parity bit corresponds to a unique set of data bits. The sum (or XOR) of two such independent parity bits always resolves to a single, unique bit position.

Mathematically, since each parity bit represents a distinct power-of-2 contribution, their combination can only result in a single nonzero bit—which directly identifies the erroneous bit. This follows from the fundamental property that the XOR of two distinct power-of-2 values always results in a unique power-of-2 value, which corresponds to a specific bit position.

But just for reference here is a table.

Position	Binary	P1 (1)	P2 (2)	P3 (4)
1 (P1)	001	✅	❌	❌
2 (P2)	010	❌	✅	❌
3 (D1)	011	✅	✅	❌
4 (P3)	100	❌	❌	✅
5 (D2)	101	✅	❌	✅
6 (D3)	110	❌	✅	✅
7 (D4)	111	✅	✅	✅

Ok so if you remember now that our Index 2 and Index 8 Parity bit failed right? So how do we pinpoint the exact error bit in a more robust or mathematical way? Instead of that kind of a luck intersection way?

Well One way is to do XOR operation on the final failed parity bits which in our case are P2 and P8

Now, XOR their positions:

2 (₁₀) = 0010₂
8 (₁₀) = 1000₂
XOR: 0010₂ ⊕ 1000₂ = 1010₂ = 10 (₁₀)

This tells us that the bit at index 10 was flipped please flip it again.

This way using Hamming code we not only found that there was an error bit but also Fixed it.

Calculating the Parity Bit

Ok , we discussed the importance and maths behind the placement of the Parity bit , but what about the value of the Parity bit? Well its same as the normal Parity bit calcualtion, if in a specific set the number of 1s are even then the parity bit will be 0 otherwise 1.

Two Error Bit Detection (Extended Hamming Codes)

Now that was all about detecting and correcting a single bit , but with Hamming's algorithm we can also detect (not fix) but detect if there was one more error or not.

Remember the bit at index 0 that we ignored earlier? That bit is used for additional error detection. After correcting the first detected error, we check the bit at index 0 to see if the total number of 1s in the message is even. If it's still odd, that means there's another error somewhere in the message.

Increase in Parity bits with increase in Message bits.

It might feel concerning that how many parity bits will we need to add as the size of the message increases. Well its really negligible just to give you an idea of scale for a 256 bit message the number of parity bits used are 9 and I think thats reasonable.

The number of parity bits ( P ) required for a message of ( D ) data bits in Hamming Code follows the formula:

2^p ≥ 𝐷 + p + 1

where:

( P ) = number of parity bits
( D ) = number of data bits
The extra ( +1 ) accounts for the overall code length (data + parity).

Rationale behind this..

To detect and correct single-bit errors, each bit position must have a unique binary representation using the parity bits.

We have D data bits and P parity bits, so the total number of bits is D + P.
We need to identify errors in any of these D + P bits plus one extra case (no error).
This means we need at least D + P + 1 unique error states.

Since each parity bit combination forms a binary number, the total number of unique error states that can be represented by P parity bits is:

2^p

Thus, to ensure that we can represent all possible error positions (including the "no error" case), we need:

2^p ≥ 𝐷 + p + 1

This blog is heavily inspired by 3Blue1Brown’s excellent video. I highly recommend checking it out!

If you spot any mistakes or feel that something could be explained better, let me know—I’ll make sure to correct it.

Ref in React : Remote access to DOM elements

Samyak Jain — Tue, 09 Apr 2024 11:02:55 +0000

Ever struggled with making two unrelated components in React interact smoothly without cluttering your code? You're not alone. Let's explore a clean, efficient solution

Let's say you created a file input element and clicking on it does your work but then you also have a button which when clicked on, you want the same functionality. But don't want to wrap the button in the input tag as well because it's already defined. You may need to focus on a node, scroll to it, or measure its size and position.

Many of you must be familiar with these use cases from vanilla JS, where it provides functions like getElementByID and you do same things like these and you can surely use the same in React as well.

BUT

It is not recommended to mix vanilla JavaScript DOM manipulation methods like getElementById with React applications. This is because React manages the DOM differently through its virtual DOM and reconciliation process.

React's declarative approach encourages managing the DOM through state and props and updating the UI in response to changes in state.

Directly manipulating the DOM using methods like getElementById can lead to inconsistencies between the virtual DOM managed by React and the actual DOM, potentially causing unexpected behavior and bugs in your application.

That's where the Refs comes into the picture in React. I think of refs as a way to remotely access an element from another element, even when the other element is not related to the first element in any way.

If I had to give an analogy of refs, Refs are like a way to turn your fan ON through your mobile phone when traditionally you can do that only through the button which is connected to the fan.

It provides us a way to interact with elements and store their information, I can add a ref attribute on a div and store its reference in a useRef hook and then call it in the onClick listener of another div, which is not related to the previous div in any way.

Here is a sandbox link where I used a useRef to trigger a file Input field through a button.

You’ll see that when you select a file either with the input button or the custom button, you’ll see the message ‘I got called’ when you select a file because we are just accessing the reference of the original input field through the custom button.

But why should I use only useRef for this purpose? Why I shouldn't use a useState for this? I mean we are just storing the reference of an element right?

Well, the main reason is simply that useState is not designed for this use case, State (useState) is used for data that, when changed, should re-render the component. State is reactive; changes to state variables trigger component re-renders.

State is intended for data that, when changed, should update the UI. However, a DOM element reference (like an input field) doesn't inherently require UI updates when set.

Each time the ref changes (which can be often, due to the nature of inline function refs), it causes a state update and thus a re-render of the component. This is inefficient and unnecessary for simply referencing a DOM element.

Thats why useRef is the better option here because making any change in the useRef-based variable does not cause a re-render.

Ok, so we get it that it's against React’s design principles to use useState for this purpose, but still if I totally neglect this reasoning and still decide to use useState for storing reference, what harm can it cause to the basic working of my feature?

Delay in Availability
State updates in React are asynchronous. When you update a state with a new DOM reference, there's a brief period before the state is actually updated. This can lead to timing issues where you try to access the ref before the state has updated, potentially leading to errors or undefined values.

Here is a sandbox link which demonstrates using ref with useState and useRef and explains why you should use useRef for storing references instead of useState.

Now if you checked the working example, you can see that The inputRef might not be immediately available after setting due to React's asynchronous state updates. This could lead to timing issues or undefined values when trying to access or interact with the input element right after it's mounted.

Ref attribute looks cool, right? But over usage of this ref attribute and using it for everything is again a bad idea, just like useState didn’t fit in ref’s use case same way ref doesn't fits every where else as well.

So when to use Ref generally?

When you need direct access to a DOM element to perform actions that cannot be done declaratively through React's state system, ref is the tool to use. Common use cases include:

Focusing an Input - Programmatically setting focus on an input field when a component mounts or in response to specific user interactions.
Measuring Elements - Obtaining measurements (e.g., width, height, or position) of an element that are only available via the DOM API.
Managing Focus, Text Selection, or Media Control ref allows you to manage focus, select text, or control media playback (play, pause, seek) in a way that is not possible through declarative state updates alone.

Now these examples are just some examples which were top of my mind, there can be a lot of creative ways to use ref attribute.

Now that we have understood about refs, their use cases, and all the other considerations, let's see how else we can use it, the most popular use of refs in React is through ForwardRefs

ForwardRefs

Forwarding refs in React is a technique that allows you to pass a ref down to a child component. This is particularly useful when you need direct access to a DOM element or a class component instance which is available in a child component.

Sounds kinda similar to passing props to children right?

But ForwardRefs holds a different purpose, it allows you to pass refs through components to a child, usually to access a DOM element or a class component instance directly for imperative operations. This is not about passing down data for rendering like with props but about giving a parent component direct access to a DOM node managed by a child.

Normally, refs are not "passable" through components because they are not part of the component's props. However, React provides the React.forwardRef API to solve this problem.

Here is a code snippet which shows how you pass a ref from a parent component to a child a component

const ChildComponent = React.forwardRef((props, ref) => (
  <div ref={ref}>I'm a child</div>
));

function ParentComponent() {
  const childRef = useRef();

  useEffect(() => {
    console.log(childRef.current); // Directly access the child's div
  }, []);

  return <ChildComponent ref={childRef} />;
}

This way my parent component gets the whole reference of the child Component’s any element.

Now I was curious, that why do I need to prefix my functional component with this ‘React.forwardRef’? I mean why can't I just pass my ref like any other prop?

Turns out if I try to pass a ref like this <ChildComponent ref={childRef} /> , and in the child component there is no prefix of React.forwardRef, it won’t work the way it was intended.

This happens because, In the Child component, we are trying to destructure ref from props,

This is problematic because ref is a reserved keyword in React. It's not passed into your component as a regular prop. React treats ref (and key) specially, and they are not part of the props object passed to your component. Thus, attempting to destructure ref from props like this won't work because ref will not exist on the props object.

Now it may sound like bypassing React’s conventions but I tried to find a work around where I can pass the ref from parent to child component without using React.forwardRef, Here is the snippet for that -

function ChildComp({Tref}) {
  return (
    <div ref={Tref} >App</div>
  )
}


function App() {
  const childRef = useRef();


  useEffect(() => {
    console.log(childRef.current); // Directly access the child's div
  }, []);


  return <ChildComp Tref={childRef} />;
}

Because ref is a reserved keyword I just added a ‘T’ in the starting and now it works like any other prop, I can pass it to the child component and it returns me the reference of the child component’s element to the parent component.

BUT

This method was just because I was curious to know what I can do with this, in a production environment, practices or experiments like these should not be done, and there are many reasons for that.

DX (Developer Experience) - When other developers will read your code if they’ll see the prefix React.forwardRef, they’ll know that you are passing refs from parent component to child component at the first glance, which increases code maintainability
Future React Compatibility - React's development team continually improves the framework. While your workaround might work now, there's no guarantee it will be compatible with future versions of React. Following the recommended patterns ensures better forward compatibility.
Static Typing - If you're using TypeScript or PropTypes for type checking, React.forwardRef integrates smoothly, allowing you to specify types for both props and refs. Custom patterns might require additional effort to type check correctly.
DevTools Inspection - Components wrapped in React.forwardRef are better supported by React DevTools. The forwarding of refs is a recognized pattern, and such components can be inspected more intuitively in the DevTools, improving debuggability.

Although none of these points, point to a breaking problem where the code won't even run the way it was intended, it is still recommended to use React.forwardRef, the way it is intended to use.

Well that was all regarding refs from my side, here is a takeaway from this blog

Understanding Refs - Refs provide a way to access DOM nodes directly within React components, bridging the gap between React's virtual DOM and the actual DOM.
useRef vs. useState - We discussed why useRef is the preferred method for referencing DOM nodes without triggering unnecessary re-renders, in contrast to useState, which is designed for data that impacts the UI and requires re-rendering.
Practical Applications - Through examples, we’ve seen how refs can be used for focusing elements, measuring them, and more complex tasks like forwarding refs to access DOM elements in child components.
Best Practices: While refs are powerful, we emphasized their proper use and the importance of not overusing them, adhering to React’s design principles for clean and maintainable code.

Remember, while ref is a great tool in the React, it comes with the responsibility of using it judiciously to enhance, rather than complicate, your components’ interactions with the DOM.

Well, that’s all the doubts and insights I had in mind regarding Refs,
Have you encountered situations where refs were the hero of your project? Or perhaps a challenge that seemed tailor-made for a ref solution but ended up being solved differently? Share your experiences, tips, or questions in the comments below.

Additionally, if you’ve found creative uses for refs or have insights on pitfalls to avoid, let’s hear about them! Engaging with each other’s stories and strategies can lead to a deeper understanding of React’s capabilities.

Thanks for reading this far 😁

Listening on the Network: What does it mean?

Samyak Jain — Wed, 27 Mar 2024 07:23:27 +0000

What does “listening” mean in terms of networking?

I am trying to make an HTTP server from scratch and had to understand about TCP listeners, so when I started to read about it and I was reading that TCP listeners do this or that and you bind an IP address to it with port number the only thing that was confusing me was that what does it mean “listening”?

I mean how does it listen? Is there a fixed interval? How is my Rust program is receiving this data? Where is this data coming from? Is there a fixed interval in which we check again if any request came?

So after doing some research, I got to know that -

When we say that a TcpListener instance in Rust is "listening" on a specific IP address and port, we're using a networking metaphor to describe the way the operating system monitors network traffic for connection attempts that match certain criteria (in this case, the IP address and port number)
The Concept of Binding

Binding to an IP Address and Port

IP Address: Every device connected to a network has an IP address, which is used to identify it on the network. When a TCP listener is bound to an IP address, it tells the operating system that this particular application is interested in network traffic sent to that address.

Port: A port is a numerical identifier in networking used to specify a specific application or service on a device. Since a single device can run multiple networked applications simultaneously, the port number helps differentiate which application should handle incoming data.

What exactly is Binding?

When you bind a server (like a Rust TcpListener) to a specific IP address, you're telling the server to listen for incoming connections on that IP address.

I was confused about what Binding means. Like does it mean from which address I’ll receive data? Or what?

So I tried to understand this with an analogy -

Imagine you have several mailboxes in front of your house, each with a different label or number. When someone sends you a letter, they choose one of those mailboxes based on the label or number you've given them.

Binding is like deciding which mailbox you’re going to check for mail. If you decide to only check the mailbox labeled "Bills," it means you’re telling anyone who wants to send you something, "Hey, if you want me to see it, put it in the 'Bills' mailbox."

Binding does not mean choosing from which addresses you'll accept request (or requests). Instead, it's about choosing which mailbox (among those at your house) you're going to use to receive any mail (or requests) sent to your house.

Now in this analogy, it might feel like how my house has multiple mailboxes in front of it. Do is it mean that my laptop has multiple IP addresses?

Well normally the NIC in a laptop has only one IP address configured but because in cloud platforms multiple websites are hosted on the same machine, the NIC with multiple IP addresses helps, so on the same machine different IP addresses can get requests for themselves without interfering with each other.

Now back to how listening works theoretically
Operating System's Role: The operating system keeps track of all the IP addresses and ports that applications are listening on. When network packets arrive at the device, the operating system checks the destination IP address and port number against this list.

Again here you can see that the operating system checks the destination of the IP address and the port number, so basically checking if the machine has multiple IP addresses configured and then to which one OS should send the request to.

Network Traffic Filtering: If the destination IP and port of an incoming packet match an application that's listening (i.e., our TcpListener), the operating system forwards this packet to that application. If not, the packet is ignored or rejected based on the system's network rules.

How does the OS "listen" to the request?

The process by which an operating system listens for and handles incoming network requests to a bound IP address and port doesn't typically rely on a timer that checks periodically (like every 1 second). Instead, the mechanism is more efficient and event-driven, utilizing the network stack and hardware capabilities to immediately notify the operating system of incoming packets. Here’s a simplified overview of how this works:

Network Interface Controller (NIC)

Each network packet arriving at a device first reaches the Network Interface Controller (NIC), the hardware component that connects a computer to a network. The NIC operates at a low level, handling the physical transmission and reception of data packets over the network medium.

so essentially it's neither the TCP listener nor the OS who is actually listening but the NIC.

The initial detection of incoming network packets is indeed a hardware-level function, primarily managed by the Network Interface Controller (NIC). The NIC is the first point of contact for all incoming data from the network. It operates at a low level, directly interfacing with the physical network medium (like Ethernet, Wi-Fi, etc.), and is responsible for the physical transmission and reception of data packets.

So what happens when concurrent requests are received by an NIC, I mean it can detect only one at a time or unlimited?

The Network Interface Controller (NIC) plays a crucial role in handling incoming network traffic, including dealing with concurrent requests. While the NIC is extremely fast and can process a vast number of packets per second, it indeed deals with one packet at a time due to the serial nature of network communication.

However, the combination of high-speed operation, buffering, and efficient handling by the operating system and application software makes it capable of managing what appears as concurrent requests. Here's how it works:

Serial Processing: Physically, the NIC receives packets one at a time because a network cable or wireless connection can only carry one packet's worth of data at any instant. However, because packets are small and the NIC operates at a very high speed, it can process many packets in a very short amount of time, giving the impression of parallelism.

Now again the question that bugged me here was -

How something can send a request to another thing without the other thing waiting for it?

Like in our current discussion, NIC is sending some information to the CPU but the CPU is working on something else, how does the CPU exactly receive it?

The process which is being described here is called asynchronous communication, where one component can send a signal or message to another component without the recipient actively waiting for it at that moment. Let's break down how this works in the context of the NIC sending an interrupt request (IRQ) to the CPU:

Interrupt Controller: Modern computer systems include an interrupt controller, a hardware component responsible for managing interrupts from various devices, including the NIC. The interrupt controller is constantly monitoring for incoming interrupts from different sources.

The main purpose of an interrupt controller is to arbitrate between multiple devices that may need the CPU's attention simultaneously. It ensures that each device gets a fair chance to interrupt the CPU when necessary.

When a hardware device, such as a network interface card (NIC) or a keyboard, needs to communicate with the CPU, it sends an interrupt signal to the interrupt controller.

Yes, that's correct! Even the Keystrokes from a keyboard can indeed generate interrupts in a computer system. When you press a key on your keyboard, it sends an electrical signal to the keyboard controller, which in turn generates an interrupt to the CPU via the interrupt controller. This interrupt informs the CPU that there is new input from the keyboard that needs to be processed.

So this explains how the data reaches the OS from the NIC (using the interrupt Handling), Now how does it reach the TCP listener?

That's a Topic for our next discussion 😁, Thanks for reading this far.

Strings in Rust

Samyak Jain — Tue, 12 Mar 2024 04:28:44 +0000

Today we are going to understand strings in Rust, which includes learning about String and &str, let's start understanding each and I'll try to clear some of the doubts that I had while reading about them.

&str -

&str is an immutable, UTF-8 encoded string slice. Since &str is immutable you cannot modify its content. It is a borrowed reference to a position of an existing string and it does not own the data.

String -

String is a growable, heap-allocated string. It is owned and mutable, allowing dynamic modifications to its content.
It is a type provided by the Rust standard library and is not a primitive type like str. It is a heap-allocated, UTF-8 encoded string that can be dynamically resized.

This is how a mutable string is created - 
fn main() {
    let test = "hello".to_string();
    println!("{test}");
}

Doubts

Now this is just a very brief overview of both and I had some problems in the working of both,

In &str, how is it by default a borrowed reference of an existing string? If I just created a variable like this let test = "hello"; , how is this a borrowed reference of an existing string? I mean I just created this right?
In mutable string, who do I need to use .to_string(), isn't that already a string?

To answer and understand the reasoning behind these questions we have to first understand string literals in Rust.

String Literals

String literals are sequences of characters defined directly in your source code, like "hello". They are immutable and embedded in the program's binary, specifically in a read-only section.

So when you create a variable like let greeting = "hello world"; And run your program, the string literal "hello world" is saved in the binary at compile time. This means that as soon as your program is compiled, the string literal "hello world" is already placed in the read-only section of the compiled binary. This process is independent of declaring any variables that might refer to it.

Now When the code actually runs after getting compiled (Which already has the "hello world" string literal stored), the greeting is initialized as a borrowed reference (&'static str) to the string literal at runtime. Because it is already saved in the binary storage. The reference greeting points to the memory location within the read-only section of the binary where "hello world" is stored.

This explains how even after just creating the greeting variable, how it directly becomes a borrowed reference to a pre-existing string, because the string is already stored during compile time, and because by default string is immutable there is no point to make a copy of it, thats why it just directly refers to the binary storage.

This also explains why we need to use .to_string() to create a mutable string, because by default the string literal is stored as a read-only binary, so to create a string that can be modified we use .to_string(), which creates a copy of the original string literal that is allocated on the heap. This means it can change size (grow or shrink) as needed during runtime.

Based on this explanation I try to think of string literals in Rust as the base form of string data, which depending on how you use it can either remain an immutable str or be 'converted' into a mutable String.

Do share some scenarios in comments where &str would be more beneficial than a mutable string, I mean I know that if i use &str , it would just refer to some other storage instead of creating a copy , but other than that is there any other benefit to use &str?

Now Let's discuss how this transformation of string literal to either &str or String happens.

Transformation

Immutable &str

When you directly use a string literal in your Rust code, such as let greeting = "Hello, world!";, greeting is an immutable &'static str.

This means it's a borrowed reference to a string slice with a 'static lifetime, pointing to data embedded in the read-only section of the binary.

This form is efficient for read-only operations, passing around string data without taking ownership, and for use cases where the string data does not need to change.

Mutable String

If you need a mutable, growable version of the string data, you can convert the string literal into a String by using methods like .to_string() or String::from().

For example, let mut greeting = "Hello, world!".to_string(); creates a String from the string literal.
The String type is heap-allocated, growable, and mutable. It allows you to modify the string data, such as appending text or changing characters.

Converting a string literal to a String involves copying the data from the binary's read-only section into dynamically allocated memory on the heap. This operation gives you full ownership and control over the copied data, including the ability to modify it.

Working under the hood

Memory Allocation: The original string literal remains in the read-only segment of your program's binary, untouched. The String object involves a separate allocation of memory on the heap, where the contents of the string literal are copied.

Data Duplication: This means you now have two copies of the "Hello, world!" string data: one embedded in the read-only memory as part of the program's binary (the original string literal), and another stored in the heap memory as a String object (the result of .to_string()).

Mutability and Ownership: The key difference between these two is that the &'static str reference to the string literal is immutable and has a static lifetime, while the String object is mutable, growable, and owned by the greeting variable. This ownership comes with the Rust guarantees of memory safety, ensuring that the heap memory will be properly deallocated when greeting goes out of scope or is no longer needed.

so now that we understand that the mutable hello world's ownership is with greeting variable, then who has the ownership of the string literal?

String literals do not have an "owner" in the traditional sense used for heap-allocated memory in Rust. Instead, they are baked into the program's binary and loaded into memory as part of the program's execution context. They are immutable and globally accessible anywhere in the program.

These were some points that piqued my interest in understanding strings in Rust, Do let me know if I missed something related to strings or if I miscommunicated some concept.

Thanks for reading this far 😁.

What is Transpilation?

Samyak Jain — Tue, 30 Jan 2024 19:17:04 +0000

Introduction

So there was this ruckus in my head about some words like babel, webpack, transpilation and these are some words that you hear a lot if you are a web developer.

The kind of words that make you pause, and wonder, "Am I supposed to know what these mean?" I mean yeah you can skip it, it won't really hurt your development, but it's good to know that what's happening underneath.

Today we'll try to understand these words, No fancy jargon, just plain talk about what these things actually are.

Transpilation

As programmers, we like to use the most recent features, I mean who doesn't like using a spread operator which is a feature introduced in ES6, but on the other hand, how to make our modern code work on older engines that don't understand ES6 or JSX or Typescript?

Browser's JS engines are not configured to understand all of this, they only understand Javascript and yeah most of the browsers these days do support ES6 but it still does not solves the problem of them not understanding JSX or typescript. That's where Transpilation comes in.

Transpilation, short for "source-to-source compilation," is the process of converting source code written in one programming language to equivalent code in another language or another version of the same language.

source to source compilation? We are talking about transpilation, right? Where did this compilation come from all of a sudden?

Well, Transpilation is a subset of compilation, there is also a lot of debate about whether there should even be a separate word for this process.

Well after reading a lot of debate in my opinion Transpilation feels like a word of choice, you like to have some distinction? Call Conversion of code written in one language to another or equivalent but different version as Transpilation, You think it's too similar to be given a different name call it a subset of a compiler.

Even Babel, the tool which is the most used tool for transpilation addresses itself as a compiler.

Now that we have made that clear lets continue talking more about transpilation,

So by now, we have understood that what a transpiler does, it takes my source code and converts it to JS engine compatible JS code. But how does it do it?

If you know how the JS engine works it will be a little easier for you, for others do not worry,

The process typically involves the following key steps:

Lexical Analysis
The transpiler starts by breaking down the source code into individual tokens, such as keywords, operators, and identifiers.
This phase is known as lexical analysis and involves creating a stream of tokens from the source code.

Here is an example -
let x = 10 + 5;

Its tokenized form will look like - "let", "x", "=", "10", "+", "5", ";"

Abstract Syntax Tree (AST) Creation
The stream of tokens is then used to build an Abstract Syntax Tree (AST). The AST represents the hierarchical structure of the code, making it easier to analyze and manipulate.

Transformation
The transpiler applies transformations to the AST. This step involves converting code written in the source language (e.g., ES6) into an equivalent representation in the target language (e.g., ES5).

Code Generation
The transformed AST is used to generate the final transpiled code. This code is designed to be compatible with the target environment or runtime.

Now this is a very high-level understanding of how transpilation works and not exactly how transpilation works.

Transpilation of `let` and `const`

Ok, we know that how transpilation works, but while reading about this a question struck me, what about let and const? I mean these keywords were introduced in ES6 they were not there in ES5, so I checked how this was being handled and I got to know that the transpiler converts let and const to var.

What?? converting let and const to var? Then what about all the concepts of hoisting and blocked scope, well don't worry I was really confused too, well after searching a bit more, I tried this website which lets you write ES6 code and you can see ES5 equivalent of it.

So I went there and I tried this snippet -

if(true){
  let a = 1
  console.log(a)
}
console.log(a)

well, the output was interesting -

if (true) {
  var _a = 1;
  console.log(_a);
}
console.log(a);

It created a behavior like that of let even with var, it smartly added an underscore inside the if block's a variable so that it doesn't exist outside of the if block.

Play around with this, try different things, and see what happens, let me know in the comments what you tried.

PolyFill

Ok so now we know that how transpilation works cool cool cool, but transpilation basically translates the code right? What about the features that don't even exist in the previous version? Like if we take an example of ES6 to ES5 conversion, and talk about the includes function, it didn't even existed in ES5. So how will we translate something that does not even exist there?

You can try using that ES6 to ES5 converter and it will show the same code that you will try.

Thats where poly filling comes in, the official definition of poly filling according to MDN is -

A polyfill is a piece of code (usually JavaScript on the Web) used to provide modern functionality on older browsers that do not natively support it.

Now how do you do polyfilling? Is it a software? Well, I was reading [Remy Sharp's] blog(https://remysharp.com/2010/10/08/what-is-a-polyfill) (Person who introduced the term), and some implementation of poly filling, I understood that poly filling is not some tool, its more of a concept that you apply in your code to increase compatibility of your application.

Here is an analogy to help you understand the relationship between traditional transpilation and Poyfilling -

Imagine that you got your hands on a time machine and you time travelled to ancient times, excited to talk to the people there, you forgot that the language is different and they won't understand what you are saying, so what will you do?

You will transpile your language to their language so that they can understand it. Now you want to talk more with these people and want to allow even communication from a distance, you thought of using the phone, but wait that did not exist at that time, how will you even translate a thing that did not even exist at that time?

Well that's where poly filling can help you to bridge the gap, you collected two cups and joined them using a long string, and now you have kind of a phone for long-distance communication.

Now I am still a little confused as if this is the right way to explain it. I mean let and const also don't exist in ES5 and it was translated, right?

What I think is transpilers handle these by finding equivalent constructs, while polyfills are employed when introducing entirely new concepts. I mean the concept of the variable was still there in ES5 right, but a function that can tell whether an element exists or not in an array, that never even existed there.

Let me know in the comments if you have a better explanation to address this.

How Transpilation happen in our Projects?

Well, before reading about this topic, I didn't even know that this was happening, so I possibly can't configure Babel in my React project or in any other framework or library.

So in popular frameworks and libraries like React, angular, etc., this is already pre-configured, like if I take the example of React, normally devs create a React environment through two ways -

CRA (Create react app)
Using Vite

Now CRA under the hood uses Webpack which is a Bundler. And another way I told you i.e 'Using Vite' well Vite is a bundler as well. Now we won't go in much depth about bundlers in this blog, but I'll give you an overview as it is necessary to understand, So bundlers like Webpack or Vite are tools that are used to merge two more JS files into one file to form a bundled file.

But this is not all that these bundlers do, they also take care of transpiling for us, so when you run npm run build in your terminal it creates an optimized bundled and transpiled code for you which you deploy. So you never get to know what happened.

The need for transpilation is more in these frameworks and libraries more than that of vanilla JS projects because Vanilla JS projects have the only problem that 'what if someone ran this in ES5 browser' but that's not the case anymore, its really rare that there is any browser which does not supports ES6.

But a library or framework like react which is written in JSX it needs to be transpiled every time because the browser has no clue about JSX.

Now there was one last thing that I was confused about, My confusion was that

'ok I ran npm run build and my code got transpiled and all good now browser will have no problem understanding my JSX.'
But what about during development?

I had questions like -

Is it getting re-transpiled on the go?
If I remove one semicolon from my code the whole code gets retranspiled?

Well if removing one semicolon were to trigger a full re-transpilation then Developing a web app wouldn't be easy, because if you are maintaining a huge codebase where you are continuously making changes and you have to wait for like 30 seconds to 1 minute because you have this huge code base, your Developer Experience and your Productivity both would be very bad.

That's where bundlers like Webpack and Vite show the magic, so these bundlers have something called HMR (Hot Module Replacement)

It smartly re-transpile only the modules that actually changed and not the entire application. Once the modules (which were changed) are re-transpiled, the new code is injected into the running application on the fly.

How does it 'inject' code in my application on the fly? Well when our frontend server starts a web socket connection is created between our frontend server and browser, and when something changes, this information is sent through a browser using a web socket.

Yeah!! Sounds really interesting right? And unbelievable right? I mean I thought I at least knew my front end, I thought I knew everything happening in my front end and how it was happening. But here we are, a WebSocket connection is being executed through my front end and I didn't even know.

I wanted to test this, so I tried turning this HMR feature off through the web pack config and this was the result.

So to test this by selecting all of the text, changing one line, and saving it and this was the result.

HMR ON -

You see how the text changed and neither the page reloaded nor the text selection went away.

HMR OFF

Here you see the page reloaded, so there is no HMR and the whole transpilation is being done again from scratch.

How can you try this?

For testing this create a Create react app, and then run this command npm run eject, after this you will be able to see the config file
which contains the configuration of webpack, in the webpack.config.js file change this variable shouldUseReactRefresh to false, restart the server, and just try changing something.

Let me know in the comments what happened.

We will talk about the working of HMR and the frontend server in another blog, it's a big topic, and need its own blog for that.

So there it is, we understood the fancy and technical words like Webpack, Babel, and Transpilation and learned some more things on the way, we talked about how transpilation works, how a newbie developer like me or someone else doesn't even need to know this stuff to run a project. And much more.

Thanks for reading this far😁.

Are Random numbers really random?

Samyak Jain — Mon, 22 Jan 2024 09:19:01 +0000

So recently a thought struck me that we use random functions in our software all the time right? Either in games or in machine learning models, cryptographic software, etc.

But are these "randomly" generated numbers are actually random? I mean there must be some algorithm that calculates this so-called random number right? And if it's calculated it's not really "random" now is it?

So today I'll try to explain how random is a random number in programming languages.

So all the random functions in different programming languages like Math.random() in Javascript are not "True" random number generators, they are called PRNGs (Psuedo Random Number Generators)

The word Psuedo is self-explanatory that these random generators are not truly random. So now the question that arises is - Okay random number generators are not truly random then to what extent these numbers are random?

This is the question that measures the quality of a PRNG.

There are many algorithms that are used to generate random numbers, but what all of them have in common is a Period and a Seed.

Seed

A seed is an initial value that is passed in the random function to create a random number.
To understand how a seed works in a PRNG, Let's take a look at an algorithm called LCG (Linear congruential generator)

LCG

For those familiar with random number generator algorithms, do you have a favorite? Whether it's LCG, Xorshift, or another algorithm, do share your preferences and experiences with different generators.

Back to the example of LCG

So the formula of an LCG is X = (a.Xcurrent + C) % mod m
where,
X is the sequence of pseudo-random numbers
m, ( > 0) the modulus
a, (0, m) the multiplier
c, (0, m) the increment
Xcurrent , [0, m) – Current Seed

So what happens is once the initial seed is selected it is passed in this equation where Xcurrent is the only thing that is variable and everything else is a constant, so the seed is passed in this equation as Xcurrent, and the operations are done, once the answer is generated, next time the previous answer is used as seed and so on.

Here is an example of how does it look

Let's consider an LCG with a specific set of constants:

a = 7,
c = 5,
m = 32,
and initial seed as 3

so
1. X = (7*3 + 5) mod 32 = 26 => X = 26
2. X = (7*26 + 5) mod 32 = 11 => X = 11
3. X = (7*11 + 5) mod 32 = 22 => X = 22
4. X = (7*22 + 5) mod 32 = 19 => X = 19
5. X = (7*19 + 5) mod 32 = 26 => X = 26
6. X = (7*26 + 5) mod 32 = 11 => X = 11

This is how once a seed is decided the generated value is used as a seed to create the next random number, Now am sure you must have noticed something strange in this sequence right? After point 4 the sequence started to repeat. First 26 and then 11 and so on?

Period

So there are multiple algorithms to generate a PRN, like LCG or Xorshift128+ (Used by the v8 engine) but what all of them have in common is that there is a limit to producing random numbers before which they start to repeat the sequence. This is what is known as a Period. Every PRNG has repeatability after a point, now how long the sequence is before the numbers start to repeat is what tests the quality of the PRNG.

Why there is a Period?

But why does this happen? Why do numbers start to repeat?

It might sound like a silly question, but it was bugging me so I tried to understand this,

what I understood is that this happens because computers have a limit to generate numbers, sure there are infinite numbers, but there is a limit till which a computer can comprehend, like in a 64-bit architecture that limit is −9,223,372,036,854,775,808 to 9,223,372,036,854,775,808 so computer needs to maintain the number in that limit.

So algorithms try to keep this number in range like in LCG after multiplication and addition in the seed, a modulus is performed so that the number does not keep increasing till the point that the computer can't comprehend it anymore.

And if there are only finite numbers that the computer can generate through an algorithm then after a while it's inevitable that the seed that started the sequence will be generated and once that number is generated the whole sequence will be repeated.

So yeah if you know the seed you can predict the whole sequence.
Here is a replit link for you to try I have set a seed and ran a loop for 10 values no matter how many times you run the loop it will have the same result.

Now if you are a Minecraft player you must have heard "seed" before as well. When you enter a seed that is shared by someone else you also get spawned in the same area as them. Why? Because Minecraft uses a random generator to create an area, and if it's random then it also has some seed, and if you have that same seed then yes you will get the same output.

Have you ever played a game that allowed you to input a 'seed' for generating the game world? Share your experience and any interesting outcomes!

While talking about PRNGs it is important to note that PRNGs are not cryptographically secure as also mentioned in the v8 Docs.

It is not recommended for hashing, signature generation, and encryption/decryption.

For those purposes you can use window.crypto.getRandomValues "but" at the cost of performance.

So why do we use PRNGs if it has so many flaws?

I mean you know the seed and voilà you know the sequence, it's not cryptographically secure then where is it used?

Applications of PRNGs

Well PRNGs are very efficient and fast they are very useful in games like you want to spawn the player at a random place and spawn some buildings randomly, but you don't want it to be truly random and efficient because games need to render everything fast.

If we go back to the example of Minecraft, Minecraft does not create the whole environment as soon as you enter in the game, because if that had been the case there would have been a great toll on the game and a huge storage space used by the game. Instead, it creates environments on the fly as players start to move in a direction.

Minecraft features randomly generated structures such as villages, temples, and dungeons. The placement of these structures adds surprises to the landscape, and their designs vary based on procedural algorithms.

Advantages of Procedural Generation:

Scalability: Minecraft's procedural generation allows the game to scale infinitely without the need for massive storage space for pre-designed maps.

Exploration: As players traverse the world, new chunks of terrain are generated dynamically, providing a sense of discovery and unpredictability.

Replayability: Different seed values and the randomness in terrain generation enhance the replayability of the game, as players can experience entirely new worlds with each playthrough.

Impact on Performance:

Generating terrain in real-time as players explore reduces the initial loading time and minimizes the demand on system resources compared to loading an entire pre-designed world.

PRNGs are also used for debugging, where engineers can pre-define the seed and can run iterations on a fixed sequence to run some tests.

Truly Random Number Generators are great but they come at the cost of performance so it's up to us to decide the tradeoff. Overall it was a great topic to read and learn about.

As a developer, how often do you use random number generators in your projects? Do you have any favorite algorithms or tips for ensuring randomness in your applications?

In conclusion, while pseudorandom number generators (PRNGs) play a crucial role in various applications, from gaming to simulations, understanding their limitations is equally important. The predictable nature of PRNGs, marked by periods and seed-dependent sequences, raises questions about their suitability for cryptographic purposes. As developers, we navigate the trade-off between efficiency and true randomness, opting for PRNGs in scenarios where rapid, reproducible results are paramount. Reflect on the challenges posed by the inherent predictability of PRNGs and consider: How do we balance the convenience of PRNGs with the need for cryptographic security in our applications?

If there is anything that I missed about PRNGs that could have made the blog better or you have any questions related to it please let me know in the comments.

Thanks for reading this far😁.

What is a Javascript Engine?

Samyak Jain — Wed, 17 Jan 2024 07:55:04 +0000

Ever Wondered what happens when you write a Javascript code and run it? Either in your browser or in a runtime like Node? I mean when you run it with HTML and CSS in the browser and try to console log something then you see your console logs in the browser, but when you write console logs in a runtime like Node then you see them in the terminal?

Why there are some APIs that you can use while writing frontend like localStorage or Canvas but not while writing Backend using Nodejs, I mean both of them are using Javascript, right?

Spoiler - It's because of the Javascript engine.

So, today in this blog we are going to understand

what is a Javascript engine?
What is its purpose?
How does it work?
How Javascript runs on both client-side and server-side

What is a Javascript Engine?

So a Javascript engine is basically a piece of code that is used to convert Javascript code into machine-understandable code and execute it.
There are many Javascript engines available in different browsers for example -
Chrome - V8 Engine
Mozilla - SpiderMonkey
Safari - JavaScriptCore
... and many more

Let's understand the workings of the Javascript engine by taking the V8 engine as the base.

So we know that computers only understand binary i.e. 0's and 1's right? So we have to follow some steps before our computer can understand our code written in JS just like any other language.

So there are multiple steps before the source code written by a Developer is ready to be machine understandable. Let's go to each step and understand what's happening there.

Tokenization

When you write JavaScript code, the first step in the journey from source code to executable instructions is tokenization.

Tokenization, also known as lexical analysis, involves breaking down the source code into smaller units called tokens.
Tokens are the building blocks of a programming language and represent individual elements such as keywords, identifiers, operators, and literals.

Here is an example -
let x = 10 + 5;

Its tokenized form will look like - "let", "x", "=", "10", "+", "5", ";"

Once the given code is tokenized then the parser converts that code into an AST (Abstract Syntax Tree)

The AST (Abstract Syntax Tree) is like a map of your code's structure.

Imagine your code as a story. The AST breaks down this story into smaller pieces, like sentences and paragraphs, called nodes. Each node represents different parts of your code, like statements, expressions, or functions.
Semantic Analysis is like checking the grammar and meaning of your story.

Once we have the map (AST), we want to make sure the story makes sense. Semantic analysis is like checking the grammar and meaning of the words and sentences. It helps us catch mistakes and makes sure the story follows the rules of the language it's written in.
So, in simpler terms, the AST helps us understand the structure of the code, and semantic analysis makes sure the code is well-written and follows the rules. It's like having a guide to understand the layout of your code and a proofreader to check if your code "speaks" correctly.

So once the AST is ready the next step that comes into play is generation of bytecode.

Bytecode

So what is this byte code? Bytecode is an abstraction of machine code. Bytecode is what makes Javascript run on all platforms like Windows, Android, MacOS, etc. Bytecode is created keeping the system architecture in which it is being executed in check. What does this mean?

So whenever we download an application on our PC, there are mostly 2 versions available for that software one is for 64-bit architecture and the other for 32-bit architecture right? But why this is done? This is done because languages like C++ are compiled languages, which means that the whole code is compiled first and then an execution file is created which runs your code. So the executable file that is created after the compilation is only that machine's architecture specific. Meaning it is only optimized for that architecture. That's why most software has two versions of the same software.

But we never had this problem in opening a website on any system right? We just type in the site of the name and voilà it opens up.

This is because Javascript is platform independent, and what makes it platform independent? Bytecode

Bytecode is generated in the JIT (Just in Time) compilation.

So if I have to give a short analogy

Bytecode is like a translator who knows all the languages (meaning different architectures) and it helps the JIT (tourist) understand the place where it is (Different platforms), but the translator is always the tourist's same friend, just that it knows a lot of languages.

(If you don't understand JIT yet don't worry I'll explain it as well)

Bytecode as a Translator:
Bytecode acts as an intermediate representation or a translator that is knowledgeable about different architectures or platforms. It is designed to be platform-independent, like a translator who understands various languages.

JIT Compiler as a Tourist:
The JIT compiler is a tourist who wants to explore a new place (execute code on a specific device or architecture). The tourist, or JIT compiler, relies on the translator (bytecode) to guide and communicate with the locals (native machine code) effectively.

Portability of Bytecode:
The translator (bytecode) is versatile and can assist the tourist (JIT compiler) in any location, making the code portable across different architectures. The tourist doesn't need to learn every local language; instead, they communicate through the translator.

JIT compilation

Compilation? Compiler? So Javascript is a compiled language? No it's not, So is it interpreted language? Haha No.

Javascript is a "Just-In-Time (JIT) compiled" language, what does that even mean?

So we know how compiled languages and interpreted languages work right?

Compiled Languages
A compiled language will take the whole source and convert it to machine code at once and will return an executable file. During this phase user can do nothing but wait.

Slower Start-Up Time - So the startup time is slow because it is reading the whole source code at once, but the later executions are fast as the machine code is already created in the exe file and we can just run it.

Platform Dependency: The compiled code is often platform-specific, requiring different binaries for different architectures.

Interpreted Language
An Interpreted language will read the code line by line and execute it line by line.

Slower Execution: Interpreted code tends to run slower than compiled code since the interpreter must translate each line of code on the fly.
No Static Optimization: Interpreters cannot perform extensive static optimizations before execution.

JIT takes the best of both worlds. How?
JIT is comprised of both a compiler and an interpreter, so what happens is when JS code first comes in contact with the JS engine, it starts with the interpreter, Why? So that the execution can get started and the user does not need to wait. It starts to run the code.

While it is running the code it also keeps track of what functions or parts of code are being called again and again and marks them as hot paths. Once enough data is collected to call a function a hot path, the compiler kicks in and takes the byte code and creates a more optimized version of that bytecode, and caches it.

So the next time when the same function is called again, the interpreter does not reads it again, but the compiler sends the pre-compiled code and is executed directly.

There is also a system of fallback in the compiler, so if there are some changes detected in the cached function or code, then that cache is not used and the interpreter re-executes the code.

What are these changes that are detected?
So you know how sometimes we have a function that takes in a value, now lets say when the hot path was created till then an integer value was being passed in that function, but now suddenly a string is passed into that function, this causes in the change of bytecode as the data type is changed and hence the same optimized bytecode cannot be used. This is also why it is advised to keep type safety in your code and not change it again and again because JIT won't be able to create optimized byte code otherwise.

In the V8 engine, the interpreter is known as ignition, the Compiler is known as TurboFan and the system that tracks the hot paths is called profiler.

Here is a diagram which explains the above process

Now that we understand what is the purpose of the JS engine and how it works, there is just one thing left, how does JS run on servers? Like Node.js?

So Node.js acts as a wrapper around V8, providing additional features and modules for server-side development.

Node.js provides an additional layer of functionality and modules on top of the V8 engine to enable server-side development.

Node.js includes its own event loop, which is responsible for managing asynchronous operations. This event loop allows Node.js to efficiently handle multiple concurrent connections without blocking the execution of code.

The wrapper provides abstractions for working with non-blocking I/O operations, making it easier for developers to write asynchronous code.

Node.js includes a set of built-in modules and APIs for common server-side tasks, such as file system operations, networking, HTTP handling, and more.
These built-in modules, like fs for file system operations and HTTP for creating HTTP servers, are part of the wrapper that Node.js provides.

Here is a summary of what we understood today

JavaScript Engine Fundamentals:
A JavaScript engine, such as V8, serves as the powerhouse behind code execution, translating JavaScript code into machine-readable instructions.

Browser Environment:
In the browser, JavaScript interacts with APIs provided by the browser environment, including the DOM API, Web APIs, and more. This enables dynamic and interactive web pages.

Node.js on the Server:
Node.js acts as a wrapper around the V8 engine on the server side, extending JavaScript capabilities for server-side development. It introduces features like event-driven architecture, a CommonJS module system, and built-in APIs for server tasks.

Compilation Process:
The journey from source code to execution involves tokenization, AST creation, and semantic analysis. The Abstract Syntax Tree acts as a map, guiding the understanding of code structure.
Bytecode and JIT Compilation:

Bytecode serves as an intermediate representation, making JavaScript platform-independent. JIT compilation optimizes hot paths dynamically, combining the benefits of both compiled and interpreted languages.

Thanks for reading this far😁.

What is the Internet?

Samyak Jain — Fri, 29 Dec 2023 07:12:46 +0000

The Internet is something that everyone knows about these days from any adult to any 2nd grader, it's such a common thing that we use every day, and yet know so less about it at least I was unaware of a lot of things regarding the internet

What are its origins?
Where does it come from?
Is it Tangible?
How does my phone send data wirelessly using the Internet?
How is it traveling through thin air?
How does the Internet reach me?
How does my ISP help me in that?

These are some questions out of a long list of questions I had regarding the internet. So, through this blog, I will try to answer those questions and hope to clear at least some common doubts people have regarding it.

This is going to be a long blog because I am going to try to explain every aspect of it in detail.

Here is a list of topics covered in this blog:

Origins of Internet
NCP
Transition to TCP/IP architecture
IP Addressing
Classful addressing scheme
Local Host
Subnetting

Origins of Internet

I think the origin of the internet is the best topic to start with when understanding what the internet is.
So, the origins of the Internet can be traced back to the 1960s when the United States Department of Defense initiated a research project called ARPANET. So yeah earlier, the first name of the Internet as we know it today was ARPANET

So What was Used before ARPANET was introduced? What problem did ARPANET Solve?

So Before ARPANET, communication networks were often centralized, meaning there was a single point of control or a small number of central hubs.
In a centralized system, if a central point failed or was targeted, then the entire network could become disrupted.

ARPANET laid the foundation for the development of a decentralized and distributed network architecture.

ARPANET adopted a decentralized architecture where communication was distributed across multiple nodes (computers) rather than relying on a single central authority. Each node had a level of autonomy, and data could take multiple paths to reach its destination. This decentralized approach made the network more resilient and adaptable to failures.

ARPANET implemented packet-switching, a method in which data is broken down into smaller packets before transmission. Each packet could take different routes to reach its destination, where it would be reassembled.

During the time of ARPANET, the protocol which was used for transferring data was 'NCP'

If I have to explain how NCP and ARPANET used to work together then I'll say Imagine ARPANET as a large postal system designed to connect different regions and facilitate the exchange of letters between individuals and organizations.

Now, imagine that NCP is like a standardized letter format and a set of protocols that ensure proper formatting, addressing, and delivery of letters within the postal system.

NCP provides the rules and conventions for how letters (data) should be packaged, addressed, and transmitted over the ARPANET. It specifies how to break down a long letter into manageable pieces (packets) and how to reassemble them at the destination.

NCP (Network control protocol)

The primary purpose of NCP was to facilitate communication between computers on the ARPANET. It provided a basic mechanism for sending and receiving messages between different hosts connected to the network.
Sounds something like another protocol that we use these days right? TCP/IP?

So if we already had a protocol that was used for communication then why did we transition to the TCP/IP architecture?

Transition to TCP/IP architecture

Although NCP was working at that time there were many flaws in it.
I will explain one problem at a time and then the solution introduced by the TCP/IP architecture of that problem.

1. Scalability Issues in Uniquely Identifying Devices
So NCP was also uniquely identifying devices but it used a flat addressing structure. Let's try to understand this with an analogy.

Imagine a city without a well-organized addressing system. In this city, each building has a unique address, but there's no logical or hierarchical structure to how the addresses are assigned. For instance:

Building 1 might be on Elm Street.
Building 2 could be on Maple Avenue.
Building 3 might be on Oak Lane.

Now, think of these addresses as the host addresses in the context of NCP. Each host on the ARPANET had a unique identifier, but these identifiers didn't follow a structured or hierarchical pattern.

As the city grows, new buildings and streets are added. Allocating addresses becomes a challenge because there's no efficient way to organize them. With no hierarchy, the city's post office has to maintain an exhaustive list of every address, and it becomes increasingly difficult to manage as the city expands.

IP Addressing

IP (Internet Protocol) suggested a Hierarchical structure instead of Flat Structure

Let's see an Analogy of that too -

Imagine that the city adopts a new addressing system inspired by ZIP codes. In this system:

ZIP Code 10000 represents a large district or neighborhood.
ZIP Code 10010 might represent a smaller area within that district.
ZIP Code 10011 could pinpoint an even more specific location.

In the world of IP, think of the ZIP codes as the network portion of the IP address. The hierarchy allows for efficient allocation and routing

Currently, there are two versions of IP i.e IPv4 and IPv6 we will talk about IPv4 for now and will later come back to talk about IPv6

The idea was initially as we discussed to remove the need for a flat structure and have a hierarchal structure so the addresses are well organized. But it was not achieved instantly IP address also evolved.

So When IP was first introduced it was a 32-bit addressing scheme,
In this scheme, the 32-bit IP address was divided into two parts
Network Portion and the Host Portion, in this division the Network Portion had 1 Octet (8 Bits) and the rest were for hosts.

Now because only 8 Bits were reserved for the Network Portion, it meant that only 256 networks were possible (2 raise to the power 8)

But it was soon realized that 256 networks alone won't be enough, many small networks would be willing to join the ARPANET and it will be a waste to allocate such a big number of hosts in just 256 unique networks

Classful addressing scheme

To Overcome this issue Classful addressing scheme was introduced.
In the scheme, IP addresses were divided into classes to allocate address space based on the size of the network.

Basically, all we have is a 32-bit IP address in IPv4, which makes approximately 4.3 billion, now it's up to us how we use it efficiently. (In the Ipv4 version)

That's where Classful addressing came in, Classful addressing suggested that instead of just making 256 networks because of the previous architecture, let's divide the IP address according to use cases this way, we can allocate an approximate amount to the people according to their need

There were three primary classes:

Class A (1.0.0.0 to 126.0.0.0):

In Class A first octet or the first 8 bits were allocated for the network and the rest were allocated for hosts. Out of the first Octet, the first bit was fixed as 0, which was an Identifier for Class A addresses.
The remaining 7 Bits were used to create networks so there were 128 networks available and each network had approximately 16 million hosts.

There are 2 fewer networks available overall since IP Address 0.0.0.0 is set aside for broadcasting needs.

For usage as a loopback address while testing software, the IP address 127.0.0.1 is set aside. (Our good old LocalHost)

Let's talk a bit about this special use case of LocalHost

Local Host 127.0.0.1

So the loopback address 127.0.0.1 (localhost) is a reserved IP address used for testing and development.

So when a request is sent using the localhost address, it is sent to your own device as if it came from somewhere else.

The operating system's network stack recognizes the loopback address as a special case and ensures that the data doesn't leave the device's network interface.

While 127.0.0.1 is the most commonly used loopback address, the entire range 127.0.0.0 to 127.255.255.255 is reserved for loopback purposes. Other addresses within this range (e.g., 127.0.0.2, 127.0.0.3) can also be used for testing multiple loopback interfaces.

So when you write localhost:3000 you are actually hitting 127.0.0.1:3000, the term localhost actually works as a domain name to map 127.0.0.1, we will talk more about this mapping when we talk about DNS (Domain Name System).

While talking about localhosts there is one more component that is added in the localhost which is a port number like 3000, why is this port number added?

This port number is added to an IP address to identify multiple services going on, on one machine which is identified by an IP address.

If you have worked with both the backend and frontend, you must have seen that your frontend may run on localhost:5173 and your backend may run on localhost:3000, this helps the network to understand where exactly to send the information.

Now back to the discussion of classful addressing

Class B (128.0.0.0 to 191.255.0.0):
The first two octets identify the network, and the remaining two octets represent hosts.

If we talk about the first two octets, the first two bits are reserved to identify class B which is 10. The remaining are used for networks. So in class B, there are approximately 16,384
where each network can accommodate 165,534 hosts.

If you are not clear how I got the number 165,534 , then here is the math.
We used the first two bits to reserve the identifier of class B and we were left with 14 bits which makes up for 16,384 networks (2 raised to the power 14)

Now we are left with 2 more octets which were left for hosts allotment, so 2^16 - 2 = 165, 534

Here 2 are reserved addresses
Suited for medium-sized networks.

Class C (192.0.0.0 to 223.255.255.0):
The first three octets identify the network, and the last octet represents the hosts.
In this, the first 3 bits represent Class C which is 110 and the rest represents the networks so approximately 2,097,152 networks and hosts (254)
Designed for smaller networks.

Seems interesting right? But many problems arose over time in this scheming.

Inefficient Use of Address Space - If you are a company and am assuming a very large company and you got assigned a Class A network, will you actually need 16 million hosts? This resulted in a lot of wastage of this exhaustible resource of IP addresses.
The allocation of entire Class B address blocks to medium-sized organizations, even if they didn't need a large number of addresses, contributed to a phenomenon known as the "Class B explosion." This led to a significant increase in the number of entries in the global routing table of the routers.
Management - Another problem was that it was not easy to administer such a big network even if I talk about a class B network it still has like 65 thousand networks.

To solve the management problem, people thought to divide those networks into sub-networks so that they are easy to maintain this is what we call subnetting.

Subnetting

Now before we go and understand subnet let's first again understand how it actually helps in maintaining IP addresses.

Customized Configurations - Let's say your company has multiple departments like finance, development, sales, etc. Now, consider a scenario where the finance team should have access to a private database containing financial information about the company, which is hosted in the company's private network. However, the company does not want every department to have access to that database. This is where subnetting comes into play. Using subnetting, you divide your company's IP address space into multiple sub-networks or subnets. In the configuration of your private database, you set up access controls to allow only the specific subnet associated with the finance department to access that database.
Decentralized System - With subnetting the network is somewhat decentralized, so if there is some problem in one subnet, it won't affect the whole system, and hence helps you to easily pinpoint where the error or bug has originated from.
Easy Monitoring - Subnets make the monitoring more organized and easy to track, if I want to know how much traffic is there in one department (subnet) or how much data is being shared in one subnet I can easily filter.

Now that we know the why let's talk about the What and how.

What is a Subnet
Let's extend our city analogy to incorporate subnetting:

Original City Blocks (Networks):
Downtown (Network Portion): Addresses starting with "100" represent the Downtown area.

Uptown (Network Portion): Addresses starting with "200" represent the Uptown area.

Suburbs (Network Portion): Addresses starting with "300" represent the suburbs.

Subnetting within Neighborhoods:
Now, imagine that each neighborhood decides to further organize itself into smaller blocks or zones for specific purposes.

Downtown (Network Portion 100):

Block A (Subnet): Addresses within the range "100.1" to "100.50" are designated for residential use.
Block B (Subnet): Addresses within the range "100.51" to "100.100" are allocated for businesses.
Uptown (Network Portion 200):

Block X (Subnet): Addresses within the range "200.1" to "200.30" are for schools.
Block Y (Subnet): Addresses within the range "200.31" to "200.60" are for parks.

So what does a subnet look like? Well, the subnet is a 32-bit address that is present alongside an IP address, It is used to identify the devices that are present on the same subnet (I know a little confusing, but bear with me), So how a subnet is identified?

So a subnet is divided between the network portion and the host portion, The network portion is uneditable, but on the other side host portion tells us how many hosts are available in that subnet.

So again, what does a subnet look like? Here is an example of a subnet - 255.255.255.0 (Subnet of a Class C network).

Now why the number 255? Why not any other number?

So in a subnet mask, continuous 1 bits represent the network portion and the continuous 0 bits represent the host portion, so if you convert the 255.255.255.0 to a binary representation, it will look something like this 11111111.11111111.11111111.00000000

So if you look at the first octet (The first 8 bits) you will see its 11111111, and if you convert it to decimal representation it will be 255. So that's it, that's why a 255 represents a full octet or the network portion, I like to think of it like this because the octet contains all 1s then there are no more possible combinations left and that's the best way to show that it's uneditable.

Now 255 in an octet alone won't make sense, for example, 0.255.255.0 can't be called a subnet mask. Why? Because the subnet needs to be a contiguous stream of 0s and 1s you can't start 0s and then suddenly 1s and then back to 0s.

let me show this in binary representation -

Given Subnet Mask - 255.255.255.0 Binary - 11111111.11111111.11111111.00000000

This will be a subnet because first there are continuous, 1s, and then 0s, and yes on the left side , it should always be 1s and on the right side 0s.

Given Subent Mask - 255.207.255.0 Binary - 11111111.11001111.11111111.00000000

This can't be called a subnet, why? because it's not contiguous, we did start with continuous 1s but then in the middle, it started to have 0s and then again 1s.

Now that we have understood the What let's understand the how.

How does subnetting work in a network of subnets?

Let's say I have a Class C Network and because in Classful addressing you get the whole network then I'll assume my IP address is 192.168.1.0 so the range will be 192.168.1.0 to 192.168.1.255.

Now in classful addressing you have to keep the same size for all of your subnets, so if your finance team has more people than the sales team, you will have to make the same host size subnet for the sales team as well because the finance team has more people.

So let's say I want to make 4 subnets of my network.

Now am using a Class C network whose default subnet mask is 255.255.255.0, so if I want to have subnets in my network then my subnet will look like this - 255.255.255.192.
Why?

Let's look at the binary representation of 255.255.255.0
Binary - 11111111.11111111.11111111.00000000

Here I can see that I have 8 bits in the network portion so if I want to make space for 4 subnets then I'll have to reserve 2 bits in the given 8bits of 0s, which gives us (2 raise to the power of 2) 4 subnets, and hence our subnet mask will look like 11111111.11111111.11111111.11000000 - 255.255.255.192

This left us with 6 bits, so 2 raises to the power 6 gives us 64 hosts per subnet.

So this gives us the following range of addresses taking 192.168.1.0 as the base address-

Subnet 1: 192.168.1.0 to 192.168.1.63
Subnet 2: 192.168.1.64 to 192.168.1.127
Subnet 3: 192.168.1.128 to 192.168.1.191
Subnet 4: 192.168.1.192 to 192.168.1.255

Now how does communication work between these subnets?
So a router works as a gateway between subnets, it contains the subnet mask and the reserved gateway IP address, so when a request is made from a device of a subnet, let's say device A makes a request to device B on the same subnet, so the request will go through router first.

The router will see hmmm, this request came from Device A who wants to send this to some Device B. The router checks the IP Address of Device B which is sent by Device A, and checks if it in the same subnet. How does a router do that? Because it already has the subnet mask of that subnet. If it is in the same subnet it routes the information to that device otherwise routes the information to another router. Now how does the router find which router it needs to send it to? That is another big topic and should be kept seperately I suppose.

Now we have only covered Classful addressing in this blog, in the next blog we will talk about Classless Inter-Domain Routing, remember this is only 1 reason why we shifted from NCP to TCP/IP architecture, once this is completed we will go to the next reason.

Thanks for reading this far 😁

DEV Community: Samyak Jain

Scaling Is All You Need: Understanding sqrt(dₖ) in Self-Attention

Division vs. Subtraction

Preserving Proportions for Softmax

Key Insight

[Boost]

Positional Encoding - Sense of direction for Transformers

Positional Encoding - Sense of direction for Transformers

Idea 1: Just Count (Integer Indexing)

Idea 2: Normalize the Count

Idea 3: Use a Binary Vector

Idea 4: Using Sine Waves

1. Periodicity problem

2. Phase problem

3. Linear Transformation

Enter Sin + Cos based embeddings

Unique positions:

Periodicity Problem

Linear Transformation

How shifting positions works

Why this solves linear transformation

Experiment

Conclusion

Understanding SVD's Intuition (Singular Value Decomposition)

What is SVD

Why Was SVD Developed?

What does it mean to find the best approximation of a matrix by a lower-rank matrix?

What is a Latent Feature?

Analogy: Music Taste

Definition

📊 What Each Matrix Encodes (in abstract terms)

Example Scenario

Breakdown of U, Σ, Vᵀ

U: User-to-Latent Feature Matrix

Vᵀ: Song-to-Latent Feature Matrix

Predicting Missing Ratings with SVD

Application in Dimensionality Reduction

How the Reduction Works

👥 Example: Alice, Bob, and Carol

Why to perform this?

1. Captures Core Patterns, Not Raw Values

2. Good Generalization — Not Just Memorization

3. Removes Noise

Conclusion: Why SVD Matters

Error-Correcting Codes: Hamming Code

A bit of history..

Parity Checks

Even Parity Check

Error Detection with Parity Checks

Problem in Parity Checks

Hamming Codes

Parity Bit Placement Logic

Calculating the Parity Bit

Two Error Bit Detection (Extended Hamming Codes)

Increase in Parity bits with increase in Message bits.

Rationale behind this..

Ref in React : Remote access to DOM elements

ForwardRefs

Listening on the Network: What does it mean?

Binding to an IP Address and Port

What exactly is Binding?

How does the OS "listen" to the request?

Network Interface Controller (NIC)

How something can send a request to another thing without the other thing waiting for it?

Strings in Rust

Doubts

String Literals

Transformation

Working under the hood

What is Transpilation?

Introduction

Transpilation

Transpilation of let and const

PolyFill

How Transpilation happen in our Projects?

How can you try this?

Are Random numbers really random?

Seed

Period

Why there is a Period?

Transpilation of `let` and `const`