Shrijith Venkatramana

Posted on Jul 2

The English Parsing Problem That Led to Modern LLM Transformers

#ai #webdev #productivity #programming

Hello, I'm Shrijith Venkatramana. I'm building git-lrc, an AI code reviewer that runs on every commit. Star Us to help devs discover the project. Do give it a try and share your feedback for improving the product.

When people explain Large Language Models, they usually start with ChatGPT, attention, or transformers.

A much better place to start is a problem linguists had been struggling with for decades:

How do you teach a computer to understand the grammatical structure of a sentence?

This wasn't just an academic curiosity.

Search engines, machine translation, question answering, speech recognition, document understanding, and even programming languages all depend on extracting structure from sequences of symbols.

One of the benchmark problems became English Constituency Parsing—given a sentence, recover the tree representing its grammatical structure.

It turns out that this seemingly narrow problem became one of the best demonstrations that the Transformer architecture had fundamentally changed NLP.

Let's see why.

Before Deep Learning: Parsing Was Mostly Hand-Crafted Rules

Imagine the sentence:

The little girl saw the dog with the telescope.

Humans immediately recognize that this sentence is ambiguous.

Did the girl use the telescope?

Or did the dog have the telescope?

Those are two completely different parse trees.

Computers, however, simply receive a sequence of words.

The | little | girl | saw | the | dog | with | the | telescope

The challenge is recovering hidden grammatical structure.

A simplified parse tree might look like:

Sentence
├── Noun Phrase
│   ├── Determiner
│   ├── Adjective
│   └── Noun
└── Verb Phrase
    ├── Verb
    ├── Noun Phrase
    └── Prepositional Phrase

This problem became known as constituency parsing.

Beginning in the 1970s and 1980s, researchers built enormous collections of grammar rules inspired by linguistic theories such as those developed by Noam Chomsky.

These systems worked...

...until real language got messy.

Every exception required another rule.

Every new language required another grammar.

Maintenance costs exploded.

Then Statistics Entered the Picture

During the 1990s, NLP underwent what many call the "statistical revolution."

Instead of manually writing thousands of grammar rules, researchers asked:

What if we simply learn grammar from data?

The creation of the Penn Treebank was a turning point.

Thousands of English newspaper sentences were manually annotated with parse trees.

For example:

(S
   (NP The cat)
   (VP sat
       (PP on
           (NP the mat))))

Instead of writing rules, researchers could now estimate probabilities.

Rather than saying

NP -> Determiner Noun

always occurs,

they estimated

P(NP -> Determiner Noun)

from millions of examples.

Suddenly parsing became a machine learning problem.

This dramatically improved accuracy.

But another limitation remained.

The models still relied heavily on manually designed features.

Neural Networks Changed the Game

Around 2013-2016, neural networks began replacing handcrafted features across NLP.

Instead of engineers inventing hundreds of linguistic features, models learned useful representations directly from text.

One breakthrough came from recurrent neural networks (RNNs) and later LSTMs.

These models could process words sequentially.

The -> little -> girl -> saw -> ...

Each word updated an internal hidden state.

This worked surprisingly well.

But there was a problem.

Suppose the sentence contains 40 words.

The subject might appear near the beginning.

The verb might appear much later.

Information had to travel through dozens of recurrent steps.

Even LSTMs struggled with long-range dependencies.

Training also became difficult because computations were inherently sequential.

GPUs like parallel work.

RNNs do not.

The Transformer Arrived

In 2017, researchers at Google published the landmark paper:

"Attention Is All You Need."

Instead of processing words one after another, the Transformer lets every word directly inspect every other word.

Imagine each word asking:

Which other words matter for understanding me?

For example:

The programmer who fixed the parser yesterday deployed it.

The word

deployed

can directly connect to

programmer

instead of waiting for information to propagate through every intermediate word.

This is called self-attention.

The result was two enormous advantages.

First, long-distance grammatical relationships became much easier to model.

Second, every word could be processed simultaneously.

Modern GPUs love this.

Training speed increased dramatically.

English Constituency Parsing Became an Early Proof

Many people associate Transformers with machine translation.

Less widely remembered is how quickly they dominated constituency parsing.

One particularly influential paper was:

"Constituency Parsing with a Self-Attentive Encoder"

published by Nikita Kitaev and Dan Klein in 2018.

Instead of recurrent networks, they built the parser entirely around self-attention.

The results were striking.

On the Penn Treebank benchmark, the model achieved state-of-the-art accuracy while being conceptually simpler than many previous systems.

Even more interesting was why it worked.

The authors found that separating content information from position information inside attention significantly improved parsing performance.

In other words, knowing what a word is and where it occurs are distinct signals, and the model benefits from treating them differently.

This observation influenced later Transformer research far beyond parsing.

Why Self-Attention Fits Parsing So Well

Parsing is fundamentally about relationships.

Take the sentence:

The book on the table near the window belongs to Alice.

The subject

book

must eventually connect with

belongs

even though several phrases intervene.

A recurrent model passes information through many intermediate states.

A Transformer simply creates a direct interaction.

Mathematically, each word produces three vectors:

Query
Key
Value

The attention score between two words is proportional to

Query · Key

The dot product measures compatibility.

If the vectors point in similar directions, the score becomes large.

A softmax converts these scores into probabilities.

Finally, each word becomes a weighted average of the Value vectors from every other word.

Intuitively:

Queries ask questions.
Keys advertise available information.
Values carry the information itself.

Every word gets to decide whom it should listen to.

This is exactly the kind of computation grammatical analysis requires.

The Economics of Parallelism

Why was this architecture such a big deal operationally?

Suppose we have a sentence of 100 words.

An RNN performs roughly 100 sequential computation steps.

Even if each step is small, the next cannot begin until the previous finishes.

A Transformer computes attention between all pairs of words.

That is roughly:

100 × 100 = 10,000

pairwise interactions.

That sounds much more expensive.

And in raw arithmetic, it is.

Self-attention has quadratic complexity with sentence length.

So why did it win?

Because GPUs can compute thousands of matrix operations simultaneously.

Instead of performing 100 serialized operations,

they perform one enormous parallel matrix multiplication.

Hardware utilization skyrockets.

The result is a classic engineering tradeoff:

More arithmetic
Far less waiting

Modern accelerators strongly favor the second option.

This shift—from minimizing floating-point operations to maximizing hardware throughput—is one reason Transformers displaced RNNs so quickly.

Looking Back

English constituency parsing might sound like a niche benchmark today.

In reality, it helped demonstrate something profound.

Language understanding isn't primarily about processing words one after another.

It's about modeling relationships between them.

The Transformer architecture embraced that idea directly.

The same self-attention mechanism that learned grammatical trees now powers systems capable of writing software, translating dozens of languages, summarizing books, answering scientific questions, and helping developers every day.

Sometimes the technologies that change the world first prove themselves on problems most people have never heard of.

English constituency parsing was one of those problems.

What surprised you most about the history of Transformers?

Was it that they first proved themselves on tasks like machine translation and parsing, or did you expect conversational AI to be the original breakthrough? I'd love to hear your thoughts.

*AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.

git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.*

Any feedback or contributors are welcome! It's online, source-available, and ready for anyone to use.

HexmosTech / git-lrc

Free, Micro AI Code Reviews That Run on Git Commit

git-lrc

Free, Micro AI Code Reviews That Run on Commit

GenAI today is a race car without brakes. It accelerates fast -- you describe something, and large blocks of code appear instantly. But AI agents silently break things: they remove logic, relax constraints, introduce expensive cloud calls, leak credentials, and change behavior -- without telling you. You often find out in production.

git-lrc is your braking system. It hooks into git commit and runs an AI review on every diff before it lands. 60-second setup. Completely free.

In short, git-lrc helps Prevent Outages, Breaches, and Technical Debt Before They Happen

At a glance: 10 risk categories · 100+ failure patterns tracked · every commit…

View on GitHub