From CAP to GAP?

#discuss #softwareengineering #testing

Humble beginnings

Testing a piece of existing code is easy. That is, at least, the common wisdom: you know what the code is supposed to do, you craft inputs to trigger certain, deterministic, behaviors in the code, and assert the output for each is as expected. Like I said - easy.

Well, sorta...

Even if we discount hardware issues, e.g., network loss in the middle of multi-GB file downloading, or hard drive on logging server becoming full (and who's idea was it to forgo monitoring on that server?!), one needs to account for edge cases such as invalid user input, which the application should handle gracefully, or worse, seemingly valid input - that might cause some really weird behaviors if left unchecked - I'm looking at you "divide by zero", you sneaky bastard.

So, smart folks, when faced with this, eh, uneasiness, came up with a new, complimentary testing paradigm: property-based testing: you identify the invariant of the piece of code, the underlying behavior, the "what is this code all about, like, deep down", and you let the computer generate inputs, both valid-but-fishy, and outright invalid, to stress that understanding of what the function is really all about.

Easy.

So easy, in fact, that the simple addition function, you know, good ole' + operator, is... let's say, complex.

And do I even dare consider load and stress testing, requiring intimate understanding of the domain, the available resources, and knowing how to analyze the results to make the call if our system has hit a peak, has hit a fail-point, or, ideally, can take even more abuse?

And don't even get me started on "pentesting", i.e., "Penetration testing", the act of actively looking for security breaches in the application and exploiting them to prove they exist!

So, to reiterate, testing a given piece of code is easy, it's what mathematicians call a "solved problem".

After all, we all do it all the time with impeccable implementation and results - it's what we call "unit testing".

Artificial intelligence is no match for human stupidity

Stage enter left, LLMs.

I mean... how do we even QA, test, and verify this new Golem we unleashed upon ourselves?!

But that's a loaded question, isn't it?

If by testing the LLM we mean verifying the code it outputs does what we expect (what a strange goal to have, I know), without non-existing APIs (an affliction I still experience every now and then, especially in less mainstream technologies where training code was, still is, scarce, and the LLM is forced to extrapolate from the rest of its training), that it's idiomatic (for the technology we're using), in a performant and secure way - well, that's back to testing 101, our solved problem.

Easy.

I'll even see my previous claim and raise - I further claim I can, to a great degree of confidence, verify an LLM's response, in natural language, to questions to which the expected response should follow some pattern and structure within a certain context.

In other words, questions whose answer conforms to some formal specification - but said answer may be expressed in natural language, not necessarily in code.

For example: say I'm a network engineer trying to partition my current intranet into subnets, each allowing only certain organizational users, and only certain protocols, and I prompt the LLM for help, identifying the issue ("need to partition the intranet to subnet allowing only given users and protocols") and the inputs (i.e., list of users, protocols, and mapping.)

I now expect the response to include the word "subnets", though I'd also be looking for "groups" or "sets", and in the context of subnets, I expect to, eventually, encounter all the inputs I listed: all the usernames and protocols.

Further, I expect to find some code to actually carry out the network partitioning.

All this should be given in some flow, some context, that is continuous, flows from idea to implementation naturally.

I now claim that if I find all of the above expected text in the LLM's response I can claim with high confidence its response is correct, but missing even one detail from that list immediately disqualifies the response.

A logical Bloom filter.

Yes, obviously, the verifier is an LLM itself, but I posit it needs to be a very simple, nerfed, model, not some beast requiring a cluster of GPUs to run, having been trained on Petabytes of domain-specific data.

The verifier's only job would be to sniff out the expected keywords and validate they are in a semantic structure, context, that makes sense.

Again, not straight forward, but definitely doable using today's technology and current models, and can be run against all domains, generally.

Which brings us to ask how can we verify an LLM's response to open-ended questions, questions whose response is, by definition, not governed by spec?
How do you verify a response that is deliberately fashioned to mimic human thinking and intelligence?

I encourage you to think about this question for a couple of seconds: how do we verify an answer by our human colleagues?

That is, after all, what we're trying to achieve here: an application to verify a human-like response, to any question, in any domain.

Can we do it? Probably not, but why not?

Fundamentally speaking

The reason we can't, and there's no "probably" about it, no one single person can verify all open-ended responses, to any and all domains, is that in order to be able to validate an answer given by someone else, the verifier has to be two things: as intelligent as the one giving the answer, so as to distinguish a valid answer from ear-pleasing mumbo-jumbo, have as much domain expertise as the person providing the answer.

When thinking about a so-called "universal LLM verifier" we'd like to impose two more constraints: the system needs to be generalized enough to be able to validate responses in any domain, and it needs to be automated with no human intervention or interaction.

Tasty, low calories, cheap - pick only two

But that, alas, is not possible!

We've hit the "GAP theorem of LLMs".

Yes, yes, I can hear your "the what now?!" gasps all the way here.

As we all know today, with the advent of distributed systems' engineering, a fundamental limitation of such systems arose: the CAP theorem of distributed system, stating that in such systems there can only ever be two attributes guaranteed by the system's design, out of three possible: Consistency (data is correct across all components of the system at the same time), Atomicity (operations either succeed end-to-end, or, upon failing, leave the system unchanged), and Partition (data can be split across several nodes in the system without degrading system's operation).

Consistency, Atomicity, Partition - choose any two.

Well, in the emergent realm of LLMs we face a similar problem: GAP - Generality (a verifier's ability to verify responses of any domain), Automation (no human intervention/interaction required), and Precision (verifier's decision is correct, to a very high degree of confidence) (yes, I know... "Accuracy". Precision lends the "P" to make the cool acronym. Humor me,constraint why don'tcha?!)

And, much like CAP theorem, so is the GAP theorem not solvable by doing the thing we humans love doing when in dire straits: putting more force into the matter.

Generality isn't solved by adding more nerfed, equivalent, LLM servers. It may cut down the time to reach a consensus whether the text being verified is valid or not, but using more of the same can't, by definition, upgrade the outcome.

Automation can't be solved by putting more servers in play, period. If a domain is deemed so risky that any and all LLM responses must be verified by a human expert in the domain, than it must.

Precision, like generality, can also not be mitigated by adding more servers of the same ability: they all have the same training, none is an expert on the domain in question, none can be an authoritative answer.

Much like CAP theorem is a fundamental limitation on distributed system stemming from current technologies used to implement said systems, so is GAP theorem a fundamental limitation on a "Universal LLM verifying" machine... at least given current LLM technologies.

I'm willing to concede that if, say, you have an Alibaba-like model, running 405B parameters, probably GAP theorem does not apply to your model, it being general and precise by sheer power of computation... so, yeah, I guess with enough additional power we can overcome GAP theorem.

Then again, considering the costs of running even a straight forward question on that model... I doubt even Alibaba is using their monster model for anything but the most pressing, urgent, profit-driving issues.

So, not as general as it may seem at first.

GAP theorem stands tall.

The future's calling

Much like CAP theorem doesn't prevent us from building distributed systems, merely forces us to make compromises in their design, and be honest about it, so will GAP theorem not prevent us from building LLM verifiers... it does mean there can never be a "universal verifier", though, and it forces anyone building such a verifier to make a compromise, and be honest about it, to themselves, and to potential clients!

Some of these future tools will choose to forgo generality - those will be tools used in high-risk domains, highly regulated, high-stakes, lives on the line domains.

Tools will be built trained on domain-specific data, regulated by human domain experts until the model shows an extreme confidence in the domain.

Some will choose to compromise the verifier's precision - being "close enough" will be good enough.

I'm guessing those will be the bulk of these tools. We need to make sure our original LLM's answer is plausible-sounding, we're not in the business of splitting hairs. We need time-to-market, not the rigor of a mathematical proof.

I don't believe any of these tools will forgo automation, though, that would be counter-productive and purpose-defeating.

Well, perhaps some of the high-risks domain mentioned above will be made to give up automation, as well as generality, by regulations. Maybe not.

Whichever way each of these tools will sway, it will not be able to disguise - and that's a win for the entire industry.

Since a universal verifier is unattainable, each of these tools will have to concede which of the three "legs" of GAP it gave up, allowing its users to better evaluate whether that tool is the right tool for their use-case.

Are we there yet?

The realm, and it is a realm, not "just a new technology" by even the slightest chance, of LLMs is in its infancy, and already showing just how vast and grand it will be. And it will.

As with anything new, uncharted, unknown, nothing is really worked-out yet. Nothing is a "solved problem".

Heck, we don't even know what we don't know about this... thing.

It's good, therefore, to at least know its limits. To know they actually exist. That, vast as it is, even this realm does have a line in the sand that can't be crossed.

Maybe it's just me, but I find it reassuring. As a father to a toddler, I know all too well how much limits are something every infant needs to have... even one made up of bits. 😃