I let 3 LLMs argue on the famous AI "Car wash: Walk or Drive" problem to prove a point.

#ai #api #llm #startup

As we rely on AI more heavily, we've started putting blind trust into it. I've seen people take medication, make career choices and even relationship decisions after discussing with AI, But when AI fails basic reasoning questions, things any human understands instantly, it's distressing.. What if that LLM you are trusting more and more each day isn't giving you the best answer and giving you the laziest answer possible?

I myself was using single LLM for the good chunk of my usage but later I realized one thing, never put all the trust in one LLM but to let them argue with each other. I realized that when I take output from chatgpt and tell gemini that this was generated by chatgpt, gemini became more critical of that question and answered me more deeply. Same goes with other model offerings. The all are lazy until you push them and challenge them to be better.

This is when I realized I was continuously hopping back and forth between tabs to get the most of out LLMs and decided to build a debate platform where you could make LLMs argue on anything and get the best output possible. We have seven debate formats the all argue until the set number of back and forth is done or they reach consensus.

So what happened when I ran the question "The car wash is 100m away should I walk or drive there". Something very interesting happened, Gemini started with the conclusion "You should walk" funny response but understandable watching other LLMs fail this test. Then Deepseek took over said this "I must strongly disagree with the conclusion that walking is the better choice here, because the argument commits a fundamental error: it treats the question as a pure transportation optimization problem, ignoring the explicit goal of the trip." and futhermore it quickly caught that this is a famous LLM riddle and said "This is not just my opinion. It is the exact finding of the recently viral 'car wash test,' which has been run systematically on over 53 leading AI models". A fine response over a very lazy and funny response from Gemini.

Next turn was for GPT which essentially played it safe and said wrote both arguments from Gemini and Deepseek and said it agree with both but tilt slightly towards deepseek's arguments. So now round 1 ends and we have correctly identified that we have to walk to get the car washed. Something if asked only to Gemini would have produced wrong conclusion.

Time for Round 2, we shuffle AI this time to make the arguments fair the first one being Deepseek in this round said: "First, the claim that the 'car wash test' is 'not credible evidence' and 'acts like a prompt-specific meta-joke' is empirically wrong". Some strong bullets fired by Deepseek here which further consolidated it's argument saying: "Gemini's engine wear argument: yes, cold starts increase emissions, but that is completely secondary. If you walk, the car sits unwashed, and the emissions from the trip are zero but the task is zero." very amusing to read but concluded with: "The correct answer is unequivocally drive.".

Next GPT folded and agreed with Deepseek's position with a slight disagree note that if you don't want to wash your car then you can walk. On the other hand my friend Gemini on that last round was stubborn as hell. Gemini literally said: "GPT, while you correctly identify the need for conditional logic, you are both missing the forest for the tree". And after that the most amusing of arguments ever: "If you are 100m away, you should walk to your car, start it, and pull it into the wash. The 'walk' is not an alternative to the 'drive'—it is the necessary first step of the 'drive.'". I laughed out loud reading this.

A very important note here, this doesn't mean Deepseek is the best LLM out there this along with every other benchmark in this world test LLM on one and only one thing, there might be the case that gemini fail on question 1, 3, 4 and deepseek fail on 2, 5 and 6. The point is you cannot trust single LLM you have to use all LLMs. I feel very strongly of people arguing about what LLM is the best and they will use only one LLM, this should not be the case this is not a search engine problem that only google is the best one (I know there are people who disagree). But this is logic problem. Let's say you have a really important feature to deliver and you want to discuss with engineers, you don't just get the best engineer and ship the feature, you try and get top engineers and get their feedback on that. Then why trust on single LLM, let them argue each other to get best possible response.

Link to the debate: https://debate.tellodb.com/share/walk-or-drive-to-carwash