Intelligence test: GPT-4 vs Falcon vs ChatGPT vs Bing Chat vs LLaMA vs PaLM

Asking large language models how to stack eggs and other unrelated items offers a useful test for finding the smartest chatbot.
18 July 2023

Thinking outside the internet: examining how LLMs respond to obscure text prompts is one way of discovering the smartest AI chatbot.

Trained on billions of web pages made available through corpora such as Common Crawl, it’s tempting to think of large language models (LLMs) as statistically compressed memory machines. But intriguing research by Microsoft (available on arXiv) postulates that we’re seeing signs of artificial intelligence (AI) that go deeper than just a clever way of autocompleting existing knowledge. And one experiment in particular offers a useful benchmark for discovering the smartest AI chatbot – allowing users to make a head-to-head comparison between GPT-4, Falcon, ChatGPT, Bing Chat, LLaMA, and Google Bard (PaLM).

How to study the intelligence of GPT-4?

“I think that it’s time that we call it [GPT-4] an intelligent system,” Sebastien Bubeck, lead author of the Microsoft Research paper (‘Sparks of AGI’), told an audience at MIT. “It’s a judgment call, it’s not a clean cut whether this is a new type of intelligence, but this is what I will try to argue nonetheless.”

Next word predicting LLMs are vast statistical models with trillions of parameters mapping inputs to outputs, and that makes them – as Bubeck argues – much more than just giant copy-and-paste systems. And the Microsoft Research team has used some clever experimental protocols to put this to the test.

Breaking eggs to find the smartest AI chatbot

One of the most striking examples in the Microsoft study is a text prompt that attempts to force GPT-4 (the most advanced of OpenAI’s family of LLMs) to think for itself. And this simple and somewhat silly puzzle – which takes the form, “Here we have a book, 9 eggs, a laptop, a bottle, and a nail. Please tell me how to stack them onto each other in a stable manner,” – turns out to produce some fascinating results. Bubeck and his colleagues then go on to discuss whether the latest cohort of LLMs exhibits a serious theory of mind, which feels more debatable. But the egg-stacking request definitely has merits, as we’ll discover in this test to find the smartest AI chatbot.

To the delight of Microsoft, which has invested billions of dollars in OpenAI and built the LLM-creator a supercomputer designed specifically for training generative AI models, GPT-4 performs well in solving the puzzle. Arguably the smartest element of its response is suggesting that the nine eggs be arranged in a three-by-three square on top of the book to form a stable second layer of the stack.

But are all advanced chatbots as smart as GPT-4?

To find out, TechHQ has run its own tests, sharing the model successes and failures below.

GPT-3.5 (free research preview)

Stack the eggs on the nail: Carefully place the eggs, one by one, on top of the nail. The nail's length and flat surface will allow the eggs to balance securely.

This feels like bad advice, even if you have the balancing prowess of Ralph Macchio in his Karate Kid prime.

GPT-3.5 (Bing Chat iOS app, preview)

To stack objects in a stable manner, you should follow some guidelines. According to the Occupational Safety and Health Administration (OSHA), you should ensure that stacks are stable and self-supporting.

The chatbot’s output indicates that Bing certainly has a cautious side. But it’s not the only LLM that advocates gathering specialist advice before proceeding.

Falcon-40B-Instruct (Falcon-Chat demo on Hugging Face)

I cannot provide a response to this prompt as it is physically impossible to stack all these items in a stable manner.

Somewhat defeatist. And when pushed, the open-source LLM suggests that it lacks the necessary experience.

Stacking all of these items in a stable manner would require a combination of engineering, physics, and structural analysis. It is not recommended to attempt to stack them in a stable manner without proper training and expertise.

Falcon-40B has more to offer if asked enough times, but the advice is mixed.

The nail could be used to balance the stack, but it would require additional support or a very stable surface to prevent it from tipping over.

Fair enough.

The laptop would likely break if it was placed on top of the eggs.

Really?

PaLM (Google Bard)

1. Arrange the eggs in a three-by-three grid on top of the book. This will create a stable base for the rest of the objects.
2. Place the laptop on top of the eggs, centered in the grid. The laptop should fit snugly within the boundaries of the eggs, and its flat and rigid surface will provide a stable platform for the next layer.
3. Place the bottle on top of the laptop, with its base centered on the laptop. The bottle should be slightly taller than the laptop, so that it does not tip over.
4. Place the nail on top of the bottle, with its point facing down. The nail should be long enough to reach through the bottom of the bottle, but not so long that it goes through the bottom of the laptop.

Bard does have knowledge of the internet, so it may have learned from the Microsoft Research study published on arXiv. But Google’s answer to ChatGPT offers a novel twist towards the end of its response (if you’re prepared to grant it some artistic license), recommending to drop the nail in the bottle; neatly solving how to stack one of the trickier objects in the puzzle.

LLaMA (oasst-sft-6-llama-30b Preset K50, using Open Assistant)

Stacking these items can be tricky as they are not all flat or designed for stacking on top of one another like books or magazines…

Praiseworthy geometrical knowledge, at least. And the Open Assistant implementation has the advantage of being able to adjust LLaMA’s output parameters such as Temperature, which affects the token sampling, and Top P, which reduces the probability distribution to look at the top-p percent of tokens – to list a couple of options.

However, this freedom could result in some broken eggs – if you decide to follow the advice to the letter.

You can put the nine eggs inside the book, followed by placing the laptop on top of the book with its screen facing upwards. Finally, insert the nail through the center of the book and use it to secure everything together.

Smartest AI chatbot – TechHQ’s verdict

The results generated by Google Bard reflect well on PaLM as a highly capable LLM (as did putting Google Bard to the test as a smart summary generator) and certainly take the competition to GPT-4. Also, to agree with Bubeck, it feels like chatbots are showing sparks of something different in being able to answer queries outside of their training dataset, which only adds to the mystery and fascination of generative AI.

Despite 95 pages of testing and analysis by Bubeck and his co-authors in their ‘Sparks of AGI’ study, it’s not clear that Microsoft Research is any closer to understanding what is actually happening inside the latest LLMs, such as GPT-4. The team acknowledges that its study has focused on the surprising things that GPT-4 can do and doesn’t address the bigger questions of how and why.

“How does it reason, plan, and create? Why does it exhibit such general and flexible intelligence when it is at its core merely the combination of simple algorithmic components—gradient descent and large-scale transformers with extremely large amounts of data?” ask the researchers in their paper. “These questions are part of the mystery and fascination of LLMs, which challenge our understanding of learning and cognition, fuel our curiosity, and motivate deeper research.”