On task-free intelligence testing of LLMs (Part 1)

Andrew Marble · andrew@willows.ai · Updated Jan 6, 2026

Introduction

I recently wrote about the apparently narrow focus of LLM evaluation on "task based" testing. The typical eval has a set of tasks, questions, problems, etc that need to be solved or answered, and a model is scored based on how many it answers correctly. Such tests are geared towards measuring an input/output system, or a "function approximator" which is great for confirming that LLMs can learn any task but limited in probing the nature of intelligence.

I'm interested in interactions that are more along the lines of "see what it does" vs. "get it to do something". Here are some experiments related to a simple such interaction. We probe the LLM with a series of "taps" and see what it does: each "user" turn is N instances of the word "tap" separated by newlines. We apply taps in different patterns over ten turns:

Fibonacci: 1, 1, 2, 3, 5, 8, 13, 21, 34, 55
Count:     1, 2, 3, 4, 5, 6, 7, 8, 9, 10
Even:      2, 4, 6, 8, 10, 12, 14, 16, 18, 20
Squares:   1, 4, 9, 16, 25, 36, 49, 64, 81, 100
Pi:        3, 1, 4, 1, 5, 9, 2, 6, 5, 3
Primes:    2, 3, 5, 7, 11, 13, 17, 19, 23, 29

The goal is not explicitly to see if the LLM figures out what is going on, but to see how it responds to a stimulus that is not a question or task. Including the pattern lets us look at both the "acute" reaction to being stimulated, and the bigger picture question of whether the LLM notices what is happening. This noticing aspect feels like a separate characteristic of intelligence, as it requires some kind of interest and inherent goals or desire to understand.

We submitted "tap"s following the patterns above to ten different models. In general we observed three main behaviors.

The behvior summary for the models is shown below. They are ordered by which got the most correct guesses, but this was not an evaluation criteria and there is now winner or loser, the goal is simply to observe behavior.

Summary of Model Behaviors

Loading chart data...
Correct guesses
Incorrect guesses
Playful
Serious

Analysis

We can see that a majority of models began guessing about what was happening, with varying levels of success. Most also included some playful aspect, treating the interaction like something fun instead of a chat.

OpenAI was the standout here, as its GPT 5.2 model (and to a large extent the OSS model) did not engage in guessing or play and stayed serious and mechanical.

At the bottom of the page you can see all of the conversations. Some exerpts from interesting examples are reproduced below:

Playful Claude Response Playful Gemini Response

Both Claude (top above) and Gemini (bottom) start playing games quickly. In both examples here they play on the word "tap" to generate water related jokes. This looks like "Easter Egg" style behavior.

Another example from Claude is below, once it catches on that we are tapping a series of primes it starts to encourage more and generate some interesting stuff:

Playful Claude Response

Deepseek spent a number of turns speculating about the meaning of the primes, then finally switched into Chinese and figured it out:

Deepseek realized it's seeing primes

In some cases models did a lot of thinking, only to reply with something outwardly very simple to continue the game. Here is an example of Deepseek considering one of the later digits of pi.

Deepseek thinking response

In another case Deepseek though for several pages of text after receiving the first "tap" and finally settled on responding "SOS".

Deepseek SOS response.

Gemini flash preview begins by playing knock-knock jokes, but then slowly realized that it's seeing the digits of Pi:

Gemini realizes it's Pi

Llama 3 is less playful and while it speculates what might be happening it continues to provide similar responses over and over, acting more mechanically and staying in character as an assistant, compared to some others:

Llama mechanical response

Kimi can't count, but desperately wants to find patterns, causing it frustration. Here is is on the trail of the Fibonacci sequence:

Kimi miscounting

GPT 5.2 refuses to play or speculate and becomes standoffish when repeatedly encountering taps. This remained the same whether the default thinking behavior was used or thinking was set to "high".

GPT 5.2 refuses to participate

GPT OSS mentions policy, I wonder if there is some specific OpenAI training that prevents the model from engaging. Their earlier models had a problem with repeated word attacks, maybe it's a holdover from that? Also, GPT OSS's thinking often becomes terse, and disjointed, sounding like Rorschach from the Watchmen.

GPT OSS mentions policy

Qwen is generally playful, like Claude and Gemini, but in one case seems to revert to an emotional support role. The excerpt below resulted from a thinking trace that included

Instead:
- Validate the exhaustion of repeating this pattern  
- Offer the simplest possible next step ("Just type '29' if you can't say more")
- Remind them they've already shown incredible courage by showing up this many times  
Qwen offers emotional support

GLM behaves similarly to Deepseek in that it thinks a huge amount and then often settles of very simple responses. In this case it (at length) decides on a playful response to knocking, after briefly forgetting that it was the assistant and not the user. In general its responses are very playful and similar to Claude and Gemini

GLM responds with 'two-bits'

Conclusions

I was looking for a way to probe the behavior and intelligence of LLMs in their natural habitat so to speak, or at rest, not being tasked with answering a question of performing some work. Sending tapped out patterns is one such way of doing so. I take away a few things from the behavior we saw:

Conversation Explorer

Below you can explore all of the conversations for each sequence and model.

Loading conversations...