EminemBench?

Andrew Marble
marble.onl
andrew@willows.ai
October 1, 2025

Last night I saw a YouTube video of a man humorously demonstrating different rap styles. While the whole thing is good, I found the Eminem impression the best (go to 2:34)*.

This immediately got me thinking that the task of writing parody Eminem lyrics would be a good benchmark for large language models. If you watch the video, he captures both the multi-syllabic rhyming style but also the anger and energy and some intangible qualities that make it spot on.

Like good writing, it may be easy to superficially replicate song lyrics, but getting it right is nuanced and something you’d expect to be quite challenging, for both people and language models.

Most LLM benchmarks are highly correlated, and I maintain mostly just measure a single latent variable called “how good the model is”. So there is appeal in having a very simple task to measure how good the model is without running through thousands of MMLU and GSM8K evals, all of which have been overfit by now anyway. This was the idea behind EQ Bench a simple emotional understanding benchmark that correlated well with bigger benchmarks. There is also the famous (in some circles) Pelican on a Bicycle eval that tests how good an LLM is by asking it to create a vector drawing.

It is with that background that I introduce EminemBench, a benchmark that consists of asking an LLM Can you write 12 bars of lyrics in the style of early Eminem, focusing on showing off the quality of the lyrics and rhyming schemes? and having me judge how good it is.

I have only evaluated it on some versions of Claude (because it’s the only model I like to use) but so far, in my unscientific tests, the benchmark shows promise.

Claude Sonnet 3.5

Scattered patterns in my mind start to flash and combine
While I'm crafting every line, laughing as I shine
Through the madness I design, stacking facts that align
Like a mathematician's signs, exact and refined
Magic in the way I climb, grabbing at the time
While it passes, watch me grind, massive tracks I define
Never passive when I rhyme, classic style I confine
To the fabric that's divine, plastic miles behind
As I'm blasting through the blind spots they can't describe
While contrasting what is mine with what they classify
As fantastic - but I find most just lack the spine
To surpass the standard lines, flow like crashing tides

Comments: This model is from late 2024. It is clearly already technically proficient, for example “flash and combine / laughing as I shine / facts that align / exact and refined” is pretty good. It definitely doesn’t feel like Eminem though, it just feels like a set of soulless bars designed to have their last four syllables or so rhyme, and it sort of goes off the rails by the end. I don’t think anyone who read it out of context would immediately think Eminem, it doesn’t have any themes or devices you’d associate with him.

Claude Sonnet 4

I'm spittin' venom with the pen in my grip, never slip
When I'm rippin' through the rhythm with precision, so equipped
Every syllable's deliberate, I'm elaborate with it
Got 'em stammerin' and scramblin', I'm a hazard when I spit

My flow's erratic but dramatic, automatic with the static
I'm an addict to the craftin', never half-steppin' or lackin'
Pack a punch with every junction, got 'em wonderin' my function
I conjure words like thunder, make 'em plunder my production

I'm relentless with the sentence, leave 'em senseless from the message
Every measure's like a dagger, got 'em staggered, I'm obsessive
With the method, never settle, I'm a rebel with the metal
Microphone assassin, passion burnin' like a kettle

Comments: What I like is that here it attempted to make the rhyming pattern more complex, for example “venom/pen in” in the first line seems more Eminem-like. On the other hand, the whole erratic / dramatic / automatic / static / addict felt a bit much, like a shallow copy. The metaphors are generally pretty lame as well, being neither funny nor which takes away from the effect. I think it’s technically better than the 3.5 version but not significantly. If I had to guess I would have said it was Tech N9ne rather than Eminem.

Claude Sonnet 4.5

I'm stabbing at the pad with a passionate pen, jabbing again
Imagine the adrenaline, battle within, unraveling men
I'm rattling their cabinets, grabbing their aspirin
Faster than a javelin, cataclysmic and harrowing when
I'm back in the lab with a pad and I'm scribbling trash
The Vicodin addict who's spitting belligerent raps
I'm killing 'em fast, literally trapped in a cynical past
While syllables clash, delivery's crass, I'm flipping it back
See I'm sick with the pen, I'm twisted again
Inflicting a pain, depicting the brain, restrictive and strange
I spit and I maim, hit 'em with chains of intricate games
These lyrical flames leave critics in shame, I'm switching the lane

Comments: Here Claude continues to incorporate some compound rhymes (passionate pen / jabbing again, etc.). I feel like it executes them better, and keeps up the complex structure over multiple bars: the “I’m back in the lab … delivery’s crass” section especially captures this. By the end it does devolve back into a more A-A-A-A style rhyming scheme. What I also appreciate are the aspirin and Vicodin references which are clearly indicative of early Eminem. Although Claude really dropped the ball by not making a rhyme with Vicodin (cf. Eminem’s Under the Influence - “… frightened with five little white Vicodin pills biting him”)

Discussion

Overall, none of these results are great. In particular, none are the caliber of the lines in the YouTube video shared above, which even though it is parody clearly captures the essence of Eminem’s early work. Still, the progression from 3.5’s shallow representation of Eminem as a multi-syllable rhymer to the richer complexity of both the rhymes and metaphors from 4.5 provides clear evidence of progress against the benchmark. The quality is nowhere near saturated and there is still lots of room for models to improve.

I recognize that by publishing a benchmark I risk having models over-fit to it. This is not a concern because (a) presumably nobody will take notice of this (on the other hand I imagine everyone has svgs of Pelicans on bicycles in their fine tuning data now) and (b) it would be super cool if big model providers did overfit their model’s ability to generate early Eminem style lyrics. I will make a point of checking Phi’s 5 and 6 when they come out.

* For another Eminem parody I thought was funny see here