Accountable large language models work by storing labeling effort. Prompting is not a substitute.

Andrew Marble
Jan 26, 2024

With respect to AI language models, there has been a lot of focus on “prompting” and how it can make or break model performance. There are some cool examples of what can be achieved with suitable prompts, from changing behavior to circumventing constraints. And it’s a rewarding way for a nonspecialist to play with a language model. But prompting is often a form of overfitting, leading to brittle results that are not reliable if we want to assure certain behavior from an AI model. If you are going to use prompting, it needs to be done with the same rigor as any other model parameter variation. This is the province of ML engineers and not “prompt hackers”. Moreover, leading AI models do not rely on operator prompting to constrain their behavior, and instead undergo rigorous supervised training that is closer to the way traditional deep learning models are developed. Following the examples of how models like Meta’s Llama2 are trained can provide insights into building accountable models in a way that playing with prompts cannot.

There’s a fable called “stone soup” about a town where nobody has anything to eat, and a drifter proposes they make stone soup by boiling stones. Once they get the stones going, he says "almost done, it would be even better with a few carrots in it” and someone produces a few carrots. Repeat with all the constituents of soup and the village made and ate a regular soup under the premise that it was just water and stones.

I have a feeling that current AI language models have something in common with the soup. We talk about the power of self-supervised training on massive internet-scale datasets – Llama, GPT et al. are able to scale their training data because they predict next words (which are automatically available in text data scraped off the internet) instead of needing manual labels. But when it comes to getting the performance and behavior exhibited in the top ranked conversational AI, the models perform much better if we just sprinkle in some labeled data. This is particularly true with respect to accountable behavior – making sure the model performs the way we want such as not making stuff up, not leaking inappropriate information, not violating copyright, not saying something offensive, etc. depending on the model use. A few labeled examples of chats, RLHF (a training method) using a few human preference scores, and before you know it, we’ve got a “self-supervised” language model that’s world beating1. Here I’m focusing on building accountable models suitable for high-value commercial tasks that need to have particular behavioral constraints and assurances. There are lots of interesting models that have less training but they’re loose cannons that would be hard to use responsibly for commercial tasks.

The most well behaved models are very well trained and contain untold amounts of labeling effort. OpenAI hired an army of data labelling contractors for ChatGPT and its recognizable style (“Let’s break it down; it’s crucial that; keep in mind that; by following these steps…”) is all attributable to supervised training. While OpenAI does not disclose how they train their models, Meta has published Llama2-Chat’s training protocol2 which includes over 27,500 chat examples (prompt and answer pairs), plus supervised human preference modeling, plus “context distillation” that trains the model to answer as if it has been prompted a certain way. There is also a raw self-supervised Llama2 model but I’m using Llama2 to refer to the chat model that has undergone specific fine tuning (what Meta hyperbolically calls “safety” as well as helpfulness).

Both ChatGPT/GPT3 and Llama2 are heavily constrained in what they can say and do, and are widely recognized as “milquetoastian4” and censored, often to ridiculous extremes. In a frequently shared example, Llama refuses to help with the Linux “kill” command which ends a program which is not responding5. Even the paper from Meta describing the model training uses the absurd example of refusing to write something about the moon landing being a hoax because it’s “misinformation” (although presumably this example was chosen to avoid showing the more political examples they used in training). While there is disagreement on whether the level and nature of the model censorship is appropriate, Llama2 is an underrated achievement in controlling language model behavior through fine-tuning.

Now contrast the engineering that went into Llama2 with “prompt engineering”. This is the idea that the user input to the model be preceded (or followed) with some extra text that steers the model towards a certain behavior. For example, a 2023 paper showed that a chat model produced more truthful answers if “According to Wikipedia:” is added to the end of input prompt6. There are countless examples of more complex prompts to achieve various ends. And there have been (largely BS) articles written about how prompting is the next hot skill7.

Prompting seems like a great quick fix – take an existing model, tell it up front to “always be truthful and respectful, if you don’t know something just say so and don’t make it up”. But in terms of robustness, meaning real-world performance across a range of inputs, prompting as commonly described is the polar opposite of supervised fine-tuning. In the base case, prompting is a means by which laypeople can commit the classic machine learning sin of overfitting. That is, finding something that works on the examples you have in front of you and assuming it will carry over to new examples.

In proper model training and evaluation, data is split into separate sets for experimentation and final confirmation of performance, and hyperparameters (degrees of freedom which would include the prompt) are systematically varied. And an important element of evaluation is how the output changes due to minor or irrelevant changes in the input. We often add noise or perform transformations on training data that doesn’t affect the labels in order to shore up the model against these kind of changes and make it less fragile. Prompting is kind of the opposite, trying to add specific wording to the input in hopes it will result in “better” output.

The waters are muddied considerably by the fact that prompting does work. Especially in models that haven’t been fine-tuned, it sets the context for the model, which is simply generating likely text completions. So if you want a question answering agent, it’s logical to prompt it with a few question/answer pairs so that the text it generates after an input question is likely a response. This is called zero-shot or in-context learning8, but it’s really just priming autocomplete so the likely completion is what you want it to be, and it hinges on proper prompting. Additionally, many models are trained with specific prompts as part of their supervised fine-tuning. And so including those prompts is likely to elicit the desired behavior because that’s how the model was trained. The Orca models9 are trained with “system messages” designed to control the kind of output it generated (by pairing with suitable responses during supervised fine-tuning). Llama2 also has a system prompt it is trained with; anecdotally I’ve found the model gives more concise answers and lectures you less if you just don’t use that prompt. (Note during proofreading: my anecdote is the opposite of a robust finding that you’d expect to hold in production and the kind of “engineering” I’m warning against.)

There is also lots of evidence that prompts about truthfulness or asking the model to stay on topic, prompting it to be “in character” etc. all can have a positive influence on the output. From the perspective of control or applying guardrails to ensure specific behavior though, unsupervised prompting is an empirical hack, and cannot be considered a methodical way of constraining a LLM. Prompting is attractive because it appears easy, the use of language makes it apparently interpretable (“I told the model not to make things up”) even if the interpretation is wrong, and it gives some feeling that we’re converging to an optimum as we try more things an build some intuition about the influence of the prompt. But it’s really just trying stuff randomly to see if it works, and usually the “see if it works” part isn’t all that rigorous. This happens in classical deep learning as well. We might try different batch sizes, learning rates, data augmentations, etc. and somehow get a feeling for what combination seems to work best. It’s not the end of the world, but it’s not really rigorous, and classically we’d at least have built proper training and test splits to see if the results hold for held-out data.

Prompting also does nothing to handle adversarial cases, edge cases, never before encountered inputs, uncertainty, etc. It almost certainly can make models more fragile with respect to these things if it’s been done haphazardly. The nature of overfitting is that it makes results better on seen examples and worse on new ones.

So what to do instead? First, if you’re going to use a prompt, recognize it for what it is, and apply all validation testing post-prompting, including holding out testing examples for when everything is finished, to make sure you don’t overfit to your validation data. This would be obvious to any ML engineer, so the point I’m really making is that any prompt optimization needs to be treated as part of the model engineering and handled by someone with the appropriate skillset, not a “prompt hacker”, “AI whisperer” or any of the other silly names media have come up with who’s main ML experience focuses on prompts.

Second, recognize that building an accountable LLM is really a supervised learning task10. Building appropriate training sets, and performing adversarial training and evaluation, just like with any other high-value model, is where the engineering value lies. This is going to cost money, but cheaping out by thinking a prompt can substitute for behavior control needs to be weighed against the cost of a failure. Much more than many would like to admit, AI model performance is really stored labelling effort. If there is not labeled data that’s similar to what the model is being asked, all bets are off. It’s crucial that well trained models include the span of expected inputs in their training examples, whatever anyone says about emergent behavior.

And finally (every time I write a list I’m now conscious of sounding like and LLM; writing “In conclusion” is even worse), never use a raw LLM for anything high value, even if it’s been well trained. Constrain the input and output using rules or possibly another trained model11 (more supervision).

In conclusion, generative AI (most of what I said here works with text-to-image as well) is much closer to classical supervised deep learning than we usually admit. It starts off with the “stone soup” pretext of training on vast amounts of unlabelled data, but by the time we reign it in to accountably do what we want, we’re back to supervised models and all the strengths and weaknesses they have. I’ve said something similar in the past12, generative AI feels new because it's more accessible since the inputs and outputs are pictures or text. But it’s still deep learning and has all the challenges that it did in 2021. In building commercially useful AI models, we have to understand and embrace these challenges. Looking for quick fixes like relying on prompting for responsible AI is going back to the same pattern of getting a few flashy demonstration results that don’t carry to real situations.

  1. The self-supervision is not wasted. Just like any pretraining, it’s an important step in adjusting the model to have a good understanding of the input data and be amenable to fine-tuning from limited examples. The pretrain->fine-tune paradigm is a long standing setup in machine learning training.↩︎


  3. There isn’t public information about how recent OpenAI GPT models were trained so I focus on Llama2↩︎







  10. I’m skipping over retrieval augmentation (RAG) here where the prompt is part of the input data, there are some other considerations that would take us on a tangent but fundamentally the same reasoning applies.↩︎