Foundational AI models do not violate copyright

Andrew Marble
July 18, 2023

There’s a long list of “bad” (and often nebulous) things that could plausibly be done with AI – generate misinformation, be biased or perpetuate stereotypes, say something offensive, perpetrate scams at scale, etc. These potential “harms” have been used as an excuse by some to suggest enforcing limitations on the technology itself, either in its capabilities or its ownership (usually to the favor of those advocating the limitations). Recently a “potential” concern that’s been getting more advocacy is copyright1. I want to focus on the contention, one of many listed in the lawsuits mentioned in the footnote article, that AI models violate copyright because it’s possible to get them to generate copyrighted content2.

While it’s possible to build an AI model that violates copyright, and it’s possible to use broadly capable foundation models like chatGPT to generate copyrighted content, neither means that the training or existence of the models themselves violates copyright. Rather, it comes down to the same idea as with the other potential misuses: just because a model can do something, doesn’t mean that’s what people are doing with it, and it’s inappropriate to penalize or limit technology just because it’s possible someone could do something bad with it.

Someone with a broadly trained conversational AI could decide to violate copyright by using it as a tool to generate some offending content. Just like I could with a photocopier, a VCR, or my own memory. That doesn’t mean we need to limit what the model can do, any more than we need to limit what a VCR can copy. If I start a video bootleg operation, I’m liable for that. Not so for the VCR itself or its manufacturer3. This is the same tired argument we’ve seen with things like BitTorrent, encryption or bitcoin where someone argues that the technology could be used for some nefarious purpose, so we need to limit its capabilities. In the context of modern AI, limiting broadly capable models in the name of copyright is no different than restricting Word or Photoshop to prevent you from typing or drawing something that’s copyrighted.

Just to mention, it would be possible to build, say a GAN (a kind of image generating AI model) that only generates Mickey Mouse pictures. I think there’s a fair argument that such a model would violate copyright, because its only use is generating such content. Foundational models, by virtue of their broad understanding of the world, are able to generate copyrighted content, as well as inappropriate, sensitive, biased and other undesirable outputs. This in a sense is a necessary condition and a consequence of their having a broad understanding of the world. Trying to build in some kind of copyright protection mechanism would be as pointless and user-hostile as the stuff we saw the movie and record companies trying in the wake of Napster.

A more sensible approach, whether to copyright or to other potential misuses of AI is to focus on regulating the use, where appropriate, rather than the capabilities of the technology itself. Parties who use AI to generate copyrighted content could be liable for infringement, same as those who use it to perpetrate some scam or to mistreat people algorithmically. In most cases, such misuses already are covered under various regulations as well as social norms. However, there may be value in better definitions of the distinction between foundational or broadly capable AI that has many non-infringing uses (like chatGPT), and a narrowly trained model who’s only purpose is some malfeasance4.

The aim here is not to dismiss potential misuse of AI. The point is that trying to legislate limitations at the capability level is the wrong way of going about it, and we should be focusing on how people actually use it.

Note added July 25 - here is another interesting related paper: P. Henderson, X. Li, D. Jurafsky, T. Hashimoto, M. A. Lemley, and P. Liang, “Foundation Models and Fair Use.” arXiv, Mar. 27, 2023. doi: 10.48550/arXiv.2303.15715.


  2. A distinct issue is the copyright of the AI model itself: The discussion here is about whether the model violates copyright on the training data.↩︎


  4. As an aside, I think the foundational / narrow divide is a generally important concept when discussing IP in relation to AI. As another example, a foundational model derived from another – say Vicuna from LLaMA is clearly a derivative work. But a specialized narrow model, for example a binary image classifier fine-tuned from a model pretrained on Image-Net, might be considered differently.↩︎