Fair use is defined in Section 107 of the Copyright Act of 1976, which I’ll quote verbatim below:
Notwithstanding the provisions of sections 106 and 106A, the fair use of a copyrighted work, including such use by reproduction in copies or phonorecords or by any other means specified by that section, for purposes such as criticism, comment, news reporting, teaching (including multiple copies for classroom use), scholarship, or research, is not an infringement of copyright. In determining whether the use made of a work in any particular case is a fair use the factors to be considered shall include—
- the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;
- the nature of the copyrighted work;
- the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and
- the effect of the use upon the potential market for or value of the copyrighted work.
The fact that a work is unpublished shall not itself bar a finding of fair use if such finding is made upon consideration of all the above factors.
Fair use is a balancing test which requires weighing all four factors. In practice, factors (4) and (1) tend to be the most important, so I’ll discuss those first. Factor (2) tends to be the least important, and I’ll briefly discuss it afterwards. Factor (3) is somewhat technical to answer in full generality, so I’ll discuss it last.
None of the four factors seem to weigh in favor of ChatGPT being a fair use of its training data. That being said, none of the arguments here are fundamentally specific to ChatGPT either, and similar arguments could be made for many generative AI products in a wide variety of domains.
Suchir Balaji
Interesting analysis by a former OpenAI researcher who left the company and publicly spoke against their business practices, going as far as an interview with The New York Times – a publication which last year sued OpenAI (and Microsoft) for copyright infringement, so naturally they would want to distribute Balaji’s views. Moreover, in November he became a potential witness in this trial after the Times’ attorneys named him in court filings as having material helpful to their case, along with at least twelve people, including past or present OpenAI employees.

The story then takes a darked turn, as Balaji was found dead on November 26, with his death initially ruled a suicide. Needless to say, the suicide of a second whistleblower this year, after Boeing whistleblower John Barnett, sounds at least a little suspect. Balaji’s family doubts that he did commit suicide and hired an independent investigator, so there may still be more to uncover here.
As for the AI copyright question, I am obviously not an expert, but generative AI does seem to stretch fair use arguments far more than say Google search before it. With search engines there is still plenty of traffic directed back to the original publication, allowing copyright owners to maintain their brand, acquire new customers, and monetize their work. Generative AI on the other hand often delivers a single, fully formed answer, so there’s little reason for most people to check out any links the chatbot might add.
OpenAI in particular seems to infringe on the first point in the list above: it started as a nonprofit, gathering training data ostensibly for research, but the products of that research have been heavily commercialized in recent years, and the company is now considering dropping its nonprofit status in favor of a more conventional business structure. It would seem obvious to me that obtaining data under the pretense of nonprofit activities, then turning around and selling the research results – and at the astronomical valuations OpenAI is aiming for – should not be allowed, but I highly doubt that the US justice system will punish this sleight of hand in any meaningful way. There’s simply too much money at stake, after all…
Post a Comment