Congress Takes on AI and Copyright
As litigation involving the use of copyrighted material to train generative artificial intelligence (GenAI) tools like ChatGPT winds its way through the courts, the US Congress is taking on the issue.
On July 21, Senators Josh Hawley (R-MO) and Richard Blumenthal (D-CT) introduced the AI Accountability and Personal Data Protection Act, which would prohibit companies from using copyrighted works to train GenAI tools without authors’ permission.
The Act also deals with the use of personal data in GenAI tools.
“Covered data” under the Act includes both personal data and data that
is generated by an individual and is protected by copyright, regardless of whether the copyright has been registered with the United States Copyright Office or any other registration authority…
The Act defines ‘‘generative artificial intelligence system’’ as “an artificial intelligence system that is capable of generating novel text, video, images, audio, and other media based on prompts or other forms of data provided by an individual.”
Under the Act,
Any person who, in or affecting interstate or foreign commerce, appropriates, uses, collects, processes, sells, or otherwise exploits the covered data of an individual, without the express, prior consent of the individual, shall be liable to the individual in accordance with this section.
The remedies under the Act are as follows:
An individual prevailing in a civil action brought under paragraph (1) may recover—
(A) compensatory damages in an amount equal to the greater of—
(i) actual damages;
(ii) treble any profits from the appropriation, use, collection, processing, sale, or other exploitation of the covered data of the individual as described in subsection (a); or
(iii) $1,000;
(B) punitive damages;
(C) injunctive relief; and
(D) attorney’s fees and costs.
“Consent” is an affirmative defense under the Act. However, consent isn’t deemed valid if it was obtained
as a condition of using a product or service through which the appropriation, use, collection, processing, sale, or other exploitation of the covered data exceeds what is reasonably necessary to provide that product or service.
The following language in the Act is presumably directed at website and app terms of service that typically limit such claims:
Any agreement purporting to waive, limit, or preclude the right of an individual to bring an action in a court of law or to participate in a joint, class, collective, or representative action concerning any claim arising under this Act shall be deemed contrary to public policy and shall be null, void, and unenforceable.
Also relevant to website/app terms of use, consent terms
(A) shall be presented distinctly and separately from any privacy policy, terms of service, or other general conditions or agreements; and
(B) shall not be satisfied by the mere inclusion of a hyperlink or general reference to a privacy policy, user agreement, or other similar document.
The bill was introduced a week after Hawley held a hearing of the Senate Judiciary Committee’s Subcommittee on Crime and Counterterrorism at which he called GenAI companies’ use of copyrighted works to train chatbots and other large language models (LLMs) “the largest IP theft in American history.”
Said Hawley,
For all of the talk about artificial intelligence and innovation and the future that comes out of Silicon Valley, here’s the truth that nobody wants to admit: AI companies are training their models on stolen material.
According to Publishers’ Weekly, the hearing
gave some hope to publishers and authors that at least some members of Congress seem willing to step up the fight against Big Tech companies, who knowingly violate copyright laws to train their large language models.
As Publishers’ Weekly reported,
The hearing featured five witnesses, four of whom argued that that the AI companies’ training methods are a clear violation of fair use, while the fifth, Edward Lee, a professor at Santa Clara University School of Law, made the case that their methods could be protected by fair use and he cautioned that before Congress takes any action it should let the issues play out in court.
Lee cited two recent rulings in lawsuits brought against Anthropic and Meta, where both judges found that their copying was protected as fair use under copyright law. However, the judges did not find that the companies were entirely in the clear.
For example, in Anthropic, the court ruled that while using legally acquired copyrighted books to train AI large language models constitutes fair use, downloading pirated copies of those books for permanent storage violates copyright law.
One witness at the hearing noted that the use of pirate book sites by GenAI companies is especially concerning.
He noted that documents showed that Meta employees knew using pirated sites was illegal, but that Meta chair Mark Zuckerberg made the decision to proceed anyway. The witness concluded that “There is no carve out in the Copyright Act for AI companies to engage in mass piracy.”
Another witness was bestselling author David Baldacci, who has published 60 novels.
In his published testimony, Baldacci noted that he had worked away for decades, getting rejected over and over. However, he kept going, honing his craft, remaining disciplined, taking the rejections head-on, and using them as motivation, until finally he was successful.
Then his son asked ChatGPT to write a plot that read like a David Baldacci novel.
“In about five seconds,” he said, “three pages came up that had elements of pretty much every book I’d ever written, including plot lines, character names, narrative, the works.”
Said the author,
That’s when I found out the AI community had taken most of my novels without permission and fed them into their machine learning system.
I truly felt like someone had backed up a truck to my imagination and stolen everything I’d ever created.
He was also offended that GenAI companies pirated his books:
They complain that it would be far too difficult to license the works from individual creators. So, apparently, it was more efficient to steal it. Trillion-dollar companies with battalions of lawyers did not have the resources to do things lawfully? I was once a trial lawyer. If I had made that argument in court, I would either have been laughed out of the courtroom or held in contempt by the judge. And rightly so.
He noted that
copyrighted books don’t simply end up in AI training datasets as part of some indiscriminate sweep of the internet. Complete books are not posted online by their copyright owners, like website content, blogs, news articles, or other text. Books are unique in that they are sold as digital files that include technical protection measures against copying and downloading through online retailers like Amazon, Barnes and Noble, Kobo, and others, to be read on digital devices. So the only way for AI companies to access free books online was through pirate websites, virtually all of them based abroad —in Russia, Ukraine, and other countries outside the reach of U.S. law enforcement. And it was not an isolated instance of one bad actor—every major large language model in commercial use today was trained on pirated books, apparently with the full knowledge and authorization of the companies’ highest decision-makers.
