Reddit and Google Enter into AI Content Licensing Agreement

A hand holding a phone

Reuters has reported that Reddit has entered into an agreement with Google to make content posted by Reddit users available to train Google’s artificial intelligence (AI) models.

According to a source, this licensing deal is worth about $60 million annually.

Ars Technica reported that

in a recent Securities and Exchange Commission filing, the popular online forum has revealed that it will bring in $203 million from that and other unspecified AI data licensing contracts over the next three years.

Reportedly,

Google and other AI companies that license Reddit's data will receive "continuous access to [Reddit's] data API as well as quarterly transfers of Reddit data over the term of the arrangement," according to the filing. That constant, real-time access is particularly valuable, the site writes in the filing, because "Reddit data constantly grows and regenerates as users come and interact with their communities and each other."

However, many companies have already used Reddit’s data to train their large language models (LLMs) without a license from Reddit.

Ars Technica notes that “Reddit seems well aware that AI models may continue to hoover up its posts and comments for free, even as it tries to sell that data to others.”

In 2023, Reddit updated its terms of use to call out machine learning training as an unauthorized use of its content without an express license from Reddit or express permission from each individual user.

Reddit recognized in its SEC filing that:

Some companies may decline to license Reddit data and use such data without a license, given its open nature, even if in violation of the legal terms governing our services… While we plan to vigorously enforce against such entities, such enforcement activities could take years to resolve, result in substantial expense, and divert management’s attention and other resources, and we may not ultimately be successful.

Some Reddit users are unhappy about their posts being used to train AI.

According to TechRadar,

Some users have privacy worries, some voiced concerns about the quality of output from an AI trained on Reddit content (which, let’s be honest, can get pretty toxic), and others simply don’t want their posts ‘stolen’ to train an AI.

Reddit’s terms of use give it a broad right to use user content as it pleases, within reason, undercutting the “stolen” argument.

However, Sarah Gilbert, a research associate at Cornell University and research director of the Citizens and Technology Lab who is an expert on content moderation and data ethics, has said:

A misalignment between the expectations of users and how Reddit allows their data to be used could be catastrophic for Reddit. It could impact willingness to contribute to the site or even prompt users to engage in vandalism as a form of protest.

Licensing is a great plan for Google

Many companies such as the New York Times are suing AI companies for unauthorized data scraping, which the AI companies defend as “fair use” under copyright law.

Generative AI (GAI) tools not only “learn” from existing content, they also often reproduce it almost verbatim. For example, in its lawsuit against OpenAI, the New York Times presented 100 examples of GPT-4 generating near-verbatim excerpts from Times articles.

Open AI responded that “‘regurgitation’ is a rare bug that we are working to drive to zero.”

As Ars Technica discussed,

Examples of verbatim copying undermine the argument that generative models only ever learn unprotectable facts from their training data. They demonstrate that—at least some of the time—these models learn to reproduce creative expression protected by copyright. The danger for AI defendants is that these examples could color the judges’ thinking about what’s going on during the training process.

As the Stanford Library explains, judges use four factors to resolve fair use disputes under copyright law:

  • the purpose and character of your use
  • the nature of the copyrighted work
  • the amount and substantiality of the portion taken, and
  • the effect of the use upon the potential market.

The Library gives the following example of the “effect on the market” factor:

in one case an artist used a copyrighted photograph without permission as the basis for wood sculptures, copying all elements of the photo. The artist earned several hundred thousand dollars selling the sculptures. When the photographer sued, the artist claimed his sculptures were a fair use because the photographer would never have considered making sculptures. The court disagreed, stating that it did not matter whether the photographer had considered making sculptures; what mattered was that a potential market for sculptures of the photograph existed. (Rogers v. Koons, 960 F.2d 301 (2d Cir. 1992).)

Similarly, the New York Times and other companies, organizations, and individuals fighting to prohibit their content from being used to train AI systems can argue that even though THEY would never have considered using their content to train AI, a potential market to use copyrighted content to train AI clearly exists.

As Ars Technica noted, the establishment of a licensing markets for content used to train AI tools can have an impact on whether courts will consider the use of such content without such licenses to train AI tools to be "fair use" under copyright law:

The more deals like this are signed in the coming months, the easier it will be for the plaintiffs to argue that the “effect on the market” prong of fair use analysis should take this licensing market into account.

Reddit, which has about 138,000 active discussion groups, called subreddits, on a wide variety of topics, is reportedly the sixth most visited internet site in the US and the 11th most-visited site in the world. It is said to have 1.5 billion registered users, of whom 430,000 are active monthly and 52 million are active daily.

Reddit, founded in 2005, recently launched an initial public offering filing. It was valued at about $10 billion in a funding round in 2021 and is looking to sell about 10% of its shares to the public.

As Reuters notes, this would be the first IPO of a social media company since Pinterest went public in 2019.

Categories: Licensing