Menu
AI Training Models and Website Terms of Use
September 19th, 2023
We’ve been writing about artificial intelligence (AI) a lot recently, and for good reason. AI technology is improving rapidly–especially in generative AI (GAI) – and influencing everything from venture capital investments to Hollywood strikes to Congressional hearings.
Large language models (LLMs) are a type of AI technology that uses large amounts of language (e.g., text) to train the AI to generate “new” language in response to prompts.
Where do those large amounts of text come from? Sources of training text include Common Crawl, The Pile, MassiveText, Wikipedia, and GitHub.
If you’ve ever posted anything online, it might be used to train a LLM.
Many people aren’t happy about their content being used to train an AI – mainly if that AI someday produces content that competes with human-generated writing.
As previously discussed, some artists sued GAI companies Stability AI, Midjourney, and DeviantArt. It accused them of committing mass copyright infringement using the artists' work in generative AI systems.
As The Verge reports, Comedian and author Sarah Silverman, as well as authors Christopher Golden and Richard Kadrey — are suing OpenAI and Meta in federal court for alleged copyright infringement:
The suits allege, among other things, that OpenAI’s ChatGPT and Meta’s LLaMA were trained on illegally acquired datasets containing their works, which they say were acquired from “shadow libraries” websites like Bibliotik, Library Genesis, Z-Library, and others, noting the books are “available in bulk via torrent systems.”
The plaintiffs have shown that ChatGPT can summarize their books, alleging that the GAI tool must have copied the text (violating the authors’ copyrights) to enable it.
Many AI-related lawsuits are based on claims that an AI company used the authors’ content without permission. But what if the AI company DID have that permission?
Many people “accept” the terms that govern website use – terms of service (ToS) or terms of use (ToU) – without seeing them… let alone reading or understanding them.
Terms of Use can be considered binding and enforceable contracts, but not all ToUs are created equal.
“Browsewrap” agreements purport to bind the user simply because the user visits the website. Links to browsewrap terms commonly appear via small-text links at the bottom of web pages. Browsewrap terms are prevalent but are the types of ToUs least likely to be enforced by courts.
“Clickwrap” agreements require users to click or check an “I agree” button or box to use (or continue to use) a site, make a purchase, etc. Courts are much more likely to enforce clickwrap terms rather than browsewrap terms because it’s clear that the user indicated consent to the words – whether or not the user read the terms.
“Scrollwrap” terms are even more potent than clickwrap and require users to scroll to the end of the terms before clicking “I accept.”
Terms of Service routinely give a website operator broad rights (a license) to use any content uploaded or transmitted via the website. For example, Facebook’s terms say:
…to provide our services, we need you to give us some legal permissions (a "license") to use this content. …
Specifically, when you share, post, or upload content that is covered by intellectual property rights on or in connection with our Products, you grant us a non-exclusive, transferable, sub-licensable, royalty-free, and worldwide license to host, use, distribute, modify, run, copy, publicly perform or display, translate, and create derivative works of your content (consistent with your privacy and application settings). This means, for example, that if you share a photo on Facebook, you give us permission to store, copy, and share it with others (again, consistent with your settings), such as Meta Products or service providers that support those products and services. This license will end when your content is deleted from our systems.
Such a broad license may already include the right for Facebook to use user content to train an AI. However, some companies are adding specific AI-training rights to their ToUs.
Also, users should be aware that the prompts they enter when using a GAI tool (“write a 5-minute standup routine about rabbits in the style of Sarah Silverman,” for example) aren’t considered confidential and can be used for training an AI system.
Terms of Use for GAI tools may say that users “own” the output of their prompts, but it’s doubtful whether anyone owns such creations. As discussed in this blog, the US Copyright Office denied registration for comic book art generated using the Midjourney AI tool.
As many website owners want the right to use content that users upload and transmit, website operators may also want to prevent others from using the content they’ve posted to train AI models.
Website terms commonly prohibit automated “scraping” of their content. “Captcha” puzzles are intended to help prevent this by allowing only humans to access the content.
In 2020, the 9th Circuit Court of Appeals ruled that the Computer Fraud and Abuse Act (CFAA) isn’t violated when a company scrapes public websites. However, scraping can break the website ToU and thus lead to a civil lawsuit for breach of contract.
For example, Ryanair’s ToU says:
You are not permitted to use this website (including the mobile app and any webpage and/or data that passes through the web domain at ryanair.com), its underlying computer programs (including application programming interfaces ("APIs")), domain names, Uniform Resource Locators ("URLs"), databases, functions or its content other than for private, non-commercial purposes. Use of any automated system or software, whether operated by a third party or otherwise, to extract any data from this website for commercial purposes ("screen scraping") is strictly prohibited.
Some website operators are going even further to prohibit using the website content to train AI models. For example, one website includes the following terms:
The license fee for this site’s material is $1,000 per unique URL per year or part of the year the model is in use. Use of any page of this site in creating an AI or Machine Learning model will be construed to be acceptance of these terms….
Should you claim that it is unreasonable to expect an automated web crawler to read these terms, I shall empathize with you tremendously as soon as you have paid the license fees. Meanwhile, reading license terms is your problem, not mine. You should not be stealing my life’s work because it isn’t convenient for you to figure out how not to steal.
Categories: Copyright