Navigating the Legal Landscape of AI: Lawsuits, Training Data, and the Future of Generative Models

In the land of AI foundation models, training data reigns supreme. It's the secret sauce that teaches AI algorithms to perform mind-boggling tasks like recognizing images or generating text. Now the responsibility lies in using training data without stepping on the toes of intellectual property rights, but it may be that the cat is out of the bag.

The AI Lawsuit Chronicles

As the AI industry grows, so does the list of lawsuits. It's like a never-ending season of a legal drama series, with new episodes airing every few months. Here's a quick recap of the ongoing saga:

GitHub, Microsoft, and OpenAI: A class-action suit involving GitHub's Copilot tool, which allegedly copies and republishes code without proper attribution.
Stability AI, Midjourney, and DeviantArt: A complaint against AI image generator providers for allegedly infringing on copyrights and creating unauthorized derivative works.
Stability AI and Getty Images: Getty Images filed complaints against Stability AI for allegedly copying and processing millions of images and associated metadata owned by Getty.
OpenAI: Authors Paul Tremblay and Mona Awad are suing OpenAI for allegedly infringing on authors' copyrights.
Meta and OpenAI: Sarah Silverman's lawsuit against Meta and OpenAI alleges copyright infringement and claims that ChatGPT and Llama were trained on illegally acquired data sets containing her work.
Google: A class-action lawsuit against Google for alleged misuse of personal information and copyright infringement related to the training of Bard.

Plot Twist

In a plot twist, some internet platforms like Twitter have started "locking down" their information, like a digital Fort Knox, to prevent unauthorized access and use. This move aims to protect content creators from AI systems that might be tempted to "borrow" their work without asking. This raises questions about the role of search engines in the digital landscape. While search engines have been the go-to source for information, AI systems are now flexing their muscles, performing tasks that make search engines look like they're stuck in the Stone Age. AI's can use the data they pull to give complete information without the user ever going to the website. That means no eyeballs, no advertising clicks, no affiliate purchases, and no value in SEO.

At ThinkChain.ai, we're committed to fighting the good fight by respecting the intellectual property rights of content creators. We provide sourcing to the original creator of any content found through our platform. It's our way of contributing to the responsible development and use of AI technologies, one ethical step at a time.

Self Training?

So how will future AI get trained if not from "public" data? Well, ChatGPT now gets a lot of content from its users. As an API customer, we can (and do) opt out, but for the web users, they are creating free training content. The problem with that is that users don't exactly supply facts and content, that's supposed to be the job of the AI. But there is one more possible route ... what about AI-generated content. Can an AI generate content to train itself? On the one hand, it makes sense. AI is trained on text, and text trains AIs. But on the other hand, it's a circle. If the data comes from the AI, how can it improve the AI? Surely AI can't self-improve?

Well, that's kind of how evolution works, isn't it?

Written with the help of ThinkChain.ai

‍