Apparently, stealing other people’s work to create product for money is now “fair use” as according to OpenAI because they are “innovating” (stealing). Yeah. Move fast and break things, huh?
“Because copyright today covers virtually every sort of human expression—including blogposts, photographs, forum posts, scraps of software code, and government documents—it would be impossible to train today’s leading AI models without using copyrighted materials,” wrote OpenAI in the House of Lords submission.
OpenAI claimed that the authors in that lawsuit “misconceive[d] the scope of copyright, failing to take into account the limitations and exceptions (including fair use) that properly leave room for innovations like the large language models now at the forefront of artificial intelligence.”
I don’t see why it wouldn’t be able to. That’s a Big Data problem, but we’ve gotten very very good at searches. Bing, for instance, conducts a web search on each prompt in order to give you a citation for what it says, which is pretty close to what I’m suggesting.
As far as comparing to see if the text is too similar, I’m not suggesting a simple comparison or even an Expert Machine; I believe that’s something that can be trained. GANs already have a discriminator that’s essentially measuring how close to generated content is to “truth.” This is extremely similar to that.
I completely agree that categorizing input training data by whether or not it is copyrighted is not easy, but it is possible, and I think something that could be legislated. The AI you would have as a result would inherently not be as good as it is in the current unregulated form, but that’s not necessarily a worse situation given the controversies.
On top of that, one of the common defenses for AI is that it is learning from material just as humans do, but humans also can differentiate between copyrighted and public works. For the defense to be properly analogous, it would make sense to me that it would need some notion of that as well.