OpenAI says it’s “impossible” to create useful AI models without copyrighted material

@[email protected] · 2 years ago

OpenAI says it’s “impossible” to create useful AI models without copyrighted material

JokeDeity · 1 year ago

Well… Yeah? How did everyone think it worked? How do you think it could work without that?

jlow (he/him) · 1 year ago

It’s also “impossible” to have multiple terabytes of media on my homeserver without copyright infringement, so piracy is ok, right!?

O no, wait it actually is possible, it’s just more expensive and more work to do it legally (and leaves a lot of plastic trash in form of Blurays and DVDs), just like with AI. But laws are just for poor people, I guess.

@[email protected] · 1 year ago

Even if it was impossible, would that make it okay?

@[email protected] · 1 year ago

OpenAI’s notion of “fair use”: military and weapons

Those type of companies are getting so f*****g disgusting.

https://techcrunch.com/2024/01/12/openai-changes-policy-to-allow-military-applications/ https://www.theverge.com/2024/1/12/24036397/openai-is-softening-its-stance-on-military-use

@[email protected] · 1 year ago

Yup, I saw that too. There is also another thread on this board that is discussing this issue.

One interesting thing I noticed is how the AI apologists in this thread seems to be quiet on the other.

@[email protected] · 1 year ago

Then shutdown your goddamn company until you find a better way.

qyron · 2 years ago

If it is impossible, either shut down operations or find a way to pay for it.

@[email protected] · edit-2 2 years ago

My concern is they and other tech companies absolutely can and would pay if they have no choice. Paying fines for illegal practices if needs be.

What absolutely wont survive a strong law to keep copyright content out of ai is the open source community which absolutely can not pay for such a thing and would be seriously lacking behind if its excluded, Strengthen the monopoly on ai by for Profit Tech. So basically this issue can have huge ramifications no matter what we end up doing.

frog 🐸 · edit-2 2 years ago

My understanding of the open source community is that taking copyrighted content from people who haven’t willingly signed onto the project would kind of undermine the principles of the movement. I would never feel comfortable using open source software if I had knowledge that part or all of it came from people who hadn’t actively chosen to contribute to it.

I have seen a couple of things recently about AI models that were trained exclusively on public domain and creative commons content which apparently are producing viable content, though. The open source community could definitely use a model like that, and develop it further with more content that was ethically obtained. In the long run, there may be artists that willingly contribute to it, especially those who use open source software themselves (eg GIMP, Blender, etc). Paying it forward, kind of thing.

The problem right now is that artists have no reason to be generous with an open source alternative to AIs, when their rights have already been stomped on and certain people in the open source community are basically saying “if we can’t steal from artists too, then we can’t compete with the corporations.” So there’s literally a trust issue between the creative and tech industries that would need to be resolved before any artists would consider offering art to an open source AI.

@[email protected] · 2 years ago

Its quite a mess but I definitely agree that open source needs a good model trained on consented works.

I do fear though that the quality gap between copyright trained and purist models will be huge in the first decenia. And no matter the law, the tech is out there and corporation and criminals will be using it in secret nonetheless.

If only things where as simple as choosing for the chad digital artists. Digital art was part of my higher education and if i Haden t get a tech job i might have been one of them so i feel torn between the divide in industries.

This may sound doomer but since the technology exist we are in a race to obtain beyond human super intelligence and we do not know what will happen after that.

OpenAI had multiple times stated they don’t know if copyright will still mean anything in a future with ai.

We are also facing some huge global issues like global warming where a super intelligence could be the answer to sustain the planet, of course also risking evil ai in the process… i repeat such a mess

I don’t fully trust sam altman, but i do believe what they say may be true. At some point its going to be here and it will be to smart to ignore.

Its optimistically possible that in 20 years we will all be leisurely artist laughing at the idea of needing to work to earn survival.

Its of course just as likely some statehead old bastard presses the deathbutton next week and thats the end of all of it or that climate has progressed beyond what our smartest future ai could possible solve.

frog 🐸 · 2 years ago

I definitely do not have the optimism that in 20 years time we’ll all be leisurely artists. That would require that the tech bros who create the AIs that displace humans are then sufficiently taxed to pay UBI for all the humans that no longer have jobs - and I don’t see that happening as long as they’re able to convince governments not to tax, regulate, or control them, because doing so will make it impossible for them to save the planet from climate change, even as their servers burn through more electricity (and thus resources) than entire countries. Tech bros aren’t going to save us, and the only reason they claim they will is so they never face any consequences of their behaviour. I don’t trust Sam Altman, or any of his ilk, any further than I can throw them.

@[email protected] · 1 year ago

That’s is why i am putting some of my eggs in open source, which is where the real innovation happens anyway. Free Ai tools at home running on consumers devices can level people up to build a better future ourselves without having to rely on techbros or government.

Of course i should nuance my wording a bit. My actual opinions tend to be contrasting mix of both optimistic and pessimistic lines of evens. I dont have much hope that the good future is the one we will end on, but it remains in my speculative opinion possible from where we are standing today, yet all can change in less than a week.

@[email protected] · 2 years ago

#fuckingcapitalists

The Bard in Green · 2 years ago

On both sides even.

@[email protected] · edit-2 2 years ago

I will repeat what I have proffered before:

If OpenAI stated that it is impossible to train leading AI models without using copyrighted material, then, unpopular as it may be, the preemptive pragmatic solution should be pretty obvious, enter into commercial arrangements for access to said copyrighted material.

Claiming a failure to do so in circumstances where the subsequent commercial product directly competes in a market seems disingenuous at best, given what I assume is the purpose of copyrighted material, that being to set the terms under which public facing material can be used. Particularly if regurgitation of copyrighted material seems to exist in products inadequately developed to prevent such a simple and foreseeable situation.

Yes I am aware of the USA concept of fair use, but the test of that should be manifestly reciprocal, for example would Meta allow what it did to MySpace, hack and allow easy user transfer, or Google with scraping Youtube.

To me it seems Big Tech wants its cake and to eat it, where investor $$$ are used to corrupt open markets and undermine both fundamental democratic State social institutions, manipulate legal processes, and undermine basic consumer rights.

@[email protected] · edit-2 2 years ago

I suspect the US government will allow OpenAI to continue doing as it please to keep their competitive advantage in AI over China (which don’t have problem with using copyrighted materials to train their models). They already limit selling AI-related hardware to keep their competitive advantage, so why stop there? Might as well allow OpenAI to continue using copyrighted materials to keep the competitive advantage.

@[email protected] · 2 years ago

Agreed.

There is nothing “fair” about the way Open AI steals other people’s work. ChatGPT is being monetized all over the world and the large number of people whose work has not been compensated will never see a cent of that money.

At the same time the LLM will be used to replace (at least some of ) the people who created those works in the first place.

Tech bros are disgusting.

nicetriangle · edit-2 2 years ago

At the same time the LLM will be used to replace (at least some of ) the people who created those works in the first place.

This right here is the core of the moral issue when it comes down to it, as far as I’m concerned. These text and image models are already killing jobs and applying downward pressure on salaries. I’ve seen it happen multiple times now, not just anecdotally from some rando on an internet comment section.

These people losing jobs and getting pay cuts are who created the content these models are siphoning up. People are not going to like how this pans out.

@[email protected] · 2 years ago

The flip side of this is that many artists who simply copy very popular art styles are now functionally irrelevant, as it is now just literally proven that this kind of basically plagiarism AI is entirely capable of reproducing established styles to a high degree of basically fidelity.

While many aspects of this whole situation are very bad for very many reasons, I am actually glad that many artists will be pressured to actually be more creative than an algorithm, though I admit this comes from basically a personally petty standpoint of having known many, many, many mediocre artists who themselves and their fans treat like gods because they can emulate some other established style.

nicetriangle · edit-2 1 year ago

Literally every artist copies, it’s how we all learn. The difference is that every artist out there does not have an enterprise-class-data-center-powerd-super-human ability to absorb <ALL THE ART> and then be able to spit out anything instantly. It still takes time and hard work and dedication. And through the years of hard work people put into learning how their heroes do X, Y, and Z, they develop a style of their own.

It’s how artists cut their teeth and work their way into the profession. What you’re welcoming in is a situation where nobody can find any success whatsoever until they are absolutely original and of course that is an impossible moving target when every original ideal and design and image can just be instantly siphoned back up into the AI model.

Nobody could survive that way. Nobody can break into the artistic industry that way. Except for the wealthy. All the low level work people get earlier in their careers that helps keep them afloat while they learn is gone now. You have to be independently wealthy to become a high level artist capable of creating truly original work. Because there’s no other way to subsidize the time and dedication that takes when all the work for people honing their craft has been hoovered up by machines.

@[email protected] · 1 year ago

No, I am not welcoming an artist apocalypse, that would obviously be bad.

I am noting that I find it amusing to me on a level I already acknowledged was petty and personal that many, many mediocre artists who are absolutely awful to other people socially would have their little cults of fandom dampened by the fact that a machine can more or less to what they do, and their cult leader status is utterly unwarranted.

I do not have a nice and neat solution to the problem you bring up.

I do believe you are being somewhat hyperbolic, but, so was I.

Yep, being an artist in a capitalist hellscape world with modern AI algorithms is not a very reliable way to earn a good living and you are not likely to be have such a society produce many artists who do not have either a lot of free time or money, or you get really lucky.

At this point we are talking about completely reorganizing society in fairly large and comprehensive ways to achieve significant change on this front.

Also this problem applies to far, far more people than just artists. One friend of mine wanted her dream job as running a little bakery! Had to set her prices too high, couldn’t afford a good location, supply chain problems, taxes, didn’t work out.

Maybe someone’s passion is teaching! Welp, that situation is all fucked too.

My point here is: Ok, does anyone have an actual plan that can actually transform the world into somewhere that allow the average person to be far more likely to be able to live the life they want?

Would that plan have more to do with the minutiae of regulating a specific kind of ever advancing and ever changing technology in some kind of way that will be irrelevant when the next disruptive tech proliferates in a few years, or maybe more like an actual total overhaul of our entire society from the ground up?

@[email protected] · 2 years ago

Any company replacing humans with AI is going to regret it. AI just isn’t that good and probably won’t ever be, at least in it’s current form. It’s all an illusion and is destined to go the way of Bitcoin, which is to say it will shoot up meteorically and seem like the answer to all kinds of problems, and then the reality will sink in and it will slowly fade to obscurity and irrelevance. That doesn’t help anyone affected today, of course.

nicetriangle · edit-2 1 year ago

I mostly disagree (especially on the long term), but hope you’re right

@[email protected] · 2 years ago

It’s garbage for programming. A useful tool but not one that can be used by a non-expert. And I’ve already had to have a conversation with one of my coworkers when they tried to submit absolutely garbage code.

This isn’t even the first attempt at a smart system that enables non-programmers to write code. They’ve all been garbage. So, too, will the next one be but every generation has to try it for themselves. AGI might have some potential some day, but that’s a long long way off. Might as well be science fiction.

Other disciplines are affected differently, but I constantly play with image and text generation and they are all some flavor of garbage. There are some areas where AI can excel but they are mostly professional tools and not profession replacements.

nicetriangle · 1 year ago

It was of no use whatsoever to programming or image generation or writing a few years ago. This thing has developed very quickly and will continue to. Give it 5 years and I think things will look very differently.

@[email protected] · 2 years ago

OpenAi, please generate your own source code but optimized and improved in all possible ways.

not how programming works, but tech illiterate people seem to think so

@[email protected] · 2 years ago

Tech bros are disgusting.

That’s not even getting into the fraternity behavior at work, hyper-reactionary politics and, er, concerning age preferences.

@[email protected] · 2 years ago

Yup. I said it in another discussion before but think its relevant here.

Tech bros are more dangerous than Russian oligarchs. Oligarchs understand the people hate them so they mostly stay low and enjoy their money.

Tech bros think they are the savior of the world while destroying millions of people’s livelihood, as well as destroying democracy with their right wing libertarian politics.

DaDragon · 2 years ago

So why is so much information (data) freely available on the internet? How do you expect a human artist to learn drawing, if not looking at tutorials and improving their skills through emulating what they see?

@[email protected] · edit-2 1 year ago

deleted by creator

Chahk · edit-2 2 years ago

Do musicians not buy the music that they want to listen to? Should they be allowed to torrent any MP3 they want just because they say it’s for their instrument learning?

I mean I’d be all for it, but that’s not what these very same corporations (including Microsoft when it comes to software) wanted back during Napster times. Now they want a separate set of rules just for themselves. No! They get to follow the same laws they force down our throats.

@[email protected] · edit-2 1 year ago

deleted by creator

@[email protected] · edit-2 2 years ago

Yep, completely agree.

Case in point: Steam has recently clarified their policies of using such Ai generated material that draws on essentially billions of both copyrighted and non copyrighted text and images.

To publish a game on Steam that uses AI gen content, you now have to verify that you as a developer are legally authorized to use all training material for the AI model for commercial purposes.

This also applies to code and code snippets generated by AI tools that function similarly, such as CoPilot.

So yeah, sorry, either gotta use MIT liscensed open source code or write your own, and you gotta do your own art.

I imagine this would also prevent you from using AI generated voice lines where you trained the model on basically anyone who did not explicitly consent to this as well, but voice gen software that doesnt use the ‘train the model on human speakers’ approach would probably be fine assuming you have the relevant legal rights to use such software commercially.

Not 100% sure this is Steam’s policy on voice gen stuff, they focused mainly on art dialogue and code in their latest policy update, but the logic seems to work out to this conclusion.

@[email protected] · 1 year ago

Wait, so if the way I make money is illegal now, it’s the system’s fault, isn’t it? That means I can keep going because I believe I’m justified, right? Right?

CC BY-NC-SA 4.0

@[email protected] · 1 year ago

CC BY-NC-SA 4.0

In order to apply that license, you will need to fully and unequivocally identify yourself (aka: doxx yourself). Not sure that’s what you really want.

@[email protected] · 1 year ago

I don’t believe this is true, a nickname or online account works completely fine for attribution if nothing else is given.

@[email protected] · 1 year ago

Attribution is not the problem. The problems are:

Entering a valid license agreement under a non-registered pseudonym.
Enforcing the conditions of the license, particularly the NC and SA parts, without revealing one’s legal name.

Depending on the applicable legislation (US, UK, EU, other), either one or both of those points may not be possible.

@[email protected] · 1 year ago

deleted by creator

@[email protected] · 2 years ago

Well in that case maybe chat gpt should just fuck off it doesn’t seem to be doing anything particularly useful, and now it’s creator has admitted it doesn’t work without stealing things to feed it. Un fucking believable. Hacks gonna hack I guess.

@[email protected] · 1 year ago

ChatGPT has been enormously useful to me over the last six months. No idea where you’re getting this notion it isn’t useful.

Bilb! · 1 year ago

People pretending it’s not useful and/or not improving all the time are living in their own worlds. I think you can argue the legality and the ethics, but any anti-ai position based on low quality output (“it can’t even do hands!”) has a short shelf-life.

@[email protected] · 2 years ago

This is not REALLY about copyright - this is an attack on free and open AI models, which would be IMPOSSIBLE if copyright was extended to cover the case of using the works for training.
It’s not stealing. There is literally no resemblance between the training works and the model. IP rights have been continuously strengthened due to lobbying over the last century and are already absurdly strong, I don’t understand why people on here want so much to strengthen them ever further.

@[email protected] · 2 years ago

Sorry AIs are not humans. Also executives like Altman are literally being paid millions to steal creator’s work.

@[email protected] · 2 years ago

I didn’t say anything about AIs being humans.

@[email protected] · 1 year ago

They’re also not vegetables 😡

BraveSirZaphod · 2 years ago

There is literally no resemblance between the training works and the model.

This is way too strong a statement when some LLMs can spit out copyrighted works verbatim.

https://www.404media.co/google-researchers-attack-convinces-chatgpt-to-reveal-its-training-data/

A team of researchers primarily from Google’s DeepMind systematically convinced ChatGPT to reveal snippets of the data it was trained on using a new type of attack prompt which asked a production model of the chatbot to repeat specific words forever.

Often, that “random content” is long passages of text scraped directly from the internet. I was able to find verbatim passages the researchers published from ChatGPT on the open internet: Notably, even the number of times it repeats the word “book” shows up in a Google Books search for a children’s book of math problems. Some of the specific content published by these researchers is scraped directly from CNN, Goodreads, WordPress blogs, on fandom wikis, and which contain verbatim passages from Terms of Service agreements, Stack Overflow source code, copyrighted legal disclaimers, Wikipedia pages, a casino wholesaling website, news blogs, and random internet comments.

Beyond that, copyright law was designed under the circumstances where creative works are only ever produced by humans, with all the inherent limitations of time, scale, and ability that come with that. Those circumstances have now fundamentally changed, and while I won’t be so bold as to pretend to know what the ideal legal framework is going forward, I think it’s also a much bolder statement than people think to say that fair use as currently applied to humans should apply equally to AI and that this should be accepted without question.

@[email protected] · 2 years ago

I know it inherently seems like a bad idea to fix an AI problem with more AI, but it seems applicable to me here. I believe it should be technically feasible to incorporate into the model something which checks if the result is too similar to source content as part of the regression.

My gut would be that this would, at least in the short term, make responses worse on the whole, so would probably require legal action or pressure to have it implemented.

BraveSirZaphod · 2 years ago

The key element here is that an LLM does not actually have access to its training data, and at least as of now, I’m skeptical that it’s technologically feasible to search through the entire training corpus, which is an absolutely enormous amount of data, for every query, in order to determine potential copyright violations, especially when you don’t know exactly which portions of the response you need to use in your search. Even then, that only catches verbatim (or near verbatim) violations, and plenty of copyright questions are a lot fuzzier.

For instance, say you tell GPT to generate a fan fiction story involving a romance between Draco Malfoy and Harry Potter. This would unquestionably violate JK Rowling’s copyright on the characters if you published the output for commercial gain, but you might be okay if you just plop it on a fan fic site for free. You’re unquestionably okay if you never publish it at all and just keep it to yourself (well, a lawyer might still argue that this harms JK Rowling by damaging her profit if she were to publish a Malfoy-Harry romance, since people can just generate their own instead of buying hers, but that’s a messier question). But, it’s also possible that, in the process of generating this story, GPT might unwittingly directly copy chunks of renowned fan fiction masterpiece My Immortal. Should GPT allow this, or would the copyright-management AI strike it? Legally, it’s something of a murky question.

For yet another angle, there is of course a whole host of public domain text out there. GPT probably knows the text of the Lord’s Prayer, for instance, and so even though that output would perfectly match some training material, it’s legally perfectly okay. So, a copyright police AI would need to know the copyright status of all its training material, which is not something you can super easily determine by just ingesting the broad internet.

@[email protected] · 2 years ago

skeptical that it’s technologically feasible to search through the entire training corpus, which is an absolutely enormous amount of data

Google, DuckDuckGo, Bing, etc. do it all the time.

HarkMahlberg · 1 year ago

Thank you for your thoroughly analytical take on the subject. Solid points all around.

@[email protected] · 1 year ago

I don’t see why it wouldn’t be able to. That’s a Big Data problem, but we’ve gotten very very good at searches. Bing, for instance, conducts a web search on each prompt in order to give you a citation for what it says, which is pretty close to what I’m suggesting.

As far as comparing to see if the text is too similar, I’m not suggesting a simple comparison or even an Expert Machine; I believe that’s something that can be trained. GANs already have a discriminator that’s essentially measuring how close to generated content is to “truth.” This is extremely similar to that.

I completely agree that categorizing input training data by whether or not it is copyrighted is not easy, but it is possible, and I think something that could be legislated. The AI you would have as a result would inherently not be as good as it is in the current unregulated form, but that’s not necessarily a worse situation given the controversies.

On top of that, one of the common defenses for AI is that it is learning from material just as humans do, but humans also can differentiate between copyrighted and public works. For the defense to be properly analogous, it would make sense to me that it would need some notion of that as well.

FaceDeer · 1 year ago

It’s actually the other way around, Bing does websearches based on what you’ve asked it and then the answer it generates can incorporate information that was returned by the websearching. This is why you can ask it about current events that weren’t in its training data, for example - it looks the information up, puts it into its context, and then generates the response that you see. Sort of like if I asked you to write a paragraph about something that you didn’t know about, you’d go look the information up first.

but humans also can differentiate between copyrighted and public works

Not really. Here’s a short paragraph about sailboats. Is it copyrighted?

Sailboats, those graceful dancers of the open seas, epitomize the harmonious marriage of nature and human ingenuity. Their billowing sails, like ethereal wings, catch the breath of the wind, propelling them across the endless expanse of the ocean. Each vessel bears the scars of countless journeys, a testament to the resilience of both sailor and ship.

@[email protected] · 1 year ago

I can spit out copyrighted work verbatim.

“No Lieutenant, your men are already dead”

See?

MudMan · 2 years ago

I’m gonna say those circumstances changed when digital copies and the Internet became a thing, but at least we’re having the conversation now, I suppose.

I agree that ML image and text generation can create something that breaks copyright. You for sure can duplicate images or use copyrighted characterrs. This is also true of Youtube videos and Tiktoks and a lot of human-created art. I think it’s a fascinated question to ponder whether the infraction is in what the tool generates (i.e. did it make a picture of Spider-Man and sell it to you for money, whcih is under copyright and thus can’t be used that way) or is the infraction in the ingest that enables it to do that (i.e. it learned on pictures of Spider-Man available on the Internet, and thus all output is tainted because the images are copyrighted).

The first option makes more sense to me than the second, but if I’m being honest I don’t know if the entire framework makes sense at this point at all.

@[email protected] · edit-2 2 years ago

The infraction should be in what’s generated. Because the interest by itself also enables many legitimate, non-infracting uses: uses, which don’t involve generating creative work at all, or where the creative input comes from the user.

MudMan · 2 years ago

I don’t disagree on principle, but I do think it requires some thought.

Also, that’s still a pretty significant backstop. You basically would need models to have a way to check generated content for copyright, in the way Youtube does, for instance. And that is already a big debate, whether enforcing that requirement is affordable to anybody but the big companies.

But hey, maybe we can solve both issues the same way. We sure as hell need a better way to handle mass human-produced content and its interactions with IP. The current system does not work and it grandfathers in the big players in UGC, so whatever we come up with should work for both human and computer-generated content.

@[email protected] · 2 years ago

But AI isn’t all about generating creative works. It’s a store of information that I can query - a bit like searching Google; but understands semantics, and is interactive. It can translate my own text for me - in which case all the creativity comes from me, and I use it just for its knowledge of language. Many people use it to generate boilerplate code, which is pretty generic and wouldn’t usually be subject to copyright.

@[email protected] · 1 year ago

This is how I use the AI: I learn from it. Honestly I just never got the bug on wanting it to generate creative works I can sell. I guess I’d rather sell my own creative output, you know? It’s more fun than ordering a robot to be creative for me.

FaceDeer · 1 year ago

I have used it as a collaborator when doing creative work. It’s a great brainstorming buddy, and I use it to generate rough drafts of stuff. Usually I use it while developing roleplaying scenarios for TTRPGs I run for my friends. Generative AI is great for illustrating those scenarios, too.

Chahk · 2 years ago

Agreed on both counts… Except Microsoft sings a different tune when their software is being “stolen” in the exact same way. They want to have it both ways - calling us pirates when we copy their software, but it’s “without merit” when they do it. Fuck’em! Let them play by the same rules they want everyone else to play.

@[email protected] · 1 year ago

That sounds bad. Do you have evidence for MS behaving this way?

Chahk · 1 year ago

https://www.computerworld.com/article/3121736/microsoft-sues-repeat-software-pirate-who-owes-company-12m-from-prior-case.html

Literally first hit on google (after the NYT links).

@[email protected] · 2 years ago

I don’t understand why people on here want so much to strengthen them ever further.

It is about a lawless company doing lawless things. Some of us want companies to follow the spirit, or at least the letter, of the law. We can change the law, but we need to discuss that.

@[email protected] · 2 years ago

IANAL, why isn’t it fair use?

@[email protected] · 2 years ago

The two big arguments are:

Substantial reproduction of the original work, you can get back substantial portions of the original work from an AI model’s output.
The AI model replaces the use of the original work. In short, a work that uses copyrighted material under fair use can’t be a replacement for the initial work.

@[email protected] · 1 year ago

you can get back substantial portions of the original work from an AI model’s output

Have you confirmed this yourself?

@[email protected] · 1 year ago

In its complaint, The New York Times alleges that because the AI tools have been trained on its content, they sometimes provide verbatim copies of sections of Times reports.

OpenAI said in its response Monday that so-called “regurgitation” is a “rare bug,” the occurrence of which it is working to reduce.

“We also expect our users to act responsibly; intentionally manipulating our models to regurgitate is not an appropriate use of our technology and is against our terms of use,” OpenAI said.

The tech company also accused The Times of “intentionally” manipulating ChatGPT or cherry-picking the copycat examples it detailed in its complaint.

https://www.cnn.com/2024/01/08/tech/openai-responds-new-york-times-copyright-lawsuit/index.html

The thing is, it doesn’t really matter if you have to “manipulate” ChatGPT into spitting out training material word-for-word, the fact that it’s possible at all is proof that, intentionally or not, that material has been encoded into the model itself. That might still be fair use, but it’s a lot weaker than the original argument, which was that nothing of the original material really remains after training, it’s all synthesized and blended with everything else to create something entirely new that doesn’t replicate the original.

@[email protected] · 1 year ago

So that’s a no? Confirming it yourself here means doing it yourself. Have you gotten it to regurgitate a copyrighted work?

FaceDeer · 1 year ago

You said:

Substantial reproduction of the original work, you can get back substantial portions of the original work from an AI model’s output.

If an AI is trained on a huge number of NYT articles and you’re only able to get it to regurgitate one of them, that’s not a “substantial portion of the original work.” That’s a minuscule portion of the original work.

🇰 🌀 🇱 🇦 🇳 🇦 🇰 🇮 🏆 · edit-2 1 year ago

Then pay for the material like everyone else who can’t do things without someone else’s copyrighted materials.

@[email protected] · 2 years ago

Orcs versus progress.

@[email protected] · 2 years ago

So creators are orc now? Wow!

@[email protected] · 2 years ago

Not all creators are orcs, of course. But people who don’t understand deliberately exaggerated comparisons might be. I believe that you understood my point. Don’t start arguing over nothing.

bedrooms · edit-2 1 year ago

Alas, AI critics jumped onto the conclusion this one time. Read this:

Further, OpenAI writes that limiting training data to public domain books and drawings “created more than a century ago” would not provide AI systems that “meet the needs of today’s citizens.”

It’s a plain fact. It does not say we have to train AI without paying.

To give you a context, virtually everything on the web is copyrighted, from reddit comments to blog articles to open source software. Even open data usually come with copyright notice. Open research articles also.

If misled politicians write a law banning the use of copyrighted materials, that’ll kill all AI developments in the democratic countries. What will happen is that AI development will be led by dictatorships, and that’s absolutely a disaster even for the critics. Think about it. Do we really want Xi, Putin, Netanyahu and Bin Salman to control all the next-gen AIs powering their cyber warfare while the West has to fight them with Siri and Alexa?

So, I agree that, at the end of the day, we’d have to ask how much rule-abiding AI companies should pay for copyrighted materials, and that’d be less than the copyright holders would want. (And I think it’s sad.)

However, you can’t equate these particular statements in this article to a declaration of fuck-copyright. Tbh Ars Technica disappointed me this time.

@[email protected] · edit-2 1 year ago

The issue is that fair use is more nuanced than people think, but that the barrier to claiming fair use is higher when you are engaged in commercial activities. I’d more readily accept the fair use arguments from research institutions, companies that train and release their model weights (llama), or some other activity with a clear tie to the public benefit.

OpenAI isn’t doing this work for the public benefit, regardless of the language of altruism they wrap it in. They, and Microsoft, and hoovering up others data to build a for profit product and make money. That’s really what it boils down to for me. And I’m fine with them making money. But pay the people whose days your using.

Now, in the US there is no case law on this yet and it will take years to settle. But personally, philosophically, I don’t see how Microsoft taking NYT articles and turning them into a paid product is any different than Microsoft taking an open source projects that doesn’t allow commercial use and sneaking it into a project.

bedrooms · 1 year ago

Well, regarding text online, most is there fir the visitors to read fir free. So, if we end up treating these AI training like human reading text one could argue they don’t have to pay.

Reddit doesn’t pay their users, anyway.

But personally, philosophically, I don’t see how Microsoft taking NYT articles and turning them into a paid product is any different than Microsoft taking an open source projects that doesn’t allow commercial use and sneaking it into a project.

Agreed. That said, NYT actually intentionally allows Google and Bing servers to parse their news articles in order to put their articles top in the search results. In that regard they might like certain form of processing by LLMs.

@[email protected] · 1 year ago

I thought about the indexing situation in contrast to the user paywall. Without thinking too much about any legal argument, it would seem that NYT having a paywall for visitors is them enforcing their right to the content signaling that it isn’t free for all use, while them allowing search indexers access is allowing the content to visible but not free on the market.

It reminds me of the Canadian claim that Google should pay Canadian publishers for the right to index, which I tend to disagree with. I don’t think Google or Bing should owe NYT money for indexing, but I don’t think allowing indexing confers the right for commercial use beyond indexing. I highly suspect OpenAI spoofed search indexers while crawling content specifically to bypass paywall and the like.

I think part of what the courts will have to weigh for the fair use arguments is the extent to which NYT it’s harmed by the use, the extent to which the content is transformed, and the public interest between the two.

I find it interesting that OpenAI or Microsoft already pay AP for use of their content because it is used to ensure accurate answers are given to users. I struggle to see how the situation is different with NYT in OpenAI opinion, other than perhaps on price.

It will be interesting to see what shakes out in the courts. I’m also interested in the proposed EU rules which recognize fair use for research and education, but less so for commercial use.

Thanks for the reply! Have a great day!

P03 Locke · 1 year ago

It’s bizarre. People suddenly start voicing pro-copyright arguments just to kill an useful technology, when we should be trying to burn copyright to the fucking ground. Copyright is a tool for the rich and it will remain so until it is dismantled.

@[email protected] · edit-2 1 year ago

Life plus 70 years is bullshit.

20 years from release date is not.

No one except corporate bigwigs will say they should be allowed to do so in perpetuity, but artists still need legal protections to make money off of what they create, and Midjourney (making OpenAI boatloads of money off of making automated collages from artwork they obtained not only without compensation but without attribution) is a prime example of why.

@[email protected] · 1 year ago

“But you see, we have to let corporations break the law, because if we don’t, a country we might be at war with later will”

@[email protected] · 1 year ago

OpenAI now needs to go to court and argue fair use forever. That’s the burden of our system. Private ownership is valued higher than anything else so … Good luck we’re all counting on you (unfortunately).