‘Impossible’ to create AI tools like ChatGPT without copyrighted material, OpenAI says

L4sBot · 2 years ago

‘Impossible’ to create AI tools like ChatGPT without copyrighted material, OpenAI says

@[email protected] · 2 years ago

Too bad

Why do they have free reign to store and use copyrighted material as training data? AIs don’t learn as a human would, and comparisons can’t be made between the learning processes.

@[email protected] · 2 years ago

They can be made. Imagine trying to hold any conversations without being able to reference popular culture.

@[email protected] · edit-2 2 years ago

Why do you have free reign to do the same?

AIs don’t learn as a human would, and comparisons can’t be made between the learning processes.

I think you’re going to have a hard time proving a financial distinction between them

@[email protected] · 2 years ago

You don’t need to prove a financial difference. They are fundamentally different systems that function in different ways. They cannot be compared 1:1 and laws cannot be applied as a 1:1. New regulations need to be added around AI use of copyrighted material.

@[email protected] · 2 years ago

I agree. For instance, it should be secured in law that you can train AI on anything, to avoid frivolous discussions like this.

Output is what should be moderated by law.

@[email protected] · 2 years ago

No

Why are you entitled to use everyone else’s work? It should be secured in law that licensing applies to training data to avoid frivolous discussions like this. Then it’s an entirely opt-in solution, which works in the benefit of everyone except the people stealing data.

Output doesn’t matter since it’s pretty well settled it’s not derivative work (as much as I disagree with that statement).

@[email protected] · 2 years ago

the people stealing data

No one is doing this

Output doesn’t matter since it’s pretty well settled it’s not derivative work

Cool, discussion over.

@[email protected] · 2 years ago

It is stealing data. In order to train on it they have to store the data. That’s a copyright violation. There’s no way to interpret it as not stealing data.

@[email protected] · 2 years ago

It is not stealing. The data is still there. It is, at worst, copyright violation.

@[email protected] · 2 years ago

deleted by creator

@[email protected] · 2 years ago

If I steal something from you I have it and you don’t. When I copy an idea from you, you still have the idea. As a whole the two person system has more knowledge. While actual theft is zero sum. Downloading a car and stealing a car are not the same thing.

And don’t even try the awarding artists and inventor argument. Companies that fund R&D get tax breaks for it, so they already get money. An artists are rarely compensated appropriately.

@[email protected] · 2 years ago

Copied cars. Copying is not theft or stealing.

@[email protected] · edit-2 2 years ago

My hot take is that it’s not like most of those independent artists are getting compensated fairly by the companies that own them anyway if at all. Stealing ai training content is just stealing from corporations. Corporations who are probably politically fighting to keep things worse for the average person in your country.

Theft is “a crime” but I never saw anyone complaining about how unfair it was all those times I myself got fucked over by google bullshitting their way out of giving me my ad revenue. If normal people can’t profit from stuff like this, we shouldn’t be doing anything to protect the profits of evil corporations.

bravesirrbn ☑️ · 2 years ago

Then don’t

@[email protected] · edit-2 2 years ago

If it ends up being OK for a company like OpenAI to commit copyright infringement to train their AI models it should be OK for John/Jane Doe to pirate software for private use.

But that would never happen. Almost like the whole of copyright has been perverted into a scam.

@[email protected] · 2 years ago

You wouldn’t steal a car, would you?

@[email protected] · 2 years ago

@[email protected] · 2 years ago

It is funny how Hollywood was droning that sentence into our head, and now they are downloading actors themselves. Oh the irony.

@[email protected] · 2 years ago

Using copyrighted material is not the same thing as copyright infringement. You need to (re)publish it for it to become an infringement, and OpenAI is not publishing the material made with their tool; the users of it are. There may be some grey areas for the law to clarify, but as yet, they have not clearly infringed anything, any more than a human reading copyrighted material and making a derivative work.

@[email protected] · 2 years ago

Insane how this comment is downvoted, when, as far as a I’m aware, it’s literally just the legal reality at this point in time.

@[email protected] · 2 years ago

any more than a human reading copyrighted material and making a derivative work.

It seems obvious to me that it’s not doing anything different than a human does when we absorb information and make our own works. I don’t understand why practically nobody understands this

I’m surprised to have even found one person that agrees with me

@[email protected] · 2 years ago

Because it’s objectively not true. Humans and ML models fundamentally process information differently and cannot be compared. A model doesn’t “read a book” or “absorb information”

@[email protected] · edit-2 2 years ago

I didn’t say they processed information the same, I said generative AI isn’t doing anything that humans don’t already do. If I make a drawing of Gordon Freeman or Courage the Cowardly Dog, or even a drawing of Gordon Freeman in the style of Courage the Cowardly Dog, I’m not infringing on the copyright of Valve or John Dilworth. (Unless I monetize it, but even then there’s fair-use…)

Or if I read a statistic or some kind of piece of information in an article and spoke about it online, I’m not infringing the copyright of the author. Or if I listen to hundreds of hours of a podcast and then do a really good impression of one of the hosts online, I’m not infringing on that person’s copyright or stealing their voice.

Neither me making that drawing, nor relaying that information, nor doing that impression are copyright infringement. Me uploading a copy of Courage or Half-Life to the internet would be, or copying that article, or uploading the hypothetical podcast on my own account somewhere. Generative AI doesn’t publish anything, and even if it did I think there would be a strong case for fair-use for the same reasons humans would have a strong case for fair-use for publishing their derivative works.

@[email protected] · 2 years ago

It comes from OpenAI and is given to OpenAI’s users, so they are publishing it.

@[email protected] · 2 years ago

It’s being mishmashed with a billion other documents just like to make a derivative work. It’s not like open hours giving you a copy of Hitchhiker’s Guide to the Galaxy.

@[email protected] · 2 years ago

New York Times was able to have it return a complete NYT article, verbatim. That’s not derivative.

@[email protected] · 2 years ago

I thought the same thing until I read another perspective into it from Mike Masnick and, from what he writes, it seems pretty clear they manipulated ChatGPT with some very specific prompts that someone who doesn’t already pay NYT for access would not be able to do. For example, feeding it 3 verbatim paragraphs from an article and asking it to generate the rest if you understand how these LLMs work, its really not surprising that you can indeed force it to do things like that but it’s an extreme and I’m qith Masnick and the user your responding to on this one myself.

I also watched most of today’s subcommittee hearing on AI and journalism. A lot of the arguments are that this will destroy local journalism. Look, strong local journalism is some of the most important work that is dying right now. But the grave was dug by these large media companies and hedge funds that bought up and gutted those local news orgs and not many people outside of the industry batted an eye while that was happening. This is a bit of a tangent but I don’t exactly trust the giant headgefunds who gutted these local news journalists ocer the padt deacde to all of a sudden care at all about how important they are.

Sorry fir the tangent butbheres the article i mentioned thats more on topic - http://mediagazer.com/231228/p11#a231228p11

@[email protected] · 2 years ago

So they gave it the 3 paragraphs that are available publicly, said continue, and it spat out the rest of the article that’s behind a paywall. That sure sounds like copyright infringement.

@[email protected] · 2 years ago

And that’s not the intent of the service, it’s a bug and they’ll fix it.

kingthrillgore · 2 years ago

Its almost like we had a thing where copyrighted things used to end up but they extended the dates because money

@[email protected] · 2 years ago

This is where they have the leverage to push for actual copyright reform, but they won’t. Far more profitable to keep the system broken for everyone but have an exemption for AI megacorps.

rivermonster · 2 years ago

I was literally about to come in here and say it would be an interesting tangential conversation to talk about how FUCKED copyright laws are, and how relevant to the discussion it would be.

More upvote for you!

@[email protected] · 2 years ago

Copyright protection only exists in the context of generating profit from someone else’s work. If you were to figure out cold fusion and I’d look at your research and say “That’s cool, but I am going to go do some woodworking.” I am not infringing any copyrights. It’s only ever an issue if the financial incentive to trace the profits back to it’s copyrighted source outway the cost of doing so. That’s why China has had free reign to steal any western technology, fighting them in their courts is not worth it. But with AI it’s way easier to trace the output back to it’s source (especially for art), so the incentive is there.

The main issue is the extraction of value from the original data. If I where to steal some bricks from your infinite brick pile and build a house out of them, do you have a right to my house? Technically I never stole a house from you.

@[email protected] · 2 years ago

You stole bricks. How rich I am does not impact what you did. Copying is not theft. You can keep stretching any shady analogy you want but you can’t change the fundamentals.

HelloThere · edit-2 2 years ago

You’re conflating copyright and patents.

@[email protected] · 2 years ago

Shit, you’re right, I’am.

@[email protected] · 2 years ago

Also conflating theft vs copying

Ook the Librarian · 2 years ago

It’s not “impossible”. It’s expensive and will take years to produce material under an encompassing license in the quantity needed to make the model “large”. Their argument is basically “but we can have it quickly if you allow legal shortcuts.”

@[email protected] · 2 years ago

Whenever a company says something is impossible, they usually mean it’s just unprofitable.

@[email protected] · 2 years ago

The law is shit

@[email protected] · 2 years ago

That argument has unfortunately worked for many other Tech Bros

@[email protected] · edit-2 2 years ago

Then LLMs should be FOSS

rivermonster · edit-2 2 years ago

All AI should be FOSS and public domain, owned by the people, and all gains from its use taxed at 100%. It’s only because of the public that AI exists, through the schools, universities, NSF, grants, etc and all the other places that taxes have been poured into that created the advances upon which AI stands, and the AI critical research as well.

@[email protected] · 2 years ago

That does nothing to solve the problem of data being used without consent to train the models. It doesn’t matter if the model is FOSS if it stole all the data it trained on.

@[email protected] · 2 years ago

The only way I can steal data from you is if I break into your office and walk off with your hard drive. Do you have access to something? It hasn’t been stolen.

@[email protected] · 2 years ago

Copying is not theft or stealing.

@[email protected] · edit-2 2 years ago

Copying copyright protected data is theft AND stealing

Edit: this also applies to my stance on piracy, which I don’t engage in for the same reason. It’s theft

db0 · 2 years ago

By definition you’re wrong

@[email protected] · 2 years ago

deleted by creator

@[email protected] · 2 years ago

It’s theft.

You can steal all you want, but it’s still theft. Piracy is theft, stealing data to be used as training data is theft.

Not everyone wants their creations to be infinitely shared beyond their control. If someone creates something, they’re entitled to absolute control over it.

@[email protected] · 2 years ago

deleted by creator

@[email protected] · 2 years ago

You are only hurting yourself by adopting a rule like that.

@[email protected] · 2 years ago

But our current copyright model is so robust and fair! They will only have to wait 95y after the author died, which is a completely normal period.

If you want to control your creations, you are completely free to NOT publish it. Nowhere it’s stated that to be valuable or beautiful, it has to be shared on the world podium.

We’ll have a very restrictive Copyright for non globally transmitted/published works, and one for where the owner of the copyright DID choose to broadcast those works globally. They have a couple years to cash in, and then after I dunno, 5 years, we can all use the work as we see fit. If you use mass media to broadcast creative works but then become mad when the public transforms or remixes your work, you are part of the problem.

Current copyright is just a tool for folks with power to control that power. It’s what a boomer would make driving their tractor / SUV while chanting to themselves: I have earned this.

@[email protected] · 2 years ago

IMHO being able to “control your creations” isn’t what copyright was created for; it’s just an idea people came up with by analogy with physical property without really thinking through what purpose is supposed to serve. I believe creators of intellectual “property” have no moral right to control what happens with their creations, and they only have a limited legal right to do so as a side-effect of their legal right to profit from their creations.

@[email protected] · 2 years ago

Copyright can be a double edged sword, but…

If you want to control your creations, you are completely free to NOT publish it.

… You’ve identified the chilling effect it’s designed to prevent: namely, telling people that they don’t matter in the scope of creation.

There’s a great video about how plaigirism dehumanizes, and if you’ve got a couple minutes I’d recommend it.

@[email protected] · 2 years ago

Here is an alternative Piped link(s):

a great video

Piped is a privacy-respecting open-source alternative frontend to YouTube.

I’m open-source; check me out at GitHub.

@[email protected] · edit-2 2 years ago

First:

I truly believe that they don’t matter as an individual when looking at their creation as a whole. It matters among their loved ones, and for that person itself. Why do you need more… importance? From who? Why do you need to matter in scope of creation? Is it a creation for you? Then why publish it? Is it a creation for others? Then why does your identity matter? It just seems like egotism with extra steps. Using copyright to combat this seems like a red herring argument made by people who have portfolio’s against people who don’t…

You are not only your own person, you carry human culture remnants distilled out of 12000 years of humanity! You plagiarised almost the whole of humanity while creating your ‘unique’ addition to culture. But, because your remixed work is newer and not directly traceable to its direct origins, we’re gonna pretend that you wrote it as a hermit living without humanity on a rock and establish the rules from there on out. If it was fair for all the players in this game, it would already be impossible to not plagiarise.

@[email protected] · 2 years ago

Them: “Oh yeah I have 10 minutes until my dentist appointment, I’ll check that out.”

@[email protected] · 2 years ago

I think it’s pretty amazing when people just run with the dogma that empowers billionaires.

Every creator hopes they’ll be the next taylor swift and that they’ll retain control of their art for those life + 70 years and make enough to create their own little dynasty.

The reality is that long duration copyright is almost exclusively a tool of the already wealthy, not a tool for the not-yet-wealthy. As technology improves it will be easier and easier for wealth to control the system and deny the little guy’s copyright on grounds that you used something from their vast portfolio of copyright/patent/trademark/ipmonopolyrulelegalbullshit. Already civil legal disputes are largely a function of who has the most money.

I don’t have the solution that helps artists earn a living, but it doesn’t seem like copyright is doing them many favors as-is unless they are retired rockstars who have already earned in excess of the typical middle class lifetime earnings by the time they hit 35, or way earlier.

@[email protected] · 2 years ago

I don’t have the solution that helps artists earn a living, but it doesn’t seem like copyright is doing them many favors as-is unless they are retired rockstars who have already earned in excess of the typical middle class lifetime earnings by the time they hit 35, or way earlier.

Just because copyright helps them less doesn’t mean it doesn’t help them at all. And at the end of the day, I’d prefer to support the retired rockstars over the stealing billionaires.

@[email protected] · 2 years ago

I am against the dogma that empowers billionaires. Sam Altman is one such billionaire who abuses data that we should not ignore.
I don’t know why you are treating copyright as a binary that doesn’t have any nuance. Current Copyright Law Imperfect, and if your concern is genuine we can talk about it at a future time.
If you don’t have the solution, perhaps you should not attack one of the remaining defenses against rampant abuses of peoples’ livelihood.

@[email protected] · 2 years ago

Current Copyright Law Imperfect,

Yeah and Joseph Stalin was a bit naughty. As long as we are seeing how understated we can be.

If you don’t have the solution, perhaps you should not attack one of the remaining defenses against rampant abuses of peoples’ livelihood.

The creator of Superman wasnt paid royalties and was laid off. Many years later he worked a restaurant delivery guy and ended up dropping off food at DC comics. The artist that built that company doing a sandwich run.

@[email protected] · edit-2 2 years ago

Oh, is this the future time? I was thinking that you could air your concerns in a different thread entirely, perhaps in a subreddit devoted to it. There has been a suspicious number of people suddenly concerned about copyright and other things but only when AI is discussed.

@[email protected] · 2 years ago

If you got an accusation go ahead and make it. I will be hearing downloading a fucking car

@[email protected] · 2 years ago

Relax, I would never be so grimy as to accuse you of something. I wish you well with your legitimate interests, and I hope you can find threads where they actually are on topic!

@[email protected] · 2 years ago

Funny thing is, human artists work quite similar to AI, in that they take the whole of human art creation, build on ot and create something new (sometimes quite derivative). No art comes out of a vacuum, it builds on previous works. I would not really say AI plagiarizes anything, unless it reproduced pretty much the exact work of someone

ugjka · 2 years ago

TBH I only use LLMs when traditional search fails and even then I’m not sure if I’m getting something useful or hallucination. I need better search engines not fancy AI bullshitters

@[email protected] · 2 years ago

I guess the lesson here is pirate everything under the sun and as long as you establish a company and train a bot everything is a-ok. I wish we knew this when everyone was getting dinged for torrenting The Hurt Locker back when.

Remember when the RIAA got caught with pirated mp3s and nothing happened?

What a stupid timeline.

@[email protected] · 2 years ago

So if I look at a painting study it and then emulate the original painter’s artstyle, then I’m in breach of their copyright?

Or if I read a lot of fantasy like GRRM or JK Rowling and I also write a fantasy book and say, that they were my Inspiration, I’m breaching their copyright??

That’s not how it works, and if it is, it shouldn’t be!

Sure, if a start reproducing work, i.e. plagiarizing the work of others, then I’m doing sth wrong.

And to spin this further: If I raise a child on children’s books by a specific author, am I breaching copyright, when my child enters the workforce and starts to earn money??? Stupid, yes! But so are the copyright claims against LLMs, in my opinion.

@[email protected] · edit-2 2 years ago

You’re comparing something humans often do subconsciously to a machine that was programmed to do that. Unless you’re arguing that intent doesn’t matter ~~(pretty much every judge in America will tell you it does)~~ then we’re talking about 2 completely different things.

Edit: Disregard the struck out portion of my comment. Apparently I don’t know shit about law. My point is that comparing a a quirk of human psychology to the strict programming of a machine is a false equivalency.

@[email protected] · 2 years ago

Intent does not matter for copyright infringement, it’s a strict liability.

@[email protected] · 2 years ago

I looked it up and you’re right. I must of been thinking of a different crime. That’ll teach me to go spouting off about stuff.

My point that AI is programmed to recycle and humans aren’t is still something I stand by, so I edited my comment.

@[email protected] · 2 years ago

I don’t think it’s accurate to call the work of AI the same as the human brain, but most importantly, the difference is that humans and tools have and should have different rights. Someone can’t simply point a camera at a picture and say “I can look at it with my eye and keep it in my memory, so why can’t the camera?”

Because we ensure the right of learning for people. That doesn’t mean it’s a free pass to technologically process works however one sees fit.

Nevermind that the more people prodded AIs, the more they have found that the reproductions are much more identical than simply vaguely replicating style from them. People have managed to get whole sentences from books and obvious copies of real artwork, copyrighted characters and celebrities by prompting AI in specific ways.

@[email protected] · 2 years ago

To be fair, I think your analogy falls apart a bit because you can in fact take a picture of pretty much any art you want to, legally speaking.

You can’t go sell it or anything, but you are definitely not in breach of copyright just by taking the picture.

@[email protected] · 2 years ago

That’s a rebuttal on the level of “if a tree falls in the forest and nobody is there to hear it”. Legally, theoretically, you should need permission just as much, but nobody is going to sue you over something nobody else sees.

Copyright addresses reproduction and distribution, paid or not, including derivative works. There are exemptions for journalism and education, AI advanced a lot by using copyrighted materials under the reasoning that it was technological research, but as it spun off into commercial use, its reliance on copyrighted materials for training has become much more questionable.

@[email protected] · 2 years ago

Copyright law only works because most violations are not feasible to prosecute. A world where copyright laws are fully enforced would be an authoritarian dystopia where all art and science is owned by wealthy corporations.

Copyright law is inherently authoritarian. The conversation we should have been having for the last 100 years isn’t about how much we’ll tolerate technical violations of copyright law; it’s how much we’ll tolerate the chilling effect of copyright law on sharing for the sake of promoting new creative works.

@[email protected] · 2 years ago

Absolutely and I’m with you on that. I think Copyright is excessively long and overly restrictive.

But that is another conversation.

The conversation we are having now is how to protect and compensate human creators that need their livelihoods to keep creating in our society as it is, when these new AI tools, trained on their works, are used to deliberately replace them.

There are many issues with copyright as it is right now, but it is literally the only resort that artists have left in this situation. It’s not a given that opposing copyright hinders corporations. In this particular case there are many corporations salivating at the opportunity to replace human creators with AI, to get faster work, cheaper, to appropriate distinctive styles without needing to hire the people who developed them.

There is a chilling effect on its own happening here. There are writers and artists today that are seeing their jobs handed to AI, which decide creative works are not a feasible career to have anymore. Not only this is tragic by virtue of human interest alone, since AI relies on human creators to be trained, it’s very possible that they will spiral into recursive derivativeness and become increasingly stale, devoid of fresh ideas and styles.

@[email protected] · 2 years ago

the right of learning

That’s not a thing. There is a right to an education, but that is not about copyright (though it may imply the necessity of fair use exceptions in certain contexts).

Also, you are confused about AI output. It’s possible to make the AI spit out training data, but it takes, indeed, prodding. It’s unlikely to matter by US law.

Melllvar · 2 years ago

Sounds like a fatal problem. That’s a shame.

@[email protected] · 2 years ago

We’ll, strictly speaking you could have an AI that only knows about the world up to 1928 and talks like it’s 1928.