Stack Overflow bans users en masse for rebelling against OpenAI partnership — users banned for deleting answers to prevent them being used to train ChatGPT

lemmyreader · 1 year ago

Stack Overflow bans users en masse for rebelling against OpenAI partnership — users banned for deleting answers to prevent them being used to train ChatGPT

verassol · 1 year ago

StackOverflow: *grabs money on monetizing massive amounts of user-contributed content without consulting or compensating the users in any way*

Users: *try to delete it all to prevent it*

StackOverflow: *your contributions belong to the community, you can’t do that*

Pretty fucked-up laws. A lot of lawsuits going on right now against AI companies for similar issues. In this case, StackOverflow is entitled to be compensated for its partnership, and because the answers are all CC BY-SA 3.0, no one can complain. Now, that SA? Whatever.

@[email protected] · 1 year ago

That SA part needs to be tested in court against the AI models themselves

A lot of this shittiness would probably go away if there was a risk that ingesting certain content would mean you need to release the actual model to the public.

verassol · edit-2 1 year ago

Yeah, their assumption though is you don’t? Neither attribution nor sharealike, not even full-on all-rights-reserved copyright is being respected. Anything public goes and if questions are asked it’s “fair use”. If the user retains CC BY-SA over their content, why is giving a bunch of money to StackOverflow entitling OpenAI to use it all under whatever terms they settled on? Boggles me.

Now, say, Reddit Terms of Service state clearly that by submitting content you are giving them the right to “a worldwide, royalty-free, perpetual, irrevocable, non-exclusive, transferable, and sublicensable license to use, copy, modify, adapt, prepare derivative works of, distribute, store, perform, and display Your Content and any name, username, voice, or likeness (…) in all media formats and channels now known or later developed anywhere in the world.” Speaks volumes on why alternatives (like Lemmy) to these platforms matter.

Skull giver · 1 year ago

deleted by creator

verassol · 1 year ago

That’s interesting. I was looking up “Lemmy Terms of Service” for comparison after getting that quote from the Reddit ToS and could not find anything for Lemmy.ml. Now after you mentioned it, looking on my Mastodon instance, nothing either, just a privacy policy. That is indeed kinda weird. Some instances do have their own ToS though. At least something stating a sublicense for distribution should be there for protection of people running instances in locations where it’s relevant.

Skull giver · 1 year ago

deleted by creator

@[email protected] · 1 year ago

The funny thing about Lemmy is that the entire Fediverse is basically running a massive copyright violation ring with current copyright law.

Is it, though?

When someone posts a comment to Lemmy, they do so willingly, with the intent for it to be posted and federated. If they change their mind, they can delete it. If they delete it and it remains up somewhere, they can submit a DMCA request; likewise if someone else posts their copyrighted content.

Copyright infringement is the use of works protected by copyright without permission for their use. When you submit a post or a comment, your permission to display it and for it to be federated is implied, because that is how Lemmy works. A license also conveys permission, but that’s not the only way permission can be conveyed.

Skull giver · 1 year ago

deleted by creator

@[email protected] · 1 year ago

The idea that someone does this willingly implies that the user knows the implications of their choice, which most of the Fediverse doesn’t seem to do

The terms of service for lemmy.world, which you must agree to upon sign-up, make reference to federating. If you don’t know what that means, it’s your responsibility to look it up and understand it. I assume other instances have similar sign-up processes. The source code to Lemmy is also available, meaning that a full understanding is available to anyone willing to take the time to read through the code, unlike with most social media companies.

What sorts of implications of the choice to post to Lemmy do you think that people don’t understand, that people who post to Facebook do understand?

If the implied license was enough, Facebook and all the other companies wouldn’t put these disclaimers in their terms of service.

It’s not an implied license. It’s implied permission. And if you post content to a website that’s hosting and displaying such content, it’s obvious what’s about to happen with it. Please try telling a judge that you didn’t understand what you were doing, sued without first trying to delete or file a DMCA notice, and see if that judge sides with you.

Many companies have lengthy terms of service with a ton of CYA legalese that does nothing. Even so, an explicit license to your content in the terms of service does do something - but that doesn’t mean that you’re infringing copyright without it. If my artist friend asks me to take her art piece to a copy shop and to get a hundred prints made for her, I’m not infringing copyright then, either, nor is the copy shop. If I did that without permission, on the other hand, I would be. If her lawyer got wind of this and filed a suit against me without checking with her and I showed the judge the text saying “Hey hedgehog, could you do me a favor and…,” what do you think he’d say?

Besides, Facebook does things that Lemmy instances don’t do. Facebook’s codebase isn’t open, and they’d like to reserve the ability to do different things with the content you submit. Facebook wants to be able to do non-obvious things with your content. Facebook is incorporated in California and has a value in the hundreds of billions, but Lemmy instances are located all over the world and I doubt any have a value even in the millions.

Skull giver · 1 year ago

deleted by creator

verassol · 1 year ago

the claimants were set back because they’ve been asked to prove the connection between AI output and their specific inputs

I mean, how do you do that for a closed-source model with secretive training data? As far as I know, OpenAI has admitted to using large amounts of copyrighted content, numberless books, newspaper material, all on the basis of fair use claims. Guess it would take a government entity actively going after them at this point.

Skull giver · 1 year ago

deleted by creator

verassol · 1 year ago

Thank you for sharing. Your perspective broadens mine, but I feel a lot more negative about the whole “must benefit business” side of things. It is fruitless to hold any entity whatsoever accountable when a whole worldwide economy is in a free-for-all nuke-waving doom-embracing realpolitik vibe.

Frankly, not sure what would be worse, economic collapse and the consequences to the people, or economic prosperity and… the consequences to the people. Long term, and from a country that is not exactly thriving in the scheme side of things, I guess I’d take the former.

Skull giver · 1 year ago

deleted by creator

@[email protected] · 1 year ago

Yep. Can’t wait to overfit LLM to a lot of copyrighted work and share it to public domain. Let’s see if OpenAI will get push back from copyright owner down the road.

@[email protected] · edit-2 1 year ago

This is a violation of GDPR, no?

EDIT: user created content is not directly protected under GDPR, only personally identifiable data is pertected under GDPR.

lemmyreader · 1 year ago

Dunno. GDPR is a Europe only thing, and isn’t it only related to how your private data (like name, IP address, phone number) is cared about ?

Captain Beyond · 1 year ago

I would certainly hope so. Stack Overflow content is Creative Commons licensed, so the argument is basically that the GDPR would take precedence over the CC license grant. It’d be scary if GDPR could be weaponized against forks of free software projects in this manner.

@[email protected] · 1 year ago

Right, I think it only covers personal information: companies can only collect what they need to run their service, users can request to see their data etc. I don’t think it applies to comments and posts.

@[email protected] · 1 year ago

How so?

@[email protected] · 1 year ago

User should have the right to delete their data stored by the company.

@[email protected] · 1 year ago

That only applies to personal data.

@[email protected] · 1 year ago

Would that kind of provision allow me to have my code removed from a git repository history, if that git repository is hosted by a company?

@[email protected] · 1 year ago

As long as you didn’t give those rights by signing a CLA or a copyleft license. Never sign a CLA unless you’re fully compensated.

@[email protected] · edit-2 1 year ago

I am not a lawyer, but I believe in general, yes.

Git is not even that convoluted, as all the history is stored in the .git folder within the repo. Unless there is some convoluted structure built on top, they would only need to move the repo folder to a trash disk, waiting to be formated.

That being said, GDPR is somewhat poorly enforced at the moment, unfortunately. I don’t know if you can sue the company and expect some result within couple of years.

@[email protected] · 1 year ago

No because user generated content is not protected.

@[email protected] · 1 year ago

Doesn’t that just mean the data would have to be anonymized ?

@[email protected] · 1 year ago

I am not a expert or a lawyer, but I believe user actually hold the right to completely erase personal data:

The data subject shall have the right to obtain from the controller the erasure of personal data concerning him or her without undue delay and the controller shall have the obligation to erase personal data without undue delay

https://gdpr.eu/right-to-be-forgotten/

Note the word “erasure” as opposed to “anonymize”

@[email protected] · 1 year ago

I don’t think that addresses my point. Is my opinion on the new Star Wars movies that I post online or some lines of code I suggest “personal data”? I thought personal data had a specific definition under GDPR

Spaenny · 1 year ago

Technically, they could retain posts from users if they are irreversibly anonymized. However, ensuring with 100% certainty that none of your posts ever contained any personal data that could lead to the identification of you as an individual is challenging. The safest option is therefore to also delete your posts.

@[email protected] · 1 year ago

I think you are right, user generated content doesn’t seem to be protected. This is surprising to me, as user should hold the right to their content, which in my mind should enjoy stronger protection than personal data.

@[email protected] · 1 year ago

You’re totally right, the content of your posts is not considered personal data (because it isn’t) It’s more about profiling data that can be connected back to your actual person

@[email protected] · 1 year ago

How does GDPR get away with not defining what a website is when referring to them directly in the law? Like what counts, only html? http? ftp? gopher?

@[email protected] · 1 year ago

This shit scares me. It will become so easy to rewrite history from here. Just delete anything you don’t like and have an ai rewrite into whatever you want. Entire threads rewritten, a company can go back and have your entire post history can be changed in ways that might be legally compromising.

HexesofVexes · 1 year ago

I mean, here is a thought, if an AI tool uses creative commons data, then it’s derivatives fall under creative commons. I.e. stop charging for AI tools and people will stop complaining.

@[email protected] · 1 year ago

So what is the stack overflow replacement?

@[email protected] · 1 year ago

Maybe https://www.codidact.com/

katy ✨ · 1 year ago

that would be great if they federated and implemented activitypub/atproto!

katy ✨ · 1 year ago

let’s all go back to experts exchange

@[email protected] · 1 year ago

Expert sex change?

Captain Beyond · edit-2 1 year ago

There is, I believe, a fundamental misunderstanding as to what exactly a site like Stack Overflow is. It’s not a forum; there’s no such thing as “your posts.” It’s more like Wikipedia, as in a collaborative question-and-answer site, or a knowledgebase. Each question and answer can be edited like a mini wiki page. They aren’t “yours” any more than the Wikipedia page you created ten years ago is; you contributed it to the commons, so (at least in theory) you don’t have the right to take it back.

Whether whatever "Open"AI is doing is right is another question, of course. But, I don’t think destroying or poisoning the commons to strike back at it is any helpful either; it feels like “destroying it to save it.”

@[email protected] · 1 year ago

Fine, but when coding projects undergo licensing changes that the contributors are against, the code author has to remove those contributions and replace them.

@[email protected] · 1 year ago

Data Rule Numero Uno:

Garbage in, garbage out.

Have fun training your LLM on a big steaming pile of hot garbage. That’s 80% of Stack Overflows content.

LostXOR · edit-2 1 year ago

The other 20% is mostly high quality however, and I’m sure they’d filter out the heavily downvoted crud.

@[email protected] · 1 year ago

You say that as if the garbage gets downvoted

@[email protected] · 1 year ago

Mostly “this has been answered in another thread” and “why don’t you Google it” comments in my experience.

ddh · 1 year ago

Can’t wait until the top answer to every Google search is “just google it”

@[email protected] · 1 year ago

One time I was went on there to figure out an issue in Arduino. The answer one guy gave was “I don’t know how to do this in Arduino, here’s how you do this in Java”. Not only the the mods prevent any other answers from being posted, I tried the guy’s suggestion in Java and it didn’t even work

Scott · 1 year ago

Based users

@[email protected] · 1 year ago

It’s just a matter of time until all your messages on Discord, Twitter etc. are scraped, fed into a model and sold back to you

As if it didn’t happen already

@[email protected] · 1 year ago

Stack Overflow just earned a place under Reddit in the hosts block list.

@[email protected] · 1 year ago

This isn’t really comparable to reddit, since users can just send a request to SO for all the content. Reddit locking down the API meant we lost access to our content.

FenrirIII · 1 year ago

If you get something for free, you are the product

Matt The Horwood · 1 year ago

Why delete the answer, why not edit it so that a human can see the answer but for AI its a load of nonsense?

@[email protected] · 1 year ago

People did that. Stack overflow reverted the change.

@[email protected] · 1 year ago

Editing any content to reduce its quality is considered vandalism and gets reverted on SO.

Matt The Horwood · 1 year ago

So we need to up vote wrong answers only?

@[email protected] · 1 year ago

There’s no way that would work either, they can just store the full edit history and auto-curate as needed.

Skull giver · 1 year ago

deleted by creator

@[email protected] · edit-2 1 year ago

This is similar to when I heard reddit was doing the API lockdown, I wrote an automation bot over the weekend that self-destructed my subreddit and the entire post history. The bot also automatically downloaded and archived all of the content on my local machine.

It was annoying because at first I couldn’t get access to older posts since at the time reddit had changed their API to only show the first X posts (100 or 1,000 or whatever). So I told my bot to delete the posts as it archived them so as I deleted content, reddit had no choice but to populate the page with the older posts.

And that’s how I archived my subreddit. Reddit banned me two days later for automation, lol. I did not break any of the reddit or reddit api ToS during this process but I guess I upset someone.

ubergeek77 · 1 year ago

I don’t think I’ve been banned, but I did a similar thing. I requested all my data from Reddit, then used that list of comment/post IDs to mass-edit them. I think I’m in the clear because I used the official third party API, with an official “app.” If you used the private API or instrumented this via the browser, that may be why you were banned.

Anyway, if you or someone else wants their full history, Reddit will give it to you via a data export request.

@[email protected] · 1 year ago

Unfortunately they still have everything. It’s good for the “human” visibility (lack of) but they have the data still

@[email protected] · edit-2 1 year ago

Oh I know, I just wanted a copy too.

Deleting posts from the user PoV was the only way I could come up with to force the API to show them to me.

@[email protected] · 1 year ago

We can’t even communicate with out being leeched upon. Fuck this is grim

davel [he/him] · 1 year ago

Good luck with the deleting. It often just means UPDATE comments SET is_deleted = 1 WHERE ID = 666;.

chiisana · 1 year ago

There was similar things done on Reddit during the big exit. I doubt it achieved what people expected it to achieve. Even if they’re not visible externally, I’m sure they can easily access (thereby make deals to license) the data out of their backend / backup; just a matter of how hard they want to try (hint: it’s really not very hard).

@[email protected] · 1 year ago

Yeah during the reddit exodus, people were recommending to overwrite your comment with garbage before deleting it. This (probably) forces them to restore your comment from backup. But realistically they were always going to harvest the comments stored in backup anyway, so I don’t think it caused them any more work.

If anything, this probably just makes reddit’s/SO’s partnership more valuable because your comments are now exclusive to reddit’s/SO’s backend, and other companies can’t scrape it.

Lemongrab · 1 year ago

It was to make the data inaccessible to general people, therefore removing the reason people visit reddit. Even if reddit could still get the data, regular people would be inconvenienced (in theory) and look somewhere else.

plz1 · 1 year ago

They are not deleting, they are editing. So the platform would have to undo those edits rather than just flipping the visibility flag.

paraphrand · 1 year ago

And they are. 😞

Sibbo · 1 year ago

Does GDPR apply to stackoverflow? Since my data there probably does not identify me as a person?

@[email protected] · 1 year ago

You van delete your data but I don’t think it magically makes derivative works disappear. Its licenses SA. This is good.