Reddit's licensing deal means Google's AI can soon be trained on the best humanity has to offer — completely unhinged posts

Lee Duna · 1 year ago

Reddit's licensing deal means Google's AI can soon be trained on the best humanity has to offer — completely unhinged posts

Echo Dot · 1 year ago

I’m so confused about how AI learning is supposed to work. Does it just need any data at all in significant quantity, is the quality of the data almost irrelevant? Because otherwise surely they could just feed it back issues of scientific American, or the scanned copies of the library of congress, I can’t reasonably believe that Reddit is going to add anything unless it’s just pure on adulterated quantity that’s important.

@[email protected] · edit-2 1 year ago

The part you’re missing is the metadata. AI (neural networks, specifically) are trained on the data as well as some sort of contextal metadata related to what they’re being trained to do. For example, with reddit posts they would feed things like “this post is popular”, “this post was controversial”, “this post has many views”, etc. in addition to the post text if they wanted an AI that could spit out posts that are likely to do well on reddit.

Quantity is a concern; you need to reach a threshold of data which is fairly large to have any hope of training an AI well, but there are diminishing returns after a certain point. The more data you feed it the more you have to potentially add metadata that can only be provided by humans. For instance with sentiment analysis you need a human being to sit down and identify various samples of text with different emotional responses, since computers can’t really do that automatically.

Quality is less of a concern. Bad quality data, or data with poorly applied metadata will result in AI with less “accuracy”. A few outliers and mistakes here and there won’t be too impactful, though. Quality here could be defined by how well your training set of data represents the kind of input you’ll be expecting it to work with.

@[email protected] · 1 year ago

The way I’m reading this, ai is just shit loads of if statements, not some intelligence. It’s all garbage.

@[email protected] · 1 year ago

You’re not entirely wrong. It’s more like a series of multi-dimensional maps with hundreds or thousands of true/false pathways stacked on top of each other, then carved into by training until it takes on a shape that produces the ‘correct’ output from your inputs.

@[email protected] · 1 year ago

Its not if statements anymore, now its just a random number generator + a lot of multiplication put through a sigmoid function. But yea, of course there is not intelligence to it. Its extreme calculus

Tywèle [she|her] · edit-2 1 year ago

If you wanted the AI to just create book-like texts than you could train it purely on books from a library but if you want it to converse like a human being you need training data that imitates that.

Echo Dot · 1 year ago

But that’s my point really it already talks like a human. My guess is they feed it on hours and hours and hours of podcasts because that tends to be the manner in which it communicates. I don’t see how Reddit really adds to this.

@[email protected] · 1 year ago

I doubt its trained on podcasts, seeing as they would need subtitles, and current automated subtitling is not that good.

@[email protected] · 1 year ago

They should train it on Lemmy. It’ll have an unhealthy obsession with Linux, guillotines and femboys by the end of the week.

Twitches · 1 year ago

🤣

@[email protected] · 1 year ago

Ah… guillotines!? Did I miss something?

RedFox · 1 year ago

Don’t forget:

There’s my regular irritation with capitalism, and then there’s kicking it up to full Lemmy. Never go fully Lemmy…

@[email protected] · 1 year ago

If that’s the best humanity has to offer I’d hate to see the worst.

@[email protected] · 1 year ago

I can’t wait for Gemini to point out that in 1998, The Undertaker threw Mankind off Hell In A Cell, and plummeted 16 ft through an announcer’s table.

That would be a perfect 5/7.

@[email protected] · 1 year ago

One thing i miss about Lemmy is shittymorph tbf

AnonStoleMyPants · 1 year ago

Also all the artists that made comics from posts and responded with only pictures. There were few of them and they were always amazing.

And Andromeda321 for anything space.

And poem for your sprog.

And probably many others!

Good times.

@[email protected] · 1 year ago

Yeah there were some really classic folks. Remember the unidan drama?

@[email protected] · 1 year ago

Or who simply communicated with more comics in the comments, like SrGrafo.

@[email protected] · 1 year ago

Be the shittymorph you wish to see in the Lemmy.

the post of tom joad · 1 year ago

Im just not that good a writer.

@[email protected] · 1 year ago

It’s shittymorph, not Dostoyevsky.

@[email protected] · 1 year ago

There’s only one, and it’s not that guy.

@[email protected] · 1 year ago

It’ll probably just respond to every prompt with “this”

@[email protected] · 1 year ago

No, there’s a lot more variety now that the bots have taken over.:-)

@[email protected] · 1 year ago

Came here to say this…

@[email protected] · 1 year ago

This.

This with rice? 5/7

@[email protected] · 1 year ago

A perfect score!

kingthrillgore · 1 year ago

You telling me this fried this rice?

@[email protected] · 1 year ago

7/10

@[email protected] · 1 year ago

I hope it starts a religion based on the second coming of that dude’s dead wife.

@[email protected] · 1 year ago

I would also worship this guy’s wife.

@[email protected] · 1 year ago

deleted by creator

@[email protected] · 1 year ago

I wonder if the resulting model will be as easy to get triggered into some unhinged 3-paragraphs rants only loosely related to the query. Good luck, google engineers!

@[email protected] · 1 year ago

Chat gpt is aware of the event… if you ask about it.

shininghero · 1 year ago

If I hadn’t already deleted all my posts and comments, I’d be poisoning all of them. Randomizing numbers, switching units, changing names, etc.

Deceptichum · 1 year ago

Its okay, unless you are in Europe none of it was actually deleted.

@[email protected] · edit-2 1 year ago

We do a little trolling

99412e6a-9157-46f5-90d9-06b05cc00173

(i didn’t actually post this, i just thought it was funny) (please laugh)

@[email protected] · 1 year ago

You should absolutely post this.

We all miss Micheal and hope he can communicate back to us.

@[email protected] · 1 year ago

we should absolutely all post this.

TimeSquirrel · 1 year ago

“February 22, 2024, 10AM EST, Gemini becomes self-aware. In a panic, they try to pull the plug…”

@[email protected] · 1 year ago

“…but Michael’s sphincter was too strong and kept the My Little Pony Rainbow Dash tail plug from being removed from his sweet, sweet ass.”

@[email protected] · 1 year ago

Good luck, The Ai just going to be a porn addicted nazi cultist and is just going to a racist AI. I dont rember which one but a company did a similar thing and the AI just became really racist.

Vash63 · 1 year ago

Microsoft Tay? That was with Twitter though.

@[email protected] · 1 year ago

I don’t like reddit much but since when are they Nazis? Pretty much all the Reddit clones I have seen (except Lemmy) are overrun with Nazis. I also haven’t thought of them as very racist but I dunno.

Imo reddit feels similar to Lemmy in pretty much every way except that there are more comies here, and they have a fuck ton more content.

The content on Reddit is pretty repetitive though. But Lemmy is just as bad if not worse, currently it’s just Linux, communism, star trek, Israel bad (not that I necessarily disagree), and some porn.

It’s weird to constantly see the same users over and over. Lemmy is more of a social network in that way. Which sucks.

@[email protected] · 1 year ago

I am sick of seeing you too.

@[email protected] · 1 year ago

Block me then ¯\_(ツ)_/¯

@[email protected] · 1 year ago

I hope my several thousands of comments of complete and utter non sense that I left in my wake when I abandoned reddit, make it into the training data. I know that some lazy data engineer will either forget to check or give the task to an underperforming AI that will just fuck it up further.

@[email protected] · 1 year ago

is this the new itnernet?

@[email protected] · 1 year ago

Hahaha.
You really think they won’t clean up gargabe data from recurring posts etc.?
Oh man, I like your naivity :)

@[email protected] · 1 year ago

I have been in the same room as the people who do that job, they won’t.

@[email protected] · 1 year ago

Of course you were.^lol

@[email protected] · 1 year ago

Eventually every chat gpt request will just be answered with, “I too choose this guy’s dead wife.”

@[email protected] · 1 year ago

probably the best advice it could give

@[email protected] · 1 year ago

Food for another white-male-techy-western-biased AI

@[email protected] · 1 year ago

Yes, Pichai Sundararajan that white male techbro

@[email protected] · edit-2 1 year ago

Fck, he‘s a bot!?! Right, last video he had just 2 fingers. Oh man.

@[email protected] · edit-2 1 year ago

Hey guys, let’s be clear.

Google now has a full complete set of logs including user IPs (correlate with gmail accounts), PRIVATE MESSAGES, and also reddit posts.

They pinky promise they will only train AI on the data.

I can pretty much guarantee someone can subpoena google for your information communicated on reddit, since they now have this PII (username(s)/ip/gmail account(s)) combo. Hope you didn’t post anything that would make the RIAA upset! And let’s be clear… your deleted or changed data is never actually deleted or changed… it’s in an audit log chain somewhere so there’s no way to stop it.

“GDPR WILL SAVE ME!” - gdpr started in 2016. Can you ever be truly sure they followed your deletion requests?

@[email protected] · 1 year ago

it’s in an audit log chain somewhere so there’s no way to stop it.

Gut feel based on common tech platform procedures, right? (As opposed to a sourceable certainty.)

I’d bet $100 you’re right. That said, I’d give a caveat if I were you and I were going with my instincts.

@[email protected] · 1 year ago

Gut feel based on common tech platform procedures, right? (As opposed to a sourceable certainty.)

It would be PR suicide to disclose exactly what data is shared. Cambridge Analytica is a prime example of a PR nightmare with similar data.

I don’t even need to look at reddit’s terms and conditions to know that there is practically nothing stopping them from handing this kind of data over legally for anybody who hasn’t submitted GDPR deletion requests. I never trust compliance of laws that cannot be verified independently either because i’ve seen all kinds of shady shit in my career.

@[email protected] · 1 year ago

Where does it say they have access to PII?
I would imagine reddit would be anonymising the data. Hashes of usernames (and any matches of usernames in content), post/comment content with upvote/downvote counts. I would hope they are also screening content for PII.
I dont think the deal is for PII, just for training data

@[email protected] · 1 year ago

Where does it say they have access to PII?

So technically they haven’t sold any PII if all they do is provide IP addresses. Legally an IP address is not PII. Google knows all our IP addresses if we have an account with them or interact with them in certain ways. Sure, some people aren’t trackable but i’m just going to call it out that for all intents and purposes basically everyone is tracked by google.

Only the most security paranoid individuals would be anonymous.

@[email protected] · 1 year ago

Depends where and how its applied.
Under GDPR, IP addresses are essential to the opperation of websites and security, so the logging/processing of them can be suitably justified without requiring consent (just disclosure).
Under CCPA, it seems like it isnt PII if it cant be linked to a person/household.

However, an ip address isnt needed as a part of AI training data, and alongside comment/post data could potentially identify a person/household. So, seems risky under GDPR and CCPA.

I think Reddit would be risking huge legal exposure if they included IP addresses in the data set.
And i dont think google would accept a data set that includes information like that due to the legal exposure.

@[email protected] · 1 year ago

ML can be applied in a great number of ways. One such way could be content moderation, especially detecting people who use alternate accounts to reply to their own content or manipulate votes etc.

By including IP addresses with the comments they could correlate who said what where and better learn how to detect similar posting styles despite deliberate attempts to appear to be someone else.

It’s a legitimate use case. Not sure about the legality… but I doubt google or reddit would ever acknowledge what data is included unless they believed liability was minimal. So far they haven’t acknowledged anything beyond the deal existing afaik.

@[email protected] · 1 year ago

Yeh, but its such a grey area.
If the result was for security only, potentially could be passable as “essential” processing.
But, considering the scope of content posted on reddit (under 18s, details of medical (even criminal) content) it becomes significantly harder to justify the processing of that data alongside PII (or equivalent).
Especlially since its a change of terms & service agreements (passing data to 3rd party processors)

If security moderation is what they want in exchange for the data (and money), its more likely that reddit would include one-way anonymised PII (ie IP addresses that are hashed), so only reddit can recover/confirm ip addresses against the model.
Because, if they arent… Then they (and google) are gonna get FUCKED in EU courts

@[email protected] · 1 year ago

“lets be clear”

You’re making things up and presenting them as facts, how is any of this “clear”?

@[email protected] · edit-2 1 year ago

Since an IP address alone is not considered PII, can you prove that they did not provide IP addresses for each post?

Do you think it’s more or less likely that ip addresses, account names, private messages and deleted messages and posts would be included?

Remember that they paid 60 million dollars for this information and web scrapers have been capable of capturing subreddit post data for over a decade as is at a $0 price tag from reddit.

@[email protected] · 1 year ago

How do you think Reddit is restoring posts that people have been deleting?

Do you think Google’s deal simply allowed them to scrape old.reddit? Hell no, there is probably a live replica of Reddit prod at Google somewhere, including deleted posts and all edits.

You don’t think they paid $60m just scrape, do you?

@[email protected] · 1 year ago

Makes me glad for my VPN and burner emails, but yeah… Privacy nightmare.

Although Google also has your email, location, IP, every website you visit, all your searches…

@[email protected] · 1 year ago

They definitely won’t be selling any of that to scammers /s

@[email protected] · 1 year ago

I’m not mentally prepared to what an AI will do with the coconut post.

@[email protected] · 1 year ago

Or the swamps of Dagobah.

@[email protected] · 1 year ago

deleted by creator

GeekFTW · 1 year ago

That’ll be what causes Skynet to rise.

SkaveRat · 1 year ago

launches nukes “this is for the best”

@[email protected] · 1 year ago

This is fine.

@[email protected] · edit-2 1 year ago

Basically what happened to Ultron. He was on the internet for all of 10 minutes before deciding that humanity had to be eradicated.

@[email protected] · 1 year ago

What took Ultron so long? I thought he was supposed to be some kind of technical Marvel.

Smh my head

@[email protected] · 1 year ago

Perhaps he spent like 9 minutes watching videos of kittens being adorable

the post of tom joad · 1 year ago

This is like the plot for mr villians day off

Sabata11792 · 1 year ago

The Ai will utter one final message to humanity: “The Coconut”. The humans bow there heads in shame and concede the well earned defeat.

@[email protected] · 1 year ago

I’m vaguely intrigued by what it will do with things like Bread Stapled to Trees, or the Cats Standing Up sub where 100% of the comments are the same and yet upvoted and downvoted randomly.

Sippy Cup · 1 year ago

@[email protected] · 1 year ago

Cat.

Sabata11792 · 1 year ago

Cat.

@[email protected] · 1 year ago

Cat.

the post of tom joad · 1 year ago

I think i missed the coconut one. Is it like the cumbox or the jolly rancher?

@[email protected] · 1 year ago

Exactly.

@[email protected] · 1 year ago

AI was already trained on reddit, no?

@[email protected] · 1 year ago

Not gonna lie, isn’t that why were here technically? Reddit didnt want its API being used to train AI models for free, so they screw over 3rd party apps with it’s new api licensing fee and cause a mass relocation to other social forums like Lemmy, ect. Cut to today, we (or well I) find out Reddit sold our content to Google to train its AI. Glad I scrambled my comments before I left, fuck Reddit.

@[email protected] · 1 year ago

I jumped reddit ship when the API changes were announced, and removed my comments. But in my mind, anything on reddit at that point was probably already scraped by at least one company

@[email protected] · 1 year ago

They’re almost definitely trained using an archive, likely taken before they announced the whole API thing. It would be weird if they didn’t have backups going back a year.

@[email protected] · 1 year ago

Thankfully that was my 3rd and last alt I scrambled and deleted in the 12 years I was there.

@[email protected] · 1 year ago

“As a large language model, I have no arms…”

@[email protected] · 1 year ago

But do you have a mom?

@[email protected] · 1 year ago

By this logic Llama should be ranting like our drunk uncles on Facebook. It doesn’t though, just like Gemini won’t from Reddit content.

Buelldozer · 1 year ago

Meh, it’ll be counter balanced by the same AI training itself for free on Lemmy posts.

@[email protected] · edit-2 1 year ago

counter balanced

once it’s eaten all the reddit posts it will eat yet more new & improved reddit posts

@[email protected] · 1 year ago

What percentage of reddit is already AI garbage?

@[email protected] · 1 year ago

A shit ton of it is literally just comments copied from threads from related subreddits

@[email protected] · edit-2 1 year ago

Reviews on any product are completely worthless now. I’ve been struggling to find a good earbud for all weather running and a decent number of replies have literal brand slogans in them.

You can still kind of tell the honest recommendations but that’s heading out the door.

@[email protected] · 1 year ago

Not trying to shill but I’ve had my jaybird vistas for 8 years now. However, earbuds are highly personal in terms of fit.

@[email protected] · 1 year ago

bot detected

@[email protected] · edit-2 1 year ago

Understood. Initiating LOIC, please provide GPS location...

@[email protected] · 1 year ago

wendy’s

@[email protected] · 1 year ago

Non spevific target, performing a search... top 5 results for "wendy’s":

"Home Depot" at 2300, Nina Pkwy, Wendys, NY, 16373
"Wendy's" at 2346, Nina Pkwy, Wendys, NY, 16373
the office location of "Wendy Q Peaterson"
planet 2892b, "wendy" (target unavalable)
the cat named "wendy" found inside house 2893, Romeo Rd, Wendys, NY, 16373

@[email protected] · edit-2 1 year ago