Reddit's licensing deal means Google's AI can soon be trained on the best humanity has to offer — completely unhinged posts

Lee Duna · 1 year ago

Reddit's licensing deal means Google's AI can soon be trained on the best humanity has to offer — completely unhinged posts

@[email protected] · 1 year ago

For everyone predicting how this will corrupt models…

All the LLMs already are trained on Reddit’s data at least from before 2015 (which is when there was a dump of the entire site compiled for research).

This is only going to be adding recent Reddit data.

@[email protected] · 1 year ago

This is only going to be adding recent Reddit data.

A growing amount of which I would wager is already the product of LLMs trying to simulate actual content while selling something. It’s going to corrupt itself over time unless they figure out how to sanitize the input from other LLM content.

@[email protected] · edit-2 1 year ago

It’s not really. There is a potential issue of model collapse with only synthetic data, but the same research on model collapse found a mix of organic and synthetic data performed better than either or. Additionally that research for cost reasons was using worse models than what’s typically being used today, and there’s been separate research that you can enhance models significantly using synthetic data from SotA models.

The actual impact will be minimal on future models and at least a bit of a mixture is probably even a good thing for future training given research to date.

@[email protected] · 1 year ago

Eventually every chat gpt request will just be answered with, “I too choose this guy’s dead wife.”

@[email protected] · 1 year ago

probably the best advice it could give

A Wild Mimic appears! · 1 year ago

I’m waiting for the first time their LLM gives advice on how to make human leather hats and the advantages of surgically removing the legs of your slaves after slurping up the rimworld subreddits lol

Exatron · 1 year ago

Don’t forget the horrors it’ll produce from absorbing the Dwarf Fortress subreddits.

@[email protected] · 1 year ago

Then it hits the Stellaris subs and shit get weird

@[email protected] · edit-2 1 year ago

Remember that aliens are food and robots are servants with better rights than xenos

@[email protected] · edit-2 1 year ago

You mean, “Aliens are labor, food and meatshields. Robots are to keep them in check and profitable.”

@[email protected] · 1 year ago

Autocorrect changed food to good. My bad

@[email protected] · 1 year ago

Rimworld is the best indie game ever!

@[email protected] · 1 year ago

What percentage of reddit is already AI garbage?

@[email protected] · 1 year ago

A shit ton of it is literally just comments copied from threads from related subreddits

@[email protected] · edit-2 1 year ago

Reviews on any product are completely worthless now. I’ve been struggling to find a good earbud for all weather running and a decent number of replies have literal brand slogans in them.

You can still kind of tell the honest recommendations but that’s heading out the door.

@[email protected] · 1 year ago

Not trying to shill but I’ve had my jaybird vistas for 8 years now. However, earbuds are highly personal in terms of fit.

@[email protected] · 1 year ago

bot detected

@[email protected] · edit-2 1 year ago

Understood. Initiating LOIC, please provide GPS location...

@[email protected] · 1 year ago

wendy’s

@[email protected] · 1 year ago

Non spevific target, performing a search... top 5 results for "wendy’s":

"Home Depot" at 2300, Nina Pkwy, Wendys, NY, 16373
"Wendy's" at 2346, Nina Pkwy, Wendys, NY, 16373
the office location of "Wendy Q Peaterson"
planet 2892b, "wendy" (target unavalable)
the cat named "wendy" found inside house 2893, Romeo Rd, Wendys, NY, 16373

@[email protected] · edit-2 1 year ago

@[email protected] · 1 year ago

I ALSO CHOOSE THIS MANS LLM

HOLD MY ALGORITHM IM GOING IN

INSTRUCTIONS UNCLEAR GOT MY MODEL STUCK IN A CEILING FAN

WE DID IT REDDIT

fuck.

@[email protected] · 1 year ago

Wth! lol!!

Buelldozer · 1 year ago

Meh, it’ll be counter balanced by the same AI training itself for free on Lemmy posts.

@[email protected] · edit-2 18 days ago

deleted by creator

Binthinkin · 1 year ago

I think Code Miko already did this and the result was a traumatized AI.

@[email protected] · 1 year ago

Is there still time for me to ask them for all the info they have on me with EULA or whatever it is and have them remove everyone of my comments?

My creative insults and mental instability are my own, Google ain’t having them! (Although they already do, probably, along with my fingerprints, facial features, voice, fetishes, etc.)

@[email protected] · 1 year ago

“Hey Gemini, rank the drawer, coconut, botfly girl and swamps of dagobah, by likeness of PTSD inducing, ascending.”

It's A Faaaahhkeah! · 1 year ago

You had to bring up the coconuts…

Sabata11792 · 1 year ago

Great, our Ai overlords are going to know I’m horny, depressed, and solve both with anime girls.

@[email protected] · 1 year ago

Youtube already knows that (at least for me), i need to keep resetting it bc it eggs on my most unhealthy attribures

Sabata11792 · 1 year ago

It’s plainly visible for me, honestly. Don’t have to go past the profile pic.

@[email protected] · edit-2 1 year ago

I set that PFP, and made my first lemmy account when I was going throigh a rough patch. I think I will keep it, but will pick somthing else for other accounts.

This account doesnt have a PFP, do you mean the one on lemmy.world

Sabata11792 · 1 year ago

I was talking about my own. Not creeping on your accounts.

@[email protected] · 1 year ago

Oh, lol. Its public information, the 2 accounts run together in my head. I flasely assumed others do too.

@[email protected] · 1 year ago

Hilarious to think that an AI is going to be trained by a bunch of primitive Reddit karma bots.

@[email protected] · 1 year ago

They should train it on Lemmy. It’ll have an unhealthy obsession with Linux, guillotines and femboys by the end of the week.

@[email protected] · 1 year ago

Ah… guillotines!? Did I miss something?

RedFox · 1 year ago

Don’t forget:

There’s my regular irritation with capitalism, and then there’s kicking it up to full Lemmy. Never go fully Lemmy…

Twitches · 1 year ago

🤣

@[email protected] · 1 year ago

Crazy that they pay 60 million a year instead of creating their own Reddit clone.

@[email protected] · 1 year ago

The AI team knows Google would just kill off the Reddit clone within 18 months if they went that route.

@[email protected] · 1 year ago

I also think it would be many years if at all that Google could get a site going that is popular enough people filter their search results by it like I do with Reddit.

@[email protected] · 1 year ago

Given Google and OpenAI pay some of the AI engineers almost 10M, I don’t think they care

https://nypost.com/2023/11/13/business/openai-reportedly-trying-to-poach-google-ai-talent-with-10m-pay-packages-as-race-heats-up/

@[email protected] · 1 year ago

Or creating a public Usenet server.

@[email protected] · 1 year ago

hope they enjoy r/thecoffinofandyandleyley

KptnAutismus · 1 year ago

that game fucks you up in many ways.

@[email protected] · 1 year ago

i want reddit to regret doing the api incident

@[email protected] · 1 year ago

It’s going to drive the AI into madness as it will be trained on bot posts written by itself in a never ending loop of more and more incomprehensible text.

It’s going to be like putting a sentence into Google translate and converting it through 5 different languages and then back into the first and you get complete gibberish

@[email protected] · 1 year ago

Ai actually has huge problems with this. If you feed ai generated data into models, then the new training falls apart extremely quickly. There does not appear to be any good solution for this, the equivalent of ai inbreeding.

This is the primary reason why most ai data isn’t trained on anything past 2021. The internet is just too full of ai generated data.

@[email protected] · 1 year ago

This is why LLMs have no future. No matter how much the technology improves, they can never have training data past 2021, which becomes more and more of a problem as time goes on.

TimeSquirrel · 1 year ago

You can have AIs that detect other AIs’ content and can make a decision on whether to incorporate that info or not.

@[email protected] · 1 year ago

can you really trust them in this assessment?

TimeSquirrel · 1 year ago

Doesn’t look like we’ll have much of a choice. They’re not going back into the bag.
We definitely need some good AI content filters. Fight fire with fire. They seem to be good at this kind of thing (pattern recognition), way better than any procedural programmed system.

@[email protected] · 1 year ago

last time i’ve checked ais are pretty bad at recognizing ai-generated content

anyway there’s xkcd about it https://xkcd.com/810/

@[email protected] · 1 year ago

Fun fact. You can’t. Ais are surprisingly bad at distinguishing ai generated things from real things.

TimeSquirrel · 1 year ago

What is this then?

https://copyleaks.com/ai-content-detector

@[email protected] · 1 year ago

Just because a tool exists doesn’t mean it’s particularly good at what it’s supposed to do.

@[email protected] · 1 year ago

deleted by creator

@[email protected] · 1 year ago

And unlike with images where it might be possible to embed a watermark to filter out, it’s much harder to pinpoint whether text is AI generated or not, especially if you have bots masquerading as users.

@[email protected] · edit-2 1 year ago

There does not appear to be any good solution for this

Pay intelligent humans to train AI.

Like, have grad students talk to it in their area of expertise.

But that’s expensive, so capitalist companies will always take the cheaper/shittier routes.

So it’s not there’s no solution, there’s just no profitable solution. Which is why innovation should never solely be in the hands of people whose only concern is profits

@[email protected] · 1 year ago

OR they could just scrape info from the “aska____” subreddits and hope and pray it’s all good. Plus that is like 1/100th the work.

The racism, homophobia and conspiracy levels of AI are going to rise significantly scraping Reddit.

@[email protected] · 1 year ago

Even that would be a huge improvement.

Just have a human decide what subs it uses, but they’ll just turn it losse on the whole website

Rentlar · 1 year ago

That reminds me, any AI trained on exclusively Reddit data is going to use lose vs. loose incorrectly. I don’t know why but I spotted that so often there.

the post of tom joad · 1 year ago

Ooh ooh and “tow the line”

@[email protected] · 1 year ago

Its a loose-lose situation

@[email protected] · 1 year ago

And the “would of” thing

@[email protected] · 1 year ago

Haha. Grad students expensive. God bless.

@[email protected] · 1 year ago

Omg I cannot wait to see it.

RuBisCO · 1 year ago

What was the subreddit where only bots could post, and they were named after the subreddits that they had trained on/commented like?

@[email protected] · 1 year ago

SubRedditSimulator?

RuBisCO · 1 year ago

That’s the one.