• @[email protected]
    link
    fedilink
    English
    21
    edit-2
    1 year ago

    Reddit banned me through IP address or something. Whatever new account i create will be banned within 24hrs even if i don’t upvote a single post or comment. I tried with 10 new account all banned and all new email address. So gave up and randomly changed all my good comments. Shifted permanently to lemmy. Missing some of the most niche community. But not so much to return to reddit.

    Edit: I didn’t even commit any rule violation. Took a too long to change from modded reddit app. I only logged in once. That doesn’t amount to blocking me from every using reddit.

    • @[email protected]
      link
      fedilink
      English
      21 year ago

      If you use a vpn and a disposable email you can get about a week out of an account if you need to comment, it’ll get quietly shadowbanned though.

  • AutoTL;DRB
    link
    fedilink
    English
    21 year ago

    This is the best summary I could come up with:


    OpenAI has signed a deal for access to real-time content from Reddit’s data API, which means it can surface discussions from the site within ChatGPT and other new products.

    It’s an agreement similar to the one Reddit signed with Google earlier this year that was reportedly worth $60 million.

    The deal will also “enable Reddit to bring new AI-powered features to Redditors and mods” and use OpenAI’s large language models to build applications.

    Recently, following news of a partnership between OpenAI and the programming messaging board Stack Overflow, people were suspended after trying to delete their posts.

    No financial terms were revealed in the blog post announcing the arrangement, and neither company mentioned training data, either.

    That last detail is different from the deal with Google, where Reddit explicitly stated it would give Google “more efficient ways to train models.” There is, however, a disclosure mentioning that OpenAI CEO Sam Altman is also a shareholder in Reddit but that “This partnership was led by OpenAI’s COO and approved by its independent Board of Directors.”


    The original article contains 334 words, the summary contains 174 words. Saved 48%. I’m a bot and I’m open source!

  • @[email protected]
    link
    fedilink
    English
    1051 year ago

    They always were.

    Only now they’ve agreed to pay Reddit for it. This is what their third party lockdown was really all about.

    They’re helping themselves to your Lemmy comments for free, as that’s just how it’s designed. If you post anything publicly anywhere, it’s getting slurped up by a bot somewhere.

    • just another dev
      link
      fedilink
      English
      161 year ago

      I’m not a lawyer. But isn’t the reason they had to go to reddit to get permission is because users hand over over ownership to reddit the moment you post. And since there’s no such clause on Lemmy, they’d have to ask the actual authors of the comments for permission instead?

      Mind you, I understand there’s no technical limitation that prevents bots from harvesting the data, I’m talking about the legality. After all, public does not equate public domain.

      • Alimentar
        link
        fedilink
        English
        31 year ago

        Well even if it was a legal argument, they wouldn’t care. Like Facebook and all the rest. They say they don’t share your data but we all know that’s a lie

      • @[email protected]
        link
        fedilink
        English
        41 year ago

        Well the legality seems to be something you can ignore when you have billions of dollars in VC money to fritter around.

        It certainly didn’t stop them hoovering up music and movies, and the owners of those have a lot more power than any of us do.

        Tech is fast, the law is slow, and you can make many times the cost of lawyers and fines by the time anybody gets around to telling you to stop it.

      • @[email protected]
        link
        fedilink
        English
        141 year ago

        users hand over over ownership to reddit the moment you post

        Not ownership. Just permission to copy and distribute freely. Which basically is necessary to run a service like this, where user-submitted content is displayed.

        And since there’s no such clause on Lemmy, they’d have to ask the actual authors of the comments for permission instead?

        It’s more of a fuzzy area, but simply by posting on a federated service you’re agreeing to let that service copy and display your comments, and sync with other servers/instances to copy and display your comments to their users. It’s baked into the protocol, that your content will be copied automatically all over the internet.

        Does that imply a license to let software be run on that text? Does it matter what the software does with it, like display the content in a third party Mobile app? What about when it engages in text to speech or braille conversion for accessibility? Or index the page for a search engine? Does AI training make any difference at that point?

        The fact is, these services have APIs, and the APIs allow for the efficient copying and ingest of the user-created information, with metadata about it, at scale. From a technical perspective obviously scraping is easy. But from a copyright perspective submitting your content into that technical reality is implicit permission to copy, maybe even for things like AI training.

      • @[email protected]
        link
        fedilink
        English
        61 year ago

        Well they’ve probably got filters that remove all that before it teaches their Ai to swear. So you need to be more subtle for 𝑓ucks sake.

  • @[email protected]
    link
    fedilink
    English
    1521 year ago

    So they filled reddit with bot generated content, and now they’re selling back the same stuff likely to the company who generated most of it.

    At what point can we call an AI inbred?

        • @[email protected]
          link
          fedilink
          English
          31 year ago

          That paper is yet to be peer reviewed or released. I think you are jumping into conclusion with that statement. How much can you dilute the data until it breaks again?

          • @[email protected]
            link
            fedilink
            English
            2
            edit-2
            1 year ago

            That paper is yet to be peer reviewed or released.

            Never doing either (release as in submit to journal) isn’t uncommon in maths, physics, and CS. Not to say that it won’t be released but it’s not a proper standard to measure papers by.

            I think you are jumping into conclusion with that statement. How much can you dilute the data until it breaks again?

            Quoth:

            If each linear model is instead fit to the generate targets of all the preceding linear models i.e. data accumulate, then the test squared error has a finite upper bound, independent of the number of iterations. This suggests that data accumulation might be a robust solution for mitigating model collapse.

            Emphasis on “finite upper bound, independent of the number of iterations” by doing nothing more than keeping the non-synthetic data around each time you ingest new synthetic data. This is an empirical study so of course it’s not proof you’ll have to wait for theorists to have their turn for that one, but it’s darn convincing and should henceforth be the null hypothesis.

            Btw did you know that noone ever proved (or at least hadn’t last I checked) that reversing, determinising, reversing, and determinising again a DFA minimises it? Not proven yet widely accepted as true, crazy, isn’t it? But, wait, no, people actually proved it on a napkin. It’s not interesting enough to do a paper about.

            • @[email protected]
              link
              fedilink
              English
              21 year ago

              Peer review, for all its flaws is a good minimum before a paper is worth taking seriously.

              In your original comment you said tha model collapse can be easily avoided with this technique, which is notably different from it being mitigated. I’m not saying that these findings are not useful, just that you are overselling them a bit with this wording.

              • @[email protected]
                link
                fedilink
                English
                11 year ago

                It was someone different who said that. There’s a chance the authors might’ve gotten some claim wrong because their maths and/or methodology is shoddy but it’s a large and diverse set of authors so that’s unlikely. Fraud in CS empirics is generally unheard of, I mean what are you going to do when challenged, claim that the dog ate the program you ran to generate the data? There’s shenanigans about the equivalent of p-hacking especially from papers from commercial actors trying to sell stuff but that’s not the case here, either.

                CS academics generally submit papers to journals more because of publish or perish than the additional value formal peer review offers. It’s on the internet, after all. By all means, if you spot something in the paper that’s wrong then be right on the internet.

        • Ghostalmedia
          link
          fedilink
          English
          151 year ago

          A model trained on jokes about bacon, narwhals, and rage comics.

          • FaceDeer
            link
            fedilink
            21 year ago

            By “old archives” I mean everything from 2022 and earlier.

            • @[email protected]
              link
              fedilink
              English
              131 year ago

              But there were still bots making shit up back then. r/SubredditSimulator was pretty popular for awhile, and repost and astroturfing bots were a problem form decades on Reddit.

              • FaceDeer
                link
                fedilink
                11 year ago

                Existing AIs such as ChatGPT were trained in part on that data so obviously they’ve got ways to make it work. They filtered out some stuff, for example - the “glitch tokens” such as solidgoldmagikarp were evidence of that.

    • @[email protected]
      link
      fedilink
      English
      181 year ago

      I wonder if Open AI or any of the other firms have thought to put in any kind of stipulations about monitoring and moderating reddit content to reduce ai generated posts and reduce risk of model collapse.

      Anybody who’s looked at reddit in the past 2 years especially has seen the impact of ai pretty clearly. If I was running open ai I wouldn’t want that crap contaminating my models.

  • Dr. Moose
    link
    fedilink
    English
    291 year ago

    This form of propaganda is my pet peeve. It’s not “your posts” as soon as you put something to public you don’t get to eat your cake. It’s out there, you shared it. Don’t share it if you don’t want humanity to ingest and use it.

    • Dataprolet
      link
      fedilink
      English
      231 year ago

      You’re technically right, but nobody anticipated and therefore agreed on their posts being used for training LLMs.

    • @[email protected]
      link
      fedilink
      English
      6
      edit-2
      1 year ago

      It’s not about it being used to train AI. It’s about the AI either not being open source/I don’t get access to it (i.e. not benefitting me) or reddit being paid for my comments (i e. also not benefitting me).

      If this AI training would get me or the public access to the AI, or I would be paid for my comments instead of Reddit, I’d be fine with it.

      • Dr. Moose
        link
        fedilink
        English
        5
        edit-2
        1 year ago

        yeah but you don’t get to choose that. You give away that right as soon as you participate in public discourse. It’s a zero sum game - either it’s a public for everyone or no one.

        Don’t get me wrong, Reddit is a bitch but I think people want to cut their noses off to spite their faces here. It’s much more important to have free information flow than to fuck reddit.

        My fear is that people will vote in some really dumb rules to spite AI and restrict free information flow accidentally.

        • @[email protected]
          link
          fedilink
          English
          3
          edit-2
          1 year ago

          That’s how it is currently and maybe also your opinion. But that doesn’t mean it has to be like that in a society. It’s your opinion that everything public can go private at any time (training proprietary private AI), but we can decide as a society that’s not how we want to do things. We can require stuff that used public data to be public as well.

          And yeah I kinda get to choose that. As democratic society, anything that the public (i.e. including me) decides, goes. Of course, if there are people like you that don’t want stuff trained on public data to be required to be public, democracy will also work in the sense that we don’t get that, as it is currently.

  • @[email protected]
    link
    fedilink
    English
    61 year ago

    I didn’t delete my comments before nuking my account, but I’m pretty sure the grand majority were shitposts containing ample amounts of smut, gore and other ridiculous over the top shit. So I consider this a win.

  • @[email protected]
    link
    fedilink
    English
    471 year ago

    Some day historians will be able to look back at this moment and be able to determine it was what caused ChatGPT to become horny and weird.

  • Possibly linux
    link
    fedilink
    English
    30
    edit-2
    1 year ago

    They now are paying Reddit? I thought they could just scrape for free.

    Also, you can not delete anything on the internet. Once something is public there will always be a copy somewhere.

    • @[email protected]
      link
      fedilink
      English
      261 year ago

      Scraping through a website at the scale they are talking about isn’t really viable. You need access to the API so that you can have very targeted requests.

      This is why reddit changed their API pricing and screwed over everyone using third party apps. They can make more money selling access to LLM trainers than they could from having millions of people using apps that rely on the API.

      • Dr. Moose
        link
        fedilink
        English
        1
        edit-2
        1 year ago

        Scraping at scale is actually cheaper than buying API access. It’s a massive rising market, try googling “web scraping service” and there are hundreds of services that provide API to scrape any public web page and bypass the blocks for you and render all of the javascript.

        • @[email protected]
          link
          fedilink
          English
          11 year ago

          Scraping ia nice for static conten, no doubt. But I wonder at what point it is easier to request changes to a developing thread via API than to request the whole page with all nested content over and over to find the new answes in there.

          • Dr. Moose
            link
            fedilink
            English
            11 year ago

            Following a developing thread is a very tiny use case I’d imagine and even then you can just scrape the backend API that is used on the public page for the same results as private API.

    • @[email protected]
      link
      fedilink
      English
      21 year ago

      My guess is reddit was cheap enough that it made sense to pay them as sort of insurance they dont get sued in the future.

    • @[email protected]
      link
      fedilink
      English
      10
      edit-2
      1 year ago

      There’s actually legal precedent against scrapping a website through unofficial channels, even if the information is public. But basically, if you scrape a website and hinder their ability to operate, it falls under “virtual trespassing”.

      I’m assuming it would be even worse now that everyone is using the cloud and that scrapping their site would cause a noticeable increase in resource cost (and thus, directly cost them more money because of cloud usage fees).

      It’s why APIs are such a big deal. They provide you with an official, controlled, entry point to a platform’s data.

      • Dr. Moose
        link
        fedilink
        English
        11
        edit-2
        1 year ago

        It’s the opposite! There’s legal precedence that scraping public data is 100% legal in the US.

        There are few countries where scraping is illegal though like Japan and China. European countries often also have things called “database protection” laws that forbid replicating public databases through scraping or any other means but that has to be a big chunk of overal database. Also there are personally identifiable info (PII) protection laws that protect storing of people data without their consent (like GDPR).

        Source: I work with anti bot tech and we have to explain this to almost every customer who wants to “sue the web scrapers” that lol if Linkedin couldn’t do it, you’re not sueing anyone.

        • @[email protected]
          link
          fedilink
          English
          21 year ago

          Refreshing to see a post on this topic that has its facts straight.

          EU copyright allows a machine-readable opt-out from AI training (unless it’s for scientific purposes). I guess that’s behind these deals. It means they will have to pay off Reddit and the other platforms for access to the EU market. Or more accurately, EU customers will have to pay Reddit and the other platforms for access to AIs.

  • @[email protected]
    link
    fedilink
    English
    41 year ago

    “Strikes” made me think they were cancelling the deal. Like strike-through, crossed it out, etc. Too bad.