we appear to be the first to write up the outrage coherently too. much thanks to the illustrious @self

  • @[email protected]
    link
    fedilink
    English
    141 year ago

    Mistral’s Mixtral-8x7B-Instruct-v0.1 produced copyrighted content on 22% of the prompts.

    did you know that a lesser-known side effect of the infinite monkeys approach is that they will produce whole sections of copyright content abso-dupo-lutely by accident? wild, I know! totes coinkeedink!

    I’d be glad to provide it once you’ve come to your senses and want to discuss things like an adult

    jesus fucking christ you must be a fucking terrible person to work with

    I’ve seen toddlers throw more mature tantrums

    • Steve
      link
      fedilink
      English
      91 year ago

      she wrote harry potter with an llm, didn’t she?

    • @[email protected]
      link
      fedilink
      English
      31 year ago

      I’m too old to discuss against bad faith arguments.

      Especially with people who won’t read the information I provide them showing their initial information was wrong.

      One is a company that has something to sell, the other an article with citations showing why it’s not easy to determine what percentage of a data set is infringing on copyright, or whether exact reproduction via “fishing expedition” prompting is a useful metric to determine if unauthorized copyright was used in training.

      The dumbest take though is attacking Mistral of all LLMs, even though it’s on an Apache 2.0 license.

      • @[email protected]
        link
        fedilink
        English
        11
        edit-2
        1 year ago

        I’ve read the article you’ve posted: it does not refute the fucking datapoint provided, it literally DOES NOT EVEN MENTION MISTRAL AT ALL.

        so all I can tell you is to take your pearlclutching tantrum bullshit and please fuck off already

        • @[email protected]
          link
          fedilink
          English
          21 year ago

          Yes, clearly I’m the one throwing a tantrum 🙄

          Btw, you can just fact check my claim about what Mistral is licenced under. The article talks about copyright and AI detection in general, which to anyone with basic critical thinking skills could then understand would apply to other LLMs like Mistral.

          You might want to look up what pearl clutching means as well. You’re using it wrong:

          https://dictionary.cambridge.org/us/dictionary/english/pearl-clutching

          Considering I’ve done the opposite of a shocked reaction. While at it, maybe also look up “projection”

          https://www.psychologytoday.com/us/basics/projection

          Anyhow, have a good day.

        • @[email protected]
          link
          fedilink
          English
          111 year ago

          god these weird little fuckers’ ability to fill a thread with garbage is fucking notable isn’t it? something about loving LLMs makes you act like an LLM. how depressing for them.

          • @[email protected]
            link
            fedilink
            English
            91 year ago

            To think that when sneer club/techtakes migrated to lemmy, I was pretty sure we would not be getting a lot of incidental traffic to the communities. Just about as wrong as you can be.

          • Steve
            link
            fedilink
            English
            91 year ago

            I bet they go to react conferences

          • @[email protected]
            link
            fedilink
            English
            111 year ago
            • high willingness to accept painfully inexact responses
            • high tendency to side with authority when given no information
            • low ability to distinguish “how it is” from “how it seems like it should be”

            Meta:

            • default expectation that others are the same way
            • indignant consent-ignoring gesture if they’re not
        • @[email protected]
          link
          fedilink
          English
          21 year ago

          Well since you want to use computers to continue the discussion, here’s also ChatGPT:

          Determining the exact percentage of copyrighted data used to train a large language model (LLM) is challenging for several reasons:

          1. Scale and Variety of Data Sources: LLMs are typically trained on vast and diverse datasets collected from the internet, including books, articles, websites, and social media. This data encompasses both copyrighted and non-copyrighted content. The datasets are often so large and varied that it is difficult to precisely categorize each piece of data.

          2. Data Collection and Processing: During the data collection process, the primary focus is on acquiring large volumes of text rather than cataloging the copyright status of each individual piece. While some datasets, like Common Crawl, include metadata about the sources, they do not typically include detailed copyright status information.

          3. Transformation and Use: The data used for training is transformed into numerical representations and used to learn patterns, making it even harder to trace back and identify the copyright status of specific training examples.

          4. Legal and Ethical Considerations: The legal landscape regarding the use of copyrighted materials for AI training is still evolving. Many AI developers rely on fair use arguments, which complicates the assessment of what constitutes a copyright violation.

          Efforts are being made within the industry to better understand and address these issues. For example, some organizations are working on creating more transparent and ethically sourced datasets. Projects like RedPajama aim to provide open datasets that include details about data sources, helping to identify and manage the use of copyrighted content more effectively【6†source】.

          Overall, while it is theoretically possible to estimate the proportion of copyrighted content in a training dataset, in practice, it is a complex and resource-intensive task that is rarely undertaken with precision.

          • @[email protected]
            link
            fedilink
            English
            111 year ago

            you should speak to a physicist, they might be able to find a way your density can contribute to science

          • Steve
            link
            fedilink
            English
            91 year ago

            “exact percentage”

            just fuck right off. wasting my fucken time.

              • @[email protected]
                link
                fedilink
                English
                111 year ago

                no, you utter fucking clown. they’re literally posting to take the piss out of you, the only person in the room who isn’t getting that everyone is laughing at them, not with them

              • Steve
                link
                fedilink
                English
                91 year ago

                re-read your chatgpt response and think about whether the percentages in my original link could be too high or too low.

                • Steve
                  link
                  fedilink
                  English
                  91 year ago

                  but, like, really think this time. at this point i’m not arguing with you, i’m trying to help you.

                • @[email protected]
                  link
                  fedilink
                  English
                  91 year ago

                  too high or too low

                  trick question everyone knows this late on a friday you want a body high for that nice mellow low feeling