• 6 Posts
  • 121 Comments
Joined 2 years ago
cake
Cake day: July 13th, 2023

help-circle
rss



  • I think more low tier output would be a disaster.

    Even pre AI I had to deal with a project where they shoved testing and compliance at juniors for a long time. What a fucking mess it was. I had to go through every commit mentioning Coverity because they had a junior fixing coverity flagged “issues”. I spent at least 2 days debugging a memory corruption crash caused by such “fix”, and then I had to spend who knows how long reviewing every such “fix”.

    And don’t get me started on tests. 200+ tests, of them none caught several regressions in handling of parameters that are shown early in the frigging how-to. Not some obscure corner case, the stuff you immediately run into if you just follow the documentation.

    With AI all the numbers would be much larger - more commits “fixing coverity issues” (and worse yet fixing “issues” that LLM sees in code), more so called “tests” that don’t actually flag any real regressions, etc.





  • When they tested on bugs not in SWE-Bench, the success rate dropped to 57‑71% on random items, and 50‑68% on fresh issues created after the benchmark snapshot. I’m surprised they did that well.

    After the benchmark snapshot. Could still be before LLM training data cut off, or available via RAG.

    edit: For a fair test you have to use git issues that had not been resolved yet by a human.

    This is how these fuckers talk, all of the time. Also see Sam Altman’s not-quite-denials of training on Scarlett Johansson’s voice: they just asserted that they had hired a voice actor, but didn’t deny training on actual Scarlett Johansson’s voice. edit: because anyone with half a brain knows that not only did they train on her actual voice, they probably gave it and their other pirated movie soundtracks massively higher weighting, just as they did for books and NYT articles.

    Anyhow, I fully expect that by now they just use everything they can to cheat benchmarks, up to and including RAG from solutions past the training dataset cut off date. With two of the paper authors being from Microsoft itself, expect that their “fresh issues” are gamed too.





  • I was writing some math code, and not being an idiot I’m using an open source math library for doing something called “QR decomposition”, and its efficient, and it supports sparse matrices (matrices where many numbers are 0), etc.

    Just out of curiosity I checked where some idiot vibecoder would end up. AI simply plagiarizes from some shit sample snippets which exist purely to teach people what QR decomposition is. It’s actually unusable, due to being numerically unstable.

    Who in the fuck even needs this shit to be plagiarized, anyway?

    It can’t plagiarize a production quality implementation, because you can count those on the fingers of one hand, they’re complex as fuck and you can’t just blend a few together to try to pretend you didn’t plagiarize.

    The answer is, people who are peddling the AI. They are the ones who ordered plagiarism with extra plagiarism on top. These are not coding tools, these are demos to convince the investors to buy the actual product, which is company’s stock. There’s a little bit of tool functionality (you can ask them to refactor the code), but it’s just you misusing a demo to try to get some value out of it.

    And to that end, the demos take every opportunity to plagiarize something, and to talk about how the “AI” wrote the code from scratch based on its supposed understanding of fairly advanced math.

    And in coding, it is counter productive to plagiarize. Many of the open source libraries can be used in commercial projects. You get upstream fixes for free. You don’t end up with some bugs or worse yet security exploits that may have been fixed since the training cut-off date.

    No fucking one in the right mind would willingly want their product to contain copy pasted snippets from stale open source libraries, passed through some sort of variable-renaming copyright laundering machine.

    Except of course the business idiots who are in charge of software at major companies, who don’t understand software. Who just failed upwards.

    They look at plagiarized lines and count them as improved productivity.



  • If it was a basement dweller with a chatbot that could be mistaken for a criminal co-conspirator, he would’ve gotten arrested and his computer seized as evidence, and then it would be a crapshoot if he would even be able to convince a jury that it was an accident. Especially if he was getting paid for his chatbot. Now, I’m not saying that this is right, just stating how it is for normal human beings.

    It may not be explicitly illegal for a computer to do something, but you are liable for what your shit does. You can’t just make a robot lawnmower and run over a neighbor’s kid. If you are using random numbers to steer your lawnmower… yeah.

    But because it’s OpenAI with 300 billion dollar “valuation”, absolutely nothing can happen whatsoever.




  • If I pirated a book and wrote a review of it, would that make the review copyright infringement?

    That would leave pirating the book be copyright infringement. But if you used AI (trained on the pirated book) to write a review, then pirating the book wouldn’t be copyright infringement, at least according to this judge.

    Re Anthropic, their ruling was that downloading itself was copyright infringement, regardless of whether Anthropic distributed works while torrenting them.

    In Meta’s case the plaintiffs will likely find it impossible to demonstrate that the data which Meta uploaded contained specifically their books. Maybe all the other torrent users who were downloading that torrent were downloading other parts of it.


  • I appreciate the sentiment but I also hate the whole “AI is a power loom for coding”.

    The power loom for coding is called “git clone”.

    What “AI” (LLM) tools provide is just English as a programming language with plagiarized sum total of all open source as the standard library. English is a shit programming language. LLMs are shit at compiling it. Open source is awesome. Plagiarized open source is “meh” - you can not apply upstream patches.


  • It’s not about ceding victory, it’s about whether we accept shit talking plaintiff’s lawyers as an adequate substitute for a slap on the wrist, or not. Clearly the judge wants to appear impartial.

    Plaintiff made a perfectly good argument that meta downloaded the books illegally, and that this downloading wasn’t necessary to enable a (fair or not) use. A human critic does not get a blanket license to pirate any work he might want to criticize, even though critique is fair use.


  • Thought about it some more, the most charitable I can get is that Meta’s judge thinks someone else could win the case if they have a specific book that was torrented and then they point at the general situation with AI slop in bookstores and argue that AI harms book sales.

    I can not imagine that working. At all. So the AI is producing slop of infinitesimally higher quality because it was trained on a pirated copy of your book in particular. Clearly the extra harm to specifically your business due to piracy of specifically your books would be rather small, as this very judge would immediately point out. In fact the AI slop is so shit that people only buy it by mistake, so its quality doesn’t really matter.

    Maybe news companies could sometimes win lawsuits like this, but book authors, no way.

    I think it is just pure copium to see this ruling in any kind of positive light. Alsup (misanthropic’s judge) at least was willing to ding an AI company for pirating books (although he was probably only willing to ding them for that because it wouldn’t be fatal to them the way it would be to Meta). This guy wouldn’t even do that bare minimum.

    And the whole approach is insane. You can’t make a movie without getting a movie rights contract with the author. A movie adaptation of a book is far more transformative than anything AI does. Especially the “training” which is just fucking gradient descent, you nudge a bunch of numbers towards replicating the works, over and over again, in a purely mechanical process.

    Nobody ever had to successfully argue that movie sales harm book sales just to treat movie adaptations as derivative work.