• @[email protected]
    link
    fedilink
    English
    210 days ago

    Latter test fails if they write a specific bit of code to put out the ‘llms fail the river crossing’ fire btw. Still a good test.

    • @[email protected]
      link
      fedilink
      English
      68 days ago

      It would have to be more than just river crossings, yeah.

      Although I’m also dubious that their LLM is good enough for universal river crossing puzzle solving using a tool. It’s not that simple, the constraints have to be translated into the format that the tool understands, and the answer translated back. I got told that o3 solves my river crossing variant but the chat log they gave had incorrect code being run and then a correct answer magically appearing, so I think it wasn’t anything quite as general as that.