overview for diz

Apple: ‘Reasoning’ AIs fail hard if they actually have to think in c/techtakes@awful.systems

[–] diz@awful.systems 7 points 2 months ago

It would have to be more than just river crossings, yeah.

Although I'm also dubious that their LLM is good enough for universal river crossing puzzle solving using a tool. It's not that simple, the constraints have to be translated into the format that the tool understands, and the answer translated back. I got told that o3 solves my river crossing variant but the chat log they gave had incorrect code being run and then a correct answer magically appearing, so I think it wasn't anything quite as general as that.

Apple: ‘Reasoning’ AIs fail hard if they actually have to think in c/techtakes@awful.systems

[–] diz@awful.systems 8 points 2 months ago

I’d just write the list then assign randomly. Or perhaps pseudorandomly like sort by hash and then split in two.

One problem is that it is hard to come up with 20 or more completely unrelated puzzles.

Although I don’t think we need a large number for statistical significance here, if it’s like 8/10 solved in the cheating set and 2/10 in the hold back set.

Apple: ‘Reasoning’ AIs fail hard if they actually have to think in c/techtakes@awful.systems

[–] diz@awful.systems 13 points 2 months ago* (last edited 2 months ago)

Yeah any time its regurgitating an IMO problem it’s a proof it’salmost superhuman, but any time it actually faces a puzzle with unknown answer, this is not what it is for.

Apple: ‘Reasoning’ AIs fail hard if they actually have to think in c/techtakes@awful.systems

[–] diz@awful.systems 19 points 2 months ago* (last edited 2 months ago) (4 children)

Further support for the memorization claim: I posted examples of novel river crossing puzzles where LLMs completely fail (on this forum).

Note that Apple’s actors / agents river crossing is a well known “jealous husbands” variant, which you can ask a chatbot to explain to you. It gladly explains, even as it can’t follow its own explanation (since of course it isn’t its own explanation but a plagiarized one, even if changes words).

edit: https://awful.systems/post/4027490 and earlier https://awful.systems/post/1769506

I think what I need to do is to write up a bunch of puzzles, assign them randomly to 2 sets, and test & post one set, while holding back on the second set (not even testing it on any online chatbots). Then in a year or two see how much the set that's public improves, vs the one that's held back.

OpenAI engineers are flocking to its rival Anthropic. “They let us huff our own farts,” says one in c/techtakes@awful.systems

[–] diz@awful.systems 13 points 2 months ago

making LLMs not say racist shit

That is so 2024. The new big thing is making LLMs say racist shit.

ChatGPT o3 found a Linux Kernel vulnerability. "The future" has an 8% success rate, and a 28% chance of false positives. in c/techtakes@awful.systems

[–] diz@awful.systems 3 points 2 months ago

Can’t be assed to read the bs but sometimes the use after free only happens in some rarely executed code path, or only when one branch is executed then later another branch. So you still may need fuzzing to trigger use after free for Valgrind to detect.

ChatGPT o3 found a Linux Kernel vulnerability. "The future" has an 8% success rate, and a 28% chance of false positives. in c/techtakes@awful.systems

[–] diz@awful.systems 9 points 2 months ago* (last edited 2 months ago) (1 children)

I swear I’m gonna plug an LLM into a rather traditional solver I’m writing. I may tuck deep into the paper a point how it’s quite slow to use an LLM to mutate solutions in a genetic algorithm or a swarm solver. And in any case non LLM would be default.

Normally I wouldn’t sink that low but I got mouths to feed, and frankly, fuck it, they can persist in this madness for much longer than I can stay solvent.

This is as if there was a mass delusion that a pseudorandom number generator can serve as an oracle, predicting the future. Doing any kind of Monte Carlo simulation of something like weather in that world would of course confirm all the dumb shit.

Game studios love AI! The gamers … hate it in c/techtakes@awful.systems

[–] diz@awful.systems 3 points 2 months ago (4 children)

Yeah plenty of opportunities to just work it into the story.

I dunno what kind of local models you can use, though. If it is a 3D game then its fine to require a GPU, but you wouldn't want to raise minimum requirements too high. And you wouldn't want to use 12 gigs of vram for a gimmick, either.

Game studios love AI! The gamers … hate it in c/techtakes@awful.systems

[–] diz@awful.systems 11 points 2 months ago* (last edited 2 months ago) (6 children)

I think it could work as a minor gimmick, like terminal hacking minigame in fallout. You have to convince the LLM to tell you the password, or you get to talk to a demented robot whose brain was fried by radiation exposure, or the like. Relatively inconsequential stuff like being able to talk your way through or just shoot your way through.

Unfortunately this shit is too slow and too huge to embed a local copy of, into a game. You need a lot of hardware compatibility. And running it in the cloud would cost too much.

Stubsack: weekly thread for sneers not worth an entire post, week ending 1st June 2025 in c/techtakes@awful.systems

[–] diz@awful.systems 15 points 2 months ago* (last edited 2 months ago)

I was trying out free github copilot to see what the buzz is all about:

It doesn't even know its own settings. This one little useful thing that isn't plagiarism, providing natural language interface to its own bloody settings, it couldn't do.

AI resorts to robot blackmail! — because Anthropic asked for a story of robot blackmail in c/techtakes@awful.systems

[–] diz@awful.systems 17 points 2 months ago* (last edited 2 months ago)

All joking aside, there is something thoroughly fucked up about this.

What's fucked up is that we let these rich fucks threaten us with extinction to boost their stock prices.

Imagine if some cold fusion scammer was permitted to gleefully boast that his experimental cold fusion plant in the middle of a major city could blow it up. Setting up little hydrogen explosions, setting up a neutron source just to make it spicier, etc.

Latest AI-hallucinated legal filing, from AI vendor Anthropic in c/techtakes@awful.systems

[–] diz@awful.systems 12 points 2 months ago* (last edited 2 months ago) (1 children)

When confronted with a problem like “your search engine imagined a case and cited it”, the next step is to wonder what else it might be making up, not to just quickly slap a bit of tape over the obvious immediate problem and declare everything to be great.

Exactly. Even if you ensure the cited cases or articles are real it will misrepresent what said articles say.

Fundamentally it is just blah blah blah ing until the point comes when a citation would be likely to appear, then it blah blah blahs the citation based on the preceding text that it just made up. It plain should not be producing real citations. That it can produce real citations is deeply at odds with it being able to pretend at reasoning, for example.

Ensuring the citation is real, RAG-ing the articles in there, having AI rewrite drafts, none of these hacks do anything to address any of the underlying problems.