this post was submitted on 17 Jun 2025

116 points (100.0% liked)

TechTakes

1974 readers

517 users here now

Big brain tech dude got yet another clueless take over at HackerNews etc? Here's the place to vent. Orange site, VC foolishness, all welcome.

This is not debate club. Unless it’s amusing debate.

For actually-good tech, you want our NotAwfulTech community

founded 2 years ago

MODERATORS

[email protected]

116

Google's Gemini 2.5 pro is out of beta. (awful.systems)

submitted 1 day ago* (last edited 1 day ago) by [email protected] to c/[email protected]

69 comments fedilink hide all child comments

I love to show that kind of shit to AI boosters. (In case you're wondering, the numbers were chosen randomly and the answer is incorrect).

They go waaa waaa its not a calculator, and then I can point out that it got the leading 6 digits and the last digit correct, which is a lot better than it did on the "softer" parts of the test.

top 50 comments

sorted by: hot top controversial new old

[–] [email protected] 6 points 6 hours ago (2 children)

lmao: they have fixed this issue, it seems to always run python now. Got to love how they just put this shit in production as "stable" Gemini 2.5 pro with that idiotic multiplication thing that everyone knows about, and expect what? to Eliza Effect people into marrying Gemini 2.5 pro?

[–] [email protected] 2 points 30 minutes ago

Oh and also for the benefit of our AI fanboys who can't understand why we would expect something as mundane from this upcoming super-intelligence, as doing math, here's why:

[–] [email protected] 5 points 4 hours ago (2 children)

Have they fixed it as in genuinely uses python completely reliably or "fixed" it, like they tweaked the prompt and now it use python 95% of the time instead of 50/50? I'm betting on the later.

[–] [email protected] 1 points 14 minutes ago

Yeah, I'd also bet on the latter. They also added a fold-out button that shows you the code it wrote (folded by default), but you got to unfold it or notice that it is absent.

[–] [email protected] 4 points 3 hours ago (1 children)

Non-deterministic LLMs will always have randomness in their output. Best they can hope for is layers of sanity checke slowing things down and costing more.

[–] [email protected] 5 points 3 hours ago

If you wire the LLM directly into a proof-checker (like with AlphaGeometry) or evaluation function (like with AlphaEvolve) and the raw LLM outputs aren't allowed to do anything on their own, you can get reliability. So you can hope for better, it just requires a narrow domain and a much more thorough approach than slapping some extra firm instructions in an unholy blend of markup languages in the prompt.

In this case, solving math problems is actually something Google search could previously do (before dumping AI into it) and Wolfram Alpha can do, so it really seems like Google should be able to offer a product that does math problems right. Of course, this solution would probably involve bypassing the LLM altogether through preprocessing and post processing.

Also, btw, LLM can be (technically speaking) deterministic if the heat is set all the way down, its just that this doesn't actually improve their performance at math or anything else. And it would still be "random" in the sense that minor variations in the prompt or previous context can induce seemingly arbitrary changes in output.

[–] [email protected] 24 points 1 day ago

if you’re considering pasting the output of an LLM into this thread in order to fail to make a point: reconsider

[–] [email protected] 8 points 21 hours ago (2 children)

One of the big AI companies (Anthropic with claude? Yep!) wrote a long paper that details some common LLM issues, and they get into why they do math wrong and lie about it in "reasoning" mode.

It's actually pretty interesting, because you can't say they "don't know how to do math" exactly. The stochastic mechanisms that allow it to fool people with written prose also allow it to do approximate math. That's why some digits are correct, or it gets the order of magnitude right but still does the math wrong. It's actually layering together several levels of approximation.

The "reasoning" is just entirely made up. We barely understsnd how LLMs actually work, so none of them have been trained on research about that, which means LLMs don't understand their own functioning (not that they "understand" anything strictly speaking).

[–] [email protected] 7 points 16 hours ago (2 children)

Thing is, it has tool integration. Half of the time it uses python to calculate it. If it uses a tool, that means it writes a string that isn't shown to the user, which runs the tool, and tool results are appended to the stream.

What is curious is that instead of request for precision causing it to use the tool (or just any request to do math), and then presence of the tool tokens causing it to claim that a tool was used, the requests for precision cause it to claim that a tool was used, directly.

Also, all of it is highly unnatural texts, so it is either coming from fine tuning or from training data contamination.

[–] [email protected] 3 points 8 hours ago

Also, if the LLM had reasoning capabilities that even remotely resembled those of an actual human, let alone someone who would be able to replace office workers, wouldn't they use the best tool they had available for every task (especially in a case as clear-cut as this)? After all, almost all humans (even children) would automatically reach for their pocket calculators here, I assume.

[–] [email protected] 10 points 13 hours ago

A tool uses an LLM, the LLM uses a tool. What a beautiful ouroboros.

[–] [email protected] 14 points 21 hours ago

We barely understsnd how LLMs actually work

I would be careful how you say this. Eliezer likes to go on about giant inscrutable matrices to fearmoner, and the promptfarmers use the (supposed) mysteriousness as another avenue for crithype.

It's true reverse engineering any specific output or task takes a lot of effort and requires access to the model's internals weights and hasn't been done for most tasks, but the techniques exist for doing so. And in general there is a good high level conceptual understanding of what makes LLMs work.

which means LLMs don’t understand their own functioning (not that they “understand” anything strictly speaking).

This part is absolutely true. If you catch them in mistake, most of their data about responding is from how humans respond, or, at best fine-tuning on other LLM output and they don't have any way of checking their own internals, so the words they say in response to mistakes is just more bs unrelated to anything.

[–] [email protected] 68 points 1 day ago* (last edited 1 day ago) (3 children)

So the "show thinking" button is essentially just for when you want to read even more untrue text?

[–] [email protected] -2 points 10 hours ago (1 children)

Depending on the task it can significantly improve the quality of the output, but it doesn't help with everything. It's more useful for stuff that has to be reasoned about in multiple iterations, not something that's a direct answer.

[–] [email protected] 2 points 6 hours ago

Except not really, because even if stuff that has to be reasoned about in multiple iterations was a distinct category of problems, reasoning models by all accounts hallucinate a whole bunch more.

[–] [email protected] 30 points 1 day ago (2 children)

It’s just more llm output, in the style of “imagine you can reason about the question you’ve just been asked. Explain how you might have come about your answer.” It has no resemblance to how a neural network functions, nor to the output filters the service providers use.

It’s how the ai doomers get themselves into a flap over “deceptive” models… “omg it lied about its train of thought!” because if course it didn’t lie, it just edited a stream of tokens that were statistically similar to something classified as reasoning during training.

[–] [email protected] 9 points 1 day ago* (last edited 1 day ago) (1 children)

I was hoping, until seeing this post, that the reasoning text was actually related to how the answer is generated. Especially regarding features such as using external tools, generating and executing code and so on.

I get how LLMs work (roughly, didn't take too many courses in ML at Uni, and GANs were still all the rage then), that's why I specifically didn't call it lies. But the part I'm always unsure about is how much external structure is imposed on the LLM-based chat bots through traditional programming filling the gaps between rounds of token generation.

Apparently I was too optimistic :-)

[–] [email protected] 10 points 1 day ago (2 children)

It is related, inasmuch as it’s all generated from the same prompt and the “answer” will be statistically likely to follow from the “reasoning” text. But it is only likely to follow, which is why you can sometimes see a lot of unrelated or incorrect guff in “reasoning” steps that’s misinterpreted as deliberate lying by ai doomers.

I will confess that I don’t know what shapes the multiple “let me just check” or correction steps you sometimes see. It might just be a response stream that is shaped like self-checking. It is also possible that the response stream is fed through a separate llm session when then pushes its own responses into the context window before the response is finished and sent back to the questioner, but that would boil down to “neural networks pattern matching on each other’s outputs and generating plausible response token streams” rather than any sort of meaningful introspection.

I would expect the actual systems used by the likes of openai to be far more full of hacks and bodges and work-arounds and let’s-pretend prompts that either you or I could imagine.

[–] [email protected] 11 points 1 day ago* (last edited 1 day ago) (1 children)

misinterpreted as deliberate lying by ai doomers.

I actually disagree. I think they correctly interpret it as deliberate lying, but they misattribute the intent to the LLM rather than to the company making it (and its employees).

edit: its like you are watching a TV and ads come on you say that a very very flat demon who lives in the TV is lying, because the bargain with the demon is that you get to watch entertaining content in response to having to listen to its lies. It's fundamentally correct about lying, just not about the very flat demon.

[–] [email protected] 3 points 12 hours ago

New version of Descartes: imagine that an LLM no less hallucination-prone than unaligned, is feeding it's output directly into your perceptions...

Non cogitat, ergo non est

[–] [email protected] 10 points 1 day ago

Note that the train of thought thing originated from users as a prompt "hack": you'd ask the bot to "go through the task step by step, checking your work and explaining what you are doing along the way" to supposedly get better results. There's no more to it than pure LLM vomit.

(I believe it does have the potential to help somewhat, in that it's more or less equivalent to running the query several times and averaging the results, so you get an answer that's more in line with the normal distribution. Certainly nothing to do with thought.)

load more comments (1 replies)

[–] [email protected] 32 points 1 day ago

Always_has_been.jpeg

[–] [email protected] 46 points 1 day ago (5 children)

As usual with chatbots, I'm not sure whether it is the wrongness of the answer itself that bothers me most or the self-confidence with which said answer is presented. I think it is the latter, because I suspect that is why so many people don't question wrong answers (especially when they're harder to check than a simple calculation).

load more comments (5 replies)

[–] [email protected] 14 points 1 day ago* (last edited 1 day ago) (1 children)

233,324,900,064.

Off by 474,720.

[–] [email protected] 8 points 1 day ago (1 children)

I find it a bit interesting that it isn't more wrong. Has it ingested large tables and got a statistical relationship between certain large factors and certain answers? Or is there something else going on?

[–] [email protected] 1 points 21 hours ago (1 children)

I posted a top level comment about this also, but Anthropic has done some research on this. The section on reasoning models discusses math I believe. The short version is it has a bunch of math in its corpus so it can approximate math (kind of, seemingly, similar to how you'd do a back of the envelope calculation in your head to get the orders of magnitude right) but it can't actually do calculations which is why they often get the specifics wrong.

[–] [email protected] 4 points 17 hours ago

reasoning models

that’s a shot, everyone drink up

load more comments