mathstodon.xyz is one of the many independent Mastodon servers you can use to participate in the fediverse.
A Mastodon instance for maths people. We have LaTeX rendering in the web interface!

Server stats:

2.7K
active users

#benchmarks

6 posts6 participants1 post today

You know how sometimes a little hobby side-project can get a bit out of hand? An unexpected performance regression on speed.python.org that only showed up on GCC 5 (and 7) led me to set up more rigorous tracking of Python performance when using different compilers. I'm still backfilling data but I think it's pretty awesome to see how much, and how consistently, free-threaded Python performance has improved since 3.13:

github.com/Yhg1s/python-benchm

Curated results from personal bench_runner benchmarks - Yhg1s/python-benchmarking-public
GitHubGitHub - Yhg1s/python-benchmarking-public: Curated results from personal bench_runner benchmarksCurated results from personal bench_runner benchmarks - Yhg1s/python-benchmarking-public

"Bluntly, the Y-axis simply doesn’t make much sense. And needless to say, if the Y-axis doesn’t make sense, you can’t meaningfully use the graph to make predictions. Computers can answer some questions reliably now, for example, and some not, and the graph tells us nothing about which is which or when any specific question will be solved. Or consider songwriting; Dylan wrote some in an afternoon; Leonard Cohen took half a decade on and off to write Hallelujah. Should we average the two figures? Should we sample Dylan songs more heavily because he wrote more of them? Where should songwriting go on the figure? The whole thing strikes us as absurd.

Finally, the only thing METR looked at was “software tasks”. Software might be very different from other domains, in which case the graph (even it did make sense) might not apply. In the technical paper, the authors actually get this right: they discuss carefully the possibility that the tasks used for testing might not be representative of real-world software engineering tasks. They certainly don't claim that the findings of the paper apply to tasks in general. But the social media posts make that unwarranted leap.

That giant leap seems especially unwarranted given that there has likely been a lot of recent data augmentation directed towards software benchmarks in particular (where this is feasible). In other domains where direct, verifiable augmentation is less feasible, results might be quite different. (Witness the failed letter ‘r’ labeling task depicted above.) Unfortunately, literally none of the tweets we saw even considered the possibility that a problematic graph specific to software tasks might not generalize to literally all other aspects of cognition.

We can only shake our heads."

garymarcus.substack.com/p/the-

Marcus on AI · The latest AI scaling graph - and why it hardly makes senseBy Gary Marcus

The Redmi 14C is here — and it's bringing serious value to the budget segment.
We've broken down the full specs and benchmark results to see how it stacks up against the competition.

Unisoc T610 processor

6.71" HD+ display

Massive 5000mAh battery

Geekbench + AnTuTu results inside

Check out the full breakdown, performance insights, and more:
radargit.com/2025/04/12/xiaomi

Radargit ·  Xiaomi Redmi 14C: specifications and benchmarks.Explore the Xiaomi Redmi 14C: Budget-friendly smartphone with a 6.88-inch 120Hz display, 50MP AI camera, and 5160mAh battery.

Meta faces flak for using an experimental Llama 4 Maverick AI to inflate benchmark scores. This prompted an apology & policy shift, now favoring the original version which lags behind OpenAI's GPT-4o, Anthropic's Claude 3.5, & Google's Gemini 1.5. Meta explained the experimental version was optimized for dialogue and excelled in LM Arena, but its benchmark reliability is debated. Meta clarifies they test various AI models, releasing the open-source version of Llama4 #AI #Meta #Llama4 #Benchmarks

Core Ultra 9 285: Performance Tests in Benchmarks and Full Specs

Intel's latest Core Ultra 9 285 is here — and we've put it through the gauntlet. From raw benchmark scores to full hardware specs, find out how it stacks up against the competition.

Read the full breakdown:
radargit.com/2025/04/11/core-u

Is this the new king of high-performance CPUs? Let’s talk.

Radargit · Core Ultra 9 285: performance tests in benchmarks and full specsIntel® Core™ Ultra 9 Processor 285 , We tested this 24-core powerhouse in games, AI tasks, and 4K rendering. Full specs, benchmarks,

Hallo schlaues Fediverse, ich tauche gerade in ein völlig absurdes #Rabbithole und mein M1 Macbook hat dank #LlmStudio und #Ollama seinen Lüfter wiederentdeckt… Aktuell ist lokal bei #LLM mit 8-12b Schluss (32GB Ram). Gibt es irgendwo #Benchmarks die mir bitte ausreden, dass das mit einem M4 >48GB RAM drastisch besser wird? Oder wäre was ganz anderes schlauer? Oder anderes Hobby? Muss Mobil (erreichbar) sein, weil zu unsteter Lebenswandel für ein Desktop. Empfehlungen gern in den Kommentaren.

Mashable: A new AI test is outwitting OpenAI, Google models, among others. “The Arc Prize Foundation, a nonprofit that measures AGI progress, has a new benchmark that is stumping the leading AI models. The test, called ARC-AGI-2 is the second edition ARC-AGI benchmark that tests models on general intelligence by challenging them to solve visual puzzles using pattern recognition, context […]

https://rbfirehose.com/2025/03/29/mashable-a-new-ai-test-is-outwitting-openai-google-models-among-others/