March 15, 2025

Google Claims Gemma 3 Reaches 98% of DeepSeek’s Accuracy Using Only One GPU – Slashdot

Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

Nickname:

Password:

Nickname:

Password:

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
I was getting all excited when I thought the article was talking about Gemma Chan⦠turns out it’s just another generic AI bot.I guess nobody here has ever seen “Humans”Please clap.Enhance 224 to 176. Enhance, stop. Move in, stop. Pull out, track right, stop. Center in, pull back. Stop. Track 45 right. Stop. Center and stop. Enhance 34 to 36. Pan right and pull back. Stop. Enhance 34 to 46. Pull back. Wait a minute, go right, stop. Enhance 57 to 19. Track 45 left. Stop. Enhance 15 to 23. Give me a hard copy right there.My binned M4 Pro/48GB gets hot enough as it is, your Max must be roasting!Enhance 224 to 176. Enhance, stop. Move in, stop. Pull out, track right, stop. Center in, pull back. Stop. Track 45 right. Stop. Center and stop. Enhance 34 to 36. Pan right and pull back. Stop. Enhance 34 to 46. Pull back. Wait a minute, go right, stop. Enhance 57 to 19. Track 45 left. Stop. Enhance 15 to 23. Give me a hard copy right there.My binned M4 Pro/48GB gets hot enough as it is, your Max must be roasting!My binned M4 Pro/48GB gets hot enough as it is, your Max must be roasting!It wants to. I find the system fan curve will allow it get hot enough that it starts pulling back the GPU clocks.
I’m using Temp Monitor [vimistudios.com] to set a “boost” mode, where if it detects the average GPU core temp hit 60C or above, it cranks the fans to 100%.

Can’t proactively set it to full fans, because the Mac refuses any fan commands until it itself turns on its fans (since they’re off when temps are reasonable).
This is an annoying change from my M1 Max MBP, which let me set the fan to whatever I wanted wheneMy binned M4 Pro/48GB gets hot enough as it is, your Max must be roasting!It wants to. I find the system fan curve will allow it get hot enough that it starts pulling back the GPU clocks.
I’m using Temp Monitor [vimistudios.com] to set a “boost” mode, where if it detects the average GPU core temp hit 60C or above, it cranks the fans to 100%.

Can’t proactively set it to full fans, because the Mac refuses any fan commands until it itself turns on its fans (since they’re off when temps are reasonable).
This is an annoying change from my M1 Max MBP, which let me set the fan to whatever I wanted wheneProbably thermal throttling if the power drops off.I imagine a lot of AI companies are re-evaluating their expected power consumption needs as it gets more efficient. Not good news for nuke fans. It’s also nice to see Google managing to compete with a Chinese firm for a change.Probably thermal throttling if the power drops off.Nope.
It will absolutely thermal throttle after a while, but spikes of 150 and then drops are simply because there’s a lot more than the GPU that can pull lots of power.
CPU and SSD, for example are fully loaded pulling ~6GB/s into RAM while the model is loading, etc.

When it thermal throttles, the clock on the GPU starts falling off, and the 100W power drops to ~75W or so.
This can be avoided by manually setting the fans to 100%, but the system won’t actually do this by itself. The highest it’ll set them Probably thermal throttling if the power drops off.Nope.
It will absolutely thermal throttle after a while, but spikes of 150 and then drops are simply because there’s a lot more than the GPU that can pull lots of power.
CPU and SSD, for example are fully loaded pulling ~6GB/s into RAM while the model is loading, etc.

When it thermal throttles, the clock on the GPU starts falling off, and the 100W power drops to ~75W or so.
This can be avoided by manually setting the fans to 100%, but the system won’t actually do this by itself. The highest it’ll set them Feeding it a large image results in ~10 seconds to first tokenThey probably left the CLIP model in a format that can’t be GPU offloaded.

I made my own GGUFs (llama.cpp is all you need for this) from the hf safetensors to get what I needed. There’s usually a bit of time before all of the GGUF publishers get around to make all the permutations you’re likely to want.Feeding it a large image results in ~10 seconds to first tokenThey probably left the CLIP model in a format that can’t be GPU offloaded.

I made my own GGUFs (llama.cpp is all you need for this) from the hf safetensors to get what I needed. There’s usually a bit of time before all of the GGUF publishers get around to make all the permutations you’re likely to want.My M2 Max Studio with 64GB barely gets warm during full GPU and CPU load.My M4 Max MBP with 128GB doesn’t, if I force the fans to full. But for some annoying fucking reason, Apple has decided that the regular system profile will only ever push them to ~55%, and it gets quite fucking warm, then.I can run R1 32B q8 with over a 60K context window with the right settings.Kinda- R1 distilled Qwen or Lllama. Not really, R1, which is 600-something B parameters. You *can* run it (using mmap() capable llamas), but it’s really, really not pretty- I get ~0.14t/sThere is no MLX version of Gemma 3 yet, I think, and I test most new models in LMStudio, so the one I downloaded won’t run yet.I pulled the weights directly from hf and converted to GGUF using llama.cpp, in 2 flavors: FP16 (our Ma My M2 Max Studio with 64GB barely gets warm during full GPU and CPU load.My M4 Max MBP with 128GB doesn’t, if I force the fans to full. But for some annoying fucking reason, Apple has decided that the regular system profile will only ever push them to ~55%, and it gets quite fucking warm, then.I can run R1 32B q8 with over a 60K context window with the right settings.Kinda- R1 distilled Qwen or Lllama. Not really, R1, which is 600-something B parameters. You *can* run it (using mmap() capable llamas), but it’s really, really not pretty- I get ~0.14t/sThere is no MLX version of Gemma 3 yet, I think, and I test most new models in LMStudio, so the one I downloaded won’t run yet.I pulled the weights directly from hf and converted to GGUF using llama.cpp, in 2 flavors: FP16 (our Ma I can run R1 32B q8 with over a 60K context window with the right settings.Kinda- R1 distilled Qwen or Lllama. Not really, R1, which is 600-something B parameters. You *can* run it (using mmap() capable llamas), but it’s really, really not pretty- I get ~0.14t/sThere is no MLX version of Gemma 3 yet, I think, and I test most new models in LMStudio, so the one I downloaded won’t run yet.I pulled the weights directly from hf and converted to GGUF using llama.cpp, in 2 flavors: FP16 (our Ma There is no MLX version of Gemma 3 yet, I think, and I test most new models in LMStudio, so the one I downloaded won’t run yet.I pulled the weights directly from hf and converted to GGUF using llama.cpp, in 2 flavors: FP16 (our Ma Does this mean AI only needs a single brain cell??I think you’re going for funny and on that basis it deserved to be FP. However I think the significance of the story is pretty close to null. LOTS of room for optimization, though the claim of second system effect is that the biggest improvement is in the second round.AI at that level these days has generally been something on the cloud that you pay fees to access. And it presumably has the entire history of your interaction with it, which is troubling. This improvement in efficiency (assuming true) makes it a lot easier for a modest-size corporation to contemplate owning the physical AI. It will result in faster proliferation of these machines. Let’s hope we survive it.That’s really not much of a claim, is it?Who has a H100 lying around?It does not seem to be the RAM that makes AI work. It seems to be the trillions of integer calculations the chip can do in a second. It this is correct, then this begs a question: why would this affect accuracy of the AI results at all?
Shouldn’t it just be slower producing it (not less accurate), when run on a slower machine?It does not seem to be the RAM that makes AI work. It seems to be the trillions of integer calculations the chip can do in a second.It this is correct, then this begs a question: why would this affect accuracy of the AI results at all?
Shouldn’t it just be slower producing it (not less accurate), when run on a slower machine?Interesting point that US AI guys are no longer just comparing their work to each other.They’re now genuinely treating Chinese AI models as a benchmark. A massive break with status quo.I’d need to know more about the tests…but it’s not really surprising. It’s got to have been trained on a specialized dataset rather on “everything on the internet”, and small, specialized, datasets are a lot cheaper both to train and to run. But could it tell you which gnus are found in Tanzania? (Not tell correctly, but even come up with a reasonable answer.)(OTOH, the use of elo as the metric makes me think it was specialized for chess, but I’m guessing that this is incorrect.)The new Huawei triple fold phone is going to blow their minds. For 3.5k you get a phone, with the entire stack integrated in house, from chip to software, that out-engineers every other phone manufacturer on the planet, including Apple.The sanctions have done Huawei a lot of good.For large scale access to the model it’s actually more expensive to run. R1 is FP8 native and if you have 32 GPUs any way to speed up requests, there is no lack of memory. Only active parameter count matters.That’s the advantage of MoE, lots of memory for trivia, but at scale just as fast as a small model.My litmus test is asking AI to write me a poem about the Cummins in the style of e e cummings. gemma3 has done the best job so far, it actually looked vaguely like what I wanted. But then I asked it to make some changes and it changed… the title. And asked me if that was what I was going for.Downloading the 4b model now. Once I see how much vram that uses I’ll see if I can run a bigger one.There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.Netflix Used AI To Upscale ‘A Different World’ and It’s a Melted NightmareSaudi Investment Fund Pays $3.5 Billion To Capture Pokemon GoTheory is gray, but the golden tree of life is green.
— Goethe

Source: https://news.slashdot.org/story/25/03/13/0010231/google-claims-gemma-3-reaches-98-of-deepseeks-accuracy-using-only-one-gpu

Leave a Reply

Your email address will not be published. Required fields are marked *

Copyright © All rights reserved. | Newsphere by AF themes.