this post was submitted on 15 Aug 2025
16 points (100.0% liked)

LocalLLaMA

3608 readers
11 users here now

Welcome to LocalLLaMA! Here we discuss running and developing machine learning models at home. Lets explore cutting edge open source neural network technology together.

Get support from the community! Ask questions, share prompts, discuss benchmarks, get hyped at the latest and greatest model releases! Enjoy talking about our awesome hobby.

As ambassadors of the self-hosting machine learning community, we strive to support each other and share our enthusiasm in a positive constructive way.

Rules:

Rule 1 - No harassment or personal character attacks of community members. I.E no namecalling, no generalizing entire groups of people that make up our community, no baseless personal insults.

Rule 2 - No comparing artificial intelligence/machine learning models to cryptocurrency. I.E no comparing the usefulness of models to that of NFTs, no comparing the resource usage required to train a model is anything close to maintaining a blockchain/ mining for crypto, no implying its just a fad/bubble that will leave people with nothing of value when it burst.

Rule 3 - No comparing artificial intelligence/machine learning to simple text prediction algorithms. I.E statements such as "llms are basically just simple text predictions like what your phone keyboard autocorrect uses, and they're still using the same algorithms since <over 10 years ago>.

Rule 4 - No implying that models are devoid of purpose or potential for enriching peoples lives.

founded 2 years ago
MODERATORS
top 20 comments
sorted by: hot top controversial new old
[–] domi@lemmy.secnd.me 3 points 2 weeks ago (1 children)

I ordered a Beelink GTR9 Pro which should hopefully arrive next month.

Really excited to play around with it, the 24GB in my 7900 XTX just don't cut it for local LLMs.

There are a lot of benchmarks for the 395 processor here: https://kyuz0.github.io/amd-strix-halo-toolboxes/

They are leaving a lot of performance (and VRAM) on the table by doing this on Windows.

[–] panda_abyss@lemmy.ca 2 points 1 day ago (1 children)

I just got the FW Desktop with the 128GB 395 yesterday

Let me know if you have specific models you want me to run/benchmark.

I found most of the reviews inadequate, just running models with a 4K context. I use 64-128k.

Qwen Coder and GPT OSS 120b both run very well on the 395, I get respectable token rates with medium/long context.

Dense models can be a little slow (e.g. Gemma no quant). Prompt processing is a little slow, but not problematic.

It’s impressive how quickly I can load 60gb models into vram.

[–] domi@lemmy.secnd.me 2 points 1 day ago (1 children)

Good to hear that people are starting to get their hands on them, still have to wait ~2 weeks for mine.

Besides the benchmarks already listed on kyuz0's Github, the things that are interesting to me:

  • GLM-4.5 (not Air) at something like IQ2_XXS or IQ2_M. Not sure if it can even fit with any reasonable context size and if it's even useful at all at that size but I have not seen anyone try yet: https://huggingface.co/unsloth/GLM-4.5-GGUF
  • Image generation with FLUX.1 (fp8 and fp16). Just got a new video with image generation on this chip but it's only with Qwen Image: https://youtu.be/7-E0a6sGWgs
  • Power usage with a large model loaded but idle
  • What's the cold start time from no model loaded to first token? Is it doable to run something like llama-swap and swap models on the fly without having to wait?
[–] panda_abyss@lemmy.ca 2 points 1 day ago (1 children)

I can't run GLM 4.5 on those quants, I've been unable to get beyond 96gb vram (I know you can get 112, but I'm still a linux noob)

GPT OSS 120b (60gb) loads into clear memory in 37-45s (tested 3 times), but I think it can take up to 60s if there are other models in memory. I'm not sure what's going on there, it should take ~10s to read the model from disk, but I do get a lock error in lmstudio and an alloc failure.

I don't know how to measure idle power with a model in memory (linux noob), but it's been on my desk all day with either GPT120b or Qwen Code and has been pretty quiet (just PSU fan running off and on). With Framework the fan seems to start at aroudn 55C, the system idles with a model in memory at 45-50C.

I'll try and figure out comfyui or a nice way to run image models then get back to you. They're not really something I need/use, so I'm starting from zero on how to run them.

[–] domi@lemmy.secnd.me 2 points 1 day ago (2 children)

There's some info on how to set the kernel arguments to get full access to the 128GB as VRAM: https://github.com/kyuz0/amd-strix-halo-toolboxes?tab=readme-ov-file#6-host-configuration

There's also containers for llama.cpp and ComfyUI which have everything setup for you in that repo.

[–] panda_abyss@lemmy.ca 2 points 18 hours ago* (last edited 18 hours ago) (1 children)

I've got memory setup to use the full 128gb, downloading the IQ2_XXS GLM 4.5, and I'm also downloading the strix-halo quen image/video containers.

It's going to be a couple hours, so I'll check in tomorrow morning and update you

Edit: the qwen image toolbox is not working at all, literally zero iteration speed and none of my memory is being allocated. It seems to think I have 512MB of vram instead of the shared 128GB, this is probably a bug.

[–] domi@lemmy.secnd.me 1 points 13 hours ago (2 children)

It seems to think I have 512MB of vram instead of the shared 128GB, this is probably a bug.

I think that is normal, it shows 512MB in his video as well.

Not sure what's going on with the zero iterations though.

[–] panda_abyss@lemmy.ca 2 points 7 hours ago* (last edited 6 hours ago)

Was unable to get GLM 4.5 UD in that quant through LM studio, I'll try just llama.cpp instead

edit: Runs fine in llama.cpp, 5.1-5.6 tok/s on CPU, but I can't seem to fit the whole memory on GPU. Still experimenting.

llama-cli --ngl 93 --context 12288 --no-mmap -t 16

12k context seems like the largest I can get (uses 118GB). You can probably push it further without a GUI, but on a desktop environment gnome daemon starts killing processes.

Prompt processing at 12.3t/s, inference at 10.7-11.1 t/s.

I would say this verges on not-usable between the speed and context window. After the thinking tokens are through you've burned a lot of your usable context.

edit: implementing conway's game of life in numpy worked, used 3k/12k context, and took 7minutes.

[–] panda_abyss@lemmy.ca 2 points 7 hours ago* (last edited 4 hours ago)

Edit: image Gen works through comfyui, but is slow. Exact same experience as the video, works as he says. Not rendering text literally halves the compute time, but Qwen Image works.

So I had neglected to download the models. Chalk that up to trying to do this after coming home from the bar.

Qwen image is not working in that toolbox container.

First attempt had bad memory access when trying to merge the model and LoRA, second attempt it kept trying to use CUDA so failed to generate, third attempt it reached 100% of denoising generation but started running into strong GPU lag on the rest of the desktop environment and never produced the image.

Fourth attempt failed to merge LoRA again due to HIP memory errors -- this is after a fresh reboot, so no resource contention.

Rebooted qwen, fifth attempt it does merge LoRA and start generation, but never actually finishes an iteration. For some reason this time it appears to be trying to run on 4 CPUs. This run again in the logs said it was trying CUDA, so... I suspect it's the same failure.

It looks to me like a torch configuration issue.___

[–] panda_abyss@lemmy.ca 2 points 1 day ago

I’ll give those a run tomorrow or late tonight

[–] HelloRoot@lemy.lol 2 points 2 weeks ago* (last edited 2 weeks ago) (1 children)

Seems pretty decent, but I wonder how it compares to an AI optimized desktop build with the same budget of 2000$.

[–] sith@lemmy.zip 2 points 2 weeks ago* (last edited 2 weeks ago) (2 children)

It will probably kick the ass of that desktop. $2000 won't get you far with a conventional build.

[–] humanspiral@lemmy.ca 1 points 6 days ago

8700g with 256gb ram is possible on desktop. Half the APU performance, but less stupid bigger model > fast, for coding. No one seems to be using such a rig though.

[–] HelloRoot@lemy.lol 1 points 2 weeks ago* (last edited 2 weeks ago) (2 children)

Well, thats what I said "AI optimized".

Even my 5 year old 900$ rig can output like 4 tps.

[–] d_arm64@lemmy.world 1 points 2 weeks ago (1 children)

With what model? GPT oss or something else?

[–] HelloRoot@lemy.lol 1 points 2 weeks ago* (last edited 2 weeks ago) (2 children)

LLama 3 8B Instruct: 25tps

DeepSeek R1 distill qwen 14b: 3.2tps

To be fair: Motherboard, cpu and ram I bought 6 years ago with an nvidia 1660. Then I bought the Radeon RX 6600 XT on release in 2021, so 4 years ago. But it's a generic gaming rig.

I would be surprised if 2000$ worth of modern hardware, picked for this specific task would be worse than that mini PC.

[–] sith@lemmy.zip 1 points 2 weeks ago* (last edited 2 weeks ago)

I promise. It's not possible. But things change quickly of course.

(Unless you're lucky/pro and get your hands on some super cheap used high end hardware..)

[–] d_arm64@lemmy.world 1 points 2 weeks ago

To be honest that is pretty good. Thanks!

[–] sith@lemmy.zip 1 points 2 weeks ago

There is nothing "optimized" that will get you better inference performance of medium/large models at $2000.

[–] rkd@sh.itjust.works 1 points 1 week ago* (last edited 1 week ago)

For some weird reason, in my country it's easier to order a Beelink or a Framework than an HP. They will sell everything else, except what you want to buy.