this post was submitted on 15 Aug 2025

16 points (100.0% liked)

LocalLLaMA

3614 readers

16 users here now

Welcome to LocalLLaMA! Here we discuss running and developing machine learning models at home. Lets explore cutting edge open source neural network technology together.

Get support from the community! Ask questions, share prompts, discuss benchmarks, get hyped at the latest and greatest model releases! Enjoy talking about our awesome hobby.

As ambassadors of the self-hosting machine learning community, we strive to support each other and share our enthusiasm in a positive constructive way.

Rules:

Rule 1 - No harassment or personal character attacks of community members. I.E no namecalling, no generalizing entire groups of people that make up our community, no baseless personal insults.

Rule 2 - No comparing artificial intelligence/machine learning models to cryptocurrency. I.E no comparing the usefulness of models to that of NFTs, no comparing the resource usage required to train a model is anything close to maintaining a blockchain/ mining for crypto, no implying its just a fad/bubble that will leave people with nothing of value when it burst.

Rule 3 - No comparing artificial intelligence/machine learning to simple text prediction algorithms. I.E statements such as "llms are basically just simple text predictions like what your phone keyboard autocorrect uses, and they're still using the same algorithms since <over 10 years ago>.

Rule 4 - No implying that models are devoid of purpose or potential for enriching peoples lives.

founded 2 years ago

MODERATORS

SkySyrup@sh.itjust.works

pax@sh.itjust.works

noneabove1182@sh.itjust.works

Smokeydope@lemmy.world

MonsterBug@sh.itjust.works

HP Z2 Mini G1a Review: Running GPT-OSS 120B Without a Discrete GPU (www.storagereview.com)

submitted 2 weeks ago by sith@lemmy.zip to c/localllama@sh.itjust.works

25 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[–] domi@lemmy.secnd.me 1 points 2 days ago (3 children)

It seems to think I have 512MB of vram instead of the shared 128GB, this is probably a bug.

I think that is normal, it shows 512MB in his video as well.

Not sure what's going on with the zero iterations though.

[–] panda_abyss@lemmy.ca 2 points 2 days ago (1 children)

Huzzah! Qwen Image! Runtime 547s

[–] domi@lemmy.secnd.me 1 points 19 hours ago (1 children)

Seems pretty good for fp16, might be quite a bit faster with an fp8 workflow.

[–] panda_abyss@lemmy.ca 1 points 14 hours ago

Yeah, I’ve discovered if you lower resolution and steps to ~4 you can prototype a prompt on blurry images.

Once you’ve got something with a good layout 40 steps works better

I’m a bit disappointed in Qwen image’s prompt processing, it’s good, but it does not have good knowledge and will swap people out if you ask for real people.

[–] panda_abyss@lemmy.ca 2 points 2 days ago* (last edited 2 days ago) (1 children)

Was unable to get GLM 4.5 UD in that quant through LM studio, I'll try just llama.cpp instead

edit: Runs fine in llama.cpp, 5.1-5.6 tok/s on CPU, but I can't seem to fit the whole memory on GPU. Still experimenting.

llama-cli --ngl 93 --context 12288 --no-mmap -t 16

12k context seems like the largest I can get (uses 118GB). You can probably push it further without a GUI, but on a desktop environment gnome daemon starts killing processes.

Prompt processing at 12.3t/s, inference at 10.7-11.1 t/s.

I would say this verges on not-usable between the speed and context window. After the thinking tokens are through you've burned a lot of your usable context.

edit: implementing conway's game of life in numpy worked, used 3k/12k context, and took 7minutes.

[–] domi@lemmy.secnd.me 1 points 19 hours ago (1 children)

Prompt processing at 12.3t/s, inference at 10.7-11.1 t/s.

Is that still on CPU or did you get it working on GPU?

I have seen a few people recommending GLM 4.5 at lower quants primarily for more intricate writing, might be worth the lower speed and context size for shorter texts.

Thanks for testing!

[–] panda_abyss@lemmy.ca 1 points 14 hours ago

That was GPU, CPU was 5.

I’ve also tested the image processing more, a 512x512 takes about a minute, 1400x900 takes about 7-10, and image to image takes about 10 minutes

Most of the time is spent on the encoder decoder layers for image to image, and decoding is what shapes the slowest with image size

[–] panda_abyss@lemmy.ca 2 points 2 days ago* (last edited 2 days ago)

Edit: image Gen works through comfyui, but is slow. Exact same experience as the video, works as he says. Not rendering text literally halves the compute time, but Qwen Image works.

So I had neglected to download the models. Chalk that up to trying to do this after coming home from the bar.

Qwen image is not working in that toolbox container.

First attempt had bad memory access when trying to merge the model and LoRA, second attempt it kept trying to use CUDA so failed to generate, third attempt it reached 100% of denoising generation but started running into strong GPU lag on the rest of the desktop environment and never produced the image.

Fourth attempt failed to merge LoRA again due to HIP memory errors -- this is after a fresh reboot, so no resource contention.

Rebooted qwen, fifth attempt it does merge LoRA and start generation, but never actually finishes an iteration. For some reason this time it appears to be trying to run on 4 CPUs. This run again in the logs said it was trying CUDA, so... I suspect it's the same failure.

It looks to me like a torch configuration issue.___