this post was submitted on 24 Feb 2025
13 points (93.3% liked)

LocalLLaMA

3801 readers
20 users here now

Welcome to LocalLLaMA! Here we discuss running and developing machine learning models at home. Lets explore cutting edge open source neural network technology together.

Get support from the community! Ask questions, share prompts, discuss benchmarks, get hyped at the latest and greatest model releases! Enjoy talking about our awesome hobby.

As ambassadors of the self-hosting machine learning community, we strive to support each other and share our enthusiasm in a positive constructive way.

Rules:

Rule 1 - No harassment or personal character attacks of community members. I.E no namecalling, no generalizing entire groups of people that make up our community, no baseless personal insults.

Rule 2 - No comparing artificial intelligence/machine learning models to cryptocurrency. I.E no comparing the usefulness of models to that of NFTs, no comparing the resource usage required to train a model is anything close to maintaining a blockchain/ mining for crypto, no implying its just a fad/bubble that will leave people with nothing of value when it burst.

Rule 3 - No comparing artificial intelligence/machine learning to simple text prediction algorithms. I.E statements such as "llms are basically just simple text predictions like what your phone keyboard autocorrect uses, and they're still using the same algorithms since <over 10 years ago>.

Rule 4 - No implying that models are devoid of purpose or potential for enriching peoples lives.

founded 2 years ago
MODERATORS
 

In case anyone isn't familiar with llama.cpp and GGUF, basically it allows you to load part of the model to regular RAM if you can't fit all of it in VRAM, and then it splits the inference work between CPU and GPU. It is of course significantly slower than running a model entirely on GPU, but depending on your use case it might be acceptable if you want to run larger models locally.

However, since you can no longer use the "pick the largest quantization that fits in memory" logic, there are more choices to make when choosing which file to download. For example I have 24GB VRAM, so if I want to run a 70B model I could either use a Q4_K_S quant and perhaps fit 40/80 layers in VRAM, or a Q3_K_S quant and maybe fit 60 layers instead, but how will it affect speed and text quality? Then there are of course IQ quants, which are supposedly higher quality than a similar size Q quant, but possibly a little slower.

In addition to the quantization choice, there are additional flags which affect memory usage. For example I can opt to not offload the KQV cache, which would slow down inference, but perhaps it's a net gain if I can offload more model layers instead? And I can save some RAM/VRAM by using a quantized cache, probably with some quality loss, but I could use the savings to load a larger quant and perhaps that would offset it.

Was just wondering if someone has already done experiments/benchmarks in this area, did not find any exact comparisons on search engines. Planning to do some benchmarks myself but not sure when I have time.

you are viewing a single comment's thread
view the rest of the comments
[–] j4k3@lemmy.world 3 points 8 months ago

I just use Oobabooga. I wrote a script to loop and display the memory remaining on the GPU to optimize the split for each model, but that was it.