robber

joined 2 years ago
[–] robber@lemmy.ml 9 points 6 days ago (2 children)

Given that Google generated more than 250 billion U.S. dollars in ad revenue in 2024, I'd say they must be pretty effective.

Source

[–] robber@lemmy.ml 1 points 1 week ago (1 children)
[–] robber@lemmy.ml 1 points 2 weeks ago (1 children)

I see. When I run the inference engine containerized, will the container be able to run its own version of CUDA or use the host's version?

[–] robber@lemmy.ml 2 points 2 weeks ago

Thank you for taking the time to respond.

I've used vLLM for hosting a smaller model which could fit in two of GPUs, it was very performant especially for multiple requests at the same time. The major drawback for my setup was that it only supports tensor parallelism for 2, 4, 8, etc. GPUs and data paralellism slowed inference down considerably, at least for my cards. exllamav3 is the only engine I'm aware of which support 3-way TP.

But I'm fully with you in that vLLM seems to be the most recommended and battle-tested solution.

I might take a look at how I can safely upgrade the driver until I can afford a fourth card and switch back to vLLM.

[–] robber@lemmy.ml 2 points 2 weeks ago (2 children)

I use the the proprietary ones from Nvidia, they're at 535 on oldstable IIRC but there are a lot newer ones.

I use 3xRTX2000e Ada. It's a rather new, quite power efficient GPU manufactured by PNY.

As inference engine I use exllamav3 with tabbyAPI. I like it very much because it supports 3-way tensor paralellism, making it a lot faster for me than llamacpp.

[–] robber@lemmy.ml 1 points 2 weeks ago

I use the the proprietary ones from Nvidia, they're at 535 on oldstable IIRC but there are a lot newer ones.

 

Hey everyone! I was just skimming through some inference benchmarks of other people and noticed the driver version is usually mentioned. It made me wonder how relevant this is. My prod server runs Debian 12 so the packaged nvidia drivers are rather old, but I'd prefer not to mess with the drivers if it won't bring a benefit. Does any of you have any experience or did do some testing?

[–] robber@lemmy.ml 6 points 2 weeks ago

That brian typo really gave me a chuckle. Hope you found the movie you were looking for.

[–] robber@lemmy.ml 2 points 3 weeks ago (2 children)

Wikipedia states the UI layer is propriertary, is that true?

[–] robber@lemmy.ml 4 points 1 month ago

The country's official app for COVID immunity certificates or whatever they were called was available on F-Droid at the time.

[–] robber@lemmy.ml 2 points 1 month ago

Too bad they've only been dropping dense models recently. Also kind of interesting since with Mixtral back in the days they were way ahead of time.

[–] robber@lemmy.ml 14 points 1 month ago* (last edited 1 month ago) (1 children)

A review from earlier this year didn't sound too bad.

Edit: as pointed out, the review seems to be about the previous version of the phone.

[–] robber@lemmy.ml 5 points 1 month ago (1 children)

I'd add that memory bandwidth is still a relevant factor, so the faster the RAM the faster the inference will be. I think this model would be a perfect fit for the Strix Halo or a >= 64GB Apple Silicon machine, when aiming for CPU-only inference. But mind that llamacpp does not yet support the qwen3-next architecture.

36
submitted 1 month ago* (last edited 1 month ago) by robber@lemmy.ml to c/localllama@sh.itjust.works
 

Title says it - it's been 10 days already but I didn't catch the release. This might be huge for those of us running on multiple GPUs. At least for Gemma3, I was able to double inference speed by using vLLM with tensor parallelism vs. ollama's homegrown parallelism. Support in ExLlamaV3 could additionally allow to pair TP with lower-bit quants. Haven't tested this yet, but I'm looking very much forward to.

 

Tencent recently released a new MoE model with ~80b parameters, 13b of which are active at inference. Seems very promising for people with access to 64 gigs of VRAM.

 

Hey fellow llama enthusiasts! Great to see that not all of lemmy is AI sceptical.

I'm in the process of upgrading my server with a bunch of GPUs. I'm really excited about the new Mistral / Magistral Small 3.2 models and would love to serve them for me and a couple of friends. My research led me to vLLM with which I was able to double inference speed compared to ollama at least for qwen3-32b-awq.

Now sadly, the most common quantization methods (GGUF, EXL, BNB) are either not fully (GGUF) or not at all (EXL) supported in vLLM, or multi-gpu inference thouth tensor parallelism is not supported (BNB). And especially for new models it's hard to find pre-quantized models in different, more broadly supported formats (AWQ, GPTQ).

Does any of you guys face a similar problem? Do you quantize models yourself? Are there any up-to-date guides you would recommend? Or did I completely overlook another, obvious solution?

It feels like when I've researched something yesterday, it's already outdated again today, since the landscape is so rapidly evolving.

Anyways, thank you for reading and sharing your thoughts or experience if you feel like it.

 

Text: Allows you to determine whether to limit CPUID maximum value. Set this to enabled for legacy operating systems such as Linux or Unix.

Found this in the BIOS of a Gigabyte Z97X-UD3H mobo.

 

Hi fellow homelabbers! I hope your day / night is going great.

Just stubled across this self-hosted cloudflare tunnel alternernative called Pangolin.

  • Does anyone use it for exposing their homelab? It looks awesome, but I've never heard of it before.

  • Should I be reluctant since it's developed by a US-based company? I mean security-wise. (I'll remove this question if it's too political.)

  • Does anyone know of alternatives pieces or stacks or software that achieve the same without relying on cloudflare?

Your insights are highly appreciated!

 
 

Hey fellow self-hosting lemmoids

Disclaimer: not at all a network specialist

I'm currently setting up a new home server in a network where I'm given GUA IPv6 addresses in a 64 bit subnet (which means, if I understand correctly, that I can set up many devices in my network that are accessible via a fixed IP to the oustide world). Everything works so far, my services are reachable.

Now my problem is, that I need to use the router provided by my ISP, and it's - big surprise here - crap. The biggest concern for me is that I don't have fine-grained control over firewall rules. I can only open ports in groups (e.g. "Web", "All other ports") and I can only do this network-wide and not for specific IPs.

I'm thinking about getting a second router with a better IPv6 firewall and only use the ISP router as a "modem". Now I'm not sure how things would play out regarding my GUA addresses. Could a potential second router also assign addresses to devices in that globally routable space directly? Or would I need some sort of NAT? I've seen some modern routers with the capability of "pass-through" IPv6 address allocation, but I'm unsure if the firewall of the router would still work in such a configuration.

In IPv4 I used to have a similar setup, where router 1 would just forward all packets for some ports to router 2, which then would decide which device should receive them.

Has any of you experience with a similar setup? And if so, could you even recommend a router?

Many thanks!


Edit: I was able to achieve what I wanted by using OpenWrt and their IPv6 relay mode. Now my ISP router handles all IPv6 addresses directly, but I'm still able to filter the packets using the OpenWrt firewall. For IPv4 I didn't figure out how to, at the same time, use the ISP's DHCP server, so I just went with double NAT. Everything works like a charm. Thank you guys for pointing me in the right direction.

140
submitted 1 year ago* (last edited 1 year ago) by robber@lemmy.ml to c/lemmyshitpost@lemmy.world
 

Most relevant section translated to english:

If he (Trump) wins the election on November 5, his billionaire supporter Musk will chair the new board. This is to implement a full financial and performance audit of the entire government and make recommendations for drastic reforms.

Source: Swiss state media article

view more: next ›