Selfhosted

46653 readers

230 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules:

Be civil: we're here to support and learn from one another. Insults won't be tolerated. Flame wars are frowned upon.
No spam posting.
Posts have to be centered around self-hosting. There are other communities for discussing hardware or home computing. If it's not obvious why your post topic revolves around selfhosting, please include details to make it clear.
Don't duplicate the full text of your blog or github here. Just post the link for folks to click.
Submission headline should match the article title (don’t cherry-pick information from the title to fit your agenda).
No trolling.

Resources:

selfh.st Newsletter and index of selfhosted software and apps
awesome-selfhosted software
awesome-sysadmin resources
Self-Hosted Podcast from Jupiter Broadcasting

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

founded 2 years ago

MODERATORS

[email protected]

Using Mac M2 Ultra 192GB to Self-Host LLMs? (sh.itjust.works)

submitted 3 months ago* (last edited 3 months ago) by [email protected] to c/[email protected]

44 comments fedilink hide all child comments

I’m doing a lot of coding and what I would ideally like to have is a long context model (128k tokens) that I can use to throw in my whole codebase.

I’ve been experimenting e.g. with Claude and what usually works well is to attach e.g. the whole architecture of a CRUD app along with the most recent docs of the framework I’m using and it’s okay for menial tasks. But I am very uncomfortable sending any kind of data to these providers.

Unfortunately I don’t have a lot of space so I can’t build a proper desktop. My options are either renting out a VPS or going for something small like a MacStudio. I know speeds aren’t great, but I was wondering if using e.g. RAG for documentation could help me get decent speeds.

I’ve read that especially on larger contexts Macs become very slow. I’m not very convinced but I could get a new one probably at 50% off as a business expense, so the Apple tax isn’t as much an issue as the concern about speed.

Any ideas? Are there other mini pcs available that could have better architecture? Tried researching but couldn’t find a lot

Edit: I found some stats on GitHub on different models: https://github.com/ggerganov/llama.cpp/issues/10444

Based on that I also conclude that you’re gonna wait forever if you work with a large codebase.

you are viewing a single comment's thread
view the rest of the comments

[–] [email protected] 1 points 3 months ago (1 children)

The context cache doesn't take up too much memory compared to the model. The main benefit of having a lot of VRAM is that you can run larger models. I think you're better off buying a 24 GB Nvidia card from a cost and performance standpoint.

[–] [email protected] 1 points 3 months ago (1 children)

Yeah I was thinking about running something like Code Qwen 72B which apparently requires 145GB Ram to run the full model. But if it’s super slow especially with large context and I can only run small models at acceptable speed anyway it may be worth going NVIDIA alone for CUDA.

[–] [email protected] 1 points 3 months ago

I found a VRAM calculator for LLMs here: https://huggingface.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator

Wow it seems like for 128K context size you do need a lot of VRAM (~55 GB). Qwen 72B will take up ~39 GB so you would either need 4x 24GB Nvidia cards or the Mac Pro 192 GB RAM. Probably the cheapest option would be to deploy GPU instances on a service like Runpod. I think you would have to do a lot of processing before you get to the breakeven point of your own machine.