LocalLLaMA

2449 readers
24 users here now

Community to discuss about LLaMA, the large language model created by Meta AI.

This is intended to be a replacement for r/LocalLLaMA on Reddit.

founded 2 years ago
MODERATORS
1
 
 

Trying something new, going to pin this thread as a place for beginners to ask what may or may not be stupid questions, to encourage both the asking and answering.

Depending on activity level I'll either make a new one once in awhile or I'll just leave this one up forever to be a place to learn and ask.

When asking a question, try to make it clear what your current knowledge level is and where you may have gaps, should help people provide more useful concise answers!

2
 
 

Generate 5 thoughts, prune 3, branch, repeat. I think that’s what o1 pro and o3 do

3
4
 
 

Changed title because no need for youtube clickbait here

5
 
 

Someone asked about how llms can be so good at math operations. My response comment kind of turned into a five paragraph essay as they tend to do sometimes. Thought I would offer it here and add some reference. Maybe spark some discussion?

What do language models do?

LLMs are trained to recognize, process, and construct patterns of language data into high dimensional manifold plots.

Meaning its job is to structure and compartmentalize the patterns of language into a map where each word and its particular meaning live as pairs of points on a geometric surface. Its point is placed near closely related points in space connected by related concepts or properties of the word.

You can explore such a map for vision models here!

Then they use that map to statistically navigate through the sea of ways words can be associated into sentences to find coherent paths.

What does language really mean?

Language data isnt just words and syntax, its underlying abstract concepts, context, and how humans choose to compartmentalize or represent universal ideas given our subjective reference point.

Language data extends to everything humans can construct thoughts about including mathematics, philosophy, science storytelling, music theory, programming, ect.

Language is universal because its a fundimental way we construct and organize concepts. The first important cognative milestone for babies is the association of concepts to words and constructing sentences with them.

Even the universe speaks its own language. Physical reality and logical abstractions speak the same underlying universal patterns hidden in formalized truths and dynamical operation. Information and matter are two sides to a coin, their structure is intrinsicallty connected.

Math and conceptual vectors

Math is a symbolic representation of combinatoric logic. Logic is generally a formalized language used to represent ideas related to truth as well as how truth can be built on through axioms.

Numbers and math is cleanly structured and formalized patterns of language data. Its riggerously described and its axioms well defined. So its relatively easy to train a model to recognize and internalize patterns inherent to basic arithmetic and linear algebra and how they manipulate or process the data points representing numbers.

You can imagine the llms data manifold having a section for math and logic processing. The concept of one lives somewhere as a point of data on the manifold. By moving a point representing the concept of one along a vector dimension that represents the process of 'addition by one' to find the data point representing two.

Not a calculator though

However an llm can never be a true calculator due to the statistical nature of the tokenizer. It always has a chance of giving the wrong answer. In the infinite multitude of tokens it can pick any number of wrong numbers. We can get the statistical chance of failure down though.

Its an interesting how llms can still give accurate answers for artithmatic despite having no in built calculation function. Through training alone they are learning how to apply simple arithmetic.

hidden structures of information

There are hidden or intrinsic patterns to most structures of information. Usually you can find the fractal hyperstructures the patterns are geometrically baked into in higher dimensions once you go plotting out their phase space/ holomorphic parameter maps. We can kind of visualize these fractals with vision model activation parameter maps. Welch labs on yt has a great video about it. 

Modern language models have so many parameters with so many dimensions the manifold expands into its impossible to visualize. So they are basically mystery black boxes that somehow understand these crazy fractal structures of complex information and navigate the topological manifolds language data creates.

conclusion

This is my understanding of how llms do their thing. I hope you enjoyed reading! Secretly I just wanted to show you the cool chart :)

6
23
submitted 6 days ago* (last edited 6 days ago) by [email protected] to c/[email protected]
 
 

Ive been playing around with the deepseek R1 distills. Qwen 14b and 32b specifically.

So far its very cool to see models really going after this current CoT meta by mimicing internal thinking monologues. Seeing a model go "but wait..." "Hold on, let me check again..." "Aha! So.." Kind of makes it feel more natural in its eventual conclusions.

I don't like how it can get caught in looping thought processes and im not sure how much all the extra tokens spent really go towards a "better" answer/solution.

What really needs to be ironed out is the reading comprehension seems to be lower th average as it misses small details in tricky questions and makes assumptions about what youre trying to ask like wanting a recipe for coconut oil cookies but only seeing coconut and giving a coconut cookie recipe with regular butter.

Its exciting to see models operate in a kind of a new way.

7
 
 

I was experimenting with oobabooga trying to run this model but due to it's size it wasn't going to fit in ram, so i tried to quantize it using llama.cpp, and that worked, but due to the gguf format it was only running on the cpu. searching for ways to quantize the model while keeping it in safetensors returned nothing; so is there any way to do that?

I'm sorry if this is a stupid question, i still know almost nothing of this field

8
 
 

Do i need industry grade gpu's or can i scrape by getring decent tps with a consumer level gpu.

9
 
 

I am excited to see how this performs when it drops around May.

10
 
 
11
13
submitted 1 month ago* (last edited 1 month ago) by [email protected] to c/[email protected]
 
 

Seems Meta have been doing some research lately, to replace the current tokenizers with new/different representations:

12
 
 

Absolutely humongous model. Mixture of 256 experts with 8 activated each time.

Aider leaderboard: The only model above 🐋 v3 here is ~~Open~~AI o1. DeepSeek is known to make amazing models and Aider rotates their benchmark over time, so it is unlikely that this is a train-on-benchmark situation.

Some more benchmarks: on Reddit.

13
12
submitted 1 month ago* (last edited 1 month ago) by [email protected] to c/[email protected]
 
 

I want to fine tune an LLM to "steer" it in the right direction. I have plenty of training examples in which I stop the generation early and correct the output to go in the right direction, and then resume generation.

Basically, for my dataset doing 100 "steers" on a single task is much cheaper than having to correct 100 full generations completely, and I think each of these "steer" operations has value and could be used for training.

So maybe I'm looking for some kind of localized DPO. Does anyone know if something like this exists?

14
 
 

Howdy!

(moved this comment from the noob question thread because no replies)

I'm not a total noob when it comes to general compute and AI. I've been using online models for some time, but I've never tried to run one locally.

I'm thinking about buying a new computer for gaming and for running/testing/developing LLMs (not training, only inference and in context learning) . My understanding is that ROCm is becoming decent (and I also hate Nvidia) , so I'm thinking that a Radeon Rx 7900 XTX might be a good start. If I buy the right motherboard I should be able to put another XTX in there as well, later. If I use watercooling.

So first, what do you think about this? Are the 24 gigs of VRAM worth the extra bucks? Or should I just go for a mid-range GPU like the Arc B580?

I'm also curious experimenting with a no-GPU setup. I.e. CPU + lots of RAM. What kind of models do you think I'll be able to run, with decent performance, if I have something like a Ryzen 7 9800X3D and 128/256 GB of DDR5? How does it compare to the Radeon RX 7900 XTX? Is it possible to utilize both CPU and GPU when running inference with a single model, or is it either or?

Also.. Is it not better if noobs post questions in the main thread? Then questions will probably reach more people. It's not like there is super much activity..

15
11
Fixed it (sh.itjust.works)
submitted 1 month ago* (last edited 1 month ago) by [email protected] to c/[email protected]
 
 

Seriously though, does anyone know how to use openwebui with the new version?

Edit: if you go into the ollama container using sudo docker exec -it bash, then you can pull models with ollama pull llama3.1:8b for example and have it.

16
 
 

People are talking about the new Llama 3.3 70b release, which has generally better performance than Llama 3.1 (approaching 3.1's 405b performance): https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_3

However, something to note:

Llama 3.3 70B is provided only as an instruction-tuned model; a pretrained version is not available.

Is this the end of open-weight pretrained models from Meta, or is Llama 3.3 70b instruct just a better-instruction-tuned version of a 3.1 pretrained model?

Comparing the model cards: 3.1: https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MODEL_CARD.md 3.3: https://github.com/meta-llama/llama-models/blob/main/models/llama3_3/MODEL_CARD.md

The same knowledge cutoff, same amount of training data, and same training time give me hope that it's just a better finetune of maybe Llama 3.1 405b.

17
15
submitted 2 months ago* (last edited 2 months ago) by [email protected] to c/[email protected]
 
 

I've been working on keeping the OSM tool up to date for OpenWebUI's rapid development pace. And now I've added better-looking citations, with fancy styling. Just a small announcement post!

Update: when this was originally posted, the tool was on 1.3. Now it's updated to 2.1.0, with a navigation feature (beta) and more fixes for robustness.

18
 
 

I've been using Qwen 2.5 Coder (bartowski/Qwen2.5.1-Coder-7B-Instruct-GGUF) for some time now, and it has shown significant improvements compared to previous open weights models.

Notably, this is the first model that can be used with Aider. Moreover, Qwen 2.5 Coder has made notable strides in editing files without requiring frequent retries to generate in the proper format.

One area where most models struggle, including this one, is when the prompt exceeds a certain length. In this case, it appears that the model becomes unable to remember the system prompt when the prompt length is above ~2000 tokens.

19
 
 

I'm just a hobbyist in this topic, but I would like to share my experience with using local LLMs for very specific generative tasks.

Predefined formats

With prefixes, we can essentially add the start of the LLMs response, without it actually generating it. Like, when We want it to Respond with bullet points, we can set the prefix to be - (A dash and space). If we want JSON, we can use ` ` `json\n{ as a prefix, to make it think that it started a JSON markdown code block.

If you want a specific order in which the JSON is written, you can set the prefix to something like this:

` ` `json
{
    "first_key":

Translation

Let's say you want to translate a given text. Normally you would prompt a model like this

Translate this text into German:
` ` `plaintext
[The text here]
` ` `
Respond with only the translation!

Or maybe you would instruct it to respond using JSON, which may work a bit better. But what if it gets the JSON key wrong? What if it adds a little ramble infront or after the translation? That's where prefixes come in!

You can leave the promopt exactly as is, maybe instructing it to respond in JSON

Respond in this JSON format:
{"translation":"Your translation here"}

Now, you can pretend that the LLM already responded with part of the message, which I will call a prefix. The prefix for this specific usecase could be this:

{
    "translation":"

Now the model thinks that it already wrote these tokens, and it will continue the message from right where it thinks it left off. The LLM might generate something like this:

Es ist ein wunderbarer Tag!"
}

To get the complete message, simply combine the prefix and the generated text to result in this:

{
    "translation":"Es ist ein wunderschöner Tag!"
}

To minimize inference costs, you can add "} and "\n} as stop tokens, to stop the generation right after it finished the json entrie.

Code completion and generation

What if you have an LLM which didn't train on code completion tokens? We can get a similar effect to the trained tokens using an instruction and a prefix!

The prompt might be something like this

` ` `python
[the code here]
` ` `
Look at the given code and continue it in a sensible and reasonable way.
For example, if I started writing an if statement,
determine if an else statement makes sense, and add that.

And the prefix would then be the start of a code block and the given code like this

` ` `python
[the code here]

This way, the LLM thinks it already rewrote everything you did, but it will now try to complete what it has written. We can then add \n` ` ` as a stop token to make it only generate code and nothing else.

This approach for code generation may be more desireable, as we can tune its completion using the prompt, like telling it to use certain code conventions.

Simply giving the model a prefix of ` ` `python\n Makes it start generating code immediately, without any preamble. Again, adding the stop keyword \n` ` ` makes sure that no postamble is generated.

Using this in ollama

Using this "technique" in ollama is very simple, but you must use the /api/chat endpoint and cannot use /api/generate. Simply append the start of a message to the conversation passed to the model like this:

"conversation":[
    {"role":"user", "content":"Why is the sky blue?"},
    {"role":"assistant", "content":"The sky is blue because of"}
]

It's that simple! Now the model will complete the message with the prefix you gave it as "content".

Be aware!

There is one pitfall I have noticed with this. You have to be aware of what the prefix gets tokenized to. Because we are manually setting the start of the message ourselves, it might not be optimally tokenized. That means, that this might confuse the LLM and generate one too many or few spaces. This is mostly not an issue though, as

What do you think? Have you used prefixes in your generations before?

20
 
 

Looks interesting. Love seeing more coming out from this space

21
16
submitted 4 months ago* (last edited 4 months ago) by [email protected] to c/[email protected]
 
 

https://huggingface.co/collections/Qwen/qwen25-66e81a666513e518adb90d9e

Qwen 2.5 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B just came out, with some variants in some sizes just for math or coding, and base models too.

All Apache licensed, all 128K context, and the 128K seems legit (unlike Mistral).

And it's pretty sick, with a tokenizer that's more efficient than Mistral's or Cohere's and benchmark scores even better than llama 3.1 or mistral in similar sizes, especially with newer metrics like MMLU-Pro and GPQA.

I am running 34B locally, and it seems super smart!

As long as the benchmarks aren't straight up lies/trained, this is massive, and just made a whole bunch of models obsolete.

Get usable quants here:

GGUF: https://huggingface.co/bartowski?search_models=qwen2.5

EXL2: https://huggingface.co/models?sort=modified&search=exl2+qwen2.5

22
 
 

Mistral Small 22B just dropped today and I am blown away by how good it is. I was already impressed with Mistral NeMo 12B's abilities, so I didn't know how much better a 22B could be. It passes really tough obscure trivia that NeMo couldn't, and its reasoning abilities are even more refined.

With Mistral Small I have finally reached the plateu of what my hardware can handle for my personal usecase. I need my AI to be able to at least generate around my base reading speed. The lowest I can tolerate is 1.5~T/s lower than that is unacceptable. I really doubted that a 22B could even run on my measly Nvidia GTX 1070 8G VRRAM card and 16GB DDR4 RAM. Nemo ran at about 5.5t/s on this system, so how would Small do?

Mistral Small Q4_KM runs at 2.5T/s with 28 layers offloaded onto VRAM. As context increases that number goes to 1.7T/s. It is absolutely usable for real time conversation needs. I would like the token speed to be faster sure, and have considered going with the lowest Q4 recommended to help balance the speed a little. However, I am very happy just to have it running and actually usable in real time. Its crazy to me that such a seemingly advanced model fits on my modest hardware.

Im a little sad now though, since this is as far as I think I can go in the AI self hosting frontier without investing in a beefier card. Do I need a bigger smarter model than Mistral Small 22B? No. Hell, NeMo was serving me just fine. But now I want to know just how smart the biggest models get. I caught the AI Acquisition Syndrome!

23
 
 

I just found https://www.arliai.com/ who offer LLM inference for quite cheap. Without rate-limits and unlimited token generation. No-logging policy and they have an OpenAI compatible API.

I've been using runpod.io previously but that's a whole different service as they sell compute and the customers have to build their own Docker images and run them in their cloud, by the hour/second.

Should I switch to ArliAI? Does anyone have some experience with them? Or can recommend another nice inference service? I still refuse to pay $1.000 for a GPU and then also pay for electricity when I can use some $5/month cloud service and it'd last me 16 years before I reach the price of buying a decent GPU...

Edit: Saw their $5 tier only includes models up to 12B parameters, so I'm not sure anymore. For larger models I'd need to pay close to what other inference services cost.

Edit2: I discarded the idea. 7B parameter models and one 12B one is a bit small to pay for. I can do that at home thanks to llama.cpp

24
 
 

I'm currently using SuperNormal to taking meeting minutes for all of my Teams, Google Meet, and Zoom conference calls. Is there a workflow for doing this locally with Whisper and some other tools? I haven't found one yet.

25
 
 

Only recently did I discover the text-to-music AI companies (udio.com, suno.com) and I was surprised about how good the results are. Both are under lawsuit from RIAA.

I am curious if there are any local ones I can experiment with or train myself. I know there is facebook/musicgen-large on HuggingFace. That model is over 1 year old and there might be others by now. Also, based on the card I get the feeling that model is not going to be good at doing specific song lyrics (maybe the lyrics just were absent from the training data?). I am most interested in trying my hand at writing songs and fine-tuning a model on specific types of music to get the sounds I am looking for.

view more: next ›