overview for brucethemoose

Trump blames DC air crash on workers with "severe intellectual disabilities, psychiatric problems, or other mental or physical conditions" in c/[email protected]

[–] [email protected] 4 points 25 minutes ago* (last edited 22 minutes ago)

And once again, Trump controlled the conversation, and all the actual evidence is out there window for most people.

It would be awesome if there was an unspoken “don’t feed the troll” understanding among journalists, influencers, mods, forum commenters, everyone. When Trump says something outrageously stupid, just… briefly acknowledge it, and then ignore it and proceed as usual. Like he doesn’t exist.

That’s a narcissist’s worst fear.

Sure, he'd bounce around in the conservative echo chamber, but at least he wouldn’t pull more people in.

2 in a single week that is crazy in c/[email protected]

[–] [email protected] 1 points 47 minutes ago* (last edited 43 minutes ago)

Yes! Try this model: https://huggingface.co/arcee-ai/Virtuoso-Small-v2

Or the 14B thinking model: https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B

But for speed and coherence, instead of ollama, I'd recommend running it through Aphrodite or TabbyAPI as a backend, depending if you prioritize speed or long inputs. They both act as generic OpenAI endpoints.

I'll even step you through it and upload a quantization for your card, if you want, as it looks like there's not a good-sized exl2 on huggingface.

Trump blames DC air crash on workers with "severe intellectual disabilities, psychiatric problems, or other mental or physical conditions" in c/[email protected]

2 in a single week that is crazy in c/[email protected]

[–] [email protected] 8 points 18 hours ago* (last edited 18 hours ago) (3 children)

I mean, if you have huge GPU, sure. Or at least 12GB free vram or a big Mac.

Local LLMs for coding is kinda a niche because most people don’t have a 3090 or 7900 lying around, and you really need 12GB+ free VRAM for the models to start being "smart" and even worth using over free LLM APIs, much less cheap paid ones.

But if you do have the hardware and the time to set a server up, the Deepseek R1 models or the FuseAI merges are great for "slow" answers where the model thinks things out for replying. Qwen 2.5 32B coder is great for quick answers on 24GB VRAM. Arcee 14B is great for 12GB VRAM.

Sometimes running a small model on a "fast" less vram efficient backend is better for stuff like cursor code completion.

Trump administration to cancel student visas of pro-Palestinian protesters in c/[email protected]

[–] [email protected] -4 points 22 hours ago (17 children)

https://en.wikipedia.org/wiki/Gaza_war_protest_vote_movements#Withdrawal_of_Joe_Biden

On the other hand, Abandon Harris endorsed Green Party candidate Jill Stein, who said she would end all military support to Israel if elected, and the group said that it was "confronting two destructive forces: one currently overseeing a genocide and another equally committed to continuing it"

Following the loss of Harris, many in the movement felt vindication. Significant portions of the electorate in Dearborn, Michigan, an Arab American majority city, did not vote for Harris.[77] Muslims who voted for Trump, and were thus pivotal in helping him win the three key states of the Rust Belt (Michigan, Pennsylvania, and Wisconsin being Harris's clearer path for a narrow win in the Electoral College), were subsequently upset that Trump nominated pro-Israel cabinet picks...

Trump tells DOJ to prosecute teachers who “unlawfully” support trans or nonbinary students in c/[email protected]

[–] [email protected] 5 points 23 hours ago* (last edited 23 hours ago)

Oof... Thanks. I appreciate the history lesson, as they did not teach that little detail in my schools.

Trump tells DOJ to prosecute teachers who “unlawfully” support trans or nonbinary students in c/[email protected]

[–] [email protected] 55 points 1 day ago* (last edited 1 day ago) (11 children)

This really is reminiscent of early Nazi Germany, with an obsession over trans people (like Jews), and the idea that they're the root of so much evil, and the constant implication that things would be better if they just go away...

[–] [email protected] 9 points 1 day ago

That’s the whole point. Deflect real controversy with stupid sound bites.

2 in a single week that is crazy in c/[email protected]

[–] [email protected] 45 points 1 day ago* (last edited 1 day ago) (5 children)

My friend, the Chinese have been releasing amazing models all last year, it just didn’t make headlines.

Tencent's Hunyuan Video is incredible. Alibabas Qwen is still a go to local model. I've used InternLM pretty regularly… Heck, Yi 32B was awesome in 2023, as the first decent long context local model.

…The Janus models are actually kind of meh, unless you're captioning images, and FLUX/Hunyuan Video is still king in diffusion world.

40

Behind the Curtain: Meta's make-up-with-MAGA map (www.axios.com)

submitted 2 weeks ago* (last edited 2 weeks ago) by [email protected] to c/[email protected]

4 comments fedilink

Here's the Meta formula:

Put a Trump friend on your board (Ultimate Fighting Championship CEO Dana White).

Promote a prominent Republican as your chief global affairs officer (Joel Kaplan, succeeding liberal-friendly Nick Clegg, president of global affairs).

Align your philosophy with Trump's on a big-ticket public issue (free speech over fact-checking).

Announce your philosophical change on Fox News, hoping Trump is watching. In this case, he was. "Meta, Facebook, I think they've come a long way," Trump said at a Mar-a-Lago news conference, adding of Kaplan's appearance on the "Fox and Friends" curvy couch: "The man was very impressive."

Take a big public stand on a favorite issue for Trump and MAGA (rolling back DEI programs).

Amplify that stand in an interview with Fox News Digital. (Kaplan again!)

Go on Joe Rogan's podcast and blast President Biden for censorship.

16

Elon Musk's headline dominance squeezes other CEOs (www.axios.com)

submitted 1 month ago* (last edited 1 month ago) by [email protected] to c/[email protected]

6 comments fedilink

Taboola's data, shared exclusively with Axios, shows Musk has outpaced his closest peers — Jeff Bezos and Mark Zuckerberg — for years, but the gap widened dramatically in 2024.

The spam is already exponential. :(

372

Trump sides with Musk in H-1B fight (www.axios.com)

submitted 1 month ago* (last edited 1 month ago) by [email protected] to c/[email protected]

73 comments fedilink

Reality check: Trump pledged to end the program in 2016.

Called it. When push comes to shove, Trump is always going to side with the ultra-rich.

214

Elon Musk pledges "war" over H-1B visa program, calls opponents racists (www.axios.com)

submitted 1 month ago* (last edited 1 month ago) by [email protected] to c/[email protected]

59 comments fedilink

Trump, who has remained silent thus far on the schism, faces a quickly deepening conflict between his richest and most powerful advisors on one hand, and the people who swept him to office on the other.

All this is stupid. But I know one thing:

Trump is a billionaire.

And I predict his followers are going to learn who he’ll side with when push comes to shove.

Also, Bannon’s take is interesting:

Bannon tells Axios he helped kick off the debate with a now-viral Gettr post earlier this month calling out a lack of support for the Black and Hispanic communities in Big Tech.

249

Musk calls MAGA element "contemptible fools" as virtual civil war brews (www.axios.com)

submitted 1 month ago by [email protected] to c/[email protected]

54 comments fedilink

141

MAGA vs. Musk: Right-wing critics allege censorship, loss of X badges (www.axios.com)

submitted 1 month ago* (last edited 1 month ago) by [email protected] to c/[email protected]

11 comments fedilink

I think the title explains it all… Even right wing influencers can have their faces eaten. And Twitter views are literally their livelihood.

Trump's conspiracy-minded ally Laura Loomer, New York Young Republican Club president Gavin Wax and InfoWars host Owen Shroyer all said their verification badges disappeared after they criticized Musk's support for H1B visas, railed against Indian culture and attacked Ramaswamy, Musk's DOGE co-chair.

9

Brainstorming Post LoK/Avatar Seven Havens Story Ideas (lemmy.world)

submitted 1 month ago by [email protected] to c/[email protected]

0 comments fedilink

I have no idea if anyone on Lemmy is into Avatar lore/fanfiction, but in the spirit of posting content here instead of Reddit... here goes.

A new Avatar series featuring 'twin' Avatars has been leaked, in case you missed it:

https://knightedgemedia.com/2024/12/avatar-seven-havens-twin-earth-avatar-series-will-initially-be-26-episodes-long/

https://lemmy.world/post/23427458

In a nutshell, its allegedly set in a cataclysmic world overrun by spirit vines, and two twins are the 'Avatars' with diametric personalities. Not much is known beyond that, but I've been brainstorming some post-LoK ideas forever.

And now I kinda feel like writing them out. Here's my thought dump for a story:

In the (largely undepicted) three years of canon Korra Alone, Korra (traveling anonymously) makes a stop on Kyoshi Island hoping to reconnect with her spirit. Instead, she meets a humble blacksmith with a bird spirit on his shoulder, and connects with him. They both wrestle with the demons haunting them, and they discover secrets on the island from Kyoshi's era.

Korra dies in 190 AG (at 37), already weakened from her metal poisoning, saving the world froma a cataclysm that leaves much of the world overgrown.

Intially, the story jumps between this period in Korra Alone (174AG) and 206 AG, where Asami Sato is struggles to steer Future Industries in a world dominated by megacorps in the safe 'havens' dotted through the world. While Kyoshi Island has barely changed at all, the 'future' thread has a more cyberpunk feel. Chi based cybernetics are commonplace, but the more augmented someone is, the more their bending is compromised, and bender vs nonbender tensions flare up once more. The world outside the safe havens is a dangerous wasteland. Tech derived from studying spirits has let to the proliferation of holograms, BCIs, and even primitive assistants and virtual environments, and advances in power storage/generation already seen in LoK mean everything is largely electric. Yet the world is still "analog," with tube radios and TVs, no digital electronics, and 'dumb' virtual assistants that are error-prone and incapable of math, giving it a retro feel. There are no guns, of course, but personal weapons like arc casters, flamethrowers, cryo blasters and such all mimic bending.

The Sei'naka clan has risen to power in the Fire Nation, taking advantage of the aftermath of the 100 Years War, the Red Lotus Insurrection, Future Industry's relative benevolence, and even the recently calamity. Now a ruthless corporation bigger than Future Industires, they dominate business and politics wherever they expand.

The White Lotus's search for the Avatar has failed. Asami rather infamously misidentified the Avatar... until one day, she find them.

Once this background is established, the story jumps back to our inseparable twin Avatars, Pavi and Nisha, born deep in the Foggy Swamp. Thanks to their predecessor, they live a harmonious, largely isolated life as members of the Foggy Swamp Tribe on the back of a water Lion Turtle. To her utter shock, they manage to manifest Korra at nine, withh Past Life Korra appearing as a nine-year-old. Dumbfounded, not even sure who the 'real' Avatar is, the girls assume she is just another spirit in the swamp. So Korra makes the decision to go along with this, and let them have a childhood she never had under the White Lotus as she figures out just what's going on with the twin Avatars.

Ultimately, the real world comes crashing into the new Avatars' isolated life, and they react poorly to Korra telling them the truth at 16. Through some more disasters and tragedies, they end up on the streets of Republic City, separated for a time, before meeting friends. The rest of the story revolves around corporate and personal greed (very much like the real wo9rld), conflict (and synergy) between the environment/spirits and technology, rivalries, family, friends across lifetimes, the nature of consciousness, reincarnation and the soul, and a conspiracy going all the way back to Kuruk threading through everything.

Some character profiles I'm working on:

Priya: Independent, kind, and resourceful. Priya is a reluctant hero who avoids altercations or fighting, but still believes in helping people using her creativity and wits. She's a talented musician and loves to make up songs with her taanbur (guitar-like instrument). Priya lost most of her leg in an accident from a fallen tree that killed her parents in the Foggy Swamp, but bends roots and muddy water as if they were her own limb. Its eventually revealed that she carries Raava. Once she finds out, Priya in particular is reluctant to accept her role as the Avatar, until a tragedy forces her hand.
Nikki is more snarky. She loves her powers and the attention it affords her, but her biggest fear is to be forgotten or not accepted by others. Nikki is awkward but puts on a face of superior confidence to hide the fact that she feels like a fish out of water. Despite her cocky attitude, Nikki is also an innovator, and her wild side is useful at times. Like her sister, she's highly attuned to the swamp, able to connect to and even manifest the collective memories of lost loved ones by touching spirit vines in the swamp. Both are apparently waterbenders with a proclivity for mud. Its eventually revealed that she carries Vaatu inside her. Nikki is missing part of her arm, but bends as a replacement, much line Ming-Hua.
Korra: Largely as she is in LoK. Hot blooded, quick to fight, passionate, empathetic, and not very spiritual. In a reversal of roles, Priya and Nikki keep her manifested constantly, and Korra becomes their best friend, learing about thier life in the Foggy Swamp. Later in the story, Korra's almost like a Johnny Silverhand to the new Avatars: manifested at will, a voice constantly in thier heads offering commentary, occasionally butting heads with them in a complex but close and encouraging relationship.
Asami Sato: Largely as she was in LoK, driven, collected, strong, smart, loyal. Now she's fifty, with a cybernetic leg from an accident. Asami still altruistic, and retained control of Future Industries through the years, but struggles with pushback from a corporate world driven by expansion and greed, and ultimately has to grapple with some of what her own company has done under her nose.
Mako: Largely as he was, brooding, cool, a noir-like detective. Recently retired as police chief, and has been secretly piecing together the conspiracy running through the plot.
Ren: The blacksmith Korra meets on Kyoshi Island. Softspoken, painfully shy, air headed and ADD, stocky and green-eyed, Ren nonetheless has a dry wit. He's self depreciating to a fault, but has a soft heart. To Korra's utter shock, Ren is a metalbender and a lavabender, using the combination to effortlessly sculpt armor and weapons, and tinker with delicate electronics. He's terrified of lightning, with a massive scar covering his back that flares up in storms or when anxious. Almost as broken as Korra is at the start, Ren reveals that his father's ancestors were lavabending miners and blacksmiths in the Hundred Years War. His past is initially shrouded in mystery, but its slowly revealed that his mother is the scientist who originally conceived of spirit vine technology, and that Varrick only replicated some of her work. Ren's mom has an 'Oppenheimer moment' and defects from Kuvira's proto Earth Empire. Ren ends up as the only survivor, deeply scarred from a spirit vine "detonation" similar to the LoK finale, that fused his soul to his body, and he's hiding from warlords hunting him for what he knows. Through the story, he grows particularly close to Asami and Korra, and grapples with some of the technology he pioneers.
Kaida: CTO of Future Industries, Kaida is the biological daughter of Korra and Ren, who both died when she was 11. Utterly tenacious, hot-blooded, fearless, and a fierce fighter like her mom, Kaida barges into the story literally melting the metal floor in front of reporters harassing her 'mom,' Asami. Fiercely intelligent, impulsive, but with some of her dad's air-headedness, introversion, and love of tinkering with technology, Kaida is almost constantly clad in meteor-metal alloy plate armor she wears as a second skin. She favors a jian, like Korra learned to use on Kyoshi Island. Kaida a talented engineer, but struggles with the tremendous legacy she's been thrust into.
Yuri Sei'naka: One of many vying for supremacy in the Sei'naka family, Yuri resembles Azula; A charismatic leader with a ruthless streak, an obsession with perfection, and a fantastically talented firebender, she has Azula's the same sharp yellow eyes and features. Like her twin brother, Yoru, Yuri chose the 'hard' path of bending over advanced cybernetics the wealthy have access too. Nevertheless, she has a good moral compass, and is unconditionally loyal to her brother.. The siblings have an intense rivalry with Kaida, just as thier company rivals Future Industries.
Yoru Sei'naka: A firebending and lightning bending prodigy and a cunning strategist, Yoru is mute, having lost his ability to speak in a sparring accident as a kid. Yoru and Yuri are practically inseperable, with Yuri serving as his voice. Tasked with tracking down the unkown Avatar by the matriarch of the clan, and always beholden to his intense sense of honor, Yoru suffers through a tragic 'Zuko' arc through the story.

spoiler

Father Glowworm: The ancient spirit survived the death of Yun, and is an ever-present invisible hand through the story, albeit with a newfound distate for humans. The swamp, taboo spirit vine technology, and just how he tunnels between worlds will all tie into crises Priya and Nikki must navigate.

I'm still working on other antagonists, but there will be a warlord who tries to capture Ren on Kyoshi Island, a ruthless corporate matriarch of the Sei'naka dynasty (Natsu?), a charismatic rebel like something between Amon and Zaheer, and more. I'm also thinking on a blind airbending thief who rejected his rich family, and a loud, warm Sun Warrior whos people have resettled in Republic City, and an introverted netrunner-like hacker as companions for the Avatar.

Other thoughts:

I don't like some 'leaked' aspects of the upcoming show, like the twin Avatars being nine and the White Lotus being so involved and 'problematic.' I'd much rather have the twin Avatars be lost, ignorant of thier own nature in the Foggy Swamp because they appear to be waterbenders with a proclivity for mud.
On that note, stealing the idea from here, maybe Priya can only bend air and water, while Nikki can only bend earth and fire, reflecting the split of their spirits and personalities.
Remnants of the Northern and Southern Water Tribes have drifted to political extremes.
The 'wasteland' is populated by spirits, and human opportunists looking to brave it.
The Avatars' monkey cat companion is a spirit they befriended in the forest.
Spirit Vine technology is taboo and effectively 'lost' after the calamity.
The Avatars' Tribe lives atop a Lion Turtle the swamp hid for millenia.
The 'nature' of the Foggy Swamp is expanded. For instance, in one chapter, Priya and Nikki manifest and talk to respresentations of their parents, built from the collectively memory of everyone who ever knew them, all connected though vines. It brings up existential questions in Korra's head, and parallels with some of the spirit-based technology the rest of the world has developed.

...So, those are my scattered thoughts so far.

Does that sound like a sane, plausible base for a post-LoK story? Do you think any of it would fit into canon? I particularly like the idea of a 'metal lavabending' canon companion, and maybe some more futuristic elements in the havens that do exist.

37

'Avatar: Seven Havens' Rumors Emerge (knightedgemedia.com)

submitted 1 month ago* (last edited 1 month ago) by [email protected] to c/[email protected]

13 comments fedilink

Most details are in the article ^

Reddit source of images: https://www.reddit.com/r/TheLastAirbender/comments/1hi2tte/more_confirmation_on_the_leaks_this_was_using_the/

I find this interesting! Post apocalyptic is a good way to "reset" the world, and the idea of twin Avatars has been batted around the fandom for some time.

56

[Rumor] Shipping Listing Suggests 24GB+ Intel Arc B580 (lemmy.world)

submitted 1 month ago* (last edited 1 month ago) by [email protected] to c/[email protected]

12 comments fedilink

Maybe even 32GB if they use newer ICs.

More explanation (and my source of the tip): https://www.pcgamer.com/hardware/graphics-cards/shipping-document-suggests-that-a-24-gb-version-of-intels-arc-b580-graphics-card-could-be-heading-to-market-though-not-for-gaming/

Would be awesome if true, and if it's affordable. Screw Nvidia (and, inexplicably, AMD) for their VRAM gouging.

326

Guide to Self Hosting LLMs Faster/Better than Ollama (lemmy.world)

submitted 3 months ago* (last edited 3 months ago) by [email protected] to c/[email protected]

83 comments fedilink

I see a lot of talk of Ollama here, which I personally don't like because:

The quantizations they use tend to be suboptimal
It abstracts away llama.cpp in a way that, frankly, leaves a lot of performance and quality on the table.
It abstracts away things that you should really know for hosting LLMs.
I don't like some things about the devs. I won't rant, but I especially don't like the hint they're cooking up something commercial.

So, here's a quick guide to get away from Ollama.

First step is to pick your OS. Windows is fine, but if setting up something new, linux is best. I favor CachyOS in particular, for its great python performance. If you use Windows, be sure to enable hardware accelerated scheduling and disable shared memory.
Ensure the latest version of CUDA (or ROCm, if using AMD) is installed. Linux is great for this, as many distros package them for you.
Install Python 3.11.x, 3.12.x, or at least whatever your distro supports, and git. If on linux, also install your distro's "build tools" package.

Now for actually installing the runtime. There are a great number of inference engines supporting different quantizations, forgive the Reddit link but see: https://old.reddit.com/r/LocalLLaMA/comments/1fg3jgr/a_large_table_of_inference_engines_and_supported/

As far as I am concerned, 3 matter to "home" hosters on consumer GPUs:

Exllama (and by extension TabbyAPI), as a very fast, very memory efficient "GPU only" runtime, supports AMD via ROCM and Nvidia via CUDA: https://github.com/theroyallab/tabbyAPI
Aphrodite Engine. While not strictly as vram efficient, its much faster with parallel API calls, reasonably efficient at very short context, and supports just about every quantization under the sun and more exotic models than exllama. AMD/Nvidia only: https://github.com/PygmalionAI/Aphrodite-engine
This fork of kobold.cpp, which supports more fine grained kv cache quantization (we will get to that). It supports CPU offloading and I think Apple Metal: https://github.com/Nexesenex/croco.cpp

Now, there are also reasons I don't like llama.cpp, but one of the big ones is that sometimes its model implementations have... quality degrading issues, or odd bugs. Hence I would generally recommend TabbyAPI if you have enough vram to avoid offloading to CPU, and can figure out how to set it up. So:

Open a terminal, run git clone https://github.com/theroyallab/tabbyAPI.git
cd tabbyAPI
Follow this guide for setting up a python venv and installing pytorch and tabbyAPI: https://github.com/theroyallab/tabbyAPI/wiki/01.-Getting-Started#installing

This can go wrong, if anyone gets stuck I can help with that.

Next, figure out how much VRAM you have.
Figure out how much "context" you want, aka how much text the llm can ingest. If a models has a context length of, say, "8K" that means it can support 8K tokens as input, or less than 8K words. Not all tokenizers are the same, some like Qwen 2.5's can fit nearly a word per token, while others are more in the ballpark of half a work per token or less.
Keep in mind that the actual context length of many models is an outright lie, see: https://github.com/hsiehjackson/RULER
Exllama has a feature called "kv cache quantization" that can dramatically shrink the VRAM the "context" of an LLM takes up. Unlike llama.cpp, it's Q4 cache is basically lossless, and on a model like Command-R, an 80K+ context can take up less than 4GB! Its essential to enable Q4 or Q6 cache to squeeze in as much LLM as you can into your GPU.
With that in mind, you can search huggingface for your desired model. Since we are using tabbyAPI, we want to search for "exl2" quantizations: https://huggingface.co/models?sort=modified&search=exl2
There are all sorts of finetunes... and a lot of straight-up garbage. But I will post some general recommendations based on total vram:
4GB: A very small quantization of Qwen 2.5 7B. Or maybe Llama 3B.
6GB: IMO llama 3.1 8B is best here. There are many finetunes of this depending on what you want (horny chat, tool usage, math, whatever). For coding, I would recommend Qwen 7B coder instead: https://huggingface.co/models?sort=trending&search=qwen+7b+exl2
8GB-12GB Qwen 2.5 14B is king! Unlike it's 7B counterpart, I find the 14B version of the model incredible for its size, and it will squeeze into this vram pool (albeit with very short context/tight quantization for the 8GB cards). I would recommend trying Arcee's new distillation in particular: https://huggingface.co/bartowski/SuperNova-Medius-exl2
16GB: Mistral 22B, Mistral Coder 22B, and very tight quantizations of Qwen 2.5 34B are possible. Honorable mention goes to InternLM 2.5 20B, which is alright even at 128K context.
20GB-24GB: Command-R 2024 35B is excellent for "in context" work, like asking questions about long documents, continuing long stories, anything involving working "with" the text you feed to an LLM rather than pulling from it's internal knowledge pool. It's also quite goot at longer contexts, out to 64K-80K more-or-less, all of which fits in 24GB. Otherwise, stick to Qwen 2.5 34B, which still has a very respectable 32K native context, and a rather mediocre 64K "extended" context via YaRN: https://huggingface.co/DrNicefellow/Qwen2.5-32B-Instruct-4.25bpw-exl2
32GB, same as 24GB, just with a higher bpw quantization. But this is also the threshold were lower bpw quantizations of Qwen 2.5 72B (at short context) start to make sense.
48GB: Llama 3.1 70B (for longer context) or Qwen 2.5 72B (for 32K context or less)

Again, browse huggingface and pick an exl2 quantization that will cleanly fill your vram pool + the amount of context you want to specify in TabbyAPI. Many quantizers such as bartowski will list how much space they take up, but you can also just look at the available filesize.

Now... you have to download the model. Bartowski has instructions here, but I prefer to use this nifty standalone tool instead: https://github.com/bodaay/HuggingFaceModelDownloader
Put it in your TabbyAPI models folder, and follow the documentation on the wiki.
There are a lot of options. Some to keep in mind are chunk_size (higher than 2048 will process long contexts faster but take up lots of vram, less will save a little vram), cache_mode (use Q4 for long context, Q6/Q8 for short context if you have room), max_seq_len (this is your context length), tensor_parallel (for faster inference with 2 identical GPUs), and max_batch_size (parallel processing if you have multiple user hitting the tabbyAPI server, but more vram usage)
Now... pick your frontend. The tabbyAPI wiki has a good compliation of community projects, but Open Web UI is very popular right now: https://github.com/open-webui/open-webui I personally use exui: https://github.com/turboderp/exui
And be careful with your sampling settings when using LLMs. Different models behave differently, but one of the most common mistakes people make is using "old" sampling parameters for new models. In general, keep temperature very low (<0.1, or even zero) and rep penalty low (1.01?) unless you need long, creative responses. If available in your UI, enable DRY sampling to tamp down repition without "dumbing down" the model with too much temperature or repitition penalty. Always use a MinP of 0.05 or higher and disable other samplers. This is especially important for Chinese models like Qwen, as MinP cuts out "wrong language" answers from the response.
Now, once this is all setup and running, I'd recommend throttling your GPU, as it simply doesn't need its full core speed to maximize its inference speed while generating. For my 3090, I use something like sudo nvidia-smi -pl 290, which throttles it down from 420W to 290W.

Sorry for the wall of text! I can keep going, discussing kobold.cpp/llama.cpp, Aphrodite, exotic quantization and other niches like that if anyone is interested.

17

Qwen2.5: A Party of Foundation Models! (qwenlm.github.io)

submitted 4 months ago by [email protected] to c/[email protected]

0 comments fedilink

cross-posted from: https://lemmy.world/post/19925986

https://huggingface.co/collections/Qwen/qwen25-66e81a666513e518adb90d9e

Qwen 2.5 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B just came out, with some variants in some sizes just for math or coding, and base models too.

All Apache licensed, all 128K context, and the 128K seems legit (unlike Mistral).

And it's pretty sick, with a tokenizer that's more efficient than Mistral's or Cohere's and benchmark scores even better than llama 3.1 or mistral in similar sizes, especially with newer metrics like MMLU-Pro and GPQA.

I am running 34B locally, and it seems super smart!

As long as the benchmarks aren't straight up lies/trained, this is massive, and just made a whole bunch of models obsolete.

Get usable quants here:

GGUF: https://huggingface.co/bartowski?search_models=qwen2.5

EXL2: https://huggingface.co/models?sort=modified&search=exl2+qwen2.5

16

Qwen2.5: A Party of Foundation Models! (qwenlm.github.io)

submitted 4 months ago* (last edited 4 months ago) by [email protected] to c/[email protected]

0 comments fedilink

https://huggingface.co/collections/Qwen/qwen25-66e81a666513e518adb90d9e

Qwen 2.5 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B just came out, with some variants in some sizes just for math or coding, and base models too.

All Apache licensed, all 128K context, and the 128K seems legit (unlike Mistral).

And it's pretty sick, with a tokenizer that's more efficient than Mistral's or Cohere's and benchmark scores even better than llama 3.1 or mistral in similar sizes, especially with newer metrics like MMLU-Pro and GPQA.

I am running 34B locally, and it seems super smart!

As long as the benchmarks aren't straight up lies/trained, this is massive, and just made a whole bunch of models obsolete.

Get usable quants here:

GGUF: https://huggingface.co/bartowski?search_models=qwen2.5

EXL2: https://huggingface.co/models?sort=modified&search=exl2+qwen2.5