Local LLM inference – impressive but too hard to work with

84 points by aazo11 2 months ago | 58 comments

thot_experiment
Yikes what's the bar for dead simple these days? Even my totally non-technical gamer friends are messing around with ollama because I just have to give them one command to get any of the popular LLMs up and running.
Now of course "non technical" here is still a pc gamer that's had to fix drivers once or twice and messaged me to ask "hey how do i into LLM, Mr. AI knower", but I don't think twice these days about showing any pc owner how to use ollama because I know I probably won't be on the hook for much technical support. My sysadmin friends are easily writing clever scripts against ollama's JSON output to do log analysis and other stuff.
- aazo11
  By "too hard" I do not mean getting started with them to run inference on a prompt. Ollama especially makes that quite easy. But as an application developer, I feel these platforms are too hard to build around. The main issues being: getting the correct small enough task specific model and how long it takes to download these models for the end user.
  thot_experiment
  I guess it depends on expectations, if your expectation is an CRUD app that opens in 5 seconds, then sure, it's definitely tedious. People do install things though, the companion app for DJI action cameras is 700mb (which is an abomination, but still). Modern games are > 100gb on the high side, downloading 8-16gb of tensors one time is nbd. You mentioned that there are 663 different models of dsr1-7b on huggingface, sure, but if you want that model on ollama it's just `ollama run deepseek-r1`
  As a developer the amount of effort I'm likely to spend on the infra side of getting the model onto the user's computer and getting it running is now FAR FAR below the amount of time I'll spend developing the app itself or getting together a dataset to tune the model I want etc. Inference is solved enough. "getting the correct small enough model" is something that I would spend the day or two thinking about/testing when building something regardless. It's not hard to check how much VRAM someone has and get the right model, the decision tree for that will have like 4 branches. It's just so little effort compared to everything else you're going to have to do to deliver something of value to someone. Especially in the set of users that have a good reason to run locally.
Jedd
TFA seems to miss a lot of things.
Mac's unified memory makes them (price-) compelling over x86 with GPU(s) for large models, say something over 24-32 GB. But a 32GB Mac doesn't take advantage of that architecture.
(IIRC by default you can use 66% of RAM in < 32GB Metal boxes, and something higher in > 32GB -- though you can override that value via sysctl.)
Macs can also run mlx in addition to gguf, which on these smaller models would be faster. No mention of mlx, or indeed gguf.
The only model tested seems to be a distil of Deepseek R1 with Qwen - which I'd have classified as 'good, but not great'.
Author bemoans all quants of that are > 5GB, which isn't true. Though with 20GB of effective VRAM to play with here, you wouldn't want to be using the Q4 (at 4.6GB).
Author seems to conflate the one-off download cost (time) from hf with on-going performance cost of using the tool.
No actual client-side tooling in play, either, by the looks of it, which seems odd given the claim that local inference is 'not ready as a developer platform'.
The usual starting point for most devs using local LLMs is vscode + continue.dev , where the 'developer experience' is a bit more interesting than just copy-pasting to a terminal.
Criterion (singular) for LLM model expertise appears to be 'text to SQL', which is fair enough if you were writing about applicability of "Local LLM Inference For Text to SQL". I'd have expected the more coding-specific (qwen2.5 coder 14B, codestral, gemma?) models would be more interesting than just one 6GB distil of R1 & Qwen.
Huggingface has some functional search, though https://llm.extractum.io/list/ is a bit better in my experience, as you can tune & sort by size, vintage, licence, max context length, popularity, etc.
I concur that freely available can-run-in-16GB-of-RAM models are not as good as Claude, but disagree that the user experience is as bad as painted here.
antirez
Download the model in background. Serve the client with an LLM vendor API just for the first requests, or even using that same local LLM installed on your own servers (likely cheaper). By doing so, in the long run the inference cost is near-zero and allows to use LLMs in otherwise impossible business models (like freemium).
- manmal
  Personally, I only use locally run models when I absolutely can’t have the prompt/context uploaded to a cloud. For anything else, I just use one of the commercial cloud hosted models. The ones I‘m using are way faster and better in _every_ way except privacy. Eg if you are ok to spend more, you can get blazing fast DeepSeek v3 or R1 via OpenRouter. Or, rather cheap Claude Sonnet via Copilot (pre-release also has Gemini 2.5 Pro btw).
  I’ve gotten carried away - I meant to express that using cloud as a fallback for local models is something I absolutely don’t want or need, because privacy is the whole and only point to local models.
- aazo11
  Exactly. Why does this not exist yet?
  byyoung3
  its an if statement on whether the model has downloaded or not
  aazo11
  A better solution would train/finetune the smaller model from the responses of the larger model and only push to the inference to the edge if the smaller model is performant and the hardware specs can handle the workload?
  monoid73
  yeah, that'd b nice, some kind of self-bootstrapping system where you start with a strong cloud model, then fine-tune a smaller local one over time until it’s good enough to take over. tricky part is managing quality drift and deciding when it's 'good enough' without tanking UX. edge hardware's catching up though, so feels more feasible by the day.
ijk
There's two general categories of local inference:
- You're running a personal hosted instance. Good for experimentation and personal use; though there's a tradeoff on renting a cloud server.
- You want to run LLM inference on client machines (i.e., you aren't directly supervising it while it is running).
I'd say that the article is mostly talking about the second one. Doing the first one will get you familiar enough with the ecosystem to handle some of the issues he ran into when attempting the second (e.g., exactly which model to use). But the second has a bunch of unique constraints--you want things to just work for your users, after all.
I've done in-browser neural network stuff in the past (back when using TensorFlow.js was a reasonable default choice) and based on the way LLM trends are going I'd guess that edge device LLM will be relatively reasonable soon; I'm not quite sure that I'd deploy it in production this month but ask me again in a few.
Relatively tightly constrained applications are going to benefit more than general-purpose chatbots; pick a small model that's relatively good at your task and train it on enough of your data and you can get a 1B or 3B model that has acceptable performance, let alone the 7B ones being discussed here. It absolutely won't replace ChatGPT (though we're getting closer to replacing ChatGPT 3.5 with small models). But if you've got a specific use case that will hold still enough to deploy a model it can definitely give you the edge versus relying on the APIs.
I expect games to be one of the first to try this: per-player-action API costs murder per-user revenue, most of the gaming devices have some form of GPU already, and most games are shipped as apps so bundling a few more GB in there is, if not reasonable, at least not unprecedented.
- aazo11
  Very interesting. I had not thought about gaming at all but that makes a lot of sense.
  I also agree the goal should not be to replace ChatGPT. I think ChatGPT is way overkill for a lot of the workloads it is handling. A good solution should probably use the cloud LLM outputs to train a smaller model to deploy in the background.
- CharlieRuan
  Curious what are some examples of "per-player-action API costs" for games?
  ijk
  Inference using an API costs money. Not a lot of money, per million tokens, but it adds up if you have a lot of tokens...and some of the obvious game uses really chew through the tokens. Like chatting with a character, or having the NPC character make decisions via reasoning model. Can easily make the tokens add up.
  Games, on the other hand, are mostly funded via up-front purchase (so you get the money once and then have to keep the servers running) or free to play, which very carefully tracks user acquisition costs versus revenue. Most F2P games make a tiny amount per player; they make up the difference via volume (and whales). So even a handful of queries per day per player can bankrupt you if you have a million players and no way to recoup the inference cost.
  Now, you can obviously add a subscription or ongoing charge to offset it, but that's not how the industry is mostly set up at the moment. I expect that the funding model will change, but meanwhile having a model on the edge device is the only currently realistic way to afford adding an LLM to a big single player RPG, for example.
  K0balt
  You release the game with a variable in game experience. If the player has two 4090s chugging away, she runs everything locally. If he’s got an rx480, a barebones 1b model or a subscription for the nicer AI NPCs, which can open up AI driven side quests (adding minor content within having to write it). Include a “free” month when you register the game.
  ivape
  What if I charge "whales" in games to talk to an anime girl? Maybe I'll only let you talk to her once a day unless you pay me like a kissing booth for every convo. There's going to be some predatory stuff out there, I can see what the GP is talking about with games.
  kevingadd
  For a while basically any mobile or browser freemium game you tried would have progress timers for building things or upgrading things and they'd charge you Actual Money to skip the wait. That's kind of out of fashion now though some games still do it.
K0balt
The only bar to using local is having the hardware and downloading the model. I find it nominally easier to use than using the openAI API since the local API isn’t picky about some of the fields (by default). Agentic flows can use local 90percent of the time and reach out to god when they need divine insight, saving 90 percent of token budgets and somewhat reducing external exposure, though I prefer to keep everything locally if possible. It’s not hard to run a 70b model locally, but the queue can get backed up with multiple users unless you have very strong hardware. Still, you can shift overflow to the cloud if you want.
zellyn
Weird to give MacBook Pro specs and omit RAM. Or did I miss it somehow? That's one of the most important factors.
- manmal
  Using a 7B model on a M2 Max also isn’t quite the most impressive way to locally run an LLM. Why not use QwQ-32 and let it give some commercial non-reasoning models a run for their money?
  zellyn
  Exactly. You want to come close to maxing out your RAM for model+context. I've run Gemma on a 64GB M1 and it was pretty okay, although that was before the Quantization-Aware Training version released last week, so it might be even better now.
- aazo11
  Thanks for calling that out. It was 32GB. I updated the post as well.
bionhoward
LM Studio seems pretty good at making local models easier to use
- manmal
  A less known feature of LM Studio I really like is speculative decoding: https://lmstudio.ai/blog/lmstudio-v0.3.10
  Basically you let a very small model speculate on the next few tokens, and the large model then blesses/rejects those predictions. Depending on how well the small model performs, you get massive speedups that way.
  The small model has to be as close to the big model as possible - I tried this with models from different vendors and it slowed generation down by x3 or so. So, you need to use a small Qwen 2.5 with a big Qwen 2.5, etc
  kmeisthax
  How exactly does this give a speedup? If you have to wait for the large model to confirm the small model's predictions, wouldn't it always be slower than just running the large model?
  manmal
  As far as I understand it works like this:
  1. Small model generates k tokens (probably k>=4 or even higher, there‘s a tradeoff to be made here, depending on the model sizes)
  2. Big model processes all k tokens‘ logits (probabilities) in parallel.
  3. Ideally, all tokens pass the probability threshold. That might be the case for standard phrases that the model likes to use, like „Alright, the user wants me to“. If not all tokens pass the probability threshold, then the first unsuitable token and all after are discarded.
  4. Return to 1., maybe with an adjusted k.
  Havoc
  Apparently for the bigger model checking a token is faster than generating a fresh one. So if they tiny model gets it right you get a tiny speed bump. Can’t say I fully understand it either why it’s faster to check
  Needs a pretty large difference in size to result in a speedup. 0.5 vs 27b is the only ones I’ve seen a speedbump
- resource_waste
  I'm genuinely afraid its going to do telemetry one day.
  I'm sure someone is watching their internet traffic, but I don't.
  I take the risk now, but I ask questions about myself, relationships, conversations, etc... Stuff I don't exactly want Microsoft/ChatGPT to have.
  ivape
  Local inferencing is synonymous with privacy for me. There is no universe until laws get put into effect where your LLM usage online is private as it stands now. I suspect most of these companies are going to be putting in a Microsoft Clippy style assistant in soon that will act as a recommendation/ad engine very soon, and this of course requires parsing every convo you've ever had. Paid tier may remove Clippy, but boy oh boy the free tier (which most people will use) won't.
  Clippy is coming back guys, and we have to be ready for it.
  manmal
  I‘ve configured Little Snitch to only allow it access to huggingface. I think for updates I need to reset LS to „ask for each connection“ or sthg like that.
  vunderba
  If you want privacy, use a local models and an open-source chat interface such as OpenWeb-UI or Jan. (avoid proprietary systems such as Msty or LM Studio).
  https://github.com/janhq/jan
  https://github.com/open-webui/open-webui
- ivape
  Here is another:
  https://msty.app/
  mh-
  Can anyone vouch for this? I (personally) don't mind that it's closed source, but I've never heard of it and can't find much about it. Website makes it look fantastic though, so I'm intrigued. But am hesitating at giving it all of my API keys..
  ivape
  Reddit is probably your best bet. I think we all take some risk, even with something like LMStudio which is closed source, since all these apps are basically a new genre.
- jasonjmcghee
  they made it so easy to do specdec, that alone sold it for me
  Some models have even a 0.5B draft model. The speed increase is incredible.
- aazo11
  They look awesome. Will try it out.
ranger_danger
I thought llamafile was supposed to be the solution to "too hard to work with"?
https://github.com/Mozilla-Ocho/llamafile
- archerx
  Llamafile is great and love it. I run all my models using it and it’s super portable, I have tested it on windows and linux, on a powerful PC and SBC. It worked great without too my issues.
  It takes about a month for the features from llama.cpp to trickle in. Also figuring the best mix of context length size to vram size to desired speed takes a while before it gets intuitive.
- rzzzt
  I thought it's "docker model" (and OCI artifacts).
  dust42
  llamafile is a multiplatform executable that wraps the model and a slightly modified version of llama.cpp. IIRC funded by Moz.
larodi
Having done my masters on the topic of grammar-assisted text2sql let me add some additional context here:
- first of all local inference can never beat cloud inference for the very simple reason that costs go down with batching. it took me two years to actually understand what batching is - the LLM tensors flowing through transformer layers has a dimension designed specifically for processing data in parallel. so no matter if you process a 1 sequence or 128 sequences the costs are the same. i've read very few articles overstating this, so bear in mind - this is the primary stopper for competing local inference with cloud inference.
- second, and this is not a light one to take - LLM-assisted text2sql is not trivial, not at all. you may think it is, you may expect cutting-edge models to do it right, but there are ...plenty of reasons models fail so badly at this seemingly trivial task. you may start with arbitrary article such as https://arxiv.org/pdf/2408.14717 and dig the references, sooner or later you will stumble on one of dozens overview papers by mostly Chinese researchers (such as https://arxiv.org/abs/2407.10956) where overview of approaches is summarized. Caution: you may feel both inspired AI will not take over your job, or you may feel miserable how much effort is spent on this task and how badly everything fails in real-world scenarios
- finally, something we agreed with a professor advising a doctorate candidate whose thesis surprisingly was on the same topic. basically given GraphQL and other structured formats such as JSON, which LLMs are much better leaned on than the complex grammar of SQL which is not a regular grammar, but context-free one, which takes more complex machines to parse it and also very often recursion.
- which brings us to the most important question - why commercial GPTs fare so much better on it than local models. well, it is presumed top players, not only use MoEs but they also employ beam search, perhaps speculative inference and all sorts of optimizations on the hardware level. while this all is not beyond comprehension for a casual researcher at a casual university (like myself) you don't get to easily run this all locally. I have not written an inference engine myself, but I imagine MoE and beam search is super compled, as beam search basically means - you fork the whole LLM execution state and go back and forth. Not sure how this even works together with batching.
So basically - this is too expensive. Besides atm (to my knowledge) only vllm (the engine) has some sort of reasonably working local beam search. I would've loved to see llama.cpp's beam search get a rewrite, but it stalled. Trying to get beamsearch working with current python libs is nearly impossible for commodity hardware, even if you have 48gigs of ram, which already means a very powerful GPU.
- ianand
  Sounds like an interesting masters thesis. Is your masters thesis available online somewhere?
  larodi
  Well, not sure about the final doc that went to the university, but this is the almost final draft.
  https://docs.google.com/document/d/e/2PACX-1vSyWbtX700kYJgqe...
  Since its in Cyrillic you should perhaps use a translation service. There are some screens showing results, though as I was really on a tight deadline, and its not a PHD but masters thesis, I decided to not go into in-depth evaluation of the proposed methodology against SPIDER (https://yale-lily.github.io/spider). Even though you can find the simplifed GBNF grammar, also some of the outputs. The grammar, interestingly it benefits/exploits a bug in llama.cpp which allows some sort of recursively-chained rules. Bibliography is in English, but really - there is so much written on the topic, by no means comprehensive.
  Sadly no open inference engine (at time of writing) was both good enough in beam search, and grammars, so this whole things needs to perhaps be redone in pytorch.
  If I find myself in a position to do this for commercial goals, I'd also explore the possibility of having human-catered SQLs against the particular schema, in order to guide the model better. And then do RAG on the DB for more context. Note: I'm already doing E/R model reduction to the minimal connected graph which includes all entities of particular interest to the present query.
  And finally, since you got that far - the real real problem with restricting LLM output with grammars is the tokenization. Because all parsers work reading one char at a time, and tokens are very often few chars, so the parser in a way needs to be able to "lookahead", which it normally does not. I believe OpenAI wrote they realized this also, but I can't really find the article atm.
  ianand
  Thanks. Took a quick look and definitely needed to use Google Translate but seems to have worked to get the gist of it.
- ijk
  There's local applications of parallel processing; your average chatbot wouldn't use it, but a research bot with multiple simultaneous queries will, for example.
  Better local beamsearch would be really nice to have, though.
- ijk
  I do wonder if recursion is particularly hard for LLMs, given that they have a hard limit on how much they can loop for a given token. (Absent beam search, reasoning models, and other trickery.)
  larodi
  Given a prolog (not problog, but the non-stochastic one) source is a parametric grammar, we can perhaps* argue the inference on the programming logic level can be unfolded by using a pen and pencil. think L-systems, they are self-similar, and recursively defined. The catch is that the whole sequence gets rewritten on each step. If you can get the LLM to do this as it progresses with generation - you get recursion. Question is whether you can get the LLM rewrite the context window, and my bet would be someone is already working on it.
  * I say perhaps, because PROLOG engines normally don't rewrite strings like crazy while doing inference, so my statement may be somewhat off.
pentagrama
Local LLM would be a great idea for Mozilla to try in its Orbit [1] extension to summarize articles. But sadly, they are only going with the cloud option (for now).
[1] https://orbitbymozilla.com/
mlcq
I installed Ollama on my 64GB M1 Max and ran gemma3:27b. Well, it works, but it's a bit laggy. I use LLM quite frequently, compared to running them locally, I still prefer using the API; it's more efficient and accurate.
Havoc
Don’t think the cost point is correct. Last I saw calcs api was cheaper than the juice needed to run a local gpu. Never mind equipment
Plus there are a mountain of free tokens out there like Gemini free
segmondy
the advantage of local LLM is that you literally could find many models that have no cloud equivalent. someone could have made a fine tune to meet your needs. if you can't find a generic model that meets your need, you can get an appropriate size model you can run, build your or get dataset. then train the cloud, then use the model locally.
anarticle
Seems fine to me, I use it like a local google for software eng questions, or rig it into aider to write tests while I'm doing something else. Keeps me focused and out of my web browser frankly.
EDIT: oh! It's also fantastic if you're on a plane!
aazo11
I spent a couple of weeks trying out local inference solutions for a project. Wrote up my thoughts with some performance benchmarks in a blog.
TLDR -- What these frameworks can do on off the shelf laptops is astounding. However, it is very difficult to find and deploy a task specific model and the models themselves (even with quantization) are so large the download would kill UX for most applications.
- codelion
  There are ways to improve the performance of local LLMs with inference time techniques. You can try with optillm - https://github.com/codelion/optillm it is possible to match the performance of larger models on narrow tasks by doing more at inference.
nexo-v1
[dead]
resource_waste
[flagged]
Der_Einzige
Why is HN so full of people who don't know good LLM tooling?
SillyTavern and vllm is right there, ready to give you a class leading experience - but you all ignore it and use stuff like LM-studio (missing tons of features that SillyTavern or even oobabooga have, like advanced samplers such as min_p or top-nsigma) or worse you use even more slow solutions like ollama or llamacpp.
The real reason that folks don't like to run models on their own is that the tools have henceforth been built by obvious coomers (we all know what most people use sillytavern or comfyUI for). Just embrace the vibe set by these products instead of resisting it by forcing yourself to use shit tools.
This is yet ANOTHER post I have to make about this: https://news.ycombinator.com/item?id=43743337#43743658
I don't care how many downvotes I get for pointing this out yet again. I'm at ICLR about to present an Oral and the vast majority of the people who'd downvote me for calling out poor tooling choices haven't done anything of note in AI before...
- thomassmith65
  It's the tone, not the content.
  a) Slang signifies group membership, and can be irritating to a reader who does not identify with the group
  b) In tech, there are too many knowledge domains for anyone to keep up. I personally find it frustrating when people expect everyone else to know about their area of expertise.
  c) People have an urge to smack down someone who too loudly toots their own horn.
  The final paragraph makes the tone problem worse; if we're being objective, it comes off as quite pissy, no? The same comment likely would have been upvoted if the tone had been gentler.
  Der_Einzige
  Strong opinions weakly held is a rallying cry here. Being spicy is how one drives engagement.
  HN loves to act like they’re experts in LLMs but they’re mostly just not - and they don’t listen to their elders here.
  Calling out poor tech choices with a “pissy” tone is simply continuing the tech culture of stack overflow, irc, etc. don’t act like it wasn’t awesome and you don’t like it.
  If the person writing is too “good” at the tone side, it’s a sign that they spent countless time getting better at social skills which objectively trades off with time spent getting better at your field. Reality is a zero-sum game, time spent getting one skill is time spent not working on another. Often the best devs are extremely hard to work with and are spicy. Terry A Davis or Yoshua Bengio come to mind as examples here.
  So, you should celebrate my tone. Not cry that your feelings were hurt by it. It means I did my homework so that you all don’t have to.
  thomassmith65
  The original comment seemed knowledgeable and informative.
  It ended with a complaint about downvotes...
  ...but it began by insulting the audience.
  It wasn't clear to me if you were too close to the comment to realize why people downvoted it.
  I thought a critique might be helpful. Apparently, it wasn't welcome, but my intentions were good.