Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
AI

'Forget ChatGPT: Why Researchers Now Run Small AIs On Their Laptops' (nature.com) 48

Nature published an introduction to running an LLM locally, starting with the example of a bioinformatician who's using AI to generate readable summaries for his database of immune-system protein structures. "But he doesn't use ChatGPT, or any other web-based LLM." He just runs the AI on his Mac... Two more recent trends have blossomed. First, organizations are making 'open weights' versions of LLMs, in which the weights and biases used to train a model are publicly available, so that users can download and run them locally, if they have the computing power. Second, technology firms are making scaled-down versions that can be run on consumer hardware — and that rival the performance of older, larger models. Researchers might use such tools to save money, protect the confidentiality of patients or corporations, or ensure reproducibility... As computers get faster and models become more efficient, people will increasingly have AIs running on their laptops or mobile devices for all but the most intensive needs. Scientists will finally have AI assistants at their fingertips — but the actual algorithms, not just remote access to them.
The article's list of small open-weights models includes Meta's Llama, Google DeepMind's Gemma, Alibaba's Qwen, Apple's DCLM, Mistral's NeMo, and OLMo from the Allen Institute for AI. And then there's Microsoft: Although the California tech firm OpenAI hasn't open-weighted its current GPT models, its partner Microsoft in Redmond, Washington, has been on a spree, releasing the small language models Phi-1, Phi-1.5 and Phi-2 in 2023, then four versions of Phi-3 and three versions of Phi-3.5 this year. The Phi-3 and Phi-3.5 models have between 3.8 billion and 14 billion active parameters, and two models (Phi-3-vision and Phi-3.5-vision) handle images1. By some benchmarks, even the smallest Phi model outperforms OpenAI's GPT-3.5 Turbo from 2023, rumoured to have 20 billion parameters... Microsoft used LLMs to write millions of short stories and textbooks in which one thing builds on another. The result of training on this text, says Sébastien Bubeck, Microsoft's vice-president for generative AI, is a model that fits on a mobile phone but has the power of the initial 2022 version of ChatGPT. "If you are able to craft a data set that is very rich in those reasoning tokens, then the signal will be much richer," he says...

Sharon Machlis, a former editor at the website InfoWorld, who lives in Framingham, Massachusetts, wrote a guide to using LLMs locally, covering a dozen options.

The bioinformatician shares another benefit: you don't have to worry about the company updating their models (leading to different outputs). "In most of science, you want things that are reproducible. And it's always a worry if you're not in control of the reproducibility of what you're generating."

And finally, the article reminds readers that "Researchers can build on these tools to create custom applications..." Whichever approach you choose, local LLMs should soon be good enough for most applications, says Stephen Hood, who heads open-source AI at the tech firm Mozilla in San Francisco. "The rate of progress on those over the past year has been astounding," he says. As for what those applications might be, that's for users to decide. "Don't be afraid to get your hands dirty," Zakka says. "You might be pleasantly surprised by the results."

'Forget ChatGPT: Why Researchers Now Run Small AIs On Their Laptops'

Comments Filter:
  • Duh (Score:5, Insightful)

    by locater16 ( 2326718 ) on Monday September 23, 2024 @03:49AM (#64808957)
    50 years of "This is complex, we could run it remotely!" and then "this is expensive, we could run it locally!"
    • This concept needs a catchy name. Lap-cloud? Moist-lap?
    • It may be the end of the bubble.
    • Re:Duh (Score:5, Interesting)

      by Rei ( 128717 ) on Monday September 23, 2024 @06:41AM (#64809113) Homepage

      To be fair, local models are not a replacement for cloud models in all cases. Like, for example, if coding, how much is your time worth that you'd choose to run a small local model than a bigger, better cloud model that will do a better job coding, if it means you have to pay $20 a month for it? Same applies to other fields.

      Running locally is also much less capital and power efficient in most circumstances than running on the cloud. Especially if running on CPU, but GPU as well. It's not just that server GPUs are more power efficient (though they are); it's that AI models benefit greatly from batching [githubusercontent.com], and the available VRAM and high-speed interconnects of servers make that possible. Though to be fair, speculative decoding helps reduce that difference somewhat. But CPU remains terribly inefficient for AI. Running on even a high-end cloud server will give you more throughput, and generally if you're querying an API, they have lots of servers that can respond to your requests, so you can accomplish a given multi-prompt task far faster. And maintenance is annoying; it can be nice to let others do it for you, to handle the temperamental boxes full of noisy fans blowing out heat.

      There are of course advantages to running locally. You never transmit your data outside your computer (you can even effectively "google search" without leaving a search history). You have full control over the server (though you have to maintain it, and stay informed of the best models / best ways to run models, as they constantly improve). You can choose from a much wider range of models, including uncensored ones, or your own finetunes (though some cloud providers offer finetuning services). You might use it so rarely that you don't care about power consumption, even on CPU. Your task might be so simple that there's basically no difference between local or remote in terms of quality. Indeed, it might be something so simple that you can do it with a finetuned model that's only 1-2B params, so you could run it with significant batching even on consumer hardware. So yeah, I don't mean to be dismissive here. I do plenty of AI tasks myself locally on a server with two NVLinked 3090s.

      Also, a note about small models: they've gotten much better at "instruction following" tasks, and indeed, can outperform models an order of magnitude their size from just a year prior. The rate of improvement of trivia knowledge however hasn't grown as fast relative to model size. Which is probably unsurprising; there's only so much knowledge that can be represented in a given parameter space (it's honestly very impressive how much knowledge they do hold; they dramatically outperform our brains when comparing numbers of neurons, even though they learn slower than our brains).

      • Apple seems to have struck a good balance with their âoeApple Intelligenceâ(my god, seriously, trying to steal AI to be a proprietary name) thing.

        Most of the times models run locally on your device. They have models that detect when the output is shit, and then suggest using online models if you want to.

        • by Rei ( 128717 )

          Yeah, it's a really good strategy, having a lightweight AI model decide whether to answer directly or outsource the task. Should make for a very good user experience.

    • Yup, it's a cycle. Very roughly: 70's mainframes, 80's PCs, 90's thin clients, 2000s better PCs, 2010s cloud.

      Now in the 2020s it's a mess, what with smartphones and apps, IoT everywhere, operating systems actively hiding what's local and what's in the cloud, etc.

    • Duh you can do this. Duh it will be less smart and underpowered. Laptops don't have entire server racks of GPUs and cooling. (Yes, even for just inference.)

  • Open weights (Score:4, Interesting)

    by silentbozo ( 542534 ) on Monday September 23, 2024 @05:25AM (#64809029) Journal

    I think the biggest benefit is being able to fine-tune open weight models. You can do additional training and apply the changes as a LoRA (or QLoRA, most likely for consumer hardware). Instead of relying on sleight of hand (are you actually talking to GPT 4.o mini, and if so, have they tacked on additional layers of guardrails since you last sent a query?) you can use checkpointed models, and then tune the specifically to the tasks you want to accomplish.

    I think from a performance perspective, it is cheaper to use VC-funded GPUs in a remote data center. You can certainly run open weight models on provisioned GPU instances in a remote datacenter - if you want to do training, it's a lot faster to rent time on a farm of A100 80gb units than to try and bodge something together at home.

    But from a customization perspective (and data safety perspective) running inference on prem, on hardware you control, is probably less of a deal breaker than running proprietary data through someone else's model, on someone else's hardware.

    • What do you tune? to what? I think you tune the tone of the language to either mimic your better or tickle your brain better. Either why it is a rather selfish purpose. If you tune to something else - it might be evil or very commercial :-)
      • Re: Open weights (Score:5, Interesting)

        by Rei ( 128717 ) on Monday September 23, 2024 @06:46AM (#64809119) Homepage

        You tune to whatever task you're running.

        Let's pick an example. Let's say you're a company that makes widgets, and have a bazillion boring documents about your widget system, and want to make it faster and easier for your employees or customers to get information. You finetune the foundation on those widget documents for several epochs, then finetune on a question-response dataset to so that it behaves as a chat model. Viola, you now have a chatbot that's an expert on your widgets.

        • You just trained / tuned your LLM to speak the average lingo of technical writers at your widget vendor. Will you trust the model to answer about facts in the product sheets? Or (non-)existence of a type of widget. Trust it more than Google? More than a database?
          • Re: (Score:3, Interesting)

            by Ed Tice ( 3732157 )
            I would trust this to handle customer support inquiries. If the training set is produced by our higher-level support folks, the model will provide at least reasonable responses. And, if the responses are wrong, the chat history is very informative and would likely mean that I can solve the customer problem without having to go back and ask them more questions. I've advocated for us having an LLM AI support feature but we don't have one partially because of cost. If this means that I can finally get a LL
            • by Rei ( 128717 )

              BTW, if you want to get it done on the cheap, go to the Discord server or forum of any LLM finetuning software, like Axolotl or whatnot, and ask if anyone would like some work finetuning an AI model. You'll almost certainly get a far lower price for it than paying some corporation to do so.

    • Re:Open weights (Score:4, Insightful)

      by Registered Coward v2 ( 447531 ) on Monday September 23, 2024 @05:39AM (#64809053)

      But from a customization perspective (and data safety perspective) running inference on prem, on hardware you control, is probably less of a deal breaker than running proprietary data through someone else's model, on someone else's hardware.

      That is clearly a benefit in that you can use proprietary data and not have it being used by someone else to enhance their model that a competitor can then use as well.

    • Faster? "simpler"? Cheaper?

      Well, for the developer probably. But the bean counters caught on after the pandemic waned.
      Those things the devs love carried a HUGE cost, far more than the business could sustain at the profitability level demanded by equity funds/investors. Individuals can't afford to do these things that way and business, more and more, has financial models that call BS on the devs.

      Something I was taught a log time ago; Good engineering isn't about gold plated, whiz bang stuff, but about co

  • They have always been quite readable, but I never could convince my self they did not miss out important facts. Is a bigger model better?
  • by bradley13 ( 1118935 ) on Monday September 23, 2024 @06:02AM (#64809075) Homepage

    I do this with Mixtral, on an older ordinary laptop (with so-so graphics card). It's a bit slow, but it doesn't send your data to the Borg. The model I am using is pretty old by now, so it's not as good as ChatGPT 4o. However, models are improving steadily, and I expect newer open models are a lot better. When I replace the laptop, speed would also no longer be an issue.

    • by Rei ( 128717 )

      I recently upgraded from Mixtral to Mistral NeMo - you may want to consider it. I find the performance similar to Mixtral (maybe even better overall), and throughput is similar, but at a far smaller memory footprint, so you can use that extra VRAM for either batching or far longer context (up to 128k tokens).

      Don't forget to use speculative decoding. It's huge extra performance for free :) Best is to use a sheared model.

    • by LKM ( 227954 )
      I'm using Llama3 locally, it's extremely fast on my M3 MacBook Pro and good enough to answer relatively obscure questions (like "how do I add index numbers after titles in my HTML page using only CSS").
      • by Rei ( 128717 )

        I've always been unhappy with LLaMA's license (though others like Qwen are even worse, and Meta has softened the terms somewhat recently, so good on them).

        Mistral's models have always been very competitive performers, relatively uncensored, and with very open licenses (usually Apache). They also seem less prone to inserting "personality", and instead just focus on completing the task. Quite multilingual as well (I really don't get why people still train monolingual or bilingual models; different languages a

  • by williamyf ( 227051 ) on Monday September 23, 2024 @06:42AM (#64809115)

    People bitch about the inclussion of an NPU in their CPUs, not noticing that, while a discrete GPU will do the deed in therms of raw compute, it will not cut mustard in therms of enough memory to run the inference models.

    Yes, main memory may be slow, but at least is Upgadeable (via DIMM, SO-DIMM or CAMM-2) and can be refarmed from CPUNPU, and swaped to SSD if push comes to shove.

    Graphics memory, meanwhile, fixed at the moment of purchase, and HyperExpensive to boot....

    Long live the NPU inside the CPU!

    • AI-specific "CPUs" are definitely coming, be they NPUs, Tensorflow chips, or something else.

      Meanwhile, GPUs work well for a lot of use cases. Hats off to Apple for almost forcing us Mac owners to take a (pretty decent) GPU, even though some of us (myself included) initially used it once in a blue moon. Apple probably made the decision to include GPUs a year or two before they actually sold the first one, so it was a good bit of foresight (or a lucky co-incidence) on their part. Given the very fixed-at-purch

    • by Targon ( 17348 )

      In the same way that dedicated memory channels on a CPU for the integrated graphics WOULD improve the performance, setting up dedicated and expandable memory for the CPU for the NPUs might help as well.

      • Nah, will not happen, one of the costliest things in semiconductor manufacturing by far is the number of pads to comunicate the chip with the external world. That's why nVIDIA and AMD moved from 256 memory buses to 196 and below.

        more efficient to have 6x64 bit channels for CPU+iGPU+iNPU than to have fixed allocation of 2xfor each, and cheaper to have 4x64 for all 3, and pray people fo not need the full 6x

        worst case scenario, they will put a bunch of eDRAM in a tile as cache and call it a day

        • I think the wider bus stopped being so important as gaming workloads started to become complex multi-stage rendering problems, in addition to latency becoming more and more important in game rendering as dynamic and indirect memory accesses starts to dominate that multi-stage rendering problem

          amd's king integrated is currently ~1080 class, all on the regular cpu memory controller - 6 years ago the 1080 was the fastest gpu you could buy

          a lot of that has to do with how rendering workloads have changed
          • I think the wider bus stopped being so important as gaming workloads started to become complex multi-stage rendering problems, in addition to latency becoming more and more important in game rendering as dynamic and indirect memory accesses starts to dominate that multi-stage rendering problem

            amd's king integrated is currently ~1080 class, all on the regular cpu memory controller - 6 years ago the 1080 was the fastest gpu you could buy

            a lot of that has to do with how rendering workloads have changed

            You are 100% right.

            But also the semiconductor manufacturing landscape has changed in the last 10 years, maching advanced chips (CPUs, GPUs, NPUs) and denser memories is harder now than 10 years ago, and, to boot, the "price per bump" being one of the costlier things in semiconductor manufacturing has increased and feature sizes decreased and packaging has become more complex...

            That's one of the reasons why AMD has favoured dual process nodes for their chiplets, both in the CPU and dGPU side.

  • by ElizabethGreene ( 1185405 ) on Monday September 23, 2024 @10:21AM (#64809531)

    I want to run my LLM locally because my squishy human brain will anthromorphize and "make friends with" the LLM. Having my friend controlled by big tech that can deprecate or upgrade it at their whimsy is not cool.

    • No. You want to run your LLM locally so fucking Big Data doesn't abuse all your fucking data.

      • I use facebook and gmail daily, get my entertainment from Netflix and YouTube, keep my files on dropbox, run Windows, host stuff in Azure+AWS, and meet cool people through Slashdot, SoylentNews, and X.

        About the only thing Big Data doesn't have of mine is feet pics, and that's because I'm too lazy to paint my toenails and take them.

  • You probably should rely on remote model as a scientist.
    I really don't want openai to get hacked and leak all the conversation with an ai assistant.
    I may want to pass a bunch of research notes for smart search functionalities that probably shouldn't leave my machine. I may want to pass a bunch of experimental logs which would be to keep local.
    I may want to pass my shell context and history to the LLM to get smarter autocomplete. That probably shouldn't leave my machine.

  • Those OSI standards were not meant to scale to accommodate every user in every domain. Just wrapping yet another abstraction based on yet another framework cannot be sustained.

  • How are we ever going to raise zillions of dollars to build huge, climate-change producers of data centers for AI if people do *this*?!?!?!

  • Who? Who would have thought?

    Oh, yeah. Everybody who's not a CEO or parasitic MBA.

  • Some people don't want to go to the trouble to run an LLM locally, so they have options to use cloud LLMs. Others have reasons to run locally, and so they can. This is all good.

Hold on to the root.

Working...