Following the release of Dimensity 9300 and S8G3 phones, I am expecting growth in popularity of LLMs running on mobile phones, as quantized 3B or 7B models can already run on high-end phones from five years ago or later. But despite it being possible, there are a few concerns, including power consumption and storage size. I’ve seen posts about successfully running LLMs on mobile devices, but seldom see people discussing about future trends. What are your thoughts?

  • oe-g@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    My personal take is what are the use cases for user friendly local LLMs on mobile compared to higher performance llm closed models?

    Privacy is the only serious benefit I can think of.

    • NDBellisario@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      Latency is one thing with the internet.

      Any model that can run locally doesn’t need a round trip to a datacenter. This can of course depending on computer power

      • Maykey@alien.topB
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        At current capabilities it’s faster to query server on the opposite hemisphere than to generate locally.

      • CocksuckerDynamo@alien.topB
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        round trip latency of an http request (or grpc or whatever pick your poison) is utterly insignificant compared to the time it takes to run the inference process, even for the smallest models with the fastest inference

    • GraceRaccoon@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      Privacy I don’t care about too late for that lol. If it becomes as normal to use ai as it is to google something, my worry about be it intentionally using language to fuck with my head, or skew my perspective on something I’m trying to get info on. Social engineering is a spooky thing. Algorithms on social media are already causing damage lol.

    • Combinatorilliance@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      It’s not going to be just chat. The LLMs are going to be integrated into everything in the OS.

      Suggesting emails, finding appointments in e-mail (I believe this already exists somewhat for Apple? In any case it will be private, local and more reliable), improved search, way improved personal assistant, APIs to access the model from any app. Lots of stuff…

  • vikarti_anatra@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Just my thoughts on this:

    Would be great.

    Would be rather limited but possible (thanks to https://llm.mlc.ai/ and increasing memory).

    A lot of CHEAP Chinese devices will say they can actually do it. They will. At 2 bit quatization and <1 t/s and it would be 7B Models or even less. They will be unusuable.

    Google say it’s not necessary because you can use their Firebase Services for AI and you can use NNAPI anyway. You must also censor your LLM-using apps in Play Store to adhere to their rules.

    Apple says it’s not necessary, later they will advertise it as very good thing and provide optmized libraries and some pretrained models but you need to buy latest iphone(last-year won’t work because Apple). You must also censor your apps AND mark it as 18+

    Areas of usage?

    - Language translation (including voice-to-voice). Basically much more improved google translate.

    - AI Assistant (basically MUCH more imroved Siri, used not only as command interface).

  • jamesstarjohnson@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    If llm is packaged in an app and installed on a phone with all the rights given and good function calling ability to use os api it will be able to do lots of things with voice commands even without being fully integrated like siri

  • AreYouOKAni@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Theoretically doable, practically unlikely. Battery life will take a significant hit, and the 3B/7B models don’t provide THAT much benefit to just take that hit.

    It is something to consider in the future, though. Like, 5 years from now we will probably have SoCs that are efficient enough to do it live.

  • ab2377@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    i am running tinyllama and deepseek 1.3B on a almost 3 year old cheap Poco X3 (snapdragon 732G) and its great. Will post the video soon. So the new phones, and high-end ones, well i am sure some people can run mistral on those. But i also wish that phones gets some of its prices reduced, high-end phones are becoming more expensive the most laptops i cant afford.

  • Maykey@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    My hot take is that local models will become truly feasible on phones(and in general) only once we move past transformers towards something more FLOP and memory efficient(RetNet, S5)

  • sshan@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Hard to make a broad use case here until power consumption drops. Best approach is still push to cloud.

    Edge cases like robotics / cars / high availbility likely exist though and could be big niche.

  • gabbalis@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    I know the question here is about running LLMs on mobile, but that’s building in too many assumptions I think.

    The future of LLM technology is as follows

    1. Large models learn to do a new task
    2. Specific tasks get broken down into foundational sub tasks
    3. foundational subtasks are distilled into memoized code, hardcoded transformers and traditional code.
    4. you no longer use a Large model for that subtask, instead you use a highly specialized module that fits on a toaster.

    This loop is going to get faster and faster, and once its generally accessible, you’re no longer concerned with what LLMs run on you’re phone, you’re instead concerned with which specific subtasks can be designed to run on your phone and how to assemble them into your application’s specific needs. At the end of the day, you are not going to need to ask an AGI to fill out API calls.