Enabling AI to navigate and interact with smartphone UIs is hard, requiring a model that goes beyond mere text processing to handle intricate visual and interactive tasks. A new paper proposes MM-Navigator, an agent based on GPT-4V that can use an iPhone and make purchases on the Amazon app.
The agent can “understand” and interact with smartphone interfaces in a much more human-like manner than previous attempts.
The key innovation lies in GPT-4V’s ability to process both text and image inputs. The agent takes a user’s text instructions and the current screen image, then outputs a description of the next action, including precise screen locations. The researchers improved interaction accuracy by adding numeric tags to interactive elements on the screen, which GPT-4V references to indicate specific actions.
Testing on iOS and Android datasets showed promising results. GPT-4V’s actions were correct 75% of the time for iOS screens, a notable achievement in visual grounding. A standout example was its successful navigation through various apps to purchase a milk frother on Amazon within a set budget.
There are limitations:
- False negatives often arose from dataset or annotation issues. These errors often stem from issues with the dataset or the annotation process. In some cases, GPT-4V’s predictions are correct but are marked as incorrect due to inaccuracies in Set-of-Mark annotation parsing or because the dataset annotations are imperfect.
- True negatives highlighted limitations in the model’s zero-shot testing approach. Without examples to guide its understanding of user action patterns, the model tends to prefer clicking over scrolling, leading to decisions that don’t align with typical human actions.
If these limitations can be reduced, I could see this being useful for automating QA testing or assisting individuals with disabilities. This research underscores the complexities of developing AI for such sophisticated tasks and emphasizes the importance of accurate data and adaptable testing methods.
TLDR: MM-Navigator is an agent that can navigate a smartphone, combining text and image processing to interact with GUIs. Promising but still has plenty of flaws.
Full summary here. Paper is here.
Hey all, I am the first author of this paper, feel free to connect with us if you got some ideas for pushing it into a cool product.
Hey this is really cool. Would love to hear if you felt my writeup was good or if there’s anything I can improve/change :)
Very interesting! Thanks for sharing this. Do you know if any studies been done regarding the feasibility and/or helpfulness of opening up accessibility bridges such screenreader interfaces for the blind and visually impaired?
I’ve been thinking about how one would give AIs access to OS-level interactions - and the screen reader accessibility bridges seem like a potentially viable fit. But based on this article, maybe if LLMs can navigate UIs directly, the additional bridging layer isn’t needed. Wdyt?
I guess the potential is not only for helping visually-impaired users (definitely important here) but also free everyone from daily tasks on our phone. A better Siri