Enabling AI to navigate and interact with smartphone UIs is hard, requiring a model that goes beyond mere text processing to handle intricate visual and interactive tasks. A new paper proposes MM-Navigator, an agent based on GPT-4V that can use an iPhone and make purchases on the Amazon app.
The agent can “understand” and interact with smartphone interfaces in a much more human-like manner than previous attempts.
The key innovation lies in GPT-4V’s ability to process both text and image inputs. The agent takes a user’s text instructions and the current screen image, then outputs a description of the next action, including precise screen locations. The researchers improved interaction accuracy by adding numeric tags to interactive elements on the screen, which GPT-4V references to indicate specific actions.
Testing on iOS and Android datasets showed promising results. GPT-4V’s actions were correct 75% of the time for iOS screens, a notable achievement in visual grounding. A standout example was its successful navigation through various apps to purchase a milk frother on Amazon within a set budget.
There are limitations:
- False negatives often arose from dataset or annotation issues. These errors often stem from issues with the dataset or the annotation process. In some cases, GPT-4V’s predictions are correct but are marked as incorrect due to inaccuracies in Set-of-Mark annotation parsing or because the dataset annotations are imperfect.
- True negatives highlighted limitations in the model’s zero-shot testing approach. Without examples to guide its understanding of user action patterns, the model tends to prefer clicking over scrolling, leading to decisions that don’t align with typical human actions.
If these limitations can be reduced, I could see this being useful for automating QA testing or assisting individuals with disabilities. This research underscores the complexities of developing AI for such sophisticated tasks and emphasizes the importance of accurate data and adaptable testing methods.
TLDR: MM-Navigator is an agent that can navigate a smartphone, combining text and image processing to interact with GUIs. Promising but still has plenty of flaws.
Full summary here. Paper is here.
Hey all, I am the first author of this paper, feel free to connect with us if you got some ideas for pushing it into a cool product.
Hey this is really cool. Would love to hear if you felt my writeup was good or if there’s anything I can improve/change :)