Successful-Western27@alien.topB to Machine Learning@academy.gardenEnglish · 2 years ago

[R] GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation

4

1

[R] GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation

Successful-Western27@alien.topB to Machine Learning@academy.gardenEnglish · 2 years ago

4

Enabling AI to navigate and interact with smartphone UIs is hard, requiring a model that goes beyond mere text processing to handle intricate visual and interactive tasks. A new paper proposes MM-Navigator, an agent based on GPT-4V that can use an iPhone and make purchases on the Amazon app.

The agent can “understand” and interact with smartphone interfaces in a much more human-like manner than previous attempts.

The key innovation lies in GPT-4V’s ability to process both text and image inputs. The agent takes a user’s text instructions and the current screen image, then outputs a description of the next action, including precise screen locations. The researchers improved interaction accuracy by adding numeric tags to interactive elements on the screen, which GPT-4V references to indicate specific actions.

Testing on iOS and Android datasets showed promising results. GPT-4V’s actions were correct 75% of the time for iOS screens, a notable achievement in visual grounding. A standout example was its successful navigation through various apps to purchase a milk frother on Amazon within a set budget.

There are limitations:

False negatives often arose from dataset or annotation issues. These errors often stem from issues with the dataset or the annotation process. In some cases, GPT-4V’s predictions are correct but are marked as incorrect due to inaccuracies in Set-of-Mark annotation parsing or because the dataset annotations are imperfect.
True negatives highlighted limitations in the model’s zero-shot testing approach. Without examples to guide its understanding of user action patterns, the model tends to prefer clicking over scrolling, leading to decisions that don’t align with typical human actions.

If these limitations can be reduced, I could see this being useful for automating QA testing or assisting individuals with disabilities. This research underscores the complexities of developing AI for such sophisticated tasks and emphasizes the importance of accurate data and adaptable testing methods.

TLDR: MM-Navigator is an agent that can navigate a smartphone, combining text and image processing to interact with GUIs. Promising but still has plenty of flaws.

Full summary here. Paper is here.

Chat

zzxslp@alien.topB
link
fedilink
English
arrow-up
1·
2 years ago
Hey all, I am the first author of this paper, feel free to connect with us if you got some ideas for pushing it into a cool product.
- Successful-Western27@alien.topOPB
  link
  fedilink
  English
  arrow-up
  1·
  2 years ago
  Hey this is really cool. Would love to hear if you felt my writeup was good or if there’s anything I can improve/change :)

Machine Learning@academy.garden

machinelearning@academy.garden

You are not logged in. However you can subscribe from another Fediverse account, for example Lemmy or Mastodon. To do this, paste the following into the search field of your instance: !machinelearning@academy.garden

Community Rules:

Be nice. No offensive behavior, insults or attacks: we encourage a diverse community in which members feel safe and have a voice.
Make your post clear and comprehensive: posts that lack insight or effort will be removed. (ex: questions which are easily googled)
Beginner or career related questions go elsewhere. This community is focused in discussion of research and new projects that advance the state-of-the-art.
Limit self-promotion. Comments and posts should be first and foremost about topics of interest to ML observers and practitioners. Limited self-promotion is tolerated, but the sub is not here as merely a source for free advertisement. Such posts will be removed at the discretion of the mods.

Visibility: Public

This community can be federated to other instances and be posted/commented in by their users.

1 user / day
1 user / week
1 user / month
1 user / 6 months
1 local subscriber
1 subscriber
786 Posts
3.03K Comments
Modlog

mods:
communick@academy.garden