Mobile-R1

Towards Online Reinforcement Learning for VLM-Driven Mobile Agent via Task-Level Rewards

Abstract

Vision language model-based mobile agents have gained the ability to not only understand complex instructions and mobile screenshots, but also optimize their action outputs via thinking and reasoning, benefiting from reinforcement learning, such as Group Relative Policy Optimization (GRPO). However, existing research centers on offline reinforcement learning training or online optimization using action-level rewards, which limits the agent's dynamic interaction with the environment. This often results in agents settling into local optima, thereby weakening their ability for exploration and error action correction. To address these challenges, we introduce an approach called Mobile-R1, which employs online reinforcement learning with task-level rewards for mobile agents. Our training framework consists of two stages: initial online training via action-level reward, followed by online training via task-level reward based on multi-turn trajectories. This strategy is designed to enhance the exploration and error correction capabilities of EurekaMove, leading to significant performance improvements. Moreover, we have collected a dataset with 24,521 high-quality manual annotations and established a new benchmark covering 28 Chinese applications and 1,510 instructions.

Eureka Move!

grade-lv

Our agent is capable of correcting itself from an incorrect state back to the correct action (called the eureka move)!

Training Pipeline

grade-lv

Our training framework consists of three stages:

Stage 1: Format Finetuning

Initial format fine-tuning using our dataset.

Stage 2: Action-level Online Training

Single-step GRPO training with action-level rewards to enhance the format compliance and click accuracy.

Stage 3: Task-level Online Training

On-line training with task-level rewards based on multi-turn trajectories to improve the generalization ability.

Action-Level & Task-Level Rewards

Comparison of Action-Level and Task-Level Rewards

Figure (a) illustrates an action-induced agent that is encouraged to “think longer” before selecting a single-step action. In contrast, Figure (b) depicts an agent trained at the task-level rewards, which explores and adjusts its trajectory over multi-turn interactions with the environment.

Main Results

grade-lv

Static evaluation results of Mobile-R1.

Case Study

grade-lv

Comparison of Mobile-R1 and Qwen2.5-VL-3B-Instruct

    Task: "Open Fliggy, enter the hotel package, enter the popular live broadcast, find Fliggy Super VIP, and follow the anchor"

  • Qwen2.5-VL-3B-Instruct: Failed at the second step.
  • Mobile-R1: Successfully completed the whole task.

Datasets & Benchmark

We open source a high-quality subset of our training data and evaluation sets.

Trajectory Dataset

grade-lv

Static evaluation results of Mobile-R1.

grade-lv

Distribution of trajectory length.