Mobile-R1

Abstract

Vision language model-based mobile agents have gained the ability to not only understand complex instructions and mobile screenshots, but also optimize their action outputs via thinking and reasoning, benefiting from reinforcement learning, such as Group Relative Policy Optimization (GRPO). However, existing research centers on offline reinforcement learning training or online optimization using action-level rewards, which limits the agent's dynamic interaction with the environment. This often results in agents settling into local optima, thereby weakening their ability for exploration and error action correction. To address these challenges, we introduce an approach called Mobile-R1, which employs online reinforcement learning with task-level rewards for mobile agents. Our training framework consists of two stages: initial online training via action-level reward, followed by online training via task-level reward based on multi-turn trajectories. This strategy is designed to enhance the exploration and error correction capabilities of EurekaMove, leading to significant performance improvements. Moreover, we have collected a dataset with 24,521 high-quality manual annotations and established a new benchmark covering 28 Chinese applications and 1,510 instructions.

Training Pipeline

Our training framework consists of three stages:

Stage 1: Format Finetuning

Initial format fine-tuning using our dataset.

Stage 2: Action-level Online Training

Single-step GRPO training with action-level rewards to enhance the format compliance and click accuracy.

Stage 3: Task-level Online Training

On-line training with task-level rewards based on multi-turn trajectories to improve the generalization ability.

Action-Level & Task-Level Rewards

Comparison of Action-Level and Task-Level Rewards

Figure (a) illustrates an action-induced agent that is encouraged to “think longer” before selecting a single-step action. In contrast, Figure (b) depicts an agent trained at the task-level rewards, which explores and adjusts its trajectory over multi-turn interactions with the environment.

Main Results

Static evaluation results of Mobile-R1.

Case Study

Comparison of Mobile-R1 and Qwen2.5-VL-3B-Instruct

Task: "Open Fliggy, enter the hotel package, enter the popular live broadcast, find Fliggy Super VIP, and follow the anchor"

Qwen2.5-VL-3B-Instruct: Failed at the second step.
Mobile-R1: Successfully completed the whole task.

Trajectory Dataset