WAA(Windows Agent Arena) LeaderBoard - WAA 리더보드
개요
원본 WAA와 WAA-V2는 태스크와 evaluator가 다르므로 분리해서 봅니다.
같은 표에서 섞어 최고점을 비교하면 안 됩니다. 아래 표는 각 결과를
step budget(15/30/50/100), run type(단일 실행 · best-of-N · pass@N · selection · 평균),
action/observation space, evidence level로 구분해 정리한 것입니다.
방법명 링크를 누르면 원 논문 · 프로젝트 · 공식 출처로 이동합니다.
- 원본 데이터 기준일: 2026-06-05 (KST)
- 수록 결과: 총 129건 (WAA 114건, WAA-V2 15건)
- 이 문서는 위 스냅샷을 텍스트로 재구성한 것이며, 이후 새 결과가 추가됐을 수 있습니다.
Sortech와 인제대 미디어랩의 공동 연구실의 comparable single-run 1위를 기록한 내용은 이 문서 맨 끝 「2026-06-27 업데이트」 섹션에 정리했고, 실험 설계와 상세 분석은 다음 글에서 다룹니다.
이 표를 읽는 법
| 항목 | 의미 |
|---|---|
| WAA vs WAA-V2 | 태스크/evaluator가 다릅니다. 같은 순위표에 섞어 최고점을 비교하면 안 됩니다. |
| single vs best/pass/selection | best-of-N, pass@N, BJudge selection은 여러 후보를 고르거나 여러 번 실행한 결과이므로 단일 실행(single)과 구분해야 합니다. |
| step budget | 15/30/50/100처럼 허용 행동 수가 다르면 직접 비교가 어렵습니다. |
| action space | GUI-only, code/Python/Bash, skill library, tool/API hybrid는 서로 다른 실행 조건입니다. 점수만 보면 strict GUI 능력과 tool-rich agent 능력이 섞입니다. |
| unknown-step | 일부 technical report는 step budget을 명시하지 않습니다. 전체표에는 두되 보수적으로 봐야 합니다. |
Run 구분: single(단일 실행) · best-of-N · pass@N · selection/N · avg/mean(평균) · official card(공식 카드) · other · not stated
근거(evidence) 등급: A 1차 논문/프로젝트 · B 공식 카드/README · C 교차 인용 · U 2차/확인 필요
원본 WAA 리더보드
| # | Score | 방법 | 모델 | Step | Run | 근거 |
|---|---|---|---|---|---|---|
| 1 | 63.50% | OS-Symphony | GPT-5 | 50 | single | A |
| 2 | 62.20% | OS-Symphony | GPT-5-Mini | 50 | single | A |
| 3 | 61.0% | GUI-Pro-Agent / VLAA-GUI | Gemini 3 Flash manager + Seed 1.8 grounder | 100 | single | A |
| 4 | 60.40% | GUI-Pro-Agent / VLAA-GUI | Gemini 3 Flash manager + Seed 1.8 grounder | 50 | single | A |
| 5 | 57.50% | CUA-Skill Agent | GPT-5 | 30 | best-of-N | A |
| 6 | 56.60% | Agent S3 + BJudge | GPT-5 | 100 | selection/N | A |
| 7 | 56.48% | EvoCUA-32B | EvoCUA-32B | unknown | other | A |
| 8 | 54.10% | Agent S3 + BJudge | GPT-5 | 50 | selection/N | A |
| 9 | 52.50% | CoAct-1 | o3 orchestrator + o4-mini programmer + OpenAI computer-use-preview GUI operator | 100 | single | A |
| 10 | 51.20% | GTA1-32B | o3 planner | 100 | official card | C |
| 11 | 50.60% | UI-TARS-2 | UI-TARS-2 | 50 | single | C |
| 12 | 50.60% | GTA1-32B | GPT-5 planner | 100 | official card | B |
| 13 | 50.30% | CUA-Skill Agent | GPT-5 | 30 | single | A |
| 14 | 50.20% | Agent S3 | GPT-5 | 100 | single | A |
| 15 | 49.20% | GTA1-7B-2507 | GPT-5 planner | 100 | official card | B |
| 16 | 49.0% | Agent S3 | GPT-5 | 50 | single | C |
| 17 | 47.90% | GTA1-7B-2507 | o3 planner | 100 | official card | B |
| 18 | 47.33% | Jedi-7B | GPT-4o planner + Jedi-7B grounder | 100 | pass@N | A |
| 19 | 46.80% | STEVE-R1-SFT | STEVE-R1-SFT 7B | 20 | pass@N | B |
| 20 | 46.67% | Jedi-3B | GPT-4o planner + Jedi-3B grounder | 100 | pass@N | A |
| 21 | 46.0% | Jedi-7B | GPT-4o planner + Jedi-7B grounder | 50 | pass@N | A |
| 22 | 45.95% | OS-Symphony | Qwen3-VL-32B-Thinking | 50 | single | A |
| 23 | 45.30% | OS-Symphony | Qwen3-VL-32B-Instruct | 50 | single | A |
| 24 | 44.76% | GUI-Owl-1.5 | 32B-Instruct | unknown | single | A |
| 25 | 44.13% | GUI-Owl-1.5 | 32B-Thinking | unknown | single | A |
| 26 | 44.0% | Jedi-3B | GPT-4o planner + Jedi-3B grounder | 50 | pass@N | A |
| 27 | 43.50% | CoAct-1 | o3 + o4-mini + OpenAI CUA | 50 | single | A |
| 28 | 42.90% | Qwen3-VL-32B-Thinking | Qwen3-VL-32B-Thinking base | unknown | other | B |
| 29 | 42.67% | Jedi-7B | GPT-4o planner + Jedi-7B grounder | 15 | pass@N | A |
| 30 | 42.10% | UI-TARS-1.5 | UI-TARS-1.5-7B | 50 | single | A |
| 31 | 41.33% | Jedi-3B | GPT-4o planner + Jedi-3B grounder | 15 | pass@N | A |
| 32 | 39.10% | RoTS-32B | Qwen2.5-VL-32B fine-tuned via RoTS | 50 | avg/mean | A |
| 33 | 37.40% | Operator | OpenAI Operator | 50 | other | U |
| 34 | 35.90% | RoTS-32B | Qwen2.5-VL-32B fine-tuned via RoTS | 15 | avg/mean | A |
| 35 | 35.07% | GUI-Owl-1.5 | 8B-Thinking | unknown | single | A |
| 36 | 34.90% | Dyna-Think + DDT | Qwen2.5-32B-Instruct-based Dyna-Think | 30 | avg/mean | A |
| 37 | 33.80% | ToolCUA-8B | ToolCUA-8B | 50 | avg/mean | A |
| 38 | 33.70% | Jedi-7B | GPT-4o planner + Jedi-7B grounder | 100 | avg/mean | B |
| 39 | 33.03% | Jedi-3B | GPT-4o planner + Jedi-3B grounder | 100 | avg/mean | A |
| 40 | 32.90% | Jedi-7B | GPT-4o planner + Jedi-7B grounder | 50 | avg/mean | A |
| 41 | 32.80% | Dyna-Think + DDT | Qwen2.5-32B-Instruct-based Dyna-Think | 30 | avg/mean | A |
| 42 | 32.23% | VistaGUI | GPT-4o | 50 | single | A |
| 43 | 32.10% | Qwen3-VL-235B-A22B | Qwen3-VL-235B-A22B | 50 | avg/mean | A |
| 44 | 31.70% | Qwen3-VL-32B-Instruct | Qwen3-VL-32B-Instruct | 50 | single | A |
| 45 | 31.66% | GUI-Owl-1.5 | 8B-Instruct | unknown | single | A |
| 46 | 31.33% | Jedi-3B | GPT-4o planner + Jedi-3B grounder | 50 | avg/mean | A |
| 47 | 31.20% | WorldGUI-Agent | likely Claude-3.5-Sonnet powered | unknown | other | A |
| 48 | 30.76% | ANCHOR | Qwen3-VL-8B | unknown | single | A |
| 49 | 30.50% | UFO-2 | o1 | 30 | single | A |
| 50 | 30.20% | Jedi-7B | GPT-4o planner + Jedi-7B grounder | 15 | avg/mean | A |
| 51 | 29.80% | Agent S2 | Claude 3.7 Sonnet | 15 | single | A |
| 52 | 29.44% | GUI-Owl-1.5 | 4B-Instruct | unknown | single | A |
| 53 | 29.06% | Jedi-3B | GPT-4o planner + Jedi-3B grounder | 15 | avg/mean | A |
| 54 | 28.40% | Dyna-Think + RFT | Qwen2.5-32B-Instruct-based Dyna-Think | 30 | avg/mean | A |
| 55 | 28.20% | RoTS-7B | Qwen2.5-VL-7B fine-tuned via RoTS | 50 | avg/mean | A |
| 56 | 27.90% | UFO-2 | GPT-4o | 30 | single | A |
| 57 | 27.90% | PC Agent-E | PC Agent-E | 30 | single | A |
| 58 | 27.80% | COLA | GPT-4o | 20 | single | A |
| 59 | 27.47% | ANCHOR ablation: Task-Driven | Qwen3-VL-8B | unknown | single | A |
| 60 | 26.90% | Dyna-Think R1 | R1 baseline | 30 | avg/mean | A |
| 61 | 26.90% | Dyna-Think DIT(R1) | DIT(R1) | 30 | avg/mean | A |
| 62 | 26.40% | Qwen3-VL-8B-Instruct | Qwen3-VL-8B-Instruct | 50 | avg/mean | A |
| 63 | 25.78% | GUI-Owl-1.5 | 2B-Instruct | unknown | single | A |
| 64 | 25.30% | UFO-2-base | o1 | 30 | single | A |
| 65 | 24.90% | RoTS-7B | Qwen2.5-VL-7B fine-tuned via RoTS | 15 | avg/mean | A |
| 66 | 24.20% | ScaleCUA-32B | ScaleCUA-32B | 15 | other | A |
| 67 | 24.20% | ScaleCUA-32B | ScaleCUA-32B | 50 | other | C |
| 68 | 24.0% | W&L SFT | UI-TARS-1.5-7B SFT with W&L IDM-labeled video data | 15 | single | A |
| 69 | 23.90% | Dyna-Think Qwen-32B baseline | Qwen-32B | 30 | avg/mean | A |
| 70 | 23.40% | UFO-2-base | GPT-4o | 30 | single | A |
| 71 | 23.07% | ANCHOR ablation: Zero-shot | Qwen3-VL-8B | unknown | single | A |
| 72 | 23.0% | STEVE | Ours-G / GPT-4o | unknown | single | A |
| 73 | 22.30% | STEVE-R1-SFT | STEVE-R1-SFT 7B | 60 | avg/mean | B |
| 74 | 21.70% | UltraCUA-7B | UltraCUA-7B | 15 | single | A |
| 75 | 21.40% | CoAct-1 | o3 + o4-mini + OpenAI CUA | 15 | single | A |
| 76 | 21.40% | ScaleCUA-32B | ScaleCUA-32B | 15 | other | C |
| 77 | 20.90% | Dyna-Think + vanilla Dyna | Qwen2.5-32B-Instruct-based Dyna | 30 | avg/mean | A |
| 78 | 20.80% | OpenAI Operator / computer-use | computer-use / Operator | 30 | single | A |
| 79 | 20.80% | OpenAI Operator / computer-use | computer-use / Operator | 50 | single | A |
| 80 | 20.70% | ScaleCUA-7B | ScaleCUA-7B | 50 | other | C |
| 81 | 20.10% | STEVE-R1-SFT | STEVE-R1-SFT 7B | 40 | avg/mean | B |
| 82 | 19.50% | UFO / OmniAgent | GPT-4o / GPT-4V variants | 30 | single | A |
| 83 | 19.50% | NAVI | GPT-4V-1106 + UIA + OmniParser | 15 | single | A |
| 84 | 19.30% | UI-TARS-7B-DPO | UI-TARS-7B-DPO | 60 | avg/mean | B |
| 85 | 18.20% | Agent S | GPT-4o | 15 | single | A |
| 86 | 18.10% | UI-TARS-1.5-7B | Qwen2.5VL-FT | 15 | single | A |
| 87 | 18.0% | ScaleCUA-7B | ScaleCUA-7B | 15 | other | C |
| 88 | 18.0% | InternVL3.5 | InternVL3.5-241B-A28B | 50 | not stated | A |
| 89 | 17.80% | STEVE-R1 UI-TARS baseline | UI-TARS-7B-DPO | 40 | avg/mean | B |
| 90 | 17.50% | STEVE-R1-SFT | STEVE-R1-SFT 7B | 20 | avg/mean | B |
| 91 | 16.30% | ANCHOR | GLM-4.1V-9B | unknown | single | A |
| 92 | 15.70% | UI-TARS1 | Qwen2VL-FT | 50 | other | U |
| 93 | 15.40% | STEVE-R1 UI-TARS baseline | UI-TARS-7B-DPO | 20 | avg/mean | B |
| 94 | 15.22% | ANCHOR | Qwen2.5-VL-7B | unknown | single | A |
| 95 | 14.50% | InternVL3.5 | InternVL3.5-38B | 50 | not stated | A |
| 96 | 14.20% | STEVE | Ours-KTO 7B | unknown | single | A |
| 97 | 14.10% | ANCHOR ablation: Task-Driven | Qwen2.5-VL-7B | unknown | single | A |
| 98 | 13.50% | OpenCUA-7B / Qwen2-VL-7B with OpenCUA data | OpenCUA-7B / Qwen2-VL-7B | 15 | single | A |
| 99 | 13.30% | NAVI | GPT-4o + UIA + proprietary OCR/grounding | 15 | single | A |
| 100 | 13.19% | ANCHOR ablation: Task-Driven | GLM-4.1V-9B | unknown | single | A |
| 101 | 12.90% | W&L SFT w/ TongUI | UI-TARS-1.5-7B SFT with TongUI-labeled video data | 15 | single | A |
| 102 | 12.50% | InternVL3.5 | InternVL3.5-14B | 50 | not stated | A |
| 103 | 11.80% | Qwen2.5-VL-72B | Qwen2.5-VL-72B | 15 | other | A |
| 104 | 11.0% | InternVL3.5 | InternVL3.5-20B-A4B | 50 | not stated | A |
| 105 | 11.0% | InternVL3.5 | InternVL3.5-30B-A3B | 50 | not stated | A |
| 106 | 10.50% | InternVL3.5 | InternVL3.5-8B | 50 | not stated | A |
| 107 | 10.40% | Kimi-VL | Kimi-VL | 15 | official card | B |
| 108 | 9.70% | Qwen2.5-VL-72B | Qwen2.5-VL-72B | 50 | other | A |
| 109 | 9.70% | InternVL3.5 | InternVL3.5-4B | 50 | not stated | A |
| 110 | 7.10% | Claude 3.7 Sonnet | Claude 3.7 Sonnet | 15 | other | A |
| 111 | 7.10% | STEVE | Ours-SFT 7B | unknown | single | A |
| 112 | 6.40% | Claude 3.7 Sonnet | Claude 3.7 Sonnet | 50 | other | A |
| 113 | 5.49% | ANCHOR ablation: Zero-shot | GLM-4.1V-9B | unknown | single | A |
| 114 | 4.39% | ANCHOR ablation: Zero-shot | Qwen2.5-VL-7B | unknown | single | A |
WAA-V2 리더보드
WAA-V2는 원본 WAA와 태스크 구성이 다르므로 위 표와 직접 비교하지 않습니다.
| # | Score | 방법 | 모델 | Step | Run | 근거 |
|---|---|---|---|---|---|---|
| 1 | 36.0% | PC Agent-E | PC Agent-E | 30 | single | A |
| 2 | 35.40% | Claude 3.7 Sonnet | Claude 3.7 Sonnet + thinking | 30 | single | A |
| 3 | 32.60% | Claude 3.7 Sonnet | Claude 3.7 Sonnet | 30 | single | A |
| 4 | 31.40% | PC Agent-E | PC Agent-E | 50 | single | A |
| 5 | 26.90% | PC Agent-E | PC Agent-E | 15 | single | A |
| 6 | 26.90% | Human Data + Direct Distillation | PC Agent-E ablation | 30 | single | A |
| 7 | 26.20% | UI-TARS-72B-DPO | UI-TARS-72B-DPO | 30 | single | A |
| 8 | 26.20% | Direct Distillation | PC Agent-E ablation | 30 | single | A |
| 9 | 21.30% | UI-TARS-1.5-7B | UI-TARS-1.5-7B | 30 | single | A |
| 10 | 14.90% | Qwen2.5-VL-72B | Qwen2.5-VL-72B | 30 | single | A |
| 11 | 11.90% | Qwen2.5-VL-72B | Qwen2.5-VL-72B | 50 | single | A |
| 12 | 11.30% | Qwen2.5-VL-72B | Qwen2.5-VL-72B | 15 | single | A |
| 13 | 6.40% | PC Agent-E-7B | PC Agent-E 7B | 30 | single | A |
| 14 | 5.0% | Qwen2.5-VL-7B | Qwen2.5-VL-7B | 30 | single | A |
| 15 | 2.10% | GPT-4o | GPT-4o | 30 | single | A |
2026-06-27 업데이트 — Sortech & 인제대 공동 실험이 comparable single-run 1위
2026년 6월 27일 기준으로, Sortech & 인제대 공동 실험이 원본 WAA의 comparable single-run 최고점을 기록했습니다.
Gemini 3.1 Pro backend에 safe-tool orchestration과 검증/복구 pipeline을 결합한 운영 구성으로
154개 태스크를 단일 fresh run으로 수행해 98.906 / 154 = 64.22% 를 달성했으며,
이는 위 스냅샷의 종전 comparable single-run 최고점(OS-Symphony · GPT-5, 63.5%)을 넘습니다.
| # | Score | 방법 (backend/scaffold) | Step | Run |
|---|---|---|---|---|
| 1 | 64.22% | Gemini 3.1 Pro + Safe Tools + Pipeline | 15 | single |
| 2 | 63.5% | OS-Symphony (GPT-5) — 종전 최고 | 50 | single |
| 3 | 62.2% | OS-Symphony (GPT-5-Mini) | 50 | single |
| 4 | 61.0% | GUI-Pro-Agent / VLAA-GUI (Gemini 3 Flash + Seed 1.8) | 100 | single |
| 5 | 60.4% | GUI-Pro-Agent / VLAA-GUI | 50 | single |
전제는 분명히 해둡니다. 이건 모델 단독 비교가 아니라 모델 + safe tools + pipeline + Windows runtime을 포함한 운영 구성 비교이고, WAA runtime 안정화 patch가 켜진 상태에서 나온 결과이며, route policy 위반은 0건이었습니다.
64.22%는 50-step 최고점(63.5)과 100-step 최고점(61.0)을 모두 넘어, step budget 축에서도 상단입니다.
실험 설계 · 도메인별 성적 · pipeline 개입 근거 · 한계는 다음 글에서 자세히 다룹니다.
→ 상세: WAA Leaderboard 1위: Gemini 3.1 Pro + Safe Tools + Pipeline (2번째 글)