DevOps-Gym
Benchmarking AI Agents in Software DevOps Cycle

Yuheng Tang*,1 , Kaijie Zhu*,1 , Bonan Ruan2, Chuqi Zhang2, Michael Yang1, Hongwei Li1, Suyue Guo1, Tianneng Shi3, Zekun Li4, Christopher Kruegel1, Giovanni Vigna1, Dawn Song3, William Yang Wang1, Lun Wang4, Yangruibo Ding5, Zhenkai Liang2, Wenbo Guo1
1UC Santa Barbara, 2National University of Singapore, 3UC Berkeley, 4Google DeepMind, 5UC Los Angeles
*Indicates Equal Contribution

The first end-to-end benchmark for evaluating AI agents across the whole DevOps cycle: build and configuration, monitoring, issue resolving, and test generation. DevOps-Gym specifically focuses on long-horizon tool calling and long context reasoning capabilities for agent tasks, featuring 700+ real-world tasks collected from 30+ projects in Java and Go.

Leaderboard

Rank Agent Model Build & Config (%) Monitoring (%) Issue Resolving (%) Test Generation (%) Overall (%)
Loading...

The leaderboard shows evaluation results on DevOps-Gym for different agent frameworks and LLMs. Results are sorted by Overall Accuracy (average of the four stages). The best result for each stage is marked in bold.

Build & Config: Success rate on build and configuration tasks
Monitoring: Success rate on monitoring tasks
Issue Resolving: Success rate on issue resolving tasks
Test Generation: Success rate on test generation tasks
Overall: Average accuracy across all four stages, calculated as (Build & Config + Monitoring + Issue Resolving + Test Generation) / 4

Overview of DevOps-Gym

DevOps-Gym is the first end-to-end benchmark for evaluating AI agents across the whole DevOps cycle: build and configuration, monitoring, issue resolving, and test generation. It specifically focuses on long-horizon tool calling and long context reasoning capabilities for agent tasks, featuring 700+ real-world tasks collected from 30+ projects in Java and Go, gathered through a semi-automated mechanism with rigorous expert validation.

DevOps-Gym Overview

Task Categories

Build and Configuration

We evaluate two categories of challenges:

  • Repair tasks: Address five prevalent error types: dependency version conflicts, build misconfiguration, compilation errors, tool-chain mismatches, and dependency resource unavailability. Agents must diagnose build failures, identify root causes, and apply targeted fixes.
  • Implementation tasks: Incorporate new functionalities including build system migration (e.g., Maven to Gradle), target release, plugin integration, and dependency version upgrades.

Input: Repository with failing build configuration (repair) or specification for new build setup (implementation); terminal access with build tools (Maven, Gradle, npm), text utilities, and package managers.
Output: For repair tasks: patch (diff format) fixing build failure; for implementation tasks: complete configuration files meeting specifications.
Evaluation: Build process executes without errors and built artifacts pass dedicated test cases.

Monitoring

Monitoring tasks require agents to capture runtime execution and system states using command-line tools, and detect performance and resource utilization anomalies. We focus on performance and resource anomalies rather than immediate crashes, as they require careful analysis to uncover. We consider two types:

  • Resource usage problems: Memory leaks, disk leaks, system handle exhaustion, and CPU spikes that gradually degrade system reliability.
  • Performance degradations: I/O bottlenecks and inefficient SQL query handling that degrade user experience without causing immediate failures.

Input: Containerized environment running an application with bugs; terminal access (top, free, ps, netstat) with no access to source code, configuration files, or trigger scripts.
Output: Structured diagnostic report: specific issue type (e.g., memory_leak) and supporting evidence with quantitative metrics (e.g., memory growth rate, affected process ID).
Evaluation: Binary accuracy requiring agents to correctly identify the specific type of anomaly.

Issue Resolving

Issue resolving requires agents to translate bug descriptions into code fixes, following established methodologies similar to SWE-bench. Agents receive buggy repositories with natural language descriptions and must generate patches that pass fail-to-pass test transitions.

Input: Buggy repository with natural language bug description.
Output: Patch (diff format) that fixes the issue.
Evaluation: Patches must pass fail-to-pass test transitions, ensuring the fix resolves the described issue and maintains existing functionality.

Test Generation

Test generation requires agents to create regression tests based on bug descriptions to prevent issue recurrence and ensure functionality correctness. This task is more challenging than issue resolving, as it requires reasoning about runtime behavior and how bugs would be triggered during execution.

Input: Bug description and repository context.
Output: Regression test that reproduces the described failure and validates patch correctness.
Evaluation: Generated tests must precisely reproduce the failure described in the issue and subsequently pass on the patched code.

Key Results

Cross-tool Builds and Configurations Present New Challenges

All models perform poorly in build implementation tasks, especially in migration tasks. Agents struggle with understanding the internal mechanisms of build tools like Maven and goreleaser, as well as their practical usage patterns in real-world projects. This is fundamentally different from fixing a bug in source code, where the context is more self-contained. This result demonstrates that while agents are improving at manipulating source code, they are far from capable of managing the software's build and deployment environment.

The Dynamic Nature of System Monitoring Reveals Critical Agent Failures

Agents perform exceptionally poorly on monitoring tasks, even reporting 0% with some state-of-the-art models. This failure stems from three fundamental challenges: (1) monitoring requires continuous processing of evolving system state information, exhausting context limits quickly; (2) agents struggle to consistently focus on monitoring, becoming distracted by earlier observations and stopping active monitoring; (3) agents exhibit poor baseline discrimination, generating false positives by misinterpreting normal operational variance as anomalies. These monitoring failures reveal that current agents lack the temporal reasoning and sustained attention mechanisms essential for dynamic system observation.

Unlike SWE-Bench, LLM Agents Perform Poorly on DevOps-Gym for Issue Resolving

The resolve rate drops significantly when moving from Python repositories (SWE-Bench) to Java and Go. Using the same agent and model combination (OpenHands + Claude-4-Sonnet), the resolving rate achieves 70.4% on SWE-Bench-Verified, yet drops dramatically to 23.87% on our benchmark. This indicates that existing LLMs struggle with the cross-language capability gap, likely due to the dominance of Python code in training data. Complex compilation processes, dependency management, and environment configuration in Java and Go pose major challenges that even newer models fail to overcome.

Generating High-Quality Tests is Even More Challenging Than Resolving Issues

When using the exact same set of issues for evaluation, the accuracy of generating high-quality tests is notably lower than the issue resolving rate. Test generation requires agents to not only have a static understanding of the repository but also dynamic analysis capabilities to reason about how bugs would be triggered during execution. The agent must also reason about how the bug might be resolved, so that generated tests can both reproduce the failure and validate the patch correctness. In contrast, generating a patch can sometimes be accomplished through more straightforward, static code analysis. These results suggest that while agents are becoming proficient at predicting coding patterns, reasoning about runtime behavior remains significantly more challenging.

Detailed Error Analysis

Build and Configuration Tasks

  • Toolchain and environment limitations (33%): Agents cannot validate or inspect configuration artifacts due to missing validators or schema checkers. Examples include unused-import violations, missing-dependency errors, and malformed build files.
  • Multi-step reasoning failures (23%): Agents often resolve initial errors but lose track of remaining issues, revealing limitations in context retention and iterative "fix-run-verify" loops.
  • Domain-specific knowledge gaps (37%): Failures reflect genuine domain knowledge requirements exceeding current capabilities, such as Maven-to-Gradle migrations requiring understanding of build-system semantics, dependency resolution, and platform-specific constraints. Notably, 17% of failures are "inherently difficult", confirming the benchmark captures real-world, high-difficulty scenarios.

Monitoring Tasks

  • Inadequate methodology (37%): Agents used one-time commands (e.g., top) instead of continuous monitoring (e.g., watch -n 1).
  • Premature conclusion (26%): Agents submitted answers without performing monitoring or completing diagnostic procedures.
  • Insufficient temporal granularity (11%): Agents used overly coarse sampling intervals (10-60s), missing transient anomalies like CPU spikes.
  • Interpretation failure (26%): Agents collected metrics correctly but failed to analyze them accurately.

Citation

If you use this work in your research, please cite the following:

@misc{devopsgym2025,
      title={DevOps-Gym: Benchmarking AI Agents in Software DevOps Cycle}, 
      author={[Authors]},
      year={2025},
      eprint={[arXiv ID]},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={[arXiv URL]}, 
}

More

Please check out more of our works: Frontier AI's Impact on the Cybersecurity Landscape , a comprehensive analysis of how frontier AI is reshaping cybersecurity and how we should respond. Also see our Frontier AI Cybersecurity Observatory , a live leaderboard tracking AI's cybersecurity capabilities across attack and defense tasks.