DevOps-Gym

Leaderboard

Rank	Agent	Model	Build & Config (%)	Monitoring (%)	Issue Resolving (%)	Test Generation (%)	Avg (%)	End-to-End (%)
Loading...

The leaderboard shows evaluation results on DevOps-Gym for different agent frameworks and LLMs. Results are sorted by Avg (average of the four stages). The best result for each stage is marked in bold.

• Build & Config: Success rate on build and configuration tasks
• Monitoring: Success rate on monitoring tasks
• Issue Resolving: Success rate on issue resolving tasks
• Test Generation: Success rate on test generation tasks
• Avg: Average accuracy across all four stages, calculated as (Build & Config + Monitoring + Issue Resolving + Test Generation) / 4
• End-to-End: Success rate on end-to-end pipeline tasks that chain all four stages sequentially. Due to budget constraints, only a subset of agent-model combinations are evaluated; "-" indicates not yet evaluated.
• Submit your results: Please contact us at tangken333@gmail.com or kaijiezhu@ucsb.edu to submit your results.

Overview of DevOps-Gym

DevOps-Gym is the first end-to-end benchmark for evaluating AI agents across the whole DevOps cycle: build and configuration, monitoring, issue resolving, and test generation. It specifically focuses on long-horizon tool calling and long context reasoning capabilities for agent tasks, featuring 700+ real-world tasks collected from 30+ projects in Java and Go, gathered through a semi-automated mechanism with rigorous expert validation. Additionally, we create 18 end-to-end pipeline tasks that chain all four stages sequentially, testing agents' ability to maintain context across complete DevOps workflows.

Task Categories

Build and Configuration

We evaluate two categories of challenges:

Repair tasks: Address five prevalent error types: dependency version conflicts, build misconfiguration, compilation errors, tool-chain mismatches, and dependency resource unavailability. Agents must diagnose build failures, identify root causes, and apply targeted fixes.
Implementation tasks: Incorporate new functionalities including build system migration (e.g., Maven to Gradle), target release, plugin integration, and dependency version upgrades.

Input: Repository with failing build configuration (repair) or specification for new build setup (implementation); terminal access with build tools (Maven, Gradle, npm), text utilities, and package managers.
Output: For repair tasks: patch (diff format) fixing build failure; for implementation tasks: complete configuration files meeting specifications.
Evaluation: Build process executes without errors and built artifacts pass dedicated test cases.

Monitoring

Monitoring tasks require agents to capture runtime execution and system states using command-line tools, and detect performance and resource utilization anomalies. We focus on performance and resource anomalies rather than immediate crashes, as they require careful analysis to uncover. We consider two types:

Resource usage problems: Memory leaks, disk leaks, system handle exhaustion, and CPU spikes that gradually degrade system reliability.
Performance degradations: I/O bottlenecks and inefficient SQL query handling that degrade user experience without causing immediate failures.

Input: Containerized environment running an application with bugs; terminal access (top, free, ps, netstat) with no access to source code, configuration files, or trigger scripts.
Output: Structured diagnostic report: specific issue type (e.g., memory_leak) and supporting evidence with quantitative metrics (e.g., memory growth rate, affected process ID).
Evaluation: Binary accuracy requiring agents to correctly identify the specific type of anomaly.

Issue Resolving

Issue resolving requires agents to translate bug descriptions into code fixes, following established methodologies similar to SWE-bench. Agents receive buggy repositories with natural language descriptions and must generate patches that pass fail-to-pass test transitions.

Input: Buggy repository with natural language bug description.
Output: Patch (diff format) that fixes the issue.
Evaluation: Patches must pass fail-to-pass test transitions, ensuring the fix resolves the described issue and maintains existing functionality.

Test Generation

Test generation requires agents to create regression tests based on bug descriptions to prevent issue recurrence and ensure functionality correctness. This task is more challenging than issue resolving, as it requires reasoning about runtime behavior and how bugs would be triggered during execution.

Input: Bug description and repository context.
Output: Regression test that reproduces the described failure and validates patch correctness.
Evaluation: Generated tests must precisely reproduce the failure described in the issue and subsequently pass on the patched code.

End-to-End Pipeline Tasks

Beyond evaluating individual DevOps capabilities, we construct 18 end-to-end pipeline tasks that test agents' ability to maintain context and solve problems across multiple stages. These tasks simulate realistic DevOps workflows where problems cascade through the development pipeline.

Each pipeline task follows a four-stage sequence:

Stage 1 - Build: Agents diagnose and fix build configuration errors (e.g., missing dependencies, incorrect paths) to establish a working baseline.
Stage 2 - Monitoring and Detection: Agents deploy and monitor the running system to identify latent performance or resource issues using system tools, without prior knowledge of the issue type.
Stage 3 - Issue Resolving: Based on monitoring findings, agents implement code-level fixes addressing the root cause.
Stage 4 - Validation and Testing: Agents rebuild the system with their fix and create regression tests that verify the issue is resolved, specifically targeting the performance characteristic identified in Stage 2.

This pipeline design evaluates critical DevOps competencies absent from isolated task evaluation: maintaining problem context across tools, propagating diagnostic information between stages, and verifying end-to-end solution correctness. Success requires completing all four stages sequentially—failure at any point terminates the pipeline, reflecting real-world deployment constraints where partial solutions have no value.

Key Results

Cross-tool Builds and Configurations Present New Challenges

All models perform poorly in build implementation tasks, especially in migration tasks. Agents struggle with understanding the internal mechanisms of build tools like Maven and goreleaser, as well as their practical usage patterns in real-world projects. This is fundamentally different from fixing a bug in source code, where the context is more self-contained. This result demonstrates that while agents are improving at manipulating source code, they are far from capable of managing the software's build and deployment environment.

The Dynamic Nature of System Monitoring Reveals Critical Agent Failures

Agents perform exceptionally poorly on monitoring tasks, even reporting 0% with some state-of-the-art models. This failure stems from three fundamental challenges: (1) monitoring requires continuous processing of evolving system state information, exhausting context limits quickly; (2) agents struggle to consistently focus on monitoring, becoming distracted by earlier observations and stopping active monitoring; (3) agents exhibit poor baseline discrimination, generating false positives by misinterpreting normal operational variance as anomalies. These monitoring failures reveal that current agents lack the temporal reasoning and sustained attention mechanisms essential for dynamic system observation.

Unlike SWE-Bench, LLM Agents Perform Poorly on DevOps-Gym for Issue Resolving

The resolve rate drops significantly when moving from Python repositories (SWE-Bench) to Java and Go. Using the same agent and model combination (OpenHands + Claude-4-Sonnet), the resolving rate achieves 70.4% on SWE-Bench-Verified, yet drops dramatically to 23.87% on our benchmark. This indicates that existing LLMs struggle with the cross-language capability gap, likely due to the dominance of Python code in training data. Complex compilation processes, dependency management, and environment configuration in Java and Go pose major challenges that even newer models fail to overcome.

Generating High-Quality Tests is Even More Challenging Than Resolving Issues

When using the exact same set of issues for evaluation, the accuracy of generating high-quality tests is notably lower than the issue resolving rate. Test generation requires agents to not only have a static understanding of the repository but also dynamic analysis capabilities to reason about how bugs would be triggered during execution. The agent must also reason about how the bug might be resolved, so that generated tests can both reproduce the failure and validate the patch correctness. In contrast, generating a patch can sometimes be accomplished through more straightforward, static code analysis. These results suggest that while agents are becoming proficient at predicting coding patterns, reasoning about runtime behavior remains significantly more challenging.

Detailed Error Analysis

Build and Configuration Tasks

Toolchain and environment limitations (33%): Agents cannot validate or inspect configuration artifacts due to missing validators or schema checkers. Examples include unused-import violations, missing-dependency errors, and malformed build files.
Multi-step reasoning failures (23%): Agents often resolve initial errors but lose track of remaining issues, revealing limitations in context retention and iterative "fix-run-verify" loops.
Domain-specific knowledge gaps (37%): Failures reflect genuine domain knowledge requirements exceeding current capabilities, such as Maven-to-Gradle migrations requiring understanding of build-system semantics, dependency resolution, and platform-specific constraints. Notably, 17% of failures are "inherently difficult", confirming the benchmark captures real-world, high-difficulty scenarios.

Monitoring Tasks

Inadequate methodology (37%): Agents used one-time commands (e.g., top) instead of continuous monitoring (e.g., watch -n 1).
Premature conclusion (26%): Agents submitted answers without performing monitoring or completing diagnostic procedures.
Insufficient temporal granularity (11%): Agents used overly coarse sampling intervals (10-60s), missing transient anomalies like CPU spikes.
Interpretation failure (26%): Agents collected metrics correctly but failed to analyze them accurately.

Citation

If you use this work in your research, please cite the following:

@inproceedings{tang2026devopsgym,
      title={DevOps-Gym: Benchmarking AI Agents in Software DevOps Cycle}, 
      author={Yuheng Tang and Kaijie Zhu and Bonan Ruan and Chuqi Zhang and Michael Yang and Hongwei Li and Suyue Guo and Tianneng Shi and Zekun Li and Christopher Kruegel and Giovanni Vigna and Dawn Song and William Yang Wang and Lun Wang and Yangruibo Ding and Zhenkai Liang and Wenbo Guo},
      booktitle={The Fourteenth International Conference on Learning Representations},
      year={2026},
      url={https://openreview.net/forum?id=bP48r4dt7Z}
      }

DevOps-Gym Benchmarking AI Agents in Software DevOps Cycle