What is an AI SRE
Understand what an AI SRE is and how it works in practice
Let's get something out of the way first. "AI SRE" is vague concept that has been thrown around a lot in the past year or so (2024-2025). It suggests some robot is going to replace a site reliability engineer wholesale, which couldn't be further from the truth (as of September 2025). But hey, it's the term everyone's using, so here we are.
So what is an AI SRE?
When people use the term AI SRE, they tend to mean an agentic ai system which can complete some subset of the work typically performed by site reliability engineers.
These agentic systems claim to be able to complete some of the following tasks without assistance from humans:
- Incident summarization
- Runbook maintenance
- Issue detection
- Root cause analysis
- Alert noise reduction (like pre-vetting firing alerts to see if it's noisy or grouping related incidents)
- Autonomous fixing and mitigatation of issues
At the core of these agentic systems is a Large Language Model (LLM), augmented with “tools”. Tools are a way for the underlying LLMs to interact with external systems. Each tool tends to be a wrapper around an API call with some description of how to use the api (what parameters to pass and when to use the tool). This is given to the model as a json blob, the model can then return a json blob with the arguments it wants to use for the tool. The tool then executes the API call and returns the result back to the model. The most common and the standard way for defining these tools is through a protococol known as Model Context Protocol.
In the AI SRE case, these tools typically allow the LLM to:
- Access alerts, logs, traces and metrics from your observability platform
- Access incident details/runbooks from your incident response platform
- Access application code from your repositories
- Update runbooks with new information gained from past incidents
- Update incident with the summary.
The AI SRE then uses these tools in a loop to autonomously investigate issues. You can imagine the loop looking like this in the ideal case:
- User prompts AI SRE: "There is an issue with the cart service, can you find the root cause?"
- AI SRE fetches alerts: It uses the tool to get the latest alerts from the observability platform.
- AI SRE fetches metrics: It uses the tool to get the latest metrics for the cart service. It sees that the number of 5XXs has spiked.
- AI SRE fetches 5XX traces: It uses the tool to get the latest traces which have returned a 5XX. It finds the specific endpoint that is returning 5XXs.
- AI SRE fetches logs: It uses the tool to get the latest logs for that endpoint. It finds a log that says "Null pointer exception".
- AI SRE checks for recent deployments of the service: It uses a tool to get the k8s deployment for the service and sees that the image tag was bumped from 0.1.0 to 0.2.0
- AI SRE grabs the diff for the cart service between 0.1.0 and 0.2.0: It uses another tool to diff the source code between version 0.1.0 and 0.2.0, sees that a variable was used before intialization
- AI SRE creates a pull request: The AI SRE calls another tool to to open a pull request which rolls back the service to 0.1.0 and opens another pull request to intialize the variable correctly.
- Response delivered to user: The system returns a document showing the root cause and the fixes that it put up via pull request
This autonomous chaining of tool calls is what differentiates AI SREs from just an SRE with a chat interface, the idea is that it can decompose the problem into smaller steps and use the tools to get the information it needs to solve the problem without human intervention.
Technical implementation of an AI SRE Agent
Going a little deeper into how an AI SRE might work.
Here is an example of how an MCP tool might look like to allow an AI SRE to fetch logs from your observability provider:
Tool name and definition allows LLMs to decide when to use the tool.
aiSreTools = []Tools{
{
Name: "get_logs",
Description: `Get logs from all or specific services/hosts/pods.`,
Handler: GetLogsHandler,
}
}
Tool argument definition tells the LLM how to call the API, what fields are available and what they do.
type GetLogsHandlerArgs struct {
Filters map[string][]string `json:"filters" jsonschema:"description=Set filters to only include specific logs with attributes. Do not set this if you want to fetch all logs."`
}
And this is the code that runs when the LLM decides to use a tool. We are just getting the arguments provided by the LLM and making an API call to an external system - in this case an observability provider Metoro - to get logs.
func GetLogsHandler(ctx context.Context, arguments GetLogsHandlerArgs) (*mcpgolang.ToolResponse, error) {
request := model.GetLogsRequest{
Filters: arguments.Filters,
ExportLimit: 20,
}
resp, err := getLogsMetoroCall(ctx, request)
if err != nil {
return nil, fmt.Errorf("error getting logs: %v", err)
}
return mcpgolang.NewToolResponse(mcpgolang.NewTextContent(fmt.Sprintf("%s", string(resp)))), nil
}
func getLogsMetoroCall(ctx context.Context, request model.GetLogsRequest) ([]byte, error) {
requestBody, err := json.Marshal(request)
if err != nil {
return nil, fmt.Errorf("error marshaling logs request: %v", err)
}
return utils.MakeMetoroAPIRequest("POST", "logs", bytes.NewBuffer(requestBody), utils.GetAPIRequirementsFromRequest(ctx))
}
This example is taken from the metoro-mcp-server which uses the mcp golang library. This is basically how any "AI Agent" works these days. Some use MCP like we have above, and some calls the APIs directly or with a different wrapper.
Conclusion
An AI SRE is really just an LLM with access to tools that an SRE uses, it calls those tools repeatedly until it (hopefully) completes a larger task.