DevOps and site reliability engineering (SRE) both aim to streamline development and operations while reducing human error. Because of their similar goals, some IT professionals consider SRE to be an extension of DevOps. However, they are distinct approaches between the two that need to be evaluated in order to define their working roles and responsibilities.
What is DevOps?
DevOps is a range of tools and unique cultural philosophies that focuses on helping organizations streamline software application release cycles to further improve their quality, security, and scalability.
What is SRE?
SRE is a set of practices that incorporate aspects of software engineering to automate IT infrastructure tasks. According to Google, the founders of SRE, the goal of SRE is to “protect, provide for, and progress the software and systems behind all of Google’s public services... with an ever-watchful eye on their availability, latency, performance, and capacity.”
SRE vs DevOps: Similarities and Differences
DevOps is highly focused on a culture of collaboration. It integrates the workflows of teams that typically do not work together easily with a process that’s often been described as “breaking down organizational silos.”
Traditionally, teams that develop software have not been involved in deploying it—let alone monitoring and maintaining its underlying infrastructure and systems. In contrast, DevOps teams are much more cohesive. They are involved throughout an app or service’s life cycle: managing and testing code, deploying and running the application or service, and fixing instabilities in the infrastructure.
SRE takes this philosophy one step further and works towards guaranteeing a reliable and resilient platform. SRE teams often start with business requirements for system and application uptime. To meet these needs—and service level baselines—SRE teams continuously maintain their deployment environments by analyzing them and validating their stability. In short, they keep the service “live”.
At their core, SRE teams are developers, but they spend more time on operational upkeep than writing code and producing software. Their development expertise is rather used to build automation into their day-to-day tasks.
DevOps and SRE share the goal of uniting development and operations teams and reducing the amount of time spent on building secure apps. They both value automation, monitoring, and shared responsibility.
SRE vs DevOps: Roles and Responsibilities
Exploring the differences between DevOps and SRE’s functions effectively illuminates the expected responsibilities of each role. Although there is significant overlap between the two, when it comes to the software development life cycle, DevOps tends to be more “shift-left,” while SRE is typically more “shift-right.”
The Role of a DevOps Team
Responsible for centralizing and maintaining source code by using source control, DevOps teams create and manage pipelines for continuous integration (CI). This enables developers to run functional and unit testing, compile code, build containers, and produce the various artifacts needed to deploy.
Create and manage pipelines for continuous delivery/continuous deployment (CD)
This allows teams to handle production workloads and final delivery in both cloud and on-premises environments, including working in cloud-native or microservice-based scenarios, such as containers.
Configure and maintain the service deployment
This is achieved once the deployment is finalized by controlling updates, managing platform maintenance, monitoring scalability, ensuring high availability, and integrating end-to-end security best practices.
The role of a DevOps team is to ensure that application source code gets validated, compiled, and deployed, allowing an application to run smoothly. This allows teams to automate whenever possible and release new value in the form of software when needed.
The Role of an SRE Team
Once the workload is up-and-running, meaning the core responsibility of the DevOps team is complete, the SRE team focuses on guaranteeing the uptime and reliability of the workload.
Establishes a baseline for what is considered “reliable”
This is done for each application workload scenario and achieved by undergoing an examination of service-level objectives (SLOs) and service-level indicators (SLIs). If an SLO/SLI target is not met—often because of an incident—the SRE team is responsible for mitigating the effects and fixing the issues.
The responsibilities of a site reliability engineer involve extensive monitoring and observability tasks. SREs rely on a large set of metrics and telemetry from every part of the workload environment to provide accurate feedback on which upon the business can base its targets.
Performing post-mortems
In addition to handling incidents, SREs complete post-incident reviews. This includes reporting fixes, documenting best practices, and discussing which existing practices proved unsuccessful. To avoid similar outages in the future, a significant responsibility for the team is to consistently work towards finding ways to improve the overall reliability of the business’ systems.
Managing toil
Described by Google’s engineers as any important activity that helps contribute to system optimization and application reliability, but which must be executed manually and repeatedly, SREs are specifically concerned with overseeing this “grungy work” .
Writing software and working on automation
Even if they are less emphasized than in DevOps, it is important to remember that SRE still involves running these responsibilities. In fact, the entire premise of Google’s approach in creating SRE is to apply software development best practices to tackle operational tasks.
The main task of an SRE team is to map business requirements for application availability and reliability to technical solutions, mechanisms, and practices. Just as in DevOps, there is a heavy focus on automating as much of the workload as possible. The primary incentive in SRE is to guarantee the reliability of a product and to continuously generate feedback about performance; both to minimize incidents and failure rates as well as provide the business with realistic service level targets.
Conclusion
Although SRE is often described as a successor to DevOps, the two roles are more complementary to each other than they are competitive.
DevOps teams are typically developers with a focus on writing code, producing software, and managing the software life cycle. Technically, this is done using DevOps tools in a CI/CD pipeline.
SRE teams are also primarily developers, but their focus is not on writing software. Rather, SRE teams ensure the reliability of applications and services and generate knowledge in response to incidents. SREs evaluate technical implementations with a focus on how they meet business requirements, using metrics from monitoring, observability, and post-incident reviews to meet service level targets.
In an ideal IT department, the DevOps and SRE teams work together. Each team functions with a different focus, but their overlap allows them to quickly and efficiently write and release reliable applications on a reliable infrastructure. Both are able to determine the root causes of incidents, fix outages, and apply the resulting observations to keep the business’ customers happy with a stable, highly available product.