Multi Cloud
How to Achieve AWS Operational Excellence in Your Cloud Workload
Explore the Operational Excellence pillar of the AWS Well-Architected Framework and examine best practices and design principles for cloud-based security operations, including CI/CD and risk management.
Related articles in the Well-Architected series:
- Overview of All 5 Pillars
- Security Pillar
- Reliability Pillar
- Performance Efficiency Pillar
- Cost Optimisation Pillar
- Sustainability Pillar
In today’s landscape, achieving operational excellence can be difficult, but not impossible. With operations often viewed as distinct from the rest of the business, it sometimes isn’t integrated into the flow like it is for other departments.
We have seen the industry recognise this divide with the creation of DevOps—combining development and IT operations into one process to enable more streamlined creation and implementation of software throughout the software development lifecycle (SDLC).
Amazon Web Services (AWS) continue to publish design principals for building applications that adhere to their well-architected frameworks. The best practises for the AWS Well-Architected Framework are based on five different pillars.
This includes:
- Operational Excellence
- Security
- Reliability
- Performance Efficiency
- Cost Optimisation
Focusing on the pillar of operational excellence, AWS has defined five design principles that spread across the areas of “organisation,” “prepare,” “operate,” and “evolve.”
The five Operational Excellence design principles:
1. Perform operations as code. The beauty of the cloud is that you can apply the same scripting skills you use to code applications to your entire environment, including operations. This means you can reduce the need for human intervention by scripting code that will automate operations and trigger appropriate responses to any events or incidents.
2. Make frequent, small, reversible changes. When multiple, large changes are made at once, it becomes exceedingly difficult to troubleshoot a problem when things don’t work in production. When designing your workloads, allow for small and frequent deployments that are easily reversable to make the process of identifying the source of the problem quick and easy when something isn’t running as intended in production.
3. Refine operations procedures frequently. There is always room for improvement. Continually analysing and poking holes in your processes and procedures helps you to constantly increase the efficiency of how you serve your customer needs.
4. Anticipate failure. It is always better to expect failure, rather than assuming that what you’ve created is flawless. If you don’t anticipate errors, how can you catch them before deployment? This is effectively the process of threat modelling and risk assessment.
5. Learn from all operational failures. The point of going back and analysing a failure is to learn from it. It is important to set up structures and processes that enable the sharing of learnings across teams and the business.
The area of “organisation” is critical to your success. It concerns the way your business organises who is responsible for what, in relation to your engineering and operations departments. You want to ask, who is responsible for the platform? Who is responsible for applications? How do we communicate between our different departments? At the end of the day, you need to be organised in a way that enables you to build software and applications, for example, that fulfil your business' strategy.
In order to make any decisions about organisation, the following seven high-level organisation priorities, as delivered by AWS, must first be reviewed and determined:
I. Evaluate external customer needs. Involve key stakeholders, including business, development, and operations teams, to determine where to focus efforts on external customer needs. This will ensure that you have a thorough understanding of the operations support that is required to achieve your desired business outcomes.
II. Evaluate internal customer needs. Engage key stakeholders to identify internal customer needs and operational support required for business outcomes. Prioritise improvement areas, such as skill development, workload performance, cost reduction, automation, and monitoring enhancement, based on established priorities. Continuously update priorities to adapt to changing needs.
III. Evaluate governance requirements. Organisational governance encompasses policies, rules, and frameworks guiding business goals. These requirements internally influence technology choices and workload operations. Incorporate them into your workload to ensure compliance and demonstrate implementation of governance requirements.
IV. Evaluate compliance requirements. Compliance requirements shape organisational priorities, potentially limiting technology and geographic choices. Conduct due diligence if no external frameworks exist. Validate compliance through audits/reports. For advertised compliance, establish internal processes for continuous adherence. Standards like PCI DSS, FedRAMP, and HIPAA vary based on data types and supported regions.
V. Evaluate threat landscape. Assess business threats, maintain a risk registry, and consider their impact when prioritising efforts. The Well-Architected Framework offers a consistent approach to evaluate and scale architectures. Our Cloud Risk Self-Assessment gives you insight on how to improve your cloud risk posture in three simple steps.
VI. Evaluate tradeoffs. Evaluate tradeoffs and alternatives to make informed decisions when prioritising efforts or selecting a course of action. For instance, prioritise speed to market over cost optimisation or choose a relational database for non-relational data to simplify migration instead of using an optimised database.
VII. Manage benefits and risks. Balance benefits and risks when prioritising efforts. Consider deploying workloads with unresolved issues to provide significant new features but mitigate associated risks. Address unacceptable risks as needed. Emphasise specific priorities when necessary. Maintain a balanced approach for long-term capability development and risk management. Update priorities based on changing needs.
Determine your businesses risk by looking at the possible attacks that could occur, as well as the likelihood of it coming to fruition. While the cloud has been around for a while, we need to pay close attention to managing the risks it can introduce, as it is still considered a new ecosystem that we are all learning to manage. How we deploy software and manage patches and updates have an impact on the businesses threat landscape.
In their report, Operational Excellence Pillar, AWS looks at engineering as the process of developing and testing applications and the infrastructure. Then, operations is responsible for the deployment and ongoing maintenance of the applications and infrastructure in production. But it isn’t always this straight forward and every business has its own processes, which is why they discuss four different operating models in the context of engineering and operations that businesses can use:
I. Fully Separated Operating Model
II. Separated Application Engineering and Operations (AEO) and Infrastructure Engineering and Operations (IEO) with Centralised Governance
III. Separated AEO and IEO with Centralised Governance and a Service Provider
IV. Separated AEO and IEO with Decentralised Governance
Note, it may be necessary to alter your business culture to conform to any one of these models.
The “prepare” area which is where you start to get into the work software developers are more familiar. However, just because it is more familiar, doesn’t mean it is more important than the area of organisation. Without having proper organisation in your business and processes, it would be very difficult to address the other three areas required to fulfil your business' strategy. AWS has broken this area into four actions:
I. Design telemetry into your cloud workloads
Telemetry provides you with information on the current health and risk level of your applications and infrastructure, giving you the ability to better manage and respond effectively to events or incidents. This is done predominantly with logs and metrics. Our Trend Micro Knowledge Base provide steps that you can take to confirm AWS CloudTrail is enabled or Amazon CloudWatch Logs are encrypted with instructions on how to remediate according to best practise. It is also good to ensure that you have metrics configured to monitor things like the functional status of your APIs.
You can audit your environment manually with 750+ industry best practises articles or give our free trial a shot and have your entire environment audited automatically in real time and continuously.
II. Improve your cloud workload flow
AWS says we need to adopt approaches that “enable refactoring, fast feedback on quality, and bug fixing.” Improving the way changes flow into production is what AWS is pointing to here. So, it is essential to have version control and ensure that you test and validate any changes before they reach production.
As a result, configuration management is a crucial topic. This relates back to one of the design principals: Making small, frequent, and reversible changes is critical to build into our processes. It is good to setup services, such as Amazon Simple Notification Service (Amazon SNS) to receive messages for services like AWS CloudFormation. Receiving a notification when stack events occur such as create, update, and delete, allows for a faster response to unauthorised actions.
III. Deployment risk mitigation processes
There are many steps that can be taken to mitigate deployment risks. Before those, it is crucial to have the attitude that changes pushed to production don’t always work. This will help you to always be prepared. Before pushing to production, always look for what would cause a failure:
i. Test
ii. Validate
iii. Use deployment management systems
iv. Deploy small changes
v. Know how to reverse your changes before they are done
IV. Understand your operational readiness
Once you understand what operational readiness is, the next step is to verify that your personnel is just as knowledgeable, so they can provide operational support. From there, you’ll want to determine whether or not you’ve automated everything you can.
The “operate” area includes three key understandings that are required to ensure you achieve your business outcomes. AWS says that it is critical to:
I. Understand workload health
II. Understand operational health
III. Respond to events
Understanding the health of your workloads or operations comes down to metrics. In order to know how to improve, it is critical to be able to show how things are functioning and how your customers are interacting with your sites. Enabling logging on Amazon CloudWatch Logs, and then aggregating those logs for analysis is very important. These logs can help generate the information needed to produce the metrics you need to improve operations and can be delivered through AWS Health Events on the AWS Personal Health Dashboard. Our Trend Micro Knowledge Base also has rules to assist in the creation of logs and health events. It is possible to use these rules manually, or to use an automated tool, which is always looking for misconfigurations.
Once the logs are created, delivered, and analysed, it is possible to respond to an event. In ITIL language, an event is a change of state. These may be planned monitored, or unplanned and problematic. With the latter, we need to ensure that we able to respond effectively.
AWS Systems Manager OpsCenter is a central place to manage issues. You can view, investigate, and resolve issues within this tool, while ensuring that information is kept confidential. There is a Trend rule for this: SSM Parameter Encryption. And as with all the rules, it is included in our automated tool. When beginning on the path to operational effectiveness, having an automated tool to analyse our cloud looking for missing configurations is essential.
Automating responses to detected events is the next step. You can utilise Amazon CloudWatch Events to create rules that respond to specific triggers. Otherwise, there would be alarms that might get missed. For example, our Trend Knowledge Base and the tool have alarms to alert us when costs are reaching a threshold we have defined.
With the “evolve” area, AWS believes that, in the context of the cloud, to properly evolve, you must learn, share, and improve. For example, use your post-incident meetings, to learn from what has occurred and make improvements for the future. There needs to be a process to manage and promote continuous improvement in an effort to change behaviours that are not working.
As more security breaches hit the news and data protection becomes a key focus, ensuring your organisation adhere to the well-architected framework’s design principles is crucial. Trend can help you stay compliant to the well-architected framework with its 750+ best practise rules. As mentioned above, if you are interested in knowing how well-architected you are, see your own security posture in 15 minutes or less.
References
SQS Dead Letter Queue
Stack Failed Status
ACM Certificate Expired
EBS Volumes Attached to Stopped EC2 Instances