The team at Firefly maintains a close partnership with our customers. This collaborative interaction has afforded a wealth of insights into the best practices that govern efficient cloud operations. Highlighting common patterns can be helpful to all cloud practitioners.
In this article, we highlight the most critical policies and their alerts used to proactively identify potential disruptions in AWS services. These notifications give insight into single points of failure, data protection, and operational issues. They may serve as a starting point for novice cloud operators or a checklist for seasoned practitioners who may be engaged in mentoring and knowledge transfer. A couple of scenarios are included to punctuate the role these policies play and the impact they can have on cloud operations.
The top five notifications to policy violations that a cloud engineer should subscribe to in order to reduce the risk of a AWS service disruption or disaster in the context of cloud infrastructure include:
1. AWS auto scaling groups are running with only a single availability zone
This could potentially lead to a complete outage if the single availability zone experiences a problem. Having redundancy with multiple availability zones can help prevent service disruption.
2. AWS DB instances with single availability zone
Similar to the previous point, a database instance running on a single availability zone is a risk as it doesn't have a failover mechanism in case of a disruption in the single availability zone.
3. AWS RDS instance without deletion protection
This is crucial to protect data and prevent unintentional deletions, which could be catastrophic for business continuity.
4. AWS DynamoDB tables without point-in-time recovery enabled
Point-in-time recovery helps in protecting your tables from accidental write or delete operations. In case of any disruption, you can restore your table to any point in time within the last 35 days.
5. AWS ELB/LB without any access logs enabled:
Access logs provide detailed records about the requests that are made to your load balancer. Without access logs enabled, troubleshooting or understanding disruptions could become a challenge. It could also limit the ability to identify and analyze malicious activities.
It might be helpful to consider a couple of scenarios that illustrate the importance of multiple availability zones and logs in preventing disruptions in a cloud environment. These notifications give insight into single points of failure, data protection, and visibility into system operation, all key to preemptively identifying issues that may cause service disruptions or a disaster scenario.
Scenario: The Unforeseen Outage
Meet Sarah, a skilled cloud engineer responsible for managing a critical e-commerce platform hosted on AWS. One busy Monday morning, Sarah is enjoying her coffee when her phone suddenly buzzes with multiple notifications from the monitoring system. The alerts indicate a surge in error rates and latency on the platform.
Concerned, Sarah quickly logs into her dashboard, only to find that the website's checkout process has ground to a halt. Panicked customers are flooding the support lines, and the company's reputation is at stake. As she investigates further, Sarah discovers that one of the availability zones in the region has experienced an unexpected outage.
Realizing the gravity of the situation, Sarah begins the process of scaling up resources and diverting traffic to unaffected zones. However, she's alarmed to find that the auto-scaling group for the affected service was configured with only a single availability zone. Without the necessary redundancy, the outage has caused a complete service disruption, leading to lost sales and frustrated customers.
In hindsight, Sarah realizes that she could have averted this catastrophe if she had received timely notifications about the configuration vulnerability. These notifications could have alerted her to the single point of failure and prompted her to ensure multi-zone redundancy for critical services. By the time she rectified the issue, valuable time had been lost, and the financial impact was significant.
Scenario: Unraveling a Security Incident
Meet Alex, a dedicated cybersecurity specialist overseeing the IT infrastructure of a fast-growing online platform. One day, as he's reviewing the system logs, he notices unusual patterns of incoming requests to the platform's services. These patterns indicate a potential Distributed Denial of Service (DDoS) attack.
Concerned about the potential impact on the platform's availability and performance, Alex decides to investigate further. He checks the Elastic Load Balancer's (ELB) configuration. The ELB distributes incoming traffic to the target instances. Hoping to analyze the incoming requests and determine their origin, to his surprise, he discovers that access logs for the ELB were not enabled.
Without these access logs, Alex's ability to identify the attack's source, patterns, and intensity is severely limited. He's unable to determine which endpoints were specifically targeted or the nature of the malicious traffic. This lack of visibility hampers his ability to mitigate the attack effectively.
In hindsight, Alex realizes that he could have identified the root cause faster if he had received timely notifications about the improper cloud configuration. These notifications could have alerted him to a lack of ELB access logs well before the critical moment when they were needed. By enabling access logs, cloud practitioners and security pros alike gain the necessary visibility to understand traffic patterns, diagnose anomalies, and respond proactively.
Ensure you don’t miss a critical notification
All 5 of the critical policies are included, out of the box, within the policies-as-code provided by Firefly’s Cloud Asset Management platform. Alerts of their violation can be sent to your favorite ChatOps tool (e.g. Slack, PagerDuty, etc). Immediately gain the benefit of these notifications and over one hundred others for better control over your complex cloud infrastructure. This two minute video demonstrates How to Create Custom Insights & Notifications using Firefly. Try it yourself for free at app.firefly.ai.