What is a policy?A policy is a condition or set of conditions that, when met, will send out an alert to a specific user, group of users or service.You can create policies in Stackdriver to stay on top of issues that may arise within your environment. Policies are used specifically to monitor the health of resources and to set thresholds. You can apply them to individual resources, specific application clusters or groups.
For setup instructions, please visit our Creating and configuring alerting policies doc.
*Email alerts are sent from email@example.com—please add this address as a trusted sender to ensure you receive all alerts.
*SMS alerts are sent from +1-617-229-6940
What are metric policy conditions?Metric policy conditions are manually configured conditionals that focus on a metric and a resource or group of resources. There are three types of metric policies: Threshold, Absence and Group Aggregate.
– A threshold policy condition can be configured to alert you when any metric crosses a set line for a specific period of time (ex. alert me when the latency on my load balancer is greater than 1000 ms for 15 minutes). You can use this condition type to monitor many different metrics and fine tune exactly when you’d like to be notified.
– A metric absence policy condition will monitor any of Stackdriver’s supported metrics and send out an alert if it detects an absence of data over a certain amount of time. Monitoring a metric’s absence will help to do things like confirm that Cloudwatch is up and running or make sure your agent is working properly. By watching for the absence of data from vital metrics in your environment, you can essentially monitor the “heartbeat” of your system.
– A group aggregate threshold policy condition allows you to set threshold alerts on aggregate metrics for clusters. The following aggregate functions are available for use in your configurations:
- Standard deviation
- 95th percentile
- 5th percentile
You can also keep an eye on variance by alerting on a change in standard deviation for a metric across a cluster. This will provide some warning when a cluster is not running within its normal operating boundaries (example: my cassandra cluster normally has a standard deviation of 5% CPU so alert me if that rises above 7% for 1 hour).
What are health policy conditions?Health policy conditions are predefined resource checks that are determined by your cloud provider, or manually in the case of uptime checks. We offer 7 types health policy conditions which can be applied to a single resource, a group of resources or all resources.
– Instance Status Checks monitor the software and network configuration of your individual instance. These checks detect problems that require your involvement to repair. When an instance status check fails, typically you will need to address the problem yourself (for example, by rebooting the instance or by making modifications in your operating system). Examples of problems that may cause instance status checks to fail include:
- Failed system status checks
- Misconfigured networking or startup configuration
- Exhausted memory
- Corrupted file system
- Incompatible kernel
– System Status Checks monitor the AWS systems required to use your instance to ensure they are working properly. These checks detect problems with your instance that require AWS involvement to repair. When a system status check fails, you can choose to wait for AWS to fix the issue or you can resolve it yourself (for example, by stopping and restarting or terminating and replacing an instance). Examples of problems that cause system status checks to fail include:
- Loss of network connectivity
- Loss of system power
- Software issues on the physical host
- Hardware issues on the physical host
- Reboot: A reboot can be either an instance reboot or a system reboot.
- System maintenance: An instance may be temporarily affected by network maintenance or power maintenance.
- Instance retirement: An instance that's scheduled for retirement will be stopped or terminated.
- Instance stop: An instance may need to be stopped in order to migrate it to new hardware.
If one of your instances is scheduled for any of the above events, you may be able to take actions to control the timing of the event, or to minimize downtime. For more information, check out Amazon's documentation.– Load Balancer Service Checks routinely check the health of each registered Amazon EC2 instance based on the configurations that you specify. If Elastic Load Balancing finds an unhealthy instance, it stops sending traffic to the instance and reroutes traffic to healthy instances.
Your load balancer performs health checks on your instances using the protocol, port, URL, timeout, and interval specified when you configured your load balancer. For example, you can configure a health check for your instances as follows - Your load balancer to send request to http://node IP address:80/index.htm every 5 seconds. Allow 3 seconds for the web server to respond. If the load balancer does not get any response after 2 attempts, take the node out of service. If the load balancer gets 2 successful responses, put the node back in service. Instances that are in service at the time of health check are marked healthy and the instances that are out of service at the time of health check are marked unhealthy.
– Load Balancer Availability Zone Checks will alert you if there is ever an AZ behind a Load Balancer that has zero InService instances behind it.
– Uptime Checks will alert you if an instance, load balancer or generic endpoint becomes unavailable. They must be configured from an https://app.stackdriver.com/endpoints before you can use them in your policies. Currently, we offer checks from Virginia, Texas, Oregon, Amsterdam and Singapore. You can find more detailed configuration information here.
How can I choose to be notified?Once you configure the foundation of your policy, you can then move on to choose your notification options. Feel free to add as many as you like. We offer support for:
Can I configure any kind of automated actions?Currently, we offer three types of post-alert actions:
- Reboot (Instance) will perform the standard EC2 reboot action.
- Move Host (Instance) will stop and then start an instance so that Amazon will provision it on a new hypervisor. This can be helpful if you have noisy neighbors.
- Add Capacity (RDS) will increase the size of an RDS database. This can be useful to automate when disk space is running low.