Welcome, Guest Login

Support Center

About policies and alerts

Last Updated: Apr 07, 2015 06:31PM EDT

What is a policy?

A policy is a condition or set of conditions that, when met, will send out an alert to a specific user, group of users or service.You can create policies in Stackdriver to stay on top of issues that may arise within your environment. Policies are used specifically to monitor the health of resources and to set thresholds. You can apply them to individual resources, specific application clusters or groups.

For setup instructions, please visit our Creating and configuring alerting policies doc. 

*Email alerts are sent from alerts@stackdriver.com—please add this address as a trusted sender to ensure you receive all alerts.
*SMS alerts are sent from +1-617-229-6940

What are metric policy conditions?

​Metric policy conditions are manually configured conditionals that focus on a metric and a resource or group of resources. There are three types of metric policies: Threshold, Absence and Group Aggregate. 

– A threshold policy condition can be configured to alert you when any metric crosses a set line for a specific period of time (ex. alert me when the latency on my load balancer is greater than 1000 ms for 15 minutes). You can use this condition type to monitor many different metrics and fine tune exactly when you’d like to be notified.

– A metric absence policy condition will monitor any of Stackdriver’s supported metrics and send out an alert if it detects an absence of data over a certain amount of time. Monitoring a metric’s absence will help to do things like confirm that Cloudwatch is up and running or make sure your agent is working properly. By watching for the absence of data from vital metrics in your environment, you can essentially monitor the “heartbeat” of your system.

– A group aggregate threshold policy condition allows you to set threshold alerts on aggregate metrics for clusters. The following aggregate functions are available for use in your configurations:
  • Average
  • Sum
  • Min
  • Max
  • Median
  • Standard deviation
  • 95th percentile
  • 5th percentile
You may want to set basic boundaries across clusters for metrics such as average CPU, Memory or Disk I/O (example: alert me if average memory usage across my cluster surpasses 80%).
You can also keep an eye on variance by alerting on a change in standard deviation for a metric across a cluster. This will provide some warning when a cluster is not running within its normal operating boundaries (example: my cassandra cluster normally has a standard deviation of 5% CPU so alert me if that rises above 7% for 1 hour).

What are health policy conditions?

Health policy conditions are predefined resource checks that are determined by your cloud provider, or manually in the case of uptime checks. We offer 7 types health policy conditions which can be applied to a single resource, a group of resources or all resources.

Instance Status Checks monitor the software and network configuration of your individual instance. These checks detect problems that require your involvement to repair. When an instance status check fails, typically you will need to address the problem yourself (for example, by rebooting the instance or by making modifications in your operating system). Examples of problems that may cause instance status checks to fail include:
  • Failed system status checks
  • Misconfigured networking or startup configuration
  • Exhausted memory
  • Corrupted file system
  • Incompatible kernel
Note: Status checks that occur during instance reboot or while a Windows instance store-backed instance is being bundled will report an instance status check failure until the instance becomes available again. Check out our documentation on Maintenance Mode.

System Status Checks monitor the AWS systems required to use your instance to ensure they are working properly. These checks detect problems with your instance that require AWS involvement to repair. When a system status check fails, you can choose to wait for AWS to fix the issue or you can resolve it yourself (for example, by stopping and restarting or terminating and replacing an instance). Examples of problems that cause system status checks to fail include:
  • Loss of network connectivity
  • Loss of system power
  • Software issues on the physical host
  • Hardware issues on the physical host
– Instance Events describe specific events that AWS may schedule for your instances, such as a reboot or retirement. These scheduled events are not frequent. If one of your instances will be affected by a scheduled event, you'll receive an email prior to the scheduled event with details about the event, as well as a start and end date. You can also view scheduled events for your instance by using the Amazon EC2 console, API, or CLI.  There are different types of scheduled events:
  • Reboot: A reboot can be either an instance reboot or a system reboot.
  • System maintenance: An instance may be temporarily affected by network maintenance or power maintenance.
  • Instance retirement: An instance that's scheduled for retirement will be stopped or terminated.
  • Instance stop: An instance may need to be stopped in order to migrate it to new hardware.

If one of your instances is scheduled for any of the above events, you may be able to take actions to control the timing of the event, or to minimize downtime. For more information, check out Amazon's documentation.

– Load Balancer Service Checks routinely check the health of each registered Amazon EC2 instance based on the configurations that you specify. If Elastic Load Balancing finds an unhealthy instance, it stops sending traffic to the instance and reroutes traffic to healthy instances.
Your load balancer performs health checks on your instances using the protocol, port, URL, timeout, and interval specified when you configured your load balancer. For example, you can configure a health check for your instances as follows - Your load balancer to send request to http://node IP address:80/index.htm every 5 seconds. Allow 3 seconds for the web server to respond. If the load balancer does not get any response after 2 attempts, take the node out of service. If the load balancer gets 2 successful responses, put the node back in service. Instances that are in service at the time of health check are marked healthy and the instances that are out of service at the time of health check are marked unhealthy.
Your registered instances can fail the health check for several reasons. The most common reasons for failing a health check are where EC2 instances close connections to your load balancer or where the response from the EC2 instances times out. For information on potential causes and steps you can take to resolve failed health check issues, see Troubleshooting Elastic Load Balancing: Health Check Configuration. The health is compromised of user defined checks on instances. When Stackdriver detects that a check is no longer ok we will alert on it.

– Load Balancer Availability Zone Checks will alert you if there is ever an AZ behind a Load Balancer that has zero InService instances behind it.

– Uptime Checks will alert you if an instance, load balancer or generic endpoint becomes unavailable. They must be configured from an https://app.stackdriver.com/endpoints before you can use them in your policies. Currently, we offer checks from Virginia, Texas, Oregon, Amsterdam and Singapore. You can find more detailed configuration information here.

How can I choose to be notified? 

Once you configure the foundation of your policy, you can then move on to choose your notification options. Feel free to add as many as you like. We offer support for:
  • Email
  • PagerDuty
  • SMS
  • HipChat
  • Campfire
  • Webhooks
  • SNS
Click here to see detailed configuration instructions for each one of these notification types.

Can I configure any kind of automated actions? 

Currently, we offer three types of post-alert actions:
  • Reboot (Instance) will perform the standard EC2 reboot action. 
  • Move Host (Instance) will stop and then start an instance so that Amazon will provision it on a new hypervisor. This can be helpful if you have noisy neighbors.
  • Add Capacity (RDS) will increase the size of an RDS database. This can be useful to automate when disk space is running low. 
Click here to see detailed configuration options for each of these actions. 
seconds ago
a minute ago
minutes ago
an hour ago
hours ago
a day ago
days ago
Invalid characters found