diff --git a/docs/AlertsSpecification.md b/docs/AlertsSpecification.md index 2eae1cc049929da24c7a04f76308ca6ea353a37e..19fcaa971ba926e254258fa2433cfbd1e776a47e 100644 --- a/docs/AlertsSpecification.md +++ b/docs/AlertsSpecification.md @@ -62,8 +62,10 @@ topology_template: - high_latency_policy: type: eu.ict-flame.policies.Alert triggers: - high_latency: + high_latency_batch: description: This event triggers when the mean network latency in a given location exceeds a given threshold (in ms). + metadata: + monitoring_type: batch # the latency is monitored in batch mode, therefore aggregation method and granularity ARE required event_type: threshold metric: network.latency condition: @@ -77,7 +79,21 @@ topology_template: implementation: - flame_sfemc - http://companyA.alert-handler.flame.eu/high-latency - + high_latency_stream: + description: This event triggers when the network latency in a given location exceeds a given threshold (in ms). + metadata: + monitoring_type: stream # the latency is monitored in stream mode, therefore aggregation method and granularity ARE NOT required + event_type: threshold + metric: network.latency + condition: + threshold: 45 + resource_type: + flame_location: watershed + comparison_operator: gt + action: + implementation: + - flame_sfemc + - http://companyA.alert-handler.flame.eu/high-latency - low_requests_policy: type: eu.ict-flame.policies.Alert triggers: @@ -85,6 +101,8 @@ topology_template: description: | This event triggers when the last reported number of requests for a given service function falls behind a given threshold. + metadata: + monitoring_type: batch event_type: threshold metric: storage.requests condition: @@ -183,6 +201,8 @@ topology_template: triggers: <event identifier>: description: <optional description for the given event trigger> + metadata: # semantic depends on the event type, deadman alert type doesn't require metadata section + <metadata key>: <metadata value> event_type: <threshold | relative | deadman> metric: <measurement>.<field> condition: @@ -211,6 +231,11 @@ topology_template: * **event_type** - the type of TICK Script template to use to create the alert - more information will be provided about the different options here, but we assume the most common one will be **threshold**. Currently, the other supported types are **relative** and **deadman**. These are also the main Kapacitor tasks that can be created through Chronograf. +* **metadata** - any metadata specific to the event type - + * for **threshold** event type, the metadata must contain a field called *monitoring_type* with a *stream* or *batch* value defining the type of monitoring to perform, see details in the relevant section below + * for **relative** event type, the metadata must contain a field called *percentage_evaluation* with a *true* or *false* value defining how to compute the difference (raw difference or percentage difference) between the current and the past metric value, see details in the relevant section below + * for **deadman** event type, metadata is not required and also not expected, passing a metadata field for this event type will fail validation + * **metric** - the metric to query in InfluxDB, must include measurement name and field name in format `<measurement>`.`<field>`. The only exception is when a **deadman** event type is used - then the `<field>`is not used, but the format is still the same for consistency. Therefore, using `<measurement>.*` will be sufficient. * **threshold** - @@ -219,11 +244,11 @@ topology_template: * for **deadman** event type, this is the critical value the number of measurement points (received in InfluxDB) is compared to. * **granularity** - period in seconds - * for **threshold** event type, this value specifies how often should Kapacitor query InfluxDB to check whether the alert condition is true. + * for **threshold** event type, this value specifies how often should Kapacitor query InfluxDB to check whether the alert condition is true; this is only required when monitoring type is set to batch, when using stream monitoring granularity must not be specified (every measurement point is monitored) * for **relative** event type, this value specifies how long back in time to compare the current metric value with * for **deadman** event type, this value specifies how long the span in time (in which the number of measurement points are checked) is -* **aggregation_method** - the aggregation function to use when querying InfluxDB in batch mode, e.g. median, mean, etc. This value is only used when the event_type is set to **threshold** or **relative**. +* **aggregation_method** - the aggregation function to use when querying InfluxDB in batch mode, e.g. median, mean, etc. This value is only used when the event type is set to **threshold** (and monitoring type is set to batch) or **relative**. The currently included InfluxQL functions are: @@ -251,12 +276,14 @@ topology_template: ##### Event types -* **threshold** - A threshold event type is an alert in which Kapacitor queries InfluxDB for a specific metric in a given period of time by using a query function such as *mean*, *median*, *mode*, etc. If the granularity is less than or equal to 60 seconds, then every measurement point is monitored (improving performance), thus, ignoring the aggregation function. This value is then compared against a given threshold. If the result of the comparison operation is true, an alert is triggered. For example: +* **threshold** - A threshold event type is an alert in which Kapacitor queries InfluxDB for a specific metric in a given period of time by using a query function such as *mean*, *median*, *mode*, etc. This value is then compared against a given threshold. If the result of the comparison operation is true, an alert is triggered. For example: ```yaml high_latency: description: This event triggers when the mean network latency in a given location exceeds a given threshold (in ms). event_type: threshold + metadata: + monitoring_type: batch metric: network.latency condition: threshold: 45 @@ -272,6 +299,26 @@ topology_template: ``` This trigger specification will create an alert task in Kapacitor, which queries the **latency** field in the **network** measurement for location **watershed** every **120** seconds and compares the mean value for the last 120 seconds with the threshold value **45**. If the mean latency exceeds 45 (**gt** operator is used, which stands for **greater than**), an alert is triggered. This alert will be sent through an HTTP POST message to the URLs listed in the **implementation** section. + + An alternative of the alert above is to use *stream* monitoring which means that every measurement point is monitored rather than querying InfluxDB on a given period. Therefore, when using stream monitoring, granularity and aggregation method are not required. For example: + + ```yaml + high_latency: + description: This event triggers when the network latency in a given location exceeds a given threshold (in ms). + event_type: threshold + metadata: + monitoring_type: stream + metric: network.latency + condition: + threshold: 45 + resource_type: + flame_location: watershed + comparison_operator: gt + action: + implementation: + - flame_sfemc + - http://companyA.alert-handler.flame.eu/high-latency + ``` * **relative** - A relative event type is an alert in which Kapacitor computes the difference between the current aggregated value of a metric and the aggregated value reported a given period of time ago. The difference between the current and the past value (could be raw difference, i.e. `current - past`, or percentage difference, i.e. `100 * (current - past) / past`) is then compared against a given threshold. If the result of the comparison operation is true, an alert is triggered. For example: