<!-- // © University of Southampton IT Innovation Centre, 2018 // // Copyright in this software belongs to University of Southampton // IT Innovation Centre of Gamma House, Enterprise Road, // Chilworth Science Park, Southampton, SO16 7NS, UK. // // This software may not be used, sold, licensed, transferred, copied // or reproduced in whole or in part in any manner or form or in or // on any media by any person other than in accordance with the terms // of the Licence Agreement supplied with the software, or otherwise // without the prior written consent of the copyright owners. // // This software is distributed WITHOUT ANY WARRANTY, without even the // implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR // PURPOSE, except where stated in the Licence Agreement supplied with // the software. // // Created By : Nikolay Stanchev // Created Date : 15-08-2018 // Created for Project : FLAME --> # **FLAME - TOSCA Alerts Specification** #### **Authors** |Authors|Organisation| |:---:|:---:| |[Nikolay Stanchev](mailto:ns17@it-innovation.soton.ac.uk)|[University of Southampton, IT Innovation Centre](http://www.it-innovation.soton.ac.uk)| |[Michael Boniface](mailto:mjb@it-innovation.soton.ac.uk)|[University of Southampton, IT Innovation Centre](http://www.it-innovation.soton.ac.uk)| #### Description This document outlines the TOSCA alert specification used to configure alerts within CLMC. Alerts are configured through a YAML-based TOSCA-compliant document according to the TOSCA simple profile. This document is passed to the CLMC service, which parses and validates the document. Subsequently, the CLMC service creates and activates the alerts within Kapacitor, then registers the HTTP alert handlers specified in the document. The specification is compliant with the TOSCA policy template as implemented by the Openstack tosca parser. See an example below: https://github.com/openstack/tosca-parser/blob/master/toscaparser/tests/data/policies/tosca_policy_template.yaml #### TOSCA Alerts Specification Document The TOSCA Alerts Specification Document consists of two main sections - **metadata** and **policies**. Each **policy** contains a number of triggers. A **trigger** is a fully qualified specification for an alert. Full definitions and clarification of the structure of the document is given in the following sections. An example of a valid alert specification document will look like: ```yaml tosca_definitions_version: tosca_simple_profile_for_nfv_1_0_0 description: TOSCA Alerts Configuration document imports: - flame_clmc_alerts_definitions.yaml metadata: sfc: companyA-VR sfci: companyA-VR-premium topology_template: policies: - high_latency_policy: type: eu.ict-flame.policies.StateChange triggers: high_latency: description: This event triggers when the mean network latency in a given location exceeds a given threshold (in ms). event_type: threshold metric: network.latency condition: threshold: 45 granularity: 120 aggregation_method: mean resource_type: flame_location: watershed comparison_operator: gt action: implementation: - flame_sfemc - http://companyA.alert-handler.flame.eu/high-latency - low_requests_policy: type: eu.ict-flame.policies.StateChange triggers: low_requests: description: | This event triggers when the last reported number of requests for a given service function falls behind a given threshold. event_type: threshold metric: storage.requests condition: threshold: 5 granularity: 60 aggregation_method: last resource_type: flame_sfp: storage flame_sf: storage-users location: watershed comparison_operator: lt action: implementation: - flame_sfemc - http://companyA.alert-handler.flame.eu/low-requests - requests_diff_policy: type: eu.ict-flame.policies.StateChange triggers: increase_in_requests: description: | This event triggers when the number of requests has increased relative to the number of requests received 120 seconds ago. event_type: relative metric: storage.requests condition: threshold: 100 # requests have increased by at least 100 granularity: 120 resource_type: flame_sfp: storage flame_sf: storage-users flame_server: watershed flame_location: watershed comparison_operator: gte action: implementation: - flame_sfemc decrease_in_requests: description: | This event triggers when the number of requests has decreased relative to the number of requests received 120 seconds ago. event_type: relative metric: storage.requests condition: threshold: -100 # requests have decreased by at least 100 granularity: 120 resource_type: flame_sfp: storage flame_sf: storage-users flame_location: watershed comparison_operator: lte action: implementation: - flame_sfemc - missing_measurement_policy: type: eu.ict-flame.policies.StateChange triggers: missing_storage_measurements: description: This event triggers when the number of storage measurements reported falls below the threshold value. event_type: deadman # deadman trigger instances monitor the whole measurement (storage in this case), so simply put a star for field value # to be compliant with the <measurement>.<field> format metric: storage.* condition: threshold: 0 # if requests are less than or equal to 0 (in other words, no measurements are reported) granularity: 60 # check for for missing data for the last 60 seconds resource_type: flame_sfp: storage action: implementation: - http://companyA.alert-handler.flame.eu/missing-measurements ``` ##### Metadata The ***metadata*** section specifies the service function chain ID and the service function chain instance ID, for which this alerts specification relates to. The format is the following: ```yaml metadata: sfc: <sfc_id> sfci: <sfc_i_id> ``` ##### Policies The ***policies*** section defines a list of policy nodes, each representing a fully qualified configuration for an alert within CLMC. The format is the following: ```yaml topology_template: policies: - <policy_identifier>: type: eu.ict-flame.policies.StateChange triggers: <event identifier>: description: <optional description for the given event trigger> event_type: <threshold | relative | deadman> metric: <measurement>.<field> condition: threshold: <critical value - semantics depend on the event type> granularity: <period in seconds - semantic depends on the event type> aggregation_method: <aggregation function supported by InfluxDB - e.g. 'mean'> resource_type: <CLMC Information Model Tag Name>: <CLMC Information Model Tag Value> <CLMC Information Model Tag Name>: <CLMC Information Model Tag Value> ... comparison_operator: <logical operator to use for comparison, e.g. 'gt', 'lt', 'gte', etc. action: implementation: - <flame_sfemc or HTTP Alert Handler URL - receives POST messages from Kapacitor when alerts trigger> - <flame_sfemc or HTTP Alert Handler URL - receives POST messages from Kapacitor when alerts trigger> ... ... ``` ##### Definitions * **policy_identifier** - policy label which MUST match with a StateChange policy in the TOSCA resource specification document submitted to the FLAME Orchestrator. * **event_identifier** - the name of the event that **MUST** match with the *constraint* event name referenced in the TOSCA resource specification document submitted to the FLAME Orchestrator. * **event_type** - the type of TICK Script template to use to create the alert - more information will be provided about the different options here, but we assume the most common one will be **threshold**. Currently, the other supported types are **relative** and **deadman**. * **metric** - the metric to query in InfluxDB, must include measurement name and field name in format `<measurement>`.`<field>`. The only exception is when a **deadman** event type is used - then the `<field>`is not used, but the format is still the same for consistency. Therefore, using `<measurement>.*` will be sufficient. * **threshold** - * for **threshold** event type, this is the critical value the queried metric is compared to. * for **relative** event type, this is the critical value the difference (between the current metric value and the past metric value) is compared to. * for **deadman** event type, this is the critical value the number of measurement points (received in InfluxDB) is compared to. * **granularity** - period in seconds * for **threshold** event type, this value specifies how often should Kapacitor query InfluxDB to check whether the alert condition is true. * for **relative** event type, this value specifies how long back in time to compare the current metric value with * for **deadman** event type, this value specifies how long the span in time (in which the number of measurement points are checked) is * **aggregation_method** - the function to use when querying InfluxDB, e.g. median, mean, etc. This value is only used when the event_type is set to **threshold**. * **resource_type** - provides context for the given event - key-value pairs for the global tags of the CLMC Information Model. This includes any of the following: `"flame_sfp", "flame_sf", "flame_sfe", "flame_server", "flame_location"`. Keep in mind that **flame_sfc** and **flame_sfci** are also part of the CLMC Information Model. However, filtering on these tags is automatically generated and added to all InfluxDB queries by using the metadata values from the alerts specification. Therefore, including **flame_sfc** and **flame_sfci** in the **resource_type** is considered INVALID. For more information on the global tags, please check the [documentation](monitoring.md). * **comparison_operator** - the logical operator to use for comparison - lt (less than), gt (greater than), lte (less than or equal to), etc. * **implementation** - a list of the URL entries for alert handlers to which alert data is sent when the event condition is true. If the alert is supposed to be sent to SFEMC, then instead of typing a URL, use **flame_sfemc** - the configurator will generate the correct SFEMC URL. ##### Event types * **threshold** - A threshold event type is an alert in which Kapacitor queries InfluxDB on specific metric in a given period of time by using a query function such as *mean*, *median*, *mode*, etc. This value is then compared against a given threshold. If the result of the comparison operation is true, an alert is triggered. For example: ```yaml high_latency: description: This event triggers when the mean network latency in a given location exceeds a given threshold (in ms). event_type: threshold metric: network.latency condition: threshold: 45 granularity: 120 aggregation_method: mean resource_type: flame_location: watershed comparison_operator: gt action: implementation: - flame_sfemc - http://companyA.alert-handler.flame.eu/high-latency ``` This trigger specification will create an alert task in Kapacitor, which queries the **latency** field in the **network** measurement on location **watershed** every **120** seconds and compares the mean value for the last 120 seconds with the threshold value **45**. If the mean latency exceeds 45 (**gt** operator is used, which stands for **greater than**), an alert is triggered. This alert will be sent through an HTTP POST message to the URLs listed in the **implementation** section. The currently included InfluxQL functions are: `"count", "mean", "median", "mode", "sum", "first", "last", "max", "min"` The comparison operator mappings are as follows: ``` "lt" : "less than", "gt" : "greater than", "lte" : "less than or equal to", "gte" : "greater than or equal to", "eq" : "equal", "neq" : "not equal" ``` * **relative** - A relative event type is an alert in which Kapacitor computes the difference between the current value of a metric and the value reported a given period of time ago. The difference between the current and the past value is then compared against a given threshold. If the result of the comparison operation is true, an alert is triggered. For example: ```yaml decrease_in_requests: description: | This event triggers when the number of requests has decreased relative to the number of requests received 120 seconds ago. event_type: relative metric: storage.requests condition: threshold: -100 granularity: 120 resource_type: flame_sfp: storage flame_sf: storage-users flame_location: watershed comparison_operator: lte action: implementation: - flame_sfemc ``` This trigger specification will create an alert task in Kapacitor, which compares every **requests** value reported in measurement **storage** with the value received **120** seconds ago. If the difference between the current and the past value is less than or equal to (comparison operator is **lte**) **-100**, an alert is triggered. Simply explained, an alert is triggered if the **requests** current value has decreased by at least 100 relative to the value reported 120 seconds ago. The queried value is contextualised for service function **storage-users** (using service function package **storage**) at location **watershed**. Triggered alerts will be sent through an HTTP POST message to the URLs listed in the **implementation** section. *Notes*: * **aggregation_method** is not required here - the alert task compares the actual value that's being reported (stream mode) * if **aggregation_method** is provided, it will be ignored * **deadman** - A deadman event type is an alert in which Kapacitor computes the number of reported points in a measurement for a given period of time. This number is then compared to a given threshold value. If less number of points have been reported (in comparison with the threshold value), an alert is triggered. For example: ```yaml missing_storage_measurements: description: This event triggers when the number of storage measurements reported falls below the threshold value. event_type: deadman metric: storage.* condition: threshold: 0 granularity: 60 resource_type: flame_sfp: storage action: implementation: - flame_sfemc ``` This trigger specification will create an alert task in Kapacitor, which monitors the number of points reported in measurement **storage** and having tag **sfp** set as **storage**. This value is computed every 60 seconds. If the number of reported points is less than **0** (no points have been reported for the last 60 seconds), an alert will be triggered. Triggered alerts will be sent through an HTTP POST message to the URLs listed in the **implementation** section. *Notes*: * **metric** only requires the measurement name in this event type and doesn't require a field name * the trigger specification still needs to be consistent with the parsing rule for **metric**: `<measurement>`.`<field>` * simply putting a `*` for field is sufficient, e.g. `storage.*` * even if you put something else for field value, it will be ignored - only the **measurement** name is used * **aggregation_method** is not required in this event type, any values provided will be ignored * **comparison operator** is not required in this event type, any values provided will be ignored