diff --git a/docs/AlertsConfiguration.md b/docs/AlertsConfiguration.md index fdd91048a38eca65860527ef03cb2ed7b3009e38..c689e6bcad4130c152b8c4f32499d810b1126b62 100644 --- a/docs/AlertsConfiguration.md +++ b/docs/AlertsConfiguration.md @@ -33,16 +33,17 @@ #### Description This document outlines the TOSCA alert specification used to configure alerts within CLMC. Alerts are configured through a YAML-based -TOSCA-compliant document according to the TOSCA simple profile. This document is passed to the CLMC service, which parses and validates the document. Subsequently, the CLMC service -creates and activates the alerts within Kapacitor, then registers the HTTP alert handlers specified in the document. +TOSCA-compliant document according to the TOSCA simple profile. This document is passed to the CLMC service, which parses and validates the document. +Subsequently, the CLMC service creates and activates the alerts within Kapacitor, then registers the HTTP alert handlers specified in the document. The specification is compliant with the TOSCA policy template as implemented by the Openstack tosca parser. See an example below: https://github.com/openstack/tosca-parser/blob/master/toscaparser/tests/data/policies/tosca_policy_template.yaml #### TOSCA Alerts Specification Document -The TOSCA Alerts Specification Document consists of two main sections - **metadata** and **triggers**. Full definitions and -clarification of the structure of the document is given in the following sections. An example of a valid alert specification +The TOSCA Alerts Specification Document consists of two main sections - **metadata** and **policies**. Each **policy** contains a number +of triggers. A **trigger** is a fully qualified specification for an alert. Full definitions and clarification of the structure of the document +is given in the following sections. An example of a valid alert specification document will look like: ```yaml @@ -188,22 +189,23 @@ topology_template: event_type: <threshold | relative | deadman> metric: <measurement>.<field> condition: - threshold: <critical value> - granularity: <period in seconds - how often to check whether the event condition is true> + threshold: <critical value - semantics depend on the event type> + granularity: <period in seconds - semantic depends on the event type> aggregation_method: <aggregation function supported by InfluxDB - e.g. 'mean'> resource_type: <CLMC Information Model Tag Name>: <CLMC Information Model Tag Value> <CLMC Information Model Tag Name>: <CLMC Information Model Tag Value> ... - comparison_operator: <logical operator to use for comparison, e.g. 'gt', 'lt' + comparison_operator: <logical operator to use for comparison, e.g. 'gt', 'lt', 'gte', etc. action: implementation: - - <HTTP Alert Handler URL> - - <HTTP Alert Handler URL> + - <HTTP Alert Handler URL - receives POST messages from Kapacitor when alerts trigger> + - <HTTP Alert Handler URL - receives POST messages from Kapacitor when alerts trigger> ... ... ``` + ##### Definitions * **policy_identifier** - policy label which should match with a StateChange policy in the TOSCA resource specification document @@ -213,21 +215,142 @@ submitted to the FLAME Orchestrator for consistency. specification document submitted to the FLAME Orchestrator. * **event_type** - the type of TICK Script template to use to create the alert - more information will be provided about -the different options here, but we assume the most common one will be **threshold**. +the different options here, but we assume the most common one will be **threshold**. Other supported types are **relative** +and **deadman**. * **metric** - the metric to query in InfluxDB, must include measurement name and field name in -format `<measurement>`.`<field>` +format `<measurement>`.`<field>`. The only exception is when a **deadman** event type is used, where the `<field>` -* **threshold** - when using the **threshold** event type, this is the critical value the actual metric is compared to. +* **threshold** - + * for **threshold** event type, this is the critical value the queried metric is compared to. + * for **relative** event type, this is the critical value the difference (between the current metric value and the past metric value) is compared to. + * for **deadman** event type, this is the critical value the number of measurement points (received in InfluxDB) is compared to. -* **granularity** - the period in seconds, which instructs Kapacitor how often to query InfluxDB and check whether the -event condition is true. +* **granularity** - period in seconds + * for **threshold** event type, this value specifies how often should Kapacitor query InfluxDB to check whether the alert condition is true. + * for **relative** event type, this value specifies how long back in time to compare the current metric value with + * for **deadman** event type, this value specifies how long the span in time (in which the number of measurement points are checked) is -* **aggregation_method** - the function to use when querying InfluxDB +* **aggregation_method** - the function to use when querying InfluxDB, e.g. median, mean, etc. This value is only used when +the event_type is set to **threshold**. -* **resource_type** - provides context for the given event - key-value pairs for the global tags of the -CLMC Information Model +* **resource_type** - provides context for the given event - key-value pairs for the global tags of the CLMC Information Model. -* **comparison_operator** - the logical operator to use for comparison - less than, greater than, less than or erual to, etc. +* **comparison_operator** - the logical operator to use for comparison - lt (less than), gt ()greater than), lte (less than or equal to), etc. * **implementation** - a list of the URLs of alert handlers to which alert data is sent when the event condition is true. + + +##### Event types + +* **threshold** - A threshold event type is an alert in which Kapacitor queries InfluxDB on specific metric in a given period of time +by using a query function such as *mean*, *median*, *mode*, etc. This value is then compared against a given threshold. If the +result of the comparison operation is true, an alert is triggered. For example: + + ```yaml + high_latency: + description: This event triggers when the mean network latency in a given location exceeds a given threshold (in ms). + event_type: threshold + metric: network.latency + condition: + threshold: 45 + granularity: 120 + aggregation_method: mean + resource_type: + location: watershed + comparison_operator: gt + action: + implementation: + - http://sfemc.flame.eu/notify + - http://companyA.alert-handler.flame.eu/high-latency + ``` + + This trigger specification will create an alert task in Kapacitor, which queries the **latency** field in the **network** + measurement on location **watershed** every **120** seconds and compares the mean value for the last 120 seconds with the threshold value **45**. + If the mean latency exceeds 45 (**gt** operator is used, which stands for **greater than**), an alert is triggered. This alert will + be sent through an HTTP POST message to the URLs listed in the **implementation** section. + + The currently included InfluxQL functions are: + + `"count", "mean", "median", "mode", "sum", "first", "last", "max", "min"` + + The comparison operator mappings are as follows: + + ``` + "lt" : "less than", + "gt" : "greater than", + "lte" : "less than or equal to", + "gte" : "greater than or equal to", + "eq" : "equal", + "neq" : "not equal" + ``` + +* **relative** - A relative event type is an alert in which Kapacitor computes the difference between the current value of a metric and the value +reported a given period of time ago. The difference between the current and the past value is then compared against a given +threshold. If the result of the comparison operation is true, an alert is triggered. For example: + + ```yaml + decrease_in_requests: + description: | + This event triggers when the number of requests has decreased relative to the number of requests received + 120 seconds ago. + event_type: relative + metric: storage.requests + condition: + threshold: -100 + granularity: 120 + resource_type: + sf_package: storage + sf: storage-users + location: watershed + comparison_operator: lte + action: + implementation: + - http://sfemc.flame.eu/notify + ``` + + This trigger specification will create an alert task in Kapacitor, which compares every **requests** value reported in + measurement **storage** with the value received **120** seconds ago. If the difference between the current and the past + value is less than or equal to (comparison operator is **lte**) **-100**, an alert is triggered. Simply explained, an alert + is triggered if the **requests** current value has decreased by at least 100 relative to the value reported 120 seconds ago. + The queried value is contextualised for service function **storage-users** (using service function package **storage**) + at location **watershed**. Triggered alerts will be sent through an HTTP POST message to the URLs listed in the **implementation** section. + + *Notes*: + + * **aggregation_method** is not required here - the alert task compares the actual value that's being reported (stream mode) + * if **aggregation_method** is provided, it will be ignored + +* **deadman** - A deadman event type is an alert in which Kapacitor computes the number of reported points in a measurement +for a given period of time. This number is then compared to a given threshold value. If less number of points have been +reported (in comparison with the threshold value), an alert is triggered. +For example: + + ```yaml + missing_storage_measurements: + description: This event triggers when the number of storage measurements reported falls below the threshold value. + event_type: deadman + metric: storage.* + condition: + threshold: 0 + granularity: 60 + resource_type: + sf_package: storage + action: + implementation: + - http://sfemc.flame.eu/notify + ``` + + This trigger specification will create an alert task in Kapacitor, which monitors the number of points reported in + measurement **storage** and having tag **sf_package** set as **storage**. This value is computed every 60 seconds. + If the number of reported points is less than **0** (no points have been reported for the last 60 seconds), an alert + will be triggered. Triggered alerts will be sent through an HTTP POST message to the URLs listed in the **implementation** section. + + *Notes*: + + * **metric** only requires the measurement name in this event type and doesn't require a field name + * the trigger specification still needs to be consisten with the parsing rule for **metric**: `<measurement>`.`<field>` + * simply putting a `*` for field is sufficient, e.g. `storage.*` + * even if you put something else for field value, it will be ignored - only the **measurement** name is used + * **aggregation_method** is not required in this event type, any values provided will be ignored + * **comparison operator** is not required in this event type, any values provided will be ignored diff --git a/src/service/.coveragerc b/src/service/.coveragerc index 9f2b9eaf4f0eb7178a1f8be6f9e3eeeb4c4ccbf4..1166225d2ef504a94bc5810480671027ba661dcc 100644 --- a/src/service/.coveragerc +++ b/src/service/.coveragerc @@ -1,9 +1,2 @@ [run] -source = clmcservice -omit = - *test* - *__init__* - clmcservice\aggregation\influx_data_interface.py - clmcservice\configapi\views.py - clmcservice\whoamiapi\views.py -# configapi\views and whoami\views are currently omitted since there is no implementation there, yet \ No newline at end of file +source = clmcservice \ No newline at end of file