-
Nikolay Stanchev authoredNikolay Stanchev authored
FLAME - TOSCA Alerts Specification
Authors
Authors | Organisation |
---|---|
Nikolay Stanchev | University of Southampton, IT Innovation Centre |
Michael Boniface | University of Southampton, IT Innovation Centre |
Description
This document outlines the TOSCA alert specification used to configure alerts within CLMC. Alerts are configured through a YAML-based TOSCA-compliant document according to the TOSCA simple profile. This document is passed to the CLMC service, which parses and validates the document. Subsequently, the CLMC service creates and activates the alerts within Kapacitor, then registers the HTTP alert handlers specified in the document. The specification is compliant with the TOSCA policy template as implemented by the Openstack tosca parser. See an example below:
TOSCA Alerts Specification Document
The TOSCA Alerts Specification Document consists of two main sections - metadata and policies. Each policy contains a number of triggers. A trigger is a fully qualified specification for an alert. Full definitions and clarification of the structure of the document is given in the following sections. An example of a valid alert specification document will look like:
tosca_definitions_version: tosca_simple_profile_for_nfv_1_0_0
description: TOSCA Alerts Configuration document
imports:
- flame_clmc_alerts_definitions.yaml
metadata:
sfc: companyA-VR
sfci: companyA-VR-premium
topology_template:
policies:
- high_latency_policy:
type: eu.ict-flame.policies.StateChange
triggers:
high_latency:
description: This event triggers when the mean network latency in a given location exceeds a given threshold (in ms).
event_type: threshold
metric: network.latency
condition:
threshold: 45
granularity: 120
aggregation_method: mean
resource_type:
flame_location: watershed
comparison_operator: gt
action:
implementation:
- flame_sfemc
- http://companyA.alert-handler.flame.eu/high-latency
- low_requests_policy:
type: eu.ict-flame.policies.StateChange
triggers:
low_requests:
description: |
This event triggers when the last reported number of requests for a given service function
falls behind a given threshold.
event_type: threshold
metric: storage.requests
condition:
threshold: 5
granularity: 60
aggregation_method: last
resource_type:
flame_sfp: storage
flame_sf: storage-users
location: watershed
comparison_operator: lt
action:
implementation:
- flame_sfemc
- http://companyA.alert-handler.flame.eu/low-requests
- requests_diff_policy:
type: eu.ict-flame.policies.StateChange
triggers:
increase_in_requests:
description: |
This event triggers when the number of requests has increased relative to the number of requests received
120 seconds ago.
event_type: relative
metric: storage.requests
condition:
threshold: 100 # requests have increased by at least 100
granularity: 120
resource_type:
flame_sfp: storage
flame_sf: storage-users
flame_server: watershed
flame_location: watershed
comparison_operator: gte
action:
implementation:
- flame_sfemc
decrease_in_requests:
description: |
This event triggers when the number of requests has decreased relative to the number of requests received
120 seconds ago.
event_type: relative
metric: storage.requests
condition:
threshold: -100 # requests have decreased by at least 100
granularity: 120
resource_type:
flame_sfp: storage
flame_sf: storage-users
flame_location: watershed
comparison_operator: lte
action:
implementation:
- flame_sfemc
- missing_measurement_policy:
type: eu.ict-flame.policies.StateChange
triggers:
missing_storage_measurements:
description: This event triggers when the number of storage measurements reported falls below the threshold value.
event_type: deadman
# deadman trigger instances monitor the whole measurement (storage in this case), so simply put a star for field value
# to be compliant with the <measurement>.<field> format
metric: storage.*
condition:
threshold: 0 # if requests are less than or equal to 0 (in other words, no measurements are reported)
granularity: 60 # check for for missing data for the last 60 seconds
resource_type:
flame_sfp: storage
action:
implementation:
- http://companyA.alert-handler.flame.eu/missing-measurements
Metadata
The metadata section specifies the service function chain ID and the service function chain instance ID, for which this alerts specification relates to. The format is the following:
metadata:
sfc: <sfc_id>
sfci: <sfc_i_id>
Policies
The policies section defines a list of policy nodes, each representing a fully qualified configuration for an alert within CLMC. The format is the following:
topology_template:
policies:
- <policy_identifier>:
type: eu.ict-flame.policies.StateChange
triggers:
<event identifier>:
description: <optional description for the given event trigger>
event_type: <threshold | relative | deadman>
metric: <measurement>.<field>
condition:
threshold: <critical value - semantics depend on the event type>
granularity: <period in seconds - semantic depends on the event type>
aggregation_method: <aggregation function supported by InfluxDB - e.g. 'mean'>
resource_type:
<CLMC Information Model Tag Name>: <CLMC Information Model Tag Value>
<CLMC Information Model Tag Name>: <CLMC Information Model Tag Value>
...
comparison_operator: <logical operator to use for comparison, e.g. 'gt', 'lt', 'gte', etc.
action:
implementation:
- <flame_sfemc or HTTP Alert Handler URL - receives POST messages from Kapacitor when alerts trigger>
- <flame_sfemc or HTTP Alert Handler URL - receives POST messages from Kapacitor when alerts trigger>
...
...
Definitions
-
policy_identifier - policy label which MUST match with a StateChange policy in the TOSCA resource specification document submitted to the FLAME Orchestrator.
-
event_identifier - the name of the event that MUST match with the constraint event name referenced in the TOSCA resource specification document submitted to the FLAME Orchestrator.
-
event_type - the type of TICK Script template to use to create the alert - more information will be provided about the different options here, but we assume the most common one will be threshold. Currently, the other supported types are relative and deadman.
-
metric - the metric to query in InfluxDB, must include measurement name and field name in format
<measurement>
.<field>
. The only exception is when a deadman event type is used - then the<field>
is not used, but the format is still the same for consistency. Therefore, using<measurement>.*
will be sufficient. -
threshold -
- for threshold event type, this is the critical value the queried metric is compared to.
- for relative event type, this is the critical value the difference (between the current metric value and the past metric value) is compared to.
- for deadman event type, this is the critical value the number of measurement points (received in InfluxDB) is compared to.
-
granularity - period in seconds
- for threshold event type, this value specifies how often should Kapacitor query InfluxDB to check whether the alert condition is true.
- for relative event type, this value specifies how long back in time to compare the current metric value with
- for deadman event type, this value specifies how long the span in time (in which the number of measurement points are checked) is
-
aggregation_method - the function to use when querying InfluxDB, e.g. median, mean, etc. This value is only used when the event_type is set to threshold.
-
resource_type - provides context for the given event - key-value pairs for the global tags of the CLMC Information Model. This includes any of the following:
"flame_sfp", "flame_sf", "flame_sfe", "flame_server", "flame_location"
. Keep in mind that flame_sfc and flame_sfci are also part of the CLMC Information Model. However, filtering on these tags is automatically generated and added to all InfluxDB queries by using the metadata values from the alerts specification. Therefore, including flame_sfc and flame_sfci in the resource_type is considered INVALID.
For more information on the global tags, please check the documentation. -
comparison_operator - the logical operator to use for comparison - lt (less than), gt (greater than), lte (less than or equal to), etc.
-
implementation - a list of the URL entries for alert handlers to which alert data is sent when the event condition is true. If the alert is supposed to be sent to SFEMC, then instead of typing a URL, use flame_sfemc - the configurator will generate the correct SFEMC URL.
Event types
-
threshold - A threshold event type is an alert in which Kapacitor queries InfluxDB on specific metric in a given period of time by using a query function such as mean, median, mode, etc. This value is then compared against a given threshold. If the result of the comparison operation is true, an alert is triggered. For example:
high_latency: description: This event triggers when the mean network latency in a given location exceeds a given threshold (in ms). event_type: threshold metric: network.latency condition: threshold: 45 granularity: 120 aggregation_method: mean resource_type: flame_location: watershed comparison_operator: gt action: implementation: - flame_sfemc - http://companyA.alert-handler.flame.eu/high-latency
This trigger specification will create an alert task in Kapacitor, which queries the latency field in the network measurement on location watershed every 120 seconds and compares the mean value for the last 120 seconds with the threshold value 45. If the mean latency exceeds 45 (gt operator is used, which stands for greater than), an alert is triggered. This alert will be sent through an HTTP POST message to the URLs listed in the implementation section.
The currently included InfluxQL functions are:
"count", "mean", "median", "mode", "sum", "first", "last", "max", "min"
The comparison operator mappings are as follows:
"lt" : "less than", "gt" : "greater than", "lte" : "less than or equal to", "gte" : "greater than or equal to", "eq" : "equal", "neq" : "not equal"
-
relative - A relative event type is an alert in which Kapacitor computes the difference between the current value of a metric and the value reported a given period of time ago. The difference between the current and the past value is then compared against a given threshold. If the result of the comparison operation is true, an alert is triggered. For example:
decrease_in_requests: description: | This event triggers when the number of requests has decreased relative to the number of requests received 120 seconds ago. event_type: relative metric: storage.requests condition: threshold: -100 granularity: 120 resource_type: flame_sfp: storage flame_sf: storage-users flame_location: watershed comparison_operator: lte action: implementation: - flame_sfemc
This trigger specification will create an alert task in Kapacitor, which compares every requests value reported in measurement storage with the value received 120 seconds ago. If the difference between the current and the past value is less than or equal to (comparison operator is lte) -100, an alert is triggered. Simply explained, an alert is triggered if the requests current value has decreased by at least 100 relative to the value reported 120 seconds ago. The queried value is contextualised for service function storage-users (using service function package storage) at location watershed. Triggered alerts will be sent through an HTTP POST message to the URLs listed in the implementation section.
Notes:
- aggregation_method is not required here - the alert task compares the actual value that's being reported (stream mode)
- if aggregation_method is provided, it will be ignored
-
deadman - A deadman event type is an alert in which Kapacitor computes the number of reported points in a measurement for a given period of time. This number is then compared to a given threshold value. If less number of points have been reported (in comparison with the threshold value), an alert is triggered. For example:
missing_storage_measurements: description: This event triggers when the number of storage measurements reported falls below the threshold value. event_type: deadman metric: storage.* condition: threshold: 0 granularity: 60 resource_type: flame_sfp: storage action: implementation: - flame_sfemc
This trigger specification will create an alert task in Kapacitor, which monitors the number of points reported in measurement storage and having tag sfp set as storage. This value is computed every 60 seconds. If the number of reported points is less than 0 (no points have been reported for the last 60 seconds), an alert will be triggered. Triggered alerts will be sent through an HTTP POST message to the URLs listed in the implementation section.
Notes:
- metric only requires the measurement name in this event type and doesn't require a field name
- the trigger specification still needs to be consistent with the parsing rule for metric:
<measurement>
.<field>
- simply putting a
*
for field is sufficient, e.g.storage.*
- even if you put something else for field value, it will be ignored - only the measurement name is used
- aggregation_method is not required in this event type, any values provided will be ignored
- comparison operator is not required in this event type, any values provided will be ignored