<!--
// © University of Southampton IT Innovation Centre, 2018
//
// Copyright in this software belongs to University of Southampton
// IT Innovation Centre of Gamma House, Enterprise Road, 
// Chilworth Science Park, Southampton, SO16 7NS, UK.
//
// This software may not be used, sold, licensed, transferred, copied
// or reproduced in whole or in part in any manner or form or in or
// on any media by any person other than in accordance with the terms
// of the Licence Agreement supplied with the software, or otherwise
// without the prior written consent of the copyright owners.
//
// This software is distributed WITHOUT ANY WARRANTY, without even the
// implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR
// PURPOSE, except where stated in the Licence Agreement supplied with
// the software.
//
//      Created By :            Nikolay Stanchev
//      Created Date :          15-08-2018
//      Created for Project :   FLAME
-->

# **FLAME - TOSCA Alerts Specification**

#### **Authors**

|Authors|Organisation|                    
|:---:|:---:|  
|[Nikolay Stanchev](mailto:ns17@it-innovation.soton.ac.uk)|[University of Southampton, IT Innovation Centre](http://www.it-innovation.soton.ac.uk)|
|[Michael Boniface](mailto:mjb@it-innovation.soton.ac.uk)|[University of Southampton, IT Innovation Centre](http://www.it-innovation.soton.ac.uk)|

#### Description

This document outlines the TOSCA alert specification used to configure alerts within CLMC. Alerts are configured through a YAML-based
TOSCA-compliant document according to the TOSCA simple profile. This document is passed to the CLMC service, which parses and validates the document. 
Subsequently, the CLMC service creates and activates the alerts within Kapacitor, then registers the HTTP alert handlers specified in the document.
The specification is compliant with the TOSCA policy template as implemented by the Openstack tosca parser. See an example below:

https://github.com/openstack/tosca-parser/blob/master/toscaparser/tests/data/policies/tosca_policy_template.yaml

#### TOSCA Alerts Specification Document

The TOSCA Alerts Specification Document consists of two main sections - **metadata** and **policies**. Each **policy** contains a number
of triggers. A **trigger** is a fully qualified specification for an alert. Full definitions and clarification of the structure of the document
is given in the following sections. An example of a valid alert specification document will look like:

```yaml
tosca_definitions_version: tosca_simple_profile_for_nfv_1_0_0

description: TOSCA Alerts Configuration document

imports:
- flame_clmc_alerts_definitions.yaml

metadata:
  sfc: companyA-VR
  sfci: companyA-VR-premium

topology_template:

  policies:
    - high_latency_policy:
        type: eu.ict-flame.policies.StateChange
        triggers:
          high_latency:
            description: This event triggers when the mean network latency in a given location exceeds a given threshold (in ms).
            event_type: threshold
            metric: network.latency
            condition:
              threshold: 45
              granularity: 120
              aggregation_method: mean
              resource_type:
                flame_location: watershed
              comparison_operator: gt
            action:
              implementation:
                - flame_sfemc
                - http://companyA.alert-handler.flame.eu/high-latency
                
    - low_requests_policy:
        type: eu.ict-flame.policies.StateChange
        triggers:
          low_requests:
            description: |
              This event triggers when the last reported number of requests for a given service function
              falls behind a given threshold.
            event_type: threshold
            metric: storage.requests
            condition:
              threshold: 5
              granularity: 60
              aggregation_method: last
              resource_type:
                flame_sfp: storage
                flame_sf: storage-users
                location: watershed
              comparison_operator: lt
            action:
              implementation:
                - flame_sfemc
                - http://companyA.alert-handler.flame.eu/low-requests
                
    - requests_diff_policy:
        type: eu.ict-flame.policies.StateChange
        triggers:
          increase_in_requests:
            description: |
              This event triggers when the number of requests has increased relative to the number of requests received
              120 seconds ago.
            event_type: relative
            metric: storage.requests
            condition:
              threshold: 100  # requests have increased by at least 100
              granularity: 120
              resource_type:
                flame_sfp: storage
                flame_sf: storage-users
                flame_server: watershed
                flame_location: watershed
              comparison_operator: gte
            action:
              implementation:
                - flame_sfemc
          decrease_in_requests:
            description: |
              This event triggers when the number of requests has decreased relative to the number of requests received
              120 seconds ago.
            event_type: relative
            metric: storage.requests
            condition:
              threshold: -100  # requests have decreased by at least 100
              granularity: 120
              resource_type:
                flame_sfp: storage
                flame_sf: storage-users
                flame_location: watershed
              comparison_operator: lte
            action:
              implementation:
                - flame_sfemc
                
    - missing_measurement_policy:
        type: eu.ict-flame.policies.StateChange
        triggers:
          missing_storage_measurements:
            description: This event triggers when the number of storage measurements reported falls below the threshold value.
            event_type: deadman
            # deadman trigger instances monitor the whole measurement (storage in this case), so simply put a star for field value
            # to be compliant with the <measurement>.<field> format
            metric: storage.*
            condition:
              threshold: 0  # if requests are less than or equal to 0 (in other words, no measurements are reported)
              granularity: 60  # check for for missing data for the last 60 seconds
              resource_type:
                flame_sfp: storage
            action:
              implementation:
                - http://companyA.alert-handler.flame.eu/missing-measurements
```


##### Metadata

The ***metadata*** section specifies the service function chain ID and the service function chain instance ID, for which this 
alerts specification relates to. The format is the following:

```yaml
metadata:
    sfc: <sfc_id>
    sfci: <sfc_i_id>
```

##### Policies

The ***policies*** section defines a list of policy nodes, each representing a fully qualified configuration for an
alert within CLMC. The format is the following:

```yaml
topology_template:

    policies:
        - <policy_identifier>:
            type: eu.ict-flame.policies.StateChange
            triggers:
                <event identifier>:
                  description: <optional description for the given event trigger>
                  event_type: <threshold | relative | deadman>
                  metric: <measurement>.<field>
                  condition:
                    threshold: <critical value - semantics depend on the event type>
                    granularity: <period in seconds - semantic depends on the event type>
                    aggregation_method: <aggregation function supported by InfluxDB - e.g. 'mean'>
                    resource_type:
                      <CLMC Information Model Tag Name>: <CLMC Information Model Tag Value>
                      <CLMC Information Model Tag Name>: <CLMC Information Model Tag Value>
                      ...
                    comparison_operator: <logical operator to use for comparison, e.g. 'gt', 'lt', 'gte', etc.
                  action:
                    implementation:
                      - <flame_sfemc or HTTP Alert Handler URL - receives POST messages from Kapacitor when alerts trigger>
                      - <flame_sfemc or HTTP Alert Handler URL - receives POST messages from Kapacitor when alerts trigger>
                      ...
        ...
```


##### Definitions

* **policy_identifier** - policy label which MUST match with a StateChange policy in the TOSCA resource specification document
submitted to the FLAME Orchestrator.

* **event_identifier** - the name of the event that **MUST** match with the *constraint* event name referenced in the TOSCA resource
specification document submitted to the FLAME Orchestrator.

* **event_type** - the type of TICK Script template to use to create the alert - more information will be provided about 
the different options here, but we assume the most common one will be **threshold**. Currently, the other supported types are 
**relative** and **deadman**.

* **metric** - the metric to query in InfluxDB, must include measurement name and field name in 
format `<measurement>`.`<field>`. The only exception is when a **deadman** event type is used - then the `<field>`is not used, but
the format is still the same for consistency. Therefore, using `<measurement>.*` will be sufficient.

* **threshold** -
    * for **threshold** event type, this is the critical value the queried metric is compared to.
    * for **relative** event type, this is the critical value the difference (between the current metric value and the past metric value) is compared to.
    * for **deadman** event type, this is the critical value the number of measurement points (received in InfluxDB) is compared to.

* **granularity** - period in seconds
    * for **threshold** event type, this value specifies how often should Kapacitor query InfluxDB to check whether the alert condition is true.
    * for **relative** event type, this value specifies how long back in time to compare the current metric value with
    * for **deadman** event type, this value specifies how long the span in time (in which the number of measurement points are checked) is 

* **aggregation_method** - the function to use when querying InfluxDB, e.g. median, mean, etc. This value is only used when
the event_type is set to **threshold**.

* **resource_type** - provides context for the given event - key-value pairs for the global tags of the CLMC Information Model.
This includes any of the following: `"flame_sfp", "flame_sf", "flame_sfe", "flame_server", "flame_location"`. 
Keep in mind that **flame_sfc** and **flame_sfci** are also part of the CLMC Information Model. However, filtering on 
these tags is automatically generated and added to all InfluxDB queries by using the metadata values from the 
alerts specification. Therefore, including **flame_sfc** and **flame_sfci** in the **resource_type** is considered INVALID.  
For more information on the global tags, please check the [documentation](monitoring.md).  

* **comparison_operator** - the logical operator to use for comparison - lt (less than), gt (greater than), lte (less than or equal to), etc.

* **implementation** - a list of the URL entries for alert handlers to which alert data is sent when the event condition is true.
If the alert is supposed to be sent to SFEMC, then instead of typing a URL, use **flame_sfemc** - the configurator will generate the correct
SFEMC URL.


##### Event types

* **threshold** - A threshold event type is an alert in which Kapacitor queries InfluxDB on specific metric in a given period of time
by using a query function such as *mean*, *median*, *mode*, etc. This value is then compared against a given threshold. If the
result of the comparison operation is true, an alert is triggered. For example:

    ```yaml
    high_latency:
        description: This event triggers when the mean network latency in a given location exceeds a given threshold (in ms).
        event_type: threshold
        metric: network.latency
        condition:
          threshold: 45
          granularity: 120
          aggregation_method: mean
          resource_type:
            flame_location: watershed
          comparison_operator: gt
        action:
          implementation:
            - flame_sfemc
            - http://companyA.alert-handler.flame.eu/high-latency
    ``` 
    
    This trigger specification will create an alert task in Kapacitor, which queries the **latency** field in the **network**
    measurement on location **watershed** every **120** seconds and compares the mean value for the last 120 seconds with the threshold value **45**.
    If the mean latency exceeds 45 (**gt** operator is used, which stands for **greater than**), an alert is triggered. This alert will
    be sent through an HTTP POST message to the URLs listed in the **implementation** section.
    
    The currently included InfluxQL functions are:
    
    `"count", "mean", "median", "mode", "sum", "first", "last", "max", "min"`
    
    The comparison operator mappings are as follows:
    
    ```
    "lt" : "less than",
    "gt" : "greater than", 
    "lte" : "less than or equal to", 
    "gte" : "greater than or equal to", 
    "eq" : "equal", 
    "neq" : "not equal"
    ```

* **relative** - A relative event type is an alert in which Kapacitor computes the difference between the current value of a metric and the value
reported a given period of time ago. The difference between the current and the past value is then compared against a given
threshold. If the result of the comparison operation is true, an alert is triggered. For example:

    ```yaml
    decrease_in_requests:
        description: |
          This event triggers when the number of requests has decreased relative to the number of requests received
          120 seconds ago.
        event_type: relative
        metric: storage.requests
        condition:
          threshold: -100
          granularity: 120
          resource_type:
            flame_sfp: storage
            flame_sf: storage-users
            flame_location: watershed
          comparison_operator: lte
        action:
          implementation:
            - flame_sfemc
    ```
    
    This trigger specification will create an alert task in Kapacitor, which compares every **requests** value reported in 
    measurement **storage** with the value received **120** seconds ago. If the difference between the current and the past
    value is less than or equal to (comparison operator is **lte**) **-100**, an alert is triggered. Simply explained, an alert
    is triggered if the **requests** current value has decreased by at least 100 relative to the value reported 120 seconds ago.
    The queried value is contextualised for service function **storage-users** (using service function package **storage**) 
    at location **watershed**. Triggered alerts will be sent through an HTTP POST message to the URLs listed in the **implementation** section.
    
    *Notes*:
    
    * **aggregation_method** is not required here - the alert task compares the actual value that's being reported (stream mode)
    * if **aggregation_method** is provided, it will be ignored

* **deadman** - A deadman event type is an alert in which Kapacitor computes the number of reported points in a measurement
for a given period of time. This number is then compared to a given threshold value. If less number of points have been 
reported (in comparison with the threshold value), an alert is triggered.
For example:

    ```yaml
    missing_storage_measurements:
        description: This event triggers when the number of storage measurements reported falls below the threshold value.
        event_type: deadman
        metric: storage.*
        condition:
          threshold: 0
          granularity: 60
          resource_type:
            flame_sfp: storage
        action:
          implementation:
            - flame_sfemc
    ```

    This trigger specification will create an alert task in Kapacitor, which monitors the number of points reported in
    measurement **storage** and having tag **sfp** set as **storage**. This value is computed every 60 seconds.
    If the number of reported points is less than **0** (no points have been reported for the last 60 seconds), an alert
    will be triggered. Triggered alerts will be sent through an HTTP POST message to the URLs listed in the **implementation** section.
    
    *Notes*:
    
    * **metric** only requires the measurement name in this event type and doesn't require a field name
    * the trigger specification still needs to be consistent with the parsing rule for **metric**: `<measurement>`.`<field>`
    * simply putting a `*` for field is sufficient, e.g. `storage.*`
    * even if you put something else for field value, it will be ignored - only the **measurement** name is used
    * **aggregation_method** is not required in this event type, any values provided will be ignored
    * **comparison operator** is not required in this event type, any values provided will be ignored