Skip to content
Snippets Groups Projects

FLAME - TOSCA Alerts Specification

Authors

Authors Organisation
Nikolay Stanchev University of Southampton, IT Innovation Centre
Michael Boniface University of Southampton, IT Innovation Centre

Description

This document outlines the TOSCA alert specification used to configure alerts within CLMC. Alerts are configured through a YAML-based TOSCA-compliant document according to the TOSCA simple profile. This document is passed to the CLMC service, which parses and validates the document. Subsequently, the CLMC service creates and activates the alerts within Kapacitor, then registers the HTTP alert handlers specified in the document. The specification is compliant with the TOSCA policy template as implemented by the Openstack tosca parser. See an example below:

https://github.com/openstack/tosca-parser/blob/master/toscaparser/tests/data/policies/tosca_policy_template.yaml

TOSCA Alerts Specification Document

The TOSCA Alerts Specification Document consists of two main sections - metadata and policies. Each policy contains a number of triggers. A trigger is a fully qualified specification for an alert. Full definitions and clarification of the structure of the document is given in the following sections. An example of a valid alert specification document will look like:

tosca_definitions_version: tosca_simple_profile_for_nfv_1_0_0

description: TOSCA Alerts Configuration document

imports:
- flame_clmc_alerts_definitions.yaml

metadata:
  sfc: companyA-VR
  sfci: companyA-VR-premium

topology_template:

  policies:
    - high_latency_policy:
        type: eu.ict-flame.policies.StateChange
        triggers:
          high_latency:
            description: This event triggers when the mean network latency in a given location exceeds a given threshold (in ms).
            event_type: threshold
            metric: network.latency
            condition:
              threshold: 45
              granularity: 120
              aggregation_method: mean
              resource_type:
                flame_location: watershed
              comparison_operator: gt
            action:
              implementation:
                - flame_sfemc
                - http://companyA.alert-handler.flame.eu/high-latency
                
    - low_requests_policy:
        type: eu.ict-flame.policies.StateChange
        triggers:
          low_requests:
            description: |
              This event triggers when the last reported number of requests for a given service function
              falls behind a given threshold.
            event_type: threshold
            metric: storage.requests
            condition:
              threshold: 5
              granularity: 60
              aggregation_method: last
              resource_type:
                flame_sfp: storage
                flame_sf: storage-users
                location: watershed
              comparison_operator: lt
            action:
              implementation:
                - flame_sfemc
                - http://companyA.alert-handler.flame.eu/low-requests
                
    - requests_diff_policy:
        type: eu.ict-flame.policies.StateChange
        triggers:
          increase_in_requests:
            description: |
              This event triggers when the number of requests has increased relative to the number of requests received
              120 seconds ago.
            event_type: relative
            metric: storage.requests
            condition:
              threshold: 100  # requests have increased by at least 100
              granularity: 120
              resource_type:
                flame_sfp: storage
                flame_sf: storage-users
                flame_server: watershed
                flame_location: watershed
              comparison_operator: gte
            action:
              implementation:
                - flame_sfemc
          decrease_in_requests:
            description: |
              This event triggers when the number of requests has decreased relative to the number of requests received
              120 seconds ago.
            event_type: relative
            metric: storage.requests
            condition:
              threshold: -100  # requests have decreased by at least 100
              granularity: 120
              resource_type:
                flame_sfp: storage
                flame_sf: storage-users
                flame_location: watershed
              comparison_operator: lte
            action:
              implementation:
                - flame_sfemc
                
    - missing_measurement_policy:
        type: eu.ict-flame.policies.StateChange
        triggers:
          missing_storage_measurements:
            description: This event triggers when the number of storage measurements reported falls below the threshold value.
            event_type: deadman
            # deadman trigger instances monitor the whole measurement (storage in this case), so simply put a star for field value
            # to be compliant with the <measurement>.<field> format
            metric: storage.*
            condition:
              threshold: 0  # if requests are less than or equal to 0 (in other words, no measurements are reported)
              granularity: 60  # check for for missing data for the last 60 seconds
              resource_type:
                flame_sfp: storage
            action:
              implementation:
                - http://companyA.alert-handler.flame.eu/missing-measurements
Metadata

The metadata section specifies the service function chain ID and the service function chain instance ID, for which this alerts specification relates to. The format is the following:

metadata:
    sfc: <sfc_id>
    sfci: <sfc_i_id>
Policies

The policies section defines a list of policy nodes, each representing a fully qualified configuration for an alert within CLMC. The format is the following:

topology_template:

    policies:
        - <policy_identifier>:
            type: eu.ict-flame.policies.StateChange
            triggers:
                <event identifier>:
                  description: <optional description for the given event trigger>
                  event_type: <threshold | relative | deadman>
                  metric: <measurement>.<field>
                  condition:
                    threshold: <critical value - semantics depend on the event type>
                    granularity: <period in seconds - semantic depends on the event type>
                    aggregation_method: <aggregation function supported by InfluxDB - e.g. 'mean'>
                    resource_type:
                      <CLMC Information Model Tag Name>: <CLMC Information Model Tag Value>
                      <CLMC Information Model Tag Name>: <CLMC Information Model Tag Value>
                      ...
                    comparison_operator: <logical operator to use for comparison, e.g. 'gt', 'lt', 'gte', etc.
                  action:
                    implementation:
                      - <flame_sfemc or HTTP Alert Handler URL - receives POST messages from Kapacitor when alerts trigger>
                      - <flame_sfemc or HTTP Alert Handler URL - receives POST messages from Kapacitor when alerts trigger>
                      ...
        ...
Definitions
  • policy_identifier - policy label which MUST match with a StateChange policy in the TOSCA resource specification document submitted to the FLAME Orchestrator.

  • event_identifier - the name of the event that MUST match with the constraint event name referenced in the TOSCA resource specification document submitted to the FLAME Orchestrator.

  • event_type - the type of TICK Script template to use to create the alert - more information will be provided about the different options here, but we assume the most common one will be threshold. Currently, the other supported types are relative and deadman.

  • metric - the metric to query in InfluxDB, must include measurement name and field name in format <measurement>.<field>. The only exception is when a deadman event type is used - then the <field>is not used, but the format is still the same for consistency. Therefore, using <measurement>.* will be sufficient.

  • threshold -

    • for threshold event type, this is the critical value the queried metric is compared to.
    • for relative event type, this is the critical value the difference (between the current metric value and the past metric value) is compared to.
    • for deadman event type, this is the critical value the number of measurement points (received in InfluxDB) is compared to.
  • granularity - period in seconds

    • for threshold event type, this value specifies how often should Kapacitor query InfluxDB to check whether the alert condition is true.
    • for relative event type, this value specifies how long back in time to compare the current metric value with
    • for deadman event type, this value specifies how long the span in time (in which the number of measurement points are checked) is
  • aggregation_method - the function to use when querying InfluxDB, e.g. median, mean, etc. This value is only used when the event_type is set to threshold.

  • resource_type - provides context for the given event - key-value pairs for the global tags of the CLMC Information Model. This includes any of the following: "flame_sfp", "flame_sf", "flame_sfe", "flame_server", "flame_location". Keep in mind that flame_sfc and flame_sfci are also part of the CLMC Information Model. However, filtering on these tags is automatically generated and added to all InfluxDB queries by using the metadata values from the alerts specification. Therefore, including flame_sfc and flame_sfci in the resource_type is considered INVALID.
    For more information on the global tags, please check the documentation.

  • comparison_operator - the logical operator to use for comparison - lt (less than), gt (greater than), lte (less than or equal to), etc.

  • implementation - a list of the URL entries for alert handlers to which alert data is sent when the event condition is true. If the alert is supposed to be sent to SFEMC, then instead of typing a URL, use flame_sfemc - the configurator will generate the correct SFEMC URL.

Event types
  • threshold - A threshold event type is an alert in which Kapacitor queries InfluxDB on specific metric in a given period of time by using a query function such as mean, median, mode, etc. This value is then compared against a given threshold. If the result of the comparison operation is true, an alert is triggered. For example:

    high_latency:
        description: This event triggers when the mean network latency in a given location exceeds a given threshold (in ms).
        event_type: threshold
        metric: network.latency
        condition:
          threshold: 45
          granularity: 120
          aggregation_method: mean
          resource_type:
            flame_location: watershed
          comparison_operator: gt
        action:
          implementation:
            - flame_sfemc
            - http://companyA.alert-handler.flame.eu/high-latency

    This trigger specification will create an alert task in Kapacitor, which queries the latency field in the network measurement on location watershed every 120 seconds and compares the mean value for the last 120 seconds with the threshold value 45. If the mean latency exceeds 45 (gt operator is used, which stands for greater than), an alert is triggered. This alert will be sent through an HTTP POST message to the URLs listed in the implementation section.

    The currently included InfluxQL functions are:

    "count", "mean", "median", "mode", "sum", "first", "last", "max", "min"

    The comparison operator mappings are as follows:

    "lt" : "less than",
    "gt" : "greater than", 
    "lte" : "less than or equal to", 
    "gte" : "greater than or equal to", 
    "eq" : "equal", 
    "neq" : "not equal"
  • relative - A relative event type is an alert in which Kapacitor computes the difference between the current value of a metric and the value reported a given period of time ago. The difference between the current and the past value is then compared against a given threshold. If the result of the comparison operation is true, an alert is triggered. For example:

    decrease_in_requests:
        description: |
          This event triggers when the number of requests has decreased relative to the number of requests received
          120 seconds ago.
        event_type: relative
        metric: storage.requests
        condition:
          threshold: -100
          granularity: 120
          resource_type:
            flame_sfp: storage
            flame_sf: storage-users
            flame_location: watershed
          comparison_operator: lte
        action:
          implementation:
            - flame_sfemc

    This trigger specification will create an alert task in Kapacitor, which compares every requests value reported in measurement storage with the value received 120 seconds ago. If the difference between the current and the past value is less than or equal to (comparison operator is lte) -100, an alert is triggered. Simply explained, an alert is triggered if the requests current value has decreased by at least 100 relative to the value reported 120 seconds ago. The queried value is contextualised for service function storage-users (using service function package storage) at location watershed. Triggered alerts will be sent through an HTTP POST message to the URLs listed in the implementation section.

    Notes:

    • aggregation_method is not required here - the alert task compares the actual value that's being reported (stream mode)
    • if aggregation_method is provided, it will be ignored
  • deadman - A deadman event type is an alert in which Kapacitor computes the number of reported points in a measurement for a given period of time. This number is then compared to a given threshold value. If less number of points have been reported (in comparison with the threshold value), an alert is triggered. For example:

    missing_storage_measurements:
        description: This event triggers when the number of storage measurements reported falls below the threshold value.
        event_type: deadman
        metric: storage.*
        condition:
          threshold: 0
          granularity: 60
          resource_type:
            flame_sfp: storage
        action:
          implementation:
            - flame_sfemc

    This trigger specification will create an alert task in Kapacitor, which monitors the number of points reported in measurement storage and having tag sfp set as storage. This value is computed every 60 seconds. If the number of reported points is less than 0 (no points have been reported for the last 60 seconds), an alert will be triggered. Triggered alerts will be sent through an HTTP POST message to the URLs listed in the implementation section.

    Notes:

    • metric only requires the measurement name in this event type and doesn't require a field name
    • the trigger specification still needs to be consistent with the parsing rule for metric: <measurement>.<field>
    • simply putting a * for field is sufficient, e.g. storage.*
    • even if you put something else for field value, it will be ignored - only the measurement name is used
    • aggregation_method is not required in this event type, any values provided will be ignored
    • comparison operator is not required in this event type, any values provided will be ignored