## FLAME Monitoring Specification

This document describe the low-level monitoring specification for cross-layer management and control within the FLAME platform.

### Principles

#### Measurements Model

The measurement model is based on a time-series model using the TICK stack from influxdata

The data model is based on the line protocol which has the format

`<measurement>[,<tag-key>=<tag-value>...] <field-key>=<field-value>[,<field2-key>=<field2-value>...] [unix-nano-timestamp]`

Each series has

* a name "measurement"
* 0 or more tags for metadata
* 1 or more fields for the measurement values
* a timestamp.

InfluxDB is schemaless allowing arbirtary series to be stored, for example, allows for arbritary measurements to be created by the wide variety of media components without requiring changes to a database schema.

Tags can be structured to provide query by dimensions allowing series data to be diced and sliced. The tags are automatically indexed.

#### Temporal Measurements

####Spatial Measurements

Discuss hierarchical tags vs GPS coordinate systems

### Logical Model

The high-level entities involved in the measurement model are defined in the figure below. The core of the model is the Surrogate SF as the primary measurement point as this is the physical realisation of services running on the platform. A Surrogate SF is a process running on a physical or virtual host with ports connecting to other Surrogate SFs within the network. The Surrogate SF has measurement processes running to capture different views on the SF include the network, host resources, and SF usage/performance. The acquisition of these different views on the SF together is a key element of the cross-layer information required for management and control.  The measurements about a surrogate SF is captured by different processes running on the VM or container but are brought together by globally asserted monitoring metadata allowing the information to be integrated, correlated and analysed.

![MeasurementModel](/docs/images/measurement-model.jpg)

Network and host measurements are general to all surrogate SFs running within the platform. SF usage and perf measurements are specific to the SF implementation. The Platform itself is realised using SFs and therefore NAPs and the Topology Manager are also monitored using the same model. For media component SFs that form part of a Service Function Chain within a Media Service, the measurement fields are not defined and developers can decide what fields they want to use. However, global tags will be inserted for all measurements to allow for integration of SF specific measurements with network and host measurements.

### Architecture

The monitoring model using an agent based approach. The general architecture is shown in the diagram below.

![AgentArchitecture](/docs/images/agent-architecture.jpg)

An agent is deployed on each of the container/VM implementing a SF. The agent is deployed by the orchestrator when the SF is provisioned. The agent is configured with a set of input plugins that collect measurements from three aspects of the SF including network, host and SF usage/perf. The agent is configured with a set of global tags that are inserted for all measurements made by the agent on the host.

Agent-based monitoring

* Telegraf AMQP: https://github.com/influxdata/telegraf/tree/release-1.5/plugins/inputs/amqp_consumer
* Telegrapf http json: https://github.com/influxdata/telegraf/tree/release-1.5/plugins/inputs/httpjson
* Telegraf http listener: https://github.com/influxdata/telegraf/tree/release-1.5/plugins/inputs/http_listener 
* Telegraf Bespoke Plugin: https://www.influxdata.com/blog/how-to-write-telegraf-plugin-beginners/
* Telegraf Existing Plugins for common services, relevant plugins include
 * Network Response https://github.com/influxdata/telegraf/tree/release-1.5/plugins/inputs/net_response: could be used to performance basic network monitoring
 * nstat https://github.com/influxdata/telegraf/tree/release-1.5/plugins/inputs/nstat : could be used to monitor the network
 * webhooks https://github.com/influxdata/telegraf/tree/release-1.5/plugins/inputs/webhooks: could be used to monitor end devices
 * prostat https://github.com/influxdata/telegraf/tree/release-1.5/plugins/inputs/procstat: could be used to monitor containers
 * SNMP https://github.com/influxdata/telegraf/tree/release-1.5/plugins/inputs/snmp: could be used to monitor flows
 * systat https://github.com/influxdata/telegraf/tree/release-1.5/plugins/inputs/sysstat: could be used to monitor hosts

Direct InfluxDB ingest (for testing measurements and queries)

* Java Client : https://github.com/influxdata/influxdb-java
* Http API : /db/<database>/series?u=<user>&p=<pass>

Agents:

* deployed at monitoring points (e.g surrogates and other network elements)
* insert contextual metadata as tags into measurements
* How does this relate to the Mona agents

Hierarchical monitoring and scalability considerations

* AMQP can be used to buffer monitoring info
* InfluxDB can be used to provide aggregation points when used with Telegraf input and output plugin
* How does this relate to the pub/sub and mySQL aggregator in FLIPS?

ISSUES

**Adapting the Mona/MOOSE agent?**

MOOSE is the monitoring system provided by POINT and FLIPS. The monitoring specification has been analysed to refactor the measurements into series. The full monitoring specification is available here:

**Trust in measurements**

If the agent is deployed in a VM/container that a tenant has root access then a tenant could change the configuration to fake measuremnents associated with network and host in an attempt gain benefit. This is a security risk. Some ideas include

* Deploy additional agents on hosts rather than agents to measure network and VM performance. Could be hard to differentiate between the different SFs deployed on a host
* Generate a hash from the agent configuration file that's checked within the monitoring message. Probably too costly and not part of the telegraf protocol
* Use unix permissions (e.g. surrogates are deployed within root access to them)


https://drive.google.com/file/d/0B0ig-Rw0sniLMDN2bmhkaGIydzA/view

A couple of comments

* CPU_UTILISATION_M: will be replaced by other metrics provided directly by Telegraf plugins
* END_TO_END_LATENCY_M (not clear who the endpoints are)


### Measurements

#### Capacity Measurements

Capacity measurements measure the size of the virtual infrastructure slice available to the platform that can be allocated on demand to tenants.

**host_resource**

The *host_resource* measurement measures the wholesale host resources available to the platform that can be allocated to media services.

Fields

* cpus(integer)
* memory(integer)
* storage(integer) 

Tags:

* server_id
* location

**network_resource**

network_resource measures the overall capacity of the network available to the platform for allocation to tenants. There are currently no metrics defined for this in the FLIPS monitoring specification, although we can envisage usage metrics such as bandwidth being part of this measurement.

#### SF Network Measurements

SF Network Measurements measure aspects of network performance in relation to SFs deployed within the network. There are currently too many names for a node within the network and the following can be considered synonyms (SF, network element, node)

**node_network_perf**

node_network_perf provides the network measurement view for network elements. Network elements can be in the role of gateway, forwarding node, network attachment point, rendezvous, service, topology manager or user equipment as defined by the FLIPS monitoring specification. The measurements are made by the Mona monitoring agent. 

Fields:
* BUFFER_SIZES_M
* FILE_DESCRIPTORS_TYPE_M 
* HTTP_REQUESTS_FQDN_M
* MATCHES_NAMESPACE_M
* PATH_CALCULATIONS_NAMESPACE_M 
* PACKET_JITTER_CID_M 
* PUBLISHERS_NAMESPACE_M 
* RX_BYTES_CID_M 
* RX_BYTES_PORT_M 
* RX_PACKETS_M 
* RX_PACKETS_HTTP_M 
* RXPACKETS_IP_M 
* RX_PACKETS_IP_MULTICAST_M 
* SUBSCRIBERS_NAMESPACE_M 
* TX_BYTES_PORT_M 
* TX_BYTES_CID_M 
* TX_BYTES_HTTP_M 
* TX_BYTES_IP_M 
* TX_BYTES_IP_MULTICAST_M 
* TX_PACKETS_PORT_M 
* TX_PACKETS_HTTP_M 
* TX_PACKETS_IP_M 
* TX_PACKETS_IP_MULTICAST_M 

Global Tags

* node_id: the network element id allocated to this surrogate
* sf_inst_id : the service function instance that this node represents in the case of surrogates
* sf_id : the service function type
* sfc_inst_id : the service function chain instance that this node is part of
* sfc_id : the service function chain type that this node is part of
* server_id : the server where the node is provisioned
* location : the location of the server

Specific Tags:

* node_role
* name
* state

**node_port_perf**

The netnode_port series provides network measurements on host ports as defined by the FLIPS monitoring specification. The measurements are made by the Mona monitoring agent.

Fields

* PACKET_DROP_RATE_M
* PACKET_ERROR_RATE_M

Tags

* node_id
* port_id
* port_name

**link**

The link series provides measurements about network links. Currently the FLIPS monitoring specification defines only topological configuration information and does not provide any measurements related to links. All performance information is included as part of the nodes. Further investigation is needed to understand if derived measurements related to links are needed or whether this is just useful for monitoring the temporal evolution of the topology. 

Fields

* ??
* ??

Tags

* link_name
* link_id
* source_node_id
* destination_node_id
* link_type

#### SF Host Resource Measurements

SF Host Resource Measurements measures the host resources allocated to a service function deployed by the platform. All measurements have the following global tags to allow the data to be sliced and diced according to dimensions.

Global Tags

* node_id : the unique id of the network element
* sf_inst_id : the service function instance that this node represents in the case of surrogates
* sf_id : the service function type
* sfc_inst_id : the service function chain instance that this node is part of
* sfc_id : the service function chain type that this node is part of
* server_id : the server where the node is provisioned
* location : the location of the server

**node_host_resource**

*node_host_resource* measures host resources allocated to a node.

Fields

* cpus (integer)
* memory(integer)
* storage(integer) 

**node_cpu_usage**

[[inputs.cpu]]

**node_disk_usage**

[[inputs.disk]]

**node_disk_IO**

[[inputs.diskio]]

**node_kernel_stats**

[[inputs.kernel]]

**node_memory_usage**

[[inputs.mem]]

**node_process_status**

[[inputs.processes]]

**node_swap_memory_usage**

[[inputs.swap]]

**node_system_load_uptime**

[[inputs.system]]

##### SF Usage and Perf Measurements

**topology_manager**

Fields

* ???

Global Tags

* node_id: the network element id allocated to this surrogate
* sf_inst_id : the service function instance that this node represents in the case of surrogates
* sf_id : the service function type
* sfc_inst_id : the service function chain instance that this node is part of
* sfc_id : the service function chain type that this node is part of
* server_id : the server where the node is provisioned
* location : the location of the server

Tags

* node_id: the network element id allocated to the topology manager

**nap**

nap measurements are the platforms view on IP endpoints such as user equipment and services. A NAP is therefore the boundary of the platform. NAP measurements may need to be extended to provide more information on the relationship between clients and FQDN requests.

Fields

* CHANNEL_AQUISITION_TIME_M 
* CMC_GROUP_SIZE_M
* NETWORK_LATENCY_FQDN_M
* RX _BYTES_HTTP_M
* RX _BYTES_IP_M

Global Tags

* node_id: the network element id allocated to this surrogate
* sf_inst_id : the service function instance that this node represents in the case of surrogates
* sf_id : the service function type
* sfc_inst_id : the service function chain instance that this node is part of
* sfc_id : the service function chain type that this node is part of
* server_id : the server where the node is provisioned
* location : the location of the server

Specific Tags
* coverage (tbc indicating the reach of the NAP)

**orchestrator**

Fields

* ???

Tags

* node_id: the network element id allocated to the orchestrator

**clmc**

Fields

* ???

Tags

* node_id: the network element id allocated to the clmc

**media_component**

Each SF developed by tenants will offer service specific usage and performance measurements. The fields in the measurements will be specific but the tags must include a predefined set of tags to allow series joins with SF Network and SF Host Resource measurements.

The actual measurements will be made by agents running on surrogate services which provide authoritative copies of SF instances deployed as part of an overall media service. Therefore the measurement series are named surrogate

Fields
* [developer defined]

Global Tags

* node_id: the network element id allocated to this surrogate
* sf_inst_id : the service function instance that this node represents in the case of surrogates
* sf_id : the service function type
* sfc_inst_id : the service function chain instance that this node is part of
* sfc_id : the service function chain type that this node is part of
* server_id : the server where the node is provisioned
* location : the location of the server

Specific Tags

* cont_nav: the content interaction id 
* cont_rep: the content representation type
* user_id: the pseudonym of the user

#### Measurements that still need some thinking 

**sf_instance**

Fields
* ??

Tags
* ??

**sf**

Fields
* ??

Tags
* ??

**sfc_inst**

Fields
* ??

Tags
* ??

**template**

Fields
* ??

Tags
* template_id
* owner