update to docs

4fc1fa3e · Michael Boniface · 472555d6 · 4fc1fa3e
Commit 4fc1fa3e authored Jan 2, 2018 by Michael Boniface
--- a/docs/monitoring.md
+++ b/docs/monitoring.md
@@ -16,6 +16,7 @@ Briefly describe
 Configuration includes:
 * Capacity (servers and networks)
+* Media Service (sfc and sf)
 * Topology (nodes and links)
 * Allocation (Media Service Instance, Service Function Instance, Surrogate Instance)
 * Basic State (up, down, etc.)
@@ -29,14 +30,13 @@ Briefly descirbe:
 * the lifecycle of monitoring data within the platform and how it is used
 * the type of monitoring data
-Monitoring includes usage metrics
+Usage metrics
 * network resource usage
 * host resource usage
 * service usage
+Performance metrics:
-Monitoring includes performance metrics:
 * cpu/sec
 * throughput
@@ -140,7 +140,7 @@ All of the measurements on a specific VM/Container instance share a common conte
 * server – a physical or virtual server for hosting VM instances
 * location – the location of the server
-By including this context with service, network and host measurements it is possible to support a wide range of queries associated with SFC’s whether they are Media Services or the Platform components themselves. By adopting the same convention for identifiers it is possible to combine measurements across service, network and host to create new series that allows exploration of diffeent aspects of the VM instance.
+By including this context with service, network and host measurements it is possible to support a wide range of temporal queries associated with SFC’s whether they are Media Services or the Platform components . By adopting the same convention for identifiers it is possible to combine measurements across service, network and host to create new series that allows exploration of different aspects of the VM instance.
 Give a worked example across service and network measurements
@@ -154,13 +154,13 @@ Discuss specific tags
 ### Architecture
-The monitoring model using an agent based approach. The general architecture is shown in the diagram below.
+The monitoring model uses an agent based approach with hierarchical aggregation used as required. The general architecture is shown in the diagram below.
 ![AgentArchitecture](/docs/image/agent-architecture.jpg)
-An agent is deployed on each of the container/VM implementing a SF. The agent is deployed by the orchestrator when the SF is provisioned. The agent is configured with a set of input plugins that collect measurements from three aspects of the SF including network, host and SF usage/perf. The agent is configured with a set of global tags that are inserted for all measurements made by the agent on the host.
+For monitoring a service function, an agent is deployed on each of the container/VM implementing a SF. The agent is deployed by the orchestrator when the SF is provisioned. The agent is configured with a set of input plugins that collect measurements from three aspects of the SF including network, host and SF usage/perf. The agent is configured with a set of global tags that are inserted for all measurements made by the agent on the host.
-Telegraf agent-based monitoring
+Telegraf agent-based monitoring with the following plugins potentially relevant for integration with FLAME
 * Telegraf AMQP: https://github.com/influxdata/telegraf/tree/release-1.5/plugins/inputs/amqp_consumer
 * Telegrapf http json: https://github.com/influxdata/telegraf/tree/release-1.5/plugins/inputs/httpjson
@@ -178,13 +178,13 @@ Agents:
 * deployed at monitoring points (e.g surrogates and other network elements)
 * insert contextual metadata as tags into measurements
-* How does this relate to the Mona agents
+* But how does this relate to the Mona agents?
 Hierarchical monitoring and scalability considerations
 * AMQP can be used to buffer monitoring info
 * InfluxDB can be used to provide aggregation points when used with Telegraf input and output plugin
-* How does this relate to the pub/sub and mySQL aggregator in FLIPS?
+* But how does this relate to the pub/sub and mySQL aggregator in FLIPS?
 Using FLIPS monitoring
@@ -197,42 +197,48 @@ https://drive.google.com/file/d/0B0ig-Rw0sniLMDN2bmhkaGIydzA/view
 **Trust in measurements**
-If the agent is deployed in a VM/container that a tenant has root access then a tenant could change the configuration to fake measuremnents associated with network and host in an attempt gain benefit. This is a security risk. Some ideas include
+If the agent is deployed in a VM/container that a tenant has root access then a tenant could change the configuration to fake measurements associated with network and host in an attempt gain benefit. This is a security risk. Some ideas include
 * Deploy additional agents on hosts rather than agents to measure network and VM performance. Could be hard to differentiate between the different SFs deployed on a host
 * Generate a hash from the agent configuration file that's checked within the monitoring message. Probably too costly and not part of the telegraf protocol
 * Use unix permissions (e.g. surrogates are deployed within root access to them)
-## Configuration Measurements
+## Configuration Measurement Summary
 |Context|Measurement|Description
 |---|---|---|
 |Capacity|host_resource|the compute infrastructure allocation to the platform|
 |Capacity|network_resource|the network infrastructure allocation to the platform|
-|Platform|topology_manager|tbd|
+|Platform|topology_manager|specific metrics tbd|
-|Media Service|sfc_config|tbd|
+|Media Service|sfc_config|specific metrics tbd|
-|Media Service|sf_config|tbd|
+|Media Service|sf_config|specific metrics tbd|
 |Media Service|vm_host_config|compute resources allocated to a VM|
 |Media Service|net_port_config|networking constraints on port on a VM|
-## Monitoring Measurements
+*Need to refer to TOSCA here*
+## Usage and Performance Measurement Summary
+|Context|Measurement|Description
+|---|---|---|
 |Platform|nap_data_io|nap data io at byte, ip and http levels|
 |Platform|nap_fqdn_perf|fqdn request rate and latency|
-|Platform|orchestrator|tbd|
+|Platform|orchestrator|specific metrics tbd|
-|Platform|clmc|tbd|
+|Platform|clmc|specific metrics tbd|
-|Media Service|cpu_usage|vm desc|
+|Media Service|cpu_usage|vm metrics|
-|Media Service|disk_usage|vm desc|
+|Media Service|disk_usage|vm metrics|
-|Media Service|disk_IO|vm desc|
+|Media Service|disk_IO|vm metrics|
-|Media Service|kernel_stats|vm desc|
+|Media Service|kernel_stats|vm metrics|
-|Media Service|memory_usage|vm desc|
+|Media Service|memory_usage|vm metrics|
-|Media Service|process_status|vm desc|
+|Media Service|process_status|vm metrics|
-|Media Service|swap_memory_usage|vm desc|
+|Media Service|swap_memory_usage|vm metrics|
-|Media Service|system_load_uptime|vm desc|
+|Media Service|system_load_uptime|vm metrics|
 |Media Service|net_port_io|vm port network io and error at L2|
-|Media Service|service|vm service perf|
+|Media Service|service|vm service perf metrics|
-#### Infrastructure Capacity Measurements
+## Capacity 
 Capacity measurements measure the size of the infrastructure slice available to the platform that can be allocated on demand to tenants.
@@ -252,9 +258,9 @@ network_resource measures the overall capacity of the network available to the p
 `network_resource,slice_id="",network_id="", bandwidth=(integer),X=(integer),Y=(integer),Z=(integer) timestamp`
-#### Platform Measurements 
+## Platform
-Platform measurements measure the usage and performance of platform components.
+Platform measurements measure the configuration, usage and performance of platform components.
 **topology_manager**
@@ -301,11 +307,19 @@ Fields
 **clmc**
-#### Media Service Measurements
+## Media Service 
-**media_service**
+Media service measurements measure the configuration, usage and performance of media service instances deployed by the platform.
-Aggregate measurement derived from VM/container measurements, most likely calculated using a continuous query of a specific time interval
+### Service Function Chain
+**sfc_config**
+tbd
+**sf_config**
+tbd
 **sfc**
@@ -323,7 +337,9 @@ Aggregate measurement derived from VM/container measurements, most likely calcul
 Aggregate measurement derived from VM/container measurements, most likely calculated using a continuous query of a specific time interval
-#### VM/Container Measurements
+### VM/Container Measurements
+VM/Container Measurements measure the configuration, usage and performance of VM/Container instances deployed by the platform within the context of a media service.
 Common tags
@@ -335,7 +351,7 @@ Common tags
 * server – a physical or virtual server for hosting VM instances
 * location – the location of the server
-##### Network Measurements
+#### Network Measurements
 **net_port_config**
@@ -369,7 +385,7 @@ Fields
 Note that RX_PACKETS_M seems to have inconsistent naming convention. 
-##### VM Host Measurements
+#### VM Host Measurements
 SF Host Resource Measurements measures the host resources allocated to a service function deployed by the platform. All measurements have the following global tags to allow the data to be sliced and diced according to dimensions.
@@ -415,7 +431,7 @@ Specific tags
 [[inputs.system]]
-##### Service Measurements
+#### Service Measurements
 **<prefix>_service_config**
@@ -445,17 +461,17 @@ Specific Tags
 * cont_rep: the content representation requested
 * user: the pseudonym of an individual user or a user classification
-##### MISC Measurements and Questions
+# MISC Measurements and Questions
 The following data points require further analysis
-* CPU_UTILISATION_M: will be replaced by other metrics provided directly by Telegraf plugins
+* CPU_UTILISATION_M: likely to be replaced by other metrics provided directly by Telegraf plugins
-* END_TO_END_LATENCY_M (not clear what this measurement means)
+* END_TO_END_LATENCY_M: not clear what this measurement means, so needs clarification
 * BUFFER_SIZES_M: needs clarification 
 * RX_PACKETS_IP_M: is this just NAP or all Nodes
 * TX_PACKETS_IP_M: is this just NAP or all Nodes
-The following fields need further analysis as they seem to relate to core ICN
+The following fields need further analysis as they seem to relate to core ICN, most likely fields/measurements related to platform components
 * FILE_DESCRIPTORS_TYPE_M 
 * MATCHES_NAMESPACE_M
@@ -465,18 +481,18 @@ The following fields need further analysis as they seem to relate to core ICN
 The following fields relate to CID which I don't understand but jitter is an important metric so we need to find out.
-* Can a single value of jitter (e.g. avg jitter) be calculated from the set of measurements in PACKET_JITTER_CID_M message? What is the time period for the list of jitter measurements?
-* What does CID  mean? consecutive identical digits
 * PACKET_JITTER_CID_M
 * RX_BYTES_CID_M 
 * TX_BYTES_CID_M 
-What about links? What about links between different media service nodes
+Some questions
+* Can a single value of jitter (e.g. avg jitter) be calculated from the set of measurements in PACKET_JITTER_CID_M message? What is the time period for the list of jitter measurements?
+* What does CID  mean? consecutive identical digits
 #### Link Measurements
-links are established between VM/container instances, need to discuss what measurements make sense. Also the context for links could be between media services, therefore a link measurement should be within the platform context and NOT the media service context. Need a couple of scenarios to work this one out.
+Links are established between VM/container instances, need to discuss what measurements make sense. Also the context for links could be between media services, therefore a link measurement should be within the platform context and NOT the media service context. Need a couple of scenarios to work this one out. 
 **link_config**
@@ -490,3 +506,5 @@ Link Tags
 * link_state
 **link_perf**
+link perf is measured at the nodes, related to end_to_end_latency. Needs further work.