diff --git a/docs/monitoring.md b/docs/monitoring.md index ff32cf3c54c9db424be54f3c3b3db48b9d9c7747..ffdf4cbda05b9adeeadcccb77a85511f091fae59 100644 --- a/docs/monitoring.md +++ b/docs/monitoring.md @@ -179,25 +179,12 @@ Hierarchical monitoring and scalability considerations Using FLIPS monitoring -* FLIPS offers a hightly scalable +FLIPS offers a hightly scalable pub/sub system. We'll most likely need to use this in place of RabbitMQ for the infrastructure monitoring. The +monitoring specification is here: - - -ISSUES - -**Testing** - -Direct InfluxDB ingest (for testing measurements and queries) - -* Http API : /db/<database>/series?u=<user>&p=<pass> -* Java Client : https://github.com/influxdata/influxdb-java - -Will go with Http API rather than Java Client as the POJO abstraction is not needed for test cases as it hides -the detail of the line protocol. - -**Adapting the Mona/MOOSE agent?** +https://drive.google.com/file/d/0B0ig-Rw0sniLMDN2bmhkaGIydzA/view -MOOSE is the monitoring system provided by POINT and FLIPS. The monitoring specification has been analysed to refactor the measurements into series. The full monitoring specification is available here: + **Trust in measurements** @@ -208,71 +195,73 @@ If the agent is deployed in a VM/container that a tenant has root access then a * Use unix permissions (e.g. surrogates are deployed within root access to them) -https://drive.google.com/file/d/0B0ig-Rw0sniLMDN2bmhkaGIydzA/view - A couple of comments * CPU_UTILISATION_M: will be replaced by other metrics provided directly by Telegraf plugins -* END_TO_END_LATENCY_M (not clear who the endpoints are) +* END_TO_END_LATENCY_M (not clear what this measurement means) ### Measurements -#### Capacity Measurements +#### Infrastructure Slice Capacity Measurements -Capacity measurements measure the size of the virtual infrastructure slice available to the platform that can be allocated on demand to tenants. +Capacity measurements measure the size of the infrastructure slice available to the platform that can be allocated on demand to tenants. + +Common tags + +* slice_id – an idenfication id for the infrastructure slice **host_resource** The *host_resource* measurement measures the wholesale host resources available to the platform that can be allocated to media services. -`host_resource,server_id="",location="" cpus=(integer),memory=(integer),storage=(integer) timestamp` +`host_resource,slice_id="",server_id="",location="" cpus=(integer),memory=(integer),storage=(integer) timestamp` **network_resource** network_resource measures the overall capacity of the network available to the platform for allocation to tenants. There are currently no metrics defined for this in the FLIPS monitoring specification, although we can envisage usage metrics such as bandwidth being part of this measurement. -#### SF Network Measurements +`network_resource,slice_id="",network_id="", bandwidth=(integer),X=(integer),Y=(integer),Z=(integer) timestamp` -SF Network Measurements measure aspects of network performance in relation to SFs deployed within the network. There are currently too many names for a node within the network and the following can be considered synonyms (SF, network element, node) +#### Media Service Measurements -**node_network_perf** +Common tags -node_network_perf provides the network measurement view for network elements. Network elements can be in the role of gateway, forwarding node, network attachment point, rendezvous, service, topology manager or user equipment as defined by the FLIPS monitoring specification. The measurements are made by the Mona monitoring agent. +* sfc – an orchestration template +* sfc_instance – an instance of the orchestration template +* sf_package – a SF type +* sf_instance – an instance of the SF type +* vm_instance – an authoritive copy of the SF instance +* server – a physical or virtual server for hosting VM instances +* location – the location of the server + +##### Network Measurements + +Network Measurements measure aspects of network performance in relation to VMs/containers. + +The following fields need further analysis as they seem to relate to core ICN and buffering. These do not seem that relevant -Fields: -* BUFFER_SIZES_M +* BUFFER_SIZES_M * FILE_DESCRIPTORS_TYPE_M -* HTTP_REQUESTS_FQDN_M * MATCHES_NAMESPACE_M -* PATH_CALCULATIONS_NAMESPACE_M -* PACKET_JITTER_CID_M * PUBLISHERS_NAMESPACE_M -* RX_BYTES_CID_M -* RX_BYTES_PORT_M -* RX_PACKETS_M -* RX_PACKETS_HTTP_M -* RXPACKETS_IP_M -* RX_PACKETS_IP_MULTICAST_M * SUBSCRIBERS_NAMESPACE_M -* TX_BYTES_PORT_M -* TX_BYTES_CID_M -* TX_BYTES_HTTP_M -* TX_BYTES_IP_M -* TX_BYTES_IP_MULTICAST_M -* TX_PACKETS_PORT_M -* TX_PACKETS_HTTP_M -* TX_PACKETS_IP_M -* TX_PACKETS_IP_MULTICAST_M -Global Tags +**node_network_perf** + +node_network_perf provides the network measurement view for network elements. Network elements can be in the role of gateway, forwarding node, network attachment point, rendezvous, service, topology manager or user equipment as defined by the FLIPS monitoring specification. The measurements are made by the Mona monitoring agent. + +Questions + +* Can a single value of jitter (e.g. avg jitter) be calculated from the set of measurements in PACKET_JITTER_CID_M message? What is the time period for the list of jitter measurements? +* What does CID actually mean? -* node_id: the network element id allocated to this surrogate -* sf_inst_id : the service function instance that this node represents in the case of surrogates -* sf_id : the service function type -* sfc_inst_id : the service function chain instance that this node is part of -* sfc_id : the service function chain type that this node is part of -* server_id : the server where the node is provisioned -* location : the location of the server +`node_network_perf,<global_tags>,node_role="",node_name="" timestamp` + +* PACKET_JITTER_CID_M +* RX_BYTES_CID_M +* TX_BYTES_CID_M +* RX_PACKETS_IP_M (ipversion) +* TX_PACKETS_IP_M (ipversion) Specific Tags: @@ -282,18 +271,18 @@ Specific Tags: **node_port_perf** -The netnode_port series provides network measurements on host ports as defined by the FLIPS monitoring specification. The measurements are made by the Mona monitoring agent. +The netnode_port series provides network measurements on host ports as defined by the FLIPS monitoring specification. + +`port_network_perf,<global_tags>,node_id="",port_id="",port_name="" timestamp` Fields * PACKET_DROP_RATE_M * PACKET_ERROR_RATE_M - -Tags - -* node_id -* port_id -* port_name +* RX_PACKETS_M +* RX_BYTES_PORT_M +* TX_BYTES_PORT_M +* TX_PACKETS_PORT_M **link** @@ -316,25 +305,11 @@ Tags SF Host Resource Measurements measures the host resources allocated to a service function deployed by the platform. All measurements have the following global tags to allow the data to be sliced and diced according to dimensions. -Global Tags - -* node_id : the unique id of the network element -* sf_inst_id : the service function instance that this node represents in the case of surrogates -* sf_id : the service function type -* sfc_inst_id : the service function chain instance that this node is part of -* sfc_id : the service function chain type that this node is part of -* server_id : the server where the node is provisioned -* location : the location of the server - **node_host_resource** -*node_host_resource* measures host resources allocated to a node. - -Fields +The resources allocated to a VM/Container -* cpus (integer) -* memory(integer) -* storage(integer) +`node_host_resource,<global-tags> cpu,memory,storage timestamp` **node_cpu_usage** @@ -372,125 +347,61 @@ Fields **topology_manager** -Fields - -* ??? - -Global Tags - -* node_id: the network element id allocated to this surrogate -* sf_inst_id : the service function instance that this node represents in the case of surrogates -* sf_id : the service function type -* sfc_inst_id : the service function chain instance that this node is part of -* sfc_id : the service function chain type that this node is part of -* server_id : the server where the node is provisioned -* location : the location of the server - -Tags - -* node_id: the network element id allocated to the topology manager +PATH_CALCULATIONS_NAMESPACE_M **nap** -nap measurements are the platforms view on IP endpoints such as user equipment and services. A NAP is therefore the boundary of the platform. NAP measurements may need to be extended to provide more information on the relationship between clients and FQDN requests. - -Fields - -* CHANNEL_AQUISITION_TIME_M -* CMC_GROUP_SIZE_M -* NETWORK_LATENCY_FQDN_M -* RX _BYTES_HTTP_M -* RX _BYTES_IP_M - -Global Tags - -* node_id: the network element id allocated to this surrogate -* sf_inst_id : the service function instance that this node represents in the case of surrogates -* sf_id : the service function type -* sfc_inst_id : the service function chain instance that this node is part of -* sfc_id : the service function chain type that this node is part of -* server_id : the server where the node is provisioned -* location : the location of the server - -Specific Tags -* coverage (tbc indicating the reach of the NAP) - -**orchestrator** - -Fields - -* ??? - -Tags +nap measurements are the platforms view on IP endpoints such as user equipment and services. A NAP is therefore the boundary of the platform. NAP also measures aspects of co-incidental multicast performance -* node_id: the network element id allocated to the orchestrator +Questions -**clmc** +* What is the group id for CHANNEL_AQUISITION_TIME_M and how can this be related to FQDN of the content? +* what is the predefined time interval for CMC_GROUP_SIZE_M? +* How does NETWORK_LATENCY_FQDN_M relate to END_TO_END_LATENCY? +* How are multicast groups identified? i.e. "a request for FQDN within a time period", what's the content granularity here? +* HTTP_REQUESTS_FQDN_M says from an endpoint yet the measurement does not have a node id, it could be just the total number of requests for a FQDN, it which case it is very much like service request stats of a media service -Fields - -* ??? +* RX _BYTES_IP_MULTICAST_M +* TX_BYTES_IP_MULTICAST_M +* RX_PACKETS_HTTP_M +* TX_PACKETS_HTTP_M +* TX_BYTES_HTTP_M +* RX_PACKETS_IP_MULTICAST_M +* TX_PACKETS_IP_MULTICAST_M +* RX_BYTES_IP_M +* TX_BYTES_IP_M -Tags +`nap_node,<global_tags>,nodeId="", CHANNEL_AQUISITION_TIME_M, timestamp` -* node_id: the network element id allocated to the clmc +* Can we ignore the specific nodes and look at aggregate measurements associated with a multicast group? +* Here the assumption is that nodes are grouped around requests to access content idenified by hashes of FQDN. -**media_component** +`nap_multicast,<global_tags>,groupId="",fqdn="" CHANNEL_AQUISITION_TIME_M, CMC_GROUP_SIZE_M, NETWORK_FQDN_LATENCY timestamp` -Each SF developed by tenants will offer service specific usage and performance measurements. The fields in the measurements will be specific but the tags must include a predefined set of tags to allow series joins with SF Network and SF Host Resource measurements. +* CHANNEL_AQUISITION_TIME_M: avg time for all nodes in this group over sample period +* CMC_GROUP_SIZE_M: avg multicastgroup size over sample period +* NETWORK_FQDN_LATENCY: avg network latency over sample period -The actual measurements will be made by agents running on surrogate services which provide authoritative copies of SF instances deployed as part of an overall media service. Therefore the measurement series are named surrogate +**service** -Fields -* [developer defined] +Each SF developed will offer service specific usage and performance measurements. -Global Tags +`service_request,<global_tags>,cont_nav="",cont_rep="",user_id="" <request-params> timestamp` -* node_id: the network element id allocated to this surrogate -* sf_inst_id : the service function instance that this node represents in the case of surrogates -* sf_id : the service function type -* sfc_inst_id : the service function chain instance that this node is part of -* sfc_id : the service function chain type that this node is part of -* server_id : the server where the node is provisioned -* location : the location of the server +`service_response,<global_tags>,cont_nav="",cont_rep="",user_id="" response_time timestamp` Specific Tags -* cont_nav: the content interaction id -* cont_rep: the content representation type +* cont_nav: the content requested +* cont_rep: the content representation requested * user_id: the pseudonym of the user #### Measurements that still need some thinking -**sf_instance** - -Fields -* ?? - -Tags -* ?? +**orch_media_service** -**sf** +**orch_sfc_instance** -Fields -* ?? +**orch_sf_instance** -Tags -* ?? - -**sfc_inst** - -Fields -* ?? - -Tags -* ?? - -**template** - -Fields -* ?? - -Tags -* template_id -* owner +**clmc** \ No newline at end of file