diff --git a/docs/graph-monitoring-user-guide.md b/docs/graph-monitoring-user-guide.md new file mode 100644 index 0000000000000000000000000000000000000000..6c4dc8601e9d636aeff6f1f1f8933459df63b9f1 --- /dev/null +++ b/docs/graph-monitoring-user-guide.md @@ -0,0 +1,150 @@ +<!-- +// © University of Southampton IT Innovation Centre, 2018 +// +// Copyright in this software belongs to University of Southampton +// IT Innovation Centre of Gamma House, Enterprise Road, +// Chilworth Science Park, Southampton, SO16 7NS, UK. +// +// This software may not be used, sold, licensed, transferred, copied +// or reproduced in whole or in part in any manner or form or in or +// on any media by any person other than in accordance with the terms +// of the Licence Agreement supplied with the software, or otherwise +// without the prior written consent of the copyright owners. +// +// This software is distributed WITHOUT ANY WARRANTY, without even the +// implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR +// PURPOSE, except where stated in the Licence Agreement supplied with +// the software. +// +// Created By : Nikolay Stanchev +// Created Date : 17-05-2019 +// Created for Project : FLAME +--> + + +## CLMC - Graph-based measurements of service end-to-end delay + + +### Input requirements + +CLMC offers API endpoints to build and query a layer-based graph data structure starting from the infrastructure network layer +up to the logical abstraction layer of a media service. This graph can then be further used to measure an aggregation of the end-to-end delay +from a particular user equipment to a given service function endpoint without putting additional load on the deployed services. For detailed analysis +on the calculations performed by CLMC to derive this metric see the [documentation](https://gitlab.it-innovation.soton.ac.uk/FLAME/consortium/3rdparties/flame-clmc/blob/master/docs/total-service-request-delay.md) +particularly the [conclusions](https://gitlab.it-innovation.soton.ac.uk/FLAME/consortium/3rdparties/flame-clmc/blob/master/docs/total-service-request-delay.md#conclusion) section. +In order to use the API, three metrics must first be measured for each service function: + +* **response_time** – how much time it takes for a service to process a request (seconds) + +* **request_size** – the size of incoming requests for this service (bytes) + +* **response_size** – the size of outgoing responses from this service (bytes) + +An example is a Tomcat-based service, which uses the Tomcat telegraf input plugin for monitoring – the plugin measures the following fields +**bytes_sent**, **bytes_received** and **processing_time**. The measurement name is **tomcat_connector**. + +* **processing_time** is the total time spent processing incoming requests measured since the server has started, therefore, +this is a constantly increasing value. + +* **bytes_sent** and **bytes_received** measured using the same approach + +The graph monitoring process runs every X seconds, where X is configurable (e.g. 30 seconds). The media service provider +must define how to get the aggregated value of the three fields defined above for this X-seconds window. For example, +if the media service provider decides to use **mean** values, the following definitions can be used for a Tomcat-based service: + +* **response_time** - `(max(processing_time) - min(processing_time)) / ((count(processing_time) -1)*1000)` + +* **request_size** - `(max(bytes_received) - min(bytes_received)) / (count(bytes_received) - 1)` + +* **response_size** - `(max(bytes_sent) - min(bytes_sent)) / (count(bytes_sent) - 1)` + +Simply explained, since the Tomcat plugin measures these values as a continuously increasing measurement, we take difference between +the maximum and the minimum value received in the time window and divide by the number of measurements received for the time window, which +basically gives us the average (response time also divided by 1000 to convert milliseconds to seconds). + +To demonstrate this, let's say that the measurements received in the time window for **processing_time** are 21439394, 21439399 and 21439406 milliseconds. +Therefore, the average processing time would be (21439406 - 21439394) / ((3 - 1) * 1000) = 0.006 seconds. The same procedure is followed +for the request size and response size fields. + + +### Running a graph monitoring process + +There is a dedicated endpoint which starts an automated graph monitoring script, running in the background on CLMC, +constantly executing a full processing pipeline - build temporal graph, query for end-to-end delay, write results bach in InfluxDB, delete +temporal graph. The pipeline uses the defined configuration to periodically build the temporal graph and query for the end-to-end delay +from all possible UEs to every deployed service function endpoint and writes the result back into a dedicated measurement in the time-series database (InfluxDB). +For more information on the graph monitoring pipeline, see the [graph RTT slides](https://owncloud.it-innovation.soton.ac.uk/remote.php/webdav/Shared/FLAME/Project%20Reviews/2nd%20EC%20Review%20(technical)/drafts/WP4_FLAME_Graph_RTT.pptx). + +* `POST http://<clmc-host>/clmc/clmc-service/graph/monitor` + +* Expected JSON body serving as the configuration of the graph monitoring script: + +```json +{ + "query_period": "<how often is the graph pipeline executed - defines the length of the time window mentioned above>", + "results_measurement_name": "<where to write the end-to-end delay measurements>", + "service_function_chain": "<SFC identifier>", + "service_function_chain_instance": "<SFC identifier>_1", + "service_functions": { + "<service function package>": { + "response_time_field": "<field measuring the service delay of a service function - as described above>", + "request_size_field": "<field measuring the request size of a service function - as described above>", + "response_size_field": "<field measuring the response size of a service function - as descirbed above>", + "measurement_name": "<the name of the measurement which contains the fields above>" + }, + ... + } +} +``` + +* Example request with curl: + +`curl -X POST -d <JSON body> http://<clmc-host>/clmc/clmc-service/graph/monitor` + +* Example JSON body for the tomcat-based service described above: + +```json +{ + "query_period": 30, + "results_measurement_name": "graph_measurements", + "service_function_chain": "fms-sfc", + "service_function_chain_instance": "fms-sfc_1", + "service_functions": { + "fms-storage": { + "response_time_field": "(max(processing_time) - min(processing_time)) / ((count(processing_time) -1)*1000)", + "request_size_field": "(max(bytes_received) - min(bytes_received)) / (count(bytes_received) - 1)", + "response_size_field": "(max(bytes_sent) - min(bytes_sent)) / (count(bytes_sent) - 1)", + "measurement_name": "tomcat_connector" + } + } +} +``` + +An example response will look like this: + +```json +{ + "uuid": "75df6f8d-3829-4fd8-a3e6-b3e917010141", + "database": "fms-sfc" +} +``` + +The configuration described above will start a graph monitoring process executing every 30 seconds and writing the end-to-end delay results +in the measurement named **graph_measurements**, database **fms-sfc**. To stop the graph monitoring process, use the request ID received in +the response of the previous request: + +`curl -X DELETE http://<clmc-host>/clmc/clmc-service/graph/monitor/75df6f8d-3829-4fd8-a3e6-b3e917010141` + +To view the status of the graph monitoring process, send the same request, but using a GET method rather than DELETE. + +`curl -X GET http://<clmc-host>/clmc/clmc-service/graph/monitor/75df6f8d-3829-4fd8-a3e6-b3e917010141` + +Keep in mind that since this process is executing once in a given period, it is expected to see status **sleeping** in the response. +Example response: + +```json +{ + "status": "sleeping", + "msg": "Successfully fetched status of graph pipeline process." +} +``` \ No newline at end of file