From a040d1c42d9d3a5081aeea91f3ca2e89a04968f0 Mon Sep 17 00:00:00 2001 From: Nikolay Stanchev <ns17@it-innovation.soton.ac.uk> Date: Fri, 27 Apr 2018 14:17:15 +0100 Subject: [PATCH] Issue #67 - added documentation for aggregation --- docs/aggregation.md | 144 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 144 insertions(+) create mode 100644 docs/aggregation.md diff --git a/docs/aggregation.md b/docs/aggregation.md new file mode 100644 index 0000000..785c9d8 --- /dev/null +++ b/docs/aggregation.md @@ -0,0 +1,144 @@ +<!-- +// © University of Southampton IT Innovation Centre, 2017 +// +// Copyright in this software belongs to University of Southampton +// IT Innovation Centre of Gamma House, Enterprise Road, +// Chilworth Science Park, Southampton, SO16 7NS, UK. +// +// This software may not be used, sold, licensed, transferred, copied +// or reproduced in whole or in part in any manner or form or in or +// on any media by any person other than in accordance with the terms +// of the Licence Agreement supplied with the software, or otherwise +// without the prior written consent of the copyright owners. +// +// This software is distributed WITHOUT ANY WARRANTY, without even the +// implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR +// PURPOSE, except where stated in the Licence Agreement supplied with +// the software. +// +// Created By : Nikolay Stanchev +// Created Date : 27-04-2018 +// Created for Project : FLAME +--> + +## **Flame CLMC - Network and Media Service measurements aggregation** + +### **Idea** + +The idea is to aggregate platform measurement points with media service measurement points and obtain a third measurement from which we can easily +understand both end-to-end and round-trip performance of a media service. This is achieved by having a python script running on the background and aggregating +the data from both measurements on a given sample period, e.g. every 10 seconds. The script then posts the aggregated data back to Influx in a new measurement. + + +### **Assumptions** + +* Network measurement - assumption is that we have a measurement for the network link delays, called **network_delays**, providing the following information: + +| path (tag) | delay | time | +| --- | --- | --- | +| path identifier | e2e delay for the given path | time of measurement | + +Here, the **path** tag value is the identifier of the path between two nodes in the network topology obtained from FLIPS. The assumption is that those identifiers +will be structured in such a way that we can obtain the source and target endpoint IDs from the path identifier itself. For example: + **endpoint1.ms-A.ict-flame.eu---endpoint2.ms-A.ict-flame.eu** +We can easily split the string on **'---'** and, thus, find the source endpoint is **endpoint1.ms-A.ict-flame.eu**, while the target endpoint is +**endpoint2.ms-A.ict-flame.eu**. +The delay field value is the network end-to-end delay in milliseconds for the path identified in the tag value. + +* Media service measurement - assumption is that we have a measurement for media services' response time, called **service_delays**, providing the following information: + +| FQDN (tag) | sf_instance (tag) | endpoint (tag) | response_time | time | +| --- | --- | --- | --- | --- | +| media service FQDN | ID of the service function instance | endpoint identifier | response time for the media service (s) | time of measurement | + +Here, the **FQDN**, **sf_instance** and **endpoint** tag values identify a unique response time measurement. The response time field value is the +response time (measured in seconds) for the media service only, and it does not take into account any of the network measurements. + + +### **Goal** + +The ultimate goal is to populate a new measurement, called **e2e_delays**, which will be provided with the following information: + +| pathID_F (tag) | pathID_R (tag) | FQDN (tag) | sf_instance (tag) | D_path_F | D_path_R | D_service | time | +| --- | --- | --- | --- | --- | --- | --- | --- | + +* *pathID_F* - tag used to identify the path in forward direction, e.g. **endpoint1.ms-A.ict-flame.eu---endpoint2.ms-A.ict-flame.eu** +* *pathID_R* - tag used to identify the path in reverse direction, e.g. **endpoint2.ms-A.ict-flame.eu---endpoint1.ms-A.ict-flame.eu** +* *FQDN* - tag used to identify the media service +* *sf_instance* - tag used to identify the media service +* *D_path_F* - network delay for path in forward direction +* *D_path_R* - network delay for path in reverse direction +* *D_service* - media service response time + +Then we can easily query on this measurement to obtain different performance indicators, such as end-to-end overall delays, +round-trip response time or any of the contributing parts in those performance indicators. + + +### **Aggregation script** + +What the aggregation script does is very similat to the functionality of a continuous query. Given a sample report period, e.g. 10s, +the script executes at every 10-second-period querying the averaged data for the last 10 seconds. The executed queries are: + +* Network delays query - to obtain the network delay values and group them by their **path** identifier: +``` +SELECT mean(delay) as "Dnet" FROM "E2EMetrics"."autogen".network_delays WHERE time >= now() - 10s and time < now() GROUP BY path +``` + +* Media service response time query - to obtain the response time values of the media service instances and group them by **FQDN**, **sf_instance** and **endpoint** identifiers: +``` +SELECT mean(response_time) as "Dresponse" FROM "E2EMetrics"."autogen".service_delays WHERE time >= now() - 10s and time < now() GROUP BY FQDN, sf_instance, endpoint +``` + +The results of the queries are then matched against each other on endpoint ID: on every match of the **endpoint** tag of the **service_delays** measurement with +the target endpoint ID of the **network_delays** measurement, the rows are combined to obtain an **e2e_delay** measurement row, which is posted back to influx. + +Example: + +* Result from first query: +``` +name: network_delays +tags: path=endpoint1.ms-A.ict-flame.eu---endpoint2.ms-A.ict-flame.eu +time Dnet +---- ---- +1524833145975682287 9.2 + +name: network_delays +tags: path=endpoint2.ms-A.ict-flame.eu---endpoint1.ms-A.ict-flame.eu +time Dnet +---- ---- +1524833145975682287 10.3 +``` + +* Result from second query +``` +name: service_delays +tags: FQDN=ms-A.ict-flame.eu, endpoint=endpoint2.ms-A.ict-flame.eu, sf_instance=test-sf-clmc-agent-build_INSTANCE +time Dresponse +---- --------- +1524833145975682287 11 +``` + +The script will parse the path identifier **endpoint1.ms-A.ict-flame.eu---endpoint2.ms-A.ict-flame.eu** and find the target endpoint being +**endpoint2.ms-A.ict-flame.eu**. Then the script checks if there is service delay measurement row matching this endpoint. Since there is one, +those values will be merged, so the result will be a row like this: + +| pathID_F (tag) | pathID_R (tag) | FQDN (tag) | sf_instance (tag) | D_path_F | D_path_R | D_service | time | +| --- | --- | --- | --- | --- | --- | --- | --- | +| endpoint1.ms-A.ict-flame.eu---endpoint2.ms-A.ict-flame.eu | endpoint2.ms-A.ict-flame.eu---endpoint1.ms-A.ict-flame.eu | ms-A.ict-flame.eu | test-sf-clmc-agent-build_INSTANCE | 9.2 | 10.3 | 11 | 1524833145975682287 | + +Here, another assumption is made that we can reverse the path identifier of a network delay row and that the reverse path delay would also +be reported in the **network_delays** measurement. + +The resulting row would then be posted back to influx in the **e2e_delays** measurement. + + +### **Reasons why we cannot simply use a continuous query to do the job of the script** + +* Influx is very limited in merging measurements functionality. When doing a **select into** from multiple measurements, e.g. +*SELECT * INTO measurement0 FROM measurement1, measurement2* +influx will try to merge the data on matching time stamps and tag values (if there are any tags). If the two measurements +differ in tags, then we get rows with missing data. +* When doing a continuous query, we cannot perform any kind of manipulations on the data, which disables us on choosing which +rows to merge together. +* Continuous queries were not meant to be used for merging measurements. The main use case the developers provide is for +downsampling the data in one measurement. \ No newline at end of file -- GitLab