Issue #67 - added documentation for aggregation

a040d1c4 · Nikolay Stanchev · 3bf3e831 · a040d1c4
Commit a040d1c4 authored 7 years ago by Nikolay Stanchev
--- a/docs/aggregation.md
+++ b/docs/aggregation.md
+<!--
+// © University of Southampton IT Innovation Centre, 2017
+//
+// Copyright in this software belongs to University of Southampton
+// IT Innovation Centre of Gamma House, Enterprise Road, 
+// Chilworth Science Park, Southampton, SO16 7NS, UK.
+//
+// This software may not be used, sold, licensed, transferred, copied
+// or reproduced in whole or in part in any manner or form or in or
+// on any media by any person other than in accordance with the terms
+// of the Licence Agreement supplied with the software, or otherwise
+// without the prior written consent of the copyright owners.
+//
+// This software is distributed WITHOUT ANY WARRANTY, without even the
+// implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR
+// PURPOSE, except where stated in the Licence Agreement supplied with
+// the software.
+//
+//      Created By :            Nikolay Stanchev
+//      Created Date :          27-04-2018
+//      Created for Project :   FLAME
+-->
+## **Flame CLMC - Network and Media Service measurements aggregation**
+### **Idea** 
+The idea is to aggregate platform measurement points with media service measurement points and obtain a third measurement from which we can easily
+understand both end-to-end and round-trip performance of a media service. This is achieved by having a python script running on the background and aggregating
+the data from both measurements on a given sample period, e.g. every 10 seconds. The script then posts the aggregated data back to Influx in a new measurement. 
+### **Assumptions**
+* Network measurement - assumption is that we have a measurement for the network link delays, called **network_delays**, providing the following information:  
+| path (tag) | delay | time |
+| --- | --- | --- |
+| path identifier | e2e delay for the given path | time of measurement |
+Here, the **path** tag value is the identifier of the path between two nodes in the network topology obtained from FLIPS. The assumption is that those identifiers
+will be structured in such a way that we can obtain the source and target endpoint IDs from the path identifier itself. For example:  
+ **endpoint1.ms-A.ict-flame.eu---endpoint2.ms-A.ict-flame.eu**  
+We can easily split the string on **'---'** and, thus, find the source endpoint is **endpoint1.ms-A.ict-flame.eu**, while the target endpoint is 
+**endpoint2.ms-A.ict-flame.eu**.  
+The delay field value is the network end-to-end delay in milliseconds for the path identified in the tag value.
+* Media service measurement - assumption is that we have a measurement for media services' response time, called **service_delays**, providing the following information:
+| FQDN (tag) | sf_instance (tag) | endpoint (tag) | response_time | time |
+| --- | --- | --- | --- | --- |
+| media service FQDN | ID of the service function instance | endpoint identifier | response time for the media service (s) | time of measurement |
+Here, the **FQDN**, **sf_instance** and **endpoint** tag values identify a unique response time measurement. The response time field value is the 
+response time (measured in seconds) for the media service only, and it does not take into account any of the network measurements.
+### **Goal**
+The ultimate goal is to populate a new measurement, called **e2e_delays**, which will be provided with the following information:
+| pathID_F (tag) | pathID_R (tag) | FQDN (tag) | sf_instance (tag) | D_path_F | D_path_R | D_service | time |
+| --- | --- | --- | --- | --- | --- | --- | --- | 
+* *pathID_F* - tag used to identify the path in forward direction, e.g. **endpoint1.ms-A.ict-flame.eu---endpoint2.ms-A.ict-flame.eu**
+* *pathID_R* - tag used to identify the path in reverse direction, e.g. **endpoint2.ms-A.ict-flame.eu---endpoint1.ms-A.ict-flame.eu**
+* *FQDN* - tag used to identify the media service
+* *sf_instance* - tag used to identify the media service
+* *D_path_F* - network delay for path in forward direction
+* *D_path_R* - network delay for path in reverse direction
+* *D_service* - media service response time
+Then we can easily query on this measurement to obtain different performance indicators, such as end-to-end overall delays, 
+round-trip response time or any of the contributing parts in those performance indicators. 
+### **Aggregation script**
+What the aggregation script does is very similat to the functionality of a continuous query. Given a sample report period, e.g. 10s,
+the script executes at every 10-second-period querying the averaged data for the last 10 seconds. The executed queries are:  
+* Network delays query - to obtain the network delay values and group them by their **path** identifier:
+```
+SELECT mean(delay) as "Dnet" FROM "E2EMetrics"."autogen".network_delays WHERE time >= now() - 10s and time < now() GROUP BY path
+``` 
+* Media service response time query - to obtain the response time values of the media service instances and group them by **FQDN**, **sf_instance** and **endpoint** identifiers: 
+```
+SELECT mean(response_time) as "Dresponse" FROM "E2EMetrics"."autogen".service_delays WHERE time >= now() - 10s and time < now() GROUP BY FQDN, sf_instance, endpoint
+```
+The results of the queries are then matched against each other on endpoint ID: on every match of the **endpoint** tag of the **service_delays** measurement with
+the target endpoint ID of the **network_delays** measurement, the rows are combined to obtain an **e2e_delay** measurement row, which is posted back to influx.
+Example:
+* Result from first query:
+```
+name: network_delays
+tags: path=endpoint1.ms-A.ict-flame.eu---endpoint2.ms-A.ict-flame.eu
+time                Dnet
+----                ----
+1524833145975682287 9.2
+name: network_delays
+tags: path=endpoint2.ms-A.ict-flame.eu---endpoint1.ms-A.ict-flame.eu
+time                Dnet
+----                ----
+1524833145975682287 10.3
+```
+* Result from second query
+```
+name: service_delays
+tags: FQDN=ms-A.ict-flame.eu, endpoint=endpoint2.ms-A.ict-flame.eu, sf_instance=test-sf-clmc-agent-build_INSTANCE
+time                Dresponse
+----                ---------
+1524833145975682287 11
+```
+The script will parse the path identifier **endpoint1.ms-A.ict-flame.eu---endpoint2.ms-A.ict-flame.eu** and find the target endpoint being
+**endpoint2.ms-A.ict-flame.eu**. Then the script checks if there is service delay measurement row matching this endpoint. Since there is one,
+those values will be merged, so the result will be a row like this:
+| pathID_F (tag) | pathID_R (tag) | FQDN (tag) | sf_instance (tag) | D_path_F | D_path_R | D_service | time |
+| --- | --- | --- | --- | --- | --- | --- | --- | 
+| endpoint1.ms-A.ict-flame.eu---endpoint2.ms-A.ict-flame.eu | endpoint2.ms-A.ict-flame.eu---endpoint1.ms-A.ict-flame.eu | ms-A.ict-flame.eu | test-sf-clmc-agent-build_INSTANCE | 9.2 | 10.3 | 11 | 1524833145975682287 | 
+Here, another assumption is made that we can reverse the path identifier of a network delay row and that the reverse path delay would also 
+be reported in the **network_delays** measurement. 
+The resulting row would then be posted back to influx in the **e2e_delays** measurement.
+### **Reasons why we cannot simply use a continuous query to do the job of the script**
+* Influx is very limited in merging measurements functionality. When doing a **select into** from multiple measurements, e.g.   
+*SELECT * INTO measurement0 FROM measurement1, measurement2*  
+influx will try to merge the data on matching time stamps and tag values (if there are any tags). If the two measurements
+differ in tags, then we get rows with missing data.
+* When doing a continuous query, we cannot perform any kind of manipulations on the data, which disables us on choosing which
+rows to merge together.
+* Continuous queries were not meant to be used for merging measurements. The main use case the developers provide is for
+downsampling the data in one measurement.
\ No newline at end of file