Skip to content
Snippets Groups Projects
Commit a040d1c4 authored by Nikolay Stanchev's avatar Nikolay Stanchev
Browse files

Issue #67 - added documentation for aggregation

parent 3bf3e831
No related branches found
No related tags found
No related merge requests found
<!--
// © University of Southampton IT Innovation Centre, 2017
//
// Copyright in this software belongs to University of Southampton
// IT Innovation Centre of Gamma House, Enterprise Road,
// Chilworth Science Park, Southampton, SO16 7NS, UK.
//
// This software may not be used, sold, licensed, transferred, copied
// or reproduced in whole or in part in any manner or form or in or
// on any media by any person other than in accordance with the terms
// of the Licence Agreement supplied with the software, or otherwise
// without the prior written consent of the copyright owners.
//
// This software is distributed WITHOUT ANY WARRANTY, without even the
// implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR
// PURPOSE, except where stated in the Licence Agreement supplied with
// the software.
//
// Created By : Nikolay Stanchev
// Created Date : 27-04-2018
// Created for Project : FLAME
-->
## **Flame CLMC - Network and Media Service measurements aggregation**
### **Idea**
The idea is to aggregate platform measurement points with media service measurement points and obtain a third measurement from which we can easily
understand both end-to-end and round-trip performance of a media service. This is achieved by having a python script running on the background and aggregating
the data from both measurements on a given sample period, e.g. every 10 seconds. The script then posts the aggregated data back to Influx in a new measurement.
### **Assumptions**
* Network measurement - assumption is that we have a measurement for the network link delays, called **network_delays**, providing the following information:
| path (tag) | delay | time |
| --- | --- | --- |
| path identifier | e2e delay for the given path | time of measurement |
Here, the **path** tag value is the identifier of the path between two nodes in the network topology obtained from FLIPS. The assumption is that those identifiers
will be structured in such a way that we can obtain the source and target endpoint IDs from the path identifier itself. For example:
**endpoint1.ms-A.ict-flame.eu---endpoint2.ms-A.ict-flame.eu**
We can easily split the string on **'---'** and, thus, find the source endpoint is **endpoint1.ms-A.ict-flame.eu**, while the target endpoint is
**endpoint2.ms-A.ict-flame.eu**.
The delay field value is the network end-to-end delay in milliseconds for the path identified in the tag value.
* Media service measurement - assumption is that we have a measurement for media services' response time, called **service_delays**, providing the following information:
| FQDN (tag) | sf_instance (tag) | endpoint (tag) | response_time | time |
| --- | --- | --- | --- | --- |
| media service FQDN | ID of the service function instance | endpoint identifier | response time for the media service (s) | time of measurement |
Here, the **FQDN**, **sf_instance** and **endpoint** tag values identify a unique response time measurement. The response time field value is the
response time (measured in seconds) for the media service only, and it does not take into account any of the network measurements.
### **Goal**
The ultimate goal is to populate a new measurement, called **e2e_delays**, which will be provided with the following information:
| pathID_F (tag) | pathID_R (tag) | FQDN (tag) | sf_instance (tag) | D_path_F | D_path_R | D_service | time |
| --- | --- | --- | --- | --- | --- | --- | --- |
* *pathID_F* - tag used to identify the path in forward direction, e.g. **endpoint1.ms-A.ict-flame.eu---endpoint2.ms-A.ict-flame.eu**
* *pathID_R* - tag used to identify the path in reverse direction, e.g. **endpoint2.ms-A.ict-flame.eu---endpoint1.ms-A.ict-flame.eu**
* *FQDN* - tag used to identify the media service
* *sf_instance* - tag used to identify the media service
* *D_path_F* - network delay for path in forward direction
* *D_path_R* - network delay for path in reverse direction
* *D_service* - media service response time
Then we can easily query on this measurement to obtain different performance indicators, such as end-to-end overall delays,
round-trip response time or any of the contributing parts in those performance indicators.
### **Aggregation script**
What the aggregation script does is very similat to the functionality of a continuous query. Given a sample report period, e.g. 10s,
the script executes at every 10-second-period querying the averaged data for the last 10 seconds. The executed queries are:
* Network delays query - to obtain the network delay values and group them by their **path** identifier:
```
SELECT mean(delay) as "Dnet" FROM "E2EMetrics"."autogen".network_delays WHERE time >= now() - 10s and time < now() GROUP BY path
```
* Media service response time query - to obtain the response time values of the media service instances and group them by **FQDN**, **sf_instance** and **endpoint** identifiers:
```
SELECT mean(response_time) as "Dresponse" FROM "E2EMetrics"."autogen".service_delays WHERE time >= now() - 10s and time < now() GROUP BY FQDN, sf_instance, endpoint
```
The results of the queries are then matched against each other on endpoint ID: on every match of the **endpoint** tag of the **service_delays** measurement with
the target endpoint ID of the **network_delays** measurement, the rows are combined to obtain an **e2e_delay** measurement row, which is posted back to influx.
Example:
* Result from first query:
```
name: network_delays
tags: path=endpoint1.ms-A.ict-flame.eu---endpoint2.ms-A.ict-flame.eu
time Dnet
---- ----
1524833145975682287 9.2
name: network_delays
tags: path=endpoint2.ms-A.ict-flame.eu---endpoint1.ms-A.ict-flame.eu
time Dnet
---- ----
1524833145975682287 10.3
```
* Result from second query
```
name: service_delays
tags: FQDN=ms-A.ict-flame.eu, endpoint=endpoint2.ms-A.ict-flame.eu, sf_instance=test-sf-clmc-agent-build_INSTANCE
time Dresponse
---- ---------
1524833145975682287 11
```
The script will parse the path identifier **endpoint1.ms-A.ict-flame.eu---endpoint2.ms-A.ict-flame.eu** and find the target endpoint being
**endpoint2.ms-A.ict-flame.eu**. Then the script checks if there is service delay measurement row matching this endpoint. Since there is one,
those values will be merged, so the result will be a row like this:
| pathID_F (tag) | pathID_R (tag) | FQDN (tag) | sf_instance (tag) | D_path_F | D_path_R | D_service | time |
| --- | --- | --- | --- | --- | --- | --- | --- |
| endpoint1.ms-A.ict-flame.eu---endpoint2.ms-A.ict-flame.eu | endpoint2.ms-A.ict-flame.eu---endpoint1.ms-A.ict-flame.eu | ms-A.ict-flame.eu | test-sf-clmc-agent-build_INSTANCE | 9.2 | 10.3 | 11 | 1524833145975682287 |
Here, another assumption is made that we can reverse the path identifier of a network delay row and that the reverse path delay would also
be reported in the **network_delays** measurement.
The resulting row would then be posted back to influx in the **e2e_delays** measurement.
### **Reasons why we cannot simply use a continuous query to do the job of the script**
* Influx is very limited in merging measurements functionality. When doing a **select into** from multiple measurements, e.g.
*SELECT * INTO measurement0 FROM measurement1, measurement2*
influx will try to merge the data on matching time stamps and tag values (if there are any tags). If the two measurements
differ in tags, then we get rows with missing data.
* When doing a continuous query, we cannot perform any kind of manipulations on the data, which disables us on choosing which
rows to merge together.
* Continuous queries were not meant to be used for merging measurements. The main use case the developers provide is for
downsampling the data in one measurement.
\ No newline at end of file
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment