Updated docs

7f10e358 · Nikolay Stanchev · c68a16cd · 7f10e358
Commit 7f10e358 authored May 2, 2018 by Nikolay Stanchev
--- a/docs/Measuring-E2E-MS-Performance.md
+++ b/docs/Measuring-E2E-MS-Performance.md
+<!--
+// © University of Southampton IT Innovation Centre, 2017
+//
+// Copyright in this software belongs to University of Southampton
+// IT Innovation Centre of Gamma House, Enterprise Road, 
+// Chilworth Science Park, Southampton, SO16 7NS, UK.
+//
+// This software may not be used, sold, licensed, transferred, copied
+// or reproduced in whole or in part in any manner or form or in or
+// on any media by any person other than in accordance with the terms
+// of the Licence Agreement supplied with the software, or otherwise
+// without the prior written consent of the copyright owners.
+//
+// This software is distributed WITHOUT ANY WARRANTY, without even the
+// implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR
+// PURPOSE, except where stated in the Licence Agreement supplied with
+// the software.
+//
+//      Created By :            Nikolay Stanchev, Simon Crowle
+//      Created Date :          02-05-2018
+//      Created for Project :   FLAME
+-->
+
+# **Flame CLMC - Measuring E2E Media Service Performance**
+
+#### **Authors**
+
+|Authors|Organisation|                    
+|---|---|  
+|[Simon Crowle](mailto:sgc@it-innovation.soton.ac.uk)|[University of Southampton, IT Innovation Centre](http://www.it-innovation.soton.ac.uk)|  
+|[Nikolay Stanchev](mailto:ns17@it-innovation.soton.ac.uk)|[University of Southampton, IT Innovation Centre](http://www.it-innovation.soton.ac.uk)|
+
+
+## E2E Model
+
+Readers of this document are assumed to have at least read the [CLMC information model](clmc-information-model.md). Here we explore the requirements which inform the definition of metrics that determine *'end-to-end'* media service performance. Before continuing, some terms are defined:
+
+| term | definition |
+| --- | --- |
+| *client* | an end-user of a FLAME media service - typically somebody accessing the service via an mobile computing device connected to an _EP router_ |
+| *endpoint* | an endpoint (EP) is a virtual machine (VM) connected to the FLAME network |
+| *service router* | an EP that allows other EPs to communicate with one another using fully qualified domain names (FQDN), rather than IP addresses |
+| *network node* | an _EP_, _service router_ or other hardware that receives and sends network traffic along network connections attached to it |
+| *media component* | a media component (MC) is a process that in part or wholly realizes the functionality of a media service |
+| *E2E path* | the directed, acyclic traversal of FLAME network nodes, beginning with a source _EP_ and moving to a target _EP_ via network nodes in the FLAME network |
+| *round trip time* | the total time taken for a service request to i) traverse an _E2E path_, ii) be processed at the _MC_, iii) be returned as a response via an _E2E path_
+
+In the sections that follow we set out some basic properties of a potential media service and then explore these in more detail with a concrete example. Following on from this analysis we provide a test-based approach to the specification of E2E media service performance measures.
+
+### E2E SFC
+
+Let us begin by identifying some simple, generic interactions within a media service function chain (SFC):
+
+```
+// simple chain
+Client --> data storage MC
+
+// sequential chain
+Client --> data processor MC --> data storage MC
+
+// complex chain
+Client --> data processor MC_A --> data processor MC_B
+                               |-> data storage MC <-|
+```
+
+The first example above imagines a client simply requesting some data be stored in (or retrieved from) a database managed by the MC responsible for persistence. In the second case, the client requests some processing of some data held in the data store, the results of which are also stored. Finally, the third case outlines a more complex scenario in which the client requests some processing of data which in turn generates further requests for additional data processing in other MCs which also may depend on storage I/O functionality. Here additional data processing by related MCs could include job scheduling or task decomposition and distribution to worker nodes. An advanced media service, such as a modern computer game, is a useful example of such a service in which graphics rendering; game state modelling; artificial intelligence and network communications are handled in parallel using varying problem decomposition methods.
+
+### E2E simple chain
+
+Next we will define a very simple network into which we will place a data processing EP and a data storage EP - we assert the clients could connect to any of _service routers_ that link these MC together.
+
+![Simple E2E network](image/e2e-simple-chain-network.png)
+
+Our simple network consists of three _service routers_ that connect clients with MC data and storage functionality; each demand from client 1 for the storage function could be routed in one network hop from router 'A' to router 'C' or in two from routers 'A' -> 'B' -> 'C'. A demand for storage function from _client 2_ would include zero network hops.
+
+### E2E simple chain metrics
+
+A principal metric we use to understand E2E performance is mean end-to-end _delay_: the average time taken between a request or response being transmitted and received _within the FLAME network_. Scoping the E2E delay to within the FLAME network is an important qualification since it is only within this network that all necessary measurements can reliably be taken.
+
+An out-going simple E2E request chain looks like this:
+
+![Simple E2E request steps](image/e2e-simple-chain-request-steps.png)
+
+the delay associated with the processing of the service request is isolated to within the storage MC:
+
+![Simple E2E MC processing](image/e2e-simple-chain-mc-processing.png)
+
+whilst for the response E2E delay, we see this:
+
+![Simple E2E response steps](image/e2e-simple-chain-response-steps.png)
+
+Above we denote the time required for an service router to handle (or pass on) an in-coming message as _handle request_ or _handle response_. When a message is first encountered by a service router, an optimized path through the FLAME network must also be determined; this is labelled above as _route specification_. The _round trip time_ is the sum of the request, service processing and response delays.
+
+> __Side note:__
+> To understand _delay_ more robustly, we may also consider the rate at which requests or responses arrive (_arrival rate_) at each node in the network since message management (queuing, for example) will have an effect at scale. Similarly, the _payload size_ of the messages being handled could also be observed since the quantity of data traversing the SFC will also impact delay in similar, large scale scenarios.
+>
+
+## E2E Measurement 
+
+### **Idea** 
+
+The idea is to aggregate platform measurement points with media service measurement points and obtain a third measurement from which we can easily
+understand both end-to-end and round-trip performance of a media service. This is achieved by having a python script running on the background and aggregating
+the data from both measurements on a given sample period, e.g. every 10 seconds. The script then posts the aggregated data back to Influx in a new measurement. 
+
+
+### **Assumptions**
+
+* Network measurement - assumption is that we have a measurement for the network link delays, called **network_delays**, providing the following information:  
+
+| path (tag) | delay | time |
+| --- | --- | --- |
+| path identifier | e2e delay for the given path | time of measurement |
+
+Here, the **path** tag value is the identifier of the path between two nodes in the network topology obtained from FLIPS. The assumption is that those identifiers
+will be structured in such a way that we can obtain the source and target endpoint IDs from the path identifier itself. For example:  
+ **endpoint1.ms-A.ict-flame.eu---endpoint2.ms-A.ict-flame.eu**  
+We can easily split the string on **'---'** and, thus, find the source endpoint is **endpoint1.ms-A.ict-flame.eu**, while the target endpoint is 
+**endpoint2.ms-A.ict-flame.eu**.  
+The delay field value is the network end-to-end delay in milliseconds for the path identified in the tag value.
+
+* A response will traverse the same network path as the request, but in reverse direction.
+
+* Media service measurement - assumption is that we have a measurement for media services' response time, called **service_delays**, providing the following information:
+
+| FQDN (tag) | sf_instance (tag) | endpoint (tag) | response_time | time |
+| --- | --- | --- | --- | --- |
+| media service FQDN | ID of the service function instance | endpoint identifier | response time for the media service (s) | time of measurement |
+
+Here, the **FQDN**, **sf_instance** and **endpoint** tag values identify a unique response time measurement. The response time field value is the 
+response time (measured in seconds) for the media service only, and it does not take into account any of the network measurements.
+
+
+### **Goal**
+
+The ultimate goal is to populate a new measurement, called **e2e_delays**, which will be provided with the following information:
+
+| pathID_F (tag) | pathID_R (tag) | FQDN (tag) | sf_instance (tag) | D_path_F | D_path_R | D_service | time |
+| --- | --- | --- | --- | --- | --- | --- | --- | 
+
+* *pathID_F* - tag used to identify the path in forward direction, e.g. **endpoint1.ms-A.ict-flame.eu---endpoint2.ms-A.ict-flame.eu**
+* *pathID_R* - tag used to identify the path in reverse direction, e.g. **endpoint2.ms-A.ict-flame.eu---endpoint1.ms-A.ict-flame.eu**
+* *FQDN* - tag used to identify the media service
+* *sf_instance* - tag used to identify the media service
+* *D_path_F* - network delay for path in forward direction
+* *D_path_R* - network delay for path in reverse direction
+* *D_service* - media service response time
+
+Then we can easily query on this measurement to obtain different performance indicators, such as end-to-end overall delays, 
+round-trip response time or any of the contributing parts in those performance indicators. 
+
+
+### **Aggregation script**
+
+What the aggregation script does is very similat to the functionality of a continuous query. Given a sample report period, e.g. 10s,
+the script executes at every 10-second-period querying the averaged data for the last 10 seconds. The executed queries are:  
+
+* Network delays query - to obtain the network delay values and group them by their **path** identifier:
+```
+SELECT mean(delay) as "Dnet" FROM "E2EMetrics"."autogen".network_delays WHERE time >= now() - 10s and time < now() GROUP BY path
+``` 
+
+* Media service response time query - to obtain the response time values of the media service instances and group them by **FQDN**, **sf_instance** and **endpoint** identifiers: 
+```
+SELECT mean(response_time) as "Dresponse" FROM "E2EMetrics"."autogen".service_delays WHERE time >= now() - 10s and time < now() GROUP BY FQDN, sf_instance, endpoint
+```
+
+The results of the queries are then matched against each other on endpoint ID: on every match of the **endpoint** tag of the **service_delays** measurement with
+the target endpoint ID of the **network_delays** measurement, the rows are combined to obtain an **e2e_delay** measurement row, which is posted back to influx.
+
+Example:
+
+* Result from first query:
+
+```
+name: network_delays
+tags: path=endpoint1.ms-A.ict-flame.eu---endpoint2.ms-A.ict-flame.eu
+time                Dnet
+----                ----
+1524833145975682287 9.2
+
+name: network_delays
+tags: path=endpoint2.ms-A.ict-flame.eu---endpoint1.ms-A.ict-flame.eu
+time                Dnet
+----                ----
+1524833145975682287 10.3
+```
+  
+* Result from second query
+
+```
+name: service_delays
+tags: FQDN=ms-A.ict-flame.eu, endpoint=endpoint2.ms-A.ict-flame.eu, sf_instance=test-sf-clmc-agent-build_INSTANCE
+time                Dresponse
+----                ---------
+1524833145975682287 11
+```
+
+
+The script will parse the path identifier **endpoint1.ms-A.ict-flame.eu---endpoint2.ms-A.ict-flame.eu** and find the target endpoint being
+**endpoint2.ms-A.ict-flame.eu**. Then the script checks if there is service delay measurement row matching this endpoint. Since there is one,
+those values will be merged, so the result will be a row like this:
+
+| pathID_F (tag) | pathID_R (tag) | FQDN (tag) | sf_instance (tag) | D_path_F | D_path_R | D_service | time |
+| --- | --- | --- | --- | --- | --- | --- | --- | 
+| endpoint1.ms-A.ict-flame.eu---endpoint2.ms-A.ict-flame.eu | endpoint2.ms-A.ict-flame.eu---endpoint1.ms-A.ict-flame.eu | ms-A.ict-flame.eu | test-sf-clmc-agent-build_INSTANCE | 9.2 | 10.3 | 11 | 1524833145975682287 | 
+  
+Here, another assumption is made that we can reverse the path identifier of a network delay row and that the reverse path delay would also 
+be reported in the **network_delays** measurement. 
+
+The resulting row would then be posted back to influx in the **e2e_delays** measurement.
\ No newline at end of file