Merge branch 'docs' of gitlab.it-innovation.soton.ac.uk:FLAME/flame-clmc into integration

fb24e8d8 · MJB · c34d2e78 · e76fe274 · fb24e8d8 · fb24e8d8
Commit fb24e8d8 authored 7 years ago by MJB
--- a/docs/Measuring-E2E-MS-Performance.md
+++ b/docs/Measuring-E2E-MS-Performance.md
+<!--
+// © University of Southampton IT Innovation Centre, 2018
+//
+// Copyright in this software belongs to University of Southampton
+// IT Innovation Centre of Gamma House, Enterprise Road, 
+// Chilworth Science Park, Southampton, SO16 7NS, UK.
+//
+// This software may not be used, sold, licensed, transferred, copied
+// or reproduced in whole or in part in any manner or form or in or
+// on any media by any person other than in accordance with the terms
+// of the Licence Agreement supplied with the software, or otherwise
+// without the prior written consent of the copyright owners.
+//
+// This software is distributed WITHOUT ANY WARRANTY, without even the
+// implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR
+// PURPOSE, except where stated in the Licence Agreement supplied with
+// the software.
+//
+//      Created By :            Nikolay Stanchev, Simon Crowle
+//      Created Date :          02-05-2018
+//      Created for Project :   FLAME
+-->
+
+# **Flame CLMC - Measuring E2E Media Service Performance**
+
+#### **Authors**
+
+|Authors|Organisation|                    
+|---|---|  
+|[Simon Crowle](mailto:sgc@it-innovation.soton.ac.uk)|[University of Southampton, IT Innovation Centre](http://www.it-innovation.soton.ac.uk)|  
+|[Nikolay Stanchev](mailto:ns17@it-innovation.soton.ac.uk)|[University of Southampton, IT Innovation Centre](http://www.it-innovation.soton.ac.uk)|
+
+### Definitions
+
+Readers of this document are assumed to have at least read the [CLMC information model](clmc-information-model.md). Here we explore the requirements which inform the definition of metrics that determine *'end-to-end'* media service performance. Before continuing, some terms are defined:
+
+| term | definition |
+| --- | --- |
+| *client* | an end-user of a FLAME media service - typically somebody accessing the service via an mobile computing device connected to an _EP router_ |
+| *network node* | is a _service function router_ or other hardware that receives and sends network traffic along network connections attached to it |
+| *service function router* | a _service function router_ (SFR) is a VM that allows _clients_ or _endpoints_ to communicate with one another using fully qualified domain names (FQDN), rather than IP addresses |
+| *service function instance* | a _service function instance_ (SFI) is a process that in part or wholly realizes the functionality of a media service |
+| *endpoint* | an endpoint (EP) is a virtual machine (VM) that implements an SFI and is connected to the FLAME network by a _service function router_ |
+| *E2E path* | the directed, acyclic traversal of FLAME network nodes, beginning with a source _EP_ and moving to a target _EP_ via network nodes in the FLAME network |
+| *round trip time* | the total time taken for a service request to i) traverse an _E2E path_, ii) be processed by the media service, iii) be returned as a response via an _E2E path_
+
+### **Assumptions**
+
+Here, we list the assumptions we make for measuring and understanding E2E performance of an EP that implements a SFIs:
+
+* Network measurement - the assumption is that we have a measurement for the network path delays between service function routers, called **network_delays**, providing the following information:  
+
+| path_ID (tag) | source_SFR (tag) | target_SFR (tag) | delay | time |
+| --- | --- | --- | --- | --- | 
+| path identifier | source SFR | target SFR | e2e delay for the given path (ms) | timestamp of measurement |
+
+Here, the **path_ID** tag value is the identifier of the path between two service function routers in the network topology obtained from FLIPS. The **source_SFR** tag value is the source service router for the identified path, while the **target_SFR** tag value is the target service router. The delay field value is the network end-to-end delay in milliseconds that a packet would experience when traversing the path between the two SFRs identified in the tag values.
+
+An example row would be:
+
+| path_ID (tag) | source_SFR (tag) | target_SFR (tag) | delay | time |
+| --- | --- | --- | --- | --- |
+| SFR-A---S1---S2---S3---SFR-B | SFR-A | SFR-B | 10 | 1525334761282000 |
+
+The semantics of the row is that a packet traversing the path from SFR-A through S1, S2, S3 (switches) to SFR-B will experience an averaged delay of 10ms.
+
+* Request/Response path - the assumption is that a response will traverse the same network path as the request, but in reverse direction.
+
+* Media service measurement - assumption is that we have a measurement for media service response time, containing at least the following information:
+
+| sf_instance (tag) | sfr (tag) | endpoint (tag) | response_time | time |
+| --- | --- | --- | --- | --- |
+| media SF instance ID (FQDN) | SFR that connects the SFI endpoint to the FLAME network | SFI EP identifier | response time for the media service (ms) | timestamp of measurement |
+
+Note that all FLAME service function EPs are expected to contain this and other decision context related data in their global tags, see the [CLMC monitoring documentation](monitoring.md) for further information. Above, the **sf_instance**, **sfr** and **endpoint** tag values identify a unique response time measurement. The response time field value is the time elapsed (measured in milliseconds) for a specific SFI/EP implementation only, and it does not take into account any of the network measurements. An example row would be:
+
+| sf_instance (tag) | sfr (tag) | endpoint (tag) | response_time | time |
+| --- | --- | --- | --- | --- |
+| media-service.ict-flame.eu | SFR-B | server1 | 27 | 1525334761282000 |
+
+The semantics of the row is that the response time for a SFI with an identity of _media-service.ict-flame.eu_ that is implemented by endpoint _server1_ and connected to the FLAME network through service function router *SFR-B* will have an averaged response time of 27 ms.
+
+## E2E Model
+
+In the sections that follow we set out some basic properties of a potential media service and then explore these in more detail with a concrete example. Following on from this analysis we provide a test-based approach to the specification of E2E media service performance measures.
+
+### E2E SFC
+
+Let us begin by identifying some simple, generic interactions within a media service function chain (SFC):
+
+```
+// simple chain
+Client --> data storage SFI/EP1
+
+// sequential chain
+Client --> data processor SFI/EP1 --> data storage SFI/EP1
+
+// complex chain
+Client --> data processor SFI_A/EP1 --> data processor SFI_B/EP1
+                               |-> data storage SFI/EP1 <-|
+```
+
+The first example above imagines a client simply requesting some data be stored in (or retrieved from) a database managed by the SFI responsible for persistence. In the second case, the client requests some processing of some data held in the data store, the results of which are also stored. Finally, the third case outlines a more complex scenario in which the client requests some processing of data which in turn generates further requests for additional data processing in other SFIs which also may depend on storage I/O functionality. Here additional data processing by related SFIs could include job scheduling or task decomposition and distribution to worker nodes. An advanced media service, such as a modern computer game, is a useful example of such a service in which graphics rendering; game state modelling; artificial intelligence and network communications are handled in parallel using varying problem decomposition methods.
+
+### E2E simple chain
+
+Next we will define a very simple network into which we will place a data processing EP and a data storage EP - we assert the clients could connect to any of _service function routers_ that link these SFI implementations together.
+
+![Simple chain E2E network](image/e2e-simple-chain-network.png)
+
+Our simple network consists of three _service function routers_ (SFRs) that connect clients with SFI data and storage functionality; a demand from client 1 for the storage function could be routed in one network hop from router 'A' to router 'C' or in two from routers 'A' -> 'B' -> 'C'. A demand for storage function from _client 2_ would include zero network hops.
+
+> __Side note: FLAME network scope__
+> 
+> Readers are reminded that low-level network traffic metrics gathered by the FLAME platform are restricted to observations of network performance between SFRs. The first and last steps (typically between a client and SFI) are not captured at the time of writing - these links are denoted by a dotted line (`--->`) in our diagrams.
+>
+
+### E2E simple chain metrics
+
+A principal metric we use to understand E2E performance is the average end-to-end _delay_: the _mean_ time taken between a request or response being transmitted and received _within the FLAME network_. Scoping the E2E delay to within the FLAME network is an important qualification since it is only within this network that all necessary measurements can reliably be taken.
+
+An out-going simple E2E request chain looks like this:
+
+![Simple E2E request steps](image/e2e-simple-chain-request-steps.png)
+
+the delay associated with the processing of the service request is isolated to within the storage SFI:
+
+![Simple E2E SFI processing](image/e2e-simple-chain-mc-processing.png)
+
+whilst for the response E2E delay, we see this:
+
+![Simple E2E response steps](image/e2e-simple-chain-response-steps.png)
+
+Above we denote the time required for an service function router to handle (or pass on) an in-coming message as _handle request_ or _handle response_. When a message is _handled_ by a service function router there are a number of processes that incur (small amounts of) delay:
+
+* _Processing delay_: error checking and an optimized route through the network for the HTTP packet must be determined
+* _Queuing delay_: the time a HTTP packet waits in a queue whilst other packets ahead of it are transmitted
+* _Transmission delay_: the time taken for the packet bits to be copied out into the transmission medium of the network
+
+The _round trip time_ is the sum of the request, service processing and response delays.
+
+> __Side note:__
+> To understand _delay_ more robustly, we may also consider the rate at which requests or responses arrive (_arrival rate_) at each node in the network since message management (queuing, for example) will have an effect at scale. Similarly, the _payload size_ of the messages being handled could also be observed since the quantity of data traversing the SFC will also impact delay in similar, large scale scenarios.
+>
+
+### E2E extended chain
+
+Up until this point we have considered an elementary SFC in which there is only one class of SFI. In a more realistic scenario, we would expect a media service function to be composed of multiple SFIs that are distributed and connected to multiple nodes in the FLAME network. Below we have extended the simple chain to include a greater level of complexity with respect to service function chains, whilst holding the network topology constant (adding network SFRs simply introduces additional hops to the problem space at this stage). In addition to indicating extra clients and SFIs, weights have been added to the network arcs to indicate relative network latency between SFRs. __In this case study we also make the assumption that chained service function calls, such as those between one SFI and another, are synchronized and blocking__.
+
+![Extended chain E2E network](image/e2e-extended-chain-network.png)
+
+Imagine a media service that both stores and processes high volumes of complex media streams. Consider as well a distributed population of clients making demands on this service. Successfully handling high demand for this service could mean deploying several EPs that implement its SFIs (storage and processing) across multiple VMs that interoperate and share the demand load. Since clients and EPs are distributed, service function requests (made by both) will likely give rise to propagating waves of activity, load (and delay) from multiple nodes across the FLAME platform. For simplicity, let us assume our multimedia service implements a request by processing some media data from the client and then storing it (returning some result to client). Here is client 1's request as it passes through the FLAME network and its SFIs:
+
+![Extended client 1 path](image/e2e-extended-client1-path.png)
+
+In the figure above the green arcs indicate service request travel whilst the blue denotes the response path. The shortest route directs the request to SFR 'B' and the consequent storage request travels on to SFR 'C'. __Responses return along the path used by the request__. Indicative service response times are provided by numeric values in the active SFI/EP boxes. Let's see the same request from client 2, who has just joined the network:
+
+![Extended client 2 path](image/e2e-extended-client2-path.png)
+
+For this client, the _locality of reference_ for processing and then storing her data is high: both of the associated SFIs are located on VMs attached to the same SFR. We could expect client 2's response time to be low for this reason.
+
+_Now for the sake of example only, let us assume that the hardware running SFI Processor A can only effectively handle one request at any time and that any more than this will result in a substantial degradation in processing performance_.
+
+Client 3 joins the network:
+
+![Extended client 3 path](image/e2e-extended-client3-path.png)
+
+In calculating a service function route that optimizes for the complete _round trip_ delay, we need to take into account the likely delays that are incurred from both network related latencies and also _all_ SFI response times. The orange route illustrated above shows how the gains made by selecting a fast route through the network are offset by penalities in using an EP processor for the SFI that is overloaded; conversely a slower route that selects a SFI with computational resources to spare resolves to an over-all faster round-trip response time.
+
+## E2E Measurement
+
+Our aim is to aggregate network measurement points with media service measurement points to obtain a third measurement from which we can easily understand both end-to-end and round-trip performance of a media service. This is achieved by using a CLMC E2E monitoring process that aggregates data from network and media service measurements within a given sample period, e.g. every 10 seconds. This process then posts the aggregated data back to Influx in a new measurement.
+
+### **Goal**
+
+The ultimate goal is to populate a new measurement, called **e2e_delays**, which will be provided with the following information:
+
+| path_ID (tag) | source_SFR (tag) | target_SFR (tag) | sf_instance (tag) | endpoint (tag) | delay_forward | delay_reverse | delay_service | time |
+| --- | --- | --- | --- | --- | --- | --- | --- | --- |
+
+* *path_ID* - tag ID used to identify the network path (bidirectional path identifier)
+* *source_SFR* - tag used to identify the source service function router (the start of the network path)
+* *target_SFR* - tag used to identify the target service function router (the end of the network path)
+* *sf_instance* - tag used to identify the SFI
+* *endpoint* - tag used to identify the EP implementing the SFI
+* *delay_forward* - network delay for the path in forward direction
+* *delay_reverse* - network delay for path in reverse direction
+* *delay_service* - media service component response time
+
+Then we can easily query on this measurement to obtain different performance indicators, such as end-to-end overall delays, round-trip response time or any of the contributing parts in those performance indicators.
+
+### Monitoring network delays
+
+Here, we describe the process of obtaining network delays between two service function routers in the network topology. CLMC retrieves network path delays between any two SFRs, see below (**SFR** denotes a service function router, **S** denotes a switch):
+
+![network_graph](image/network_graph.png)
+
+SFR monitoring provides us with FIDs at each service function router, which are bidirectional path IDs. From those, we derive the desired SFR-SFR network latencies. For instance, if we take the network graph example and analyse service function router **SFR3**. We would get 2 FIDs for this router - one for the path to reach SFR2 and one for the path to reach SFR1.
+
+We assume that the FID for reaching *SFR1* from *SFR3* tells us the path goes through nodes *S3* and *S6*.
+
+![network-SFR3-SFR1](image/network-SFR3-SFR1.png)
+
+Hence, we accumulate the individual link delays to derive the full SFR-SFR delay for both forward and reverse direction.
+
+delay_forward = SFR3-S3 + S3-S6 + S6-SFR1 = 12 + 3 + 3 = 18
+delay_reverse = SFR1-S6 + S6-S3 + S3-SFR3 = 1 + 5 + 10 = 16
+
+Now, we assume that the FID for reaching *SFR2* from *SFR3* tells us the path goes through nodes *S4* and *S2*.
+
+![network-SFR3-SFR1](image/network-SFR3-SFR2.png)
+
+Hence, we accumulate the individual link delays to derive the full SFR-SFR delay for both forward and reverse direction.
+
+delay_forward = SFR3-S4 + S4-S2 + S2-SFR2 = 12 + 4 + 5 = 21
+delay_reverse = SFR2-S2 + S2-S4 + S4-SFR3 = 8 + 2 + 11 = 21
+
+Overall, from this analysis, the following data will be reported to Influx in the **network_delays** measurement:
+
+| path_ID (tag) | source_SFR (tag) | target_SFR (tag) | delay | time |
+| --- | --- | --- | --- | --- |  
+| SFR3-SFR1 | SFR3 | SFR1 | 18 | 1525334761282000 |
+| SFR3-SFR1 | SFR1 | SFR3 | 16 | 1525334761282000 |
+| SFR3-SFR2 | SFR3 | SFR2 | 21 | 1525334761282000 |
+| SFR3-SFR2 | SFR2 | SFR3 | 21 | 1525334761282000 |
+
+### Monitoring SFI/EP response times
+
+Readers of the [CLMC information model](clmc-information-model.md) will already be aware of the approach to identifying and reporting SFI performance metrics in the FLAME project. The global measurement tags that help in a decision context are used in this case to provide the mapping between network measurements and a specific service response time. Specifically, we use the SFR tag encapsulated in the media service global tags to cross-reference against target SFR tags (described above).
+
+In its simplest case, a media SFI's response time could be defined as a single value that derives from the (average) time spent processing requests in local memory and/or on disk. Indeed, a number of the FLAME foundation media service metrics sent to the CLMC could be described as such. In more advanced cases (such as for clients 1 and 3 in our example above) the full service function chain is implemented across more than one endpoint. Here we have at least two options:
+
+1. Let the first SFI in a SFC be representative of the entire service function delay (making opaque the sub-calls to other SFIs required to fullfil the client's request)
+
+2. Construct a more complex view of the service function response time as an aggregate of internal SFI processing delays and their related, dependent network delays
+
+#### TO BE DISCUSSED FURTHER
+
+> __NOTE: ABOVE TO BE VALIDATED__
+> __***************************__
+>
+
+### E2E Aggregation process
+
+The aggregation process provides similar functionality to that of an INFLUX continuous query. During each sample period the process collects and averages network and service delay data for the last 10 seconds (for example). The executed queries are:
+
+* Network delays query - to obtain the network delay values and group them by their **path_ID**, **source_SFR** and **target_SFR** identifiers:
+
+```
+SELECT mean(delay) as "net_delay" FROM "E2EMetrics"."autogen"."network_delays" WHERE time >= now() - 10s and time < now() GROUP BY path_ID, source_SFR, target_SFR
+``` 
+
+* Media service response time query - to obtain the response time values of the media service instances and group them by **endpoint**, **sf_instance** and **sfr** identifiers: 
+```
+SELECT mean(response_time) as "response_time" FROM "E2EMetrics"."autogen"."service_delays" WHERE time >= now() - 10s and time < now() GROUP BY endpoint, sf_instance, sfr
+```
+
+The results of the queries are then matched against each other on the **target** and **sfr** tag values (for *network_delays* and *service_delays* respectively): 
+on every match of the **sfr** tag of the **service_delays** measurement with the **target** service function router of the **network_delays** measurement, the rows are combined 
+to obtain an **e2e_delay** measurement row, which is posted back to influx.
+
+Example:
+
+Let's assume we have these results from the two queries:
+
+* Result from first query
+
+```
+name: network_delays
+tags: path_ID=SFR-A---SFR-B, source_SFR=SFR-A, target_SFR=SFR-B
+time                net_delay
+----                ---------
+1524833145975682287 9.2
+
+name: network_delays
+tags: path_ID=SFR-A---SFR-B, source_SFR=SFR-B, target_SFR=SFR-A
+time                net_delay
+----                ---------
+1524833145975682287 10.3
+```
+  
+* Result from second query
+
+```
+name: service_delays
+tags: endpoint=server1, sfr=SFR-B, sf_instance=ms-A.ict-flame.eu
+time                response_time
+----                -------------
+1524833145975682287 11
+```
+
+The E2E aggregation process will merge those rows, because there is a match on network delay target SFR and service delay SFR - namely **SFR-B**.
+
+| path_ID (tag) | source_SFR (tag) | target_SFR (tag) | endpoint (tag) | sf_instance (tag) | delay_forward | delay_reverse | delay_service | time |
+| --- | --- | --- | --- | --- | --- | --- | --- | --- |
+| SFR-A---SFR-B | SFR-A | SFR-B | server1 | ms-A.ict-flame.eu | 9.2 | 10.3 | 11 | 1524833145975682287 |
+
+The resulting row would then be posted back to influx in the **e2e_delays** measurement.
\ No newline at end of file
--- a/docs/figures/e2eFigures.graphml
+++ b/docs/figures/e2eFigures.graphml
--- a/docs/image/e2e-extended-chain-network.png
+++ b/docs/image/e2e-extended-chain-network.png
--- a/docs/image/e2e-extended-client1-path.png
+++ b/docs/image/e2e-extended-client1-path.png
--- a/docs/image/e2e-extended-client2-path.png
+++ b/docs/image/e2e-extended-client2-path.png
--- a/docs/image/e2e-extended-client3-path.png
+++ b/docs/image/e2e-extended-client3-path.png
--- a/docs/image/e2e-simple-chain-mc-processing.png
+++ b/docs/image/e2e-simple-chain-mc-processing.png
--- a/docs/image/e2e-simple-chain-network.png
+++ b/docs/image/e2e-simple-chain-network.png
--- a/docs/image/e2e-simple-chain-request-steps.png
+++ b/docs/image/e2e-simple-chain-request-steps.png
--- a/docs/image/e2e-simple-chain-response-steps.png
+++ b/docs/image/e2e-simple-chain-response-steps.png
--- a/docs/image/network-SFR3-SFR1.png
+++ b/docs/image/network-SFR3-SFR1.png
--- a/docs/image/network-SFR3-SFR2.png
+++ b/docs/image/network-SFR3-SFR2.png
--- a/docs/image/network_graph.png
+++ b/docs/image/network_graph.png