Refactors section order; streamlines terminology

Adds discussion points on SFI response times

Refactors section order; streamlines terminology
0a18e2d1 · Simon Crowle · 7381920c · 0a18e2d1 · 0a18e2d1 · 0a18e2d1
Commit 0a18e2d1 authored 7 years ago by Simon Crowle
--- a/docs/Measuring-E2E-MS-Performance.md
+++ b/docs/Measuring-E2E-MS-Performance.md
@@ -40,13 +40,13 @@ Readers of this document are assumed to have at least read the [CLMC information
 | *endpoint* | an endpoint (EP) is a virtual machine (VM) connected to the FLAME network by a _service function router_ |
 | *service function router* | a SFR is a VM that allows EPs to communicate with one another using fully qualified domain names (FQDN), rather than IP addresses |
 | *network node* | a _service function router_ or other hardware that receives and sends network traffic along network connections attached to it |
-| *media component* | a media component (MC) is a process that in part or wholly realizes the functionality of a media service |
+| *service function instance* | a _service function instance_ (SFI) is a process that in part or wholly realizes the functionality of a media service |
 | *E2E path* | the directed, acyclic traversal of FLAME network nodes, beginning with a source _EP_ and moving to a target _EP_ via network nodes in the FLAME network |
-| *round trip time* | the total time taken for a service request to i) traverse an _E2E path_, ii) be processed at the _MC_, iii) be returned as a response via an _E2E path_
+| *round trip time* | the total time taken for a service request to i) traverse an _E2E path_, ii) be processed by the media service, iii) be returned as a response via an _E2E path_

 ### **Assumptions**

-Here, we list the assumptions we make for measuring and understanding E2E performance of media components:
+Here, we list the assumptions we make for measuring and understanding E2E performance of SFIs:

 * Network measurement - the assumption is that we have a measurement for the network path delays between service function routers, called **network_delays**, providing the following information:  

@@ -66,23 +66,19 @@ The semantics of the row is that a packet traversing the path from SFR-A (servic

 * Request/Response path - the assumption is that a response will traverse the same network path as the request, but in reverse direction.

-* Media service measurement - assumption is that we have a measurement for media service components' response time, called **service_delays**, providing the following information:
+* Media service measurement - assumption is that we have a measurement for media service response time, called **service_delays**, providing the following information:

 | FQDN (tag) | sf_instance (tag) | sfr (tag) | response_time | time |
 | --- | --- | --- | --- | --- |
-| media service FQDN | ID of the service function instance | SFR that connects the MC endpoint to the Flame network | response time for the media service (ms) | timestamp of measurement |
+| media service FQDN | ID of the service function instance | SFR that connects the SFI endpoint to the FLAME network | response time for the media service (ms) | timestamp of measurement |

-Here, the **FQDN**, **sf_instance** and **sfr** tag values identify a unique response time measurement.  
-The response time field value is the response time (measured in milliseconds) for the media service component only, and it does not take into account any of the network measurements.
-
-An example row would be:
+Here, the **FQDN**, **sf_instance** and **sfr** tag values identify a unique response time measurement. The response time field value is the time elapsed (measured in milliseconds) for a media service instance only, and it does not take into account any of the network measurements. An example row would be:

 | FQDN (tag) | sf_instance (tag) | sfr (tag) | response_time | time |
 | --- | --- | --- | --- | --- |
 | ms-A.ict-flame.eu | ms-A-sf_INSTANCE | SFR-B | 27 | 1525334761282000 |

-The semantics of the row is that the response time for a service function instance with ID *ms-A-sf_INSTANCE* serving media service
-*ms-A.ict-flame.eu* and connected to the FLAME network through service function router *SFR-B* will have an averaged response time of 27 ms.
+The semantics of the row is that the response time for a service function instance with ID *ms-A-sf_INSTANCE* serving media service *ms-A.ict-flame.eu* and connected to the FLAME network through service function router *SFR-B* will have an averaged response time of 27 ms.

 ## E2E Model

@@ -94,29 +90,29 @@ Let us begin by identifying some simple, generic interactions within a media ser

 ```
 // simple chain
-Client --> data storage MC
+Client --> data storage SFI

 // sequential chain
-Client --> data processor MC --> data storage MC
+Client --> data processor SFI --> data storage SFI

 // complex chain
-Client --> data processor MC_A --> data processor MC_B
-                               |-> data storage MC <-|
+Client --> data processor SFI_A --> data processor SFI_B
+                               |-> data storage SFI <-|
 ```

-The first example above imagines a client simply requesting some data be stored in (or retrieved from) a database managed by the MC responsible for persistence. In the second case, the client requests some processing of some data held in the data store, the results of which are also stored. Finally, the third case outlines a more complex scenario in which the client requests some processing of data which in turn generates further requests for additional data processing in other MCs which also may depend on storage I/O functionality. Here additional data processing by related MCs could include job scheduling or task decomposition and distribution to worker nodes. An advanced media service, such as a modern computer game, is a useful example of such a service in which graphics rendering; game state modelling; artificial intelligence and network communications are handled in parallel using varying problem decomposition methods.
+The first example above imagines a client simply requesting some data be stored in (or retrieved from) a database managed by the SFI responsible for persistence. In the second case, the client requests some processing of some data held in the data store, the results of which are also stored. Finally, the third case outlines a more complex scenario in which the client requests some processing of data which in turn generates further requests for additional data processing in other SFIs which also may depend on storage I/O functionality. Here additional data processing by related SFIs could include job scheduling or task decomposition and distribution to worker nodes. An advanced media service, such as a modern computer game, is a useful example of such a service in which graphics rendering; game state modelling; artificial intelligence and network communications are handled in parallel using varying problem decomposition methods.

 ### E2E simple chain

-Next we will define a very simple network into which we will place a data processing EP and a data storage EP - we assert the clients could connect to any of _service function routers_ that link these MC together.
+Next we will define a very simple network into which we will place a data processing EP and a data storage EP - we assert the clients could connect to any of _service function routers_ that link these SFIs together.

 ![Simple chain E2E network](image/e2e-simple-chain-network.png)

-Our simple network consists of three _service function routers_ (SFRs) that connect clients with MC data and storage functionality; a demand from client 1 for the storage function could be routed in one network hop from router 'A' to router 'C' or in two from routers 'A' -> 'B' -> 'C'. A demand for storage function from _client 2_ would include zero network hops.
+Our simple network consists of three _service function routers_ (SFRs) that connect clients with SFI data and storage functionality; a demand from client 1 for the storage function could be routed in one network hop from router 'A' to router 'C' or in two from routers 'A' -> 'B' -> 'C'. A demand for storage function from _client 2_ would include zero network hops.

 > __Side note: FLAME network scope__
 > 
-> Readers are reminded that low-level network traffic metrics gathered by the FLAME platform are restricted to observations of network performance between SFRs. The first and last steps (typically between a client and media component) are not captured at the time of writing - these links are denoted by a dotted line (`--->`) in our diagrams.
+> Readers are reminded that low-level network traffic metrics gathered by the FLAME platform are restricted to observations of network performance between SFRs. The first and last steps (typically between a client and SFI) are not captured at the time of writing - these links are denoted by a dotted line (`--->`) in our diagrams.
 >

 ### E2E simple chain metrics
@@ -127,9 +123,9 @@ An out-going simple E2E request chain looks like this:

 ![Simple E2E request steps](image/e2e-simple-chain-request-steps.png)

-the delay associated with the processing of the service request is isolated to within the storage MC:
+the delay associated with the processing of the service request is isolated to within the storage SFI:

-![Simple E2E MC processing](image/e2e-simple-chain-mc-processing.png)
+![Simple E2E SFI processing](image/e2e-simple-chain-mc-processing.png)

 whilst for the response E2E delay, we see this:

@@ -149,27 +145,27 @@ The _round trip time_ is the sum of the request, service processing and response

 ### E2E extended chain

-Up until this point we have considered an elementary SFC in which there is only one class of media component. In a more realistic scenario, we would expect a media service function to be composed of multiple MCs that are distributed and connected to multiple nodes in the FLAME network. Below we have extended the simple chain to include a greater level of complexity with respect to service function chains, whilst holding the network topology constant (adding network SFRs simply introduces additional hops to the problem space at this stage). In addition to indicating extra clients and MCs, weights have been added to the network arcs to indicate relative network latency between SFRs.
+Up until this point we have considered an elementary SFC in which there is only one class of SFI. In a more realistic scenario, we would expect a media service function to be composed of multiple SFIs that are distributed and connected to multiple nodes in the FLAME network. Below we have extended the simple chain to include a greater level of complexity with respect to service function chains, whilst holding the network topology constant (adding network SFRs simply introduces additional hops to the problem space at this stage). In addition to indicating extra clients and SFIs, weights have been added to the network arcs to indicate relative network latency between SFRs. __In this case study we also make the assumption that chained service function calls, such as those between one SFI and another, are synchronized and blocking__.

 ![Extended chain E2E network](image/e2e-extended-chain-network.png)

-Imagine a media service that both stores and processes high volumes of complex media streams. Consider as well a distributed population of clients making demands on this service. Successfully handling high demand for this service could mean deploying several instances of its media components (storage and processing) across multiple VMs which interoperate and share the demand load. Since clients and MCs are distributed, service function requests (made by both) will likely give rise to propagating waves of activity, load (and delay) from multiple nodes across the FLAME platform. For simplicity, let us assume our multi-media component service implements a request by processing some media data from the client and then storing it (returning some result to client). Here is client 1's request as it passes through the FLAME network and its MCs:
+Imagine a media service that both stores and processes high volumes of complex media streams. Consider as well a distributed population of clients making demands on this service. Successfully handling high demand for this service could mean deploying several instances of its SFIs (storage and processing) across multiple VMs which interoperate and share the demand load. Since clients and SFIs are distributed, service function requests (made by both) will likely give rise to propagating waves of activity, load (and delay) from multiple nodes across the FLAME platform. For simplicity, let us assume our multimedia service implements a request by processing some media data from the client and then storing it (returning some result to client). Here is client 1's request as it passes through the FLAME network and its SFIs:

 ![Extended client 1 path](image/e2e-extended-client1-path.png)

-In the figure above the green arcs indicate service request travel whilst the blue denotes the response path. The shortest route directs the request to SFR 'B' and the consequent storage request travels on to SFR 'C'. __Responses return along the path used by the request__. Indicative service response times are provided by numeric values in the active MC boxes. Let's see the same request from client 2, who has just joined the network:
+In the figure above the green arcs indicate service request travel whilst the blue denotes the response path. The shortest route directs the request to SFR 'B' and the consequent storage request travels on to SFR 'C'. __Responses return along the path used by the request__. Indicative service response times are provided by numeric values in the active SFI boxes. Let's see the same request from client 2, who has just joined the network:

 ![Extended client 2 path](image/e2e-extended-client2-path.png)

-For this client, the _locality of reference_ for processing and then storing her data is high: both of the associated MCs are located on VMs attached to the same SFR. We could expect client 2's response time to be low for this reason.
+For this client, the _locality of reference_ for processing and then storing her data is high: both of the associated SFIs are located on VMs attached to the same SFR. We could expect client 2's response time to be low for this reason.

-_Now for the sake of example only, let us assume that the hardware running MC Processor A can only effectively handle one request at any time and that any more than this will result in a substantial degradation in processing performance_.
+_Now for the sake of example only, let us assume that the hardware running SFI Processor A can only effectively handle one request at any time and that any more than this will result in a substantial degradation in processing performance_.

 Client 3 joins the network:

 ![Extended client 3 path](image/e2e-extended-client3-path.png)

-In calculating a service function route that optimizes for the complete _round trip_ delay, we need to take into account the likely delays that are incurred from both network related latencies and also service response times. The orange route illustrated above shows how the gains made by selecting a fast route through the network are offset by penalities in using a processor MC that is overloaded; conversely a slower route that selects a MC with computational resources to spare resolves to an over-all faster round-trip response time.
+In calculating a service function route that optimizes for the complete _round trip_ delay, we need to take into account the likely delays that are incurred from both network related latencies and also _all_ SFI response times. The orange route illustrated above shows how the gains made by selecting a fast route through the network are offset by penalities in using a processor SFI that is overloaded; conversely a slower route that selects a SFI with computational resources to spare resolves to an over-all faster round-trip response time.

 ## E2E Measurement

@@ -186,13 +182,64 @@ The ultimate goal is to populate a new measurement, called **e2e_delays**, which
 * *source_SFR* - tag used to identify the source service function router (the start of the network path)
 * *target_SFR* - tag used to identify the target service function router (the end of the network path)
 * *FQDN*- tag used to identify the media service
-* *sf_instance* - tag used to identify the media component instance ID
+* *sf_instance* - tag used to identify the SFI
 * *delay_forward* - network delay for the path in forward direction
 * *delay_reverse* - network delay for path in reverse direction
 * *delay_service* - media service component response time

 Then we can easily query on this measurement to obtain different performance indicators, such as end-to-end overall delays, round-trip response time or any of the contributing parts in those performance indicators.

+### Monitoring network delays
+
+Here, we describe the process of obtaining network delays between two service function routers in the network topology. CLMC retrieves network path delays between any two SFRs, see below (**SFR** denotes a service function router, **S** denotes a switch):
+
+![network_graph](image/network_graph.png)
+
+SFR monitoring provides us with FIDs at each service function router, which are bidirectional path IDs. From those, we derive the desired SFR-SFR network latencies. For instance, if we take the network graph example and analyse service function router **SFR3**. We would get 2 FIDs for this router - one for the path to reach SFR2 and one for the path to reach SFR1.
+
+We assume that the FID for reaching *SFR1* from *SFR3* tells us the path goes through nodes *S3* and *S6*.
+
+![network-SFR3-SFR1](image/network-SFR3-SFR1.png)
+
+Hence, we accumulate the individual link delays to derive the full SFR-SFR delay for both forward and reverse direction.
+
+delay_forward = SFR3-S3 + S3-S6 + S6-SFR1 = 12 + 3 + 3 = 18
+delay_reverse = SFR1-S6 + S6-S3 + S3-SFR3 = 1 + 5 + 10 = 16
+
+Now, we assume that the FID for reaching *SFR2* from *SFR3* tells us the path goes through nodes *S4* and *S2*.
+
+![network-SFR3-SFR1](image/network-SFR3-SFR2.png)
+
+Hence, we accumulate the individual link delays to derive the full SFR-SFR delay for both forward and reverse direction.
+
+delay_forward = SFR3-S4 + S4-S2 + S2-SFR2 = 12 + 4 + 5 = 21
+delay_reverse = SFR2-S2 + S2-S4 + S4-SFR3 = 8 + 2 + 11 = 21
+
+Overall, from this analysis, the following data will be reported to Influx in the **network_delays** measurement:
+
+| path_ID (tag) | source_SFR (tag) | target_SFR (tag) | delay | time |
+| --- | --- | --- | --- | --- |  
+| SFR3-SFR1 | SFR3 | SFR1 | 18 | 1525334761282000 |
+| SFR3-SFR1 | SFR1 | SFR3 | 16 | 1525334761282000 |
+| SFR3-SFR2 | SFR3 | SFR2 | 21 | 1525334761282000 |
+| SFR3-SFR2 | SFR2 | SFR3 | 21 | 1525334761282000 |
+
+### Monitoring SFI response times
+
+Readers of the [CLMC information model](clmc-information-model.md) will already be aware of the approach to identifying and reporting SFI performance metrics in the FLAME project. The global measurement tags that help in a decision context are used in this case to provide the mapping between network measurements and a specific service response time. Specifically, we use the SFR tag encapsulated in the media service global tags to cross-reference against target SFR tags (described above).
+
+In its simplest case, a media service function's response time could be defined as a single value that derives from the (average) time spent processing requests in local memory and/or on disk. Indeed, a number of the FLAME foundation media service metrics sent to the CLMC could be described as such. In more advanced cases (such as for clients 1 and 3 in our example above) the full service function chain is implemented across more than one endpoint. Here we have at least two options:
+
+1. Let the first SFI in a SFC be representative of the entire service function delay (making opaque the sub-calls to other SFIs required to fullfil the client's request)
+
+2. Construct a more complex view of the service function response time as an aggregate of internal SFI processing delays and their related, dependent network delays
+
+#### TO BE DISCUSSED FURTHER
+
+> __NOTE: ABOVE TO BE VALIDATED__
+> __***************************__
+>
+
 ### E2E Aggregation process

 The aggregation process provides similar functionality to that of an INFLUX continuous query. During each sample period the process collects and averages network and service delay data for the last 10 seconds (for example). The executed queries are:
@@ -249,42 +296,3 @@ The E2E aggregation process will merge those rows, because there is a match on n
 | SFR-A---SFR-B | SFR-A | SFR-B | ms-A.ict-flame.eu | test-sf-clmc-agent-build_INSTANCE | 9.2 | 10.3 | 11 | 1524833145975682287 |

 The resulting row would then be posted back to influx in the **e2e_delays** measurement.
\ No newline at end of file
-
-## Monitoring
-
-### Monitoring network delays
-
-Here, we describe the process of obtaining network delays between two service function routers in the network topology. CLMC retrieves network path delays between any two SFRs, see below (**SFR** denotes a service function router, **S** denotes a switch):
-
-![network_graph](image/network_graph.png)
-
-SFR monitoring provides us with FIDs at each service function router, which are bidirectional path IDs. From those, we derive the desired SFR-SFR network latencies. For instance, if we take the network graph example and analyse service function router **SFR3**. We would get 2 FIDs for this router - one for the path to reach SFR2 and one for the path to reach SFR1.
-
-We assume that the FID for reaching *SFR1* from *SFR3* tells us the path goes through nodes *S3* and *S6*.
-
-![network-SFR3-SFR1](image/network-SFR3-SFR1.png)
-
-Hence, we accumulate the individual link delays to derive the full SFR-SFR delay for both forward and reverse direction.
-
-delay_forward = SFR3-S3 + S3-S6 + S6-SFR1 = 12 + 3 + 3 = 18
-delay_reverse = SFR1-S6 + S6-S3 + S3-SFR3 = 1 + 5 + 10 = 16
-
-Now, we assume that the FID for reaching *SFR2* from *SFR3* tells us the path goes through nodes *S4* and *S2*.
-
-![network-SFR3-SFR1](image/network-SFR3-SFR2.png)
-
-Hence, we accumulate the individual link delays to derive the full SFR-SFR delay for both forward and reverse direction.
-
-delay_forward = SFR3-S4 + S4-S2 + S2-SFR2 = 12 + 4 + 5 = 21
-delay_reverse = SFR2-S2 + S2-S4 + S4-SFR3 = 8 + 2 + 11 = 21
-
-Overall, from this analysis, the following data will be reported to Influx in the **network_delays** measurement:
-
-| path_ID (tag) | source_SFR (tag) | target_SFR (tag) | delay | time |
-| --- | --- | --- | --- | --- |  
-| SFR3-SFR1 | SFR3 | SFR1 | 18 | 1525334761282000 |
-| SFR3-SFR1 | SFR1 | SFR3 | 16 | 1525334761282000 |
-| SFR3-SFR2 | SFR3 | SFR2 | 21 | 1525334761282000 |
-| SFR3-SFR2 | SFR2 | SFR3 | 21 | 1525334761282000 |
-
-### Monitoring media service response times
\ No newline at end of file
--- a/docs/figures/e2eFigures.graphml
+++ b/docs/figures/e2eFigures.graphml
--- a/docs/image/e2e-extended-chain-network.png
+++ b/docs/image/e2e-extended-chain-network.png
--- a/docs/image/e2e-extended-client1-path.png
+++ b/docs/image/e2e-extended-client1-path.png
--- a/docs/image/e2e-extended-client2-path.png
+++ b/docs/image/e2e-extended-client2-path.png
--- a/docs/image/e2e-extended-client3-path.png
+++ b/docs/image/e2e-extended-client3-path.png
--- a/docs/image/e2e-simple-chain-mc-processing.png
+++ b/docs/image/e2e-simple-chain-mc-processing.png
--- a/docs/image/e2e-simple-chain-network.png
+++ b/docs/image/e2e-simple-chain-network.png
--- a/docs/image/e2e-simple-chain-request-steps.png
+++ b/docs/image/e2e-simple-chain-request-steps.png
--- a/docs/image/e2e-simple-chain-response-steps.png
+++ b/docs/image/e2e-simple-chain-response-steps.png