diff --git a/docs/Measuring-E2E-MS-Performance.md b/docs/Measuring-E2E-MS-Performance.md index 6b69b4033ceb8ded4776839793da6db7534a1b29..9b9acc0728e6f3f1caf287a0f89b1e639992bf3b 100644 --- a/docs/Measuring-E2E-MS-Performance.md +++ b/docs/Measuring-E2E-MS-Performance.md @@ -1,5 +1,5 @@ <!-- -// © University of Southampton IT Innovation Centre, 2017 +// © University of Southampton IT Innovation Centre, 2018 // // Copyright in this software belongs to University of Southampton // IT Innovation Centre of Gamma House, Enterprise Road, @@ -54,14 +54,12 @@ Here, we list the assumptions we make for measuring and understanding E2E perfor | --- | --- | --- | --- | --- | | path identifier | source SFR | target SFR | e2e delay for the given path (ms) | timestamp of measurement | -Here, the **path** tag value is the identifier of the path between two nodes (service routers) in the network topology obtained from FLIPS. -The **source** tag value is the source service router for the identified path, while the **target** tag value is the target service router. -The delay field value is the network end-to-end delay in milliseconds that a packet would experience when traversing the path between the two SFRs identified in the tag values. +Here, the **path** tag value is the identifier of the path between two nodes (service routers) in the network topology obtained from FLIPS. The **source** tag value is the source service router for the identified path, while the **target** tag value is the target service router. The delay field value is the network end-to-end delay in milliseconds that a packet would experience when traversing the path between the two SFRs identified in the tag values. An example row would be: -| path (tag) | source (tag) | target (tag) | delay | time | -| --- | --- | --- | --- | --- | +| path (tag) | source (tag) | target (tag) | delay | time | +| --- | --- | --- | --- | --- | | SFR-A---S1---S2---S3---SFR-B | SFR-A | SFR-B | 10 | 1525334761282000 | The semantics of the row is that a packet traversing the path from SFR-A (service router) through S1, S2, S3 (switches) to SFR-B (service router) will experience an averaged delay of 10ms. @@ -167,13 +165,9 @@ For this client, the _locality of reference_ for processing and then storing her __TO DO__ -## E2E Measurement - -### **Idea** +## E2E Measurement -The idea is to aggregate network measurement points with media service measurement points and obtain a third measurement from which we can easily -understand both end-to-end and round-trip performance of a media service. This is achieved by having a python script running on the background and aggregating -the data from the two measurements on a given sample period, e.g. every 10 seconds. The script then posts the aggregated data back to Influx in a new measurement. +Our aim is to aggregate network measurement points with media service measurement points to obtain a third measurement from which we can easily understand both end-to-end and round-trip performance of a media service. This is achieved by using a CLMC E2E monitoring process that aggregates data from network and media service measurements within a given sample period, e.g. every 10 seconds. This process then posts the aggregated data back to Influx in a new measurement. ### **Goal** @@ -191,16 +185,14 @@ The ultimate goal is to populate a new measurement, called **e2e_delays**, which * *delay_reverse* - network delay for path in reverse direction * *delay_service* - media service component response time -Then we can easily query on this measurement to obtain different performance indicators, such as end-to-end overall delays, -round-trip response time or any of the contributing parts in those performance indicators. +Then we can easily query on this measurement to obtain different performance indicators, such as end-to-end overall delays, round-trip response time or any of the contributing parts in those performance indicators. +### E2E Aggregation process -### **Aggregation script** - -What the aggregation script does is very similar to the functionality of a continuous query. Given a sample report period, e.g. 10s, -the script executes at every 10-second-period querying the averaged data for the last 10 seconds. The executed queries are: +The aggregation process provides similar functionality to that of an INFLUX continuous query. During each sample period the process collects and averages network and service delay data for the last 10 seconds (for example). The executed queries are: * Network delays query - to obtain the network delay values and group them by their **path**, **source** and **target** identifiers: + ``` SELECT mean(delay) as "net_delay" FROM "E2EMetrics"."autogen"."network_delays" WHERE time >= now() - 10s and time < now() GROUP BY path, source, target ``` @@ -244,12 +236,11 @@ time response_time 1524833145975682287 11 ``` - -The script will merge those rows, because there is a match on network delay target SFR and service delay SFR - namely **SFR-B**. +The E2E aggregation process will merge those rows, because there is a match on network delay target SFR and service delay SFR - namely **SFR-B**. | path_ID (tag) | source_SFR (tag) | target_SFR (tag) | FQDN (tag) | sf_instance (tag) | delay_forward | delay_reverse | delay_service | time | | --- | --- | --- | --- | --- | --- | --- | --- | --- | -| SFR-A---SFR-B | SFR-A | SFR-B | ms-A.ict-flame.eu | test-sf-clmc-agent-build_INSTANCE | 9.2 | 10.3 | 11 | 1524833145975682287 | +| SFR-A---SFR-B | SFR-A | SFR-B | ms-A.ict-flame.eu | test-sf-clmc-agent-build_INSTANCE | 9.2 | 10.3 | 11 | 1524833145975682287 | The resulting row would then be posted back to influx in the **e2e_delays** measurement. @@ -257,15 +248,11 @@ The resulting row would then be posted back to influx in the **e2e_delays** meas ### Monitoring network delays -Here, we describe the process of obtaining network delays between two service function routers in the network topology. -CLMC retrieves the network topology graph from the monitoring framework and the link delays between any two network nodes. -Example (**SR** denotes a service router, **S** denotes a switch): +Here, we describe the process of obtaining network delays between two service function routers in the network topology. CLMC retrieves network path delays between any two SFRs, see below (**SR** denotes a service router, **S** denotes a switch):  -SFR monitoring provides us with FIDs at each service router, which are bidirectional path IDs. From those, we derive the desired SR-SR network latencies. -For instance, if we take the network graph example and analyse service router **SR3**. We would get 2 FIDs for this router - one for the path to reach SR2 -and one for the path to reach SR1. +SFR monitoring provides us with FIDs at each service router, which are bidirectional path IDs. From those, we derive the desired SR-SR network latencies. For instance, if we take the network graph example and analyse service router **SR3**. We would get 2 FIDs for this router - one for the path to reach SR2 and one for the path to reach SR1. We assume that the FID for reaching *SR1* from *SR3* tells us the path goes through nodes *S3* and *S6*. @@ -273,21 +260,21 @@ We assume that the FID for reaching *SR1* from *SR3* tells us the path goes thro Hence, we accumulate the individual link delays to derive the full SR-SR delay for both forward and reverse direction. -delay_forward = SR3-S3 + S3-S6 + S6-SR1 = 12 + 3 + 3 = 18 +delay_forward = SR3-S3 + S3-S6 + S6-SR1 = 12 + 3 + 3 = 18 delay_reverse = SR1-S6 + S6-S3 + S3-SR3 = 1 + 5 + 10 = 16 -Now, we assume that the FID for reaching *SR2* from *SR3* tells us the path goes through nodes *S4* and *S2*. +Now, we assume that the FID for reaching *SR2* from *SR3* tells us the path goes through nodes *S4* and *S2*.  Hence, we accumulate the individual link delays to derive the full SR-SR delay for both forward and reverse direction. -delay_forward = SR3-S4 + S4-S2 + S2-SR2 = 12 + 4 + 5 = 21 +delay_forward = SR3-S4 + S4-S2 + S2-SR2 = 12 + 4 + 5 = 21 delay_reverse = SR2-S2 + S2-S4 + S4-SR3 = 8 + 2 + 11 = 21 Overall, from this analysis, the following data will be reported to Influx in the **network_delays** measurement: -| path (tag) | source (tag) | target (tag) | delay | time | +| path (tag) | source (tag) | target (tag) | delay | time | | --- | --- | --- | --- | --- | | SR3-SR1 | SR3 | SR1 | 18 | 1525334761282000 | | SR3-SR1 | SR1 | SR3 | 16 | 1525334761282000 |