Issue #68 - updated documentation's E2E measurement related sections

ae027d2a · Nikolay Stanchev · 51e7973c · ae027d2a
Commit ae027d2a authored May 3, 2018 by Nikolay Stanchev
--- a/docs/Measuring-E2E-MS-Performance.md
+++ b/docs/Measuring-E2E-MS-Performance.md
@@ -46,29 +46,45 @@ Readers of this document are assumed to have at least read the [CLMC information
 ### **Assumptions**
-* Network measurement - assumption is that we have a measurement for the network link delays, called **network_delays**, providing the following information:  
+Here, we list the assumptions we make for measuring and understanding E2E performance of media components:
-| path (tag) | delay | time |
+* Network measurement - the assumption is that we have a measurement for the network path delays between service function routers, called **network_delays**, providing the following information:  
-| --- | --- | --- |
-| path identifier | e2e delay for the given path | time of measurement |
-Here, the **path** tag value is the identifier of the path between two nodes in the network topology obtained from FLIPS. The assumption is that those identifiers
+| path (tag) | source (tag) | target (tag) | delay | time |
-will be structured in such a way that we can obtain the source and target endpoint IDs from the path identifier itself. For example:  
+| --- | --- | --- | --- | --- | 
- **endpoint1.ms-A.ict-flame.eu---endpoint2.ms-A.ict-flame.eu**  
+| path identifier | source SFR | target SFR | e2e delay for the given path (ms) | timestamp of measurement |
-We can easily split the string on **'---'** and, thus, find the source endpoint is **endpoint1.ms-A.ict-flame.eu**, while the target endpoint is 
-**endpoint2.ms-A.ict-flame.eu**.  
-The delay field value is the network end-to-end delay in milliseconds for the path identified in the tag value.
-* A response will traverse the same network path as the request, but in reverse direction.
+Here, the **path** tag value is the identifier of the path between two nodes (service routers) in the network topology obtained from FLIPS.   
+The **source** tag value is the source service router for the identified path, while the **target** tag value is the target service router.   
+The delay field value is the network end-to-end delay in milliseconds that a packet would experience when traversing the path between the two SFRs identified in the tag values.  
-* Media service measurement - assumption is that we have a measurement for media services' response time, called **service_delays**, providing the following information:
+An example row would be:
-| FQDN (tag) | sf_instance (tag) | endpoint (tag) | response_time | time |
+| path (tag) | source (tag) | target (tag) | delay | time |  
 | --- | --- | --- | --- | --- |  
-| media service FQDN | ID of the service function instance | endpoint identifier | response time for the media service (s) | time of measurement |
+| SFR-A---S1---S2---S3---SFR-B | SFR-A | SFR-B | 10 | 1525334761282000 |
+The semantics of the row is that a packet traversing the path from SFR-A (service router) through S1, S2, S3 (switches) to SFR-B (service router) will experience an averaged delay of 10ms.
+* Request/Response path - the assumption is that a response will traverse the same network path as the request, but in reverse direction.
-Here, the **FQDN**, **sf_instance** and **endpoint** tag values identify a unique response time measurement. The response time field value is the 
+* Media service measurement - assumption is that we have a measurement for media service components' response time, called **service_delays**, providing the following information:
-response time (measured in seconds) for the media service only, and it does not take into account any of the network measurements.
+| FQDN (tag) | sf_instance (tag) | sfr (tag) | response_time | time |
+| --- | --- | --- | --- | --- |
+| media service FQDN | ID of the service function instance | SFR that connects the MC endpoint to the Flame network | response time for the media service (ms) | timestamp of measurement |
+Here, the **FQDN**, **sf_instance** and **sfr** tag values identify a unique response time measurement.  
+The response time field value is the response time (measured in milliseconds) for the media service component only, and it does not take into account any of the network measurements.
+An example row would be:
+| FQDN (tag) | sf_instance (tag) | sfr (tag) | response_time | time |
+| --- | --- | --- | --- | --- |
+| ms-A.ict-flame.eu | ms-A-sf_INSTANCE | SFR-B | 27 | 1525334761282000 |
+The semantics of the row is that the response time for a service function instance with ID *ms-A-sf_INSTANCE* serving media service
+*ms-A.ict-flame.eu* and connected to the FLAME network through service router *SFR-B* will have an averaged response time of 27 ms.
 ## E2E Model
@@ -155,26 +171,25 @@ __TO DO__
 ### **Idea** 
-The idea is to aggregate platform measurement points with media service measurement points and obtain a third measurement from which we can easily
+The idea is to aggregate network measurement points with media service measurement points and obtain a third measurement from which we can easily
 understand both end-to-end and round-trip performance of a media service. This is achieved by having a python script running on the background and aggregating
-the data from both measurements on a given sample period, e.g. every 10 seconds. The script then posts the aggregated data back to Influx in a new measurement. 
+the data from the two measurements on a given sample period, e.g. every 10 seconds. The script then posts the aggregated data back to Influx in a new measurement. 
 ### **Goal**
 The ultimate goal is to populate a new measurement, called **e2e_delays**, which will be provided with the following information:
-| pathID_F (tag) | pathID_R (tag) | FQDN (tag) | sf_instance (tag) | D_path_F | D_path_R | D_service | time |
+| path_ID (tag) | source_SFR (tag) | target_SFR (tag) | FQDN (tag) | sf_instance (tag) | delay_forward | delay_reverse | delay_service | time |
-| --- | --- | --- | --- | --- | --- | --- | --- | 
+| --- | --- | --- | --- | --- | --- | --- | --- | --- |
-* *pathID_F* - tag used to identify the path in forward direction, e.g. **endpoint1.ms-A.ict-flame.eu---endpoint2.ms-A.ict-flame.eu**
+* *pathID* - tag ID used to identify the network path (bidirectional path identifier)
-* *pathID_R* - tag used to identify the path in reverse direction, e.g. **endpoint2.ms-A.ict-flame.eu---endpoint1.ms-A.ict-flame.eu**
+* *source_SFR* - tag used to identify the source service function router (the start of the network path)
+* *target_SFR* - tag used to identify the target service function router (the end of the network path)
 * *FQDN*- tag used to identify the media service
-* *sf_instance* - tag used to identify the media service
+* *sf_instance* - tag used to identify the media component instance ID
-* *D_path_F* - network delay for path in forward direction
+* *delay_forward* - network delay for the path in forward direction
-* *D_path_R* - network delay for path in reverse direction
+* *delay_reverse* - network delay for path in reverse direction
-* *D_service* - media service response time
+* *delay_service* - media service component response time
 Then we can easily query on this measurement to obtain different performance indicators, such as end-to-end overall delays, 
 round-trip response time or any of the contributing parts in those performance indicators. 
@@ -182,37 +197,40 @@ round-trip response time or any of the contributing parts in those performance i
 ### **Aggregation script**
-What the aggregation script does is very similat to the functionality of a continuous query. Given a sample report period, e.g. 10s,
+What the aggregation script does is very similar to the functionality of a continuous query. Given a sample report period, e.g. 10s,
 the script executes at every 10-second-period querying the averaged data for the last 10 seconds. The executed queries are:  
-* Network delays query - to obtain the network delay values and group them by their **path** identifier:
+* Network delays query - to obtain the network delay values and group them by their **path**, **source** and **target** identifiers:
 ```
-SELECT mean(delay) as "Dnet" FROM "E2EMetrics"."autogen".network_delays WHERE time >= now() - 10s and time < now() GROUP BY path
+SELECT mean(delay) as "net_delay" FROM "E2EMetrics"."autogen"."network_delays" WHERE time >= now() - 10s and time < now() GROUP BY path, source, target
 ``` 
-* Media service response time query - to obtain the response time values of the media service instances and group them by **FQDN**, **sf_instance** and **endpoint** identifiers: 
+* Media service response time query - to obtain the response time values of the media service instances and group them by **FQDN**, **sf_instance** and **sfr** identifiers: 
 ```
-SELECT mean(response_time) as "Dresponse" FROM "E2EMetrics"."autogen".service_delays WHERE time >= now() - 10s and time < now() GROUP BY FQDN, sf_instance, endpoint
+SELECT mean(response_time) as "response_time" FROM "E2EMetrics"."autogen"."service_delays" WHERE time >= now() - 10s and time < now() GROUP BY FQDN, sf_instance, sfr
 ```
-The results of the queries are then matched against each other on endpoint ID: on every match of the **endpoint** tag of the **service_delays** measurement with
+The results of the queries are then matched against each other on the **target** and **sfr** tag values (for *network_delays* and *service_delays* respectively): 
-the target endpoint ID of the **network_delays** measurement, the rows are combined to obtain an **e2e_delay** measurement row, which is posted back to influx.
+on every match of the **sfr** tag of the **service_delays** measurement with the **target** service router of the **network_delays** measurement, the rows are combined 
+to obtain an **e2e_delay** measurement row, which is posted back to influx.
 Example:
-* Result from first query:
+Let's assume we have these results from the two queries:
+* Result from first query
 ```
 name: network_delays
-tags: path=endpoint1.ms-A.ict-flame.eu---endpoint2.ms-A.ict-flame.eu
+tags: path=SFR-A---SFR-B, source=SFR-A, target=SFR-B
-time                Dnet
+time                net_delay
----                ----
+----                ---------
 1524833145975682287 9.2
 name: network_delays
-tags: path=endpoint2.ms-A.ict-flame.eu---endpoint1.ms-A.ict-flame.eu
+tags: path=SFR-A---SFR-B, source=SFR-B, target=SFR-A
-time                Dnet
+time                net_delay
----                ----
+----                ---------
 1524833145975682287 10.3
 ```
@@ -220,22 +238,17 @@ time                Dnet
 ```
 name: service_delays
-tags: FQDN=ms-A.ict-flame.eu, endpoint=endpoint2.ms-A.ict-flame.eu, sf_instance=test-sf-clmc-agent-build_INSTANCE
+tags: FQDN=ms-A.ict-flame.eu, sfr=SFR-B, sf_instance=test-sf-clmc-agent-build_INSTANCE
-time                Dresponse
+time                response_time
----                ---------
+----                -------------
 1524833145975682287 11
 ```
-The script will parse the path identifier **endpoint1.ms-A.ict-flame.eu---endpoint2.ms-A.ict-flame.eu** and find the target endpoint being
+The script will merge those rows, beucase there is a match on network delay target SFR and service delay SFR - namely **SFR-B**.
-**endpoint2.ms-A.ict-flame.eu**. Then the script checks if there is service delay measurement row matching this endpoint. Since there is one,
-those values will be merged, so the result will be a row like this:
-| pathID_F (tag) | pathID_R (tag) | FQDN (tag) | sf_instance (tag) | D_path_F | D_path_R | D_service | time |
-| --- | --- | --- | --- | --- | --- | --- | --- | 
-| endpoint1.ms-A.ict-flame.eu---endpoint2.ms-A.ict-flame.eu | endpoint2.ms-A.ict-flame.eu---endpoint1.ms-A.ict-flame.eu | ms-A.ict-flame.eu | test-sf-clmc-agent-build_INSTANCE | 9.2 | 10.3 | 11 | 1524833145975682287 | 
-Here, another assumption is made that we can reverse the path identifier of a network delay row and that the reverse path delay would also 
+| path_ID (tag) | source_SFR (tag) | target_SFR (tag) | FQDN (tag) | sf_instance (tag) | delay_forward | delay_reverse | delay_service | time |
-be reported in the **network_delays** measurement. 
+| --- | --- | --- | --- | --- | --- | --- | --- | --- |
+| SFR-A---SFR-B | SFR-A | SFR-B | ms-A.ict-flame.eu | test-sf-clmc-agent-build_INSTANCE | 9.2 | 10.3 | 11 | 1524833145975682287 | 
 The resulting row would then be posted back to influx in the **e2e_delays** measurement.
\ No newline at end of file