diff --git a/docs/Measuring-E2E-MS-Performance.md b/docs/Measuring-E2E-MS-Performance.md index 401308a26f93ebbaf9baed33ac8872dbab9f968b..efc1eb96497eeb45262b669694e1b461ea643f54 100644 --- a/docs/Measuring-E2E-MS-Performance.md +++ b/docs/Measuring-E2E-MS-Performance.md @@ -46,29 +46,45 @@ Readers of this document are assumed to have at least read the [CLMC information ### **Assumptions** -* Network measurement - assumption is that we have a measurement for the network link delays, called **network_delays**, providing the following information: +Here, we list the assumptions we make for measuring and understanding E2E performance of media components: -| path (tag) | delay | time | -| --- | --- | --- | -| path identifier | e2e delay for the given path | time of measurement | +* Network measurement - the assumption is that we have a measurement for the network path delays between service function routers, called **network_delays**, providing the following information: -Here, the **path** tag value is the identifier of the path between two nodes in the network topology obtained from FLIPS. The assumption is that those identifiers -will be structured in such a way that we can obtain the source and target endpoint IDs from the path identifier itself. For example: - **endpoint1.ms-A.ict-flame.eu---endpoint2.ms-A.ict-flame.eu** -We can easily split the string on **'---'** and, thus, find the source endpoint is **endpoint1.ms-A.ict-flame.eu**, while the target endpoint is -**endpoint2.ms-A.ict-flame.eu**. -The delay field value is the network end-to-end delay in milliseconds for the path identified in the tag value. +| path (tag) | source (tag) | target (tag) | delay | time | +| --- | --- | --- | --- | --- | +| path identifier | source SFR | target SFR | e2e delay for the given path (ms) | timestamp of measurement | -* A response will traverse the same network path as the request, but in reverse direction. +Here, the **path** tag value is the identifier of the path between two nodes (service routers) in the network topology obtained from FLIPS. +The **source** tag value is the source service router for the identified path, while the **target** tag value is the target service router. +The delay field value is the network end-to-end delay in milliseconds that a packet would experience when traversing the path between the two SFRs identified in the tag values. -* Media service measurement - assumption is that we have a measurement for media services' response time, called **service_delays**, providing the following information: +An example row would be: -| FQDN (tag) | sf_instance (tag) | endpoint (tag) | response_time | time | +| path (tag) | source (tag) | target (tag) | delay | time | +| --- | --- | --- | --- | --- | +| SFR-A---S1---S2---S3---SFR-B | SFR-A | SFR-B | 10 | 1525334761282000 | + +The semantics of the row is that a packet traversing the path from SFR-A (service router) through S1, S2, S3 (switches) to SFR-B (service router) will experience an averaged delay of 10ms. + +* Request/Response path - the assumption is that a response will traverse the same network path as the request, but in reverse direction. + +* Media service measurement - assumption is that we have a measurement for media service components' response time, called **service_delays**, providing the following information: + +| FQDN (tag) | sf_instance (tag) | sfr (tag) | response_time | time | +| --- | --- | --- | --- | --- | +| media service FQDN | ID of the service function instance | SFR that connects the MC endpoint to the Flame network | response time for the media service (ms) | timestamp of measurement | + +Here, the **FQDN**, **sf_instance** and **sfr** tag values identify a unique response time measurement. +The response time field value is the response time (measured in milliseconds) for the media service component only, and it does not take into account any of the network measurements. + +An example row would be: + +| FQDN (tag) | sf_instance (tag) | sfr (tag) | response_time | time | | --- | --- | --- | --- | --- | -| media service FQDN | ID of the service function instance | endpoint identifier | response time for the media service (s) | time of measurement | +| ms-A.ict-flame.eu | ms-A-sf_INSTANCE | SFR-B | 27 | 1525334761282000 | -Here, the **FQDN**, **sf_instance** and **endpoint** tag values identify a unique response time measurement. The response time field value is the -response time (measured in seconds) for the media service only, and it does not take into account any of the network measurements. +The semantics of the row is that the response time for a service function instance with ID *ms-A-sf_INSTANCE* serving media service +*ms-A.ict-flame.eu* and connected to the FLAME network through service router *SFR-B* will have an averaged response time of 27 ms. ## E2E Model @@ -155,26 +171,25 @@ __TO DO__ ### **Idea** -The idea is to aggregate platform measurement points with media service measurement points and obtain a third measurement from which we can easily +The idea is to aggregate network measurement points with media service measurement points and obtain a third measurement from which we can easily understand both end-to-end and round-trip performance of a media service. This is achieved by having a python script running on the background and aggregating -the data from both measurements on a given sample period, e.g. every 10 seconds. The script then posts the aggregated data back to Influx in a new measurement. - - +the data from the two measurements on a given sample period, e.g. every 10 seconds. The script then posts the aggregated data back to Influx in a new measurement. ### **Goal** The ultimate goal is to populate a new measurement, called **e2e_delays**, which will be provided with the following information: -| pathID_F (tag) | pathID_R (tag) | FQDN (tag) | sf_instance (tag) | D_path_F | D_path_R | D_service | time | -| --- | --- | --- | --- | --- | --- | --- | --- | +| path_ID (tag) | source_SFR (tag) | target_SFR (tag) | FQDN (tag) | sf_instance (tag) | delay_forward | delay_reverse | delay_service | time | +| --- | --- | --- | --- | --- | --- | --- | --- | --- | -* *pathID_F* - tag used to identify the path in forward direction, e.g. **endpoint1.ms-A.ict-flame.eu---endpoint2.ms-A.ict-flame.eu** -* *pathID_R* - tag used to identify the path in reverse direction, e.g. **endpoint2.ms-A.ict-flame.eu---endpoint1.ms-A.ict-flame.eu** -* *FQDN* - tag used to identify the media service -* *sf_instance* - tag used to identify the media service -* *D_path_F* - network delay for path in forward direction -* *D_path_R* - network delay for path in reverse direction -* *D_service* - media service response time +* *pathID* - tag ID used to identify the network path (bidirectional path identifier) +* *source_SFR* - tag used to identify the source service function router (the start of the network path) +* *target_SFR* - tag used to identify the target service function router (the end of the network path) +* *FQDN*- tag used to identify the media service +* *sf_instance* - tag used to identify the media component instance ID +* *delay_forward* - network delay for the path in forward direction +* *delay_reverse* - network delay for path in reverse direction +* *delay_service* - media service component response time Then we can easily query on this measurement to obtain different performance indicators, such as end-to-end overall delays, round-trip response time or any of the contributing parts in those performance indicators. @@ -182,37 +197,40 @@ round-trip response time or any of the contributing parts in those performance i ### **Aggregation script** -What the aggregation script does is very similat to the functionality of a continuous query. Given a sample report period, e.g. 10s, +What the aggregation script does is very similar to the functionality of a continuous query. Given a sample report period, e.g. 10s, the script executes at every 10-second-period querying the averaged data for the last 10 seconds. The executed queries are: -* Network delays query - to obtain the network delay values and group them by their **path** identifier: +* Network delays query - to obtain the network delay values and group them by their **path**, **source** and **target** identifiers: ``` -SELECT mean(delay) as "Dnet" FROM "E2EMetrics"."autogen".network_delays WHERE time >= now() - 10s and time < now() GROUP BY path +SELECT mean(delay) as "net_delay" FROM "E2EMetrics"."autogen"."network_delays" WHERE time >= now() - 10s and time < now() GROUP BY path, source, target ``` -* Media service response time query - to obtain the response time values of the media service instances and group them by **FQDN**, **sf_instance** and **endpoint** identifiers: +* Media service response time query - to obtain the response time values of the media service instances and group them by **FQDN**, **sf_instance** and **sfr** identifiers: ``` -SELECT mean(response_time) as "Dresponse" FROM "E2EMetrics"."autogen".service_delays WHERE time >= now() - 10s and time < now() GROUP BY FQDN, sf_instance, endpoint +SELECT mean(response_time) as "response_time" FROM "E2EMetrics"."autogen"."service_delays" WHERE time >= now() - 10s and time < now() GROUP BY FQDN, sf_instance, sfr ``` -The results of the queries are then matched against each other on endpoint ID: on every match of the **endpoint** tag of the **service_delays** measurement with -the target endpoint ID of the **network_delays** measurement, the rows are combined to obtain an **e2e_delay** measurement row, which is posted back to influx. +The results of the queries are then matched against each other on the **target** and **sfr** tag values (for *network_delays* and *service_delays* respectively): +on every match of the **sfr** tag of the **service_delays** measurement with the **target** service router of the **network_delays** measurement, the rows are combined +to obtain an **e2e_delay** measurement row, which is posted back to influx. Example: -* Result from first query: +Let's assume we have these results from the two queries: + +* Result from first query ``` name: network_delays -tags: path=endpoint1.ms-A.ict-flame.eu---endpoint2.ms-A.ict-flame.eu -time Dnet ----- ---- +tags: path=SFR-A---SFR-B, source=SFR-A, target=SFR-B +time net_delay +---- --------- 1524833145975682287 9.2 name: network_delays -tags: path=endpoint2.ms-A.ict-flame.eu---endpoint1.ms-A.ict-flame.eu -time Dnet ----- ---- +tags: path=SFR-A---SFR-B, source=SFR-B, target=SFR-A +time net_delay +---- --------- 1524833145975682287 10.3 ``` @@ -220,22 +238,17 @@ time Dnet ``` name: service_delays -tags: FQDN=ms-A.ict-flame.eu, endpoint=endpoint2.ms-A.ict-flame.eu, sf_instance=test-sf-clmc-agent-build_INSTANCE -time Dresponse ----- --------- +tags: FQDN=ms-A.ict-flame.eu, sfr=SFR-B, sf_instance=test-sf-clmc-agent-build_INSTANCE +time response_time +---- ------------- 1524833145975682287 11 ``` -The script will parse the path identifier **endpoint1.ms-A.ict-flame.eu---endpoint2.ms-A.ict-flame.eu** and find the target endpoint being -**endpoint2.ms-A.ict-flame.eu**. Then the script checks if there is service delay measurement row matching this endpoint. Since there is one, -those values will be merged, so the result will be a row like this: +The script will merge those rows, beucase there is a match on network delay target SFR and service delay SFR - namely **SFR-B**. -| pathID_F (tag) | pathID_R (tag) | FQDN (tag) | sf_instance (tag) | D_path_F | D_path_R | D_service | time | -| --- | --- | --- | --- | --- | --- | --- | --- | -| endpoint1.ms-A.ict-flame.eu---endpoint2.ms-A.ict-flame.eu | endpoint2.ms-A.ict-flame.eu---endpoint1.ms-A.ict-flame.eu | ms-A.ict-flame.eu | test-sf-clmc-agent-build_INSTANCE | 9.2 | 10.3 | 11 | 1524833145975682287 | - -Here, another assumption is made that we can reverse the path identifier of a network delay row and that the reverse path delay would also -be reported in the **network_delays** measurement. +| path_ID (tag) | source_SFR (tag) | target_SFR (tag) | FQDN (tag) | sf_instance (tag) | delay_forward | delay_reverse | delay_service | time | +| --- | --- | --- | --- | --- | --- | --- | --- | --- | +| SFR-A---SFR-B | SFR-A | SFR-B | ms-A.ict-flame.eu | test-sf-clmc-agent-build_INSTANCE | 9.2 | 10.3 | 11 | 1524833145975682287 | The resulting row would then be posted back to influx in the **e2e_delays** measurement. \ No newline at end of file