Skip to content
Snippets Groups Projects
Commit 281aaefd authored by Stephen C Phillips's avatar Stephen C Phillips
Browse files

Documents the measurement of service request RTT (WIP)

parent e2e8484c
No related branches found
No related tags found
No related merge requests found
# Round Trip Time of a Service Request
The Round Trip Time (RTT) of a network is the time taken from sending a packet to receiving the acknowlegement. We are also interested in factoring in the size of the data being sent over the network and the delay caused by the service processing the request.
```
total_delay = forward_network_delay + service_delay + reverse_network_delay
```
## Network delay
Time to send complete payload over network
network_delay = time delay from first byte leaving source to final byte arriving at destination
If we ignore the OSI L6 protocol (e.g. HTTP, FTP, Tsunami) then we are modelling a chunk of data moving along a wire. The network delay is then:
```
network_delay = latency + (time difference from start of the data to the end of the data)
= latency + data_delay
```
### Latency
The latency (or propagation delay) of the network path is the time taken for a particular bit of data to get from one end to the other. If we are just modelling one wire (with no switches) then this can be modelled using:
latency = distance / speed
For optical fibre (or even an eletric wire), the speed naively would be the speed of light. In fact, the speed is slower than this (in optical fibre this is because of the internal refraction that occurs, which is different for different wavelengths). According to [m2.optics.com](http://www.m2optics.com/blog/bid/70587/Calculating-Optical-Fiber-Latency) the delay (1/speed) is approximately 5 microseconds / km
```
if
distance is in m
delay is in s/m
latency is in s
then
latency = distance * 5 / 1E9
```
(this matches MJB's "propogation_delay" formula)
Normally we would just measure the latency of a link. Most real-life connections comprise many network links and many switches, each of which introduces some latency.
### Data delay
The time difference from start of the data to the end of the data (or "data delay" for want of a better term) is dependent on the bandwidth of the network and the amount of data.
```
if
data_size is in Bytes
bandwidth is in Mb/s
data_delay is in s
then
data_delay = data_size * 8 / bandwidth * 1E6
```
The data_size naively is the size of the data you want to send over the network (call this the "file_size"). However, the data is split into packets and each packet has a header on it so the amount of data going over the network is actually more than the amount sent. The header includes contributions from (at least) the L6 protocol (e.g. HTTP), L4 (e.g. TCP) and L3 (e.g. IP) layers.
```
let
packet_size = packet_header_size + packet_payload_size
then
data_size = (packet_size / packet_payload_size) * file_size
or
data_size = (packet_size / packet_size - packet_header_size) * file_size
```
### Measuring and Predicting
Bringing the above parts together we have:
```
network_delay = latency + data_delay
= (distance * 5 / 1E9) + {[(packet_size / packet_size - packet_header_size) * file_size] * 8 / bandwidth * 1E6}
```
We want to be able to measure the `network_delay` and also want to be able to predict what the delay is likely to be for a given deployment.
Parameter | Known / measured
----------|--------
latency | measured by network probes
distance | sometimes known
packet_size | known (a property of the network)
packet_header_size | known (at least for L3 and L4)
file_size | measured at the service function
bandwidth | known (a property of the network), can also be measured
Measuring the actual `latency` can be done in software. For a given `file_size`, the `network_delay` could then be predicted.
*We are ignoring network congestion and the effect of the protocol (see below).*
### Effect of protocol
The analysis above ignores the network protocol. However, the choice of protocol has a large effect in networks with a high bandwidth-delay product.
In data communications, bandwidth-delay product is the product of a data link's capacity (in bits per second) and its round-trip delay time (in seconds). The result, an amount of data measured in bits (or bytes), is equivalent to the maximum amount of data on the network circuit at any given time, i.e., data that has been transmitted but not yet acknowledged.
TCP for instance expects acknowledgement of every packet sent and if the sender has not received an acknowledgement within a specified time period then the packet will be retransmitted. Furthermore, TCP uses a flow-control method whereby the receiver specifies how much data it is willing to buffer and the sending host must pause sending and wait for acknowledgement once that amount of data is sent.
### Effect of congestion
The analysis above considers the best case where the whole bandwidth of the link is available for the data transfer.
## Service Delay
A particular service function may have several operations (API calls) on it. A model of service function performance needs to consider the resource the service function is deployed upon (and its variability and reliability), the availability of the resource (i.e. whether the service function have the resource to itself), the workload (a statistical distribution of API calls and request sizes) and the speed at which the resource can compute the basic computations invoked by the requests.
We must simplify sufficiently to make the problem tractable but not too much so that the result is of no practical use.
To simplify we can:
* assume that the resource is invariable, 100% available and 100% reliable;
* assume that the distribution of API calls is constant and that the workload can be represented sufficiently by the average request size.
To be concrete, if a service function has two API calls: `transcode_video(video_data)` and `get_status()` then we would like to model the average response time over "normal" usage which might be 10% of calls to `transcode_video` and 90% of calls to `get_status` and a variety of `video_data` sizes with a defined average size.
### Measuring
As an example, the `minio` service reports the average response time over all API calls already so, for that service at least, measuring the `service_delay` is easy. We expect to also be able to measure the average `file_size` which will do as a measure of workload.
### Predicting
As noted above, a simple model must consider:
* the resource the service function is deployed upon (e.g. CPU, memory, disk);
* the workload (an average request size)
* the speed at which the resource can compute the basic computations invoked by the requests (dependent on the service function).
We can therefore write that:
```
service_delay = f(resource, workload, service function characteristics)
```
For our simplified workload we could assume that this can be written as:
```
service_delay = workload * f(resource, service function characteristics)
```
The resource could be described in terms of the number of CPUs, amount of RAM and amount of disk. Even if the resource was a physical machine more detail would be required such as the CPU clock speed, CPU cache sizes, RAM speed, disk speed, etc. In a virtualised environment it is even more complicated as elements of the physical CPU may or may not be exposed to the virtual CPU (which may in fact be emulated).
Benchmarks are often used to help measure the performance of a resource so that one resource may be compared to another without going into all the detail of the precise artchitecture. Application benchmarks (those executing realistic workloads such as matrix operations or fast fourier transforms) can be more useful than general benchmark scores (such as SPECint or SPECfp). For more information on this, see [Snow White Clouds and the Seven Dwarfs](https://eprints.soton.ac.uk/273157/1/23157.pdf).
The best benchmark for a service function is the service function itself combined with a representative workload. That is, to predict the performance of a service function on a given resource, it is best to just run it and find out. In the absence of that, the next best would be to execute Dwarf benchmarks on each resource type and correlate them with the service functions, but that is beyond what we can do now.
We might execute a single benchmark such as the [Livermoor Loops](http://www.netlib.org/benchmark/livermorec) benchmark which stresses a variety of CPU operations. Livermoor Loops provides a single benchmark figure in Megaflops/sec.
Our service_delay equation would then just reduce to:
```
service_delay = workload * f(benchmark, service function characteristics)
= workload * service_function_scaling_factor / benchmark
```
The `service_function_scaling_factor` essentially scales the `workload` number into a number of Megaflops. So for a `workload` in bytes the `service_function_scaling_factor` would be representing Megaflops/byte.
If we don't have a benchmark then the best we can do is approximate the benchmark by the number of CPUs:
```
service_delay = workload * f(benchmark, service function characteristics)
= workload * service_function_scaling_factor / cpus
```
Is this a simplification too far? It ignores the size of RAM for instance which cannot normally be included as a linear factor (i.e. twice as much RAM does not always give twice the performance). Not having sufficient RAM results in disk swapping or complete failure. Once you have enough for a workload, adding more makes no difference.
\ No newline at end of file
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment