Hydrology2.rmd 12.6 KB
Newer Older
Ben Anderson's avatar
Ben Anderson committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
---
title: "FEEG6025 Week 8 Practical - Hydrological forecasting"
output:
  html_document:
    fig_caption: yes
    number_sections: yes
    toc: yes
    toc_depth: 3
---
#Introduction
We are going to use a dataset from a river catchment area in Wales, included in the Hydrological Model Assessment and Development (**hydromad** http://hydromad.catchment.org) package. This upland catchment is the source of the River Wye, which is a significant river that occasionally floods in towns further downstream.

##Aim
Create a forecast model to predict streamflow in the catchment for input rainfall. This model could be used for early warning of flooding downstream, or for designing hydraulic structures in the region.  

#Getting started
##R and Rstudio versions
This practical works with R version 3.3.1 and RStudio version 1.0.44. The University 'additional software' is out of date and, if used, this markdown code fails to run.  If you are using your laptop, this is not an issue, so skip to next section.  If you are using a University workstation, then you need to download these two files:

* https://cran.r-project.org/bin/windows/base/R-3.3.2-win.exe
* https://download1.rstudio.org/RStudio-1.0.44.zip

Install R3.3.2 by running the .exe file, accepting the defaults.  RStudio-1.0.44.zip can simply be moved to your 'my documents' folder and te
hn unzipped.   
To run RStudio, navigate in Windows Explorer to mydocuments\RStudio-1.0.44\bin and click on rstudio.exe.

##Install packages
```{r echo=TRUE, cache=TRUE}

# This is a function to install any packages that cannot be loaded
# (Might be useful for coursework etc.)

my_required_packages <- function(x,y){
  for( i in x ){
    #  require returns TRUE if it was able to load package
    if( ! require( i , character.only = TRUE ) ){
      #  If package was not able to be loaded then re-install
      install.packages( i , repos=y , 
                        #type="win.binary" , comment out so runs on OS X etc
                        quiet=TRUE , dependencies = TRUE , verbose = FALSE )
      #  Load package after installing
      require( i , character.only = TRUE, quietly = TRUE )
    }
  }
}
```

Now load libraries:
```{r echo=TRUE, warning=FALSE, message=FALSE}
# call the function my_required_packages
my_required_packages(c("chron", "data.table", "zoo" , "latticeExtra", "polynom", "car", "Hmisc","reshape","DEoptim", "astsa", "pracma", "fractal", "tseries"),"http://cran.rstudio.com/")
```

Now install **hydromad** package:
```{r echo=TRUE, warning=FALSE}
# call the function my_required_packages 
my_required_packages(c("hydromad"), "http://hydromad.catchment.org")
```

##Load data
The data consists of two time series: rainfall (*P*) and streamflow data (*Q*).  Both are in units of mm/hour.  For stream flow, the flow rate has been divided by the catchment area; the rainfall has been averaged over the whole catchment area which is **10.6 km^2^**.

```{r cache=TRUE}
data(Wye)

# Create a 'zoo' time series object
Wye<-zoo(coredata(Wye),seq(start(Wye),end(Wye),by="hours"))
summary(Wye)
head(Wye)
tail(Wye)

#Extract the core data from 'zoo' time series object 
Rain<-coredata(Wye$P)
Flow<-coredata(Wye$Q)
t_start<-start(Wye)
t_end<-end(Wye)
t<-seq(t_start,t_end,by = "hour")
```

#Inspecting data
##Basic plots
Create a 'run series plot' (plot variables against time on the x-axis) and histograms of the variables 

```{r, echo=TRUE}
#plot.zoo(Wye, main = "River Wye Rainfall (P) and Streamflow (Q) in mm/h", xlab="Date")
xyplot(Wye)
hist(Rain)
#hist(Wye$Q)
hist(Flow)
```

###Q. What are the average monthly rainfall figures?

```{r}
P_mon=aggregate(Wye$P, as.yearmon, FUN = sum)
aggregate(P_mon,function(x) cycle(as.yearmon(x)),FUN=mean)
```

###Q. What is the "run-off ratio" (sum of streamflow)/(sum of rainfall)?  Does this make sense? 

```{r}
sum(Flow)/sum(Rain)
```

###Task: For the streamflow data, log-transform the variable and plot it against time

```{r, echo=TRUE}
plot.ts(as.Date(as.POSIXct(t)),log(Flow), main = "River Wye Streamflow Log(Q in mm/h)", xlab="Date")
hist(log(Flow))
```

###Q. What is the effect of log-transforming the variable?

##Zoom in on some specific events
###Task: use the **hydromad** function **eventseq()** to find events where rain > 8.5 mm/h and streamflow > 6 mm/h, remaining above 1 mm/h for a duration of 12 hours

```{r}
HighWye<-eventseq(Wye, thresh=c(8.5, 6.0), inthresh=c(6.0, 1.0), mingap=24, mindur=12)
table(coredata(HighWye))
```

You can zoom in on a specific event, using the window() function:

```{r, echo=TRUE}
Wye88<-window(Wye, start = as.POSIXct("1988-09-20"),
                     end = as.POSIXct("1988-10-03"))
xyplot(Wye88,main = "River Wye Rainfall (P) and Streamflow (Q) in mm/h\nSeptember 1988 Storm", xlab="Date")
```

There is another event of interest in 1987, which has a higher peak rainfall, but a lower peak streamflow.   

###Task: Use window and plot functions to zoom in on this event

```{r, echo=TRUE}
Wye87<-window(Wye, start = as.POSIXct("1987-10-10"),
                     end = as.POSIXct("1987-10-25"))
xyplot(Wye87,main = "River Wye Rainfall (P) and Streamflow (Q) in mm/h\nOctober 1987 storm", xlab="Date")
```

The streamflow is being forced by an external input---rainfall.

###Task: find a period when there is zero rainfall, and the streamflow is freely decreasing.


#Check regularity
Are there any missing data?

```{r}
any(is.na(Flow))
any(is.na(Rain))
```
Are there any uneven time steps?

```{r}
tol<-0.0000000001 # small number, avoids rounding errors 
any(diff(diff(t))>tol)
```

The **zoo** package makes it easy to check for regularity, but first you have to create a 'zoo' object from the raw data:

```{r}
Wye2<-zoo(cbind(Rain,Flow),order.by=t)

# 'zoo' function for checking regularity:
is.regular(Wye2, strict = TRUE)
```
What if we lost some rows of data?

```{r}
set.seed(42) # makes random sampling repeatable

# randomly sample 80% of the existing points:   
idx<-sort(sample( seq(1:length(Rain)), floor(0.8*length(Rain)) ))
Rain3<-Rain[idx]
Flow3<-Flow[idx]
t3<-t[idx]

# create new time series from sampled points
Wye3<-zoo(cbind(Rain3,Flow3),order.by=t3)
is.regular(Wye3, strict = TRUE)
```

###TASK: show that the new time series is irregular

##Interpolation of irregular time series
The **zoo** package has some useful functions for: 

(a) filling in gaps in irregular time series with 'missing data' (NA)
(b) interpolating across the gaps

```{r}
# create an empty regular time series with the same start and end times as the original time series
reg<-zoo( ,order.by=seq( start(Wye), end(Wye), "hour" ) )

# merge the regular time series with the irregular time series
Wye4<-merge(Wye3,reg)

is.regular(Wye4, strict = TRUE)
any(is.na(Wye4))
```

###Q. Is the new time series regular? Are there missing data values (NA)?

```{r}
Wye5<-na.approx(Wye4, rule=2) # LINEAR interpolation
#na.locf(Wye4) # NEAREST neighbours interpolation
#na.spline(Wye4) # SPLINE interpolation
rmserr(coredata(Wye5$Rain3), Rain, summary=TRUE)
rmserr(coredata(Wye5$Flow3), Flow, summary=TRUE)
```

###Q. which interpolation method gives the lowest error compared to the original time series?

###TASK: plot one of the storm events with the original and interpolated data points

#Check stationarity
###Task: use three stationarity tests to check whether the streamflow time series is stationary. 
Note the **stationarity()** function (PSR test) mentioned in the lecture is very strict whereas ADF and KPSS tests are more widely used and are less strict. 

```{r}
stationarity(Flow) # T>0.05 indicates stationarity (very strict test)
adf.test(Flow) # null hypothesis: non-stationary (not very strict)
kpss.test(Flow) # null hypothesis: stationary
```

###Q. Do the tests indicate stationarity?  How do the results change if you (a) log transform the variable and (b) take differences of the variable?
Log'ed data:
```{r}
stationarity(log(Flow+0.0001)) # make sure to avoid log(0)
adf.test(log(Flow+0.0001))
kpss.test(log(Flow+0.0001))
```
Diff'ed data:
```{r}
stationarity(diff(Flow))
adf.test(diff(Flow))
kpss.test(diff(Flow))
```

#Examine autocorrelation structure
How does the streamflow relate to its value at past time steps?

###Task: create lag plots of streamflow
```{r}
lag1.plot(Flow,max.lag = 9)
```

###Q. What do the lag plots indicate about the autocorrelation structure?

###Task: plot the autocorrelation function for streamflow for (a) the whole time series (b) the storm events and (c) the freely decreasing period

```{r, echo=TRUE}
acf(Flow, lag.max=24, main="Autocorrelation function for streamflow Q - all data")
```

```{r, echo=TRUE}
pacf(coredata(Wye88$Q), lag.max=24, main="Partial autocorrelation function for streamflow Q - storm event")
```

##Q. What do the ACF and PACF suggest about the type of time-series model?

#Examine cross-correlation between rainfall and streamflow
###Task: Plot the cross-correlation function for (a) all the time series (b) storm events.


```{r, echo=TRUE}
ccf(Flow,Rain, lag.max=24, main="Cross-correlation between streamflow Q(t+k) and rainfall P(t) - all data")
```

```{r, echo=TRUE}
ccf(coredata(Wye88$Q),coredata(Wye88$P), lag.max=24, main="Cross-correlation between streamflow Q(t+k)\n and rainfall P(t) - storm event 1988")
```

```{r, echo=TRUE}
ccf(coredata(Wye87$Q),coredata(Wye87$P), lag.max=24, main="Cross-correlation between streamflow Q(t+k)\n and rainfall P(t) - storm event 1987")
```

###Q. What are the similarities and differences between the CCF plots?

###Q: What is your estimate of the typical delay between rainfall and streamflow?

```{r}

my_delay<-3 # in time steps

```

**Hydromad** includes a useful function to produce a rolling cross-correlation - it shows how the cross-correlation for different lags varies with time throughout the period of records, using a moving 'window' of specified **width** of time steps, which moves along **by** a certain number of time steps. 

###Task: plot the rolling CCF
```{r}
WyeRoll <- rollccf(Wye, width=list(100), by=100)
xyplot(WyeRoll)
```

###Q: Is the cross-correlation consistent throughout the time series? Is there a period that looks like something changes systematically?


#Time series modelling of streamflow
In the lecture, "autoregressive moving-average" models (**ARIMA**) of one variable were covered - these assume that the future value of the variable only depends on the past values of itself (**AR**), and on the random errors at past time steps (**MA**). The time series may be differenced to achieve stationarity, in which case the model needs to be integrated (**I**) to recover the original variable.   
In our case, we also have an external input - rainfall.  This is called an 'e**X**ogenous' variable, hence the X in **ARMAX**.  
**Hydromad** includes the ability to fit an **ARMAX** model to the dataset.

###Task: fit an **ARMAX** model to the dataset.
Try different order combinations of **AR** and **MA** order (here referred to as *n* and *m*)

```{r}

# fit an ARMAX model, in this case first-order Auto-Regressive model AR(1)
Wyefit <- hydromad(DATA=Wye, sma=NULL, routing = "armax", rfit = list("ls", delay=my_delay, order = c(n = 1, m = 0)))

# print the AR and MA coefficients 
coef(Wyefit)

# print a summary of the model fit
summary(Wyefit)

# plot the model fit for all data
xyplot(Wyefit, with.P = TRUE) 

# plot the model fit for one of the storms
xyplot(Wyefit, with.P = TRUE, xlim = c(as.POSIXct("1988-09-20"),as.POSIXct("1988-10-03")))

# Q-Q plot
qqmath(Wyefit, scales = list(y = list(log = TRUE)), distribution = qnorm, type = c("g", "l"))

# ACF of the residuals
Wyeres=coredata(residuals(Wyefit))
acf(Wyeres, na.action = na.pass)

# Some other combinations to try:
# ...order = c(n=0, m=1) => Moving Average, order 1
# ...order = c(n=1, m=1) => Unusual, p=q
# ...order = c(n=2, m=0) => Second-order Auto-Regressive model AR(2)
# ...order = c(n=1, m=2) => ARIMA(1,0,2)
# ...order = c(n=2, m=1) => ARIMA(2,0,1)

```


###Q. Which combination of auto-regressive (**AR**) and moving average (**MA**) terms give the best overall fit? Which combination fits the *peak* values the best?

###Q. Does the **ACF** plot of the residuals suggest that most of the autocorrelation has been removed by the model?

#Using the model to forecast the streamflow for a given rainfall

###Q. A storm with a duration of six hours and rainfall 10 mm/h occurs just after the end of the dataset.  What is your best prediction of the resulting hydrograph? 

```{r}
# create a new rainfall time series
new_storm<-rep(0.0,240)
new_storm[12:18]<-10.0 # hours 12 to 18 have rainfall of 10 mm/h

# turn it into a 'zoo' time series object
new_storm<-zoo(new_storm,seq(from=end(Wye), length.out=length(new_storm), by="hour") )

# use the model fit to predict streamflow for a given rainfall (known as 'routing')
Wyepred<-predict(Wyefit, newdata=new_storm, which="routing")
xyplot(Wyepred)
```