Skip to content

This is a toy project to predict the flow of River Test using gauged data from Environment Agency and National River Archive.

License

Notifications You must be signed in to change notification settings

JZhou3083/SouthernWater_Riverflow_Forcasting

Repository files navigation

Table of Contents
  1. About The Project
  2. Data Preparation
  3. Roadmap
  4. License
  5. Acknowledgments

About The Project

Hands off Flow (HOF) is a measurement of the river flow that triggers warning of when water supply company may be breaching their licence condition for water abstraction on the river to preserve water resource for plants and wildlife.

In the past eight months, Southampton has experienced the driest months in 131 years due to an extreme shortage of rainfall. The below graph, from the website of Southern Water, shows the recent flow data of River Test, one of main water source in Southampton:

In July, Southern Water introduced a 'Temporary Use Ban(TUB)' to all its customers in Hampshire and the Isle of Wight, restricting the unnecessary water usage such as watering a garden using a hosepipe.

This project thereby aims to conduct time series analysis on the flow readings of River Test by the gauging stations, and construct a predictive model. The main objectives of the project include:

  1. Analyse the flow of River Test using gauged data from Southern Water
  2. Construct a model that predicts river flow from climatic data, external data source will be utilized if needed
  3. Validate the model and automate the data collecting and prediction using APIs
  4. Make an APP for better UI

(back to top)

Methodlogy

The target of this work is to provide a reusable pipeline for water availability forecasting. It provides also a comparative analysis about different forecasting strategies and models. Every datasets are differents from each other, so each dataset will be treated independantly following a general pipeline. I am aiming for a mid and long term forecasting and do not intend to used inferred outputs for future prediction, therefore excluding the usage of recursive forecasting.

The general methodology falls into three categories:

  • data preparation: Sourcing, imputing, cleansing and feature engineering for model feeding.
  • evaluation strategy: Expanding window cross validation
  • modeling strategy: Ensemble learning

Data collection

Data collection is a difficult task as climatic data of EA often suffers from significant discontinuities. The target is to collect at least 20 years of data for modelling, for which I exploited external data sources extensively if correlation analysis indicated good matching to the data of EA.

Daily Mean Flow (m3/s)

The schematic of the hydrology of the River Test downstream of Romsey, adopted from Environment Agency in 2011 from Environment Agency(EA):

According to the draught permit application by Southern Water(section 2.3.3), Testwood Bridge GS does not exist. Hence, the actual HOF data is obtained by summing the readings of the following gauge stations:

  1. River Great Test at Testwood
  2. River Blackwater at Ower
  3. Broadlands Fish Carrier at M27 TV1
  4. River Little test at Conagar Bridge

(The interfacing module to Environment Agency database is the class ImportFromEA in EnvironAgency.py). However, the Testwood GS station of EA has a severe data missing issue, containing data from Apr 2018- Aug 2021 only and in low quality (unchecked estimation). Filling missing values is essential because rejecting data can significantly decrease the dataset size and forecasting reliability.

To fill the gap, I imputed it with the flow readings of a Broadlands Gauging Station(GS) locating at slightly upstream of Conagar Bridge GS and Test Back GS stations(look at the hydrology map for a clearer idea), from National River Flow Archive(NRFA). Given the proximity, it is possible to achieve the approximation or estimation(a system identification task). To validate my idea, I extract the data from all the stations:

Then I compute the Scatter Index and the coefficient of determination R2-score between the two series (code can be found from EDA.py) and found that for the existing data, the SI and R2-score are around 0.1 and 0.91 respectively. This is an unexpected good approximation. The equations of SI:

and R2-score:

where RSS is the sum of squares of residuals, TSS is the total sum of squares. To sumarize, the closer R2-score is to 1 and the SI is to 0, the better the estimation it is. Considering there may be delay between the two data, I also ran correlation check on the two time series:

It is found that the greatest correlation lies on the day 0, which means readings between Broadlands GS and the sum of the other three has a negligible delay. I also built a transfer model(data/tfModel.mat) using system identification toolbox of MATLAB to achieve closer approximation, whereas the model overfits due to the shortage of training data. Finally, outputs of the model are:

  1. Daily flow mean gauged at Broadlands GS from EA
  2. Daily flow mean gauged at Ower GS from NRFA

(back to top)

Precipitation & Temperature

Rainfall data collection and imputation used the same methodology as flow but with the two extra tasks:

  • Locations. It is assumed that rainfall and temperature of proximal coordinates have high correlations(below 10 km it is around 90%), therefore I use locations at least 10 km from each other. The coordinates of the data collection points start from Ashe, where River Test rises untill HF measuring point at Southampton. A list of the locations is saved in /data/stations.csv:

  • Extraction from netCDF files. The dataset I used was the HadUK-Grid at the resolution of 5km grid, in the netCDF format. The code that merges netCDF files and extracts data at the interested coordinates is merge_nc_files.py. And the data set I produced is in the format:
Date rainfall1 rainfall2 ... flow_BL flow_Ower
1980/01/01 *** *** *** *** ***
Year/mth/day *** *** *** *** ***
2021/12/31 *** *** *** *** ***

As I assumed there are linear relations between river flow and temperature and rainfall, I checked the pearson correlation between features and the target variables:

I found that temperatures of all locations are highly correlated therefore I kept only the maximum and minimum temperature at Andover. Also, I noticed that temperature has more significant influence on river flow of Broadland than regional rainfall while in Ower it is the opposite way. Then I applied an Exponential window average with alpha =0.1 on the features to smooth and denoise, which was found to be essential for the model performance later:

Data preparation

Now that I had all the data we can get, the next step is to prepare it for supervised learning. There were few adjustments I did

  • Finding the relevant time lag effect of weather variables on river flow
  • Cyclical encoding to add time related features, i.e. month of the year, week of the month and day of the week
  • Add the pass values of target variables as new features

lagged features

Update: check Kaggle for more updates.

Roadmap

  • Data preparation
  • Feature engineering
  • Modelling
  • Validation
  • APP

See the open issues for a full list of proposed features (and known issues).

(back to top)

License

Distributed under the MIT License. See LICENSE.txt for more information.

(back to top)

Acknowledgments

(back to top)

About

This is a toy project to predict the flow of River Test using gauged data from Environment Agency and National River Archive.

Topics

Resources

License

Stars

Watchers

Forks

Languages