Water-Quality-DW-on-SQL-Server

This is an MSSQL Data Warehouse and ETL implementation on specially formatted Water Quality dataset from DEFRA, UK

Introduction:

A data warehouse is a central repository of information that can be analyzed to make more informed decisions. Data flows into a data warehouse from transactional systems, relational databases, and other sources, typically on a regular cadence (https://aws.amazon.com/what-is/data-warehouse).

This repository is about a data warehouse project that was carried out using ETL (extract, transform, and load) process on a specially formatted WaterQuality dataset from The Department for Environment Food & Rural Affairs (DEFRA), UK. This particular dataset is provided in an MS Access (.accdb) file. It contains 17 tables, and each would have to be exported into individual CSV files.

The data warehouse consists of a staging table, nine (9) dimension tables, and one fact table. Among the dimension tables is an extended Time table to aid time-based BI analysis. The data warehouse was created in a Microsoft SQL Server 2019 database environment with the source dataset exported into CSV files, and then imported into corresponding tables in the database using SQL Server Management Studio (SSMS) Import wizard; while the main ETL process was done in a Jupyter Notebook (Python environment) which was connected to the data warehouse in the MSSQL database through pyodbc Python cursor connection.

Finally, SQL queries were run on the data warehouse star schema using the project questions to gain insights into the data.

Objectives of the project:

These are the objectives of the project:

To design a data warehouse on Microsoft SQL Server database environment for the WaterQuality dataset to enable analysis.
To implement ETL process and demonstrate its use cases especially in the transform and load phases.
To demonstrate the use of Python environment to interact with the data warehouse.

The following are information desired to be gotten from the dataset:

The list of water sensors measured by type of sensor by month
The number of sensor measurements collected by type of sensor by week
The number of measurements made by location by month
The average number of measurements covered for pH by year
The average value of Nitrate measurements by locations by year

Deliverables on the project:

Here is the Jupyter Notebook for the Python environment that was used to carry out data cleaning, and ETL
For reference purposes, here are all the T-SQL scripts and codes that were used throughout the project.

PS

For code comparison sake, here is the Oracle SQL equivalent of the Jupyter Notebook mentioned above.
If you need to see exactly how I implemented it in Oracle DW, see this repository.

Enjoy!

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
T-SQL scripts		T-SQL scripts
Corresponding Code To Illustrate ETL on Oracle DW.ipynb		Corresponding Code To Illustrate ETL on Oracle DW.ipynb
Python Environment To Demonstrate DW & ETL on MSSQL.ipynb		Python Environment To Demonstrate DW & ETL on MSSQL.ipynb
README.md		README.md
WaterQuality.accdb		WaterQuality.accdb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Water-Quality-DW-on-SQL-Server

Introduction:

Objectives of the project:

Deliverables on the project:

PS

About

Releases

Packages

Languages

vaxdata22/Water-Quality-DW-on-SQL-Server

Folders and files

Latest commit

History

Repository files navigation

Water-Quality-DW-on-SQL-Server

Introduction:

Objectives of the project:

Deliverables on the project:

PS

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages