Skip to content

SQL based data profiling & data quality checks, which will help you to perform data profiling & data quality checks on SQL database at table & database level.

Notifications You must be signed in to change notification settings

martandsingh/SQL-DQC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SQLT - Data Profiling & Quality Checks using T-SQL commands

Author: Martand Singh

What is SQLT?

SQLT is a collection of script to perform data profiling & quality in your SQL database using T-SQL scripts. All the script are written in Microsoft SQL Server.

What is database profiling?

According to the Wikipedia, Data profiling is the process of examining the data available from an existing information source (e.g. a database or a file) and collecting statistics or informative summaries about that data.[1] The purpose of these statistics may be to:

  1. Find out whether existing data can be easily used for other purposes
  2. Improve the ability to search data by tagging it with keywords, descriptions, or assigning it to a category
  3. Assess data quality, including whether the data conforms to particular standards or patterns[2]
  4. Assess the risk involved in integrating data in new applications, including the challenges of joins
  5. Discover metadata of the source database, including value patterns and distributions, key candidates, foreign-key candidates, and functional dependencies
  6. Assess whether known metadata accurately describes the actual values in the source database
  7. Understanding data challenges early in any data intensive project, so that late project surprises are avoided. Finding data problems late in the project can lead to delays and cost overruns.
  8. Have an enterprise view of all data, for uses such as master data management, where key data is needed, or data governance for improving data quality.

How data profile is conducted?

Data profiling utilizes methods of descriptive statistics such as minimum, maximum, mean, mode, percentile, standard deviation, frequency, variation, aggregates such as count and sum, and additional metadata information obtained during data profiling such as data type, length, discrete values, uniqueness, occurrence of null values, typical string patterns, and abstract type recognition. The metadata can then be used to discover problems such as illegal values, misspellings, missing values, varying value representation, and duplicates.

SQLT details

SQLT is a collection of script where each script is participating to help you profiling your data. As this is the initial release, we are launching the basic data profiling methods.

  1. sqlt.DQC_DB_LEVEL - Execute all the DQC or profiling method for the given database (all the tables).
  2. sqlt.DQC_DISTINCT_COLUMN_COUNT - Distinct values for all the columns for the given table.
  3. sqlt.DQC_MAX_MIN_LENGTH - Minimun and Maximum length of string type column values.
  4. sqlt.DQC_MAX_MIN_VALUE - Minimum & Maximum (Range) of all the numeric fields.
  5. sqlt.DQC_MISSING_VALUES_COUNT - Count of missing values for all the fields.
  6. sqlt.DQC_SPECIAL_CHARACTER - Count of columns containing special character(s) with customized REGEX string.
  7. sqlt.DQC_TOTAL_COUNT - total row count comparison for the table.

Setup

Step 1 - Open SSMS query window. Go to Query -> SQLCMD Mode

STEP1-SQLCMD

Step 2 - Paste the content of INITIAL_SETUP/OBJECT_CREATIONS.sql to the query window (SQLMODE)

STEP2-COMMAND

Step 3 - Press F5 or run

STEP3-COMMANDRUN

How to use it?

All the DQC output will be saved in a table named sqlt.assertlog. You can select this table to check the output.

image

Run DQC for the whole database

Go to procedures & execute sqlt.DQC_DB_LEVEL. e.g. sqlt.DQC_DB_LEVEL @db_name = 'YOUR DB'

Run DQC for individual table

You can run DQC procedure individually. e.g. sqlt.DQC_DISTINCT_COLUMN_COUNT @table_name = 'Table-Name'

Examples

  1. EXEC sqlt.DQC_TOTAL_COUNT @table_name = 'EmployeeDest', @predicted_value=2
  2. EXEC sqlt.DQC_TOTAL_COUNT @table_name = 'EmployeeDest', @predicted_value=4
  3. EXEC sqlt.DQC_MISSING_VALUES_COUNT @table_name = 'SALES'
  4. EXEC sqlt.DQC_SPECIAL_CHARACTER @table_name = 'SALES'
  5. EXEC sqlt.DQC_MAX_MIN_LENGTH @table_name = 'EMPLOYEEDEST'
  6. EXEC sqlt.DQC_MAX_MIN_VALUE @table_name = 'SALES'
  7. EXEC sqlt.DQC_DISTINCT_COLUMN_COUNT @table_name = 'EMPLOYEE'
  8. EXEC sqlt.DQC_DB_LEVEL @db_name ='AdventureWorksDW2019'

As this is the first version of the scripts, we have included only basic DQC. In future commit we are planning to include statistical profiling, meta profiling (Number of partitions, partition size, type of partition).

Keep Supporting...

Releases

No releases published

Packages

No packages published

Languages