Hypothesis Testing

Description

SQL data extraction for performing A/B Testing to inform buisiness decisions making for global retailer.

Main Files:

student.ipynb - Notebook inclues data, methodologies and modeling around hypothesis testing

Mod_3_preso.pdf - presentation summarizing findings for a non-technical audience

Additional Files Blog Post: Categorical Data

SQL Database Table Schematic Layout:

# !pip install -U fsds_100719
# from fsds_100719.imports import *

import pandas as pd

import functions as fn

## Uncomment the line below to see the source code for the imported functions
# fs.ihelp(Cohen_d,False),fs.ihelp(find_outliers_IQR,False), fs.ihelp(find_outliers_Z,False)

import sqlite3
connect = sqlite3.connect('Northwind_small.sqlite')
cur = connect.cursor()

List of Tables:

cur.execute("""SELECT name FROM sqlite_master WHERE type='table';""")
df_tables = pd.DataFrame(cur.fetchall(), columns=['Table'])
df_tables

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Table
0	Employee
1	Category
2	Customer
3	Shipper
4	Supplier
5	Order
6	Product
7	OrderDetail
8	CustomerCustomerDemo
9	CustomerDemographic
10	Region
11	Territory
12	EmployeeTerritory

cur.execute("""SELECT * FROM `Order`;""")
df = pd.DataFrame(cur.fetchall(), columns=[x[0] for x in cur.description])
df.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Id	CustomerId	EmployeeId	OrderDate	RequiredDate	ShippedDate	ShipVia	Freight	ShipName	ShipAddress	ShipCity	ShipRegion	ShipPostalCode	ShipCountry
0	10248	VINET	5	2012-07-04	2012-08-01	2012-07-16	3	32.38	Vins et alcools Chevalier	59 rue de l'Abbaye	Reims	Western Europe	51100	France
1	10249	TOMSP	6	2012-07-05	2012-08-16	2012-07-10	1	11.61	Toms Spezialitäten	Luisenstr. 48	Münster	Western Europe	44087	Germany
2	10250	HANAR	4	2012-07-08	2012-08-05	2012-07-12	2	65.83	Hanari Carnes	Rua do Paço, 67	Rio de Janeiro	South America	05454-876	Brazil
3	10251	VICTE	3	2012-07-08	2012-08-05	2012-07-15	1	41.34	Victuailles en stock	2, rue du Commerce	Lyon	Western Europe	69004	France
4	10252	SUPRD	4	2012-07-09	2012-08-06	2012-07-11	2	51.30	Suprêmes délices	Boulevard Tirou, 255	Charleroi	Western Europe	B-6000	Belgium

## looking at dates to get an idea of timeframe
df['OrderDate']=pd.to_datetime(df['OrderDate'])

df['OrderDate'].describe()

count                     830
unique                    480
top       2014-02-26 00:00:00
freq                        6
first     2012-07-04 00:00:00
last      2014-05-06 00:00:00
Name: OrderDate, dtype: object

Timespan is ~ 2 years : April 7 2012 - June 5 2014

HYPOTHESIS 1

Does discount amount have a statistically significant effect on the quantity of a product in an order? If so, at what level(s) of discount?

$H_0$:There is no statistcally significant effect on the quantity of a product in an order in relation to a discount amount.
$H_1$:Discounts have a statistically significant effect on the quantiy of a product in an order.
$H_1a$:Certain discount values have a greater effect than others.

Importing and inspecting data from OrderDetail table:

This table from the Northwind database includes order information on:

1) Quantity
2) Discount

Once imported, the data will be grouped by discount level and the means of the quantity sold for each will be compared against each other to evaluate if they are statistically significantly different. This will determine if the null hypothesis can be rejected with a probability of 5% error in reporting a false negative.

cur.execute("""SELECT * FROM OrderDetail;""")
df = pd.DataFrame(cur.fetchall(), columns=[x[0] for x in cur.description])
df.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Id	OrderId	ProductId	UnitPrice	Quantity
0	10248/11	10248	11	14.0	12
1	10248/42	10248	42	9.8	10
2	10248/72	10248	72	34.8	5
3	10249/14	10249	14	18.6	9
4	10249/51	10249	51	42.4	40

#explorting dataset
specs = df.describe()
specs

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	OrderId	ProductId	UnitPrice	Quantity	Discount
count	2155.000000	2155.000000	2155.000000	2155.000000	2155.000000
mean	10659.375870	40.793039	26.218520	23.812993	0.056167
std	241.378032	22.159019	29.827418	19.022047	0.083450
min	10248.000000	1.000000	2.000000	1.000000	0.000000
25%	10451.000000	22.000000	12.000000	10.000000	0.000000
50%	10657.000000	41.000000	18.400000	20.000000	0.000000
75%	10862.500000	60.000000	32.000000	30.000000	0.100000
max	11077.000000	77.000000	263.500000	130.000000	0.250000

General Observation: Pricing ranges from $2 - $263 with an average price of $26.21 and average qty of 23 items ordered, no null values

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2155 entries, 0 to 2154
Data columns (total 6 columns):
Id           2155 non-null object
OrderId      2155 non-null int64
ProductId    2155 non-null int64
UnitPrice    2155 non-null float64
Quantity     2155 non-null int64
Discount     2155 non-null float64
dtypes: float64(2), int64(3), object(1)
memory usage: 101.1+ KB

len(df['OrderId'].unique())

Initial Visual Inspection:

From this dataset of 2155 line items that span 830 orders, the average quantity ordered is 24 regardless of discount, the minimum ordered is 0 and max ordered is 130, although the IQR is between 10 and 30.

qty = df['Quantity']
qty_specs = qty.describe()
qty_specs

count    2155.000000
mean       23.812993
std        19.022047
min         1.000000
25%        10.000000
50%        20.000000
75%        30.000000
max       130.000000
Name: Quantity, dtype: float64

qty_mu = round(qty_specs['mean'],0)
n = len(df)
print(f'The average quantity ordered from this sample is : {qty_mu}')
print(f'There are {n} orders in this sample.')

The average quantity ordered from this sample is : 24.0
There are 2155 orders in this sample.

d =list(df['Discount'].unique())
d
#Dscounts are as follows:

[0.0, 0.15, 0.05, 0.2, 0.25, 0.1, 0.02, 0.03, 0.04, 0.06, 0.01]

import matplotlib.pyplot as plt
import seaborn as sns
sns.distplot(qty)
plt.axvline(qty_mu, label='Qty Mean', color='purple')
plt.show()

<Figure size 640x480 with 1 Axes>

import seaborn as sns
from ipywidgets import interact

@interact
def plt_discounts(d=d):
    sns.distplot(df.groupby('Discount').get_group(d)['Quantity'])
    plt.axvline(qty_mu, color='purple')

interactive(children=(Dropdown(description='d', options=(0.0, 0.15, 0.05, 0.2, 0.25, 0.1, 0.02, 0.03, 0.04, 0.…

dfa = df.groupby('Discount').count()['Quantity']
display(dfa)

Discount
0.00    1317
0.01       1
0.02       2
0.03       3
0.04       1
0.05     185
0.06       1
0.10     173
0.15     157
0.20     161
0.25     154
Name: Quantity, dtype: int64

The sample sizes associated with discounts .01, .02, .03, .04 and .06 are relatively nominal <4, and will be dropped as without a normal or comperable dataset to evaulate their impact in comparison with the other groups.

discs = {}
for disc in df['Discount'].unique():
    discs[disc] = df.groupby('Discount').get_group(disc)['Quantity']

for k,v in discs.items():
    print(k)

0.0
0.15
0.05
0.2
0.25
0.1
0.02
0.03
0.04
0.06
0.01

l=[.01,.02,.03,.04,.06]
oneper = discs.pop(.01)
twoper = discs.pop(.02)
threeper = discs.pop(.03)
fourper = discs.pop(.04)
sixper = discs.pop(.06)

#visualizing distributions
fig, ax = plt.subplots(figsize=(10,5))
for k,v in discs.items():
    sns.distplot(v,label=v)

plt.title('Quantity Distribution')
plt.legend()
print('Distributions appear roughly equal,')

Distributions appear roughly equal,

Initial Observations:

Datatype is numeric.

The average quantity ordered from this sample is : 24.0 There are 2155 orders in this sample.

Discounts are as follows: 0.0, 0.15, 0.05, 0.2, 0.25, 0.1, 0.02, 0.03, 0.04, 0.06, 0.01

The majority of product purchases are without the discount(1317/2155) 61%, and frequency of discounts are as follows: 5% 10% 20% 15% 25%

For discounts 1%,2%,3%,4%, and 6% , the amount of data provided was relatively small to evaluate the impact on the whole. This data was removed from further testing.

Overall distributions appears relatively uniform.

Since we are comparing multiple discounts to inspect it's impact on quantity ordered an AVNOVA or Kruksal test will be run depending on how assumptions are met:

Assumptions for ANOVA Testing: (see corresponding sections

No significant outliers
- Upon a quick visual inspection, the distribution is skewed and visually there appears to be some outliers
Equal variance
- Lavene's testing demonstrates NOT equal variance
Normality (if n>15)
- Not required for discounts: 5%, 10%, 15%, 20% and 25% since n > 15

Assumption 1: Outliers

Evaluation and removal via Z-Score testing:

for disc, disc_data in discs.items():
    idx_outs = fn.find_outliers_Z(disc_data)
    print(f'Found {idx_outs.sum()} outliers in Discount Group {disc}')
    discs[disc] = disc_data[~idx_outs]
print('\n All of these outliers were removed')

Found 20 outliers in Discount Group 0.0
Found 2 outliers in Discount Group 0.15
Found 3 outliers in Discount Group 0.05
Found 2 outliers in Discount Group 0.2
Found 3 outliers in Discount Group 0.25
Found 3 outliers in Discount Group 0.1

 All of these outliers were removed

Assumption 2: Equal Variance

Levines testing conducted on cleaned dataset

#preparing data for levene's testing
datad = []
for k,v in discs.items():
    datad.append(v)

import scipy.stats as stats
stat,p = stats.levene(discs[0.0],discs[0.05],discs[0.1],discs[0.25], discs[0.15], discs[0.20])
print(f'Lavene test for equal variance results are {round(p,4)}')
sig = 'do NOT' if p < .05 else 'DO'

print(f'The groups {sig} have equal variance')

Lavene test for equal variance results are 0.0003
The groups do NOT have equal variance

Since group does not prove to be equal variance, a kruksal will be conducted.

Looking at sample sizes to determine if normality needs to be tested.

Assumption 3: Normality

First, checking sample sizes since assumption for normality depends on sample size. If 2-9 groups, each group n >= 15

For Discounts of 5%, 10%, 15%, 20% and 25% n>15

n = []
for disc, disc_data in discs.items():
    print(f'There are {len(disc_data)} samples in the {disc} discount group.')    
    n.append(len(disc_data)>15)
if all(n):
    print('\nAll samples are >15: Normality Assumption Criterion is met.')

There are 1297 samples in the 0.0 discount group.
There are 155 samples in the 0.15 discount group.
There are 182 samples in the 0.05 discount group.
There are 159 samples in the 0.2 discount group.
There are 151 samples in the 0.25 discount group.
There are 170 samples in the 0.1 discount group.

All samples are >15: Normality Assumption Criterion is met.

Kruksal Testing:

stat, p = stats.kruskal(discs[0.0],discs[0.05],discs[0.1],discs[0.25], discs[0.15], discs[0.20])
print(f"Kruskal test p value: {round(p,4)}")
if p < .05 :
    print(f'Reject the null hypothesis')
else: 
    print(f'Null hypotheis remains true')

Kruskal test p value: 0.0
Reject the null hypothesis

ANOVA Testing for comparison

stat, p = stats.f_oneway(*datad)
print(f"ANOVA test p value: {round(p,4)}")
if p < .05 :
    print(f'Reject the null hypothesis')
else: 
    print(f'Null hypotheis remains true')

ANOVA test p value: 0.0
Reject the null hypothesis

Post-Hoc Testing:

disc_df = fn.prep_data_for_tukeys(discs)
disc_df

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	data	group
0	12.0	0.0
1	10.0	0.0
2	5.0	0.0
3	9.0	0.0
4	40.0	0.0
...	...	...
2095	30.0	0.1
2096	77.0	0.1
2098	25.0	0.1
2099	4.0	0.1
2135	2.0	0.1

2114 rows × 2 columns

disc_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2114 entries, 0 to 2135
Data columns (total 2 columns):
data     2114 non-null float64
group    2114 non-null object
dtypes: float64(1), object(1)
memory usage: 49.5+ KB

d =list(disc_df['group'].unique())
d

['0.0', '0.15', '0.05', '0.2', '0.25', '0.1']

import statsmodels.api as sms
tukey = sms.stats.multicomp.pairwise_tukeyhsd(disc_df['data'],disc_df['group'])
tukey.summary()

Multiple Comparison of Means - Tukey HSD, FWER=0.05

group1	group2	meandiff	p-adj	lower	upper	reject
0.0	0.05	6.0639	0.001	2.4368	9.691	True
0.0	0.1	2.9654	0.2098	-0.7723	6.7031	False
0.0	0.15	6.9176	0.001	3.0233	10.8119	True
0.0	0.2	5.6293	0.001	1.7791	9.4796	True
0.0	0.25	6.1416	0.001	2.2016	10.0817	True
0.05	0.1	-3.0985	0.4621	-7.9861	1.789	False
0.05	0.15	0.8537	0.9	-4.1547	5.862	False
0.05	0.2	-0.4346	0.9	-5.4088	4.5396	False
0.05	0.25	0.0777	0.9	-4.9663	5.1218	False
0.1	0.15	3.9522	0.2311	-1.1368	9.0412	False
0.1	0.2	2.6639	0.6409	-2.3915	7.7193	False
0.1	0.25	3.1762	0.4872	-1.9479	8.3004	False
0.15	0.2	-1.2883	0.9	-6.4605	3.884	False
0.15	0.25	-0.7759	0.9	-6.0154	4.4635	False
0.2	0.25	0.5123	0.9	-4.6945	5.7191	False

There is a statistically significant effect on quantity purchased based on discount. The discounts below are statistically considered equal:

disc_tukey = fn.tukey_df(tukey)
disc_tukey_trues = disc_tukey.loc[disc_tukey['reject']==True]
disc_tukey_trues

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	group2	meandiff	p-adj	lower	upper	reject
0	0.05	6.0639	0.001	2.4368	9.6910	True
2	0.15	6.9176	0.001	3.0233	10.8119	True
3	0.2	5.6293	0.001	1.7791	9.4796	True
4	0.25	6.1416	0.001	2.2016	10.0817	True

disc_df['group']=disc_df['group'].astype(str)
d=disc_df['group']

@interact
def plt_discounts(d=d):
    sns.distplot(disc_df.groupby('group').get_group(d)['data'])
    plt.axvline(qty_mu, color='purple')

interactive(children=(Dropdown(description='d', options=('0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0…

stats = disc_df['data'].describe()
dataqty = stats['mean']
print(f'Revised average quantity sold across all orders {round(dataqty)}')
zeros = disc_df.groupby('group').get_group('0.0')['data']
nodiscav = zeros.mean()
print(f'Average quantity sold for orders where no discount was extended was: {round(nodiscav)}')

Revised average quantity sold across all orders 23.0
Average quantity sold for orders where no discount was extended was: 21.0

for k,v in discs.items():
    print(f'The average quantity sold for {k} discount is {round(v.mean())}')

The average quantity sold for 0.0 discount is 21.0
The average quantity sold for 0.15 discount is 27.0
The average quantity sold for 0.05 discount is 27.0
The average quantity sold for 0.2 discount is 26.0
The average quantity sold for 0.25 discount is 27.0
The average quantity sold for 0.1 discount is 24.0

data_mu = disc_df['data'].mean()
print(f'The average qty purchased regardless of discount or none offered is {round(data_mu)}')

The average qty purchased regardless of discount or none offered is 23.0

Various EDA to understand distributions and remaining potential outliers

The following hex-bin visualization illustrates the density of data in the 0% category. Visually, this appears that equal quantitites are purchased, but might be due to light markings relative to sample sizes. Clearly not the best choice for EDA, and further exploration is required.

disc_df['group'] = disc_df['group'].astype(float)
sns.jointplot(data=disc_df, x='group', y='data', kind='hex')
plt.ylabel('Qty')

Text(336.9714285714286, 0.5, 'Qty')

Visual on distributions and potential remaining outliers:

#boxen plot
sns.catplot(data=disc_df, x='group', y='data', kind='boxen')
ax.axhline(26.75, color='k')
plt.gca()
plt.show()

This plot may not best suit non-technical audience with the additional information potentially could cause confusion. However, the above boxen plot clearly illustrates distributions as well as remaining potential outliers in the .05, .15, .2 and .25 groups. These outliers will not be removed at the present moment with the intention to preserve as much of the initial dataset as possible. Sample sizes can be referenced under 'Assumption 3: Normality' where this is observed.

fig, ax = plt.subplots(figsize=(10,8))
sns.barplot(data=disc_df, x='group', y='data', ci=68)
#plt.axhline(data_mu, linestyle='--', color='darkgray')
plt.axhline(26.75, linestyle='--', color='lightblue')
plt.title('Average Quanity Purchased at Varying Discont Levels', fontsize=20)
plt.xlabel('Discount Extended')
plt.ylabel('Average Quantity Purchased')

Text(0, 0.5, 'Average Quantity Purchased')

Effect Sizes:

for disc, disc_data in discs.items():
    es = fn.Cohen_d(zeros, disc_data)
    print(f'The effect size for {disc} is {round(np.abs(es),4)}')

The effect size for 0.0 is 0.0
The effect size for 0.15 is 0.453
The effect size for 0.05 is 0.3949
The effect size for 0.2 is 0.3751
The effect size for 0.25 is 0.4109
The effect size for 0.1 is 0.1982

Hypothesis 1 Findings and Recommendation:

Findings:

Rejecting the null hypothesis :

𝐻0 :There is no statistcally significant effect on the quantity of a product in an order in relation to a discount amount.

Alternative hypothesese:

*  𝐻1 :Discounts have a statistically significant effect on the quantiy of a product in an order.
*  𝐻1𝑎 :Certain discount values have a greater effect than others.(see below for findings)

A 10% discount had no statistical significance on the quantiy purchased.

Discounts extended at 5%, 15%, 20%, and 25% statistically are equal in terms of their effect on quantity sold when compared to none offered, with a p-value of .001 meaning there is a .01 percent chance of classifying them as such due to chance. Each have varying effect sizes in compared to orders placed with no discount extended.

Discount	AvQty	Effect Size	Effect
5 %	27	.1982	Small
15 %	27	.454	Medium
20 %	26	.3751	Medium
25 %	27	.454	Medium

Additional notes: For discounts 1%,2%,3%,4%, and 6% that were included in the original dataset, the amount of data provided was relatively small to evaluate the impact on the whole. This data was removed from further testing

With additional outliers removed for each of the discount groups, the revised average qty purchased was 23 overall.

Recommendation:

While larger discounts did deomonstrate significant effect on quantity purchased, smaller discounts held a statistically equal effect. To recognize the effect of driving higher quantities purchased and realize larger profit margins, offer the smaller discount.

HYPOTHESIS 2:

Do some categories generate more revenue than others ?? Which ones?

$𝐻0$ : All categories generate equal revenues.
$𝐻1$ : Certain categories sell at statistically higher rates of revneu than others.
$𝐻1𝑎$ :

Importing and inspecting data from Product and OrderDetail tables:

These tables includes product information data including:

Categories
Pricing and discount information to generate revenues

##clean sql notation
cur.execute("""SELECT 
                ProductId, 
                ProductName, 
                od.UnitPrice,
                CategoryID, 
                CategoryName, 
                Discount, 
                Quantity
                FROM Product AS p
                JOIN OrderDetail as od
                ON p.ID = Od.ProductId
                JOIN Category as c 
                ON c.ID = p.CategoryID;""")
catavs = pd.DataFrame(cur.fetchall(), columns=[x[0] for x in cur.description])
catavs.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	ProductId	ProductName	UnitPrice	CategoryId	CategoryName	Quantity
0	11	Queso Cabrales	14.0	4	Dairy Products	12
1	42	Singaporean Hokkien Fried Mee	9.8	5	Grains/Cereals	10
2	72	Mozzarella di Giovanni	34.8	4	Dairy Products	5
3	14	Tofu	18.6	7	Produce	9
4	51	Manjimup Dried Apples	42.4	7	Produce	40

#Revenue is calculated by subtracting discounts from the UnitPrice and multiplying by quantity ordered.
catavs['Revenue'] = (catavs['UnitPrice'] * (1 - catavs['Discount']))*catavs['Quantity']

avcrev = catavs['Revenue'].mean()
print (round(avcrev),2)
cg = catavs['CategoryName'].unique()

587.0 2

Initial Visual Inspection and Observations:

There are 8 different categories sold in this company that represent 77 products.

The average revenue generated across all categories is $587.00

Visually, it appears that there are three categories that significantly generate higher revenues than others, additional testing will demonstrate their siginficance and effect.

fig, ax = plt.subplots(figsize=(12,8))
sns.barplot(data=catavs, x='CategoryName', y='Revenue', ci=68, ax=ax)
plt.title('Initial Inspection: Average Revenue Generated Accross All Categories', fontsize=18)
plt.axhline(avcrev,linestyle="--", color='k', linewidth=.8 )
plt.xlabel('Category')
plt.ylabel('Revenue')
plt.show()

catcount = len(catavs['CategoryId'].unique())
avrev = catavs['Revenue'].mean()
print(f'There are {catcount} different categories sold in this company')
print(f'The average revenue generated accross all categories is {round(avcrev,0)}')

There are 8 different categories sold in this company
The average revenue generated accross all categories is 587.0

cats = {}
for cat in catavs['CategoryName'].unique():
    cats[cat] = catavs.groupby('CategoryName').get_group(cat)['Revenue']

In each of the different categories, the products align as such:

catsprods = {}
for cat in catavs['CategoryName'].unique():
    catsprods[cat] = catavs.groupby('CategoryName').get_group(cat)['ProductId'].unique()

for k,v in catsprods.items():
    print(f'There are {len(v)} products in the {k} Category')

There are 10 products in the Dairy Products Category
There are 7 products in the Grains/Cereals Category
There are 5 products in the Produce Category
There are 12 products in the Seafood Category
There are 12 products in the Condiments Category
There are 13 products in the Confections Category
There are 12 products in the Beverages Category
There are 6 products in the Meat/Poultry Category

Assumption 1: Outliers

Outliers removed via z-score testing.

for cat, cat_data in cats.items():
    idx_outs = fn.find_outliers_Z(cat_data)
    print(f'Found {idx_outs.sum()} outliers in Category # {cat}')
    cats[cat] = cat_data[~idx_outs]
print('\n All of these outliers were removed')

Found 5 outliers in Category # Dairy Products
Found 6 outliers in Category # Grains/Cereals
Found 1 outliers in Category # Produce
Found 8 outliers in Category # Seafood
Found 4 outliers in Category # Condiments
Found 9 outliers in Category # Confections
Found 12 outliers in Category # Beverages
Found 4 outliers in Category # Meat/Poultry

 All of these outliers were removed

pids = catavs['ProductName'].unique()
pids
print(f'There are {len(pids)} products')

There are 77 products

Assumption 2: Equal Variance

Testing cleaned dataset for equal variance.

Since the groups do NOT have euqal variance, a Kruksal test will be conducted.

datac = []
for k,v in cats.items():
    datac.append(v)

import scipy.stats as stats
stat,p = stats.levene(*datac)
print(f'Lavene test for equal variance results are {round(p,4)}')
sig = 'do NOT' if p < .05 else 'DO'

print(f'The groups {sig} have equal variance')

Lavene test for equal variance results are 0.0
The groups do NOT have equal variance

Assumption 3: Normality

All groups are > 15 samples = Assumption for normality is met.

n = []

for cat, cat_data in cats.items():
    print(f'There are {len(cat_data)} samples in the data set for Employee #{cat}.')
    n.append(len(cat_data)>15)
if all(n):
    print('\nAll samples are >15: Normality Assumption Criterion is met.')

There are 361 samples in the data set for Employee #Dairy Products.
There are 190 samples in the data set for Employee #Grains/Cereals.
There are 135 samples in the data set for Employee #Produce.
There are 322 samples in the data set for Employee #Seafood.
There are 212 samples in the data set for Employee #Condiments.
There are 325 samples in the data set for Employee #Confections.
There are 392 samples in the data set for Employee #Beverages.
There are 169 samples in the data set for Employee #Meat/Poultry.

All samples are >15: Normality Assumption Criterion is met.

Kruksal Testing:

Results, reject the null hypothesis.

stat, p = stats.kruskal(*datac)
print(f"Kruskal test p value: {round(p,4)}")
if p < .05 :
    print(f'Reject the null hypothesis')
else: 
    print(f'Null hypotheis remains true')

Kruskal test p value: 0.0
Reject the null hypothesis

Post-Hoc Testing:

cat_df = fn.prep_data_for_tukeys(cats)
cat_df

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	data	group
0	168.00	Dairy Products
2	174.00	Dairy Products
12	47.50	Dairy Products
13	1088.00	Dairy Products
14	200.00	Dairy Products
...	...	...
2096	2702.70	Meat/Poultry
2098	738.00	Meat/Poultry
2099	86.40	Meat/Poultry
2102	111.75	Meat/Poultry
2148	48.00	Meat/Poultry

2106 rows × 2 columns

Visual Inspection Post Data Cleaning:

@interact
def plt_discounts(d=cg):
    sns.distplot(cat_df.groupby('group').get_group(d)['data'])
    plt.axvline(cat_df['data'].mean(), color='purple')
    plt.title('Average Revenue Generated by Category')
    plt.ylabel('Distribution')

interactive(children=(Dropdown(description='d', options=('Dairy Products', 'Grains/Cereals', 'Produce', 'Seafo…

tukeyc = sms.stats.multicomp.pairwise_tukeyhsd(cat_df['data'],cat_df['group'])
tukeyc.summary()

Multiple Comparison of Means - Tukey HSD, FWER=0.05

group1	group2	meandiff	p-adj	lower	upper	reject
Beverages	Condiments	55.7712	0.9	-78.858	190.4005	False
Beverages	Confections	27.0669	0.9	-91.4026	145.5364	False
Beverages	Dairy Products	194.0913	0.001	78.8965	309.286	True
Beverages	Grains/Cereals	21.4034	0.9	-118.1928	160.9996	False
Beverages	Meat/Poultry	411.042	0.001	265.7215	556.3625	True
Beverages	Produce	296.1055	0.001	138.516	453.695	True
Beverages	Seafood	-46.1557	0.9	-164.9266	72.6151	False
Condiments	Confections	-28.7043	0.9	-168.1193	110.7106	False
Condiments	Dairy Products	138.32	0.0449	1.6769	274.9631	True
Condiments	Grains/Cereals	-34.3678	0.9	-192.129	123.3933	False
Condiments	Meat/Poultry	355.2707	0.001	192.4225	518.119	True
Condiments	Produce	240.3342	0.001	66.4494	414.2191	True
Condiments	Seafood	-101.927	0.3439	-241.5981	37.7441	False
Confections	Dairy Products	167.0244	0.001	46.2712	287.7776	True
Confections	Grains/Cereals	-5.6635	0.9	-149.8807	138.5537	False
Confections	Meat/Poultry	383.9751	0.001	234.2101	533.7401	True
Confections	Produce	269.0386	0.001	107.3415	430.7357	True
Confections	Seafood	-73.2226	0.6099	-197.392	50.9467	False
Dairy Products	Grains/Cereals	-172.6879	0.0054	-314.2272	-31.1485	True
Dairy Products	Meat/Poultry	216.9507	0.001	69.7626	364.1389	True
Dairy Products	Produce	102.0142	0.5181	-57.2991	261.3276	False
Dairy Products	Seafood	-240.247	0.001	-361.2959	-119.1982	True
Grains/Cereals	Meat/Poultry	389.6386	0.001	222.6607	556.6164	True
Grains/Cereals	Produce	274.7021	0.001	96.9438	452.4604	True
Grains/Cereals	Seafood	-67.5591	0.831	-212.024	76.9057	False
Meat/Poultry	Produce	-114.9365	0.5359	-297.2246	67.3516	False
Meat/Poultry	Seafood	-457.1977	0.001	-607.2012	-307.1942	True
Produce	Seafood	-342.2612	0.001	-504.1792	-180.3432	True

for cat, cat_data in cats.items():
    print(f'The average revenue for {cat} is ${round(cat_data.mean(),2)}')

The average revenue for Dairy Products is $593.86
The average revenue for Grains/Cereals is $421.17
The average revenue for Produce is $695.87
The average revenue for Seafood is $353.61
The average revenue for Condiments is $455.54
The average revenue for Confections is $426.83
The average revenue for Beverages is $399.77
The average revenue for Meat/Poultry is $810.81

Hypothesis 2: A Clean Vizualization

#help(cat_df.groupby('group').mean().sort_values('data').plot)

index = list(cat_df.groupby('group').mean().sort_values('data', ascending=False).index)
index

['Meat/Poultry',
 'Produce',
 'Dairy Products',
 'Condiments',
 'Confections',
 'Grains/Cereals',
 'Beverages',
 'Seafood']

import matplotlib.ticker as ticker
plt.style.use('seaborn')
fig, ax = plt.subplots(figsize=(10,8))
sns.barplot(data=cat_df, x=round(cat_df['data']), y=cat_df['group'],order=index, ci=68, palette='Wistia_r')
formatter = ticker.FormatStrFormatter('$%1.2f')
ax.xaxis.set_major_formatter(formatter)
plt.axvline(cat_df['data'].mean(), linestyle='--', color='gray')
plt.ylabel('Category')
plt.xlabel('Average Revenue Generated')
plt.xticks(rotation=45)
plt.title('Average Revenue* Generated By Category', fontsize=20)

Text(0.5, 1.0, 'Average Revenue* Generated By Category')

f, ax = plt.subplots(figsize=(10,8))
#sns.despine(f, left=True, bottom=True)
sns.catplot(data=cat_df, x='group', y='data', ax=ax)

<seaborn.axisgrid.FacetGrid at 0x201332e4c88>

tukeycdf = fn.tukey_df(tukeyc)

The following groups are statistically similar and accept the null hypothesis that all categories generate equal revenue:

tukeycfalse  = tukeycdf.loc[tukeycdf['reject']==False]
tukeycfalse

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	group1	group2	meandiff	p-adj	lower	upper	reject
0	Beverages	Condiments	55.7712	0.9000	-78.8580	190.4005	False
1	Beverages	Confections	27.0669	0.9000	-91.4026	145.5364	False
3	Beverages	Grains/Cereals	21.4034	0.9000	-118.1928	160.9996	False
6	Beverages	Seafood	-46.1557	0.9000	-164.9266	72.6151	False
7	Condiments	Confections	-28.7043	0.9000	-168.1193	110.7106	False
9	Condiments	Grains/Cereals	-34.3678	0.9000	-192.1290	123.3933	False
12	Condiments	Seafood	-101.9270	0.3439	-241.5981	37.7441	False
14	Confections	Grains/Cereals	-5.6635	0.9000	-149.8807	138.5537	False
17	Confections	Seafood	-73.2226	0.6099	-197.3920	50.9467	False
20	Dairy Products	Produce	102.0142	0.5181	-57.2991	261.3276	False
24	Grains/Cereals	Seafood	-67.5591	0.8310	-212.0240	76.9057	False
25	Meat/Poultry	Produce	-114.9365	0.5359	-297.2246	67.3516	False

The folowing groups can reject the null hypothesis:

tukeyctrues  = tukeycdf.loc[tukeycdf['reject']==True]
tukeyctrues

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	group1	group2	meandiff	p-adj	lower	upper	reject
2	Beverages	Dairy Products	194.0913	0.0010	78.8965	309.2860	True
4	Beverages	Meat/Poultry	411.0420	0.0010	265.7215	556.3625	True
5	Beverages	Produce	296.1055	0.0010	138.5160	453.6950	True
8	Condiments	Dairy Products	138.3200	0.0449	1.6769	274.9631	True
10	Condiments	Meat/Poultry	355.2707	0.0010	192.4225	518.1190	True
11	Condiments	Produce	240.3342	0.0010	66.4494	414.2191	True
13	Confections	Dairy Products	167.0244	0.0010	46.2712	287.7776	True
15	Confections	Meat/Poultry	383.9751	0.0010	234.2101	533.7401	True
16	Confections	Produce	269.0386	0.0010	107.3415	430.7357	True
18	Dairy Products	Grains/Cereals	-172.6879	0.0054	-314.2272	-31.1485	True
19	Dairy Products	Meat/Poultry	216.9507	0.0010	69.7626	364.1389	True
21	Dairy Products	Seafood	-240.2470	0.0010	-361.2959	-119.1982	True
22	Grains/Cereals	Meat/Poultry	389.6386	0.0010	222.6607	556.6164	True
23	Grains/Cereals	Produce	274.7021	0.0010	96.9438	452.4604	True
26	Meat/Poultry	Seafood	-457.1977	0.0010	-607.2012	-307.1942	True
27	Produce	Seafood	-342.2612	0.0010	-504.1792	-180.3432	True

def mult_Cohn_d(tukey_result_df, df_dict):
    '''Using a dataframe from Tukey Test Results and a 
    corresponding dictionary, this function loops through 
    each variable and returns the adjusted p-value and Cohn_d test'''
    import pandas as pd
    
    res = [['g1', 'g2','padj', 'd']]
    for i, row in tukey_result_df.iterrows():
        g1 = row['group1']
        g2 = row['group2']
        padj = row['p-adj']
        d = fn.Cohen_d(df_dict[g1], df_dict[g2])

        res.append([g1, g2,padj, d])

    mdc = pd.DataFrame(res[1:], columns=res[0])
    return mdc

The table below illustrates those categories that can reject the null hypothesis that states all categories generate equal revenue. The padj is the probability this is due to chance and the d column shows the effect size.

mult_Cohn_d(tukeyctrues, cats)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	g1	g2	padj	d
0	Beverages	Dairy Products	0.0010	-0.351635
1	Beverages	Meat/Poultry	0.0010	-0.596175
2	Beverages	Produce	0.0010	-0.520603
3	Condiments	Dairy Products	0.0449	-0.277491
4	Condiments	Meat/Poultry	0.0010	-0.517399
5	Condiments	Produce	0.0010	-0.490099
6	Confections	Dairy Products	0.0010	-0.340630
7	Confections	Meat/Poultry	0.0010	-0.600173
8	Confections	Produce	0.0010	-0.560435
9	Dairy Products	Grains/Cereals	0.0054	0.349940
10	Dairy Products	Meat/Poultry	0.0010	-0.310545
11	Dairy Products	Seafood	0.0010	0.525083
12	Grains/Cereals	Meat/Poultry	0.0010	-0.563832
13	Grains/Cereals	Produce	0.0010	-0.570886
14	Meat/Poultry	Seafood	0.0010	0.754594
15	Produce	Seafood	0.0010	0.798069

Hypothesis 2 Findings and Recommendation:

Findings:

Reject the null hypothesis:

𝐻0 : All categories generate equal revenues.

Alternative hypotheis

𝐻1 : Certain categories sell at statistically higher rates of revnue than others.

Top three revenue-generating categories: Meat/Poultry, Produce, and Dairy Products.

Statistically, Meat/Poultry & Produce were statistically equivalent and ranked as top sellers, Dairy & Produce were statiscally equal. (Definition of statistically equal: they returned false value from Tukey test, indicating they had a simliar mean and therefore statiticaly equal with a .05 chance of falsely being classified as such)

Category	Average Revenue
Meat/Poultry	`$810.81`
Produce	`$695.87`
Dairy Products	`$593.86`
Condiments	`$455.54`
Confections	`$426.83`
Grains/Cereals	`$421.17``
Beverages	`$399.77`
Seafood	`$353.61`

The table(s) above outlines how the various categories compare to each other. If the adjusted p-value in column padj is >.05, they are statistically equal. Conversely of the adjusted p-value 'adjp' is <.05 the two samples are statistically not-equal. This is further examined by the effect size illustrated in column d. d=0.2 be considered a 'small' effect size, 0.5 represents a 'medium' effect size and 0.8 a 'large' effect size.

Notes: Revenue is calculated by subtracting any extended discounts from the salesprice and multiplying that by quantity sold.

Recommendation:

If there are additional products that align with the higher revenue generating categories, that category could be broadened to maximize revenue generating potential.

Example: Meat/Poultry currently has 6 products, this could be expanded. Conversely, the seafood category carries 12 products which could be narrowed. Additional analysis could demonstrate which seafood are the best sellers which would be preserved.

Knowing what revenue each category generates could potentially influence the ability to appropriately categorize discounts. However, not knowing profit margins - we'd need to take this into consideration.

HYPOTHESIS 3

Do certain sales representatives sell more than others? Who are the top sellers?

$H0$: All sales representatives generate equal revenue.

$H1$: Some sales representatives generate more than others in revenue.

Importing and inspecting data from Product, OrderDetail, Order and Employee Tables

These table includes information on:

1) Product information including SalesPrice, Discount and Quantity Sold
2) Sales Representative Information

cur.execute("""SELECT 
                ProductID, 
                ProductName, 
                Discontinued,
                OrderID, 
                ProductID, 
                od.UnitPrice AS SalesPrice, 
                Quantity, 
                Discount,
                EmployeeID, 
                LastName, 
                FirstName, 
                e.Region, 
                ShippedDate
                FROM Product AS p
                JOIN OrderDetail AS od
                ON od.ProductID = p.Id 
                JOIN 'Order' AS o
                ON o.Id = od.OrderId
                JOIN Employee AS e
                ON o.EmployeeID = e.ID;""")
dfr = pd.DataFrame(cur.fetchall(), columns=[x[0] for x in cur.description])
dfr

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	ProductId	ProductName	Discontinued	OrderId	ProductId	SalesPrice	Quantity	Discount	EmployeeId	LastName	FirstName	Region	ShippedDate
0	11	Queso Cabrales	0	10248	11	14.00	12	0.00	5	Buchanan	Steven	British Isles	2012-07-16
1	42	Singaporean Hokkien Fried Mee	1	10248	42	9.80	10	0.00	5	Buchanan	Steven	British Isles	2012-07-16
2	72	Mozzarella di Giovanni	0	10248	72	34.80	5	0.00	5	Buchanan	Steven	British Isles	2012-07-16
3	14	Tofu	0	10249	14	18.60	9	0.00	6	Suyama	Michael	British Isles	2012-07-10
4	51	Manjimup Dried Apples	0	10249	51	42.40	40	0.00	6	Suyama	Michael	British Isles	2012-07-10
...	...	...	...	...	...	...	...	...	...	...	...	...	...
2150	64	Wimmers gute Semmelknödel	0	11077	64	33.25	2	0.03	1	Davolio	Nancy	North America	None
2151	66	Louisiana Hot Spiced Okra	0	11077	66	17.00	1	0.00	1	Davolio	Nancy	North America	None
2152	73	Röd Kaviar	0	11077	73	15.00	2	0.01	1	Davolio	Nancy	North America	None
2153	75	Rhönbräu Klosterbier	0	11077	75	7.75	4	0.00	1	Davolio	Nancy	North America	None
2154	77	Original Frankfurter grüne Soße	0	11077	77	13.00	2	0.00	1	Davolio	Nancy	North America	None

2155 rows × 13 columns

#Sales Revenue is calculated by multiplying the adjusted price (accounting for any discounts) times quantity
dfr['SaleRev'] = (dfr['SalesPrice'] * (1-dfr['Discount'])) * dfr['Quantity']

empcount = len(dfr['EmployeeId'].unique())
avrev = dfr['SaleRev'].mean()
print(f'There are {empcount} employees in this company associated with sales information')
print(f'The calculated avarage revenue generated by a sales representative in this dataset is ${round(avrev)}')

There are 9 employees in this company associated with sales information
The calculated avarage revenue generated by a sales representative in this dataset is $587.0

Hypothesis 3 Preliminary Visualizations:

fig, ax = plt.subplots(figsize=(8,5))
sns.distplot(dfr['SaleRev'], color='green')
plt.axvline(avrev, color='lightgreen' )
plt.xlabel('Sales Revenue')
plt.ylabel('Distribution')
plt.title('Average Sales Revenue Distribution', fontsize=18)
plt.show()

fig, ax = plt.subplots(figsize=(8,5))
sns.barplot(data=dfr, x='EmployeeId', y='SaleRev', ci=68, ax=ax)
plt.title('Average Revenue by Sales Representative', fontsize=16)
plt.axhline(avrev,linestyle="--", color='gray', linewidth=.6 )
plt.xlabel('Employee Id')
plt.ylabel('Sales Revenue')
plt.show()

Employees are listed in the table below: Although the employee names are unique, for the sake of data inspection we'll continue to use Employee Id as the unique identifier and reference the table above to gather additional insight.

cur.execute("""SELECT ID, LastName, FirstName, Title, Region
                from Employee""")
empsdata = pd.DataFrame(cur.fetchall(), columns=[x[0] for x in cur.description])
empsdata

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Id	LastName	FirstName	Title	Region
0	1	Davolio	Nancy	Sales Representative	North America
1	2	Fuller	Andrew	Vice President, Sales	North America
2	3	Leverling	Janet	Sales Representative	North America
3	4	Peacock	Margaret	Sales Representative	North America
4	5	Buchanan	Steven	Sales Manager	British Isles
5	6	Suyama	Michael	Sales Representative	British Isles
6	7	King	Robert	Sales Representative	British Isles
7	8	Callahan	Laura	Inside Sales Coordinator	North America
8	9	Dodsworth	Anne	Sales Representative	British Isles

reps = {}
for rep in dfr['EmployeeId'].unique():
    reps[str(rep)] = dfr.groupby('EmployeeId').get_group(rep)['SaleRev']

fig, ax = plt.subplots(figsize=(10,5))
for k,v in reps.items():
    sns.distplot(v,label=v)

plt.title('Sales Revenue By Distribution by Rep')
print('Distributions appear roughly equal and there appears to be outliers')

Distributions appear roughly equal and there appears to be outliers

Initial Observations:

Datatype is numeric in this 2155 order sample.

There are 9 employees in this company associated with sales information

The avarage revenue generated by a sales representative is $629.00.

Initial visual inspection indicates roughly uniform distribution in sales revenue, more than half of the sales representatives achieve the average. Additional testing will demonstrate if it is significant.

Since we are comparing multiple discounts to inspect it's impact on quantity ordered an AVNOVA test will be run: Assumptions for ANOVA Testing:

No significant outliers Upon a quick visual inspection, there appears to be some outliers that could be removed
Equal variance
Normality (if n>15) Not required for samples greater than 15

Assumption 1: Outliers

for rep, rep_data in reps.items():
    idx_outs = fn.find_outliers_Z(rep_data)
    print(f'Found {idx_outs.sum()} outliers in Employee # {rep}')
    reps[rep] = rep_data[~idx_outs]
print('\n All of these outliers were removed')

Found 3 outliers in Employee # 5
Found 3 outliers in Employee # 6
Found 7 outliers in Employee # 4
Found 4 outliers in Employee # 3
Found 2 outliers in Employee # 9
Found 4 outliers in Employee # 1
Found 5 outliers in Employee # 8
Found 5 outliers in Employee # 2
Found 5 outliers in Employee # 7

 All of these outliers were removed

fig, ax = plt.subplots(figsize=(10,5))
for k,v in reps.items():
    sns.distplot(v,label=v)

plt.title('Revised: Sales Revenue Distribution by Rep')
print('Distributions appear roughly equal and outliers are visibly removed in comparisson with other visual')

Distributions appear roughly equal and outliers are visibly removed in comparisson with other visual

Asumption 2: Equal Variance

Results are NOT equal variance for this group

#from functions import test_equal_variance
import scipy.stats as stats

datas = []
for k,v in reps.items():
    datas.append(v)

stat,p = stats.levene(*datas)
print(f'Lavene test for equal variance results are {round(p,4)}')
sig = 'do NOT' if p < .05 else 'DO'

print(f'The groups {sig} have equal variance')

Lavene test for equal variance results are 0.0143
The groups do NOT have equal variance

Assumption 3: Normality

The lengths of these samples are >15 so normality criteria is met.

n = []

for rep,samples in reps.items():
    print(f'There are {len(samples)} samples in the data set for Employee #{rep}.')
    n.append(len(samples)>15)
if all(n):
    print('\nAll samples are >15: Normality Assumption Criterion is met.')

There are 114 samples in the data set for Employee #5.
There are 165 samples in the data set for Employee #6.
There are 413 samples in the data set for Employee #4.
There are 317 samples in the data set for Employee #3.
There are 105 samples in the data set for Employee #9.
There are 341 samples in the data set for Employee #1.
There are 255 samples in the data set for Employee #8.
There are 236 samples in the data set for Employee #2.
There are 171 samples in the data set for Employee #7.

All samples are >15: Normality Assumption Criterion is met.

stat, p = stats.kruskal(*datas)
print(f"Kruskal test p value: {round(p,4)}")

Kruskal test p value: 0.2968

stat, p = stats.f_oneway(*datas)
print(stat,p)

2.180626766904163 0.026222185205141618

clean_data = fn.prep_data_for_tukeys(reps)
clean_data

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	data	group
0	168.000	5
1	98.000	5
2	174.000	5
17	45.900	5
18	342.720	5
...	...	...
2077	390.000	7
2103	52.350	7
2104	386.400	7
2105	490.000	7
2123	232.085	7

2117 rows × 2 columns

Hypothesis 3: A Clean Visualization:

f, ax = plt.subplots(figsize=(8,4))
sns.set(style="whitegrid")
sns.barplot(data=clean_data, x='group', y='data', ci=68)
plt.xlabel('Employee Id')
plt.ylabel('Average Revenue Generated')
plt.axhline(clean_data['data'].mean(), linestyle="--", color='darkgray')

plt.title('Average Revenue Generated by Sales Representative', fontsize=20)

Text(0.5, 1.0, 'Average Revenue Generated by Sales Representative')

Observation: The plot above is misleading since it goes off of means it skews the data. Further testing indicates there are no statistical differences between representatives in terms of revenue generated over this time span. The subsequent plot reveals a more accurate depiction of what testing demonstrates.

mu = clean_data['data'].mean()

sns.catplot(data=clean_data, x='group', y='data', 
            kind='swarm', height=6, aspect=1.5)
formatter = ticker.FormatStrFormatter('$%1.2f')
plt.axhline(mu, color='k', linestyle="-", lw=2)
plt.xlabel('Employee Id')
plt.ylabel('Average Total Revenue in US Dollars')
plt.title('Average Total Revenue by Sales Representaitve', fontsize=20)
print(f'Average revenue across all sales representatives: ${round(mu,2)}')

Average revenue across all sales representatives: $494.72

#datacleaning for additional testing:
clean_data['data'] = clean_data['data'].astype(float)
clean_data['group'] = clean_data['group'].astype(str)

clean_data

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	data	group
0	168.000	5
1	98.000	5
2	174.000	5
17	45.900	5
18	342.720	5
...	...	...
2077	390.000	7
2103	52.350	7
2104	386.400	7
2105	490.000	7
2123	232.085	7

2117 rows × 2 columns

The table below illustrates to accept the null hypothesis.

tukeys = sms.stats.multicomp.pairwise_tukeyhsd(clean_data['data'], clean_data['group'])
tukeys.summary()

Multiple Comparison of Means - Tukey HSD, FWER=0.05

group1	group2	meandiff	p-adj	lower	upper	reject
1	2	70.6795	0.7896	-67.9146	209.2736	False
1	3	62.1979	0.8328	-65.5036	189.8995	False
1	4	-2.0207	0.9	-121.7839	117.7426	False
1	5	2.7041	0.9	-174.3746	179.7827	False
1	6	-81.4678	0.7616	-236.6872	73.7517	False
1	7	68.5247	0.9	-84.8486	221.8981	False
1	8	-24.7293	0.9	-160.2376	110.779	False
1	9	101.7408	0.7013	-80.9369	284.4186	False
2	3	-8.4815	0.9	-149.2052	132.2421	False
2	4	-72.7002	0.725	-206.2617	60.8614	False
2	5	-67.9754	0.9	-254.6631	118.7123	False
2	6	-152.1472	0.1034	-318.2452	13.9507	False
2	7	-2.1547	0.9	-166.5288	162.2193	False
2	8	-95.4088	0.5349	-243.2531	52.4356	False
2	9	31.0613	0.9	-160.9455	223.0682	False
3	4	-64.2186	0.7606	-186.4399	58.0027	False
3	5	-59.4939	0.9	-238.2441	119.2564	False
3	6	-143.6657	0.1049	-300.7895	13.4581	False
3	7	6.3268	0.9	-148.9735	161.6271	False
3	8	-86.9273	0.5613	-224.6128	50.7583	False
3	9	39.5429	0.9	-144.7557	223.8415	False
4	5	4.7247	0.9	-168.4434	177.8929	False
4	6	-79.4471	0.7575	-230.19	71.2959	False
4	7	70.5454	0.8574	-78.2959	219.3867	False
4	8	-22.7086	0.9	-153.0653	107.648	False
4	9	103.7615	0.6578	-75.1282	282.6512	False
5	6	-84.1718	0.9	-283.5134	115.1697	False
5	7	65.8207	0.9	-132.0867	263.7281	False
5	8	-27.4334	0.9	-211.8418	156.975	False
5	9	99.0368	0.9	-122.3569	320.4304	False
6	7	149.9925	0.1837	-28.6232	328.6082	False
6	8	56.7384	0.9	-106.7935	220.2704	False
6	9	183.2086	0.1211	-21.1229	387.5401	False
7	8	-93.2541	0.6646	-255.0348	68.5267	False
7	9	33.2161	0.9	-169.7166	236.1487	False
8	9	126.4701	0.4952	-63.3213	316.2616	False

Effect Size Testing:

tukeyrepdf = fn.tukey_df(tukeys)

tukeyrepdf

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	group1	group2	meandiff	p-adj	lower	upper	reject
0	1	2	70.6795	0.7896	-67.9146	209.2736	False
1	1	3	62.1979	0.8328	-65.5036	189.8995	False
2	1	4	-2.0207	0.9000	-121.7839	117.7426	False
3	1	5	2.7041	0.9000	-174.3746	179.7827	False
4	1	6	-81.4678	0.7616	-236.6872	73.7517	False
5	1	7	68.5247	0.9000	-84.8486	221.8981	False
6	1	8	-24.7293	0.9000	-160.2376	110.7790	False
7	1	9	101.7408	0.7013	-80.9369	284.4186	False
8	2	3	-8.4815	0.9000	-149.2052	132.2421	False
9	2	4	-72.7002	0.7250	-206.2617	60.8614	False
10	2	5	-67.9754	0.9000	-254.6631	118.7123	False
11	2	6	-152.1472	0.1034	-318.2452	13.9507	False
12	2	7	-2.1547	0.9000	-166.5288	162.2193	False
13	2	8	-95.4088	0.5349	-243.2531	52.4356	False
14	2	9	31.0613	0.9000	-160.9455	223.0682	False
15	3	4	-64.2186	0.7606	-186.4399	58.0027	False
16	3	5	-59.4939	0.9000	-238.2441	119.2564	False
17	3	6	-143.6657	0.1049	-300.7895	13.4581	False
18	3	7	6.3268	0.9000	-148.9735	161.6271	False
19	3	8	-86.9273	0.5613	-224.6128	50.7583	False
20	3	9	39.5429	0.9000	-144.7557	223.8415	False
21	4	5	4.7247	0.9000	-168.4434	177.8929	False
22	4	6	-79.4471	0.7575	-230.1900	71.2959	False
23	4	7	70.5454	0.8574	-78.2959	219.3867	False
24	4	8	-22.7086	0.9000	-153.0653	107.6480	False
25	4	9	103.7615	0.6578	-75.1282	282.6512	False
26	5	6	-84.1718	0.9000	-283.5134	115.1697	False
27	5	7	65.8207	0.9000	-132.0867	263.7281	False
28	5	8	-27.4334	0.9000	-211.8418	156.9750	False
29	5	9	99.0368	0.9000	-122.3569	320.4304	False
30	6	7	149.9925	0.1837	-28.6232	328.6082	False
31	6	8	56.7384	0.9000	-106.7935	220.2704	False
32	6	9	183.2086	0.1211	-21.1229	387.5401	False
33	7	8	-93.2541	0.6646	-255.0348	68.5267	False
34	7	9	33.2161	0.9000	-169.7166	236.1487	False
35	8	9	126.4701	0.4952	-63.3213	316.2616	False

Hypothesis 3 Results Table:

fn.mult_Cohn_d(tukeyrepdf, reps)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	g1	g2	padj	d
0	1	2	0.7896	-0.130744
1	1	3	0.8328	-0.114624
2	1	4	0.9000	0.004144
3	1	5	0.9000	-0.005335
4	1	6	0.7616	0.170314
5	1	7	0.9000	-0.120978
6	1	8	0.9000	0.052428
7	1	9	0.7013	-0.176757
8	2	3	0.9000	0.014775
9	2	4	0.7250	0.143003
10	2	5	0.9000	0.123877
11	2	6	0.1034	0.298368
12	2	7	0.9000	0.003517
13	2	8	0.5349	0.192116
14	2	9	0.9000	-0.049137
15	3	4	0.7606	0.124941
16	3	5	0.9000	0.108117
17	3	6	0.1049	0.277284
18	3	7	0.9000	-0.010479
19	3	8	0.5613	0.171933
20	3	9	0.9000	-0.063944
21	4	5	0.9000	-0.010070
22	4	6	0.7575	0.178198
23	4	7	0.8574	-0.133375
24	4	8	0.9000	0.051141
25	4	9	0.6578	-0.194860
26	5	6	0.9000	0.193965
27	5	7	0.9000	-0.110550
28	5	8	0.9000	0.063274
29	5	9	0.9000	-0.159411
30	6	7	0.1837	-0.275056
31	6	8	0.9000	-0.140752
32	6	9	0.1211	-0.329617
33	7	8	0.6646	0.178140
34	7	9	0.9000	-0.048048
35	8	9	0.4952	-0.239676

Findings and recommendations:

Findings:

Both parametric and non parametric tests were conducted, despite indications for non-parametric tests as the dataset(s) did not meet the assupmtion of equal variance. A visual inspection after outlier removal suggested there could be significant variance in the mean and the parametric was run for comparison. The parametric test indicated that the null hypothesis could be rejected. However, post-hoc testing validated the efficacy of the non-parametric test to accept the null hypothesis, with the smallest probability that the outcome was due to chance was well over the accepted rate of .05.

The Hypothesis 3 Results Table above provides details on how each representative compares with one another and is itemized by the adjusted p values and d is the result of a Cohen D illustrating the effect size of each comparison.

Recommendation:

If there is no statistical difference, and effect size is small at best, best practices can still be shared by those who have a higher average revenue, examples 2,3 and 9 still have higher than average sales.

Perhaps a little healthy, insentivised competition might spur increased revenues if not by one, by many. Also, building on knowledge of smaller discounts yielding larger quantities, sales team could increase revenue by being conservative with discount rates.

HYPOTHESIS 4:

Where are our customers from that spend the most money?

$H0$: Customers spend equal amounts regardless of region.

$H1$: Region has an effect on total amount spent.

Importing and inspecting data from OrderDetail and Order

These table includes information on:

1) Sales total information 
2) Regions where orders shipped to indicating the location of customers

cur.execute("""SELECT 
                ShipRegion, 
                OrderID, 
                ProductID, 
                UnitPrice,
                Quantity, 
                Discount
                FROM `Order` AS o
                JOIN OrderDetail AS od
                on o.ID = od.OrderId ;""")
dfreg = pd.DataFrame(cur.fetchall(), columns=[x[0] for x in cur.description])
dfreg.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	ShipRegion	OrderId	ProductId	UnitPrice	Quantity
0	Western Europe	10248	11	14.0	12
1	Western Europe	10248	42	9.8	10
2	Western Europe	10248	72	34.8	5
3	Western Europe	10249	14	18.6	9
4	Western Europe	10249	51	42.4	40

dfreg['Amount_Spent'] = ((dfreg['UnitPrice'])*(1 - dfreg['Discount']))*dfreg['Quantity']

Hypothesis 4 Preliminary Visualizations:

fig, ax = plt.subplots()
spend_mu = dfreg['Amount_Spent'].mean()
sns.distplot(dfreg['Amount_Spent'], ax=ax)
plt.axvline(spend_mu, color='lightgreen')
plt.title('Average Total Spend')
print(f'The average total spend is ${round(spend_mu,2)}')
print(f'The distribution indicates there may be outliers.')

The average total spend is $587.37
The distribution indicates there may be outliers.

regs = {}
for reg in dfreg['ShipRegion'].unique():
    regs[reg] = dfreg.groupby('ShipRegion').get_group(reg)['Amount_Spent']

regions = list(dfreg['ShipRegion'].unique())
regions
print(f'There are {len(regions)} regions, they are {regions}.')

There are 9 regions, they are ['Western Europe', 'South America', 'Central America', 'North America', 'Northern Europe', 'Scandinavia', 'Southern Europe', 'British Isles', 'Eastern Europe'].

fig, ax = plt.subplots(figsize=(10,5))
for k,v in regs.items():
    sns.distplot(v,label=v)


plt.title('Sales Revenue Distribution')
plt.ylabel('Distribution')
print('Distributions appear roughly equal, although there appears to be outliers')

Distributions appear roughly equal, although there appears to be outliers

@interact
def plt_discounts(d=regions):
    sns.distplot(dfreg.groupby('ShipRegion').get_group(d)['Amount_Spent'])
    plt.axvline(spend_mu, color='purple')
    plt.gca()

interactive(children=(Dropdown(description='d', options=('Western Europe', 'South America', 'Central America',…

fig, ax = plt.subplots(figsize=(8,5))
sns.barplot(data=dfreg, x='ShipRegion', y='Amount_Spent', ci=68, palette="rocket", ax=ax)
plt.title('Total Spend by Region', fontsize=16)
plt.axhline(spend_mu,linestyle="--", color='orange', linewidth=.6 )
plt.xlabel('Region')
plt.xticks(rotation=45)
plt.ylabel('Total Spend')
plt.show()

Hypothesis 4 Initial Observations:

Datatype is numeric in this 2155 order sample.

There are 9 regions in reflected in this dataset

The avarage of total spent is $587.37.

Initial visual inspection indicates skewed, but roughly uniform distribution in total sales, more than half of the sales representatives achieve the average. Additional testing will demonstrate if it is significant.

Since we are comparing multiple regions to inspect it's impact on quantity ordered an AVNOVA test will be run: Assumptions for ANOVA Testing:

Upon a quick visual inspection, there appears to be some outliers that could be removed
Equal variance
Normality (if n>15) Not required for samples greater than 15

Hypothesis 4 Assumption 1: Outlier

Outliers were identified and removed via z-score testing. Details are below:

regs = {}
for reg in dfreg['ShipRegion'].unique():
    regs[reg] = dfreg.groupby('ShipRegion').get_group(reg)['Amount_Spent']

for reg, reg_data in regs.items():
    idx_outs = fn.find_outliers_Z(reg_data)
    print(f'Found {idx_outs.sum()} outliers in the {reg}')
    regs[reg] = reg_data[~idx_outs]
print('\n All of these outliers were removed')

Found 11 outliers in the Western Europe
Found 2 outliers in the South America
Found 1 outliers in the Central America
Found 9 outliers in the North America
Found 2 outliers in the Northern Europe
Found 1 outliers in the Scandinavia
Found 3 outliers in the Southern Europe
Found 4 outliers in the British Isles
Found 0 outliers in the Eastern Europe

 All of these outliers were removed

Hypothesis 4 Assumption 2: Equal Variance

data = []
labels = []
for k,v in regs.items():
    data.append(v)
    labels.append(k)

stat,p = stats.levene(*data, center = 'median')
print(f'Lavene test for equal variance results are {round(p,4)}')
sig = 'do NOT' if p < .05 else 'DO'

print(f'The groups {sig} have equal variance')

Lavene test for equal variance results are 0.0
The groups do NOT have equal variance

Hypothesis 4 Assumption 3: Normality

n =[]
for reg, samples in regs.items():
    print(f'There are {len(samples)} samples in the data set for Regions #{reg}.')
    n.append(len(samples)>15)
if all(n):
    print('\nAll samples are >15: Normality Assumption Criterion is met.')

There are 734 samples in the data set for Regions #Western Europe.
There are 353 samples in the data set for Regions #South America.
There are 71 samples in the data set for Regions #Central America.
There are 418 samples in the data set for Regions #North America.
There are 141 samples in the data set for Regions #Northern Europe.
There are 69 samples in the data set for Regions #Scandinavia.
There are 134 samples in the data set for Regions #Southern Europe.
There are 186 samples in the data set for Regions #British Isles.
There are 16 samples in the data set for Regions #Eastern Europe.

All samples are >15: Normality Assumption Criterion is met.

Using non-parametric Kruskal since the data set was not of equal variance:

stat, p = stats.kruskal(*data)
print(f"Kruskal test p value: {round(p,6)}")

Kruskal test p value: 0.0

cregs = fn.prep_data_for_tukeys(regs)  
cregs

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	data	group
0	168.0	Western Europe
1	98.0	Western Europe
2	174.0	Western Europe
3	167.4	Western Europe
4	1696.0	Western Europe
...	...	...
1933	54.0	Eastern Europe
1934	199.5	Eastern Europe
1935	200.0	Eastern Europe
1936	232.5	Eastern Europe
2054	591.6	Eastern Europe

2122 rows × 2 columns

mu = cregs['data'].mean()
mu

500.12534142318566

cregs['data']=cregs['data'].astype(float)
cregs['group']=cregs['group'].astype(str)

tukeyr = sms.stats.multicomp.pairwise_tukeyhsd(cregs['data'],cregs['group'])
tukeyr.summary()

Multiple Comparison of Means - Tukey HSD, FWER=0.05

group1	group2	meandiff	p-adj	lower	upper	reject
British Isles	Central America	-185.8522	0.2271	-415.6671	43.9628	False
British Isles	Eastern Europe	-241.5782	0.6901	-670.7752	187.6188	False
British Isles	North America	108.0077	0.3368	-37.1939	253.2093	False
British Isles	Northern Europe	48.1291	0.9	-135.8232	232.0814	False
British Isles	Scandinavia	-137.1482	0.6379	-369.3612	95.0647	False
British Isles	South America	-38.1848	0.9	-187.4464	111.0767	False
British Isles	Southern Europe	-172.9738	0.0946	-359.6391	13.6914	False
British Isles	Western Europe	124.6055	0.0989	-10.6288	259.8398	False
Central America	Eastern Europe	-55.726	0.9	-511.6242	400.1721	False
Central America	North America	293.8599	0.001	82.3968	505.3229	True
Central America	Northern Europe	233.9812	0.0623	-5.7511	473.7136	False
Central America	Scandinavia	48.7039	0.9	-229.7849	327.1927	False
Central America	South America	147.6673	0.4491	-66.6039	361.9385	False
Central America	Southern Europe	12.8783	0.9	-228.942	254.6986	False
Central America	Western Europe	310.4577	0.001	105.7104	515.205	True
Eastern Europe	North America	349.5859	0.1926	-70.0708	769.2426	False
Eastern Europe	Northern Europe	289.7073	0.4947	-144.8807	724.2953	False
Eastern Europe	Scandinavia	104.43	0.9	-352.6817	561.5417	False
Eastern Europe	South America	203.3934	0.8404	-217.6853	624.472	False
Eastern Europe	Southern Europe	68.6044	0.9	-367.1389	504.3476	False
Eastern Europe	Western Europe	366.1837	0.1372	-50.1293	782.4968	False
North America	Northern Europe	-59.8786	0.9	-220.316	100.5588	False
North America	Scandinavia	-245.1559	0.0115	-459.2227	-31.0892	True
North America	South America	-146.1925	0.0045	-265.2754	-27.1097	True
North America	Southern Europe	-280.9815	0.001	-444.5224	-117.4406	True
North America	Western Europe	16.5978	0.9	-84.3478	117.5435	False
Northern Europe	Scandinavia	-185.2773	0.2974	-427.3094	56.7548	False
Northern Europe	South America	-86.3139	0.7597	-250.4349	77.807	False
Northern Europe	Southern Europe	-221.1029	0.0164	-419.8505	-22.3554	True
Northern Europe	Western Europe	76.4764	0.7992	-74.9996	227.9525	False
Scandinavia	South America	98.9634	0.8905	-117.8778	315.8046	False
Scandinavia	Southern Europe	-35.8256	0.9	-279.926	208.2748	False
Scandinavia	Western Europe	261.7538	0.003	54.3185	469.189	True
South America	Southern Europe	-134.789	0.2306	-301.9451	32.3671	False
South America	Western Europe	162.7904	0.001	56.0873	269.4934	True
Southern Europe	Western Europe	297.5794	0.001	142.82	452.3387	True

tukeyr.plot_simultaneous();

** Noting that the sample size of Eastern Europe is relatively small and the confidence interval is much greater to accomodate for it, possibly skewing the results of the table below.

tukeyr_df = fn.tukey_df(tukeyr)

tukeyr_false = tukeyr_df[tukeyr_df['reject']==False]
tukeyr_false

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	group1	group2	meandiff	p-adj	lower	upper	reject
0	British Isles	Central America	-185.8522	0.2271	-415.6671	43.9628	False
1	British Isles	Eastern Europe	-241.5782	0.6901	-670.7752	187.6188	False
2	British Isles	North America	108.0077	0.3368	-37.1939	253.2093	False
3	British Isles	Northern Europe	48.1291	0.9000	-135.8232	232.0814	False
4	British Isles	Scandinavia	-137.1482	0.6379	-369.3612	95.0647	False
5	British Isles	South America	-38.1848	0.9000	-187.4464	111.0767	False
6	British Isles	Southern Europe	-172.9738	0.0946	-359.6391	13.6914	False
7	British Isles	Western Europe	124.6055	0.0989	-10.6288	259.8398	False
8	Central America	Eastern Europe	-55.7260	0.9000	-511.6242	400.1721	False
10	Central America	Northern Europe	233.9812	0.0623	-5.7511	473.7136	False
11	Central America	Scandinavia	48.7039	0.9000	-229.7849	327.1927	False
12	Central America	South America	147.6673	0.4491	-66.6039	361.9385	False
13	Central America	Southern Europe	12.8783	0.9000	-228.9420	254.6986	False
15	Eastern Europe	North America	349.5859	0.1926	-70.0708	769.2426	False
16	Eastern Europe	Northern Europe	289.7073	0.4947	-144.8807	724.2953	False
17	Eastern Europe	Scandinavia	104.4300	0.9000	-352.6817	561.5417	False
18	Eastern Europe	South America	203.3934	0.8404	-217.6853	624.4720	False
19	Eastern Europe	Southern Europe	68.6044	0.9000	-367.1389	504.3476	False
20	Eastern Europe	Western Europe	366.1837	0.1372	-50.1293	782.4968	False
21	North America	Northern Europe	-59.8786	0.9000	-220.3160	100.5588	False
25	North America	Western Europe	16.5978	0.9000	-84.3478	117.5435	False
26	Northern Europe	Scandinavia	-185.2773	0.2974	-427.3094	56.7548	False
27	Northern Europe	South America	-86.3139	0.7597	-250.4349	77.8070	False
29	Northern Europe	Western Europe	76.4764	0.7992	-74.9996	227.9525	False
30	Scandinavia	South America	98.9634	0.8905	-117.8778	315.8046	False
31	Scandinavia	Southern Europe	-35.8256	0.9000	-279.9260	208.2748	False
33	South America	Southern Europe	-134.7890	0.2306	-301.9451	32.3671	False

tukeyr_trues = tukeyr_df[tukeyr_df['reject']==True]
tukeyr_trues

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	group1	group2	meandiff	p-adj	lower	upper	reject
9	Central America	North America	293.8599	0.0010	82.3968	505.3229	True
14	Central America	Western Europe	310.4577	0.0010	105.7104	515.2050	True
22	North America	Scandinavia	-245.1559	0.0115	-459.2227	-31.0892	True
23	North America	South America	-146.1925	0.0045	-265.2754	-27.1097	True
24	North America	Southern Europe	-280.9815	0.0010	-444.5224	-117.4406	True
28	Northern Europe	Southern Europe	-221.1029	0.0164	-419.8505	-22.3554	True
32	Scandinavia	Western Europe	261.7538	0.0030	54.3185	469.1890	True
34	South America	Western Europe	162.7904	0.0010	56.0873	269.4934	True
35	Southern Europe	Western Europe	297.5794	0.0010	142.8200	452.3387	True

mult_Cohn_d(tukeyr_trues, regs)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	g1	g2	padj	d
0	Central America	North America	0.0010	-0.485386
1	Central America	Western Europe	0.0010	-0.548229
2	North America	Scandinavia	0.0115	0.402484
3	North America	South America	0.0045	0.262021
4	North America	Southern Europe	0.0010	0.485062
5	Northern Europe	Southern Europe	0.0164	0.481990
6	Scandinavia	Western Europe	0.0030	-0.460358
7	South America	Western Europe	0.0010	-0.300535
8	Southern Europe	Western Europe	0.0010	-0.539438

for reg, rev in regs.items():
    print(f'The average revenue for {reg} is ${round(rev.mean(),2)}')

The average revenue for Western Europe is $586.93
The average revenue for South America is $424.14
The average revenue for Central America is $276.47
The average revenue for North America is $570.33
The average revenue for Northern Europe is $510.45
The average revenue for Scandinavia is $325.18
The average revenue for Southern Europe is $289.35
The average revenue for British Isles is $462.33
The average revenue for Eastern Europe is $220.75

Hypothesis 4 Observations and Recommendations:

Obeservations

The average spend for this dataset accross all regions was $500.

Regional averages were found to be what is reported in the table below:

Region	Average Revenue
Western Europe	`$586.93`
North America	`$570.33`
Northern Europe	`$510.45`
British Isles	`$462.33`
South America	`$424.14`
Scandinavia	`$325.18`
Southern Europe	`$289.35`
Central America	`$276.47`
Eastern Europe	`$220.75

For each group, the assumption for equal variance was not met and a Kruksal test was conducted. The p value for the nonparametric kruskal test was singificant, which can reject the null hypothesis that all regions spend the same amounts with a 5% degree of error that this is due to chance.

Orders shipped to Western Europe, and North America and Northern Europe generated the highest amount of revenue, statistically these regions are equal:

Region 1	Region 2	MeanDiff	Adj P	Reject Null
North America	Northern Europe	-59.8786	0.9000	False
North America	Western Europe	16.5978	0.9000	False
Northern Europe	Western Europe	76.4764	0.7992	False

For bottom performers that had enough data:

Region 1	Region 2	Adj P
South America	Southern Europe	0.2306

Recommendations:

Explore best practices from regions that are top performers.
Market analysis for regions that need to be developed. Leverage knowledge gained regarding categories and discounts.

indexr = list(cregs.groupby('group').mean().sort_values('data', ascending=False).index)
indexr

['Western Europe',
 'North America',
 'Northern Europe',
 'British Isles',
 'South America',
 'Scandinavia',
 'Southern Europe',
 'Central America',
 'Eastern Europe']

avspend = cregs['data'].mean()
avspend
fig, ax = plt.subplots(figsize=(12,8))
sns.barplot(data=cregs, x='group', y='data', ci=68,order=indexr, palette="rocket", ax=ax)
plt.title('Average Revenue by Region', fontsize=20)
plt.axhline(avspend,linestyle="--", color='orange', linewidth=.6 )
plt.xlabel('Region')
plt.xticks(rotation=45)
plt.ylabel('Average Revenue')
plt.show()

In Closing:

Since the data provided did not include purchase prices of merchandise, ways were examined to maximize revenues. The datasets were all multi group comparisons and none of the groups met all the assupmtions for parametric testing. All groups called for Kruskal-Wallis and post-hoc testing detailed in this notebook.

It was discovered through hypothesis testing and data analysis, various ways to achieve this through: Minimizing discounts Broadening Revenue Generating Categories **add alllll the recommendations)

In addition, future analysis and testeing could provide insight to: Develop Regional Markets Develop Sales Staff

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Employee Photos		Employee Photos
__pycache__		__pycache__
gitignore		gitignore
imgs		imgs
.gitattributes		.gitattributes
CONTRIBUTING.md		CONTRIBUTING.md
Hypothesis Testing with SciPy_codeacademy slides.pdf		Hypothesis Testing with SciPy_codeacademy slides.pdf
LICENSE.md		LICENSE.md
MOD_PROJECT_README.ipynb		MOD_PROJECT_README.ipynb
Northwind_ERD.png		Northwind_ERD.png
Northwind_ERD_updated.png		Northwind_ERD_updated.png
Northwind_small.sqlite		Northwind_small.sqlite
Presentation_Recording_Northwind.mp4		Presentation_Recording_Northwind.mp4
README.md		README.md
functions.py		functions.py
hypothesis-testing-with-scipy cheatsheet.pdf		hypothesis-testing-with-scipy cheatsheet.pdf
mod3_preso.pdf		mod3_preso.pdf
mod3_preso.pptx		mod3_preso.pptx
student.ipynb		student.ipynb

License

andiosika/hypothesis_testing_workflow_python

Folders and files

Latest commit

History

Repository files navigation

Hypothesis Testing

Description

Main Files:

SQL Database Table Schematic Layout:

List of Tables:

HYPOTHESIS 1

Importing and inspecting data from OrderDetail table:

Initial Visual Inspection:

Initial Observations:

Since we are comparing multiple discounts to inspect it's impact on quantity ordered an AVNOVA or Kruksal test will be run depending on how assumptions are met:

Assumption 1: Outliers

Assumption 2: Equal Variance

Since group does not prove to be equal variance, a kruksal will be conducted.

Assumption 3: Normality

Kruksal Testing:

ANOVA Testing for comparison

Post-Hoc Testing:

Various EDA to understand distributions and remaining potential outliers

Hypothesis 1 Findings and Recommendation:

Findings:

Recommendation:

HYPOTHESIS 2:

Importing and inspecting data from Product and OrderDetail tables:

Initial Visual Inspection and Observations:

Assumption 1: Outliers

Assumption 2: Equal Variance

Assumption 3: Normality

Kruksal Testing:

Post-Hoc Testing:

Visual Inspection Post Data Cleaning:

Hypothesis 2: A Clean Vizualization

Hypothesis 2 Findings and Recommendation:

Findings:

Recommendation:

HYPOTHESIS 3

Importing and inspecting data from Product, OrderDetail, Order and Employee Tables

Hypothesis 3 Preliminary Visualizations:

Initial Observations:

Assumption 1: Outliers

Asumption 2: Equal Variance

Assumption 3: Normality

Hypothesis 3: A Clean Visualization:

The table below illustrates to accept the null hypothesis.

Effect Size Testing:

Hypothesis 3 Results Table:

Findings and recommendations:

Findings:

Recommendation:

HYPOTHESIS 4:

Importing and inspecting data from OrderDetail and Order

Hypothesis 4 Preliminary Visualizations:

Hypothesis 4 Initial Observations:

Hypothesis 4 Assumption 1: Outlier

Hypothesis 4 Assumption 2: Equal Variance

Hypothesis 4 Assumption 3: Normality

Using non-parametric Kruskal since the data set was not of equal variance:

Hypothesis 4 Observations and Recommendations:

Obeservations

Recommendations:

In Closing:

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages