Add more correlation test #19

Edouard-Legoupil · 2018-05-30T15:04:43Z

Currently the data crunching report only deals with Chisquare test between select_one variable;
https://github.com/unhcr/koboloadeR/blob/gh-pages/inst/script/3-generate-report.R#L500:L625

More correlation test could be handled...

Edouard-Legoupil · 2018-10-18T06:41:14Z

numeric to categoric: --> anova

if ordinal variable
Kruskal-Wallis Test > kruskal.test(
Mann-Whitney (This is the same as a two-sample wilcoxon test) > wilcox.test(

Edouard-Legoupil · 2018-11-11T09:28:49Z

Maybe test to be accompanied with better graphs coming from https://indrajeetpatil.github.io/ggstatsplot/

maherdaoud · 2018-11-13T14:32:45Z

What I am trying now is covering all cases of correlation test between dependent variable (target) and the independent variable.
Lets start

1. Normal Distribution Numerical variable VS Normal Distribution Numerical variable
We need to test if there is Monotonic relationship, Linear relationship or non-Linear relationship.

Linear relationships are monotonic, but not all monotonic relationships are linear.

Monotonic variables increase (or decrease) in the same direction, but not always at the same rate.

Linear variables increase (or decrease) in the same direction at the same rate.
First we need to define the dependent and independent variables in R with the below code

tar <- df[,target] #Dependent variable
field <- df[,col] #Independent variable
#df is a dataframe that contains your dataset

For Linear relationship test we will use Pearson correlation test using the below function in R
stats::cor(tar,field,method = "pearson")
For Monotonic relationship test we will use Spearman correlation test using the below function in R
stats::cor(tar,field,method = "spearman")
For non-Linear relationship test we need to find Mutual Information using the below function in R
mpmi::cminjk.pw(tar,field)/mpmi::cminjk.pw(tar,tar)

2. Normal Distribution Numerical variable VS Non Normal Distribution Numerical variable
In this case we can't use Pearson correlation test, because the assumption behind this test is

For the Pearson r correlation, both variables should be normally distributed (normally distributed variables have a bell-shaped curve)

Here, we have to apply Spearman correlation test and find the value of Mutual Information to check if there is a Monotonic relationship or non-Linear relationship

3. Normal Distribution Numerical variable VS Binary nominal factor
In this case, you can apply Two independent samples t-test to check if there is relationship between the two variables, use the below R code to run this test

#tar :Normal Distribution Numerical variable
#field :Binary nominal factor
tTest <- t.test(tar~field )

p_value <- tTest$p.value
if(p_value > 0.05){
 levelOfConfidence <- "very weak"
}else if(p_value <= 0.05 & p_value >= 0.01 ){
 levelOfConfidence <- "moderate"
 isPredictor <- "Yes"
}else if(p_value <= 0.01 & p_value >= 0.005 ){
 levelOfConfidence <- "strong"
 isPredictor <- "Yes"
}else{
 levelOfConfidence <- "very strong"
 isPredictor <- "Yes"
}

4. Normal Distribution Numerical variable VS Multinomial factor
For Multinomial variable we have to apply ANOVA test as below R code

#tar :Normal Distribution Numerical variable
#field :Binary nominal factor
test  <- aov(field~tar)

p_value <- summary(test)[[1]][["Pr(>F)"]][1]
if(p_value > 0.05){
 levelOfConfidence <- "very weak"
}else if(p_value <= 0.05 & p_value >= 0.01 ){
 levelOfConfidence <- "moderate"
 isPredictor <- "Yes"
}else if(p_value <= 0.01 & p_value >= 0.005 ){
 levelOfConfidence <- "strong"
 isPredictor <- "Yes"
}else{
 levelOfConfidence <- "very strong"
 isPredictor <- "Yes"
}

5. Non Normal Distribution Numerical variable VS Binary nominal factor
For Non Normal Distribution Numerical variable we should use Wilcoxon-Mann-Whitney test
Here we go

#tar :Normal Distribution Numerical variable
#field :Binary nominal factor
tTest <- stats::wilcox.test(tar~field)
p_value <- tTest$p.value
if(p_value > 0.05){
  levelOfConfidence <- "very weak"
}else if(p_value <= 0.05 & p_value >= 0.01 ){
  levelOfConfidence <- "moderate"
  isPredictor <- "Yes"
}else if(p_value <= 0.01 & p_value >= 0.005 ){
  levelOfConfidence <- "strong"
  isPredictor <- "Yes"
}else{
  levelOfConfidence <- "very strong"
  isPredictor <- "Yes"
}

6. Non Normal Distribution Numerical variable VS Multinomial factor
To find if there is a relationship between those variables; you need to apply Kruskal-Wallis Test
using the below R code

#tar :Normal Distribution Numerical variable
#field :Binary nominal factor
tTest <- stats::kruskal.test(tar~field )
p_value <- tTest$p.value
if(p_value > 0.05){
  levelOfConfidence <- "very weak"
}else if(p_value <= 0.05 & p_value >= 0.01 ){
  levelOfConfidence <- "moderate"
  isPredictor <- "Yes"
}else if(p_value <= 0.01 & p_value >= 0.005 ){
  levelOfConfidence <- "strong"
  isPredictor <- "Yes"
}else{
  levelOfConfidence <- "very strong"
  isPredictor <- "Yes"
}

To be continue ...

maherdaoud · 2018-11-15T07:32:36Z

In the previous comment, we talked about the Nominal factor and what are compatible tests to check if there is a relationship with a numerical variable.
Now, let's take about the Ordinal factor which is the second type of variable measurement scales

7. Ordinal factor vs Numerical variable
In this case, we need to apply Spearman's Rank-Order Correlation test by using the below R code

#tar :Numerical variable
#field :Ordinal factor
cor(rank(tar),rank(field), method = "spearman")

8. Nominal factor CS Nominal factor
To check if there is a relationship between two Nominal variables you need first to build contingency table and then run Chi-squared Test, check the following R code

We apply Cramer's V test to measure the association between the two variables.

#tar :Nominal factor
#field :Nominal factor
contingency_table <- table( tar, field ) 
test <- chisq.test(contingency_table , correct=F)
p_value <- test$p.value
cramerTest <- rcompanion::cramerV(contingency_table, 
                      digits=4)
statisticalTests <- "Chi-squared Test | Cramer's V test"
if(p_value > 0.05 | cramerTest < 0.29){
  levelOfConfidence <- "very weak"
}else if(p_value <= 0.05 & p_value >= 0.01 ){
  levelOfConfidence <- "moderate"
  isPredictor <- "Yes"
}else if(p_value <= 0.01 & p_value >= 0.005 ){
  levelOfConfidence <- "strong"
  isPredictor <- "Yes"
}else{
  levelOfConfidence <- "very strong"
  isPredictor <- "Yes"
}

8. Ordinal factor CS Ordinal factor
For this case, we need to apply Linear-by-Linear Association Test

The null hypothesis for the linear-by-linear test is that there is no association among the variables in the table. A significant p-value suggests that there is an association. This is similar to a chi-square test, except that the categories are ordered in nature.

You can use the following R code

contingency_table <- table( tar, field ) 
test <- coin::lbl_test(contingency_table)
p_value <- pvalue(test)
statisticalTests <- "Linear-by-Linear Association Test"
if(p_value > 0.05){
  levelOfConfidence <- "very weak"
}else if(p_value <= 0.05 & p_value >= 0.01 ){
  levelOfConfidence <- "moderate"
  isPredictor <- "Yes"
}else if(p_value <= 0.01 & p_value >= 0.005 ){
  levelOfConfidence <- "strong"
  isPredictor <- "Yes"
}else{
  levelOfConfidence <- "very strong"
  isPredictor <- "Yes"
}

9. Ordinal factor CS Nominal factor
If the Nominal factor has only two categories and an Ordinal factor with k categories we need to use Cochran–Armitage Test using the below R code

#field: Nominal factor
#tar:  Ordinal factor
contingency_table <- table(field, as.numeric(tar) ,dnn = c("independent","dependent")) 
test <- chisq_test(contingency_table,scores = list("dependent" =  seq( 1 , length( levels(tar)))))
p_value <- pvalue(test)
statisticalTests <- "Cochran–Armitage Test"
if(p_value > 0.05){
  levelOfConfidence <- "very weak"
}else if(p_value <= 0.05 & p_value >= 0.01 ){
  levelOfConfidence <- "moderate"
  isPredictor <- "Yes"
}else if(p_value <= 0.01 & p_value >= 0.005 ){
  levelOfConfidence <- "strong"
  isPredictor <- "Yes"
}else{
  levelOfConfidence <- "very strong"
  isPredictor <- "Yes"
}

I think now we need to cover other cases like Date variable VS Numerical variable, Date variable VS Ordinal factor and Nominal factor has more than two categories VS Ordinal factor

Edouard-Legoupil mentioned this issue May 31, 2018

Dealing with ordinal questions of type of Likert... #21

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add more correlation test #19

Add more correlation test #19

Edouard-Legoupil commented May 30, 2018

Edouard-Legoupil commented Oct 18, 2018

Edouard-Legoupil commented Nov 11, 2018

maherdaoud commented Nov 13, 2018 •

edited

maherdaoud commented Nov 15, 2018

Add more correlation test #19

Add more correlation test #19

Comments

Edouard-Legoupil commented May 30, 2018

Edouard-Legoupil commented Oct 18, 2018

Edouard-Legoupil commented Nov 11, 2018

maherdaoud commented Nov 13, 2018 • edited

maherdaoud commented Nov 15, 2018

maherdaoud commented Nov 13, 2018 •

edited