Skip to content
This repository has been archived by the owner on Oct 24, 2023. It is now read-only.

Add more correlation test #19

Open
Edouard-Legoupil opened this issue May 30, 2018 · 4 comments
Open

Add more correlation test #19

Edouard-Legoupil opened this issue May 30, 2018 · 4 comments

Comments

@Edouard-Legoupil
Copy link
Member

Currently the data crunching report only deals with Chisquare test between select_one variable;
https://github.com/unhcr/koboloadeR/blob/gh-pages/inst/script/3-generate-report.R#L500:L625

More correlation test could be handled...
image
image

@Edouard-Legoupil
Copy link
Member Author

numeric to categoric: --> anova

if ordinal variable
Kruskal-Wallis Test > kruskal.test(
Mann-Whitney (This is the same as a two-sample wilcoxon test) > wilcox.test(

@Edouard-Legoupil
Copy link
Member Author

Maybe test to be accompanied with better graphs coming from https://indrajeetpatil.github.io/ggstatsplot/

@maherdaoud
Copy link

maherdaoud commented Nov 13, 2018

What I am trying now is covering all cases of correlation test between dependent variable (target) and the independent variable.
Lets start

1. Normal Distribution Numerical variable VS Normal Distribution Numerical variable
We need to test if there is Monotonic relationship, Linear relationship or non-Linear relationship.

Linear relationships are monotonic, but not all monotonic relationships are linear.

  • Monotonic variables increase (or decrease) in the same direction, but not always at the same rate.
  • Linear variables increase (or decrease) in the same direction at the same rate.
    First we need to define the dependent and independent variables in R with the below code
tar <- df[,target] #Dependent variable
field <- df[,col] #Independent variable
#df is a dataframe that contains your dataset
  • For Linear relationship test we will use Pearson correlation test using the below function in R
    stats::cor(tar,field,method = "pearson")
  • For Monotonic relationship test we will use Spearman correlation test using the below function in R
    stats::cor(tar,field,method = "spearman")
  • For non-Linear relationship test we need to find Mutual Information using the below function in R
    mpmi::cminjk.pw(tar,field)/mpmi::cminjk.pw(tar,tar)

2. Normal Distribution Numerical variable VS Non Normal Distribution Numerical variable
In this case we can't use Pearson correlation test, because the assumption behind this test is

For the Pearson r correlation, both variables should be normally distributed (normally distributed variables have a bell-shaped curve)

Here, we have to apply Spearman correlation test and find the value of Mutual Information to check if there is a Monotonic relationship or non-Linear relationship

3. Normal Distribution Numerical variable VS Binary nominal factor
In this case, you can apply Two independent samples t-test to check if there is relationship between the two variables, use the below R code to run this test

#tar :Normal Distribution Numerical variable
#field :Binary nominal factor
tTest <- t.test(tar~field )

p_value <- tTest$p.value
if(p_value > 0.05){
 levelOfConfidence <- "very weak"
}else if(p_value <= 0.05 & p_value >= 0.01 ){
 levelOfConfidence <- "moderate"
 isPredictor <- "Yes"
}else if(p_value <= 0.01 & p_value >= 0.005 ){
 levelOfConfidence <- "strong"
 isPredictor <- "Yes"
}else{
 levelOfConfidence <- "very strong"
 isPredictor <- "Yes"
}

4. Normal Distribution Numerical variable VS Multinomial factor
For Multinomial variable we have to apply ANOVA test as below R code

#tar :Normal Distribution Numerical variable
#field :Binary nominal factor
test  <- aov(field~tar)

p_value <- summary(test)[[1]][["Pr(>F)"]][1]
if(p_value > 0.05){
 levelOfConfidence <- "very weak"
}else if(p_value <= 0.05 & p_value >= 0.01 ){
 levelOfConfidence <- "moderate"
 isPredictor <- "Yes"
}else if(p_value <= 0.01 & p_value >= 0.005 ){
 levelOfConfidence <- "strong"
 isPredictor <- "Yes"
}else{
 levelOfConfidence <- "very strong"
 isPredictor <- "Yes"
}

5. Non Normal Distribution Numerical variable VS Binary nominal factor
For Non Normal Distribution Numerical variable we should use Wilcoxon-Mann-Whitney test
Here we go

#tar :Normal Distribution Numerical variable
#field :Binary nominal factor
tTest <- stats::wilcox.test(tar~field)
p_value <- tTest$p.value
if(p_value > 0.05){
  levelOfConfidence <- "very weak"
}else if(p_value <= 0.05 & p_value >= 0.01 ){
  levelOfConfidence <- "moderate"
  isPredictor <- "Yes"
}else if(p_value <= 0.01 & p_value >= 0.005 ){
  levelOfConfidence <- "strong"
  isPredictor <- "Yes"
}else{
  levelOfConfidence <- "very strong"
  isPredictor <- "Yes"
}

6. Non Normal Distribution Numerical variable VS Multinomial factor
To find if there is a relationship between those variables; you need to apply Kruskal-Wallis Test
using the below R code

#tar :Normal Distribution Numerical variable
#field :Binary nominal factor
tTest <- stats::kruskal.test(tar~field )
p_value <- tTest$p.value
if(p_value > 0.05){
  levelOfConfidence <- "very weak"
}else if(p_value <= 0.05 & p_value >= 0.01 ){
  levelOfConfidence <- "moderate"
  isPredictor <- "Yes"
}else if(p_value <= 0.01 & p_value >= 0.005 ){
  levelOfConfidence <- "strong"
  isPredictor <- "Yes"
}else{
  levelOfConfidence <- "very strong"
  isPredictor <- "Yes"
}

To be continue ...

@maherdaoud
Copy link

In the previous comment, we talked about the Nominal factor and what are compatible tests to check if there is a relationship with a numerical variable.
Now, let's take about the Ordinal factor which is the second type of variable measurement scales

7. Ordinal factor vs Numerical variable
In this case, we need to apply Spearman's Rank-Order Correlation test by using the below R code

#tar :Numerical variable
#field :Ordinal factor
cor(rank(tar),rank(field), method = "spearman")

8. Nominal factor CS Nominal factor
To check if there is a relationship between two Nominal variables you need first to build contingency table and then run Chi-squared Test, check the following R code

We apply Cramer's V test to measure the association between the two variables.

#tar :Nominal factor
#field :Nominal factor
contingency_table <- table( tar, field ) 
test <- chisq.test(contingency_table , correct=F)
p_value <- test$p.value
cramerTest <- rcompanion::cramerV(contingency_table, 
                      digits=4)
statisticalTests <- "Chi-squared Test | Cramer's V test"
if(p_value > 0.05 | cramerTest < 0.29){
  levelOfConfidence <- "very weak"
}else if(p_value <= 0.05 & p_value >= 0.01 ){
  levelOfConfidence <- "moderate"
  isPredictor <- "Yes"
}else if(p_value <= 0.01 & p_value >= 0.005 ){
  levelOfConfidence <- "strong"
  isPredictor <- "Yes"
}else{
  levelOfConfidence <- "very strong"
  isPredictor <- "Yes"
}

8. Ordinal factor CS Ordinal factor
For this case, we need to apply Linear-by-Linear Association Test

The null hypothesis for the linear-by-linear test is that there is no association among the variables in the table. A significant p-value suggests that there is an association. This is similar to a chi-square test, except that the categories are ordered in nature.

You can use the following R code

contingency_table <- table( tar, field ) 
test <- coin::lbl_test(contingency_table)
p_value <- pvalue(test)
statisticalTests <- "Linear-by-Linear Association Test"
if(p_value > 0.05){
  levelOfConfidence <- "very weak"
}else if(p_value <= 0.05 & p_value >= 0.01 ){
  levelOfConfidence <- "moderate"
  isPredictor <- "Yes"
}else if(p_value <= 0.01 & p_value >= 0.005 ){
  levelOfConfidence <- "strong"
  isPredictor <- "Yes"
}else{
  levelOfConfidence <- "very strong"
  isPredictor <- "Yes"
}

9. Ordinal factor CS Nominal factor
If the Nominal factor has only two categories and an Ordinal factor with k categories we need to use Cochran–Armitage Test using the below R code

#field: Nominal factor
#tar:  Ordinal factor
contingency_table <- table(field, as.numeric(tar) ,dnn = c("independent","dependent")) 
test <- chisq_test(contingency_table,scores = list("dependent" =  seq( 1 , length( levels(tar)))))
p_value <- pvalue(test)
statisticalTests <- "Cochran–Armitage Test"
if(p_value > 0.05){
  levelOfConfidence <- "very weak"
}else if(p_value <= 0.05 & p_value >= 0.01 ){
  levelOfConfidence <- "moderate"
  isPredictor <- "Yes"
}else if(p_value <= 0.01 & p_value >= 0.005 ){
  levelOfConfidence <- "strong"
  isPredictor <- "Yes"
}else{
  levelOfConfidence <- "very strong"
  isPredictor <- "Yes"
}

I think now we need to cover other cases like Date variable VS Numerical variable, Date variable VS Ordinal factor and Nominal factor has more than two categories VS Ordinal factor

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants