Hypthesis Testing with Python

Hypthesis Testing with Python

Datura updated 4 years, 5 months ago 1 Member · 2 Posts
Python Forum

Datura

Member

January 31, 2021 at 5:36 pm

Below code gives the general method for doing t-Test, ANOVA and Chi-Square test with Python. Please feel free to ask questions. Please note that: impute is the input data frame.

 ################### Part IV: Statistical Analyses. ##########################
# 1) Test the association between account status and different factors by Chi-Square Test.

import scipy
from scipy.stats import chi2
from scipy.stats import chi2_contingency # import Scipy's built-in function

impute.head()
impute.info()

############## For reporting, the summary includes grand totals. ############### However, For Chi-Square test, the contingency table exlcudes grand totals.#########

 summary =impute.pivot_table('acctno', <wbr>    ### analysis variable
 index=['credit'],          ### rows
 columns=['active'],        ### columns
 aggfunc='nunique',        
 margins=True)  
 summary

 df =impute.pivot_table('acctno', <wbr>         ### analysis variable
               index=['credit']<wbr>,          ### rows
               columns=['active'],        ### columns
               aggfunc='nunique',     
               margins=False)

######### use the chi2_contingency() function to do Chi-Squared test. ##########
### Degree of freedom is calculated by using the following formula: 
### DOF =(r-1)(c-1), where r and C are the levels of treatment and outcome variables.

chi2, p_value, DOF, expected= chi2_contingency(df, correction=False) # "correction=False" means no Yates' correction is used! results=pd.DataFrame({'Chi-Square' : round(chi2, 2),
                      'Degrees of Freedom' : DOF,
                       'p_value' : p_value}, index=['Chi-Square Test Output'] )

### p-value <0.05, we reject H0 and conclude that there exists an association between credit and account status.

########  Do the same thing on DealerType/RatePlan vs. Status etc. ############
### 2) Test the association between account status and tenure segment.
summary =impute.pivot_table('acctno',     
               index=['tenure_bin'],   
               columns=['active'], 
               aggfunc='nunique',   
               margins=True)

df =impute.pivot_table('acctno',
               index=['tenure_bin'],  
               columns=['active'],   
               aggfunc='nunique', 
                margins=False)  

chi2, p_value, DOF, expected= chi2_contingency(df, correction=False) 
# "correction=False" means no Yates' correction is used! 

results=pd.DataFrame({'Chi-Square' : round(chi2, 2),
                     'Degrees of Freedom' : DOF,
                     'p_value' : p_value}, index=['Chi-Square Test Output'] ) 

### p-value <0.05, we reject H0 and conclude that there exists an association between tenure and account status.

This discussion was modified 4 years, 5 months ago by Datura.

Datura

Member

January 31, 2021 at 5:46 pm

Down

### 3) Test on means of Sales or other numeric variables by Student t-test and ANOVA.

import scipy.stats as stats
### Two sample t-test
summary =impute.pivot_table('sales',  
               columns=['credit'],   
               aggfunc='mean',   
               margins=False)

tStat, p_value = stats.ttest_ind(impute[impute['credit']==0 ]['sales'],
                                 impute[impute['credit']==1 ]['sales'],
                                 equal_var = False,
                                 nan_policy='omit' )  

#run independent sample T-Test
print("t-Stat:{0}, p-value:{1}".format(tStat, p_value))
# We cannot reject H0 therefore conclude the sales are same for good and bad credit people.

### Multiple sample ANOVA test. We look into the deactivated people only. ###################
summary =impute.pivot_table('sales',     
               columns=['reason'],   
               aggfunc='mean',    
               margins=False)

### we need to drop null values on analysis variable, otherwise we will get nan results.
impute.isnull().sum()
sub=impute[ (sub['reason'] !=' ')  & (pd.notnull(impute['sales'] ) ) ]

FStat, p_value = stats.f_oneway( sub[ sub['reason']=='COMP']['sales'],
                                 sub[ sub['reason']=='DEBT']['sales'],
                                 sub[ sub['reason']=='MOVE']['sales'],
                                 sub[ sub['reason']=='NEED']['sales'],
                                 sub[ sub['reason']=='TECH']['sales']  )

print("F-Stat: {0}, DOF={1}, p-value: {2}".format(FStat, len(sub)-1, p_value))
# Since p-value<0.05, we reject H0 and conclude that the sales means
# significantly differ among people with different deactivation reasons.

### Note: From ANOVA analysis above, we know that treatment differences are statistically significant,
### but ANOVA does not tell which treatments are significantly different from each other.
### To know the pairs of significant different treatments, we will perform multiple
### pairwise comparison (Post-hoc comparison) analysis using Tukey HSD test.

####################  Tukey’s multi-comparison method #########################

# This method tests at p <0.05 (correcting for the fact that multiple comparisons
# are being made which would normally increase the probability of a significant
# difference being identified). A results of ’reject = True’ means that a significant
# difference has been observed.
# load packages
from statsmodels.stats.multicomp import (pairwise_tukeyhsd, MultiComparison)

### class statsmodels.sandbox.stats.multicomp.MultiComparison(data, groups, group_order=None)
sub.info()

MultiComp = MultiComparison(sub['sales'], ### analysis variable                                       
                            sub['reason'].replace(' ', ' NA'),  ### group variable
                            group_order=None)    ### group_order: the desired order for the group mean results to be reported in.

results=MultiComp.tukeyhsd().summary()
results
############################ End of Program. #################################

This reply was modified 4 years, 5 months ago by Datura.