Home Forums Main Forums Python Forum Hypthesis Testing with Python

  • Hypthesis Testing with Python

     Datura updated 3 years, 1 month ago 1 Member · 2 Posts
  • Datura

    Member
    January 31, 2021 at 5:36 pm
    Up
    0
    Down

    Below code gives the general method for doing t-Test, ANOVA and Chi-Square test with Python. Please feel free to ask questions. Please note that: impute is the input data frame.

     ################### Part IV: Statistical Analyses. ##########################
    # 1) Test the association between account status and different factors by Chi-Square Test.
    import scipy
    from scipy.stats import chi2
    from scipy.stats import chi2_contingency # import Scipy's built-in function
    impute.head()
    impute.info()

    ############## For reporting, the summary includes grand totals. ############### However, For Chi-Square test, the contingency table exlcudes grand totals.#########
     summary =impute.pivot_table('acctno', <wbr>    ### analysis variable
    index=['credit'], ### rows
    columns=['active'], ### columns
    aggfunc='nunique',
    margins=True)
    summary
     df =impute.pivot_table('acctno', <wbr>         ### analysis variable
    index=['credit']<wbr>, ### rows
    columns=['active'], ### columns
    aggfunc='nunique',
    margins=False)

    ######### use the chi2_contingency() function to do Chi-Squared test. ##########
    ### Degree of freedom is calculated by using the following formula:
    ### DOF =(r-1)(c-1), where r and C are the levels of treatment and outcome variables.
    chi2, p_value, DOF, expected= chi2_contingency(df, correction=False) # "correction=False" means no Yates' correction is used! results=pd.DataFrame({'Chi-Square' : round(chi2, 2),
    'Degrees of Freedom' : DOF,
    'p_value' : p_value}, index=['Chi-Square Test Output'] )

    ### p-value <0.05, we reject H0 and conclude that there exists an association between credit and account status.
    ########  Do the same thing on DealerType/RatePlan vs. Status etc. ############
    ### 2) Test the association between account status and tenure segment.
    summary =impute.pivot_table('acctno',
    index=['tenure_bin'],
    columns=['active'],
    aggfunc='nunique',
    margins=True)
    df =impute.pivot_table('acctno',
    index=['tenure_bin'],
    columns=['active'],
    aggfunc='nunique',
    margins=False)

    chi2, p_value, DOF, expected= chi2_contingency(df, correction=False)
    # "correction=False" means no Yates' correction is used!

    results=pd.DataFrame({'Chi-Square' : round(chi2, 2),
    'Degrees of Freedom' : DOF,
    'p_value' : p_value}, index=['Chi-Square Test Output'] )

    ### p-value <0.05, we reject H0 and conclude that there exists an association between tenure and account status.
    • This discussion was modified 3 years, 1 month ago by  Datura.
  • Datura

    Member
    January 31, 2021 at 5:46 pm
    Up
    1
    Down
    ### 3) Test on means of Sales or other numeric variables by Student t-test and ANOVA.
    import scipy.stats as stats
    ### Two sample t-test
    summary =impute.pivot_table('sales',
    columns=['credit'],
    aggfunc='mean',
    margins=False)
    tStat, p_value = stats.ttest_ind(impute[impute['credit']==0 ]['sales'],
    impute[impute['credit']==1 ]['sales'],
    equal_var = False,
    nan_policy='omit' )

    #run independent sample T-Test
    print("t-Stat:{0}, p-value:{1}".format(tStat, p_value))
    # We cannot reject H0 therefore conclude the sales are same for good and bad credit people.
    ### Multiple sample ANOVA test. We look into the deactivated people only. ###################
    summary =impute.pivot_table('sales',
    columns=['reason'],
    aggfunc='mean',
    margins=False)
    ### we need to drop null values on analysis variable, otherwise we will get nan results.
    impute.isnull().sum()
    sub=impute[ (sub['reason'] !=' ') & (pd.notnull(impute['sales'] ) ) ]
    FStat, p_value = stats.f_oneway( sub[ sub['reason']=='COMP']['sales'],
    sub[ sub['reason']=='DEBT']['sales'],
    sub[ sub['reason']=='MOVE']['sales'],
    sub[ sub['reason']=='NEED']['sales'],
    sub[ sub['reason']=='TECH']['sales'] )
    print("F-Stat: {0}, DOF={1}, p-value: {2}".format(FStat, len(sub)-1, p_value))
    # Since p-value<0.05, we reject H0 and conclude that the sales means
    # significantly differ among people with different deactivation reasons.
    ### Note: From ANOVA analysis above, we know that treatment differences are statistically significant,
    ### but ANOVA does not tell which treatments are significantly different from each other.
    ### To know the pairs of significant different treatments, we will perform multiple
    ### pairwise comparison (Post-hoc comparison) analysis using Tukey HSD test.
    ####################  Tukey’s multi-comparison method #########################
    # This method tests at p <0.05 (correcting for the fact that multiple comparisons
    # are being made which would normally increase the probability of a significant
    # difference being identified). A results of ’reject = True’ means that a significant
    # difference has been observed.
    # load packages
    from statsmodels.stats.multicomp import (pairwise_tukeyhsd, MultiComparison)
    ### class statsmodels.sandbox.stats.multicomp.MultiComparison(data, groups, group_order=None)
    sub.info()

    MultiComp = MultiComparison(sub['sales'], ### analysis variable                                       
    sub['reason'].replace(' ', ' NA'), ### group variable
    group_order=None) ### group_order: the desired order for the group mean results to be reported in.

    results=MultiComp.tukeyhsd().summary()
    results
    ############################ End of Program. #################################
    • This reply was modified 3 years, 1 month ago by  Datura.

Log in to reply.

Original Post
0 of 0 posts June 2018
Now