Members - sparkdata

Datura

Member

November 15, 2020 at 5:33 pm

（3）Analysis of Variance (ANOVA) Test

t-test 还有很多类型，one-way t-test, two-way t-test, paired t-test 等等，其区别以后再说。t-test 只能适用于比较2组数据 ( continuous) ，如果是多变量多组数据，就有些无能为力了。这时，功能强大的 ANOVA 同学就闪亮登场。

比如上面的化学反应，在 6 个不同反应温度下的进行试验，最后得到了6 组不同温度下的反应产率。该如何判别这些数据是否有区别呢？Just follow the same way.

H0: u1 = u2 = u3 =…..= u6

H1: u1, u2….u6 not all equal.

把实验数据带入 Excel, SAS, 进行 ANOVA 计算，最后如果：

1. p > alpha, then we can not reject H0. 就是说，这许多组数据，可能并没有显著差别，说不定是误差所致。

2. p < alpha, then we can reject H0. 就是说，这6组数据，有显著差别，其中，至少有一组与其他组是不同的。

ANOVA 可以用来评估几乎所有的数据比较，回归分析，功能强大。<wbr>同时注意：ANOVA has both one-way and two-way ANOVA.

In digital marketing, Student’s t-Test or Chi-Squared Test 被称为 A/B testing. 实际上，它们就是这里介绍的假说检验 hypothesis testing. 不要披上马甲就不认识了噢！

Datura

Member

November 12, 2020 at 6:43 pm

Up

0

Down

I think that, when you use query() function, probably you can only use columns in the Loan data frame. x is an external object, so Python cannot find it.

Try to use other functions to subset data.

Datura

Member

November 11, 2020 at 11:08 pm

Up

1

Down

The X axis is same but Y axis is different.

In cumulative gains chart, we want to present the gains of target capturing of using predictive model against random selection. However, for the cumulative distributions of Good and Bad classes, we intend to see the separation power of models.

Did I answer your questions?

Datura

Member

November 11, 2020 at 12:04 pm

Generally, it is a work process. We need to understand it in context.

Datura

Member

November 9, 2020 at 7:20 pm

Up

0

Down

Interesting. Too good to be true, very questionable!

Let me look into the data when get time

Datura

Member

November 9, 2020 at 7:52 am

Up

0

Down

That’s ok. In this case we may only get 8 or 9 bins rather than 10 bins. A bin may have about big chunk, say 20% of records rather than 10 %. It happens in real work. It is fine

Datura

Member

November 8, 2020 at 9:54 pm

Up

2

Down

Credit score of zero or negative is theoretically possible because we can choose PDO or offset to scale a predicted probability to any wanted score range

But actually, it’s always more convenient for us to set score range between 100 —1000. In this case, zero or a negative score are not real scores they are just special value codes, meaning no score or sth else. We need exclude these rows from our analysis

To create the decile we divide it into approximately equal size bins based on the score. In Python we can use qcut() function to do it. In SAS use Proc Rank

Datura

Member

November 9, 2020 at 12:26 pm

Up

0

Down

???

Datura

Member

October 26, 2020 at 10:10 pm

Up

1

Down

Wow, it works, awesome! you are a genius. Thank you! We need to use the locals() /globals() macro symbol tables, which are similar to those in SAS.

df1=pd.DataFrame({})
df2=pd.DataFrame({})
df3=pd.DataFrame({})
df4=pd.DataFrame({})
df5=pd.DataFrame({})
df6=pd.DataFrame({})dflist=[df1,df2,df3, df4, df5, df6]
for i in range(len(dflist)): 
    print("loop round: ", i)
    locals()['df'+str(i+1)] = raw[i]
        
df6

This reply was modified 3 years, 6 months ago by Datura.

Datura

Member

October 26, 2020 at 5:35 pm

Up

0

Down

??? Is this the method to extract table data without using the read_html() function?

Datura

Member

October 26, 2020 at 5:32 pm

Up

0

Down

No automatic method? OK, it’s too bad, really inconvenient to use raw[1], raw[2]……

This reply was modified 3 years, 6 months ago by Datura.

Datura

Member

October 26, 2020 at 12:33 pm

Up

0

Down

I tried this method, it run through without errors, but all the data frames df1-df3 are still all empty. The loop does not overwrite the pre-defined empty ones.

df1=pd.DataFrame()
df2=pd.DataFrame()
df3=pd.DataFrame()

dflist=[df1,df2,df3]
for i in range(len(dflist)): 
    print("loop round: ", i)
    temp=raw[i]
    print(temp)
    dflist[i] = temp

df1

So, what’s wrong?

This reply was modified 3 years, 6 months ago by Datura.

Datura

Member

October 26, 2020 at 12:08 pm

Up

0

Down

No. All these tables are totally different with different columns and structures, I need separate them and then use each one differently.

Datura

Member

October 26, 2020 at 11:13 am

Up

0

Down

I checked the urlwatch Python package, please see below, the fundamental idea is same as mine. Please see below…… we can just use this package since it is available and free.

———————————————-

Introduction

urlwatch monitors the output of webpages or arbitrary shell commands.

Every time you run urlwatch, it:

retrieves the output and processes it
compares it with the version retrieved the previous time (“diffing”)
if it finds any differences, generates a summary “report” that can be displayed or sent via one or more methods, such as email

This reply was modified 3 years, 6 months ago by Datura.
This reply was modified 3 years, 6 months ago by Datura.

Datura

Member

October 26, 2020 at 11:10 am

Up

0

Down

Got it. She wants to monitor government websites, they definitely don’t belong to her~~ 🙂

Datura

Forum Replies Created