Members - sparkdata

Justin

Administrator

November 17, 2020 at 5:19 pm

It is true. This is one of the distinct differences between Call Symput() and Select Into in creating macro variables. Thus when we need to reference the macro variables, the reference methods are totally different too. Please see the syntax below.

data CCC;
set  BBB;
where  Birth_Date<  &MV_1   ;  
where  Birth_Date< "&MV_2"n ;
run;

Please note: your code will fail to work if you do NOT reference them correctly. Be careful, details are critically important in programming.

This reply was modified 3 years, 5 months ago by Justin.

Justin

Administrator

November 6, 2020 at 10:06 pm

Up

1

Down

How come? I think it’s impossible. The counts of predicted Good and Bad depend on the cut-off values. If you change it, the TP/FP must change then we can get a curve.

Check your code, sth must go wrong.

Justin

Administrator

November 4, 2020 at 2:01 pm

Up

1

Down

Why not use qcut() function to create a flag variable? It is good for this project.

Justin

Administrator

November 3, 2020 at 9:32 pm

Up

0

Down

In this case, we use F-test to check on this categorical predictor. Test the reduced model against the full model, if it is insignificant, then we just drop it.

If it is overall significant, but few levels are NOT significant. We can then regroup it by combining the insignificant levels, which will make more sense for this variable.

Justin

Administrator

November 3, 2020 at 9:12 pm

Up

0

Down

This is a good question. I know many people are confused about it.

To be accurate, for cross validation, we actually need to split data into 3 parts:

1. Train, for model building.

2. Validate: for model validation.

3. Hold-out: use it to fine tune the model if needed.

The suggested partition is: Train: Validate: Holdout=60:20:20.

However, many people skip the Hold out partition and only use Train: Validate=70:30 or 60:40. It is fine if you skip it, if you do out-of-time validation instead.

Does this surprise or confuse you further?

This reply was modified 3 years, 6 months ago by Justin.

Justin

Administrator

October 26, 2020 at 11:15 pm

Up

0

Down

Thank all for the contribution and discussion, very useful and helpful.

Based on Yi’s contribution, we can go one step further. If the newly created data frames don’t have any similar pattern, we can use below method to create all of them.

### create a list of dataframe names.
dflist=['Core', 'Spouse', 'Education', 'Work', 'Certificate', 'Additional']  

### looping over each element
for i in range(len(dflist)): 
    print("loop round: ", i)
    globals()[dflist[i]] = raw[i]
    
print(Work)

It will create the 6 data frames: Core, Spouse… Additional. Ideas on ideas, this is a more generic way to do it.

Justin

Administrator

October 25, 2020 at 11:54 am

Up

0

Down

One way is to use pandas package, it has the read_html() function and read HTML tables into a list of data frames. It is very simple and convenient. Any other method?

import requests
url = 'https://www.welcomebc.ca/Immigrate-to-B-C/B-C-Provincial-Nominee-Program/Invitations-to-Apply'
html = requests.get(url).content
html
tables = pd.read_html(html)
tables
df = tables[1]
df
df.to_excel('/kaggle/working/skilled.xlsx',  sheet_name='skilled', index = False)

Also, I want to extract the table names from the web content:

Table 1: Skills Immigration and Express Entry BC

Table 2: Entrepreneur Immigration

How to grab them and assign them to each list element?

Justin

Administrator

October 24, 2020 at 12:14 pm

Not actually!

In my opinion, SAS is a commerical software, widely used in banks, telecom and goverment sectors. So, it is useful to learn it if want to seek a career there.

R and Python are quite similar and both are open source tools. Grasping one is sufficient.

One commercial + one open source, they are enough for most data analysis. How do you think?

Justin

Administrator

October 16, 2020 at 5:12 pm

In data step, if we use multiple SET statements rather than one SET statement, the outcome is to overwrite rather than appending. The observations in the later SET will overwrite the observations in the previous one.

Another key point is: when will it stop and how many observations can produce? Given below example, if A and B have exactly same variables.

data C;
set   A;    *  5 records;
set   B;    *  10 records;
run;

data D;
set   B;   *  10 records;
set   A;   *  5  records;
run;

Both data C and D have only 5 records, but the records are different, why?

As you remember, each data set has a End of File indicator (which can be monitored by the END=EOF option), the data step execution is stopped no matter which data set reaches the end of the file first. In the above case, data A has only 5 obs, therefore it always reach the end first, and determines the final number of observations in the output data set: 5 observations! In summary, the final number of output observations is always determined by the smallest number of dataset observations, if you have multiple SET statements..

However, although the number of observations are same in data C and D, but the records are different, because the later one always overwrite the previous one: the ORDER does matter!

Justin

Administrator

November 23, 2020 at 8:29 am

Up

0

Down

Yes, assume we lose all the loan amount.

Justin

Administrator

November 4, 2020 at 11:14 pm

Up

1

Down

No, it’s ok. There may have little difference in creating the bins by SAS Proc Rank and qcut() function. It does NOT matter. It don’t need to be accurate.

Also, please remember to exclude missing, zero or negative scores from analysis.

Justin

Administrator

November 4, 2020 at 12:14 pm

Up

0

Down

If you have the model built already, it is just a model validation, not a cross validation. You don’t need to split it, just use the whole data to validate it.

Justin

Administrator

November 4, 2020 at 12:14 am

Up

0

Down

Below is sth from internet:

“k-Fold Cross-Validation

Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample.

The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. As such, the procedure is often called k-fold cross-validation. When a specific value for k is chosen, it may be used in place of k in the reference to the model, such as k=10 becoming 10-fold cross-validation.

Cross-validation is primarily used in applied machine learning to estimate the skill of a machine learning model on unseen data. That is, to use a limited sample in order to estimate how the model is expected to perform in general when used to make predictions on data not used during the training of the model.

It is a popular method because it is simple to understand and because it generally results in a less biased or less optimistic estimate of the model skill than other methods, such as a simple train/test split.

The general procedure is as follows:

Shuffle the dataset randomly.
Split the dataset into k groups
For each unique group:
1. Take the group as a hold out or test data set
2. Take the remaining groups as a training data set
3. Fit a model on the training set and evaluate it on the test set
4. Retain the evaluation score and discard the model
Summarize the skill of the model using the sample of model evaluation scores

Importantly, each observation in the data sample is assigned to an individual group and stays in that group for the duration of the procedure. This means that each sample is given the opportunity to be used in the hold out set 1 time and used to train the model k-1 times”

Justin

Administrator

November 3, 2020 at 9:34 pm

Up

0

Down

No, this is another cross validation method called K-fold validation. Google it.

Justin

Administrator

October 26, 2020 at 8:53 pm

Up

0

Down

Yes, I think this is probably the right way, let me try……. can you talk about the macro functions: locals() and global() in Python? You can use open a new discussion. Thanks!

Justin

Forum Replies Created