Members - sparkdata

Justin

Administrator

July 7, 2022 at 9:37 am

0

### 3) Deactivation forecasting

######### Install the pmdarima library to perform time series analysis.
!pip install pmdarima
!pip install matplotlib
!pip install statsmodels

# !pip uninstall statsmodels -y
# !pip install statsmodels==0.11.0

# Import the library
from pmdarima import auto_arima
from pmdarima.arima import ADFTest
from statsmodels.tsa.seasonal import seasonal_decompose

# Ignore harmless warnings
import warnings
warnings.filterwarnings(“ignore”)

# Fit auto_arima function to deactivation dataset

churn = impute[impute[‘active’]==0].groupby([‘deact_yymm’]).agg( churn=(‘acctno’, ‘nunique’) )
churn.info()

churn[‘yymm_day’]= churn.index + ’01’
churn

# ETS Decomposition
result = seasonal_decompose(churn[‘churn’],
model =’multiplicative’, period=1)

# ETS plot
result.plot()

### Test for Stationarity: H0: unit root is present (non-stationary). H1: the series is stationary. ###########################
# In statistics and econometrics, an augmented Dickey–Fuller test (ADF) tests the null hypothesis
# of a unit root is present in a time series sample. The alternative hypothesis is different depending
# on which version of the test is used, but is usually stationarity or trend-stationarity.
# It is an augmented version of the Dickey–Fuller test for a larger and more complicated set of time series models.

### should_diff(x): Test whether the time series is stationary or needs differencing.

ADF_Test=ADFTest(alpha=0.05)
pval, should_diff= ADF_Test.should_diff(churn[‘churn’])
print(‘p-value={}, should differencing = {}’.format(pval, should_diff))

# Test results: p-value=0.9770785716280405, should differencing = True.
# It is non stationary because p-value>0.05. True means that it needs differencing.

stepwise_fit = auto_arima(churn[‘churn’],
start_p = 1, start_q = 1,
max_p = 3, max_q = 3, m = 12,
start_P = 0, seasonal = True,
d = None, D = 1, trace = True,
error_action =’ignore’, # we don’t want to know if an order does not work
suppress_warnings = True, # we don’t want convergence warnings
stepwise = True) # set to stepwise

# To print the summary
stepwise_fit.summary()

# Best model: ARIMA(2,0,1)(0,1,0)[12] intercept
# Total fit time: 6.703 seconds
# ARIMA(2,0,1)(0,1,0)[12]: AIC=197.856, Time=0.13 sec

# Fit the best model
from statsmodels.tsa.statespace.sarimax import SARIMAX

model = SARIMAX(churn[‘churn’],
order = (2,0,1),
seasonal_order =(0,1,0,12) )

result = model.fit()
result.summary()

##### Use the time series model to forecast for next 6 months. ###############
### predict() function: generate in-sample predictions from the fit ARIMA model.

forecast = result.predict(start = len(churn), #### start=25, len(churn)=25
end = (len(churn)-1) + 6, #### end=25-1+6= 30.
typ = ‘levels’).rename(‘Forecast’)

# Plot the forecast values
churn[‘churn’].plot(figsize = (12, 5), legend = True)
forecast.plot(legend = True)

Justin

Administrator

August 3, 2021 at 7:07 pm

proc sql;
create table CCC as
select a.*,
b.Year as Year_2,
c.Year as Year_3

from AAA a inner join AAA b
on a.Name=b.Name and a.Year=b.Year +1
inner join AAA c
on a.Name=c.Name and a.Year=c.Year +2
order by a.Name, a.Year;

create table Output as

select distinct Name

from CCC;

run;

Justin

Administrator

February 7, 2021 at 5:25 pm

Method 2: Use ODS tagsets.ExcelXP destination

Microsoft Excel uses the general format when importing data values that do not have an Excel format applied. This general format attempts to determine how the value should be formatted. For example, an account number that is not alpha numeric and has leading zeroes will lose the leading zero. The
same problem occurs when you type a value in a cell of Excel.

To get around this problem, the Excel Text format can be applied so that the value comes over the same way it was displayed in SAS. How this is done depends on the ODS destination that you use to generate the file. If the ExcelXP destination is used, the Excel text format can be applied using the
format: parameter within the TAGATTR= attribute. Using the MSOffice2k/HTML/HTML3 destinations, the Microsoft Office CSS style property mso-number-format can be used with the same text format.

Example:

 data one;
   input account name $;
 cards;
 023456 Bob
 054556 Henry
 034456 Wes
 ;
 run;

 /* ExcelXP destination */
 ods tagsets.ExcelXP file="temp.xls";
 proc print data=one;
 var name;
 var account / style(data)={tagattr="format:@"};
 format account z6.;
 run;
 Ods tagsets.ExcelXP  close;

 /* MSOffice2k destination */
 ods MSoffice2k  file="temp1.xls"; 
 proc print data=one;
 var name;
 var account / style(data)={htmlstyle="mso-number-format:\@"};
 format account z6.; 
 run;
 Ods msoffice2k close;

Note: The above options work with tagsets.ExcelXP, MSOffice2k, but does not work with ODS HTML. Please refer below for more information: http://support.sas.com/kb/32/414.html

Justin

Administrator

February 7, 2021 at 4:51 pm

Method A:

 proc sort data=test;
 by ID;
 run;

 data B;
 set test;
 by ID;
 retain Z1-Z4;
 if not missing(V1) then Z1=V1;
 if not missing(V2) then Z2=V2;
 if not missing(V3) then Z3=V3;
 if not missing(V4) then Z4=V4;
 if last.ID;
 drop V1-V4;
 rename Z1=V1 Z2=V2 Z3=V3 Z4=V4;
 run;

Method B:

proc sql;
 select ID, 
        max(V1) as V1,
        max(V2) as V2, 
        max(V3) as V3, 
        max(V4) as V4

from test 
group by ID;
quit;

Method C:

%let NM=is not missing;
 data one(keep=id v1 v2 where=(v1 &NM))
      two(keep=v3 where=(v3 &NM))
      three(keep=v4 where=(v4 &NM));
      set test;
 ;
run;

data all;
    set one;    
    set two;
    set three;
run;
%let NM=is not missing;
 data test2;
 merge test(keep=id v1 v2 where=(v1 &NM))
       test(keep=v3  where=(v3 &NM))
       test(keep=v4 where=(v4 &NM));
 ;
run;

Method D

 %macro test;
 proc sql;
 create table var_ls as 
 select distinct name 
 from dictionary.columns
 where libname='WORK' and memname='TEST' and upcase(name) ne 'ID';
 quit;

 data _null_;
 set var_ls;
 call symput(cats('var',_n_),name);
 call symput('cnt',_n_);
 run;

 proc sql;
 create table test2 as
 select id
 %do i = 1 %to &cnt;
 , max(&&var&i) as &&var&i
 %end;

 from test
 group by id;
quit;
%mend;
 %test;

This reply was modified 4 years, 4 months ago by Justin.

Justin

Administrator

February 7, 2021 at 4:31 pm

CALL MISSING Routine

Assigns missing values to the specified character
or numeric variables.

Syntax
CALL MISSING(variable-name-1 <, variable-name-2, …>);

Required Argument
variable-name
specifies the name of SAS character or numeric variables.

Details
The CALL MISSING routine assigns an ordinary numeric missing value (.) to each numeric variable in the argument list.

The CALL MISSING routine assigns a character missing value (a blank) to each character variable in the argument list. If the current length of the character variable equals the maximum length, the current length is not changed. Otherwise, the current length is set to 1.

You can mix character and numeric variables in the argument list.

Comparisons
The MISSING function checks whether the argument has a missing value but does not change the value of the argument.

data one;
   prod='shoes';
   invty=7498;
   sales=23759;
   call missing(prod, invty);
   put prod= invty= sales=;
run;

data one;
   prod='shoes';
   invty=7498;
   sales=23759;
   call missing(of  _all_ );
   put prod= invty= sales=;
run;

The above code set the specified variables to missing values, including both numeric and char variables.

This reply was modified 4 years, 4 months ago by Justin.

Justin

Administrator

February 7, 2021 at 3:53 pm

Method B:

Use SUM statement to create a counter variable CNT, then split the dataset into 2 parts based on CNT=0 or not. Then merge them back together. Drop and rename vars if needed. See below:

data W;
length CNT 3 New $20 ;
 set Alice;
 retain New ' ' ;
 if Name =' ' then CNT +1;
 else CNT=0;
 if CNT=0 then New=Name;
run;

data Old(drop=CNT);
 set W;
 where CNT=0;
run;

proc sort data=Old;  by New; run;
proc sort data=W(keep= New Name Q5) out= New;
 by New;
run;
data final;
 merge old new;
 by New;
run;

Method 3: Use By—NotSorted option.

 data B;
 length New $10 Pre $10;
 set Alice;
 by Name notsorted;
 retain Pre 'AAA';
 F=first.Name;
 L=last.Name;
 Flag=F+L;
 if flag=2 and not missing(Name) then New=Name;
 else New=Pre;
 Pre=New;
 run;

 data Old;
 set B;
 if F=1 and not missing(Name) then output Old;
 run;

Then merge the OLD with B by New.

Reach-out:

Some people may not have SAS in their company. If so, can you only use SQL to do it? It will be much more challenging in coding. But no pain, no gain!

This reply was modified 4 years, 4 months ago by Justin.

Justin

Administrator

February 7, 2021 at 3:48 pm

Method A: Retain + Merge

filename Alice 'file path\alice.xls';
proc import datafile=Alice out=alice(rename=(Name_ = Name)) dbms=Excel replace;
run;

*************** Condition: By default, the blank in Name means it has same value  as the previous obs. ************;data B(drop=pre);
 length Pre $20  New_Name $20 ;
 retain  Pre   ' ';
 set Alice;
 if Name =" " then New_Name=Pre;
 else New_Name=Name;
 Pre=New_Name;
run;
data First_Name ;
 set B;
 if New_Name = Name;
run;
proc sort data=First_Name; by name;  run;
proc sort data=B(keep=New_Name   Q5) out=Full ;
 by New_Name;
run;
data final(drop=New_Name);
 length name $20;
 merge First_Name      full(rename=(new_name=name))  ;
 by name;
run;

Justin

Administrator

February 6, 2021 at 11:27 pm

3. IFN and IFC: Conditional Processing: Equivalent to ifelse() function in R.

Syntax
IFN (logical-expression, value-returned-when-true, value-returned-when-false
, <,value-returned-when-missing>)

Details

The IFN function uses conditional logic that enables you to select among several values based on the value of a logical expression.

IFN evaluates the first argument, then logical-expression. If logical-expression is true (that is, not zero and not missing), then IFN returns the value in the second argument. If logical-expression is a missing value, and you have a fourth argument, then IFN returns the value in the fourth argument. Otherwise, if logical-expression is false, IFN returns the value in the third argument.

The IFN function, an IF/THEN/ELSE construct, or a WHERE statement can produce the same results. (See Examples.) However, the IFN function is useful in DATA step expressions when it is not convenient or possible to use an IF/THEN/ELSE construct or a WHERE statement.

data _null_;
    input TotalSales;
    commission= IFN (TotalSales > 10000,  TotalSales*.05, TotalSales*.02);
    put commission=;
    datalines;
 25000
 10000
 .
 0
 663
 .
 500
 ;
run;

IFC: Character Function

Details
Length of Returned Variable
In a DATA step, if the IFC function returns a value to a variable that has not previously been assigned a length, then that variable is given a length of 200 bytes.

The Basics
The IFC function uses conditional logic that enables you to select among several values based on the value of a logical expression.

IFC evaluates the first argument, logical-expression. If logical-expression is true (that is, not zero and not missing), then IFC returns the value in the second argument. If logical-expression is a missing value, and you have a fourth argument, then IFC returns the value in the fourth argument. Otherwise, if logical-expression is false, IFC returns the value in the third argument.

The IFC function is useful in DATA step expressions, and even more useful in WHERE clauses and other expressions where it is not convenient or possible to use an IF/THEN/ELSE construct.

data B;
length name $12 grade 3 performance $30;
    input name $ grade;
    performance = IFC(grade GE 60, 'Pass', 'Fail! Work hard...', 'NA');
    *  performance = IFC( substr (name,1,1)='K', 'First name starts with K', 'Others');
  datalines;
 Ted 60
 John 74
 Kareem 89
 .  65
 Jack 53
 Peter .
 Amanda 99
 Smith 0
 .  55
 Sara .
 Kati 35
 Maria 92
 ;
 run;

Note: In IFN and IFC, a logic expression can have 3 levels:
1 = true
0 = false
(.) = missing when specified.

You can use functions in the logic expression, such as Substr, Put, Scan etc…

Justin

Administrator

February 6, 2021 at 6:13 pm

The COALESCE function in SQL returns the first non-NULL expression among its arguments. It is the same as the following CASE WHEN statement:

 SELECT 
   CASE ("column_name") WHEN "expression 1 is not NULL" THEN "expression 1"
   WHEN "expression 2 is not NULL" THEN "expression 2"
   ...
   [ELSE "NULL"]
  END
FROM "table_name"

For examples, say we have the following table Table Contact_Info
Name Business_Phone Cell_Phone Home_Phone
Jeff 531-2531 622-7813 565-9901
Laura NULL 772-5588 312-4088
Peter NULL NULL 594-7477

and we want to find out the best way to contact each person according to the following rules:
1. If a person has a business phone, use the business phone number.
2. If a person does not have a business phone and has a cell phone, use the cell phone number.
3. If a person does not have a business phone, does not have a cell phone, and has a home phone, use the home phone number.

We can use the COALESCE function to achieve our goal:

SELECT Name, 
       COALESCE(Business_Phone, Cell_Phone, Home_Phone) as Contact_Phone
FROM Contact_Info;

Result:
Name Contact_Phone
Jeff 531-2531
Laura 772-5588
Peter 594-7477

They do the same job as the below SAS code:

If Business_phone is not null then contact_phone=business_phone;
   Else if cell_phone is not null then contact_phone=cell_phone;
   Else if home_phone is not null then contact_phone=home_phone;
 Else contact_phone=”N/A”;

Isn’t it cool? Then practice and use it!

Justin

Administrator

January 22, 2021 at 11:24 am

It is very easy to count the total number of missing values in a data frame with R and Python. Below is the code. Suppose we have a data frame named df.

R code

sum( is.na(df) )  
df[ is.na(df) ]<-  -999  ### we can replace all the missing values in this way.

Python code: we can use either isnull() or is.na() function. Please note: isna()/notna() are same as isnull()/notnull() in Python.

### Below code counts the number of missing values of each column.
df.isnull().sum() 
df.isna().sum() 

### Below code counts the number of missing values of the whole data frame.
df.isnull().sum().sum() 
df.isna().sum().sum()

For SAS programming, it is not an easy job. Some people may think of using macro to do it. SAS macro will work it out certainly, but it is too much work. My suggestion is to use macro only when you have to. Actually, we have a much easier and better solution if we use the CMISS() function in SAS. Below gives the SAS code.

/*cmiss() function: This function counts the number of missing arguments.*/
/* argument*: specifies a constant, variable, or expression. Argument can be either a character value or a numeric value.*/

data AAA;
set P1.New_Wireless  End=EOF ;
miss=CMISS(of acctno -- sales);
Total_Miss + Miss;
if EOF=1 then output;
run;
************* The correct answer is 371,468 for the above input data. ********;

The most challenging task is to use SQL to do it. It is very complicated and we may have to use the “information schema” to do it. Does any one have a better solution?