SAS for beginners: Proc Import

When we perform data analysis, the first important and essential thing is to: read in your data! If you cannot read data in, what else can you do?! This is same for using any analytic tools such as R, Python, SPSS etc. 

In SAS, we have several methods to import external files, such as Import Wizard, Proc Import and DATA step. If we want to read in external files such as delimited files, Excel or Access files into SAS, the easy and convenient way is to use Proc Import.  It is just simple and powerful. Please note, however, Proc Import cannot read in fixed-length files. We have to use Import Wizard or DATA step to read in fixed length files. I will talk about them in another article. 

Below is an example of using Proc Import to read in a tab-delimited text file. Let’s talk some details about the often used options and statements.

***** generate SAS datasets from tab delimited text files. Using defined formats for the output variables***; proc import datafile="path\orders.txt" out=P3.orders dbms=DLM replace; delimiter='09'x; getnames=yes; guessingrows=500000; datarow=2; run;
Code language: SAS (sas)

Replace option
The REPLACE option is to choose to replace the previously created data set or not. If you don’t use REPLACE, the SAS data set generated from current run will NOT overwrite the one from previous run. Instead, it will prompt you to save it with a different name. This is really annoying and not necessary, so I suggest:  always use REPLACE with Proc Import.

Getnames= Yes/No 
This statement is used to read column names and it really depends on your input data. If the first line is for column names, you need to set Getnames=Yes, SAS will read the first line as column names, and read the real data starting from the 2nd line, such as:

ID Name Sex Age
101 Clinton M 55
102 George M 39
103  Alice    F   26

If your data does NOT have the column name line, you set Getnames=No, it will read the first line as real data rather than column names. In this case, the default column names are shown as Var1, Var2, Var3…… if you use Proc Import or F1, F2, F3….. if you use Import Wizard.

Datarow 
This statement is used to tell where the real data starts from. Because some data may contain many lines of non-data at the beginning,  for example:

ID Name Sex Age
*** This is online sales data collected from Amazon ;
*** The data is confidential and for internal use only;
***  Data is created as of 2020-09-25.
101 Clinton M 55
102 George M 39
103  Alice    F   26

In this case, we need set datarow=5 to skip the non data lines.

Guessingrows
When SAS read in an external file, it need scan the external file to determine the variable attributes for each column: Numeric or Character? Variable length? … etc. This task takes place at the compile phase actually, it is called first occurrence properties.

By default, Proc Import only scans the first 50 rows to determine them. This is really risky because: if you have a huge dataset with millions of records, the character values are very likely to appear after the first 50 rows. In this case, this column will be set as numeric and the rows with character values will all get missing values. It is really bad, isn’t it?

To avoid this potential issue, you can set the Guessingrows to a big number. Say, if you set Guessingrows=500000, it will scan the first 500K records. Or you can set Guessingrows=Max, it will scan all the records to determine the column types.

Responses

    1. It really depends on your input data. If the first line is for column names, you need to set Getnames=Yes, SAS will read the first line as column names, and read the real data starts from the 2nd line, such as:

      ID Name Sex Age
      101 Clinton M 55
      102 George M 39

      If your data does NOT have the column name line, you set Getnames=No, it will read the first line as real data rather than column names.

    1. If you don’t use REPLACE, the SAS data set generated from current run will NOT overwrite the one from previous run. Instead, it will prompt you to save it with a different name. This is really annoying and unnecessary, so I suggest always use REPLACE with Proc Import.

    1. When SAS read in an external file, it need scan the external file to determine the variable type for each column: Numeric or Character?

      By default, SAS only scan the first 50 rows to determine them. This is quite risky because if you have a huge dataset with millions of records, the character values are very likely to appear after the first 50 rows. In this case, this column will be set as numeric and the rows with character values will all get missing values. It is really bad, isn’t it?

      To avoid this potential issue, you can set the Guessingrows to a big number. Say, if you set Guessingrows=1000000, it will scan the first 1M records. Or you can set Guessingrows=Max, it will scan all the records to determine the column types.

error: Content is protected !!