Home Forums Main Forums SAS Forum SAS: what is the function of multiple SET statements in a DATA step?

  • SAS: what is the function of multiple SET statements in a DATA step?

     Justin updated 3 years, 6 months ago 2 Members · 2 Posts
  • Datura

    Member
    October 16, 2020 at 10:23 am

    My previous student Grace asked me this question before. It is a good question, and I am happy to answer and share it with others.

    She asked: “Given the below SAS code

    data con1;
    input custom_id $ product $ 12.;
    cards;
    28901 pentium IV
    36815 pentium III
    ;
    run;

    data con2;
    input custom_id $ product $ 12.;
    cards;
    18601 pentium IV
    24683 pentium III
    851921 pentium IV
    61831 pentium IV
    ;
    run;

    data con3;
    set con1;
    set con2;
    run;

    The result is:

    custom_id product
    18601 pentium IV
    24683 pentium III

    However, if we change code as follows:

    data con3;
    set con2;
    set con1;
    run;

    The result will be:
    custom_id product
    28901 pentium IV
    36815 pentium III

    How come? What’s the mechanism for con1 overwrite con2? And, what is the function of multiple SET statements in a DATA step?

    • This discussion was modified 3 years, 6 months ago by  Justin.
    • This discussion was modified 3 years, 6 months ago by  Justin.
  • Justin

    Administrator
    October 16, 2020 at 5:12 pm

    In data step, if we use multiple SET statements rather than one SET statement, the outcome is to overwrite rather than appending. The observations in the later SET will overwrite the observations in the previous one.

    Another key point is: when will it stop and how many observations can produce? Given below example, if A and B have exactly same variables.

    data C;
    set A; * 5 records;
    set B; * 10 records;
    run;

    data D;
    set B; * 10 records;
    set A; * 5 records;
    run;

    Both data C and D have only 5 records, but the records are different, why?

    As you remember, each data set has a End of File indicator (which can be monitored by the END=EOF option), the data step execution is stopped no matter which data set reaches the end of the file first. In the above case, data A has only 5 obs, therefore it always reach the end first, and determines the final number of observations in the output data set: 5 observations! In summary, the final number of output observations is always determined by the smallest number of dataset observations, if you have multiple SET statements..

    However, although the number of observations are same in data C and D, but the records are different, because the later one always overwrite the previous one: the ORDER does matter!

Log in to reply.

Original Post
0 of 0 posts June 2018
Now