Home Forums Main Forums Big Data How to count the number of NAs in a dataset by SAS/SQL/R/Python?

  • How to count the number of NAs in a dataset by SAS/SQL/R/Python?

     Datura updated 10 months ago 3 Members · 4 Posts
  • Datura

    Member
    January 22, 2021 at 10:55 am

    A student had a job interview with a bank yesterday. They asked her an interesting but challenging question:

    How do you count the total number of NAs (missing values) in a dataset with a lot of variables by SAS/SQL/R/Python?

    This question is not easy to answer, especially for SAS and SQL. Let’s discuss about it.

  • Justin

    Administrator
    January 22, 2021 at 11:24 am

    It is very easy to count the total number of missing values in a data frame with R and Python. Below is the code. Suppose we have a data frame named df.

    R code

    sum( is.na(df) )  
    df[ is.na(df) ]<- -999 ### we can replace all the missing values in this way.

    Python code: we can use either isnull() or is.na() function. Please note: isna()/notna() are same as isnull()/notnull() in Python.

    ### Below code counts the number of missing values of each column.
    df.isnull().sum()
    df.isna().sum()

    ### Below code counts the number of missing values of the whole data frame.
    df.isnull().sum().sum()
    df.isna().sum().sum()

    For SAS programming, it is not an easy job. Some people may think of using macro to do it. SAS macro will work it out certainly, but it is too much work. My suggestion is to use macro only when you have to. Actually, we have a much easier and better solution if we use the CMISS() function in SAS. Below gives the SAS code.

    /*cmiss() function: This function counts the number of missing arguments.*/
    /* argument*: specifies a constant, variable, or expression. Argument can be either a character value or a numeric value.*/

    data AAA;
    set P1.New_Wireless End=EOF ;
    miss=CMISS(of acctno -- sales);
    Total_Miss + Miss;
    if EOF=1 then output;
    run;
    ************* The correct answer is 371,468 for the above input data. ********;

    The most challenging task is to use SQL to do it. It is very complicated and we may have to use the “information schema” to do it. Does any one have a better solution?

    • This reply was modified 10 months, 1 week ago by  Justin.
  • Wong

    Member
    January 22, 2021 at 7:33 pm

    For SQL, we can extract all column names from information_schema. Then concatenate the column names with ‘IS NULL’ in where clause to get the number.

    SET @query = CONCAT(

    'SELECT COUNT(*) FROM sakila.staff WHERE CONCAT(',

    (SELECT GROUP_CONCAT(COLUMN_NAME)

    FROM information_schema.COLUMNS

    WHERE TABLE_SCHEMA = 'sakila' AND

    TABLE_NAME = 'staff'),

    ') IS NULL');

    PREPARE CHECK_NULL from @query;

    EXECUTE CHECK_NULL;

    • This reply was modified 10 months, 1 week ago by  Wong.
    • Datura

      Member
      January 31, 2021 at 11:17 am

      Great job! Thanks!

Log in to reply.

Original Post
0 of 0 posts June 2018
Now