MemberJanuary 22, 2021 at 10:55 am
A student had a job interview with a bank yesterday. They asked her an interesting but challenging question:
How do you count the total number of NAs (missing values) in a dataset with a lot of variables by SAS/SQL/R/Python?
This question is not easy to answer, especially for SAS and SQL. Let’s discuss about it.
AdministratorJanuary 22, 2021 at 11:24 am
It is very easy to count the total number of missing values in a data frame with R and Python. Below is the code. Suppose we have a data frame named df.
sum( is.na(df) )
df[ is.na(df) ]<- -999 ### we can replace all the missing values in this way.
Python code: we can use either isnull() or is.na() function. Please note: isna()/notna() are same as isnull()/notnull() in Python.
### Below code counts the number of missing values of each column.
### Below code counts the number of missing values of the whole data frame.
For SAS programming, it is not an easy job. Some people may think of using macro to do it. SAS macro will work it out certainly, but it is too much work. My suggestion is to use macro only when you have to. Actually, we have a much easier and better solution if we use the CMISS() function in SAS. Below gives the SAS code.
/*cmiss() function: This function counts the number of missing arguments.*/
/* argument*: specifies a constant, variable, or expression. Argument can be either a character value or a numeric value.*/
set P1.New_Wireless End=EOF ;
miss=CMISS(of acctno -- sales);
Total_Miss + Miss;
if EOF=1 then output;
************* The correct answer is 371,468 for the above input data. ********;
The most challenging task is to use SQL to do it. It is very complicated and we may have to use the “information schema” to do it. Does any one have a better solution?
- This reply was modified 10 months, 1 week ago by Justin.
MemberJanuary 22, 2021 at 7:33 pm
For SQL, we can extract all column names from information_schema. Then concatenate the column names with ‘IS NULL’ in where clause to get the number.
SET @query = CONCAT(
'SELECT COUNT(*) FROM sakila.staff WHERE CONCAT(',
WHERE TABLE_SCHEMA = 'sakila' AND
TABLE_NAME = 'staff'),
') IS NULL');
PREPARE CHECK_NULL from @query;
- This reply was modified 10 months, 1 week ago by Wong.
MemberJanuary 31, 2021 at 11:17 am
Great job! Thanks!
Log in to reply.