Task 3: How to Identify and Evaluate the Impact of Outliers in NHANES I Data

In this task, you will check for outliers and their potential impact using the following steps:

 

Step 1 Check distributions by running a univariate analysis

Before you analyze your data, it is very important that you check the distribution and normality of the data and identify outliers for continuous variables.

 

Program to Plot Distribution of Continuous Variable
Statements Explanation
proc univariate data =nh1.demo2_nh1 normal plot ;

Use the proc univariate procedure to get all default descriptive statistics, such as mean, minimum and maximum values, standard deviation, and skewness, etc...Use the normal plot option to obtain a plot of normality.

where n1bm0101>=20 ;

Use the where statement to select the participants who were age 20 years and older.

id seqn;

Use the id statement to list the sequence numbers associated with extreme values in the output.

var N1LB0237 N1ME0228 N1ME0231 N1ME0718 N1ME0721 N1BM0260 N1BM0266;

run ;

Use the var statement to list the variables of interest.

Highlighted items from the univariate analysis output:

 

Step 2 Plot a graph of survey weight against the distribution of the variable

In this example, you will plot the survey weight for stands 1-65 (n1bm0176) against the distribution of the cholesterol variable to determine whether the extreme observations are influential outliers.

 

info iconIMPORTANT NOTE

This weight was used because the question "Has doctor ever told you that you ... [high blood pressure]?" (n1ah290) was only asked of participants at stands 1-65. Please see NHANES III Clean & Recode Data, Task 1 - How to Identify and Recode Missing Data for a full explanation.

 

Plot Exam Weight Against Cholesterol
Statement Explanation
symbol1 value =dot height = .2 ;

Use the option statements, symbol and height, to format the output of the plot.

proc gplot data =demo2_nh1;
where n1bm0101>= 20 ;

plot N1BM0176*(N1LB0237 N1ME0228 N1ME0231 N1ME0718 N1ME0721

N1BM0260 N1BM0266)

/frame ;

      title 'NHANES I, adults age 20 years and older' ;

run ;

Use the proc gplot procedure to plot the total serum cholesterol (n1lb0237) by the corresponding sample weight for each observation in the dataset.  Use the where statement to select the participants who were age 20 years and older.

 

Highlighted items from plotting the survey weight against the distribution of the cholesterol variable:

 

 

Step 3 Identify outliers and compare estimates with outliers deleted against the original estimates with outliers included

In this step you will:

 

Program to Create Dataset Without Outliers and Output Means of Both Datasets
Statements Explanation
data temp2_nh1;

set demo2_nh1;

Use the data and set statements to refer to your analytic dataset.

if seqn in ( 18969 , 14145, 16025 ) then delete ;

Use the if…then statements to delete the outliers using their SEQN previously identified in the plot of survey weight versus distribution of the variable.  The SEQNs associated with these outliers are listed in the proc univariate output under extreme observations.

proc format ;
value race
  1 = 'White'
  2 = 'Black'
  3 = 'Other Race' ;
run ;

Use the proc format procedure to give easily understood labels to your race/ethnicity variable values.

proc means data =demo2_nh1 mean stderr maxdec = 1 ;

Use the proc means procedure to determine the mean and standard error for the dataset with the outliers.

where n1bm0101>=20 ;

Use the where statement to select the participants who were age 20 years and older.

var n1lb0237;

Use the var statement to indicate the variable of interest.

class n1bm0103;

Use the class statement to group the variable of interest by race/ethnicity categories.

weight n1bm0176;

Use the weight statement to account for the unequal probability of sampling and non-response. In this example, the examined sample weight for all six years of data is used.

format n1bm0103 race. ;

Use the format statement to label your race variable with English labels you defined in the proc format statement.

proc means data =temp2_nh1 mean stderr maxdec = 1 ;

Use the proc means procedure to determine the mean and standard error for the dataset without the outliers.

where n1bm0101>=20 ;

Use the where statement to select the participants who were age 20 years and older.

var n1lb0237;

Use the var statement to indicate the variable of interest.

class n1bm0103;

Use the class statement to group the variable of interest by race/ethnicity categories.

weight n1bm0176;

Use the weight statement to account for the unequal probability of sampling and non-response. In this example, the examined sample weight is used.

format n1bm0103 race. ;

Use the format statement to label your race variable with easily understood labels you defined in the proc format statement.

 

Highlighted items from comparison of the results with and without outliers: