How to Identify and Describe the Impact of Influential Outliers

Before you analyze your data, it is very important that you examine the data for the presence of outlying values.

Check for Outliers among by Running a Univariate Analysis

Use the PROC UNIVARIATE procedure to get all default descriptive statistics such as mean, minimum, and maximum values, standard deviation, and skewness. Use the VAR statement to identify the variable of interest (PAG_MINW). Use the ID statement to list the sequence numbers associated with extreme values in the output.

Sample Code

proc univariate data =cvx normal plot ;
 var CVDESVO2;
 where WTINT4CD > 0 and RIDAGEYR >= 16;
 id seqn;
 title 'Distribution of estimated VO2 max among study participants aged 12 to 49 years' ;
run ;

 

Output of Program

Download program output[PDF - 58 KB]

 

Plot Sample Weight against the Distribution of the Variable

Use the PROC GPLOT procedure to estimated VO2 max (CVDESVO2) by the corresponding sample weight for each observation in the dataset. Set 20 ml/kg/min as the minimum reasonable estimated VO2 max and 90 ml/kg/min as the maximum reasonable VO2 max based on observed measures by sex age and sex.

Sample Code

symbol1 value = dot height = .2 ;
title ;

proc gplot data = cvx;
 plot WTMEC4BC*CVDESVO2/ href = 20 , 90 frame ;
 where WTMEC4BC > 0 and RIDAGEYR >= 12 and RIDAGEYR <= 49 ;
run ;

Output of Program

Download program output [PDF - 62 KB]
  • The observed outlier observations have a moderately large sample weights. Therefore, removing these observations would not have a great effect on population estimates.

Identify Outliers and Compare Estimates with Outliers Deleted Against the Original Estimates with Outliers Included

Use the IF, THEN, and DELETE statements in the DATA step to delete the identified outliers with with CVDESV02 < 20 or > 90.  Use the PROC MEANS procedure to determine the mean and standard error for the dataset both with and without excluding the outlier values.

Sample Code

data exclude_SP;
 set cvx;
 if WTMEC4BC > 0 and RIDAGEYR >= 12 and RIDAGEYR <= 49 and CVDESVO2 < 20 then delete ;
 if WTMEC4BC > 0 and RIDAGEYR >= 12 and RIDAGEYR <= 49 and CVDESVO2 > 90 then delete
run ;

proc format ;
 value GENDERF 1 = 'Male'
               2 = 'Female' ;

proc means data = cvx mean stderr maxdec = 1 ;
 title 'No Exclusions' ;
 var CVDESVO2;
 class RIAGENDR;
 weight WTMEC4BC;
 format RIAGENDR GENDERF. ;
run ;

proc means data = exclude_SP mean stderr maxdec = 1 ;
 title Outlier Exclusion' ;
 var CVDESVO2;
 class RIAGENDR;
 weight WTMEC4BC;
 format RIAGENDR GENDERF. ;
run ;

Output of Program

Download program output [PDF - 38 KB]