Task 1: Key Concepts about Calculating Variances in NHANES

Because NHANES is only a sample of the U.S. population (instead of a census), any number computed from NHANES data is an estimate of the corresponding population number.  Therefore, each statistic has some level of sampling error associated with it, and if estimates were derived from numerous samples, they would not all be the same.  That is, this sampling error would result in a dispersion of the estimates, and this theoretical dispersion is known as the variance.  Typically, the true population variance is unknown and we can only estimate the sampling variance—or a related measure, the standard error—of a statistic.  Standard errors are used to assess the precision of the statistic of interest.  See the NHANES Analytic Guidelines to learn how to interpret these standard errors and determine whether your estimate is precise enough to report.

Standard statistical software packages calculate variance estimates, but only those that are designed to address complex weighted samples such as NHANES should be used.  These software packages include SUDAAN and STATA or the survey procedures in SAS and SPSS.  These procedures require information on the first stage of the sample design (i.e., identification of strata and PSUs) for each sample person.  Variance estimates computed using standard statistical software packages that assume simple random sampling would be generally too low (i.e., significance levels would be overstated) for the NHANES sample.  They also would be biased because they would not account for the differential weighting and the correlation among sample persons within a cluster.  Therefore, the procedures used to analyze NHANES data should be able to account for the complex sample design when producing variance estimates.

Accounting for the complex sampling design of NHANES is critical when calculating estimates and standard errors of means, percentages, and other statistics. As explained in the NHANES Survey Design Overview module, NHANES has a multistage probability design, where the first two stages (selection from strata and from PSUs) are of primary concern for variance estimation. Typically, individuals within a PSU are more similar to one another than to those in other PSUs.  Ideally, it is more desirable to sample fewer people within each PSU but sample more PSUs.  However, because of operational limitations (e.g., cost of moving the Mobile Examination Centers, geographic distances between PSUs), NHANES can sample only 30 PSUs within a 2-year survey cycle.  The sample size is roughly equal across PSUs and yields about 5,000 examined persons per year.  The NHANES sample design uses unequal probabilities of selection in order to oversample select individuals and population subgroups.  For example, in 1999-2006, individuals ages 65 and older, African Americans, and Mexican Americans are oversampled, as are low income whites.  All of these complex sample design factors (PSU stratification, geographic clustering, differential probabilities of selection) affect variance estimates of the NHANES data.   

Together, the strata and the PSUs define the variance units of the sampling design, which should be taken into account to properly estimate the variance due to sampling error for any statistic computed from NHANES data. However, the true stratum and PSU identifiers must be kept confidential because the release of data in 2-year cycles makes it easy to identify them.  To protect the confidentiality of data obtained from sample persons, Masked Variance Units (MVUs) are constructed by aggregating secondary sampling units into groups.  Therefore, “sample design” variables on the public release files (SDMVSTRA and SDMVPSU for strata and PSU, respectively) are provided instead of the real identifiers for purposes of variance estimation. These variables define MVUs.  MVUs are equivalent to Pseudo-PSUs used to estimate sampling errors in past NHANES.

Using MVUs yields variance estimates that closely approximate those obtained using the real “unmasked” variance units and therefore are considered satisfactory and appropriate. They have been created for each 2-year cycle of NHANES in such a way that they can be used for any combination of data cycles without recoding.

For complex sample surveys, exact mathematical formulas for variance estimates are not readily available. Variance approximation procedures are required to provide reasonable estimates of the magnitude of sampling error. Two variance approximation procedures that account for the complex sample design are replication methods and Taylor series linearization methods.  For the most part, the variance approximation methods are generally equivalent, although for some specific applications, one particular method may be slightly preferred.  Because replication methods tend to be more cumbersome, NCHS currently recommends the use of the Taylor series linearization methods for variance estimation in all NHANES surveys. SUDAAN, SAS, STATA, and SPSS procedures can be used to obtain variances estimated by this method for a variety of statistics, such as means, geometric means, and percentages. In general, you need to identify the variables that hold the information about the sampling design when using most statistical software packages. In other words, you need to specify the variables that define the stratum, PSUs (also called clusters) within each stratum, and sampling weight.  

The two statistical analysis software packages demonstrated in the tutorial are SUDAAN and SAS, and they each use slightly different syntax for specifying the design.  The key differences are:

 

 

 

close window icon Close Window to return to module page.