Task 2c: How to Use Stata Code to Perform Linear Regression

In this example, you will assess the association between high density lipoprotein (HDL) cholesterol — the outcome variable — and body mass index (bmxbmi) — the exposure variable — after controlling for selected covariates in NHANES 1999-2002. These covariates include gender (riagendr), race/ethnicity (ridreth1), age (ridageyr), smoking (smoker, derived from SMQ020 and SMQ040; smoker =1 if non-smoker, 2 if past smoker and 3 if current smoker) and education (dmdeduc).

 

warning iconWARNING

There are several things you should be aware of while analyzing NHANES data with Stata. Please see the Stata Tips page to review them before continuing.

 

Step 1: Use svyset to define survey design variables

Remember that you need to define the SVYSET before using the SVY series of  commands. The general format of this command is below:

svyset [w=weightvar], psu(psuvar) strata(stratavar) vce(linearized)

 

To define the survey design variables for your high density lipoprotein cholesterol analysis, use the weight variable for four-years of MEC data (wtmec4yr), the PSU variable (sdmvpsu), and strata variable (sdmvstra) .The vce option specifies the method for calculating the variance and the default is "linearized" which is Taylor linearization.  Here is the svyset command for four years of MEC data:

svyset [w= wtmec4yr], psu(sdmvpsu) strata(sdmvstra) vce(linearized)

 

Step 2: Determine how to specify variables in the model

For continuous variables, you have a choice of using the variable in its original form (continuous) or changing it into a categorical variable (e.g. based on standard cutoffs, quartiles or common practice).  The categorical variables should reflect the underlying distribution of the continuous variable and not create categories where there are only a few observations. 

It is important to exam the data both ways, since the assumption that a dependent variable has a continuous relationship with the outcome may not be true.  Looking at the categorical version of the variable will help you to know whether this assumption is true. 

In this example, you could look at BMI as a continuous variable or convert it into a categorical variable based on standard BMI definitions of underweight, normal weight, overweight and obese.  Here is how categorical BMI variables are created:


Table of code to generate categorical BMI variable
Code to generate categorical BMI variables BMI Category

gen bmicat=1 if  bmxbmi>=0 &  bmxbmi<18.5

underweight

replace bmicat=2 if  bmxbmi>=18.5 & bmxbmi<25

normal weight

replace bmicat=3 if  bmxbmi>=25

overweight

replace bmicat=4 if  bmxbmi>=30 &  bmxbmi<.

obese

 

 

Step 3: Determine the reference group for categorical variables

For all categorical variables, you need to decide which category to use as the reference group.  If you do not specify the reference group options, Stata will choose the lowest numbered group by default.  

Use the following general command to specify the reference group:

char var[omit]reference group value

 

For these analyses, use the following commands to specify the following reference groups.

Stata command Reference group

char ridreth1[omit]3