Task 3b: How to Perform Logistic Regression Using SAS Survey Procedures

In this module, you will use simple logistic regression to analyze NHANES data to assess the association between calcium supplement use (anycalsup) — the exposure or independent variable — and the likelihood of receiving treatment for osteoporosis (treatosteo) — the outcome or dependent variable, among participants ages 20 years old and older.  You will then use multiple logistic regression to assess the relationship after controlling for selected covariates.  The covariates include gender (riagendr), age (ridageyr), race/ethnicity (ridreth1), and body mass index (bmxbmi).

 

Step 1. Determine the appropriate weight for the data used

This example uses the demoadv dataset (download at Sample Code and Datasets).  This dataset already contains a variable anycalsup that has a value of 1 for those who report calcium supplement use, and a value of 2 for those who do not. A participant was considered not to have any calcium supplement use if the daily average amount of calcium supplement use was zero; otherwise, a participant was considered a supplement user (see Supplement Code under Sample Code and Module 9, Task 4 for more information).

It is always important to check all the variables in the model, and use the weight of the smallest common denominator. In the example of univariate analysis, the 2-year MEC weight is used, because the osteoporosis variable is from the MEC examination. The demoadv dataset for this example only includes those with MEC weights (wtmec2yr>0).

 

Step 2: Create independent categorical variables

This example will also illustrate the creation of additional independent categorical variables (age, bmigrp) from the age, and BMI categorical variables, and these new variables will be used in this analysis.

Code to Generate Independent Categorical Variables

Independent variable Code to generate independent categorical variables
Age

if 20 <=ridageyr<40 then age= 1 ;
else if 40 <=ridageyr<60 then age= 2 ;
else if ridageyr>= 60 then age= 3 ;

BMI category

if 0 <=bmxbmi<25 then bmigrp= 1 ;
else if 25 <=bmxbmi<30 then bmigrp= 2 ;
else if bmxbmi>= 30 then bmigrp= 3 ;

 

Step 3: Create new weight variable for Domain (Subpopulation) Analysis (prior to SAS 9.2) or add domain statement (SAS 9.2 and higher)

You should not use a where clause or by-group processing in order to analyze a subpopulation with the SAS Survey Procedures. Prior to SAS 9.2, to get an approximate domain (subpopulation) analysis when using proc surveylogistic, you would assign a near zero weight to observations that do not belong to your current domain. The reason that you cannot make the weight zero is that the procedure will exclude any observation with zero weight. In this example, you have a domain (subpopulation) where age is greater than or equal to 20 years, and if you specify in a data step:  

     if ridageyr GE 20 then newweight=wtmec2yr; 

     else newweight=1e-6;

you could then perform the logistic regression using the newweight variable as:

     weight newweight;

 

Info iconIMPORTANT NOTE

The code above with the newweight variable is no longer necessary in SAS 9.2.  The statement

weight newweight;

may be replaced with the statements

weight wtmec2yr;

domain sel;

where sel is defined as

if ridageyr GE 20 then sel= 1

else sel= 2 ;

(Note that for this particular example, osteoporosis treatment is only collected for those ages 20 and over, so you will not notice a difference whether wtmec2yr or newweight is used.  However, if a different age group or variable was used for the subpopulation, differences would be noted.)

Reference: SAS Technical Support

 

Step 4: Fit Multiple Logistic Regression Model in SAS

This step introduces you to the SAS procedure for logistic regression, proc surveylogistic. There is a summary table of the SAS program below.

 

Info iconIMPORTANT NOTE

These programs use variable formats listed in the sample program. You may need to format the variables in your dataset the same way to reproduce results presented in the tutorial.

SAS Logistic Regression Procedure

Statements Explanation

proc surveylogistic data =demoadv;  

Use the proc surveylogistic procedure to perform multiple logistic regression to assess the association between the outcome and multiple risk factors, including: age, gender, race/ethnicity, and body mass index.

stratum sdmvstra;

Use the stratum statement to specify strata to account for design effects of stratification.

cluster sdmvpsu;  

Use the cluster statement to specify primary sampling unit (PSU) to account for design effects of clustering.

weight newweight;

Use the weight statement to account for the unequal probability of sampling and non-response. In this example, you use the new weight variable created in the data step. See Step 1.

class age ( ref = '40-59' ) riagendr ( ref = 'Male' ) ridreth1 anycalsup ( ref = 'No supp use' ) bmigrp ( ref = '25<=BMI<30' )/ param =ref;

Use the class statement to specify all categorical variables in the model.

Use the param and ref options to choose your reference group for the categorical variables.

model treatosteo =anycalsup riagendr age ridreth1 bmigrp;

Use the model statement to specify the dependent variable and all independent variable(s) in your logistic regression model.

format riagendr gender. age agegrp. ridreth1 race. anycalsup yesnos. bmigrp bmifmt. ;

Use the format statement to read the SAS formats  for all formatted variables.

 

Info iconIMPORTANT NOTE

The SAS Survey Procedure, proc surveylogistic, produces the Wald statistic and its p value. It does not produce the Satterthwaite χ2 or the Satterthwaite F and the corresponding p values recommended for NHANES analyses. For this reason, it is recommended that you use proc rlogist in SUDAAN for logistic regression.

 

Step 5: Review SAS Multiple Logistic Regression Output

In this step, the SAS output is reviewed. The highlighted elements show that:

If you ran both the SAS Survey and SUDAAN programs (or reviewed the output provided on the Sample Code and Datasets page), you may have noticed slight differences in the output. These differences can be caused by missing data in any paired PSU or how each software program handles degrees of freedom.

 

close window icon Close Window to return to module page.