Continuous NHANES Web Tutorial: Descriptive Statistics: Frequency Distribution and Normality

Key Concepts About Checking Frequency Distribution and Normality

Frequency Distribution

A frequency distribution shows the number of individuals located in each category of a categorical variable. For continuous variables, frequencies are displayed for values that appear at least one time in the dataset. Frequency distributions provide an organized picture of the data, and allow you to see how individual scores are distributed on a specified scale of measurement. For instance, a frequency distribution shows whether the data values are generally high or low, and whether they are concentrated in one area or spread out across the entire measurement scale.

A frequency distribution not only presents an organized picture of how individual scores are distributed on a measurement scale, but also reveals extreme values and outliers. Researchers can make decisions on whether and how to recode or perform data transformation based on the distribution statistics.

Frequency distributions can be structured as tables or graphs, but either should show the original measurement scale and the frequencies associated with each category. Because NHANES data have very large sample sizes with a potentially long list of different values for continuous variables, it is recommended that you use a graphic format to check the distribution for continuous variables, and either frequency tables or graphic forms for nominal or interval variables.

Statistics of Normality (for Continuous Variables)

Statistics of normality reveal whether a data distribution is normal and symmetrically bell-shaped or highly skewed. It is important to use these statistics to check the normality of a distribution because they will determine whether you will use parametric (which assume a normal distribution), non-parametric tests, or the need to use a transformation in your analysis.

IMPORTANT NOTE

Note: Before you analyze the data, it is important to check the distribution of the variables to identify outliers and determine whether parametric (for a normal distribution) or non-parametric tests are appropriate to use.

NHANES 1999-2002 is a large, representative sample of the U.S. population, and most continuous variables from this sample are expected to be normally distributed. If you conduct tests for normality, results on most variables would be significant, i.e. even the slightest deviation from normality could result in rejecting the null hypothesis due to the extremely large sample sizes. Therefore, users are discouraged from solely depending on these tests for normality. Instead you can also request a Q-Q plot to examine normality.

A Q-Q plot, or a quantile-quantile plot, is a graphical data analysis technique for assessing whether the distribution for data follows a particular distribution. In a Q-Q plot, the distribution of the variable in question is plotted against a normal distribution. The variable of interest is normally distributed, if a straight line intersects the y-axis at a 45 degree angle.

Standard Deviation

The standard deviation is a measure of the variability of the distribution of a random variable. To estimate the standard deviation

calculate the weighted sum of the squares of the differences of the observations in a simple random sample from the sample mean
divide the result obtained in 1 by an estimate of the population size minus 1
take the square root of the result obtained in 2.

Skewness

Skewness is a measure of the departure of the distribution of a random variable from symmetry. The skewness of a normally distributed random variable is 0.

Kurtosis

Kurtosis is a measure of the peakedness of the distribution. The kurtosis of a normally distributed random variable depends on the formula used. One formula subtracts 3, as used by SAS, which makes the value for a normal distribution equal to 0. The other formula does not subtract 3, as used by Stata, which makes the value for a normal distribution equal to 3. A kurtosis exceeding the value for a normal distribution indicates excess values close to the mean and at the tails of the distribution. A kurtosis of less than the value for a normal distribution indicates a distribution with a flatter top.

SAS Support Link: http://support.sas.com/publishing/bbu/companion_site/update/lsb_kurtosis.html

Standard Error of the Mean

The standard error of the mean based on data from a simple random sample is estimated by dividing the estimated standard deviation by the square root of the sample size. The value of the standard error obtained from SAS proc univariate using the freq option with the sample weight (i.e. freq appropriate sample weight) is obtained by dividing the estimated standard deviation (see above) by the sum of the sample weights (i.e. an estimate of the population size). In order to obtain the " correct" estimate of the simple random sample standard error of the mean, divide the estimated standard deviation by the square root of the sample size. The SRS estimate of the standard error of the mean thus obtained serves as a bench mark against which to compare the design based estimate of the standard error of mean which can be obtain from SUDAAN proc descript . (See Variance Estimation module for more information).

Close Window