Getting Started with SAS/Part 4

From BingWiki

Jump to: navigation, search

< Getting Started with SAS

Contents

Basic Statistical Methods

T-test

The ttest procedure performs t-tests for one sample, two independent samples and two paired samples.


Independent-Samples T Test (two independent-sample t test)

The independent-samples t-test compares the differences of the mean of two independent samples to 0. 


Example: The golf scores for seven males and seven females in a physical education class are recorded. Since the group of males and the group of females are independent of each other and the scores are normally distributed in each group, the independent-samples t-test can be applied to determine if the mean golf score for the men in the class differs significantly from the mean score for the women.


The program below runs proc ttest to obtain the independent-samples t-test.



Here are the outputs:


Note: the Variable column states the variable used in computations and the Class column specifies the group for which the statistics are computed. For each class, the sample size, mean, standard deviation and standard error, and maximum and minimum values are displayed. The confidence bounds for the mean and the confidence bounds for the standard deviation of the groups are also calculated.


Note: It shows the results of tests for equal group means and equal variances. A group test statistic for the equality of means is reported for equal and unequal variances. Before deciding which test is appropriate, you should look at the test for equality of variances; this test does not indicate a significant difference in the two variances (F = 1.53, p-value = 0.6189), so the pooled t statistic should be used. Based on the pooled statistic, the golf scores for the group of males and the group of females are significantly different (t=-3.83, p-value=0.0024).

Note that this test assumes that the observations in both data sets are normally distributed; this assumption can be checked in PROC UNIVARIATE using the raw data.

Paired-Samples T Test (Two dependent-sample t test)

The paired-samples t-test compares the difference of the mean of two paired samples to 0.

Example: The data, from Statistics for experimenters by George E. P. box (1978), are the amount of wear of the soles of shoes worn by 10 boys, each boy wore a special pair of shoes, the sole of one shoe having been made with material A and the sole of the other with another different material B, which were assigned to the left and right shoes randomly. The experiments were run in pairs because pairs of feet go around together. We want to test whether there was no difference between materials A and B.

The program below runs proc ttest to obtain the paired-samples t-test.



Here is the output:


Note: The summary statistics of the difference are displayed (mean, standard deviation and standard error) along with their confidence limits (i.e. CL). The minimum and maximum differences are also displayed. The fourth column gives the differences of the mean between Material A and Material B is -0.41. The t test is highly statistically significant (t=-3.35, p-value=0.0085), indicating that Material B is better than Material A.


Note that this test of hypothesis assumes that the differences are normally distributed. This assumption can be investigated using PROC UNIVARIATE with the NORMAL option. If the assumption of normality is not reasonable, you should analyze the data with the nonparametric Wilcoxon Rank Sum test using PROC NPAR1WAY.


One sample t test

The one-sample t-test compares the mean of the sample to a given number.

Example: In the data set of shoes showed above, we want to test the null hypothesis of mean of 10 for the material A. The program below runs proc ttest to obtain the one samples t-test:



Here is the output:


Note: The summary statistics of Material A are displayed (mean, standard deviation and standard error) along with their confidence limits (i.e. CL). The t test is not significant (t=0.81, p-value=0.4373) at the 5% level, indicating that the mean of Material A is closes to 10.


Note that this test of hypothesis assumes that the distribution of Material A is normally distributed. This assumption can be investigated using PROC UNIVARIATE with the NORMAL option.

(See also proc ttest options)

ANOVA

One-Way Layout with Means Comparisons

One-way analysis of variance considers one treatment factor with two or more treatment levels. The goal of the analysis is to test for differences among the means of the levels and to quantify these differences. If there are two treatment levels, this analysis is equivalent to a t test comparing two group means.

Example: The effect of bacteria on the nitrogen content of red clover plants. The treatment factor is bacteria strain, and it has six levels. Five of the six levels consist of five different Rhizobium trifolii bacteria cultures combined with a composite of five Rhizobium meliloti strains. The sixth level is a composite of the five Rhizobium trifolii strains with the composite of the Rhizobium meliloti. Red clover plants are inoculated with the treatments, and nitrogen content is later measured in milligrams. The data are derived from an experiment by Erdman (1946) and are analyzed in Chapters 7 and 8 of Steel and Torrie (1980).

The program below runs proc anova to obtain the one-way anova:.



Here is the output:


Note: The "Class Level Information" table shown above lists the variables that appear in the CLASS statement, their levels, and the number of observations in the data set. And in CLASS statement, PROC ANOVA does not allow continuous variables on the right-hand side of the model.



Note: The degrees of freedom (DF) column should be used to check the analysis results. The model degrees of freedom for a one-way analysis of variance are the number of levels minus 1; in this case, 6-1=5. The Corrected Total degrees of freedom are always the total number of observations minus one; in this case 30-1=29. The sum of Model and Error degrees of freedom equal the Corrected Total.


The overall F test is significant (F=14.37, p-value<0.0001), indicating that the model as a whole accounts for a significant portion of the variability in the dependent variable. The F test for Strain is significant, indicating that some contrast between the means for the different strains is different from zero. Notice that the Model and Strain F tests are identical, since Strain is the only term in the model.


Prior to doing a complex ANOVA, it's good to examine the groups using proc means; by or class options to examine the dependent values within each of the groups.(See also proc anova options)

Correlations


The proc corr computes correlation coefficients which measure the linear associations between two numeric random variables. If a random variable x is an exact linear function of another random variable y</FONT>, then the correlation coefficient between x and y will be 1 or -1. If there is no linear association between the two variables, then the correlation coefficient is 0. However, correlation does not imply causation because, in some cases, an underlying causal relationship may exist.


Example

From a study of the Los Angeles Standard Metropolitan Statistical Area, the twelve individuals are census tracts. The data of total population, median school year, total employment, misc. professional services and median value of house are recorded.


The program below runs proc corr to obtain the correlations between among pop, school, employ, services and house.


Note: In the PROC PRINT, we use the format statement to define our output formats.

Results of the example

Here is the output:

Note: CORR in the table gives all the correlation coefficients between variables of pop, school, employ, services, and house. From the results above, we can find that there are high linear associations between pop and employ, and between school and house.

See also

proc corr options)

Regression

Regression analysis is the analysis of the relationship between one variable and another set of variables. The relationship is expressed as an equation that predicts a response variable (also called a dependent variable or criterion) from a function of regressor variables (also called independent variables, predictors, explanatory variables, factors, or carriers) and parameters.

Example

In the study, the heights, weights and ages of 19 school children are collected. Regression can be applied to fit the linear regression model

  Weight = a + b*Height + e, 

where a is the slope, b is the slope and e is the chance error.


The program below runs proc reg to obtain the relationship between weight (response variable) and height (dependent variable) by proc reg:


Results of the example

Here is the output:



From the results above, the estimations are: a = -143.0, b = 3.9. The regression line becomes

Weight = -143.0 + 3.9*Height. 

T-test is used to test whether each parameter is significantly different from zero. The test results (t=-4.432, p=0.0004 for Intercept and t=7.555, p<0.0001 for Height) indicate that the intercept and Height parameter are highly different from zero.

The F statistic for the overall model is highly significant (F=57.076, p<0.0001), indicating that the model explains a significant portion of the variation in the data.

The degrees of freedom can be used in checking accuracy of the data and model. The model degrees of freedom are the number of parameters to be estimated minus 1. This model estimates two parameters: a and b; thus, the degrees of freedom should be 2-1=1. The corrected total degrees of freedom are always other total number of observations in the data set minus 1, that is, 19-1=18.


Several simple statistics follow the ANOVA table. The Root MSE is an estimate of the standard deviation of the error term. The R-Square and Adj R-Square are two statistics used in assessing the fit of the model; values close to 1 indicate a better fit. The R-Square of 0.77 indicates the fit is quite good.

See also

proc reg options and other regressions: proc glm; proc logistic

Data manipulation

Selecting Data

The following statements are all used to select a subset of data from a raw data file.


If expression; If the expression is true, then SAS includes that observation; Otherwise, SAS excludes that observation.

If expression then delete; If the expression is true, the SAS excludes that observation; Otherwise, SAS includes that observation. Remark: delete statement do the opposite of selecting date.

Example: Suppose you have input a raw data file called shakesp.dat into your C:\SAS directory as follows:

A Midsummer Night's Dream 1595 comedy
Comedy of Errors          1590 comedy
Hamlet                    1600 tragedy
Macbeth                   1606 tragedy
Richard III               1594 history
Romeo and Juliet          1596 tragedy
Taming of the shrew       1593 comedy
Tempest                   1611 romance

Then you want to select only comedies by using a subsetting if statement:



Note: In the program above, you could substitute the statement

 if type = ‘tragedy’ 
   or type = ‘romance’  
   or type = ‘history’ then delete;

for the statement

 if type =’comedy’;

Generally, you use the subsetting if when it is easier to specify a condition for including observations, and use the delete statement when it is easier to specify a condition for excluding observations.

Where condition

or (where=(condition))

Only observations satisfying the condition will be included. It is similar to if statement.

In the program above, you could substitute the statement

where type=’comedy’;

for the statement

if type =’comedy’;

Note: While the other methods of subsetting work only in DATA steps, the WHERE statement works in PROC steps too. You can place (where=(condition)) just after any data= option of the PROC Statement, such as

proc print data=test (where=(type=’comedy’));
run;

Graphs – Bar Charts, Scatter Plots

Chart Procedure Overview

The CHART procedure produces vertical and horizontal bar charts, block charts, pie charts, and star charts. These types of charts graphically display values of a variable or a statistic associated with those values. The charted variable can be numeric or character.

Bar Charts

The simplest type of chart you can produce is a frequency bar chart. For bar charts, the bars represent the count of the observations that have the values that are display on either X axis or Y axis.

Example: The data set SHIRTS contains the sizes of a particular shirt that is sold during a week at a clothing store, with one observation for each shirt sold.

The following program is to create a vertical and horizontal bar chart with frequency counts. The VBAR and HBAR statements produce vertical and horizontal bar charts for the frequency counts of the Size values respectively.


Here is the output:


Note: The VBAR (vertical bar) and HBAR (horizontal bar) statements specify the variable for which you want frequency counts. In this case, SIZE is a character variable which has the values small, medium, and large.

Scatter Plots

To examine the relationship between two scores or measurements (any two interval level variables), it is common and appropriate to produce a scatter plot, which provides an easy, intuitive way to get a feel for the data.


It is worth to point out the basic form of PROC PLOT is:



Note: Even though these basic plots are typically called X-Y plots, SAS plots the first variable on the vertical axis and the second on the horizontal. (e.g. plot Height*Age, where Height is the Y axis, Age is the X axis). Be careful!

By default, SAS uses letters to mark the points on the plot: A for a single observation, B for two observations at the same point, C for three, and so on. To substitute a different character, such as “*”, specify it this way:



You can also use a third variable as the plot character, making convenient label for each point. The following statement tells SAS to use the first letter from the variable Name (i.e. each student’s name) to make each point.



You can plot more than one variable on the vertical axis by using the OVERLAY options. In addition, if you have already sorted the variables, then SAS will produce a separable plot for each value of the BY statement with adding a BY statement.


Example: To help visualize the value of PROC PLOT, we apply it to the data HTWT introduced in the section of Simple SAS Program (page 4) as follows:



Note: The statement above requests two plots: one for Height by Weight and another for Height by Age. In the first plot, SAS uses letters to mark the points on the plot; in the second one, SAS marks the points as “*”.


Here is the output:


… next is the plot Height * Age using an asterisk as substitution …


Now we use the values of the variable NAME to label the individual data points:



Note: To specify a label variable, you follow the plot statement by a dollar sign ($) and the name of the label variable. You may also specify a plotting symbol in quotation marks (e.g. Name = ‘*’ in this example).

Here is the output:

(See also PROC GPLOT to get more sophisticated plots. It is part of SAS/GRAPH software and is licensed separately from base SAS)

Personal tools