SAS Work Shop - REG Statistical Programs
Handout # 3 College of Agriculture

Regression Diagnostics

Residual Analysis:       One of the most important aspects of the regression technique is the residual analysis. This involves numeric and graphical inspection of the model residuals defined as the observed values minus the predicted values. The ability of PROC REG to do such analyses is unequalled in other SAS procedures and is the main reason for developing regression models using PROC REG rather than PROC GLM. Residual analysis in PROC REG can be approached in three basic ways outlined below.

MODEL Statement Options:       As mentioned earlier, some MODEL statement options are related to diagnostics, and in particular, residual analysis. The most obvious of these is the R option. This requests the following: observed values, predicted values, residuals, standard errors, studentized residuals, and Cook's D statistic. The P option will print only the observed value, predicted value and the residual. Standardized estimates can be obtained with the STB option. The first order correlation of the residuals can also be examined with the DW (Durban-Watson) option.

PLOT and PAINT Statements:       One of the most effective and simple methods of residual analysis is plotting of residuals vs predicted values and regressors. Problems such as heteroskedasticity, and systematic lack of fit tend to stand out in such plots. Recent versions of PROC REG (versions 6 and higher) have made producing these plots simple using a PLOT statement. The syntax for PLOT is similar to that of the PROC PLOT procedure: PLOT y var* x var=symbol. The differences in the PROC REG PLOT statement are the variable names and the options available. The variables used can come from the data set being analyzed, or from the results of the current regression analysis. If the later is to be plotted, say the residuals, the specific variable name to use is: residual.

(Note this has a period at the end). Other potential variables available are rstudent., and predicted. . In fact, any variables available in the OUTPUT statement can be used.

Example:

		PROC REG DATA=PHOTO;
			MODEL PHOTO=IRRAD;
			PLOT PHOTO*IRRAD='O' PREDICTED.*IRRAD='P'/OVERLAY;
			PLOT RSTUDENT.*PREDICTED.='+' RSTUDENT.*IRRAD='+';
		
In this example the photosynthetic rate is regressed against irradiation. The first PLOT statement requests the observed and predicted values be plotted against irradiation as O's and P's, respectively. The OVERLAY option forces both plots to be on the same graph. The second PLOT statement looks at studentized residuals vs predicted values and irradiation. Both are plotted as a + symbol and are on separate graphs. Other options for PLOT allow for multiple plots per printed page, default symbols and clearing of graphs.

      In some cases, it is useful to identify points on a given graph according to some criteria. For this, PROC REG uses the PAINT statement. The PAINT statement is issued preceding a PLOT statement and defines the criteria and plotting symbol to be used. The syntax is:

PAINT var name (condition) / options .

This can be very useful when trying to understand the structure of the data or locating troublesome data points.

Example:
		PROC REG DATA=PHOTO;
			MODEL PHOTO=IRRAD;
			PAINT CO2 > 600/SYMBOL='#';
			PLOT PHOTO*IRRAD='O' PREDICTED.*IRRAD='P'/OVERLAY;
			PLOT RSTUDENT.*PREDICTED.='+' RSTUDENT.*IRRAD='+';
		
This example produces the same plots as before, but now changes the plotting symbol to # for those observations with CO2 levels greater than 600. Any variable in the data set or from the analysis can be used for the criteria in the PAINT statement. Multiple PAINT statements are allowed and are cumulative.

OUTPUT Statement:       The third method for addressing residual analysis is the OUTPUT statement. This allows the results obtained from the MODEL statement options to be put into a data set for further analysis -- it may be used to test the distributional assumption of the residuals, for example. It also permits the values to be exported to software other than SAS for plotting or analysis. The syntax for OUTPUT is:

OUTPUT OUT=(data set name) option=var1 option=var2 ... option=varn.

The data set name specifies where the output will go and the options are what statistics are requested.

Example:
		
		PROC REG DATA=PHOTO;
			MODEL PHOTO = CO2;
			OUTPUT OUT=PRED P=YHAT RSTUDENT=RESID L95M=LOW U95M=HIGH;	

		PROC UNIVARIATE PLOT NORMAL;
			VAR RESID;
		
The OUTPUT statement used in this example creates a data set named PRED which contains several new variables. These are (in order requested): YHAT=predicted values, RESID=studentized residuals, LOW=lower 95% CI on mean values, and HIGH=upper 95% CI on mean values. SAS gives the 95% confidence levels for the last two by default.

These are not the only variables in PRED, however. The data set created by OUTPUTwill have the new requested variables and all the original variables. Thus, there is no need to merge together this new data set and the old one! The second half of the example runs a univariate summary procedure on the residuals of the analysis and specifically calls the PLOT and NORMAL options which produce a stem and leaf diagram and test for normality of the residuals. This step allows the user to examine the variance, skewness, and other summary information on the residuals.

Influence and Collinearity:       Other diagnostic features of PROC REG examine the influence of data points and multicollinearity among regressors. These are invoked as MODEL statement options and produce a variety of printed output.

Influence:       The INFLUENCE option of PROC REG produces several measures of influence for each observation. These include residual, studentized residual, hi (leverage), and the statistics DFFITS and DFBETA.

Collinearity:        Collinearity implies a lack of independence between regressors and can lead to biased estimates with inflated errors. The PROC REG options for examining collinearity are COLLIN, VIF and TOL. The main option here is COLLIN which outputs condition numbers and variance proportions associated with regressors. The options VIF and TOL give the statistics Variance Inflation Factor and Tolerance, respectively, which are the inverse of one another.

Example:
		PROC REG DATA=PHOTO;
			MODEL PHOTO = IRRAD CO2 RESIST/INFLUENCE COLLIN;
		


Return to TOP; Return to Outline.