

The econometric literature on segmentation has historically shown that regression based forecasting models work best when applied to homogeneous groups. This makes sense for a number of reasons. Factors affecting one population may not be the same factors that impact others. In forecasting telephone lines, for example, variables describing more complex dynamics in pricing, technology, and marketing may better describe the demand for business lines than residential connections. Even within the variables themselves there may exist certain groupings that may behave differently from one another. For example, maybe the propensity to purchase consumer durables is different between high income and low income groups. Perhaps customers residing in various geographic locations behave very similarly. Although the forecaster has always had a choice to model homogeneous groups separately, not unit recently has there been available software for nonstatisticians to assist them in identifying similar groupings or empirical breaks in the data. The following discussion focuses on one such procedure called Classification and Regression Trees (CART).
Tree analysis along with other analytical tools such as factor analysis were originated by social scientists to provide additional insight into the data structure being studied. The use of trees in regression dates back to the AID (Automatic Interaction Detection) program developed at the Institute for Social Research, University of Michigan, by Morgan and Sonquist in the early 1960?s. Largely with the help of Jerome Friedman and others in 1980, CART emerged as a practical way of interpreting data  adding another tool to the analyst?s arsenal in data analysis. In the past, it has been used extensively in the medical profession to determine key factors in the risk of heart attack upon the patience?s admission to the hospital. Now, with the help of new windowlike software packages like SPlus, CART business applications are seeing ever increasing use.
CART is a computationally intensive exploratory analysis tool which attempts to describe the structure of your data in a treelike fashion. SPlus, one of the more popular CART vendors, uses a measure called deviance to determine the tree structure. Deviance, a form of the likelihood ratio test, is used to measure the heterogeneity of the tree structure. The procedure is nonparametric, meaning that no assumptions are made as to the population?s underlying distributions. This is quite different than regression analysis where statistical assumptions are essential for precision, accuracy, and interpretability. However, as in regression, CART relates a single dependent variable (either binary or continuous) to a set of predictors. CART?s advantages over regression based approaches centers around its ability to handle missing data and capturing nonlinearity or interactions within the data. Although a regression model can be specified ahead of time to include interactions (income * price, for example), CART does it automatically with no need of user intervention. (Source: SPlus User's Manual)


