Search User Guide
Note: This writeup contains only an outline of the options and features of Search. Before attempting to use Search, you should read about the technique in Searching for Structure (Sonquist, et al, 1974).
Search is a binary segmentation procedure used to develop a predictive model for a dependent variable. It searches among a set of predictor variables for the predictors that most increase the researcher’s ability to account for the variance or distribution of a dependent variable. The question “What dichotomous split on which single predictor variable will give us a maximum improvement in our ability to predict values of the dependent variable?” asked iteratively, is the basis for the Search algorithm.
Search divides the sample through a series of binary splits into a mutually exclusive series of subgroups. The splits are chosen so that at each step in the procedure the split into the two new subgroups accounts for more of the variance or distribution (reduces the predictive error more) than a split into any other pair of subgroups. The dependent variable may be continuous or categorical. The predictor variables may be ordinally or nominally scaled. Search is an elaboration of the Osiris III Automatic Interaction Detector (AID) program.
Research questions are often of the type “What is the effect of X on Y?” But the answer requires answering a larger question “What set of variables and their combinations seems to affect Y?” With Search a variable X that seems to have an overall effect may have its apparent influence disappear after a few splits, with the final groups, while varying greatly as to their levels of Y, showing no effect of X. The implication is that, given other things, X does not really affect Y.
Conversely, while X may seem to have no overall effect on Y, after splitting the sample into groups that take account of other powerful factors, there may be some groups in which X has a substantial effect. Think of economists’ notion of the actor at the margin. A motivating factor might affect those not constrained or compelled by other forces. Those who, other things considered, have a 40-60 percent probability of acting, might show substantial response to some motivator. Or a group with very high or very low likelihood of acting might be discouraged or encouraged by some motivator. But if X has no effect on any of the subgroups generated by Search, one has pretty good evidence that it does not matter, even in an interactive way.
The purpose of Search is to allow an evaluation of many competing and probably mis-specified models. It relies on the fact that the explanatory power of any one predictor is rapidly exhausted by a few binary splits using it, so that a sequence of binary splits allowing competing predictors at each split, can search data for structure without restrictive assumptions of linearity or additivity of effects. The approach is closer to analysis of variance components than to sequential regression.
Search makes a sequence of binary divisions of a dataset in such a way that each split maximally reduces the error variance or increases the information (chi-square or rank correlation). It finds the best split on each predictor, then takes the best of the best.
The process stops when additional splits are not likely really to improve predictions to a fresh sample or to the population, i.e., when the null probability from that split rises above some selected level (e.g., .05, .025, .01 or .005). Of course, having tried several possibilities for each of several predictors, the null probability is clearly understated. Alternative stopping rules can be used in any combination: minimum group size, maximum number of splits, minimum reduction in explained variance relative to the original total, or maximum null probability.
This enhanced Search program provides for four kinds of dependent or criterion variables: means, simple regressions of Y on X, classifications, and ranks. The split criterion uses some measure of reduction in uncertainty or error, not a level of significance. The reasons to use error reduction rather than significance are that 1) particularly at the start with large numbers of cases in the groups being split, almost any split that maximizes error reduction will be highly significant, and 2) a small potential splitoff might be very highly significant because it is very extreme or very homogenous, but splitting it off would do little to improve overall predictions back to the population.
With means the splitting criterion is reduction in unexplained error variance from using two means, rather than the single parent group mean. With regressions the splitting criterion is the reduction in error variance from using two simple regressions, rather than the single parent group regression. With classifications the splitting criterion is the likelihood-ratio chi-square, which fits with the variance components approach. With ranks the splitting criterion is Kendall’s tau-b, a rank correlation based on all possible pairs adjusted for ties.
For each predictor one can maintain its monotonic order (“monotonic”), try each class against all the others (“select”), or reorder each time according to the criterion variable (“free”). The last should be used rarely, and only with predictors with few classes, for it involves implicitly trying many things, resulting in a bias in favor of that predictor. In addition, the combinations split off are difficult to interpret and probably idiosyncratic, as the parent groups become smaller.
One might want to reassign missing information to some large class, or, better, use a multivariate assignment procedure, for example, Search itself with the chi-square option.
With monotonic predictors one tries the first class against the rest, then the first two classes against the rest, etc., making k-1 tries. With select predictors one tries each class against all the others, making k tries, but since the splitting criterion combines difference between the two new groups with both their sizes, there is an offsetting bias against the select option. The alternatives are not really independent, so the bias in favor of predictors with more classes should be small. And with at least 50 cases, adjusting the degrees of freedom would make little difference.
Predictors can also be hierarchically ranked as to when they are used. Rank 0 means compute the potential gain but do not split on that predictor. Ranks 1, 2, 3, etc., mean exhaust the rank 1 variables first, then try the rank 2 ones, then the rank 3 ones, etc. Since the program will produce recode statements to generate expected values or residuals, one can also hold aside some later-stage predictors for an analysis of the residuals.
This enhanced Search has added a significance test to the rules for stopping the splitting process. Given the prior searching and the possibility of sample design effects, the test is crude. Purists will object that the more classes in a predictor, the more alternatives tried, so a bias exists toward predictors with many classes. One can think of using up degrees of freedom, or adding alternative null probabilities. But adding probabilities in a Bronferroni-type correction vastly overcorrects, since the k-1 or k alternatives are not really independent. The only serious bias would come from freeing a predictor with 5 or more classes and reordering it at each stage.
The other three stopping rules, with their defaults, are: minimum final group size (default 25), minimum reduction in error variance relative to original total (default .05), and maximum number of splits (default 25). The error variance reduction rule can be too stringent when the first few splits greatly reduce the remaining error.
The use of weights to adjust for different sampling or response rates affects variances and tests, so the program calculates an estimate of that effect for the whole sample, based on the variance of the weights, and issues a warning. Weights should be used, because if they do not make a difference, nothing is lost, but if they do, the unweighted data are biased.
This enhanced Search has also added to the print output a structure table with entries for each group, numbered in order and indented, so that one can easily see the pedigree of each final group and its detail. We find such a table easier to use than branching diagrams (trees). With relatively little wordprocessing one has a publishable table. It is also easy to create a branching diagram from the group summary table.
Search can perform the following functions:
- Maximize differences in group means, group regression lines, distributions (maximum-likelihood chi-square criterion), or ranking (Kendall’s tau-b).
- Rank the predictors to give them preference in the partitioning.
- Sacrifice explanatory power for symmetry.
- Start after a specified partial tree structure has been generated.
Cases with missing-data in a continuous dependent variable or a covariate are deleted automatically. Cases with missing-data in a categorical dependant variable may be included or excluded through the valid codes specified or implied in the dependent variable statement. Cases with missing-data in predictor variables may be included or excluded through the valid codes specified or implied in the relevant predictor statement.
The print output will be and stored in a file called “name.lst”, where “name” is the name specified in the Search invocation. In the R, SAS, SPSS and Stata implementations it will also displayed in the output window. The major components of the print output are specified below.
Residuals and estimates output
Search can produce recode statements to determine group numbers, estimates, and/or residual values from raw data. The statements will be stored in a file called “name.res”, where “name” is the name specified in the Search invocation. If a dataout statement is supplied, Search will run the recode on the input data to add the group numbers, estimates, and/or residuals and write the result in the output dataset. Estimates (expected values) can be used to assign missing data cases or to see how well the results predict to fresh data. Residuals can be used to run a second-stage Search. This option is not available in the standalone implementation.
Input is the dataset specified by the “datain” statement. Search requires numeric, not character, variables.
- Maximum number of predictors: 200.
- Maximum predictor value: 31.
- Maximum number of categorical variable codes: 400.
- Maximum number of predefined splits: 49.
You can run Search with R, SAS, SPSS, Stata or Srcware. Here are links to instructions for each.
You’ll need to set the following parameters, however you run Search.
|Name||run name||Name of the run. The name will be used with various extensions to identify the setup file (name.set), the log file (name.log), the output file (name.lst), and the residuals file (name.res).|
|Dir||run directory||The path of the directory containing or to contain the Search setup and other files. If the run directory is omitted, the SAS the version will use the default directory, and the R, SPSS, Stata and Srcware versons will use the current working directory.|
|Specify “new” to indicate that the Search setup statements follow the Search invocation. The “setup=new” option may not be used when SAS is invoked non-interactively with a command-line setup file or when Search is invoked by the SAS Enhanced Editor. It is not used with the R, SPSS and Stata implementations. The default is “old”.|
|Specify “batch” to indicate that this is a batch run and to instruct Search not to write to a window or screen. Specify “debug” to indicate that this is a debug run and to instruct Search to preserve intermediate files.|
The following statements may be specified in a setup file “name.set”, where “name” is the name specified in the Search invocation. If “setup=new” is specified in the Search invocation, the setup statements are specified following the macro invocation and are written to a setup file “name.set”.
The statements are written in the form:
Keyword [=] [(] parameter [)] […];
The equal sign and parentheses are optional. The datain, dataout, and title statements have one parameter pair each. The libname, parameter, predictor, and predefined split statements may have more than one parameter. The first three or four letters are generally sufficient to specify a keyword or parameter.
A Search setup must have a datain statement, a title statement, a parameter statement, and at least one predictor statement, in that order. The dataout, libname, and predefined split statements are optional. The dataout and libname statements must precede the title statement, and the predefined split statements must follow the predictor statements. A statement may span more than one line, and more than one statement may be specified on a line.
|Datain||dataset||The input dataset. In the SAS implementation the dataset is specified as library.file, where library defaults to “work” if it is not supplied. In the R, SPSS, Stata and standalone implementations the dataset is specified as the full or relative path of the metadata/data files.|
|Dataout||dataset||The output dataset. This statement is used in conjunction with the group, residuals, and estimate keywords. If it is supplied, Search will add the group numbers, estimates, and/or residuals to the input data and write the result in the output dataset. In the SAS implementation the dataset is specified as library.file, where library defaults to “work” if it is not supplied. The Dataout option is not available in the standalone implementation.|
|Libname||path||The location (directory path) of the SAS library. The statement may have more than one name/location pair. The statement is ignored with a warning in the R, SPSS, Stata and standalone implementations.|
|Title||text||Text to be used as the title for the output.|
|Analysis type. Default: mean.|
|Covariate||variable||Covariate. Must be specified for regression analysis.|
|Dependent variable(s). A list of dependent variables or dependent variable codes may be supplied for “analysis=chi” or “analysis=tau”. If a list of dependent variables is supplied, the analysis is done on the distribution of the variables. If a list of dependent variable codes is supplied (e.g., “depv=V7/1,2,4-7,.”), only the codes listed are used in the analysis, with “.” indicating missing data. If a code list is not supplied, only codes 0-9 are used.|
|Variable(s) for estimates or expected values. For a categorical dependent variable or a distribution set of dependent variables, a set of variables representing the expected distribution for the case.|
|Explanatory||percent||Minimum increase in explanatory power required for a split. Default: 0.8 percent.|
|Group||variable||Variable for the final group number. Required if either estimates or residuals are requested.|
|ID||variable||Identification variable to print with each case classified as an outlier. Default: dependent variable.|
|Maximum||number||Maximum number of splits. Default: 25.|
|Minimum||number||Minimum number of cases in a group. Default: 25.|
|Null||percent||Maximum probability that there is really no gain from the split. Default: none, no significance test.|
|Outdistance||number||Number of standard deviations from the parent group mean defining an outlier. May be decimal valued. Outliers are reported but not excluded from the analysis. Outliers could be excluded in subsequent runs by creating a subset of the data that excludes cases with the outliers. This option is useful only if a trace option is specified. Default: 5.|
|Dictionary||Print the list of input variables.|
|Trace||Print the trace of best splits for each predictor for each split.|
|Full||Print the full trace of all splits for each predictor, including eligible but suboptimal splits.|
|Table||Print all the predictor summary tables.|
|First||Print the predictor summary tables for the first group.|
|Final||Print the predictor summary tables for the final groups.|
|Variable(s) for residuals. For a categorical dependent variable or a distribution set of dependent variables, a set of variables representing the deviation of the case from the expected pattern. A two-stage analysis can be performed by using the residuals from one analysis as the dependent variable(s) for a subsequent analysis.|
|Symmetry||percent||Amount of explanatory power one is willing to lose in order to have symmetry. Default: Sym=0.|
|Weight||variable||Weight variable. Optional.|
|Specify either the value of the largest acceptable code or a list of acceptable codes. These codes should range from 0 to 31 or “.” for missing data. Cases with codes outside the range are discarded. Default: Code=(0-9).|
|M||Monotonic codes which must be kept adjacent during the splitting. The default.|
|F||Free codes which may be reordered according to the split criterion.|
|S||Select codes which may individually split from the remaining codes.|
|Rank||number||Assigned rank. Rank 1 predictors are used before rank 2, rank 2 before rank 3, etc. A zero rank indicates that statistics are to be computed for the predictors, but they are not to be used in the partitioning. Default: Rank=1.|
|Predictor variable(s). Default: none, at least one predictor statement must be supplied.|
Predefined split statements
|Group||number||Number of the group to be split. Groups are specified in ascending order, where the entire original sample is group 1. Each set of parameters forms two new subgroups. Default: none, must be specified for each predefined split.|
|Variable||variable||Predictor variable used to make the split. Default: none, must be specified for each predefined split.|
|Predictor code(s) defining the first subgroup. All other codes will belong to the second subgroup. Default: none, must be specified for each predefined split.|
There can be four splitting criteria, based on the dependent variable type:
The splitting criterion in each case is the reduction in ignorance (error variance, etc.) or increase in information. Terms like classification and regression trees should be replaced by binary segmentation or unrestricted analysis of variance components, or searching for structure. With rich bodies of data, many non-linearities and non-additivities possible, and many competing theories, the usual restrictions and assumptions that one is testing a single model are not appropriate. What does remain, however, is a systematic, pre-stated searching strategy that is reproducible, not a free ransacking.
Means. For means the splitting criterion is the reduction in error variance, that is, the sum of squares around the mean, using two subgroup means instead of one parent group mean.
Regressions. For regressions (y=a+bx) the splitting criterion is the reduction in error variance from using two regressions rather than one.
Classifications. For classifications (categorical dependent variable), the splitting criterion is the likelihood-ratio chi-square for dividing the parent group into two subgroups.
Ranks. For rankings (ordered dependent variable), the splitting criterion is Kendall’s tau-b, a rank correlation measure.
There are four stopping rules, each with a default option:
- Maximum number of splits. Default: 25.
- Minimum number in any final group. Default: 25.
- Minimum reduction in error, relative to the original total. Default: 0.8 percent.
- Maximum null probability. Default: none, no significance test.
A combination of the minimum number in any final group and the minimum reduction in error is a primitive significance test, but a more formal test is possible. Assuming that the minimum in any final group is 15 or more, the degrees of freedom for any test will be over 30, large enough to assure reasonable normality, and a Z-ratio (ratio of the gain from a split relative to its standard error) would be 2.33 for a maximum probability that there is nothing there (null hypothesis) of .01. The loss from trying several splits is small if predictor order is maintained, or each class is only tried against all the others (k-1 or k).
For the tau-b option, we cannot define a minimum reduction in error, relative to the original total, so we use a minimum tau-b value for each split. Even with means a minimum error reduction can cause difficulty if the first few splits account for a large fraction of the variance, and the “significance level” however fraudulent, is perhaps a better stopping rule.
For the means and ranks criteria, the maximum null probability stopping rule is based on Z, the ratio of the gain from a split to its standard error, using the normal distribution for the null probabilities. For the regression option, we use an f-test to get the null probabilities, and for the chi option, we use the chi-squared distribution.
We do not multiply the null probabilities by the number of alternatives tried (the Bronferroni correction), since for monotonic predictors or select predictors with fewer than 10 categories, the alternatives are few enough and not really independent. We suggest not using the “free” option with more than three or four categories.
Agresti, Alan (1996), Introduction to Categorical Data Analysis, New York: John Wiley & Sons, Inc.
Dunn, Olive Jean, and Virginia A. Clark (1974), Applied Statistics: Analysis of Variance and Regression, New York: Holt, Rinehart and Winston.
Chow, G. (1960), “Test of Equality between Sets of Coefficients in Two Linear Regressions,” Econometrica, 29:591-605.
Gibbons, Jean Dickinson (1997), Nonparametric Methods for Quantitative Analysis, 3rd edition, Syracuse: American Sciences Press.
Hays, William (1988), Statistics, 4th edition, New York: Holt, Rinehart, & Winston.
Klem, Laura (1974), “Formulas and Statistical References,” in Osiris III, Volume 5, Ann Arbor: Institute for Social Research.
Sonquist, J. A., E. L. Baker and J. N. Morgan (1974), Searching for Structure, revised edition, Ann Arbor: Institute for Social Research, The University of Michigan.