Screening and Selection
Screening and Selection
Screening and selection procedures are statistical methods for assigning individuals to two or more categories on the basis of certain tests or measurements that can be made upon them. Of concern usually is some desired trait or characteristic of the individuals that cannot be measured directly. All that can be done is to obtain an estimate for each individual from the results of the available tests and then to make the assignment on the basis of these estimates. The central statistical problem is to evaluate the properties of alternative schemes for utilizing the available data to make the assignments in order to choose the scheme that best achieves whatever objectives are considered to be most relevant for the particular application.
For example, some educational selection schemes may be regarded in this light. The individuals might be high school students and the categories “admit to college” and “do not admit to college.” The desired trait is future success in college, but only tests at the high school level are available. (As will be seen, many examples of selection and screening are somewhat more complex than this, particularly in their use of more than one level of screening. In this educational context one might instead use three categories: “admit,” “put on waiting list,” and “do not admit.”)
Denote by N the number of individuals to be assigned and let c be the number of categories. For any individual, let Y denote the unknown value of the desired trait and let Xlt … ,XP denote the measurements or scores that can be obtained and used as predictors of Y. The screening or selection procedure is a scheme that specifies in terms of X1? … ,Xf how each individual is to be assigned to one of the c categories.
The screening may be done at one stage—that is, all the measurements Xa, · ·· ,XP become available before an individual is assigned—or it may be multistage. The advantage of a multistage procedure is that it may allow some individuals to be assigned at an early stage or, at least, to be eliminated from contention for the categories of interest, thus permitting the resources available for performing the tests to be concentrated on fewer individuals in the later stages.
The terms screening and selection are largely synonymous, although in particular applications one or the other may be preferred. Sometimes, in order to avoid the possible connotation that certain categories may be more desirable than others, a neutral term such as allocation is used. The term classification has a different shade of meaning, referring to the identification of which of several distinct distributions each individual belongs to (in taxonomy, for example, classification involves the assignment of an organism to its proper species) [see Multivariate analysis, article on classification and discrimination].
Various formulations of screening and selection problems have been proposed and investigated. A BIBLIOGRAPHY containing more than five hundred references has been given by Federer (1963). Of special interest is the case c = 2, in which the individuals are separated into two categories, a selected group and the remainder; the success of the screening procedure is judged by the values of Y for the individuals in the selected group. This article will be primarily devoted to this case of two categories.
The case where N is large
This section deals with the case of separation into two categories where either the number, N, of individuals is large enough for their Y-values to be considered as forming a continuous distribution or, alternatively, the N individuals themselves are considered as a random selection from a conceptually infinite population. The object of the screening procedure is to produce a distribution of Y-values in the selected group that is, in an appropriate sense, an improvement on the original distribution. For example, one might want the selected group to have as high a median Y as possible.
In many applications, the feature of the distribution of Y that is considered most important is its mean. For example, in a plant-breeding program, Y might stand for the crop yield of the individual varieties, and the purpose of the program might be to select a set of varieties whose mean Y is as high as possible. The difference in the mean Y for the selected varieties from that of the original group is referred to as the “advance” or the “gain due to selection.”
Cochran (1951) summarized the mathematical basis for selection procedures designed to maximize the mean Y in the selected set. He showed that the optimum selection rule to use at each stage should be based on the regression of Y on the X’s that are known at that stage. In the case where the joint distribution of these variates is multivariate normal, this regression is the linear combination of the X’s that has maximum correlation with Y. [See MULTIVARIATE ANALYSIS, articles on CORRELATION.]
Plant selection. In a plant-breeding program for improving the yield of a particular crop, a large number of potential new varieties become available in any year. These are tested in successive plantings, and the better-yielding types in each planting are selected to produce the seed for the next sowing, until finally a small fraction of the original number remain as possible replacements for the standard varieties in commercial use.
Finney (1966) and Curnow (1961) have carried out an extensive theoretical and numerical investigation of a fairly general type of selection procedure that is particularly applicable to plantselection. The problem is to reduce in k stages an initial set of N candidates to a predetermined fraction, IT, called the “selection intensity,” using a fixed total expenditure of resources, A. At stage r (1 < r < fe), the candidates selected at stage r – I are tested, using resources Ar, to obtain for each a score, Xr, which estimates Y with a precision dependent upon Ar; the fraction, Pr (0Prl), having the highest scores are selected, and the remainder are discarded. Stage 0 consists in selecting at random a fraction, Po, of the initial set. The problem is to choose Pr and Ar to maximize the expected mean Y-value in the selected group, subject to AI + · · · + A = A and P0 PI · · · P* = IT , where A and IT are given. The authors found that approximately optimum results were obtained with Ar = A/k, , for r = 1, · · ·, k, called the “symmetric” scheme, usually with P0 – 1, although in some circumstances a value P0 < 1 effected further improvement. Three or four stages at most were sufficient.
In the context of plant selection, the N candidates are the new crop varieties produced in a particular year, and the resource expenditure, A, is the area of land available for testing, which must be divided into separate portions for varieties being tested for the first time, varieties selected on the basis of last year’s tests to be tested in stage 2, and so on. For example, suppose N = 200 varieties are started in a two-year program to select 8 to compare with the standard commercial types (thus, ·n– .04). Then at the end of each year, of the varieties should be selected—that is, 40 at the end of the year 1 and 8 at the end of year 2— with equal areas of land to be divided among the 200 varieties in year 1 and the 40 varieties in year 2.
Drug screening. In drug screening, the problem is to screen a large supply of chemical compounds by means of a biological test, usually in laboratory animals, in order to select for further testing the few that may possess the biological activity desired. Here Y stands for the unknown activity level of a compound (averaged over a conceptual population of animals), estimates being provided by the test results, X,,X2, …. The distribution of Y in the population of compounds available for screening will usually have a large peak at Y = 0, since most of the compounds do not possess the activity being sought unless a specific class of compounds chemically related to known active compounds is being screened. The number of compounds available for testing usually exceeds the capacity of the testing facilities; therefore, part of the problem in drug screening is to determine the optimum number, N, of compounds to screen in a given period of time.
The mean Y in the selected group does not have as much relevance in drug screening as it does in plant selection. Instead, a value, a, is usually specified such that a drug is of interest if its activity equals or exceeds level a. The screening procedure is then designed to maximize the number of compounds in the selected group having Y = a, usually subject to the requirement that the total number of compounds selected over a certain period of time is fixed. Davies (1958) and King (1963) have considered in detail this approach to the statistical design of drug-screening tests.
Educational selection. The consideration of selection procedures to allocate school children to different “streams” is necessarily much more complex than in the applications considered above. For one thing, there can be no question of rejection; the object, at least in principle, is to provide the education most suitable for each child. Furthermore, it can be expected that the characteristic Y of each child will be altered by the particular stream in which he may be placed. An admirable discussion of the problems was given by Finney (1962), who described, as an illustration of the methodology, a simplified mathematical model of the educational selection process then in operation in the British school system.
Finney considered university entrance as a two-stage selection process: the first stage is the separation of students at the age of 11+ into those who will receive a grammar school education and those who will go instead to a secondary school, and the second stage is university entrance. Denoting by Xi the composite score of all test results available at the first stage, by X2 the composite score at university entrance, and by Y the “suitability” of a student for university study as determined by his subsequent university grades, Finney considered Y, X,, and X2 to have a multivariate normal distribution with correlation coefficients estimated from available data. He studied the effects that varying the proportion of students admitted to grammar school, as well as the relative proportions admitted to universities from the two types of school, had on the average value of Y in the selected group and on the proportion of university entrants having Y = a. Interesting numerical results are presented, but the main feature of the paper is its demonstration of how the approach can bring about a clearer insight into the issues involved in a selection process.
The case where N is small
Rather different methods of approach have been developed for selection when the number N is smallenough so that the individual values of Y, rather than their distribution, may be considered. The object of the selection procedure is expressed in terms of the Y’s; for example, the object may be to select the individual with the largest value of Y, to select the individuals with the t largest values, or to rank the N individuals in order according to their values of Y. These are special cases of the general goal of dividing the N individuals into c categories, containing respectively the nc individuals with the highest values of Y, the n individuals with the next highest values, and so on, down to the n, individuals with the lowest values of Y. Bechhofer (1954) developed expressions for the probability of a correct assignment of the N individuals to the c categories.
Selecting the “best” of N candidates . Of most frequent interest in practical applications is the selection of the best of several candidates, “best” being interpreted to mean the one with the largest Y. Estimates of the unknown Y’s are obtained from experimentation, and the problem usually is to decide how much experimentation needs to be done.
In a single-stage selection procedure, an experiment consisting of taking n observations for each candidate is performed, and the candidate with the highest observed mean is selected. The probability of a correct selection is the probability that the candidate with the highest Y will also have the highest observed mean value; this is a function not only of the number n of observations but also of the unknown configuration of values of Y.
Bechhofer (1954) recommended that the experimenter choose n so that the probability of a correct selection would exceed a specified value, P, whenever the best value of Y exceeded the others by at least a specified amount, d. His paper contains tables of the required value of n, calculated on the assumption that the Y’s are in the “least favorable” configuration, which in this case is the configuration where all the Y’s except the best one are equal and are less than the best one by the amount d.
Another approach to determining the value of n is based on striking an optimum balance between the cost of taking observations, which is a function of n, and the economic loss incurred if some candidate other than the best one is selected. The loss due to an incorrect selection is assumed to be a function, usually linear, of the difference between the largest Y and the selected Y. With the probability of selecting any particular candidate taken into account, an expected loss or risk function, which is a function of n and the unknown Y’s, is determined. Somerville (1954) showed how the minimax principle can be used to determine the optimum sample size; this procedure is appropriate if no prior information is available about the unknown Y-values [see Decision theory]. In the case where prior information is available, Dunnett (1960) showed how such information can be utilized in determining the sample size; he also made numerical comparisons between alternative procedures.
Gupta (1965) considered the situation in which the experimenter is willing to select a larger group than is actually needed and is concerned with guaranteeing a specified probability, P, that the best candidate is included in the selected group. In this way, the need to specify a minimum difference, d, as in Bechhofer’s method, is avoided, but there is the drawback of not necessarily having a unique selection. Gupta investigated the effect of the sample size, n, and the configuration of the Y’s on the expected size of the selected group.
Sequential procedures have also been investigated; for example, Paulson (1964) considered a sequential method for dropping candidates from contention at each stage until only one remains, so as to achieve a specified probability, P, that the best one is selected whenever its value Y exceeds the others by at least d. [See Sequential analysis.]
A medical selection problem. An interesting method (Colton 1963) for selecting the better of two medical treatments uses some of the principles discussed above but also contains ingenious points of difference. In the problem considered, there is a fixed number of patients to be treated. A clinical trial is performed on a portion of them, with equal numbers being given each treatment. On the basis of the trial, one of the two treatments is selected to treat the remainder of the patients. The problem is to determine how many patients to include in the trial in order to maximize the expected total number receiving the better drug. Sequential procedures for making the selection are also discussed.
Tournaments
A tournament is a series of contests between pairs of N players (or teams) with the object either of selecting the best player or of ranking the players in order. It may be regarded as a selection procedure in which the experiments consist of paired comparisons between candidates. David (1959) studied some properties of two types of tournaments, the knockout and the round robin. Glenn (1960) compared the round robin and several variations of the knockout tournament for the case of four contestants; he found that a single knockout tournament with each contest being on a “best two out of three” basis achieved the highest probability that the best player will win, but at the “expense” of requiring a higher average number of games.
Other problems
There are many other interesting topics in screening and selection. One is group screening, in which a single test is performed on several candidates as a group to determine whether any of them possess the characteristic of interest. When an affirmative answer is obtained, further tests are performed to determine which ones possess the characteristic. One application of this procedure is in blood testing for the presence of some disease; a great saving in the number of tests necessary is accomplished by physically pooling several samples and making a single test. Another application is in factor screening in industrial research (see Sobel & Groll 1959; Watson 1961).
In practice, many screening problems are multivariate–that is, there is more than one trait, Y, of interest, and the traits are likely to be correlated. Sometimes the measurements on the several traits are reduced to a single variate by combination into a suitable index, perhaps with weights determined by the economic worth of each trait. Sometimes only one trait is dealt with at a time, the candidates considered for selection on the basis of trait Yr being those who have previously been selected on the basis of Yj, · · ·, YM in turn. Much work remains to be done to determine the best procedures for use in multivariate screening. (See Rao 1965 for a treatment of some of the mathematical problems.)
C. W. Dunnett
[See also Clustering; Multivariate Analysis, article OnClassification And Discrimination; Statistical Analysis, Special Problems OF, article on Outliers.]
BIBLIOGRAPHY
Bechhofer, R. E. 1954 A Single-sample Multiple Decision Procedure for Ranking Means of Normal Populations With Known Variances. Annals of Mathematical Statistics 25:16–39.
Cochran, W. G. 1951 Improvement by Means of Selection. Pages 449–470 in Berkeley Symposium on Mathematical Statistics and Probability, Second, Proceedings. Berkeley: Univ. of California Press.
Colton, Theodore 1963 A Model for Selecting One of Two Medical Treatments. Journal of the American Statistical Association 58:388–400.
Curnow, R. N. 1961 Optimal Programmes for Varietal Selection. Journal of the Royal Statistical Society Series B 23:282–318. -> Contains eight pages of discussion.
David, H. A. 1959 Tournaments and Paired Comparisons Biometrika 46:139–149.
Davies, O. L. 1958 The Design of Screening Tests in the Pharmaceutical Industry. International Statistical Institute, Bulletin 36, no. 3:226–241.
Dunnett, C. W. 1960 On Selecting the Largest of k Normal Population Means. Journal of the Royal Statistical Society Series B 22:1–40. Includes 10 pages of discussion.
Federer, Walter T. 1963 Procedures and Designs Useful for Screening Material in Selection and Allocation, With a BIBLIOGRAPHY. Biometrics 19:553–587.
Finney, D. J. 1962 The Statistical Evaluation of Educational Allocation and Selection. Journal of the Royal Statistical Society Series A 125:525–549. Includes “Discussion on Dr. Finney’s Paper,” by T. Lewis et al.
Finney, D. J. 1966 An Experimental Study of Certain Screening Processes. Journal of the Royal Statistical Society Series B 28:88–109.
Glenn, W. A. 1960 A Comparison of the Effectiveness of Tournaments. Biometrika 47:253–262.
Gupta, Shanti S. 1965 On Some Multiple Decision (Selection and Ranking) Rules. Technometrics 7:225–245.
King, E. P. 1963 A Statistical Design for Drug Screening. Biometrics 19:429–440.
Paulson, Edward 1964 A Sequential Procedure for Selecting the Population With the Largest Mean from k Normal Populations. Annals of Mathematical Statistics 35:174–180.
Rao, C. R. 1965 Problems of Selection Involving Programming Techniques. Pages 29–51 in Ibm Scientific Computing Symposium on Statistics, Yorktown Heights, N.Y., 1963 Proceedings. White Plains, N.Y.: Ibm Data Processing Division.
Sobel, Milton; and Gholl, Phyllis A. 1959 Group Testing to Eliminate Efficiently All Defectives in a Binomial Sample. Bell System Technical Journal 38: 1179–1252.
Somerville, Paul N. 1954 Some Problems of Optimum Sampling. Biometrika 41:420–429.
Watson, G. S. 1961 A Study of the Group Screening Method. Technometrics 3:371–388.