STANFORD ARTIFICIAL INTELLIGENCE PROJECT MEMO AIM-145 COMPUTER SCIENCE DEPARTMENT REPORT NO. CS-221 A HEURISTIC PROGRAMMING STUDY OF THEORY FORMATION IN SCIENCE BY BRUCE G. BUCHANAN EDWARD A, FEIGENBAUM JOSHUA LEDERBERG JUNE 1971 COMPUTER SCIENCE DEPARTMENT STANFORD UNIVERSITY wy wy # Aw yf’ STANFORD ARTIFICIAL INTELLIGENCE PROJECT JUNE 1971 MEMO ATM-145 COMPUTER SCIENCE DEPARTMENT REPORT NO. CS 221 A HEURISTIC PROGRAMMING STUDY OF THEORY FORMATION IN SCIENCE* by Bruce G. Buchanan Edward A. Feigenbaum Joshua Lederberg ABSTRACT: The Meta-DENDRAL program is a vehicle for studying problems of theory formation in science. The general strategy of Meta-DENDRAL is to reason from data to plausible general- izations and then to organize the generalizations into a unified theory. Three main subproblems are discussed: (1) explain the experimental data for each individual chemical structure, (2) generalize the results from each structure to all structures, and (3) organize the general- izations into a unified theory. The program is built upon the concepts and programmed routines already available in the Heuristic DENDRAL performance program, but goes beyond the performance program in attempting to formulate the theory which the performance program will use. % This research was supported by the Advanced Research Projects Agency of the Office of the Secretary of Defense under Contract SD-183. The views and conclusions contained in this document are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the Advanced Research Projects Agency or the U.S. Government. Reproduced in the USA. Available from the Clearinghouse for Federal Scientific and Technical Information, Springfield, Virginia 22151. Price: Full size copy $3.00; microfiche copy $ .95. A Heuristic Programming Study of Theory Formation in Science by Bruce G. Buchanan Edward a. Feigenbaus Joshua Lederberg I. INTRODUCTION Theory formation in science erabodies many elements of creativity which make it both an interesting and challenging task for artificial intelligence research. One of the goals of the Heuristic DENDRAL project has long been the study of processes underlying theory formation. This paper presents the first steps we are taking to achieve that goal, in a program called Meta-DENDRAL. Because we believe there is value in reproting ideas in their forrmative stages -- in terms of feedback to us and, hopefully, stiaulation of the thinking of others -- we are presenting hers a description of work on Seta-DENDRAL even though not all of the program has been written. Just like the scientists we attempt to model, we often fail to make explicity the thinking steps we g9 through. Therefore, the designs of the unfinished peices of program are described as they will be initially programmed, ani several outstanding problems are mentioned. It is hoped that this discussion will provoke comments and criticisms, for that is also part of its purpose. The Heuristic DENDRAL project has concentrated its efforts on the inductive analysis of empirical data for the formation of explanatory hypotheses. This is the type of inference task that calis for the use of a scientific theory by a performance progran, but not for the formation of that theory. When we startei on Heuristic DENDRAL we did not have the insight, understanding, and daring to tackle ab initio the problem of theory formation. But now we feel the time is ripe for us to turn our attention to the problea of theory formation. Our understanding and dur technical tools have matured along with the Heuristic DENDRAL program to thea point where we now see clear ways to proceed, As always, the proper choice of task environment is crucial, but for us the choice was absolutely clear. Because the Heuristic DENDRAL performance programs uses the theory of a spesialized branch of chemistry, formulating statements of that theory is the task most accessible to us. The theory itself will be briefly introduced in Section II, although it is not expected that reaiers understand it to anderstand the directions of this paper. The goal of the Meta-DENDRAL prograrg is to infer the theory that the performance program (Heuristic DENDRAL) uses to analyze experimental cherajcal data from a mass spectrometer. The follcwing table atteapts to sketch some differences between the programs at the perforaance level and the meta-level. Input Output Exasple Heuristic DENDRAL The analytic data fros a molecule whose struc- ture is not known {except of course in our test cases). A molecular structure inferred from the data. Uses alpha-carbon frag- mentation theory rules in planning and in validation. Meta-DENDRAL A large number of sets of data and the associated (known) molecular structures. A set of cleavage and rearrangement rules con- stituting a subset of tha theory of mass spectrometry. Discovers (and validates) alpha-carbon fragmentation cules in a space of possible patterns of cleavage. JUses set of primitive concepts but does not invent new primitives. In our view, the continuity evident in this table reflects a continuity in the processes of inductive explanation in science. Moves toward meta-levels of scientific inference are moves toward encompassing broader data bases and constructing more gena2ral cules for describing regularities in the data. Beyond this level of Meta~DENDRAL there are still higher levels. Not all theory formation is as simple as the program described hete assumes it is. For exanple, the representation of chemical molecules and the list of basic processes are both fixed for this program, yet these are concepts which a higher level program Should be expected to discover. there is no postulation of new theoretical entities in this program. But, again, higaer levels of theory formation certainly do include this process. The task of theory forrgjation can be and has been discussei out of the context of any particular theory.<4> However, writing a computer program to perform the general task is more difficult than working within the context of one particular scientific discipline. While it is not clear how science proceeds in general, it may be possible to describe in detail how the scientists in one particular discipline perform their work. From there, it is not a large step to designing the computer progran. Thus this paper attacks the general probleas of theory formation by discussing the problems of designing a computer prograa to formulate a theory in a specific branch of science. The general strategy of Meta-DENDRAL is to reason from data to plausible generalizations and then to integrate the generalizations into a unified theory. The input to the Neta-DENDRAL system is a set of structure-data pairs. It receives essentially the same data as a chemist might choose when he attempts to elucidate the processes underlying the behavior of a class of molecules in a mass spectrometer. When chemists turn their attention to a class of chemical compounds whose mass Spectrometric processes (MS processes) are not well understood, they must collect mass spectrometry data for a number of the compounds and look for generalizations. The generalizations have to be tested against new data and against the established theory. If new data provide counterexamples, the generalizations are changed. If the generalizations are not compatible with the old theory either the old theory or the generalizations are changed. This paper is organized by the three main subprobleas around which the program is also organized. fhe tirst is to explain the experimental data of each individual molecular structure. [hat is, determine the processes (cr alternative sets of processes) which account for the experimental data. The second subproblea is to generalize the results from each structure to all structures. In other words, find the common processes and sets of processes which can explain several sets of experimental data. The last is to integrate the generalizations into the existing theory in such a way that the theory is consistent and economical. Within each of the three main sections, the subsections indicate further subprobleas which the programs must solve. II. THE PROBLEM DOMAIN Because this paper discusses theory formation in the context of a particular branch of science, mass spectrometry, the theory of this science will be explained briefly for readers wishing an understanding of the Meta-DENDRAL program at this level. The mass spectrometer is an analytic instrument which bombards molecules of a chemical sample with electrons and records the relative numbers of resulting charged fragments by mass. When molecules are bombarded, they tend to fragment at different locations and fragments tend to rearrange and break apart as determined by the environments around the critical chemical bonds and atoms. The description of these processes is called "mass Spectrometry theory". The output of the instrument, the mass spectrum or fragaent-mass table (FMT)*, is commonly representei as a graph of masses of fragments plotted against their relative abundance. By examining the FMT, an analytic chemist often can determine the molecular structure of the sample uniquely. > oe es ee ee ee ee ee *The term ‘fragment-mass table* is used here in place of the slightly misleading term ‘mass spectrurm', The latter is well entrenched in the literature, but the former is more suggestive of the fern of the data. a ee vee a oe an OD aoe Mass spectrometry theory (MS theory), as used by the DENDRAL programs and many cheamists, is a collection of statements about the fragmentation patterns of various types of molecules upon electron impact. It contains, for example, numerous statements about the likelihood that links (bonds) between chemical atoms will break apart or remain stable, in light of the local environment of the bonds within the graph structure of the molecule. The probability of a fragment splitting off thea molecule is detergined by the configurations of chemical atoms ani bonds in the fragment and in its complement. Further splitting of the fragment is determined in like manner. In addition to rules about fragmentations, the theory also contains rules relating graph features of molecules and fragaents to the probabilities that an atom or group of atoms will migrate from one part of the gctaph to another. Fortunately, mass spectrometry results are reproducible, or nearly so, which means that identical saaples will produce nearly identical FMTs {under the same operating conditions of the same type of instrument). As mentioned earlier, there are alternative levels for expressing this, as any other theory. The model in whose terms the theory is Stated is a "ball and stick" model of Chemistry, in which ‘atom and "bond" are primary terms, and not, for example, an elactron density model. Some of the primitive terms of the program's theory are listed in Appendix A. III. FIRST SUBPROBLEM: EXPLAINING EACH SPECTRUM The so-called "method of hypothesis" in science is sometimes proposed as the essence of scientific work. Pestating it, in 4 deliberately imprecise way, the method is to formulate a hypothesis to account for some of the observed data and nake successively finer adjustments to it as more observations are made. Very little is known about the details of a scientist's intellectual processes as he goes through the method. Thinking of hypotheses, for exaaple, is a aysterious task which eust be elucidated before the saethod can be programmed. That is the task we have designated as the first subproblen. The program starts with individual structure- FAT (fragment-mass table) pairs as separate from one another. It constructs alternative explanations for each FMT and then considars the FMT's all together. An explanation, for the program, as for the cherist, is a plausible account of the AS processes {or mechanisas) which produced the masses in the FMT. The explanation is scmething like a story of the molecule's adventures in the sass spectrometer: certain data pqints appear as a result of cleavage, others appear as a result of more complex processes, At this stage of development of the theory, the chermist's story does not account for every data point because of the complexities of the lnostrument and the vast amount of sissing information about MS theory. A. REPRESENTATION The well-known problea of choosing a representation for the Statesaents of a scientific theory and the objects sentioned by tha theory is ccamon to all sciences. In computer science it is recognized as a crucial problem for the efficient solution (or for any solution) to each problem. Some ways of looking at a problem turn out to be auch less helpful than others, as, for exaaple, considering the mutilated checkerboard problem<5> as simply a problem of covering rectangles (with dominoes) instead of as a parity problem. At this stage there are no computer programs which successfully choose the representation of objects ina problem domain. Therefore we, the designers of the Meta-DENDRAL system, have chosen representations with which we have some experience and for which programmed subroutines have already been written in the Heuristic DENDRAL performance systen. It was natural to use these representations since the meta-~program itself will not only interface with the Heuristic DERDRAL performance program, bat is built up from many of the LISP functions of the performance program, Specifically, for this progrags, the input data are chemical structures paired with their experimental datas structure-1 - FMT-1 structure-n - FMT-n The representation of chemical structures is just the DENDRAL representation used in the Heuristic DENDRAL system. It has b2en described in detail elsewhere : essentially it is a linear string which uniquely encodes the graph structure of the molecule. The FSTs, also, are represented in the same way as for the Heuristic DENDRAL performance system. Each FMT is a list of x-y pairs, where the x-points are aasses of fragments and the y-points are the relative abundances of fragments of those masses, The Predictor program of the Heuristic DENDRAL systea has been extensively revised so that the internal representations of molecular structures and of MS theory statements would be amenable to the kind of analysis and change suggested in this work. As mentigned, Appendix A contains exargples of the teras which are used in statements of the theory. B. SEARCH It is not clear what a scientist does when he "casts about" for a good hypothesis. Intuition, genius, insight, creativity and other faculties have been invoked to explain how a scientist arrives at the hypothesis which he later rejects or comes to believe or aodifies in light of new observations. Prom an information processing point of view it makes sense to view the hypothesis formation problea as a problem of searching a space of possible hypotheses for the most plausible ones. This presupposes a generator of the search space which, admittedly, remains undiscovered for sost scientific probleas. In the Heuristic DENDRAL performance system the "legal move generator" is the DENDRAL algorithm for cqnstructing a complete and irredundant set of molecular models from any specifiei collection of chemical atoms. Heuristic search through this space produces the molecular structures which are plausible explanations of the data. The meta-problea of finding sets of MS processes to explain each set of data is also conceived as a heuristic search problem. Writing a computer prograa which solves a scientific reasoning probles is facilitated by seeing the problem as one of heuristic search. This is as true of the meta-program which reasons from collections of data to generalizations as for the performance systerg which reasons from one set of data to an explanation. For this reason we have called the process of induction "a process of efficient selection from the domain of all possible stractures."%<3> In broad teras, the program contains (1) a generator of the search space, {2) heuristics for pruning the tree, and (3) evaluation criteria for guiding the search. Except for problemas inherent in the task, then, the probleas of such a program are reasonably well understood. These three main components of the heuristic search program are considered one at a time in the immediate discussion. 1. GENERATOR For this part of the Meta-DENDRAL systea, the generator is a procedure for systematically breaking apart chemical solecules to represent all possible MS processes. In addition to single cleavages, the generator aust be capable of producing all possible pairs of cleavages, all possible triples, and so forth. And, for each cleavage or set of cleavages it must be able to reproduce the result of atoms or groups of atoms asigrating from ona fragment to another. For example, after the single break labeled (a) in Figure 1? below, subseguent cleavage (b) may also occur. fhe result of (a) + (b) is the simple fragment CH3. CH3 - C - CH2 - CH2 - CH3 (>) (a) FIGURE 1 Or, for the same solecule, cleavage {c) may be followed by Bigration of one hydrogen atom from the gamma position (marked with an asterisk) to the oxygen, as shown in Figure 2: 0 CH3 ~- C - CH2 - CH2 - CH3 (c) * FIGURE 2 The generator of the search space will postulate these processas as possible explanations of the FMT data points at masses 15 (2H3) and 58 (C3H60) for this particular molecule. But it will also postulate the simple cleavage (b) in Figure 1 as the explanation of the peak at mass 15. And for the peak at mass 58 from the process in Figure 2 it will postulate the alternative migration of a hydrogen atom froa the beta position (adjacent to the asterisk). From the generator's point of view these processes are at least as good as the more or less accurate processes shown in Figures 1 and 2. Cherists also appeal to the localization of the pasitive charge in the charged molecule to explain why one peak appears in a set of data but another does not. Since it is known that only the charged fragments are recorded by the mass spectrometer, the generator program must also manipulate charges to account for the data. The primitive mechanisms of the generator are charge localization, cleavage, and group migration (where a group can be a positive charge, a single atom, or a set of connected atoms). Tha generator is a procedure for producing all possible charged fragments, not just all possible fragments, in other words. Putting these mechanisms together in all possible ways leads to an extremely large space of possible explanations for the peaks in the experimental FST of a molecule. The pruning heuristics discussed in the next section alleviate that problem. Briefly, let us turn to the actual design of the generator. At the first level of branching in the tree all possible single cleavages are perforaed on the original molecular structure resulting in all pessible primary fragments. At the next level, the positive charge is assigned to all possible atoms in the fragments. (Switching these two steps gives the same results and is closer to the conceptualization used by the chemist; it results in a less efficient program, however.) Starting with level 3, the procedure for generating successive levels is recursive: For each charged fragment at level n (n > 2) produce the charged fragments resulting from (i) cleavage of each bond in the fragment and (ii) migration of each group from its origin to each other atom in the fragment, where ‘group’ currently aeans 'positive charge 9dr hydrogen ator’. 2. PRUNING HEURISTICS Three simple pruning technigues are currently used by the prograg. (1) Since the result of breaking a pair of bonds (or n bonds) is independent of the order in which the bonds are broken, allow only one occurrence of each bond set; (2) Since MS processes tend to follow favorable pathways, prune any branch in the tree which is no longer favorable, as evidenced by failure of a fragment"s mass to appear in the experimental FMT; (3) Limit the number of allowable group migrations after each cleavage. The first pruning technique hardly needs explaining: duplications of nodes in the search space are unnecessary in this case and can be avoided by removing a bond from consideration after all possible results of breaking it have been explored. The seconi technique carries an element of risk, because mass spectrometry theory includes no guarantee that every fragment in a decomposition pathway will produce a peak in the experimental PMT. In fact, the pruning can only be done after a complete cycle of cleavage plus migration because these processes occur together in the mass spectrometer -- without the appearance of the intermediate fragments. The third technique also is truly heuristic since there are no theoretical reasons why group migrations might not occur in complex and exotic patterns between cleavages. The bias of mass spectroscopists toward simpl2 mechanisms, however, leads us to believe that they would place little faith in exotic mechanisms as explanations of peaks in the data, at least not without other corroborating evidence, 3. EVALUATION Evaluation of alternative paths in the search tree is necassary, either during generation or after it is completed, in order to distinguish the highly attractive explanatory mechanisms from those which are merely possible. However, without building in the biases of experts toward their current theory it is difficult to evaluate pechanisas at ali. The prograg’s evaluation routine presently contains only one a priori principle, a forre of Occar's razor. In an attempt to measure the simplicity of the statements describing sechanisas, the programs counts the number of primitive mechanisms necessary to explain a peak. Thus when there are alternative explanations of the same data paint, the program chooses the siaplest one, that is, the one with the fewest steps. Simple cleavage is preferred to cleavage plus migration plus cleavage, for exanple. The result of the generation process as described so far, with peuning and evaluation, is a set of candidate 4S processes for each structure which provides alternative explanations for data points in the associated mass table (FANT). For instance, the progras breaks the asaolecular structure shown in Figure 3 at individual bonds or pairs of bonds to give the following information (atoms in the structure are nuabered from left to Cight): MASS EXPLAINED PROCESS 103 Breakbond: C2-C1 or Breakbond: C6-C7 89 Breakbond: S3-C2 - 7 - or Breakbond: C5-C6 75 Breakbond: C4-c5 61 Breakbond: S3-C4 60 Breakbond: C4-C5 & C2-C1 57 Breakbond: C4-S3 46 Breakbond: S3-C4 & C2-C1 43 Breakbond: C5-C4 42 Breakbond: C6-C7 & C4-S3 29 Breakbond: C2-S3 28 Breakbond: C5-C6 & C4-S3 1 2 3 4 5 6 7 CH3 - CH2 - SH - CH2 - CH2 - CH2 - CH3 FIGURE 3 In this example, the programa used no migrations or charge localization information, for purposes of simplicity. The projram explcred all simple cleavages and found peaks corresponding to every resulting Fragment but two.* For each of the successful fragments, the program broke each of the remaining bonds. Fron all the secondary breaks considered, the resulting fragments corresponded to only four additional peaks in the FMT. So thase four branches of the search tree were each expandad by on2 mor? Simple cleavage. None of the tertiary fragments were found in the FMT so the program teruinated. a we a ee ee ee ey od ee nm OE oe ee ee a *The CH3 fragment was produced twice but peaks of low masses were not recorded in the FAST. a ee ee ee tee ee ee es ee ee ee The output of this phase of the prograrm is a set of molecule-process pairs. Por the one example shown in Figure 3, thirteen such pairs would be included in the output: the molecule shown there paired with each of the thirteen processes. IV. SECOND SUBPROBLEM: GENERALIZING TO ALL STRUCTURES The method of hypothesis, mentioned earlier as a vague description of scientific work, suggests that a plausible hypothesis can be successively modified in light of new experience to bring a scientist closer and closer to satisfactory explanations >of data. Apart from the problera of formulating a starting hypothesis discussed above and the problem of terminating the procedure, it is not at all clear how the adjustments are to be sade nor how to select the new experiences so as to make the procedure relatively efficient, or at least workable. These are well-known problems in the methodology of science. In other terms, the problem of successive modifications can be viewed as a problem of generalizing a hypothesis from one set of observations to a larger set. The task for the second main part of the Meta-DENDRAL system is to construct a consistent and simple set of Situation-action (S-A) rules out of the numerous instances of rules generated by the first phase of the Systema. It is necessary for this program to determine (a) when two instances {molecule-process pairs) are instances of the same general S-A rule and (b) the form of the general rule. In other terms, the program is given a set of input/output (1/0) pairs, with respect to the AS theory in a "black box", The task of the program is to constract a model of what is inside the black box. Thus it needs methods for (a) determining when two outputs (processes) are of the same class and (b) constructing an input/output transformation rule which accounts for the inputs (molecules) as well as outputs. For each molecule there will be several associated processes, as seen from the example from Section III (Figure 3). So tha samo molecule will appear in several T/O pairs. Moreover, since the molecules are chosen for the test because they are known to exhibit similar 5s behavior, there will be a number of instances of each general MS rule. If the program is successful, the resulting set of explanations will be a unified description of tha MS behavior of all the molecules in the class. In operational terms, this means, at least, that the final set of explanations will be smaller than the union of instances. The program itself has not been coapleted. It is hoped that this sketch shows enough detail that it will be instructive ani provocative. Yet we do not wish to emphasize unfinished pieces of programs. As in Section III, the issues of representation and search are discussed separately in this section. A. REPRESENTATION The general form of the cules the prograrm is to infer has been fixed as S-A rules, as mentioned above. Sut representing the instances from which to infer the rules presents other difficulties. It has been difficult to decide how to represent the instances in such a way that they can be compared and unified, without building in concepts which would beg the theory-formation question. For example, representing the chemical graphs by feature vectors is attractive because it is easy to give the programa just the right information for efficient comparisons. But this is the danger, too, for omitting "superfluous" inforeation gives the program auch too great a head start on the problem. It might discover what we believe are in the data -- the old principles-- but it would never discover anything new. The difficulty with the representation of the instances, i.e., the molecule-process pairs in the input stream, is that the numbering - 21 - of atoms in the molecules, and the corresponding numberings in the function arguments of the processes, do not allow simple comparisions. However, by comparing rules two at a time it is possible to determine mappings between the atoms, and the function arguments, so that the program can make comparisons. This is described below, as part of the scheme for generalizing rules. B. SBARCH The program has been designed to generalize on situations which exhibit the same processes. If situations 51 and M2 both exhibit process P, far example, the program atteapts to construct a rule (S --> P) where S captures the comgon features of M1 and 82. This procedure requires that the programs knows enough about the syntax of the process language that it can recognize the "same" process in different contexts. Also, this procedure requires that the program can find comzon features of situations which Satisfy sone criteria of anon-triviality. AS in any learning problema there will need to be many readjustments of the learned generalizations as new data are considered. In this case, the addition of each nev molecule-process pair brings the potential for revising any S-A rule in the emerging MS theory. Since each aolecule initially considered may be associated with a dozen or nore process2s, and - 22 - the emerging theory may contain many dozens of S-A rales, the generalization process will be lengthy. All of the molecule-process pairs, which are instances of the rules the program is supposed to find, are compared among themselves. The result of this comparison is a set of generalized descriptions which account for the input data. This resulting set is then organized hierarchically to form the program's MS theory by the process described in Section ¥. The comparison of the instances is conducted pairwise. The first molecule-process pair is postulated as a situation-astion rule, R. A new molecule-process rule, N, (the next one) is than cosparei with R in the following way. (1) The MS processes, or actions, of N and R are compared at a gross level. (2) If this comparison holds, the graph structures (situations) of N and R are compared to find common subgraphs. If there are no common subgraphs, W is compared with the next rule, or, if no more rules, N is postulated aS a new rule. (3) Otherwise, the common subgraph, S, is expanded to S*' to capture alternative allowable atoms beyond the common subgraph as indicated by the situations of N and R. (4) Pinally, the graph of & is replaced by S*. These four steps will be illustrated and briefly described below. Consider the rule 1 2 3 4 5 6 -23- (RB): CH3 ~- CH2 - NH - CH2 - CH2 - CH3 --> Breakbond(4 5) and the new solecale-process pair 1 2 3 4 5 6 (N) : CH3 - NH - CH2 - CH2 - CH2 - CH3 --> Breakbond(3 4). (1) Compare the processes of R and N (the right-hand sides of R and N), disregarding the arguments of functions. Comparison of just the names of the processes shows that both R and W follow the same syntactic rules, and thus deserve closer comparison. This is made passible by the generator of processes describei in Section Iii, which names processes and sets of processes uniquely. Hai the form of the processes been different, N would be compared with the next rule (if any). (2) Ccmpare the graph structures in N and R, ignoring hydrogen atoms (H) for the moment. Using the clue that the atoms involved in the processes of both N and R are important, the program looks for a way of matching these atoms. Then the "interesting" subgraphs in both N and R are expanded, starting with the impgrtant atoms and building the greatest subgraph, S, which is common to both N and BR. The criteria of "interesting" subgraphs and for “greatest™ common subgraph are heuristic and are specific to chemistry. Since the nitrogen atom, N, and the adjacent right-hand carbon - Py - atom, C, are both involved in the Breakbond process for both rales (N) and (R), these are recognized as important atoms. Thus the subgraph common to (EB) and (N) must contain these noles. Using the nuabering of the graph of (RB), nodes 2-6 are found to he coanop to both graphs. This is an "interesting" subgraph because, for example, it contains at least one non-carbon atom and contains more than two nodes. Moreover, it is the greatest subgraph common to the two. Without H's, this subgraph, S, is: (S): C-N-C-C- Cc (3) Expand the subgraph (S). Now, reconsider the hydrogen atoms ignored in step (2). Nodes 2 and 6 in S fail to match exactly on the number of hydrogens, but the rest do match. Both 2 and 6 are connected to at least two hydrogens, but in each case, the last connection may be to either an H or a C. This is reflected in the expanded subgraph {S*): (C,H) - CH2 - NH - CH2 - CH2 - CH2 - (C,H) The parentheses indicate alternative choices for the atom linked by the adjacent bond. The program now extends subgraphs only one atom beyond the - 25 - greatest common subgraph (in each direction), but this clearly should be a parameter which the system can set. (4) Replace the graph of R with S'. The result of comparing N with R, then, has been to change the conditions under which the process of R has been observed to apply. The old rule & is replaced by a revised rule, R', in which the situation is modified, but the action remains the same: 1 2 3 4 5 6 7 (R'): 9 (C,H) - CH2 - WH - CH2 - CH2 - CH2 - (C,H) --> Breakbond {4 5) The result of this whole process is a set of S-A rules which can account for the observed data. This part of the projram cautiously tries not to generalize beyond the observed situations. So it may piss some sweeping generalizations ("brilliant insights") which explain several of these cautious rules. But its result will net, at least, contain n "rules" to explain oa observations, unless the input data are wildly discrepant. ¥. THIRD SUBPROBLEM: ORGANIZING NEW RULES AND INTEGRATING THE INTO THE EXISTING THEORY - 26 - The scientist's probleam does not necessarily end with the Satisfactory formulation of general statements explaining all the observed data. If he is working in a discipline for which there is no existing theory, he will still want to organiza the statements. But it is rare to be out of any theoretical context. Typically, the hypotheses are forrmulated as extensions of soae existing theory. Thus, the Heta-DENDRAL program must be prepared to merge new MS rules into the theory previously constructed by the program (or by a chemist). However, as a test exercise we want to see whether the meta-program builds approximately the same MS theory aS the performance program now contains. One cf the reasons we have rewritten the DENDRAL systea's mass table predictor was to separate the MS theory from the LISP functions it drives. Making changes to the theory, then, does not require reprogramming, in the usual sense, Conseguently, writing a prograa which updates the theory no longer seeas to be an insurmountable task. The problems of organizing a set of new rules or integrating new rules into the old theory are independent of the source of those Cules. In order to study these problems we have written 4a prograa which (a) accepts new rules from human chemists and (b) updates the theory table of the program. The program for doing (a), called the dialog program, is not central to this paper, thus this - 27 - section will focus on the work to accomplish (b), organizing and updating the theory. In short, the program organizes the new rules either into a fresh theory or into an old theory (depending on the test) in the sare way. The rule table is organized hierarchically according to the Situations in the rules. Because the situations are graph structures, determining situation levels is just determining whether one graph is contained within another. For example, the gtaph -NH2 is contained in the graph -CH2-NH2 , so th2 former is a higher-level situation in the rule table. If neither Situation is a subgraph of the other and they are not identical, they are put at the same level in the rule table. A. REPRESENTATION The performance program's MS theory is represented as a table of Situation-action rules (S-A rules), patterned after Waterman's table of heuristics for good poker play.<6> Situations are predicate functions which evaluate to ‘true! or 'false' in a specific context. For simplicity, only two predicate functions are allowed as situations at this time (in addition to 'I') -- although a wide range of arguments may be supplied. Also, only one simple predicate function at a time can serve as a situation; Boolean expressions of predicates are not allowed. The first Simplifying restriction will be easy to loosen as new predicate - 28 < functions are discovered which will be useful. Limiting a Situation to a single predicate, however, is an important way of limiting the difficulties encountered in revising the projraa's M5 theory or analyzing it. Actions are sequences of primitive MS processes constitating rewrite rules for transforming one structural fragment into another. In this system, an action place can also be filled by another S-A rule, allowing nesting 9f rules in a manner quite natural to the current textbook descriptions of NS theory. The structure of the rule table in the program, which constitutes the program's MS theory, can be expressed in Backus normal fore: z= ((T ... ) 222 t= ( | ( ... ) * 2:3 (ISIT ) | ({CHECKFOR ) | T ** 2: ( ) | (PROG {) ... ) OO * The function ISIT deteraines whether the subgraph named in its - 29 - argument place is contained in the chemical graph under consideration. The function CHECKFOR checks to see whether the current value of the named variable is equal to the value specified. This predicate allows checking global context before determining answers to specific questions about subgraph matching. ** The basic actions {function names) known to the systes are listed in Appendigz A. Any action which is bailt out of several basic functions can be given its own name. In fact, the MS theory in the present version of the performance program contains many named complex actions. oe ee ee ee ee on ae ee ie ee 0 oe oe The performance program is driven by the MS theory in the rule table by the following procedure. The program picks up the S-A rule immediately following the default action and checks to see if the current context satisfies the situation by executing the named predicate function (with appropriate arguments). If it does, the program performs the associated action by executing the named (or described) function (with appropriate arguments). The very first situation, 'T*, is certain to be satisfied (since ‘T' evaluates to 'true'), so the default action will be executed if none of the other situations are satisfied. A simple illustration will make the structure of the rule table - 30 - clear. Suppose it contains rules for two distinct situations: ethers and alcohols, plus a subrule for a special class of ethers, named ethert. The table would look like (T default (alcohol-situation alcohol-action) (ether-situation ether-action (ether1-situation ethertl-action) ) If a compound satisfies the ether! situation, neither the default actiop nor the ether action will be executed. All the processes for each situation are collected in the corresponding action. This may cause duplication if some of the processes in a rule also apply to the subrules. But smodification of the rule table is made easier because of this unification. B. ORGANIZATION AND INTEGRATION The output from the generalization program discussed in Section IV is a set of S~A rules (with accompanying definitions of the situations and actions). The set of new S-A rules is organized withcut reference to any existing theory or integrated into an existing theory by exactly the same process. Each S-A rule is considered in turn. It is postulated as a new S-A rule at the top level of the rule table if its situation does not appear elsewhere in the rule table. If a new situation, S1, subsuges a situation, S2, already in the rule table (i.e., S1 is more general than 52, - 31 - or S1 is contained in $2), then the new rule is inserted in the rule table so that the old rule, with $2, is below the new one. Or, the reverse may be the case, namely, that the new situation (S1) is subsumed by a situation (S2) already defined. Then the new cule must be inserted below the old one in the hierarchy. These three cases all depend only upon the program's ability t9 determine when one graph is contained within another. They are briefly illustrated below. (1) If the situation does not appear elsewhere in th? rule table, the new S-A rule is merely added to the top level of the rule table. Por example, adding an amine rule to the sample rule table above would result in (T default (alcohol-situation alcohol-~action) {ether-situation ether-action {ether 1-situation ethert-action)) {amine-situation amine-action) ) (2 & 3) If the situation of the new S-A rule subsumes a previously defined situation, the old S-A rule becomes a sub-rule of the new rule. If the situation of the new rule can be subsumed under ain existing one, the new rule becomes a sub-rule of the old one. These two cases are both illustrated by the following example. Suppose the program adds a rule (ether2-situation ether2-action) to the rule table above, where ether2-situation is an instance of ether but more general than etheri-situation. This would result in (T default (alcohol-situation alcohol-action) (ether-sit uation ether-action (ether2-situation ether2-action (ethert-situation ethert-action))) (amine~situation armine~-action) ) After deciding where the rule must be inserted, the program adis the definitions of the new situation names and action names to the systen. AS this part of the program becomes amore sophisticated it will have to (a) check the rules to be sure there are instances which actually distinguish them, (b) look for less cautious ways of generalizing, and (c) associate a measure of confidence with each Cule so that it can resolve conflicts between rules. VI. CONCLOSION The Meta-DENDRAL program described here is a vehicle for studying problems of theory formation in science. It is built upon the - 33 - concepts and programmed routines already available in the Heuristic DENDRAL performance programs, which uses a scientific theory to explain analytical data in organic chemistry. [he MNeta-DENDRAL system goes beyond the performance program, howevar, in atteapting to formulate the theory which the performance program will use. The Meta~DENDRAL program works much like a chemist who is extending his theory of mass spectrometry by looking at collections of experimental results. The data, for both the chemist and program, are the results of mass spectrometry experiments (called FMTs here) and the associated rolecular structures. By selecting some "typical" examples, first-order general hypotheses about the whole collection of data can be proposed. Then, by subsequent adjustments, the generalizations are modified to explain all the data. The new rules are then integrated into the existing corpus of theoretical statements in ways dictated by considerations of simplicity and personal preference, The version of the meta-program which is described here suggests that the design is workable. But it accentuates the arbitrariness of our design decisions and raises the questions of what alternative designs would look like and how good they would be. It also raises a number of issues important to understanding scientific methodology in general. The design question is certainly one such issue. Others are questions concerning the criteria of acceptable generalizations, criteria of good scientific theories, and criteria for deciding on a set of primitive concepts for a theory. None of these general issues will be resolved satisfactorily in the context of this prograr, Yet none can be resolved for this program without saying something about the general solutions. - 35 - APPENDIX A. PRIMITIVE CONCEPTS OF MASS SPECTROMETRY KNOWN TO THE DENDRAL PROGRAMS This list is taken from an outline given to chemists who define hew massS spectrometry rules for the system. The functions at the front of the list are most primitive, those at the end are more complex, and in fact are built out of the simpler ones. To the chemist this list serves as a reminder of the names and associated syntax of the “building blocks" available to him for defining new rules. To the present reader it is meant to illustrate the concepts already programmed into the system. FUNCTION (Function Arguments) * DESCRIPTION ee ee OH ae A a ee OO ee ee SD Oe See cee SO ae AO ee ate ee ce ee me ee ee ee ee ee ee eo we ewe ee HOUSEKEEPING FUNCTIONS: ADDCHARGE {atm) AsSign a positive charge to atm. ADDDOT (atm) Assign a free electron to atm. IONIZE (ata) Assign a dot and a charge to atn. PAIRELECTRONS (List; nolist) Look among the atoms of LIST for adjaceat atoms with free electrons. Pair up the electrons to make an explicit bond unless the pair is named in NOLIST. REMOVECHARGE (ata) Take away the positive charge from ata. REAOVEDOT (ata) Remove the dot (if present) from ata FUNCTIONS FOR MANIPULATING STROCTURE WITHOUT HOUSEKEEPING: ADDH (ats) Put a hydrogen on ath. CHANGEBOND (atai;ata2;n) Add n (pos. of neg.) to the order of tha atmi-ata2 bond. JOINATON (Oldatm; ata; bond; atoat ype; nodenun) REMOVEBOND (ataljata2) REMOVEH {atm) STRUCTURAL MSANIPOLATION FUNCTIONS BREAKBOND (atmal;atn2) BREAKRING (ataut;ata2) ELIMINATES (atm) LOSEALPHARAD {atm) LOSENEXTRAD (ata) MAKERING (atalzata2; bond) MIGRATEH {atals;ate2) NCLEAVAGE (n, pct) NEWBOND (atal;atm2) ee ae ee ee oe oe ee * The arbitrary names given to f the appropriate kinds of argument Bring atm into the structure -- attach ata to oldata with bond ordec BOND. Give ata the atom type and node number specified. Remove the bond between atm! and ata2. Take a hydrogen off ata. WITH HOUSEKEEPING: Replace the atmi-ata2 bond with a pair of electrons, Try to pair any other free elactron With one of the new free electrons, Do the same as BREAKBOND when it is certain that the atmt-ata2 bond is in a ring. Eliminate a hydrogen from atm, leaving a free electron. Lose the largest radical alpha to ats. Lose the largest radical adjacent to ata. Join atnl & atm2 with bond to form a ring. Move a hydrogen from atal to ata2, leaving a free electron on atal (unless atai = ANYATOM, in which case the H comes from nowhere) Break the nth bonds away fron the heteroatoras in the molecule and assign intensity=pct oldint/too. If n is 0 or (quote adjacent), the adjacent bonds are broken, 1=(quote alpha), 2=(qguote beta), 3=(quote gamma). Replace adjacent free electrons on atal & atm2 with an explicit bond. unction arguments hare are meant to sugjest s for these functions. For example, ‘ata' will be replaced by the name of a specific chemical atom in th2 context of the actual progras. - 37 - REPERENCES (1) B.G. Buchanan and J.Lederberg, “The Heuristic DENDRAL Prograa for Explaining Empirical Data". In Proceedings of the IPIP-71 Congress {in press). {Also Stanford Artificial Intelligence Project Memo No. 1&1, April, 1971). (2) C. W. Churchman and B. G. Buchanan, "On the Dasign of Inductive Systems: Some Philosophical Problems". British Journal for the Philosophy of Science, 20 (1969), pp. 311-323. (3) J. Lederberg, G. L. Sutherland, 3B. G. Buchanan, and E. A. Peigenbaum, “A Heuristic Program for Solving a Scientific Inference Problem: Summary of Motivation and Implementation", [n Theoretical Approaches to Non-Numerical Problesa Solving (R. Banerji and &.D. MeSarovic, eds.) Springer-Verlag, New York (1970). (Also Stanford Artificial Intelligence Projact Memo No. 104, November 1969). (4) Peter B. Medewar, Induction and Intuition in Scientific Thought. American Philosophical Society, Philadelphia (1969). (5) A. Newell, "Limitations of the Current Stock of Ideas about Problem Solving". In Electronic Information Handling (A.Kent and - 38 - O.E. Taulbee, eds.) Spartan (1965). (6) D. A. Waterman, “Generalization Learning Techniges for Automating the Learning of Heuristics". Artificial Intelligence, 12121 (1970). ~ 39 - ary