II

II. PARADIGMS AND CONTROVERSIES IN MACHINE LEARNING

We begin our study with a complete review of the ideas that have become important paradigms, as well as debated controversies, in the various attempts to codify machine learning. In order to have a clear understanding, we need to review the literature that has had a great influence in the fields of Self-Organizing Systems, Artificial Intelligence, Learning Automata and, to a lesser extent, Psychology.

II.1. Preprogrammed Intelligent Behavior vs. Learning.

In the beginning of cybernetics, the idea of self organization became an important goal to achieve; we encounter in the literature a fair number of articles devoted to this paradigm. An initial point of disagreement was that of defining the extent to which self-organization was possible. On the theoretical side of cybernetics, Ashby (4) represents most clearly the search for the basic principles of self-organization, always looking to demonstrate the possible existence of intelligent machinery. By 1952 he had built the homeostat, a device able to achieve complete self stability.

One of his colleagues, H. V. Forester, started the BCL (Biological Computer Laboratory), dedicated to continue with this lien of research. BCL was eventually closed in the mid 70´s, after having been published a great deal of material on the subject. Another important author, A. Turing (61), in an unpublished paper clearly discussed two basic approaches to machine intelligence. On the one hand, "disciplined machines", as he call them, are those which are programmed to achieve specific tasks and by their mechanistic nature are very practical: this is what we shall call preprogrammed intelligence. There is no provision for internal modifications, the programmer being the one who must change the internal code in order to make any modifications, the programmer being the one who must change the internal code in order to make any modification. Yet, as long as the device is confronted with the same environmental conditions, its behavior can be optimal.

The second approach suggested is a more difficult one: to provide the machine with a basic structure only and to let it achieve stability by means of continuous interactions with the environment, what the cyberneticians called self-organization. While such an approach, implemented on a sequential machine, precludes the achievement of fast and optimal results, it is intended to reflect and emulate the way in which human beings behave. As computers became faster and more powerful, these two basic approaches were required to pass a test of practicality, as well as the verification of some of their theoretical foundations.

In the sixties, preprogrammed intelligence obtained considerable momentum due to its successful results; on the other had, self-organization declined because it only achieved moderately successful results, disillusioning many of its followers. Only Ashby and his associates continued with their work, developing what we call conceptual cybernetics. By then a new term was coined for the practical study of machine intelligence: Artificial Intelligence, the field dedicated to the computer-oriented study of intelligence. In 1963, a book edited by Feigenbaum and Feldman (17) presented a rich conglomerate of articles showing research being done with both approaches. This was the last time full cooperation existed between the two, and the honor of achieving a clear balance between preprogrammed and learning paradigms belongs to Samuel (51, 52). His checkers playing project, which was begun previous to 1957 and finished by 1967, presents an adequate tree structure for searching the state-space. It also has the ability of taking advice from a teacher without any recoding. Later work showed little effort to incorporate learning in AI models.

The reason for this abandonment of learning in the AI models that followed can be in part explained on the basis of the lack of understanding of the process of learning, even in the field of psychology (11). Marvin Minsky (35), one of the leaders in AI, has been one of the strongest critics of machine learning, on the basis that operant conditioning cannot be simulated by a mere modifications and parametrization of some assumed factors. In 1967 he suggested that before attempting to implement learning (34), one should devote his efforts to the aspects of fact representation. His group at MIT, as well as his followers at the Stanford Research Institute and Sanford University, have since concentrated on the paradigm of representation, using data management schemes, such as list processing, graph theory, and interpretive languages such as LISP, PLANNER, CONNIVER, etc.

The controversy of machine learning was also recognized by Ernst and Newell (16), pioneers of information processing and simulation of human thought; their point of view is not as critical as that of Minsky. In their quest for generality, they focus on three aspects: internal representation, external representation and problem solving techniques. They state that the reason for not considering generality through learning arises from a tactical decision, implying that it may be better to first conquer just the mechanistic paradigms, rather than to try to deal concurrently with the undefinable concept of learning.

It was also Simon (41) who initially suggested the concept of heuristic search in game playing and problem solving. A heurism can be defined as a rule of thumb, providing a guide line, based on experience, in a complicated decision analysis that does not have an achievable algebraic solution. An important argument in favor of simulating human thought is that by using heurisms, the computer does the things the way humans do; yet, one must be careful not to make this the only paradigm for human reasoning. Only a fraction of human thought processes use heurisms; furthermore, in order to take the advice of these rules of thumb, humans must have structured their brains by some trial and error process long before, as stated by Minsky (34).

An English psychologist, D. Michie (32), began his studies on machine intelligence by developing a program that learned to play tic-tac-toe through a rudimentary trial and error process. Although the game is simple, the results showed the possibility of such an approach. We think that the results of the project should have been considered more deeply since they showed the feasibility of productive self-organization. However, as the leaders of AI turned to data representation and heuristic searches, much of the practical side of adaptive learning was buried at least for a decade or so.

A related field of research, initially a part of AI, is Pattern Recognition (PR). Various approaches have been suggested to discriminate patterns. The approach followed by Fu (19) considers the use of complex statical models to form related clusters and classify patterns accordingly. A second approach is to use algorithms aided with heurisms. An approach suggested by Rosenblatt is to learn by trial and error. Research on these lines has been strongly favored by Uhr and his associates (62). It may be fair to say that due to the research on PR, the paradigm of learning still exists in the vocabulary of current AI scientists; but in any case, emphasis on learning paradigms still remains mainly in data management, computer linguistics, and game theory.

We can certainly state that preprogrammed intelligence, despite its critics (14), has achieved important goals and in many respects has helped in the advancement of user-computer interfaces; important developments in the areas of language understanding (24), problem solving and robot development owe much to MIT's AI Laboratory.

But the time has come to turn back to learning, as suggested by some rediscoveries from Winston (69), of the very same group of scientists at MIT. We purposely use the word "rediscovering" because we feel that he is trying to achieve a new balance between preprogrammed intelligence and learning. His "learning by being told" still requires extensive prestructure tat must be obtained by means of some kind of operant conditioning.

Three more recent works on learning ought to be mentioned here, the first one by Korn (27) who simulated a learning environment where fictional characters can move around and learn to stay close or apart, depending on their emotional feelings (initially set by the program). In his program, Korn applies data management techniques to attempt this new balance between the opposite viewpoints on machine intelligence. The second is the work of Bock and his students (8), that to a certain extent reopened the line of research initiated by Michie. since this work is based on automata models, we shall discuss it later in the chapter.Thirdly, the ideas of J. S. Albus (1) on learning through hierarchies have influenced the present author and will be discussed in a separate section of this chapter.

The final author we consider in this field is the mathematician R. Banerji (5). In his conception of AI he agrees with the use of heurisms and acknowledges the fact that some of the problems are rather unstructured and very ill-defined; he criticizes those who negate the intent of representing the problem in mathematical terms and rely solely on programming skills. We can but agree with him, since we feel that the models ought to be expressed in the most formal way possible, although we accept the fact that the use of simulation and statistical analysis is necessary in order to obtain non trivial results.

With the previous review in mind, the purpose of this dissertation in restated: we are interested in pursuing the research on learning and adaptation. we feel that such a a cause is worth the effort since it will help in clarifying the computer implementation of operant conditioning. Even though we are aware of the fact that short-term results are small in nature, we believe that the model to be presented can prove helpful by providing a better framework for large scale productive learning systems. For the sake of formality, we have turned to the field of Learning Automata (LA) in order to find an adequate nomenclature for the model. We shall devote the next section to reviewing the authors in this field, and in the next chapter we shall discuss the mathematics of LA.

II.2 Learning Automata

In psychology, almost concurrently with the development of cybernetics and self-organizing systems, W.K. Estes influenced by Skinner, developed early models that describe operant conditioning in T mazes with rats. A related set of stochastic models was published by Bush & Mosteller (12) in a text considered a stepping stone for future research. More recently, we find in psychology a very rigorous work on learning by Norman (44) using Markov processes.The work of Bush & Mosteller, who began to implement learning using stochastic automata, stimulated the research of automata theories. The subject of stochastic automata was initially popularized by Ashby (3). More recently, Paz (46) dedicated a complete text to the subject.

Tsetlin (57) is considered the first mathematician who used the concepts of automata for learning. With his studies he opened a complete line of research in the Soviet Union, using both deterministic and stochastic automata. We shall review his most important collaborators in the next chapter. Let us for the moment mention Varshavsky & Vorontsova (65) who first proposed compensation schemes to update stochastic matrices in learning. To achieve learning in automata theory, it is necessary to provide the automaton with a state for each input and for each possible output; such a matrix is initialized with equal probabilities for all outputs for a given input. Then, for each input the system selects an output from the matrix and submits it to the environment for evaluation. The environment compares the output delivered with a true or required value, providing the automaton with an overall composite error, or performance metric (e.g., a grade). The automaton then uses this value to update the probabilities of the matrix accordingly. It is proposed that such a feedback eventually causes the automaton to select the correct answer. Such feedback loop is known in the literature by various names, the most common being reward and punishment (37). In related literature, it is called algedonic loop see Beer ( 7). The word comes from the Greek "algos", pain; "edos": pleasure. From now on, we shall use the work "algedonic" to denote this feedback process.

From what we just describe, it may be argued that the existence of both the stochastic matrix and the algedonic loop violates the concept of complete self-organization , since the designer must provide the system with such a structure. However, we are not interested in constructing something from nothing; on the contrary, we let some initial structure evolve through experience until it achieves a stable and hopefully optimal, ultimate structure. Learning Automata research is then dedicated to the study of learning by means of automata models. Before going any further, let us give a precise definition of the learning we are interested in, as suggested by Narendra & Thathachar (37):

"Learning is defined as any relatively permanent change in behavior resulting from past experience and a learning system is characterized by its ability to improve its behavior with time, in some sense tending towards an ultimate goal".

Note that the definition does not exclude the existence of an initial structure; on the contrary, it presupposes it. Yet, it assumes that the system is able to change its pattern of behavior if the environment so requires. Preprogrammed intelligence precludes such a change in pattern. It relies solely on its initial (albeit complex) structure.

Various compensation schemes to achieve this kind of learning have been proposed. Strictly speaking, one of the goals of LA research is to prove that an algedonic scheme will converge to a given goal. Theoretical theorems have been presented by Chandresekaran (13) and Viswanathan (66) to test the adequacy of various proposed schemes. Using a standardized notation, we review some of them in the next chapter. As the models get more complex the mathematics get very involved and eventually intractable, precluding the achievement of nontrivial analytical solutions. Therefore, even if it loses its mathematical elegance, LA must rely on simulation to test the efficacy of these schemes.

Following a combined approach, i.e., part analytical and part simulation, Fu (19) has tested new models, concluding that both approaches closely agree. Viswanathan & Narendra (67) use a similar dual process, also obtaining good results.

Bock (8) has devised a collective automaton that updates its matrix in a batch mode, i.e., after a set of transitions have taken place. Due to the mathematical intractability, the model is tested through s i m u l a t i o n . This collective model is based on the assumption that humans, when confronted with a problem, cannot update every single transition. They must wait until a high level goal is achieved to correct errors and update probabilities. The sensibility of such an approach is reinforced by the recent studies of A. Routtenberg (50) who describes the reward mechanism in the brain as being performed in a sort of collective manner. In an unpublished work, Templeman and Bock (56) have implemented a collective automata that plays a non-trivial game, named Blitz, that achieves fast learning by interacting with a human player. Once again, their approach is simulation. An interesting aspect of this game is the way the expectation of a given output is used to compensate the LA. The research in LA has been largely guided by mathematicians rather than by engineers. Yet, in the literature we find recent applications in control theory, game playing, decision making, etc. We believe that heuristics and AI research should reevaluate the use of learning, using the formal background of LA, in order to provide flexibility and adaptation to their programs.

The main drawback of LA suggested by AI researchers is that the problems solved through analytical solutions of learning are, to a great extent, trivial. It is the purpose of this thesis to present an LA model that can be used to solve complex and non-trivial problems, augmented with currently accepted AI techniques, such as list processing, text handling, frame representation, tree searches, etc.

II. 3. Learning in Artificial Intelligence.

Although we already mentioned that there is little application of learning automata theory in AI research, there are a few AI scientists who are trying to implement some kind of learning in their programs. we devote this section to reviewing them conceptually. A. Samuel (52), as mentioned earlier, can be considered a pioneer of learning in game playing. He discussed three basic tactics to learning within the strategy of preprogrammed intelligence: rote learning, learning by generalization and learning from books. In the first case, rote learning is achieved by keeping a record of all the moves of a game. Although this is tedious and space consuming, through sophisticated tree manipulation techniques, it can be done with certain flexibility. Samuel comments that the computer learns to become an efficient end game player.

Learning by generalization is implemented by the use of a polynomial of weighted factors in a given board configuration. e.g., the number of pieces on the board, the number of queens, etc. To select a move, the program computes the polynomial for each possible subsequent configuration, choosing the one with the highest value. The computation can be done efficiently, using alpha-beta pruning in a decision tree. Since the coefficients of the polynomial are reweighed after each game, learning is simulated and one can say that the program achieves generalization since the value obtained becomes independent of a specific configuration. A more sophisticated computation is achieved with a set of hierarchical parametric tables that perform the calculations. Samuel suggested that learning in this fashion was slow, which led to the idea of having a preprogrammed polynomial with preweighted coefficients. Book learning is implemented by accessing a fixed set of board positions each of which is associated with a recommended move: by searching through this file, the computer benefits from the advice of expert players. Samuel's program did simulate learning and achieved productive results. As we mentioned earlier, such a program was an initial trade-off between preprogrammed intelligence and learning. Since people were more interested in productive results, subsequent research concentrated upon better approaches to tree searching and state representation, leaving the learning aside.

L. Uhr (62), in the field of symbolic pattern recognition, is one of the authors who has maintained such equilibrium in his research. In 1973 he published a book with more than 100 programs written in SNOBOL, thus heavily substantiating that learning is achievable. The basic tools in his implementation are string templates, which are used as hypotheses for the discrimination of patterns. The templates are compared with new patterns; when they do not satisfy these, the strings in the templates are shortened, letting the process repeat itself until a useful template is found.

P. Winston (69) has shown interest in providing programs with some learning. Disregarding previous work on the subject, he sees learning as a one-dimensional continuum where the extremes are represented by: Learning by Being Programmed and Learning by Discovery. Somewhere along this continuum. He places Learning by Being Told, also Learning by Study and Generalization of Examples. Working on a "block world" environment, similar to that of Winograd's (See 70), the program is able to create frames that describe specific objects. The representation is done through functional links in a graph-like structure. Learning is achieved by presenting to the program three types of examples for a given object: real objects, near objects, and misses. In this way the program learns the functional links required to represent the object. Although most of the emphasis is given to the symbolic representation of the examples using LISP, we see some aspects worth mentioning in learning. The main drawback in his approach is the rather deterministic fashion in which learning takes place, i.e., by requiring an exact match with one of a set of existing functional links, rather than by a statistical measurement of the closeness of a near match.

In a recent work, Winston (70) describes the creation of transferring frames to allow the comparison of a source object and a destination object, in order to increase the knowledge of the program about the destination object. This is done by copying the most important functional links or slots existing in the source object. In this case, learning is implemented in two steps: hypothesis and filtering. In the first step the source is studied and compared with similar objects to obtain a transfer frame with the slots that most clearly identify the object. In the filtering step, the transfer fame is compared against the destination in order to obtain the slots which, when added to the frame of the destination, will increase the knowledge about the object in study.

In this as in any other case of Learning by Teacher-Supplied Examples, the choice of the examples is of capital importance. As we all know, it is always difficult to separate signal from the noise of a bad teacher.

II.4. Hierarchies in Learning

Since the model we are going to discuss later is hierarchical, we now review the hierarchical nature of learning. We find the concept of hierarchies in the fields of Psychology, Control Theory, Artificial Intelligence and Learning Automata as a useful means to understand, represent and solve complexity.

Let us first take Psychology. There exists an immense set of learning theories; almost every scientist offers a new one. Hilgard and Bower (11) as well as Mower (36) present a very good introductive description of most of them. We find that conditioning, both classical and operant, is the most convenient to emulate in a computer. Skinner's Box (See 12) has been implemented by many using mathematical models and simulations that agree with the results obtained at the laboratory. LA, as we commented earlier, evolved from the initial models of Estes.

Let us look at three authors that specifically describe the various stages in a learning process as a hierarchical process. We begin with J. Piaget (48) who describes four developmental stages in children.

1. The Sensory Period, from 0 to 2 years old, when the child can only perform motor actions. Piaget maintains that intelligence is not really operational since the child's actions are not yet completely represented internally.

2. Preoperational Thought, from 2 to 7 years of age, when the child learns symbolic functions, such a as language, imagination, deferred imitation. Due to such functions, the child is able to construct internal representations and thus thought can take place.

3. Concrete Operations, from 7 to 11 years of age. Corresponding to this period is the ability to obtain the inverse of symbolic functions, i.e., reversibility. From this, the child is able to achieve seriation by means of transitivity functions. Recursive thought begins to appear.

4. Propositional or Formal Operations, from 12 to 15. In this period the child reaches his/her equilibrium. The main feature in the period is the ability to reason the hypothesis.

Besides considering this set of stages, Piaget & Inhelder (see Albus (1) ) describe another hierarchy for the case of sensory-motor development:

Stage Age Description
1 0 to 4 months Reflexive grasping and waving movements
3 4 to 12 months The eyes track the hand
4 4 to 12 months The hands are controlled by the eyes.
Perceptual constancy of the objects
under rotation and transition.
5 1 to 2.5 years Ability to arrange objects in patterns
6 1 to 2.5 years Initial attempts to represent symbolically
7 2.5 to 4 years Identification of hidden objects
8 2.5 to 4 years Discrimination of topological forms by touch
9 4 to 6 years Discrimination of Euclidian shapes by touch
10 4 to 6 years Ability of systematic tactile exploration

A second author, Robert Gagne (20), classifies learning in eight different classes or stage. Furthermore, he considers that in order to obtain higher level rules, it is necessary to know the lower level ones first. The model of learning presented in Chapter IV, as well as the results of the simulations, describe the implementation of such a hierarchical rule. Gagne's stages in learning are presented below:

Stage Description
1 Signal Learning
2 Stimulus-Response
3 Chaining
4 Verbal Associations
5 Multiple Discrimination
6 Concept Learning
7 Principles Learning
8 Problem Solving

A bit apart from the typical learning theories, we encounter another author that describes a hierarchy in the development of youth identity: E. Erickson (15) describes a life cycle consisting of eight stages in which the child's age matures before joining social institutions:

1 Trust versus Basic Mistrust
2 Autonomy versus Shame and Doubt
3 Initiative versus Guilt
4 Industry versus Inferiority
5 Identity versus Role Diffusion
6 Intimacy versus Isolation
7 Generality versus Stagnation
8 Ego Integrity versus Despair

Again, in this hierarchy, Erikson states that it is necessary to overcome lower stages before going to a higher one.

We turn back to artificial intelligence, to discuss two authors that have suggested the use of hierarchies in order to solve complex problems.O. Selfridge (53) the first to suggest the use of a tree of information, gathered elements to obtain feature analysis. Such an arrangement was named Pandemonium, since one dedicated to the recognition of a specific detail in a pattern an delivering a value to a higher level demon, regarding the correspondence of the detail of the pattern being analyzed. Using a novel and peculiar approach to problem solving, Albus (1), and Barbera (6) suggest the use of a hierarchical control mechanism as the stepping stone to planning. They state that the brains of animals are mainly concerned with controlling actions, rather than thinking and planning. These paper imply that sensory-motor interactive goal mechanisms are not something added to the skills of the animal, but rather the substrate from which intelligence develops in higher level mammals. Albus proposes a hierarchical mapping of task decomposition operators arranged in a fashion similar to that of the pandemonium, but with the difference that instead of working bottom-up, it goes top-down. A higher level demon sets up a chain of command actions to lower level demons in order to achieve a given command from a higher level demon than itself. The example he uses is explicit enough:

ASSEMBLE AB

FETCH A

REACH A
GRASP
MOVE
RELEASE

FETCH B
MATE A&B
FASTEN B TO A

Albus assumes that Problem Solving can be achieved by a set of sensory motor activities that include both actions (motor part and hierarchical communication (sensory part). Furthermore, Albus suggests that the environment may influence the system by affecting it a various levels. As such, environmental conditions may force a change in strategy at higher levels or only tactic changes all lower levels. However if the change generated is unstable, or the environment changes constantly, the system may be overwhelmed, precluding successful performance. He goes into the discussion of a mapping function that can be used to operate on a given level in the model. The mapping suggested is based on a previous work entitled CMC, Cerebellar Model Articulator Controller, that has already been implemented (2).

We feel that his work is important and will prove fundamental in certain aspects as the technology on robotics and control advances. Concerning the hierarchical inter-communication the problem is not fully addressed. in Chapter IV of this thesis we elaborate on such a topic; for the moment we mention the work of Mesarovic et al. (31) that present a formal discussion on inter-level communication or coordination

Concerning learning, Albus does acknowledge a teleological commitment at higher levels, as well as a prestructured hierarchy. He states that motor sensory actions are to be learned and must start at lower levels first. He is not as definite a Gagne's stage (20). Learning has to take place on lower levels first, if not completely, at least to a certain degree, before attempting the next level up. He makes reference to the work of Piaget & Inhelder (48), previously discussed. The work of Albus has notably influenced this thesis. However, we have tried to stay within the realm of LA and reinforcement schemes, a topic not discussed by him.

The final topic on hierarchies lies within the LA field itself. In his Ph.D. dissertation, Viswanathan (66) discusses a two-level automaton to solve the complexity of a non-stationary environment 1. This work is perhaps the only work that addresses such an approach. There are two noteworthy aspects of the work. First, he acknowledges that hierarchical models lack an appropriate formal algebra to permit analytical solutions; again, we turn to Mesarovic's work (31) as an initial formal treatment of the subject of hierarchies. Second, he uses simulation extensively in order to test the expediency of the schemes.

Having provided a comprehensive review of learning related to the spirit of this thesis, we now turn to the specific details of the various LA models that have been suggested in the last two decades.

__________

1) We devote Chapter III to present the mathematics of LA models.