APPENDICES


A. PROBABILITY PRELIMINARIES

B. STOCHASTIC AUTOMATA PRELIMINARIES

C. HIERARCHICAL LEARNING: A Simple Example

D. GLOSSARY


APPENDIX A. PROBABILITY PRELIMINARIES

Due to the widespread use of probability, almost every author presents a nomenclature that fits his own application, this is many times a source of confusion. We have tried to standardize our own notation, providing in this appendix some basic notions in order to familiarize the reader with our applications (see Papoulis (45) and Freund (18).

A.1. Discrete Stochastic Processes

A finite discrete stochastic process Q can be defined as t=0,1,2,... i.e. {Q(t), t e T} where each random variable Q(t) can take a value from a given set Q.

Let i, j represent two values of the variable Q (u) and Q (v) respectively, then the joint probability of i and j is given by:

 

Being stationary, we drop the t in equation (A.19) having that: pij = tpij = npij . Processes that are stationary can be described by a set of homogeneous P transition matrices; strictly speaking, the concept of homogeneity refers to another definition. In the case of learning automata, we are concerned with two other types of stationary processes; namely: Periodically stationary processes are those in which:

Processes Stationary in an Interval are those in which equation (A.22) holds true during a given length in time. In order to solve any Markov process, it is necessary to know the initial probability distribution vector of the values in the set Q.

Chapman-Kolmogorov Equation

Since all along we have talked about finite discrete Markov processes where the random variables take their value from a given set, we can refer to these processes as Markov chains.

A Markov chain follows the well known Chapman Kolmogorov equation, stated as:

To prove the equality it is only necessary to apply the chain rule, i.e., (A.9) to equation (A.16). With an initial distribution and provided that the chain is stationary, we may obtain the probability distribution at any given point in time by applying equation (A.24). The fact that the set of values for each random variable is the same in the whole process, e.g., set Q allows us to consider them as nodes or states. Such fact proves handy when we elaborate on stochastic automata. From now on, we will refer to these values as the states.

We present a new set of variables, used in the solution of transition matrices. The notation fij(n) represents the probability of the first passage time between states i to j in a units of time, i.e., it is necessary to have a transitions in the process to reach j from i. From the definition it is evident that fij(n) <> pij the values of f for n 1 can be obtained recursively by:

The notation pij(1) represents the probability of moving from state i to state j in exactly n transitions of units of time.

For the specific case of i = j in fii(n) we are referring to the probability of recurrence to the same state. The values for the various fij(n) must satisfy the following conditions:

For equation (A.27), only when the equation is strictly 1, can we state that the f's are probability measurements. These values are of great help because they are used to investigate certain conditions of Markov chains.

If i = j and fii(1) = 1 , then i is called recurrent. If the process reaches state i, it must eventually return to it. For the special case in which fii(1) = 1, i is called an absorbing state: Once it gets there, the process cannot leave it. On the other hand, if fii(n) < 1, then i is called a transient state: There exists the probability of never returning to i. It is important to note that the computation of the f ' s is not a simple task. There are some better methods to determine the type of states in a Markov chain.

State j is said to be accessible from state i of pij(n) > 0 for some value n > 0. Furthermore, states i and j are said to be communicated if i is also accessible from j. Markov chains are generally partitioned in disjoint classes, where each class is comprised of states that communicate to each other. If there is only one class, the chain is said to be irreducible.

To discuss ergodicity, we need first to define the expectation of fij(n). this can be defined only if (A.24) is strictly an equality, the expected value is given by:

Thanks to a (A.29) it is possible to obtain the expected values in a simpler way than by using ( A.25). When i = j , the state is said to be null recurrent if E{Fij} = and positive recurrent otherwise. In a finite state Markov chain there are only positive recurrent states and transient states. In a Markov chain, a state is said to have a period T if pii(n) = 0 whenever n is not divisible by T and for T > 1. In the case of T = 1, we refer to the state i as an aperiodic state. If a state is a given class is aperiodic, then all the states in the class are also aperiodic. We refer to positive recurrent states that are also aperiodic as ergodic states.From what was just defined, an irreducible Markov chain with positive and aperiodic recurrent states is said to be an ergodic chain.

We end our review of Markov processes with an important and well known theorem of ergodic chains. In an ergodic chain, the probabilities of the various states at time n when n tends to infinity are equal to the steady state distribution, independent of the initial state distribution, i.e.:

 


APPENDIX B. STOCHASTIC AUTOMATA PRELIMINARIES

 

In order to provide the reader with a unified notation for this novel field, we present in this appendix a rather brief review of its concepts. The idea of a stochastic automaton was first introduced by W.R. Ashby 93) in 1956. Since then, the topic has been formally studied by authors like Booth (10), Paz (46) and many others. In the present appendix, we stress the relation of a stochastic automaton with a Markov chain, presenting particular findings relevant to learning automata and, more specifically, to the HCLSA.

B.1. Mealy Stochastic Automata

A Mealy Stochastic Automaton is defined by the quadruple:

In the case where f is defined deterministically, the corresponding S (Y | X) matrices are saturated, i.e., an element per row is equal to one and the rest equal to zero. In such a case, we deal with Deterministic Mealy Machines. On the other hand, the function g does not need to be stochastic, since the joint representation of f an g is S allows for complete generality.

The correspondence between a first order Markovian chain and a stochastic automaton can be obtained from (B.3) provided that we keep the inputs constant, because the choice of the next state depends only on the current state. The value taken by the random variable Q(t+1) depends only on the value taken by Q(t). Stochastic automata for which inputs are kept constant are called autonomous. We state that S (Y | X) are semistochastic since they satisfy the condition 0 <Sij(Y | X)< 1; furthermore, a double summation, first over all states and then over all possible outputs, causes the addition to be exactly one:

In the matrix, each row adds up to one since the matrix represents equation (B.5). Let I represent an input string x 1, x2, x3, x4, ... x n. and 0 represent the output string y1, y2, y3, ..... yn corresponding to the string I. Assuming that the inputs are independent from each other, we define S( 0 | I ) as:

For the null string ¬ we have that S( ¬ | ¬ ) is equal to the identity matrix; i.e., no transition has taken place. The equation is merely an extension to the rules of conditional probability discussed in the previous appendix. In order to obtain probability measures, let us define the vector U as a unite vector containing as may elements as states are in the automaton, then we define h ( O | I ) as:

Such formulation is merely a use of Bayes' Rule, i.e., (A.8), when applied to stochastic automata. To fix the ideas already presented, we present a simple numerical example:

B.2. Moore Stochastic Automata

The case of a Moore machine is different from that of a Mealy in that the output is a function only of the current state, thus equation (B.4) is given by:

(B.13) Y(t) = g(Q(t))

Again, the function need not be stochastic. Since in the discussion of the relation between Markov chains and stochastic automata we held the inputs constant, this equation does not affect the correspondence we talked about. However, the representation of a Moore machine differs from that of (B.1) due to the fact that in this case there are transition matrices that are fully stochastic in the definition of the machine. A Moore Stochastic Automaton is defined as the quintuple:

We include g in the definition since the output is independent of the probabilities in . The entries in each P(X) represent the following probability:

State equivalences and synthesis of stochastic automata are further studied by Fu (19) and Paz (46).Although the relation between a first order Markov chain and a stochastic automaton was already mentioned, we want to stress that his relation goes further than just the case of a first order chain. Booth (10) has shown that it is possible to represent any rth order Markov chain by means of a stochastic automaton by the expansion of the state set Q.

For the general case, we require (xmax)r number of states for a Markov chain in order to be represented as a stochastic machine. The transition matrix in such a case contains more zeros than a normal stochastic matrix does. This is because only the transitions that follow (A.17) are allowed for a given row. Due to the relation between stochastic machine and Markov chains, all the theorems for the latter can be applied to the former, e.g., Z transforms, limits when t tends to infinity, etc.

A simple approach to the synthesis of stochastic automata is to consider a set of deterministic automata subject to be random (Bernoulli or Markov) input. This approach is important to our discussion since the first learning application in the field was based on such a consideration.

B.3. Automata Interacting with the Environment

In this part we deal with the solution of the interactions between an automaton and its environment, as described in Chapter III of this thesis.

We suggest to read Chapter III in advance, before going any further in this appendix. First, we discuss the solution proposed by Tsetlin (57) for a deterministic automata (DA) interacting with a binary random environment (BRE). The definitions required were already presented in Chapter III (see equations 3 and 4). Let us assume that the environment is stationary and consider that at instant t the automaton is in state Q(t) = i with an output y as a function at the state i.

The evaluation of the environment is based on the entry r corresponding to y, and the overall probability of moving to state j at time t + 1, denoted by t+1oij, the elimination rule, see equation (A.7), is given by:

Collecting all oij terms in a single matrix t+10, we observe that it is also stochastic. The values in oij lie in the range 0 < oij < 1, and the summation over j in a given row equals 1 since P(X) is deterministic. Furthermore, since the probabilities of the matrix depend on R, and this has been assumed to be stationary, t+10 is also stationary, i.e., 0 =t0 =n0. For the case we are interested in, the chain is ergodic, presenting steady state distributions following equation (A.30). Let us denote SSi (y) as the steady state solution of the probability of state i with input y, i. e.,

Recall that the DA is said to be expedient if E{ {DA, BRE} } < E{ R }, and optimal if the value is equal to: E{ {DA, BRE} } = rmin . To determine the performance of a given DA, we need only compute the SSi(y) values and obtain the various expectations. For the case of the T-maze experiment, we provide here the solution of Tsetlin's linear strategy. The strategy was stated for a two state automaton by the following heurism:

For this case, the pair I = {DA, BRE} is given by:

Tsetlin shows that if rmin < 1/2, then the expected value of the interaction approaches rmin as n goes to . The model can be further generalized for the case of n outputs, e.g., the n-armed bandit problem, in which case ii ts only necessary to divide the state set in n classes. Note that the standardize our nomenclature, we have presented the vector R in a slightly different manner than Tsetlin originally proposed it.

Turning to stochastic automata, we present our solution to the interaction {SAVS,BRE} discussed in Chapter III, with reference to the work of Varshavsky and Varontsova (65).Let the interaction I = {SAVS,BRE} discussed in Chapter III, with reference to the work of Varshavsky and Varontsova (65) Let the interaction For this case, the pair I = {SAVS, BRE} be given by:


APPENDIX C. HIERARCHICAL LEARNING: A SIMPLE EXAMPLE

In order to obtain a clear picture of the concept of hierarchical learning, we provide in this part a simple example that shows both hierarchical and collective learning. The example chosen is helpful for those readers wishing to have a concrete view of the concepts discussed in the present work.

Suppose that we want the computer to learn the way in which the various pieces of a chess set are moved on the board; specifically, let us say that we want the compute to learn how to get to a given position, starting with just three pieces: a rook, a knight and a queen. The method to be used is by continuous trial and error. There are many ways in which we could approach this problem; however, we think that a hierarchical way is most appropriate. Such an approach calls for a hierarchical organization of CLSA's in the manner described in Chapter IV. At the root of the hierarchy we set the goal of reaching the desired position on the chessboard. In order for this to happen, it is necessary to define the following subgoals at the various levels of the hierarchy:

Using a hierarchical arrangement to learn the previous goals, allow the machine to learn lower level goals independently from higher-level ones. For this case, whether or not the machine learns well the way in which it can checkmate the opponent's king is irrelevant to how well it learns to move the knight, this concept is a rather important one. Furthermore, it can be suggested that by organizing the learning in such a way, the computer can learn new higher-level goals based on the knowledge it acquired with previous higher-level ones.

The example just described can be generalized to many other cases, as discussed in Chapter IV of this work. Finally, although the learning process described is highly flexible in nature, it must be understood that it is a rather time-consuming process.

 

 


APPENDIX B. GLOSSARY

The following terms have been used in this thesis; we provide here a brief definition for them and also the page where they fist appear in the text.