A 3-Brain Architecture for Brain-Like Intelligence
1 Introduction
1.1 Context
In the development of learning systems and neural networks, the issue of complexity occurs at many levels of analysis.
Most of the literature on complexity and neural networks addresses the task of supervised learning -- the fundamental task of learning a mapping from an input vector X(t) to a vector of targets Y(t), based on training data consisting of examples of the two vectors. This is an extremely important topic, where insights into alternative concepts and measures of complexity will be very important to practical results in engineering [14, 2 chapter 10]. Further research on this topic will be crucial to the development of supervised learning components which are powerful enough to permit truly brain-like capabilities in larger intelligent systems.
This paper will address an equally important, complementary aspect of the complexity issue -- the problem of how to build larger-scale intelligent control systems able to cope with truly large and complex environments. In particular, the paper will describe the design of a new class of systems to perform optimization over time. Many other papers (e.g. in [1,15-18]) have discussed why this is a reasonable approach to explaining and replicating the kind of higher-order intelligence found in the mammalian brain; however the previous designs based on that aproach have had certain limitations so far in their ability to handle truly complex environments. These designs have been more than enough for the current types of challenges being addressed at the frontiers of control engineering[2]; however, in order to handle those kinds of complexity which the mammalian brain can cope with, one must upgrade these designs, so as to exploit the phenomena of “temporal chunking” and “spatial chunking,” as will be discussed below. A large part of the challenge here is to build a truly brain-like neural learning-based design which can learn to achieve the kinds of capabilities which have been already achieved in some special applications, using mainly hard-wired algorithms, in the domain of artificial intelligence. This paper will discuss how to meet that challenge.
In many ways, the design approach given here is a beginning rather than a conclusion. The paper does lead to many possible variations in how to implement the approach, and and many questions about what will work best, under what conditions. It also makes very strong assumptions about the reader’s knowledge of prior knowledge, especially involving the papers cited below.
The text which follows is taken directly from an international patent application filed on July 4, 1997, by this author, in association with Scientific Cybernetics, Inc. That application also incorporated several of the papers cited below, and some additional information on possible applications, for example, in strategic game-playing, in hypersonics, in the control of complex heat exchanger systems or fuel processors, and in distributed large-scale network control. For the sake of authenticity and timeliness, the text is being published without alteration, except for the addition of some new references and for copy editing required to meet the standards of this publisher.
The relation between this design and the brain has also been discussed with the neuroscientist and psychologist Karl Pribram, as reported in part in [19]. The “decision blocks” which form the core of this design could be thought of either as “task modules” (as in Pribram’s earlier work) or as “episodes” as his in more recent work[20]; in fact, the proposed structure has the ability to learn to be more goal-oriented (like a task module) or to be more reactive in nature (like an episode processor), depending on what works best in specific situations. The “loops” or “layers” in the basal ganglia/cortex/thalamus circuit have been discussed not only by Brooks, but also by John Taylor, James Houk, Seng-Beng Ho and several others
whose thoughts and guidance I am most grateful for.
1.2 A Three-Brain Architecture for Artificial Intelligence
The purpose of this paper is to describe a method for building a new type of general-purpose artificial intelligence which will be called a “3-brain system” or “3-brain architecture.”
This architecture was originally motivated by an effort to understand and replicate the kinds of problem-solving capability and learning which exist in the brains of mammals [1]. The details of these efforts are described in several papers filed with the preliminary patent applications and included as part of this disclosure. An additional paper, also included [2], describes some of the underlying engineering principles and ideas for how to use these designs in practice. Some of these papers were written on government time, but the original ideas therein were developed by the author independently, usually in periods of reflection outside of the office.
The 3-brain architecture is not, strictly speaking, a single design. It is a design methodology, which can be implemented in a variety of ways, on a variety of platforms. The architecture can be embodied in chips, in opto-electronic hardware, in biomolecular hardware, or in software. As a practical matter, the software emulation will probably come first, in order to permit the careful evaluation and tradeoff studies necessary to justify the more expensive step of building dedicated hardware.
This design methodology may be thought of as an improvement to certain architectures previously developed (and partly patented) by this author. More precise-ly, it is an extension of the model-based adaptive critic (MBAC) or “brain-like intelligent control” designs described in the attached papers. Those designs, in turn, may be seen as extensions of Ron Howard’s methods of dynamic programming [3].
In general, all of these designs can be expressed as designs for learning-based maximization of utility over multiple time periods into the future. This includes “reinforcement learning” as a special case [4]. In reinforcement learning, an intelligent system is given access to a set of sensor inputs, a set of actuators which it controls (i.e. its outputs are wired up to the actuators), and to a monitor which evaluates its performance or “utility” (U). Through learning and/or exploration, it develops a strategy or “policy” of action which enables it to maximize (or minimize) total utility in the future. These reinforcement learning systems are intended to be general-purpose systems, because the same learning system can be applied to different applications, simply by having it learn to adapt to these various applications sepa-rately. The 3-brain architecture is a major, qualitative improvement over the earlier designs, insofar as it has the potential ability to learn to cope with far more difficult applications. In effect, it is a general purpose system for making intelligent decisions.
This paper will describe how to build a 3-brain system, by a series of progressive improvements, starting from incremental dynamic programming, which will be reviewed. The first part of the paper will describe new designs for “temporal chunking” with reinforcement learning, in a classical context. The second part will describe how to replace the matrices in these new designs with neural networks (or similar structures), in order to permit larger-scale applications based on the ability of neural networks to approximate complex nonlinear relations in a parsimonious way. The third part will describe new, more sophisticated neural networks (and their nonneural generalizations) which should perform better than conventional neural networks as components of these designs; in addition, it will describe some possible hardware implementation of the most critical, computationally expensive components, and address the extension of this system to incorporate aspects of “spatial chunking.”
Crudely speaking, parts 2 through 4 of this paper will describe how to construct the “upper brain” and the “middle brain” as described in [1]. Part 5 will describe briefly how to attach such a higher-level intelligent system to a lower level “brain” so as to construct a complete “3 brain” system.
The technological intention here is to first build up a modular software package, in which a full three-brain system can be obtained by linking together the appropriate modules. (Before this is fully completed, however, some of the hardware development recommended in part 4 will be started, if possible.) However, for practical purposes, the user of this software will also be able to use simpler modules, or alternative modules, to fill in the various components, if he or she chooses.
The step-by-step approach to constructing this kind of intelligent system is intended to provide additional flexibility in the overall package, flexibility which a monolithic implementation would not possess.
Notice that this paper will suggest several alternate designs, based on the same general approach or method, to perform subsystem tasks. In earlier research, in past years, the author often tried to specify one best alternative for each subsystem. But experience showed that different alternatives worked better in different applications. Thus the intention is to build a general system which provides the user a choice of subsystems, so as to allow extensive tradeoff studies.
2 Time-chunked Approximate Dynamic Programming
2.1 Notation and Classical Results
In the simplest forms of dynamic programming, we assume that the environment or the plant to be controlled can only exist in one of a finite number of possible states. These possible states may be denoted as s1 ,..., si ,..., sn , where n is the number of states. At each time t, the intelligent decision-making system observes the state s(t) (where s is an integer between 1 and n), and then outputs a vector containing decisions or control variables, u(t). Usually the intelligent system will choose u(t) based upon a “policy” p which is simply a collection of rules of how to behave (to choose u) in different states s. This may be written conceptually as:
u(t) = u(s(t), p) (1)
Normally the user of the system provides a utility function U and an interest rate r. We are asked to design an intelligent system which can learn the optimal policy, the policy which at any time t will maximize:
(2)
where the angle brackets denote expectation value. (It is a straightforward well-known extension of this to consider finite horizon problems, in which t goes to some finite maximum T. Also, it is common to build designs in which r is initially set to a high value -- even infinity -- in the first few iterations, and lowered slowly to the user-specified value, as a method for improving learning.)
Normally it is assumed that we know the transition probabilities as a function of action, which may be written:
(3)
For a particular policy p, we may define the classic J function as:
(4)
and:
(5)
The basic theorems of incremental dynamic programming [3,5] describe the properties of this function J, which must normally obey the Bellman equation:
(6)
Note that this equation also provides a basis for actually choosing u(t) at any time t.
In the case where there are only a finite number of possible states s, we may define a state probability vector p by defining its components p1 ,..., pi ,..., pn as:
pi = Pr(si) (7a)
pi(t) = Pr(s(t) = i) (7b)
In this notation, we have, for any policy p:
p(t+1) = Ppp(t) (8)
Likewise, we may define the vectors Jp and Up by defining their components:
(9a)
(9b)
In this notation, equation 4 takes the form:
Jp = Up + MpJp , (10)
where we define:
Mp = (Pp)T/(1+r) (11)
2.2 Classical Approaches
In traditional incremental dynamic programming [3,5,6], the optimal policy is found by some kind of alternation between “value updates” and “policy updates.” One starts with something like an arbitrary policy p and an arbitrary estimate of the value vector J. One learns the optimal policy simply by progressive improvements in the policy and in the estimated value vector. The value updates are generally based on equations 6 and 10, translated into:
J(n+1) = Up + MpJ(n) (12)
In other words, for the current policy p, one replaces the old estimate of J (J(n))
with this new estimate (J(n+1)). In a policy update, one simply picks u(t) at each time t so as to maximize the right-hand side of equation 6, using the current estimate of J. In particular, one may do all this for the entire set of states (as implied by equation 12) or for one state at a time. The MBAC designs mentioned above provide neural network approximations to these kinds of updates.
Broadly speaking, there is another important choice of strategies in making these kinds of updates. In a passive strategy, one simply assumes the current policy p, and carefully works out J in detail. In the active approach, one explicitly designs the value-updating system so as to permit more frequent changes in the policy p and more explicit allowance for the effects of such changes.
2.3 Temporal Chunking: Multiresolutional Designs
Traditional approaches to dynamic programming and to approximate dynamic programming (ADP) are generally based on “backups” or “value updates” from time t+1 to time t, as implicitly assumed in equation 12. But in a real-time control system, the interval between time t and time t+1 (the sampling interval) may be very short. The literature on artificial intelligence has stressed the need to jump over longer time intervals [7]; however, this kind of “temporal chunking” has yet to be implemented in effective learning-based ADP designs.
In theory, the usual ADP designs should all converge to the correct policy, anyway, if given enough time. But there is a problem here with computational cost and computational complexity.To put it another way, new designs which inject time chunking into ADP should lead to reductions in computational cost and complexity, which in turn should make it possible to handle more complex applications at acceptable cost.
To understand these cost issues, return to equation 12. For simplicity, assume a purely passive approach, in which we try to find the correct J function (Jp, in effect) for a fixed policy p. Assume that the initial estimate of J -- J(0) -- is simply set to equal U. In that case, it is easy to see that:
(13)
Thus after n complete value updates, the “critic” (the estimate of J) “sees” only n periods of time into the future, in effect. Equation 13 is just an estimate of the true value:
Jp = (I - Mp)-1 U (14)
In order to learn the true J much more quickly, one may exploit the following numerical identity (for the limiting case, assuming no singularities, as usual):
Jp = ... (I + (Mp)16)(I + (Mp)8)(I + (Mp)4)(I + (Mp)2)(I+Mp)U (15)
Using this approach, after only n steps of calculation, one “sees” 2n periods of time into the future. There are two ways to implement this approach:
1. For each number j, from j=1 to “infinity”, multiply J(j-1) on the left by Mp 2j times, and then add the result to J(j-1), in order to calculate J(j).
2. To start with, set M0=Mp. Then for each iteration j: first set Mj = Mj-1Mj-1;
then calculate:
J(j) = J(j-1) + Mj-1J(j-1) (16)
There are many possible extensions of this, such as the obvious generalizations based on the repeated application of:
Jnk = (I + (Mp)n + (Mp)2n + ... + (Mp)n(k-1)) Jn , (17)
where I now define (just for equation 17):
, (18)
and where the parameter k can be set to any integer >1, and even varied from iteration to iteration if desired.
These methods, collectively, will be henceforth called Multiresolutional ADP. In conventional artificial intelligence, they would correspond to systems based on “clock-based synchronization.”
It should be noted that the n-step methods described by Sutton [6] have some relation to these methods. However, Sutton’s designs permit a foresight extension of only a factor of 2 (or of k), rather than 2n or kn! He does not demonstrate any awareness of the crucial tricky relation in equation 15.
Both in Multiresolutional ADP and in other temporal chunking designs, it can be extremely useful (when possible and appropriate) to represent a utility function as a growth process, i.e. as:
U(s(t),u(t)) = V(s(t)) - V(s(t-1)), (19)
for some reasonable function V, in the case where r=0. This can permit a substantial reduction in the apparent complexity of the calculations.
2.4 Temporal Chunking: Two-Level Event-Based Designs
If the matrix Mp were a fully populated (fully connected) matrix, it would be very difficult, in principle, to improve upon these multiresolutional methods. However, as a practical matter, the matrix Mp will usually be extremely sparse, in large real-world applications. To reduce the costs still further, in the finite state situation, one can use a domain decomposition approach, in order to exploit this sparsity.
To begin with, let us consider a simple partition design for implementing this approach. Let us assume that the possible states of the plant have been partitioned into blocks. Thus every state s will be denoted by sA,i , where A is the block number and i is the number of the state within the block. The key to this approach is to find a partition such that Pr(B(t+1),j(t+1) | A(t),i(t)) will equal zero, except when block B happens to be one of a very small set of blocks “near to” A. More precisely, if n(A) is the set of blocks B such that a direct transition from A to B is possible, then the key is to find a partition such that n(A) is as small as possible for the “average” block A. This is a two-level design, where the upper level involves the choice of blocks A or B, and the lower level involves the choice of states i and j.
Starting from any block A, for a fixed policy p, we now have two sets of transition matrices to consider: PA, which represents transitions within block A, and PBA, which represents transitions from block A to a different block B. Mirroring equation 11, we then arrive at the matrix MA and the matrices MAB.
For any vector v defined over all of the possible states of the plant, let us write “v |A” to represent that portion of the vector v which applies to states within block A. For example, if there are 100 possible states of the system, of which 15 are in block A, then v will be a vector with 100 components, and v |A will be a vector 15 components, extracted from v. In this notation, the Bellman equation (equation 10) implies, for each block A:
(20)
By matrix algebra, this implies:
(21)
Let us define:
(22a)
(22b)
With these definitions, equation 21 reduces to the following fundamental equation for event-based chunking:
, (23)
where the asterisk indicates matrix multiplication. Equations 22a and 22b imply the following recurrence relations, similar to the Bellman equation:
(24)
(25)
The simple partition design is defined as any incremental dynamic programming design in which:
1. JA and JAB are updated by some sort of use of equations 24 and 25;
2. The global estimate of J is updated or calculated by use of equation 23, in some way;
3. Policies or actions are updated based on J, as usual.
As with ordinary incremental dynamic programming, value updates (updates of J, JA or JAB) or policy updates may be global (all states at once) or local (e.g. state-by-state), and may be scheduled in a variety of ways.
Furthermore, one can reduce computational cost (or architectural complexity) by a considerable amount, by performing only the most essential calculations at appropriate times. For example, in order to develop long-term foresight most efficiently, one can use equation 23 very often only to update J for those states in blocks A which can be reached directly, in one step from states in other blocks. Values of J in other states are needed only when updating actions; such updates typically require the calculation of only a few J values, for the block in which action decisions are currently being updated. Formally, if a state i in block A can be reached directly in one step from block B, then state i will be called a “post-exit” state of block B. Global foresight requires only that J be updated in post-exit states, because those are the states whose J values are actually used in the far right term of equation 23.
2.4.1 Variations to the Simple Partition: Focused Partition and ADAC
There is a very straightforward variation of the simple partition design which has many of the same properties. This variation may be called the focused partition design. In this variation, we define the exit states of any block A as the states from which a direct transition to other blocks is possible. One tries to find a partition which minimizes the number of exit states. In the simple partition design, we always consider those components of a vector v which apply to states within the block A. In the focused partition variant, we consider those states in A which are not exit states, plus those states of other blocks which can transition to block A in a single step. The resulting changes in structure are straightforward, but of some significance. For example, in equation 23, instead of considering J values for state in other blocks B on the far right, we consider only J values for the exit states of block A. But the matrices JA and JAB are extended, so as to provide components of the J values for exit states of some other blocks. In fact, it is most convenient simply to update additional vectors, J+A , which estimate the values of J for the exit states of each block A. It is not necessary to maintain estimates of J for other states. The major value of this variant is to reduce the size of the rightmost term in equation 23. This is especially useful in the completely active variant, where it is important to minimize the number of inputs to the various neural networks.
There is another variant here which should be mentioned, for the sake of completeness, even though it is not a preferred variant. It is possible to define:
J’A(i,u) = U(i,u) + (MA(u)JA) , (26)
in rough notation, where “I” represents a state within block A. This is the most natural way to apply the notion of “Action Dependent HDP” or “Q learning” [8,ch.13] in this context. ADHDP and Q-learning are both examples of what is called “ADAC,” the Action-Dependent Adaptive Critic, in [8].
One can adapt J’A based on transitions within block A, using essentially the same kind of update we would use for JA, just as the classic method ADHDP is essentially the same as HDP, with the same minor variations. This is not the preferred version for large-scale problems, because it entails the same sort of weaknesses (with a few advantages) of ordinary ADHDP and Q learning [8]. However, when the choice of actions is actually very limited an state variables are very complex, this variation can be useful. For low-level control, the choice of actions is usually more continuous, but for higher-level decisions the important choices sometimes are more discrete.
2.4.2 Step-by-Step
Approaches to Learning J, JA,
JAB or J’A
In a pure finite-state problem, direct matrix updates of J, JA, JAB or J’A based on these equations can be very efficient. However, the ultimate purpose of describing the finite-state case in this paper is to set the stage for the full preferred form of the 3-brain architecture, which involves neural networks and learning for large-scale problems. For the sake of simplicity (and step-by-step implementation), the neural net designs will be presented as extensions of finite-state methods.
For neural network designs, especially, it is important to consider methods for updating J, etc., on a step-by-step basis. Such methods can be studied and used in the finite-state case as well.
Even in the finite-state case, there are several choices for how to update JA and JAB, essentially the same as the usual choices in updating J in ordinary ADP [8]. One choice is simply to use the matrix equations, as written (or restricted to exit states, where applicable), for global value or policy updates. Another is to use state-by-state updates. In conventional state-by-state updates, for example, it is common [8] to change the estimate of J(s(t)) in proportion to:
J(s(t+1))/(1+r) + U(s(t),u(t)) - J(s(t)), (27)
where s(t+1) may be the state actually observed after state s(t), as we observe the actual plant, or where s(t+1) may be simulated based on the correct transition probabilities (P). In our case, we may apply exactly the same approach to updating JA. In rough notation, we may update our estimate of JA in any state s(t) as follows:
new JA(s(t)) = old JA(s(t)
+ LR*( “JA(s(t+1))”/(1+r) + U(s(t),u(t)) - old JA(s(t))) , (28)
where LR is some kind of learning rate. (LR will usually be less than zero because of the stochastic nature of this process.) In this equation, “JA(s(t+1))” usually represents our estimate of JA(s(t+1)), which we are not updating here, unless s(t+1)=s(t), in which case we use the old estimate. However, in the case where s(t+1) is not in block A, JA(s(t+1)) is not defined; in that case, we use zero instead of JA(s(t+1)) in equation 28. This is not an ad hoc fix; instead, it follows directly from analyzing equation 24.
This same approach can also be applied in updating the JAB matrices. Notice that JijAB has an important intuitive interpretation: if the system starts out in state i in block A, then JijAB represents the probability that the next future state outside of block A will be state j in block B, discounted by the interest rate r (if r¹0). The simplest appropriate update rule is:
new JjB(s(t)) = old JjB(s(t) + LR*(“JjB(s(t+1))”/(1+r) - old JjB(s(t)), (29)
where I have not bothered to write the additional indices (i and A) representing state s(t), and where “JjB(s(t+1))” represents the old estimate of JijAB except in the case where s(t+1) is no longer in block A. In that case, “JjB(s(t+1))” is defined as 1, if s(t+1) is state j in block B, but otherwise zero.
Actually, there is a further variation of this adaptation procedure for JAB which -- though more complex -- may improve convergence in some cases. One may define a new global transition probability:
(30)
One can then adapt this transition probability using an update rule essentially identical to equation 29, except that it uses a term “PiAB”, which equals 1 if and only if s(t+1) is in block B. One can adapt a conditional J value, J’jB , using the same adaptation rule as equation 29, with J replaced by this J’, except that adaptation is skipped whenever s(t+1) is not in block A or block B. In this variation, one continually updates PAB and J’AB instead of JAB, and one replaces the use of JAB by exploiting the relation:
(31)
All of these state-by-state update procedures are adaptations of the classic methods first proposed as Heuristic Dynamic Programming [8] and later elaborated under the name Temporal Difference methods. As an alternative to such methods, there are of course more classical methods (usually less efficient) for estimating transition probabilities in this kind of situation. For example, in each transit through block A, one may simply record every state visited. For every state in block A, one maintain a description of the distribution of the corresponding eventual exit state. One may then update each such description as soon as one exits A again, and then one can delete the record of this transit. It is straightforward to adapt this method to a neural network approach as well, similar in spirit to Widrow’s original adaptive critic blackjack player, briefly discussed in [2]. Although these approaches are expected to be less robust than the state-by-state update methods, they may be useful as part of a more complex hybrid approach combining both.
Finally, it is also possible to use step-by-step learning procedures to update the global estimates of J, based on equation 23. It has already been noted that we do not actually need to update estimates of J for every state. It is enough simply to update estimates of post-exit states (in the simple partition case) or of exit states (in the focused partition case). For updating action policies it is enough, in effect, to use equation 23 directly to calculate J(s(t+1)) for particular states s(t+1) which might result from u(t).
For updating the essential J values, consider the analogy between equation 23 and the Bellman equation (equation 10), for the simple partition case. Equation 23 is like the Bellman equation, except that s(t) in the Bellman equation corresponds to the first state encountered when entering block A, while s(t+1) corresponds to the first state (j) encountered when exiting the entire block A. This suggests three natural ways to update J on a state-by-state basis. First, we can remember the initial state i encountered when entering block A, and then, upon exiting block A, adapt the estimate of Ji so as to be closer to:
JiA + (Jj/ (1+r)t) , (32)
where t is the number of time periods between entry and exit, which must be remembered as well unless r=0. This is a reality-based update. A reality-based update can only be undertaken once per t time periods, roughly, because it requires that we wait from the start to the exit from the block. Second, we can store i, j and t in some kind of memory, and do a memory-based update at some later time. (This is an example of the learning strategy called “syncretism” in [8,ch.3].) Third, we can pick a possible or remembered (or just conceivable) value for i, and then simulate j (and t, if r does not equal zero). Notice that the matrices JijAB, made up of transition probabilities, can be used to perform this kind of simulation, without any need to simulate intermediate states. Simulation-based updates and memory-based updates of J can be done much more frequently than reality-based updates, because they require only one cycle time of computation. Therefore, the preferred variation for a full, efficient, parallel 3-brain design would involve frequent simulation-based updates and memory-based updates of J, especially for calculations involving large chunking intervals t, even during normal real-time operation of the system.
In actuality, for a neural-network approximation of this system, to be described in section 3, there is an easier step-by-step adaptation rule for these post-exit J estimates. We can simply set:
new Ji = old Ji + LR*(JA(i) + “JB(i, J |B)” - old Ji) , (33)
where the term in quotations refers to the output of a neural network (or other supervised learning system) which is trained to input the J estimates for the post-exit states of block A and a specification of a state i in block A, and to output the complete JAB term for that state i (i.e., to output the summation in equation 23.)
Equation
33 may be called a backwards value update, to contrast it against the
update procedure related to equation 32, which may be called a forwards
value update. (In similar language, equation 23 is used in the classical
approach to provide a backwards value update, but a matrix-based update rather
than a state-by-state
update. Later, neural network approximations provide still
another choice between “pattern learning” and “batch learning” alternatives[2].) The tradeoff between forwards value updates and
backwards value updates is a recurrent issue, even in more complex designs. In
this context, however, the backwards value updates have an advantage, because
they are exact; by contrast, equation 32 involves the usual random disturbances
associated with any statistical sampling method, without any real cost
advantage.
2.4.3 From Passive
Design to Active Design: Locality and
Decision-Making
This section will describe further variants of these designs, also motivated by the desire to provide a basis for neural network learning extensions.
The simple partition and focused partition designs, as described above, were both derived as passive methods -- as methods for efficiently calculating J for a given action policy. To develop more active designs, one can employ two general strategies which will be explained in this section: (1) increased “locality”; (2) explicit “decision-making.” In the full, preferred version of the 3-brain architecture, both of these are used.
In locality, the idea is to make sure that the things one is learning in any region of state space (here, a block) do not depend strongly on what is observed and learned in other regions of space. Locality is already widely used as a strategy in ordinary supervised learning [2]. In that context, it is well known that local designs lead to faster and more robust learning (though often with poor generalization, for reasons which do not apply here), in part because there is no need to unlearn or relearn things as one moves from one block to another.
The simple partition and focused partition designs already provide a reasonable degree of locality. The JA and JAB matrices for a block A depend only on transition probabilities from that block . Thus the crosstangled global learning problem is decomposed into smaller local parts, such that the required foresight horizon in each part is usually very limited. (In principle, one could also apply multiresolution methods within each block, so as to accelerate equations 24 and 25; this is a straightforward application of what we have discussed so far, but it is questionable whether the brain uses such a complex design.) Equation 23 allows one to update the global values by jumping over an entire block in a single step of calculation, in effect. This is a form of event-based chunking, because equation 23 provides temporal chunking, and the chunks are defined by events (exiting from a block) rather than by predetermined time intervals.
This design is actually somewhat active, in the following sense. Actions within any block A should affect only the transition probabilities -- and hence the JA and JAB -- in that block. However, the design is not completely active or local, because the J values used when selecting actions u are the J values for the relevant block, calculated by equation 23; this calculation in turn, does depend on some J values for states in blocks B. Thus any change in those global J values would change the actions within the block. This in turn implies that the action policy within the block will depend on global variables outside the block. Thus the action policy within the block, the transition probabilities within the block, and even JA and JAB themselves, are all subject to change to some degree, depending on things outside of the block.
In order to upgrade the design to make it completely local, one can replace the matrices JA and JAB and the local action policy with neural networks or the like. Section 3 will describe this kind of process in more detail. In the classical context, JA is essentially just a lookup table which, for each state in block A, yields an estimate of JA (a component of the J function of dynamic programming). However, one can replace JA with either a neural network or a lookup table full of neural networks, which inputs both the state and the specific J values for blocks B which affect block A via equation 23. Those specific J values are the J values for the “post exit states” of block A, the states which can be reached from block A directly in a single transition. Equation 24, for example, provides the target which can be used to train that neural network (or any other supervised learning system [2]) used for that purpose. Essentially the same choice of representations also applies to the network(s) which represent JAB, although, as discussed before, there are some further variants possible for JAB to improve convergence. Even in the passive case, there are several ways to represent an action policy (exactly as in ordinary incremental or approximate dynamic programming [3,5,8]); for the completely local variant of the simple partition design, the action policy itself would also be represented either as a neural network or something similar [2], or a lookup table of separate neural networks for each state. The preferred embodiment for large-scale control tasks would involve a single neural network for each of these components. Because the dependence on information outside of block A can be accounted for by these networks, they should make it possible to learn the relevant information (the three networks, normally) depending only on information within the block itself; in principle, changes outside of the block should not require any change in these networks. This kind of complete locality has many advantages.
Experts in AI may ask how this kind of structure could accommodate rapid changes in estimates of “J” within a block, which exploit the power of local search activities. The answer, in brief, is that the neural networks used to approximate JA and JAB can have fast-learning local components (i.e. local components as in supervised learning.) There is no inherent conflict between locality as described here, and the ability to exploit the power of local search.
For a full-fledged system of brain-like intelligence, one must go even further and in notion which may be called “decision-making” (or “decisiveness” or “action chunking.”)
The partitions described above are very passive in nature. They do apply to large-scale decision problems in the real world, even more than they apply to abstract finite-state problems. They reflect the fact that the state of our world (unlike movies or TV shows) does not often change in a split second from one state to something radically different. The possibilities of transition from one state to another are very constrained, regardless of what actions we take. For a strict application of the designs discussed so far, the partitions and blocks must be defined so that they allow for any possible choice of actions. (In practice, however, if we have ruled out certain kinds of actions, we need only focus on transitions which are possible for the currently-considered choice of actions.)
In larger, more realistic problems, we can achieve a tighter partition, and therefore more efficient calculation, by using a concepts of decisions or action schemata or task modules or verbs in place of these large, passive blocks. (Sutton [6] refers to “abstract actions” in an intuitive way, which does not relate to the machinery here.)
There are several ways of implementing this idea of “decision-making.” In the pure finite-state case, one would usually require that the decision options -- like the simple partitions above -- must be specified in advance, at least in terms of an initial local action policy and in terms of the entry states and exit states. The best techniques for learning the set of options (entries, exits..) involve fuzzy chunking and such, to be discussed in a later section.
In the simplest variant, we assume that the user has supplied a list of decision-blocks, rather than a set of simple blocks. But within each decision block, there is still a choice of actions, and a definite pre-specified set of exit states and post-exit states. The one new feature is that, whenever we encounter a post-exit state, we have a choice of several possible decision blocks to enter. Thus, in any post-exit state x, we have a choice of D decision blocks we can choose from (a subset of all the decision-blocks in the system). For each block number d (where 1<d<D), there should be block-specific matrices Jd and JddB, exactly analogous to the matrices JA and JAB discussed before. But then, if we use equation 23, we would have D different estimates of the value of J(x,d), depending on the choice of decision, without even considering how to handle J|B !! Of course, the proper procedure is that, upon entering x, we choose the decision d which maximizes:
(34)
Strictly speaking, we do not really need to identify blocks B as such; this is really a short-hand way of saying that the sum is to be taking over the post-exit states of decision block d. To perform this summation, we need to know the estimates of J in all the post-exit states, just as before. Note that when we decide on the decision block for state x, we can at that time update the estimate of J in that state to be closer to the estimate indicated by equation 34! Thus equation 34 is both a recipe for making decisions, and for updating global J estimates. It is a kind of higher-level Bellman equation, even more than equation 23 is!
Once we have made a decision -- i.e., entered a particular decision block -- it is appropriate to update Jd and JdB only for that decision d (not for the alternative decisions we might have made!) based on current real experience, until we have exited from that decision block.
Unfortunately for the control theorist, the simplest variant here is still not quite enough to explicitly capture the full process of decision-making by mammalian brains. Even after a decision is made, it is sometimes useful or necessary to abort the decision before the normal post-exit states are reached. There are three phenomena to be considered here: (1) failure to complete the task; (2) changing one’s mind, sometimes because of new opportunities arising, sometimes due to incipient failure, etc.; (3) modification of the goals of the task. The third of these is beyond the scope of this section. For strict locality, we can and must account for the first two simply by redefining the post-exit states to include failure states and states of changed mind. (To allow changing of minds, one applies equation 34 to all possible states where it may be reasonable to consider changing one’s mind.) But as a practical matter, it is more reasonable to build a system which tries to learn all the important exit modes, in a purely active/local manner, but which remains somewhat consistent by using equation 23 in the older passive mode (as per section 2.4.2) when unexpected exits occur. In either case, equations 23, 24, 25 and 34 remain the basis of system adaptation, both for expected post-exit states and unexpected ones.
In the limit, one might imagine using equation 23, as in section 2.4.3 and above, for all the possible decision blocks within a larger, passive block of possible states. But this simply reduces to the lower-level ADAC design (equation 26), using the larger passive block as the basis for partition! The decision-based additional locality is simply lost altogether. This analysis leads to two observations of relevance to further design work: (1) use of lower-level ADAC or mixture of expert designs to output actions can be a useful first step in suggesting initial possibilities for decision blocks, i.e. a useful part of the growing/pruning process for these systems; (2) because the ADAC approach does not fully capture the local approach, it is better -- when possible -- to try to learn the unexpected post exist states, so that they will not be unexpected in the future.
2.5 Temporal-Chunking: Multi-level Task-Based Designs
The previous discussion in section 2.4 only referred to two levels of organization -- the block level and the state level. How far does this extend our effective vision into the future -- the key problem discussed in section 2.3? If the system tends to stay in the same block for k periods of time, then we extend our vision only by a factor of k. Thus the extension of foresight is only like Sutton’s k-step-ahead approach discussed in section 2.3, not like the full kn approach. There are still some advantages over multiresolutional methods, because of the sparsity of matrices here, but the benefit to foresight is more limited.
In order to achieve a kn improvement in foresight, with an event-based architecture, we can extend the designs of the previous section in order to build a multilevel hierarchy. This section will show how to do this, in the example of a three-level hierarchy. In order to extend this result to a hierarchy of more levels, one can simply use the same adaptation rule