**A 3-Brain Architecture for Brain-Like Intelligence**

1 **Introduction**

**1.1 Context**

In the development of learning systems and neural networks, the issue of complexity occurs at many levels of analysis.

Most
of the literature on complexity and neural networks addresses the task of __supervised
learning__ -- the fundamental task of
learning a mapping from an input vector ** X**(t)
to a vector of targets

This
paper will address an equally important, complementary aspect of the complexity
issue -- the problem of how to build larger-scale intelligent control systems
able to cope with truly large and complex environments. In particular, the paper will describe the design of a new
class of systems to perform optimization over time. Many other papers (e.g. in
[1,15-18]) have discussed why this is a reasonable approach to explaining and replicating the kind of
higher-order intelligence found in the
mammalian brain; however the previous designs __based on__ that aproach have
had certain limitations so far in their ability to handle truly complex
environments. These designs have been more than enough for the current types of
challenges being addressed at the frontiers of control engineering[2]; however,
in order to handle those kinds of complexity which the mammalian brain can cope
with, one must upgrade these designs, so as to exploit the phenomena of “temporal chunking” and “spatial
chunking,” as will be discussed below. A large part of the challenge here is to build a truly
brain-like neural learning-based design which can __learn__ to achieve the
kinds of capabilities which have been already achieved in some special
applications, using mainly hard-wired algorithms, in the domain of
artificial intelligence. This paper will
discuss how to meet that challenge.

In
many ways, the design approach given here is a __beginning__ rather than a __conclusion__.
The paper does lead to many possible variations in how to implement the
approach, and and many questions about what will work best, under what
conditions. It also makes very strong assumptions about the reader’s knowledge
of prior knowledge, especially involving the papers cited below.

The
text which follows is taken __directly__ from an international patent application filed on July 4, 1997, by this author, in
association with Scientific Cybernetics, Inc. That application also
incorporated several of
the papers cited below, and some additional information on
possible applications, for example, in strategic game-playing, in hypersonics,
in the control of complex heat exchanger
systems or fuel processors, and in distributed large-scale network
control. For the sake of authenticity and timeliness, the text is being
published without alteration, except for
the addition of some new
references and for copy editing required
to meet the standards of this publisher.

The
relation between this design and the brain has also been discussed with the
neuroscientist and psychologist Karl Pribram, as reported in part in [19]. The
“decision blocks” which form the core of this design could be thought of either
as “task modules” (as in Pribram’s earlier work) or as “episodes” as his in
more recent work[20]; in fact, the proposed structure
has the ability to __learn __to be more goal-oriented (like a task module)
or to be more reactive in nature (like an episode processor), depending on what
works best in specific situations. The “loops” or “layers” in the basal ganglia/cortex/thalamus
circuit have been discussed not only by Brooks, but also by John Taylor, James
Houk, Seng-Beng Ho and several others

whose thoughts and guidance I am most grateful for.

**1.2 A
Three-Brain Architecture**** for Artificial Intelligence**

The purpose of this paper is to describe a method for building a new type of general-purpose artificial intelligence which will be called a “3-brain system” or “3-brain architecture.”

This architecture was originally motivated by an effort to understand and replicate the kinds of problem-solving capability and learning which exist in the brains of mammals [1]. The details of these efforts are described in several papers filed with the preliminary patent applications and included as part of this disclosure. An additional paper, also included [2], describes some of the underlying engineering principles and ideas for how to use these designs in practice. Some of these papers were written on government time, but the original ideas therein were developed by the author independently, usually in periods of reflection outside of the office.

The 3-brain architecture is not, strictly speaking, a single design. It is a design methodology, which can be implemented in a variety of ways, on a variety of platforms. The architecture can be embodied in chips, in opto-electronic hardware, in biomolecular hardware, or in software. As a practical matter, the software emulation will probably come first, in order to permit the careful evaluation and tradeoff studies necessary to justify the more expensive step of building dedicated hardware.

This
design methodology may be thought of as an *improvement*
to certain architectures previously developed (and partly patented) by this author.
More precise-ly, it is an extension of the *model-based
adaptive critic* (MBAC) or “brain-like intelligent control” designs
described in the attached papers. Those designs, in turn, may be seen as
extensions of Ron Howard’s methods of dynamic programming [3].

In
general, all of these designs can be expressed as designs for learning-based
maximization of utility over multiple time periods into the future. This includes “reinforcement learning” as a
special case [4]. In reinforcement learning, an intelligent system is given
access to a set of sensor inputs, a set of actuators which it controls (i.e.
its outputs are wired up to the actuators), and to a monitor which evaluates
its performance or “utility” (U). Through learning and/or exploration, it develops
a strategy or “policy” of action which enables it to maximize (or minimize)
total utility in the future. These reinforcement learning systems are intended
to be *general-purpose* systems,
because the same learning system can be applied to different applications,
simply by having it learn to adapt to these various applications sepa-rately.
The 3-brain architecture is a major, qualitative improvement over the earlier
designs, insofar as it has the potential ability to learn to cope with far more
difficult applications. In effect, it is a general purpose system for making
intelligent decisions.

This paper will describe how to build a 3-brain system, by a series of progressive improvements, starting from incremental dynamic programming, which will be reviewed. The first part of the paper will describe new designs for “temporal chunking” with reinforcement learning, in a classical context. The second part will describe how to replace the matrices in these new designs with neural networks (or similar structures), in order to permit larger-scale applications based on the ability of neural networks to approximate complex nonlinear relations in a parsimonious way. The third part will describe new, more sophisticated neural networks (and their nonneural generalizations) which should perform better than conventional neural networks as components of these designs; in addition, it will describe some possible hardware implementation of the most critical, computationally expensive components, and address the extension of this system to incorporate aspects of “spatial chunking.”

Crudely speaking, parts 2 through 4 of this paper will describe how to construct the “upper brain” and the “middle brain” as described in [1]. Part 5 will describe briefly how to attach such a higher-level intelligent system to a lower level “brain” so as to construct a complete “3 brain” system.

The technological intention here is to first build up a modular software package, in which a full three-brain system can be obtained by linking together the appropriate modules. (Before this is fully completed, however, some of the hardware development recommended in part 4 will be started, if possible.) However, for practical purposes, the user of this software will also be able to use simpler modules, or alternative modules, to fill in the various components, if he or she chooses.

The step-by-step approach to constructing this kind of intelligent system is intended to provide additional flexibility in the overall package, flexibility which a monolithic implementation would not possess.

Notice
that this paper will suggest several alternate designs, based on the same
general approach or
method, to perform subsystem tasks. In
earlier research, in past years, the author often tried to specify one __best__
alternative for each subsystem. But experience showed that
different alternatives
worked better in different applications. Thus the intention is to
build a __general__ system which provides the user a choice of subsystems, so as
to allow extensive tradeoff studies.

**2 Time-chunked Approximate
Dynamic Programming **

**2.1 Notation**** and Classical Results**

In the simplest forms of dynamic
programming, we assume that the environment or the plant to be controlled can
only exist in one of a finite number of possible states. These possible states
may be denoted as s_{1} ,..., s_{i}
,..., s_{n} , where n is the number of states. At each time t, the
intelligent decision-making system observes the state s(t)
(where s is an integer between 1 and n), and then outputs a vector containing
decisions or control variables, ** u**(t).
Usually the intelligent system will choose

** u**(t) =

Normally the user of the system provides a utility function U and an interest rate r. We are asked to design an intelligent system which can learn the optimal policy, the policy which at any time t will maximize:

_{} (2)

where the angle brackets denote expectation value. (It is a straightforward well-known extension of this to consider finite horizon problems, in which t goes to some finite maximum T. Also, it is common to build designs in which r is initially set to a high value -- even infinity -- in the first few iterations, and lowered slowly to the user-specified value, as a method for improving learning.)

Normally it is assumed that we know the transition probabilities as a function of action, which may be written:

_{} (3)

For a __particular__ policy p, we may define the classic J function as:

_{} (4)

and:

_{} (5)

The basic theorems of incremental dynamic programming [3,5] describe the properties of this function J, which must normally obey the Bellman equation:

_{} (6)

Note that this equation also provides a
basis for actually choosing ** u**(t) at any time t.

In
the case where there are only a finite number of possible states s, we may
define a state probability vector ** p**
by defining its components p

p_{i} = Pr(s_{i}) (7a)

p_{i}(t) = Pr(s(t) = i) (7b)

In this notation, we have, for any policy p:

** p**(t+1) = P

Likewise, we may define the vectors __J__^{p} and __U__^{p} by defining their
components:

_{} (9a)

_{} (9b)

In this notation, equation 4 takes the form:

__J__^{p} = __U__^{p}
+ M^{p}__J__^{p} , (10)

where we define:

M^{p}
= (P^{p})^{T}/(1+r) (11)

**2.2 Classical**** Approaches**

In
traditional incremental dynamic programming [3,5,6],
the optimal policy is found by some kind of alternation between “value updates”
and “policy updates.” One starts with something like an arbitrary policy p and
an arbitrary estimate of the value vector ** J**.
One learns the optimal policy simply by progressive improvements in the policy
and in the estimated value vector. The value updates are generally based on
equations 6 and 10, translated into:

__J__^{(}^{n+1)} = __U__^{p} + M^{p}__J__^{(n)} (12)

In other words, for the current policy p, one replaces the old estimate of ** J** (

with
this new estimate (__J__^{(n+1)}).
In a policy update, one simply picks ** u**(t) at each time t so as to maximize the right-hand
side of equation 6, using the current estimate of

Broadly speaking, there is
another important choice of strategies in making these kinds of updates. In a *passive* strategy, one simply *assumes* the current policy
p,
and carefully works out J in detail. In the *active*
approach, one explicitly designs the value-updating system so as to permit more
frequent changes in the policy p and more explicit allowance for the effects of such
changes.

**2.3 Temporal**** Chunking: Multiresolutional Designs**

Traditional approaches to dynamic programming and to approximate dynamic programming (ADP) are generally based on “backups” or “value updates” from time t+1 to time t, as implicitly assumed in equation 12. But in a real-time control system, the interval between time t and time t+1 (the sampling interval) may be very short. The literature on artificial intelligence has stressed the need to jump over longer time intervals [7]; however, this kind of “temporal chunking” has yet to be implemented in effective learning-based ADP designs.

In theory, the usual ADP designs
should all converge to the correct policy, anyway, *if given enough time*. But there is a problem here with *computational cost* and *computational complexity*.To put it
another way, new designs which inject time chunking into ADP should lead to
reductions in computational cost and complexity, which in turn should make it
possible to handle more complex applications at acceptable cost.

To understand these cost issues,
return to equation 12. For simplicity, assume a purely passive approach, in
which we try to find the correct J function (__J__^{p}, in effect) for a fixed policy p.
Assume that the initial estimate of ** J**
--

_{} (13)

Thus after n complete value updates, the “critic” (the estimate of J) “sees” only n periods of time into the future, in effect. Equation 13 is just an estimate of the true value:

__J__^{p} = (I - M^{p})^{-1} ** U** (14)

In order to learn the true ** J** much more quickly, one may
exploit the following numerical identity (for the limiting case, assuming no
singularities, as usual):

__J__^{p} = ...
(I + (M^{p})^{16})(I
+ (M^{p})^{8})(I
+ (M^{p})^{4})(I
+ (M^{p})^{2})(I+M^{p})** U** (15)

Using this approach, after only n steps
of calculation, one “sees” 2^{n} periods of time into the future. There
are two ways to implement this approach:

1. For each number j, from j=1 to
“infinity”, multiply __J__^{(j-1)}
on the left by M^{p } 2^{j}
times, and then add the result to __J__^{(j-1)},
in order to calculate __J__^{(j)}.

2. To start with, set M_{0}=M^{p}.
Then for each iteration j: first set M_{j} = M_{j-1}M_{j-1};

then calculate:

__J__^{(}^{j)} = __J__^{(j-1)} + M_{j-1}__J__^{(j-1)} (16)

There are many possible extensions of this, such as the obvious generalizations based on the repeated application of:

__J___{nk } = (I + (M^{p})^{n}
+ (M^{p})^{2n}
+ ... + (M^{p})^{n(k-1)})
__J___{n }, (17)

where I now define (just for equation 17):

_{} , (18)

and where the parameter k can be set to any integer >1, and even varied from iteration to iteration if desired.

These methods, collectively, will be henceforth called Multiresolutional ADP. In conventional artificial intelligence, they would correspond to systems based on “clock-based synchronization.”

It
should be noted that the n-step methods described by Sutton [6] have some
relation to these methods. However, Sutton’s designs permit a foresight
extension of only a factor of 2 (or of k), rather than 2^{n} or k^{n}! He does not demonstrate any awareness of the
crucial tricky relation in equation 15.

Both
in Multiresolutional ADP and in other temporal chunking designs, it can be
extremely useful (when possible and appropriate) to represent a utility
function as a *growth process*, i.e.
as:

U(s(t),** u**(t))
= V(s(t)) - V(s(t-1)), (19)

for some reasonable function V, in the case where r=0. This can permit a substantial reduction in the apparent complexity of the calculations.

**2.4 Temporal**** Chunking: Two-Level Event-Based Designs**

If
the matrix M^{p}
were a fully populated (fully connected) matrix, it would be very difficult, in
principle, to improve upon these multiresolutional methods. However, as a
practical matter, the matrix M^{p} will usually be extremely sparse, in large
real-world applications. To reduce the costs still further, in the finite state
situation, one can use a *domain
decomposition* approach, in order to exploit this sparsity.

To
begin with, let us consider a *simple
partition* design for implementing this approach. Let us assume that the possible
states of the plant have been partitioned into *blocks*. Thus every state s will be denoted by s_{A,i} , where A is the block number and i is the number
of the state within the block. The key to this approach is to find a partition
such that Pr(B(t+1),j(t+1) | A(t),i(t)) will equal
zero, except when block B happens to be one of a very small set of blocks “near
to” A. More precisely, if n(A) is the set of blocks B
such that a direct transition from A to B is possible, then the key is to find
a partition such that n(A) is as small as possible for the “average” block A.
This is a two-level design, where the upper level involves the choice of blocks A or B, and
the lower level involves the choice of states i and j.

Starting
from any block A, for a fixed policy p, we now have *two* sets of transition matrices to
consider: P^{A}, which represents transitions within block A, and P^{BA},
which represents transitions from block A to a different block B. Mirroring
equation 11, we then arrive at the matrix M^{A} and the matrices M^{AB}.

For
any vector ** v** defined over

_{} (20)

By matrix algebra, this implies:

_{} (21)

Let us define:

_{} (22a)

_{} (22b)

With these definitions, equation 21 reduces to the following fundamental equation for event-based chunking:

_{} , (23)

where the asterisk indicates matrix multiplication. Equations 22a and 22b imply the following recurrence relations, similar to the Bellman equation:

_{} (24)

_{} (25)

The __simple partition design__ is
defined as any incremental dynamic programming design in which:

1. __J__^{A}
and J^{AB} are updated by some sort of use of equations 24 and 25;

2. The global estimate of ** J** is updated or calculated by
use of equation 23, in some way;

3. Policies or actions are updated
based on ** J**, as usual.

As with ordinary incremental dynamic
programming, value updates (updates of ** J**,

Furthermore,
one can reduce computational cost (or architectural complexity) by a
considerable amount, by performing only the most essential calculations at
appropriate times. For example, in order to develop long-term foresight most
efficiently, one can use equation 23 very often only to update ** J** for those states in
blocks A which can be reached

**2.4.1 Variations**** to the Simple Partition: Focused Partition and ADAC**

There
is a very straightforward variation of the simple partition design which has
many of the same properties. This variation may be called the __focused
partition__ design. In this variation, we define the __exit states__ of
any block A as the states from which a direct transition to other blocks is
possible. One tries to find a partition which minimizes the number of exit
states. In the simple partition design, we always consider those components of
a vector ** v** which apply to states
within the block A. In the focused partition variant, we consider those states
in A which are

There is another variant here which should be mentioned, for the sake of completeness, even though it is not a preferred variant. It is possible to define:

J’^{A}(i,** u**) = U(i,

in rough notation, where “I” represents a state within block A. This is the most natural way to apply the notion of “Action Dependent HDP” or “Q learning” [8,ch.13] in this context. ADHDP and Q-learning are both examples of what is called “ADAC,” the Action-Dependent Adaptive Critic, in [8].

One can adapt J’^{A}
based on transitions within block A, using essentially the same kind of update
we would use for __J__^{A}, just as the classic method ADHDP is
essentially the same as HDP, with the same minor variations. This is not the
preferred version for large-scale problems, because it entails the same sort of
weaknesses (with a few advantages) of ordinary ADHDP and Q learning [8]. However, when the choice of actions is
actually very limited an state variables are very
complex, this variation can be useful. For low-level control, the choice of
actions is usually more continuous, but for higher-level decisions the
important choices sometimes are more discrete.

**2.4.2 Step****-by-Step
Approaches to Learning J, J ^{A},
J^{AB} or J’^{A}**

In
a pure finite-state problem, direct matrix updates of ** J**,

For
neural network designs, especially, it is important to consider methods for
updating ** J**, etc., on a

Even
in the finite-state case, there are several choices for how to update __J__^{A} and J^{AB}, essentially the same as
the usual choices in updating ** J**
in ordinary ADP [8]. One choice is simply to use the matrix equations, as
written (or restricted to exit states, where applicable), for global value or policy updates. Another is to use
state-by-state updates. In conventional
state-by-state updates,
for example, it is common [8] to change
the estimate of J(s(t)) in proportion
to:

J(s(t+1))/(1+r) + U(s(t),u(t)) - J(s(t)), (27)

where s(t+1)
may be the state __actually observed__ after
state s(t), as we observe the actual plant, or where s(t+1) may be __simulated__
based on the correct transition probabilities (P). In our case, we may apply
exactly the same approach to updating __J__^{A}.
In rough notation, we may update our estimate of __J__^{A}
in any state s(t) as follows:

new J^{A}(s(t)) = old J^{A}(s(t)

+ LR*( “J^{A}(s(t+1))”/(1+r) + U(s(t),** u**(t)) - old J

where LR is
some kind of learning rate. (LR will usually be less than zero because of the
stochastic nature of this process.) In this equation, “J^{A}(s(t+1))” usually represents our estimate of J^{A}(s(t+1)),
which we are not updating here, unless s(t+1)=s(t), in which case we use the
old estimate. However, in the case where s(t+1) is not
in block A, J^{A}(s(t+1)) is not defined; in that case, we use zero
instead of J^{A}(s(t+1)) in equation 28. This is not an ad hoc fix;
instead, it follows directly from analyzing equation 24.

This
same approach can
also be applied in updating the J^{AB} matrices. Notice that J_{ij}^{AB}
has an important intuitive interpretation: if the system starts out in state i
in block A, then J_{ij}^{AB} represents the probability that
the next future state __outside__ of block A will be state j in block B,
discounted by the interest rate r (if r¹0). The simplest
appropriate update rule is:

new J_{j}^{B}(s(t)) = old J_{j}^{B}(s(t)
+ LR*(“J_{j}^{B}(s(t+1))”/(1+r) - old J_{j}^{B}(s(t)), (29)

where I have
not bothered to write the additional indices (i and A) representing state s(t),
and where “J_{j}^{B}(s(t+1))” represents the old estimate of J_{ij}^{AB}
__except__ in the case where s(t+1) is no longer in block A. In that case,
“J_{j}^{B}(s(t+1))” is defined as 1,
if s(t+1) is state j in block B, but otherwise zero.

Actually,
there is a further
variation of this adaptation procedure for J^{AB} which
-- though more complex -- may improve convergence in some cases. One may define
a new global transition probability:

_{} (30)

One can then adapt this transition probability
using an update rule essentially identical to equation 29, except that it uses
a term “P_{i}^{AB}”,
which equals 1 if and only if s(t+1) is in block B. One can adapt a __conditional__
J value, J’_{j}^{B} , using the same
adaptation rule as equation 29, with J replaced by this J’, __except that__
adaptation is skipped whenever s(t+1) is not in block A or block B. In this
variation, one continually updates P^{AB} and J’^{AB} instead of J^{AB},
and one replaces the use of J^{AB} by exploiting the relation:

_{} (31)

All of these state-by-state update procedures are adaptations of the classic methods first proposed as Heuristic Dynamic Programming [8] and later elaborated under the name Temporal Difference methods. As an alternative to such methods, there are of course more classical methods (usually less efficient) for estimating transition probabilities in this kind of situation. For example, in each transit through block A, one may simply record every state visited. For every state in block A, one maintain a description of the distribution of the corresponding eventual exit state. One may then update each such description as soon as one exits A again, and then one can delete the record of this transit. It is straightforward to adapt this method to a neural network approach as well, similar in spirit to Widrow’s original adaptive critic blackjack player, briefly discussed in [2]. Although these approaches are expected to be less robust than the state-by-state update methods, they may be useful as part of a more complex hybrid approach combining both.

Finally,
it is also possible to use step-by-step learning procedures to update the
global estimates of ** J**, based
on equation 23. It has already been noted that we do not actually need to
update estimates of J for

For
updating the essential J values, consider the analogy between equation 23 and
the Bellman equation (equation 10), for the simple partition case. Equation 23
is like the Bellman equation, except that s(t) in the Bellman equation corresponds to the __first__
state encountered when entering block A,
while s(t+1) corresponds to the first state (j) encountered when exiting the
entire block A. This suggests three natural ways to update J on a state-by-state
basis. First, we can __remember__ the initial state i encountered when
entering block A, and then, upon exiting block A, adapt the estimate of J_{i
}so as to be closer to:

J_{i}^{A}
+ (J_{j}/ (1+r)^{t}) , (32)

where t is
the number of time periods between entry and exit, which must be remembered as
well unless r=0. This is a __reality-based__ update. A reality-based update
can only be undertaken once per t time periods, roughly, because it requires that we wait
from the start to the exit from the block. Second, we can store i, j and t in
some kind of __memory__, and do a __memory-based__ update at some later
time. (This is an example of the learning strategy called “syncretism” in [8,ch.3].) Third, we
can pick a possible or remembered (or just conceivable) value for i, and then __simulate__
j (and t,
if r does not equal zero). Notice that
the matrices J_{ij}^{AB}, made up of __transition
probabilities__, can be used to perform this kind of simulation, without any
need to simulate intermediate states. __Simulation-based__ updates and __memory-based__
updates of J can be done much more frequently than reality-based updates,
because they require only one cycle time of computation. Therefore, the
preferred variation for a full, efficient, parallel 3-brain design would
involve frequent simulation-based updates and memory-based updates of ** J**, especially for calculations
involving large chunking intervals t,
even during normal real-time operation of the system.

In actuality, for a neural-network approximation of this system, to be described in section 3, there is an easier step-by-step adaptation rule for these post-exit J estimates. We can simply set:

new J_{i} = old J_{i} + LR*(J^{A}(i)
+ “J^{B}(i, ** J** |

where the term in quotations refers to
the output of a neural network (or other supervised learning system) which is
trained to __input__ the J estimates for the post-exit states of block A and
a specification of
a state i in block A, and to __output__ the complete J^{AB} term
for that state i (i.e., to output the
summation in equation 23.)

Equation
33 may be called a __backwards value update__, to contrast it against the
update procedure related to equation 32, which may be called a __forwards
value update__. (In similar language, equation 23 is used in the classical
approach to provide a backwards value update, but a matrix-based update rather
than a state-by-state
update. Later, neural network approximations provide still
another choice between “pattern learning” and “batch learning” alternatives[2].) The tradeoff between forwards value updates and
backwards value updates is a recurrent issue, even in more complex designs. In
this context, however, the backwards value updates have an advantage, because
they are exact; by contrast, equation 32 involves the usual random disturbances
associated with any statistical sampling method, without any real cost
advantage.

**2.4.3 From**** Passive
Design to Active Design: Locality and
Decision-Making**

This section will describe further variants of these designs, also motivated by the desire to provide a basis for neural network learning extensions.

The
simple partition and focused partition designs, as described above, were both
derived as __passive__ methods -- as methods for efficiently calculating ** J** for a

In
__locality__, the idea is to make sure that the things one is learning in
any region of state space (here, a block) do not depend strongly on what is
observed and learned in __other__ regions of space. Locality is already
widely used as a strategy in ordinary supervised learning [2]. In that context,
it is well known that local designs lead to faster and more robust learning
(though often with poor generalization, for reasons which do not apply here),
in part because there is no need to __unlearn__ or __relearn__ things as
one moves from one block to another.

The
simple partition and focused partition designs already provide a reasonable
degree of locality. The __J__^{A}
and J^{AB} matrices for a block A __depend only__ on transition
probabilities __from that block .__ Thus the crosstangled global
learning problem is decomposed into smaller local parts, such that the required
foresight horizon in each part is usually very limited. (In principle, one
could also apply multiresolution methods within each block, so as to accelerate
equations 24 and 25; this is a straightforward application of what we have
discussed so far, but it is questionable whether the brain uses such a complex
design.) Equation 23 allows one to update the __global__ values by jumping
over an entire block in a single step of calculation, in effect. This is a form
of __event-based chunking__, because equation 23 provides temporal chunking,
and the chunks are defined by __events__ (exiting from a block) rather than
by predetermined time intervals.

This
design is actually somewhat active, in the following sense. Actions within any
block A should affect __only__ the transition probabilities -- and hence the
__J__^{A} and J^{AB}
-- __in that block__. However, the design is not __completely__ active or
local, because the ** J** values
used when selecting actions

In
order to upgrade the design to make it completely local, one can replace the matrices __J__^{A} and J^{AB}
and the local action policy with neural networks or the like. Section 3 will
describe this kind of process in more detail. In the classical context, __J__^{A} is essentially
just a lookup table which, for each state in block A, yields an estimate of J^{A}
(a component of the J function of dynamic programming). However, one can
replace __J__^{A} with
either a neural network or a lookup table full of neural networks, which inputs
__both__ the state __and__ the specific J values for blocks B which
affect block A via equation 23. Those
specific J values are the J values for the “post exit states” of block A, the
states which can be reached from block A directly in a single
transition. Equation 24, for example, provides the target which can be used to
train that neural network (or any other supervised learning system [2]) used
for that purpose. Essentially the same choice of representations also applies
to the network(s)
which represent J^{AB}, although, as discussed before,
there are some further variants possible for J^{AB} to improve
convergence. Even in the passive case,
there are several ways to represent an action policy (exactly as in ordinary incremental or
approximate dynamic programming [3,5,8]); for the __completely local variant
of the simple partition design__, the action policy itself would also be
represented either as a neural network or something similar [2], or a lookup
table of separate neural networks for each state. The preferred embodiment for large-scale
control tasks would involve a single
neural network for each of these
components. Because the dependence on information outside of block A can be
accounted for by these networks, they
should make it possible to learn the relevant information (the three networks,
normally) depending __only__ on information
within the block itself; in principle, changes outside of the block
should not require any change in these networks. This kind of complete locality
has many advantages.

Experts
in AI may ask how this kind of structure could accommodate rapid changes in
estimates of “J” within a block, which exploit the power of local __search__
activities. The answer, in brief, is that the neural networks used to
approximate __J__^{A} and
J^{AB} can have fast-learning local components (i.e. local components as in
supervised learning.) There is no inherent conflict between locality as
described here, and the ability to exploit the power of local search.

For a full-fledged system of brain-like intelligence, one must go even further and in notion which may be called “decision-making” (or “decisiveness” or “action chunking.”)

The
partitions described above are very passive in nature. They __do__ apply to
large-scale decision problems in the real world, even more than they apply to
abstract finite-state
problems. They reflect the fact that the state of our world
(unlike movies or TV shows) does not often change in a split second from one
state to something radically different. The possibilities of transition from
one state to another are very constrained, __regardless of what actions we
take__. For a strict application of the designs discussed so far, the
partitions and blocks must be defined so that they allow for __any possible
choice__ of actions. (In practice, however, if we have ruled out certain
kinds of actions, we need only focus on transitions which are possible for the
currently-considered choice of actions.)

In
larger, more realistic problems, we can achieve a tighter partition, and
therefore more efficient calculation, by using a concepts of __decisions__ or __action schemata__
or __task modules__ or __verbs__ in place of these large, passive blocks.
(Sutton [6] refers to “abstract actions” in an intuitive way, which does not
relate to the machinery here.)

There
are several ways of implementing this idea of “decision-making.” In the __pure__
finite-state case, one would usually require that the decision options -- like
the simple partitions above -- must be specified in advance, at least in terms
of an __initial__ local action policy and in terms of the entry states and
exit states. The best techniques for __learning__ the set of options
(entries, exits..) involve fuzzy chunking and such, to
be discussed in a later section.

In
the simplest variant, we assume that the user has supplied a list of
decision-blocks, rather than a set of simple blocks. But within each decision
block, there is still a choice of actions, and a definite pre-specified set of
exit states and post-exit states. The __one new feature__ is that, whenever
we encounter a post-exit state, we have a __choice__ of several possible
decision blocks to enter. Thus, in any post-exit state __x__, we have a
choice of D decision blocks we can choose from (a subset of all the decision-blocks
in the system). For each block number d (where 1__<__d__<__D),
there should be block-specific matrices __J__^{d}
and J_{d}^{dB}, exactly analogous to the matrices __J__^{A} and J^{AB} discussed before. But
then, if we use equation 23, we would have D different estimates of the value
of J(** x**,d),
depending on the choice of decision, without even considering how to handle J|

_{} (34)

Strictly speaking, we do not really
need to identify blocks B as such; this is really a short-hand way of saying
that the sum is to be taking over the post-exit states of decision block d. To
perform this summation, we need to know the estimates of ** J** in all the post-exit states, just as before. Note that
when we decide on the decision block for state

Once
we have made a decision -- i.e., entered a particular decision block -- it is
appropriate to update __J__^{d}
and J^{dB} __only for that decision d__ (not for the alternative
decisions we might have made!) based on current real experience, until we have
exited from that decision block.

Unfortunately
for the control
theorist, the simplest variant here is still not quite enough to
explicitly capture the full process of decision-making by mammalian brains.
Even after a decision is made, it is sometimes useful or necessary to abort the
decision before the normal post-exit states are reached. There are three
phenomena to be considered here: (1) __failure__ to complete the task; (2) __changing
one’s mind__, sometimes because of new opportunities arising, sometimes due
to incipient failure, etc.; (3) __modification__ of the goals of the task.
The third of these is beyond the scope of this section. For strict locality, we
can and must account for the first two simply by __redefining__ the
post-exit states to include failure states and states of changed mind. (To
allow changing of minds, one applies equation 34 to all possible states where
it may be reasonable to consider changing one’s mind.) But as a practical
matter, it is more reasonable to build a system which __tries__ to learn all
the important exit modes, in a purely active/local manner, but which remains
somewhat consistent by using equation 23 in the older passive mode (as per
section 2.4.2) when unexpected exits
occur. In either case, equations 23, 24, 25 and 34 remain the basis of system
adaptation, both for expected post-exit states and unexpected ones.

In
the limit, one might imagine using equation 23, as in section 2.4.3 and above,
for __all__ the possible decision blocks __within__ a larger, passive
block of possible states. But this simply reduces to the lower-level ADAC
design (equation 26), using the larger passive block as the basis for partition!
The decision-based additional locality is simply lost altogether. This analysis leads to two observations of
relevance to further design work: (1) use of lower-level ADAC or mixture of
expert designs to output actions can be a useful first step in __suggesting__
initial __possibilities__ for decision blocks, i.e. a useful part of the
growing/pruning process for these systems; (2) because the ADAC approach does not fully capture the local approach, it
is better -- __when possible __ -- to
try to learn the unexpected post exist states, so that they will not be
unexpected in the future.

**2.5 Temporal****-Chunking: Multi-level
Task-Based Designs**

The
previous discussion in section 2.4 only referred to __two__ levels of
organization -- the block level and the state level. How far does this extend
our effective vision into the future -- the key problem discussed in section
2.3? If the system tends to stay in the same block for k periods of time, then
we extend our vision only by a factor of k. Thus the extension of foresight is
only like Sutton’s k-step-ahead approach discussed in section 2.3, not like the
full k^{n} approach. There are still some
advantages over multiresolutional methods, because of the sparsity of matrices
here, but the benefit to foresight is more limited.

In
order to achieve a k^{n} improvement in
foresight, with an event-based architecture, we can __extend__ the designs
of the previous section in order to build a multilevel hierarchy. This section
will show how to do this, in the example of a three-level hierarchy. In order
to extend this result to a hierarchy of more levels, one can simply use the __same__
adaptation rules used here for the __middle__ level, connecting it to the
levels immediately above and below, and apply those rules to __each one__ of
the middle levels of the larger hierarchy, connecting each of them to the level
immediately above and below. This is a straightforward process, but this
section will only present the three-level case, simply in order to keep the
equations from appearing truly horrendous.

This
section will present a __decision-making__ formulation (ala 1.4.3), based on
a three-level extension of the simple partition design, with backwards value
updates. Other variations discussed in section 2.4 carry over in a straightforward
way. (For example, a purely passive block design may be obtained simply by
providing only one choice of allowed decision blocks at each postexit state.)

**2.5.1 Changes**** in Notation**

This
section will use slightly different notation from
section 2.4, in order to reduce the complexity of the equations. As
before, we will assume a finite number of possible states i or j of the
environment. But now, the letters A and B will refer to __decision blocks__.
Decision blocks are essentially made up of states, plus local action policies __u___{i}^{A}
specifying the actions ** u** to
take in each state i of A, an

Here
we will also consider __higher-order__ decision blocks, a and b. Higher-order decision blocks are made up of decision
blocks. Just as the same state, i, may appear in
multiple competing decision blocks, so too may the same decision block A appear
in multiple competing higher-level decision blocks.

The
union of p(A), across all blocks A in a,
will be denoted as p(a). The set of __all__ block-post-exit states in the
entire system, i.e. the union of p(a)
across all higher-order blocks a, may be written as p(). The post-exit states of a
itself will be written as P(a). The union of P(a) over all higher-order blocks a will be written as P().
Each higher-order decision block must contain a set of decision blocks, plus an
internal critic J_{i}^{a0} and an interaction critic J_{ij}^{aI}.
These critics are defined for all iÎa which are also in p(), and for
all j in p(a).

In general, for any state i, let d(i) be the set of decision blocks which can be selected from at state i. Let D(i) be the set of higher-order decision blocks which can be selected at state i. In a simple decision scheme (as described in section 2.4.3), D(i) is only defined for iÎP(), and d(i) is defined only for iÎp(); however, the discussion below will also allow for the possibility of allowing a change of minds and unexpected exits (as discussed in 2.4.3).

For
the sake of computational efficiency, we will need to implement an additional
“critic” for each
decision block A, which may be written as J_{j}^{A+}, defined
for jÎp(A).

Note
how this change in notation simplifies the description of decision-making
designs. For example, in this notation, if state i is a state in which a
decision is required, __in the two-level design of section 2.4__, equation
23 becomes:

_{} (35)

If i is a state in which no decision is
required, then A should be already specified for that state, and J_{i} simply
equals the term within the curly brackets in equation 35.

Finally,
for the sake of generality __and__ readability, we need to replace the
awkward notation used for learning used in equations 28, 29 and 33. In each of
those equations, an old estimate is replaced by a new estimate, based upon a
new example. The precise update rules given in those equations were really just
the simplest __example__ of a way to do that kind of learning. More
generally, in each of those situations there was an old estimate (or network),
which is updated in response to a new desired value or output, which may be
called a “target value.” Learning methods or architectures which can perform
this kind of task are called __supervised learning systems__ [2]. For the
full three-brain system, in the software version, the user will be given a __choice__
of supervised learning systems; the ultimate preferred choice for a supervised learning system [2] will be far
more complex than the simple fixed linear learning scheme shown in the previous
equations. Thus to indicate a supervised-learning update in a more general
fashion, we will use the notation:

estimate ¬ target (36)

For example, equation 29 is replaced by the more general update rule:

J_{i}^{B}(s(t)) ¬ “J_{j}^{B}(s(t+1))”/(1+r) (37)

**2.5.2 The****
Three-Level Design Itself**

The
state-by-state update rules for __ J__^{A0}
and J^{AI}, within any decision block A, are essentially the same as in
section 2.4, i.e.:

J_{i}^{A0} ¬ U(i, __u___{i}^{A}) + “J^{A0}(s(t+1))”/(1+r), (38)

where “J^{A0}(s(t+1))” is J_{s(t+1)}^{A0}
in the case where s(t+1)ÎA, J_{s(t+1)}^{B0} in case of an
unexpected exit to decision block B, and 0 otherwise; and:

J_{ij}^{AI} ¬ “J_{ij}^{AI}(s(t+1))”/(1+r) = J_{s(t+1),j}^{AI}
/(1+r) (if s(t+1)ÎA)

= 1/(1+r) (if s(t+1) = j) (39)

= J_{s(}_{t+1),j}^{BI} /(1+r) (if s(t+1)ÎB by unexpected exit)

= 0 (otherwise)

When a state i demands that a decision
be made, these updates are inapplicable until __after__ the appropriate data
block has been chosen. Updates are made __only__ for the chosen decision
block, not for other blocks containing a state. For a simple decision-making scheme, as
discussed in section 2.4.3, unexpected exit modes do not occur; however, for
the general case, one must allow for their possibility.

When a state i requires that a higher-level decision be made, that decision should be made first before any lower-level decision is made in that state. The higher-level decision is made based upon the relation:

_{} (40)

After this higher-level decision is made
(based on this maximization), we update J_{i} so as to match the target
value of the right-hand side of this equation, which yields a backwards value
update. (Actually, as in equation 32, a forwards value update could be used
instead, but would not appear
to be preferable in most
cases.)

When a state i requires that a decision be made, and when state i is already in a definite higher-order decision block a, the decision is made based on:

_{} (41)

For strict accuracy, we may first
update each of the J_{j}^{A+} values used in equation 41, for jÎa, by
using the equation:

_{} (42)

(However, to reduce calculations, we
may maintain flags or recency
factors on each of the J^{+} values, and perform this
calculation only when the flags indicate that an update is overdue.) After the
decision has been made (i.e., after a block A has been selected for state i),
then two further
updates should be carried out

if iÎp():

_{} , (43)

where “J_{j}^{a0}”
is J_{j}^{a0} if jÎA, or 0 otherwise; and:

_{} , (44)

where “J_{jk}^{aI}”
is J_{jk}^{aI} if jÎA, or 1 if j=k, or 0 otherwise. Similar to equations 38
and 39, the terms in quotations in equations 43 and 44 are replaced by J_{j}^{b0}
or J_{jk}^{bI}, respectively, if jÎb as the result of an
unexpected exit from the higher order block a.
(Again, recall from section 2.4.3 that unexpected exits do __not__ represent
“failure,” if failure corresponds to one of the possible, anticipated but not
desired states. These special provisions apply to exits which were not even
considered on this list of __possibilities__ p(a).)

After the decision has been made in state i, we also have the option of performing a third update:

_{} (45)

Again, these updates are generally backwards value updates, as discussed in section 2.4.2; the forwards equivalents are generally straightforward, by analogy to that section.

Finally,
in order to complete the basic design, we must specify how to pick the actions ** u** at any time i, after the relevant
decision A and higher-level decision a have been made. In this
case, we simply choose

_{} , (46)

where:

_{} (47)

Recall, from section 2.3, that all of these equations and the corresponding designs are simplified substantially in the special case where the utility function U is a growth process.

**2.5.3 Action****
Networks, Decision Networks and
Other Extensions of the Design**

As previously mentioned, the various variants mentioned in section 2.4, like the focused partition design, can also be applied here in a straightforward manner.

In addition, it is straightforward to modify this system to permit the use of trees of decision blocks or task modules of varying depth. For example, one can define a “higher level” block a which contains only one decision block A. In effect, then, a becomes a kind of alias for A. In this way, we can design still-higher-level decision blocks for which, in some states i, one has a choice between initiating a truly high-level task b (a higher-level decision block) or lower-level task (A, appearing under the alias of a). The design above still goes through, without significant computational overhead involved in the “aliasing.” Using this procedure, it is then straightforward to handle this more general form of hierarchy of tasks or decision modules.

The
most important extensions are those which lay the groundwork for the use of
neural networks, to be discussed in section 3. Of these extensions, the most
important is the use of __action networks__ and/or __decision networks__
instead of optimal actions and decisions.

In
the simplest designs -- as above -- we always choose the best possible action ** u** in any state i, the best
possible decision A when there is a decision to be made, and the best possible
higher-level decision a. This always gives the
optimal result. When the number of choices is discrete and few, it is a
reasonable way to go. However, when the number of choices is large, the
computational cost of finding the true optimum in all cases becomes
unrealistic. In that case, we can build some sort of network -- deterministic
or stochastic -- to output actions or decisions as a function of the state and
of other information.

There
is a wide variety of such “action network” designs available in the adaptive
critic literature[2,8], all of which can be applied
here in a straightforward way, both to actions and to decisions. In addition,
one can build action or decision networks which simulate __several alternative__
possibilities at each time, and then: (1) select the best of the few; (2) adapt
the network, based on these results.A slight variation of this is to maintain a
buffer containing the “best action/decision possibility found so far,” and to
update the buffer only
when the stochastic action/decision network finds something better for the
current state. Section 4 will discuss some new possibilities for implementing
these subsystems, which are part of the preferred embodiment of the full neural
3-brain architecture. All of these adaptive designs also require some choice
between the many, many possibilities available [2,8]
for initializing the various networks.

Once
we extend the design by using action networks and decision networks, we can go
one step further by
allowing __modifiers__ to decisions. Intuitively, then, we may think of the decision
choices as something like subroutines or __verbs__. The modifiers are
something like input arguments or __adverbs__. (Objects will not be
discussed until a later section.) Given a decision block A, we may specify

an associated
vector, __u___{A}, which
contains the adverbs.

The
adverbs may be discrete
or continuous or a combination of
both. There are two major kinds of adverbs, passive and active. For
passive adverbs, the postexit states are still a function p(A)
of A alone. For active adverbs, they depend on the adverbs; in other words, we
have p(A,__u___{A}).

The
extension to passive adverbs is fairly easy, in principle. We expect that the decision
block will be initialized to allow for some variation in its actions
depending on __u___{A}.
We set up the action network, the internal critic network, and the interconnection critic
network to receive __u___{A}
as part of their input vectors. We set up the decision network to output first
A and then __u___{A}.
(The obvious way to structure this is to add an adverb-generating network which
outputs __u___{A}
whenever A has just been decided upon.) Then we train the various networks just
as we normally would do.

For
the full version of the system, we would expect a combination both of passive
adverbs and of active adverbs. With active adverbs, it is no longer practical
to implement the various calculations above which require explicit
consideration of __all possible decisions__, because the choice of adverbs
allows many possible decisions, in effect. (Mathematically, the equations above
then remain valid, but with “decision” replaced by “combination of decision and
active adverb”.) Again, however, the use of decision networks instead of exact
optimization leads to a straightforward approximation of the equations above.

The choice of words “verb” and “adverb” is not just an analogy here. Our speculation is that these sorts of computational structures, as implemented in the mammalian brain, do provide the deep structure which underlies the way we talk about our lives in human language.

**2.5.4 Learning**** of the Underlying Partitions**

The previous discussion has assumed a fixed, known partition of the set of possible states. In other words, the sets of states labeled A, a, etc., are all assumed to be known.

For tasks which truly involve a finite list of possible states, this is a reasonable assumption. There is a large body of methods already in existence on sparse matrices and domain decomposition techniques; thus, if useful partitions of the matrix P are not obvious by inspection, we can simply borrow methods from that well-established literature to find them. The case where P(P) depends on P can be addressed as an extension of these methods.

On
the other hand, for a truly brain-like intelligent control system, one would
prefer to augment such initial techniques by adding a __learning capability__,
which can adjust the partitions based on __learning__, either real-time
learning or off-line learning [2]. This becomes more important for the neural
network extension/approximation of this design, to be described in section 3.
However, the approaches to be described in that section can be seen as extensions
(roughly) of approaches which can be used even in the case where there is a
finite list of possible states.

The most important requirement, to permit learning, is to carry out some re-evaluation of decisions in mid-stream, on some sort of random or regular or event-driven basis. (As with many of the adaptation methods in the adaptive critic literature [8], one can use -- in theory -- “almost any” sampling strategy which is mixed and diverse enough to eventually touch base with all relevant states. Measures of new opportunity or of surprise could be used, for example, to trigger higher probabilities of reconsideration, above a lower but nonzero baseline probability which always applies.)

When such reconsideration is allowed, one can keep track of the unexpected exit states for each block A, and, when certain unexpected exit states are both frequent and important, add them to the set p(A). Likewise, one can prune the set p(A) by deleting states which are rare or unimportant, as measured, again, by the statistical experience of the system. Whenever a state outside of A frequently exits into A, one can add that state to A, on a trial basis, even though this would permit an overlap with a previous decision block. Section 2.4.3 has mentioned some additional adjustment rules which would also be added, to construct a full 3-brain system.

In
some previous work [1], it was speculated that a true, flexible learning of
partitions in the neural network version would require the use of __fuzzy
partitions__, in which crisp sets like A and a are replaced by fuzzy
sets. Based on the more detailed design work of the past year, this no longer
seems to be the case, for two main reasons: (1) __decision blocks__, unlike
the passive blocks of section 2.4, can be overlapping sets anyway, which opens
the door to the procedures of the previous paragraph; (2) the key to the
optimal adaptation of a decision block is to know when the intelligent system
is __committed__ to a particular decision. Because each individual decision
block can normally find the local optimum “within its own domain,” and because
the __choice__ between decision blocks is essentially a discrete choice
rather than the incremental kind of choice where compromise methods are useful,
the value of fuzzy partitions now seems questionable here. Thus the preferred
3-brain design to be given in section 3 will be an extension of the design here
in section 2.5. However, for the sake of generality, the principles necessary
to creating a fuzzy extension of this design will be described in section 2.6.

Note
that in section 3, it will not be necessary for the neural network system to __know__
the sets A or p(A) as such. It will be good enough to
know __which__ decision is in effect at any time, and to know when there is
a transition from one decision block to another.

**2.6 Fuzzy**** Partitions:
A Possible Extension of the Design**

Equation 10 may be written in slightly different form as:

_{} (48)

Starting from this equation, we may
choose any array b_{ij}
and derive the following equation by substitution:

_{}

(49a)

_{} ,

where:

_{} (49b)

_{} (49c)

This equation looks very similar, at
first, to some equations proposed by Sutton in [6]. However, there are several differences which are
crucial and not obvious. First, instead of assuming an array b_{ij},
he assumed only a vector of values b_{i}, for any given “abstract action.” Adding a
dependence on j is crucial, for reasons to be discussed. Second, while Sutton
did discuss the idea that a given weight vector ** b** might be associated with a particular
“abstract action,” he did not describe any procedure for choosing

When
Sutton’s discussion is assimilated into our context, he seems to be suggesting
that b_{i}
should be set equal to the degree of membership m_{i} of a given
state i in a given decision block. However, his equations then show value
weights converging to zero over time, when the state of the system merely stays
put in a fixed state which has a membership value less than one! This is
radically different from the kind of behavior we have seen and sought in
sections 2.4 and 2.5! When we use a b_{ij} array, we then have two other obvious
possibilities: (1) the preferred possibility, to set b_{ij} to m(j)/m(i)
(in the case where this is less than 1, or 1 if not); (2) to set b_{ij}
to 1+m(j)-m(i) (again
limited to a maximum of 1.).

When our preferred version is used on the problem in section 2.4, in the limit where m is always 1 or 0, it reproduces the equations of 2.4 (and their extensions in 2.5), with only the minor point that it allows the calculation of internal critics for states which extend beyond the entry to the block. (These can be simply thrown out as unnecessary calculations.) In the general case, we get back essentially the same adaptation rules as in section 2.5.2, except that the discrete choice of alternative targets is replaced by weighted sums which depend on the change in the membership function from time t to t+1. This follows from simple substitution of equation 49 into the derivations. The details are not shown here, however, because this extension is not the preferred embodiment of the 3-brain approach.

**3 Temporal**** Chunking With Neural
Networks**

**3.1 Goals**

The
purpose of this
section is to describe how to build an intelligent system using neural networks
(or other similar learning components [2]) to perform optimization over time,
using a learning-based __approximation__ of the structure in section 2.5.2,
with the associated methods already discussed in section 2.

In
a formal sense, the details in section 2 are __strictly precise__ even for
the most general range of tasks. (Even when a task involves continuous
variables, the system normally can be approximated to any desired degree of
accuracy using a large but finite number of states, in principle, if one is
careful.)

The purpose of building a neural
network system to approximate section 2.5.2 is toe __reduce the number of
calculations needed__, particularly in the case where there are many
continuous variables present in the external environment or the plant to be
controlled; this reduction, in turn, makes it possible for the resulting
intelligent system to cope effectively (though approximately) with a larger
variety of more complex tasks, within the limits of what is computationally
feasible. By using __neural network__ approximation methods, instead of other
learning-based approximators, we can extend the range of what is
computationally feasible because we can use special-purpose hardware dedicated
to implementing these particular architectures.

Many
aspects of how to approximate dynamic programming with neural networks have
already been established in the literature [1,2,8] and
in a previous patent disclosure by this inventor. For example, the methods
called HDP, DHP and GDHP have all been defined in great detail as methods of
approximating the original Bellman equation; for the modified Bellman
equations, __as expressed in sections 2.4 and 2.5__, the corresponding
changes in these 3 methods follow in a straightforward way. This novel
combination is a major part of this disclosure.

On the other hand, certain aspects of the neural network extension are much more difficult. This section will describe the overall structure of the new architecture, emphasizing the critical novel details necessary to make it work.

This
section will mainly address the issue of __temporal chunking__ -- i.e., the
design of a system aimed at solving problems where partitions over time are
very useful, but there are no special assumptions made about the structure of __space__,
or even about decomposition between different __groups of variables__ at the
same time. For a full brain-like structure, such issues of __spatial chunking__
must also be exploited. This section will occasionally mention design details
aimed at exploiting spatial chunking; however, the main discussion of such
further design extensions will be postponed until section 4.

**3.2 Overall**** Structure of
the System**

First
of all, we will assume that the intelligent system is made up of a __multilevel
hierarchy__ of __decision blocks__, as in section 2.5. As in section 2.5,
a particular decision block may have “aliases” at higher levels. The highest
level of all is not a decision block, but a kind of global critic network (or ** J** estimator), again as in
section 2.5. Thus by specifying the learning and dynamics within a

**Figure 1.**** Structure of a Decision Block**

In effect, each level or stratum of the hierarchy
contains a discrete “library” of decision blocks. The main goal of this section
is to describe how __individual__ decision blocks are adapted over time.
This is analogous to the usual adaptation schemes used to adapt individual __weights__
in simple artificial neural networks (ANNs). As with the simple ANNs, however,
there is a huge variety of
useful but semi-arbitrary tricks which one can use to “grow” or
“prune” the overall system. For example, if one frequently uses a given block with different
arguments, one may create a “copy” of
that block at the same level of the hierarchy, which then permits
separate adaptation of the two, as a way of “growing” the repertoire of
modules. Blocks which are rarely used can be deleted. Blocks which are frequently used in
sequence can be merged (in one step or partial steps) into a larger block, even
without deleting the original two blocks. (In fact, the creation of such a new block could explain the
experiments on “insight” described by
Brooks [9].)

Biologically, the levels of this hierarchy are assumed to correspond to the various “loops” (low-level and high-level) in the circuits through the basal ganglia described by Brooks [9] and others. It now appears that a system of only 7 or 8 such levels could replicate the kind of capabilities one observes in the mammalian brain.

In
biological systems, the execution of decision modules is not always sequential,
even at the higher levels of intelligence that we are trying to reproduce here.
At times, certain actions -- like walking and talking -- can be carried out concurrently. This may require a kind of __spatial
decomposition__ of the environment. In effect, it may require an __extension__
of the purely temporal design which will be the main (though not exclusive)
focus of this section. This kind of simultaneous commitment to different
actions is not an example of fuzzy commitment (section 2.6), because of the
issues raised in section 2.5. There is a very powerful system of __lateral
inhibition__ in the state of the basal ganglia, which ensures that discrete
choices (of a go/no-go variety) are made between competing possible decisions,
under normal circumstances [9]. There is a kind of fuzziness or uncertainty in
the values (critic estimates) which __underlie__ the decisions made at any
time; however, in some sense, the system can really focus only on one goal at a
time. It may __waver__ between competing goals, however. It should also be
noted that mammals -- __unlike__ large-scale factories -- are extended more
in time than they are in space; thus aspects of spatial chunking which are useful
in factory control may not always be relevant to mammalian
intelligence.

In
the discussion which follows, it will generally be assumed that each decision
block is made up of __independent__ neural networks “inside” each block, __sharing
access__ to some global set of feature variables {r_{i}} available to
all modules. In practice, it is also possible -- as in the brain -- to share
many “hidden neurons” between blocks. It is also possible to use growing and
pruning schemes which tend to give preference to the use of “neighboring”
variables, and so on, as has been done or proposed before for simpler ANNs. For
simplicity, however, the discussion below will treat the various decision
blocks as if they were made up of independent networks. Also, the discussion
below will refer throughout to “neural networks;” however, as discussed in [2],
this should be understood as a shorthand term for “ANNs or other learning
structures which perform similar tasks,
such as supervised learning, etc.”

**3.3 Structure**** and Training of a Decision Block**

Within
each decision block, we will first need neural networks to approximate the J^{A0}
and J^{AI} terms in equation 41. For J^{A0}, the obvious
procedure is to train a neural network using supervised learning, based on
equation 38. More precisely, one would insert a neural network to receive as
inputs i, A, and __u___{A},
and train it to match the targets given in equation 38. Instead of an integer
“i,” one would use ** r**, the
current estimated state vector of
the environment, learned primarily through neuroidentification
techniques [2]. For J

However,
for a true neural network approximation, __we____ cannot assume the
availability of the full vector J_{j}__

In
general, the SI vectors can be generated as the output of a __decision network__
(which, by itself, would make them really the same as the __u___{A} vectors described previously), or as a kind of
__communication__ which passes back value information explicitly from one
decision block to a previous decision block. However, if each block develops
its own compression scheme, it becomes extremely difficult to develop
translation networks __from__ each possible decision block __to__ each
other block which may precede it!

As
an alternative, for the 3-brain architecture, we will implement the SI vectors
as Quadratic SI (QSI) vectors. In fact, each QSI “vector” will actually be a
“goal object” or “fuzzy image,” consisting of two scalars -- g_{0} and
g_{1} -- and two vectors, ** r***
and

_{} , (50)

where r_{i}
is one of the global feature or state variables. (As a practical matter, of
course, individual decision blocks may “nominate” new features that they need
to the global array, or, more easily, may influence the development of new
features indirectly through backpropagation feedback they send back to the ** r** network -- something which
happens automatically in a full implementation of the appropriate
backpropagation learning.)

For
a fully __active__ design (as defined in section 2.4.3), we must account for
the fact that action policies __change__ as a result of values received from
outside. Thus the J^{A)} network, __like__ the J^{AI}
network, will be assumed to input ** r**,
A,

In order to complete this design, however, we also need to include some additional components, some essential and some (though optional) part of the preferred variation.

First
of all, in order to permit the backwards flow of goal information, g, we need
to create a new network, J^{A-}, in each decision block, with the
following characteristics. J^{A-} will contain one or more “components” (blocks of hidden
units) g_{i}^{A-}; at any given time, we allow i=1 to n^{A-},
for a fixed number n^{A-}, which may be grown or pruned. Whenever the
decision block A is chosen, and the rest of the decision made, in a state ** r**, the network J

Secondly,
in order to improve the training of the system, it is preferred that each
decision block also contain a network J^{A+}, which is trained to
estimate J in the p(A) states. (In theory, J^{AI}
could take over this function, but there are potential benefits to learning in
decomposing qualitatively different learning tasks.) Thus in training the J^{AI}
network, whenever s(t+1) is in p(A) (i.e. when a new
decision block is invoked -- an explicit representation of p(A) is unnecessary), we can use J^{A+},
in effect, as the target. Likewise, we can use the output of this J^{A+}
network in evaluating possible simulated results jÎp(A)
of choosing decision A in an initial state i. Note the importance of training J^{A+}
to match J^{B0}+J^{BI} for the resulting state j, after the
subsequent decision B is known; by training to that information, __instead__
of training directly to a fuzzy goal image, one can improve robustness
considerably.

For
completeness, J^{A+} and the previously mentioned networks should all
receive an additional set of inputs. These would represent g^{a+}
and __u___{a}
and a, where a is the __larger__ decision block in which A is
embedded, __as well__ as similar information for the blocks in which a is
embedded, and so on. This is a large amount of information. For parsimony,
however, one can “channel” this information,

by defining a
compression network either for all of a or for A in particular.
This compression network would receive the three mentioned sources of input,
plus ** r**, plus the output of
the compression network of the next higher block. It would output information
then used as “hidden units” in the various networks associated with block A, trained based on feedback from all those networks. There
are other similar
methods of compression which could be used instead, but this at least shows
that it can be done without too much difficulty. This compression network is
providing information about the larger strategic context which could be
important in some cases.

Thirdly, the operations described above --
including the use of forwards results sampling in order to evaluate possible
decisions and to train decision networks -- clearly require the existence of a
network to __predict or simulate__ an outcome state “j”Îp(A)
based on an entry state ** r**, A,

Finally,
in order to complete this arrangement, we need to have a mechanism available
which actually allows us to make decisions at a state “i” (or ** r**) before we know for sure what
the resulting state jÎp(A) and the follow-up decision
will be. To do this, we need to develop a

Actually,
there are good reasons to __weight__ this supervised learning, to try to
match g^{A+} to g^{B-} more energetically in some situations
that in others. Higher weights should be given in cases of __success__, such
as cases when J^{B0}+J^{BI} is higher
in the result state j than might have been expected. (A variety of similar
metrics could be used.) The exact function used to determine the weight of each
observation as a function of success is arbitrary, in principle; as with the
arbitrary choice of certain parameters in any learning system, it provides a
legitimate basis for variation from one learning system to another, leading to
variations in “cognitive style” or “personality,” all of which are workable but
which vary in effectiveness from environment to environment.

For the effective operation of this system, there are two important subsystem issues which need to be considered explicitly.

First,
because the time chunks here (from entry
to exit of a block) are usually
much larger than one computing cycle, the system will work best if there is
frequent “arbitrary” simulation or exploration of possible interesting states
in the near future. This is also true (though to a lesser extent) in ordinary
adaptive critic adaptation[8], where a wide variety of
simulation or exploration approaches have been tried -- all valid in theory, but
very variable in practice. Clearly this design includes both “backwards
chaining” (using J^{A-} to find “subgoals” related to a goal g^{A+})
and “forward chaining” (future simulation, perhaps using a supplementary
decision network to suggest possible alternative goals g^{A+} instead of just
possible A and __u___{A}.)

Second,
it is critical, in practice to be aware here of the availability of supervised
learning systems which learn very rapidly through some kind of local learning
or associative memory, and of “syncretism”
approaches [8, chapter 3] which permit
further generalization from there. In practice, if the g^{A+}
network effectively “remembers” the results of forwards and backwards chaining
in the __current__
instance of block a, under current circumstances, this type of design --
properly implemented -- should be able to fully replicate the capabilities of
mixed forward-chaining backwards-chaining reasoning systems used in planning in
traditional artificial intelligence.

**3.4 Possible**** Variations of This Architecture**

There are a number of possible variations of this design, of varying usefulness. In addition to the variations mentioned in previous sections, there are the following variations of increasing relevance here:

(1)
Using only __u___{A}, not
g.

(2)
Allowing multiple goals g^{A+} in decision blocks.

(3) Multiplexing the processing of goals at different levels.

(4) Symbolic communication between modules.

(5) Linear SI vectors.

(6) Spatial "convolution" of goals.

Of these six, the first four are not part of the preferred embodiment, the fifth is of marginal interest, and only the sixth is part of the 3-brain architecture proposed here. The sixth anticipates some further extensions to be discussed in section 4.

First
of all, the design of the previous section would clearly be radically
simplified if goal objects g were not used. To some extent, modifiers __u___{A} can take over some
of the same functions, of defining the goals to be pursued in a particular task
invocation. In early implementations of our approach, this variation will be
extremely useful because of its simplicity. In the long-term, however, the lack
of rapid electronic transmission of __J__^{+}
information, in effect, makes this a more passive, less modular approach than
our baseline architecture. It is also less plausible than the base architecture
as a description of mammalian thought, in which the inherent ability to __focus__
on a goal is clearly very fundamental. In section 3.3, the opposite
simplification was briefly mentioned -- the possibility of using goals g in
place of scalar critic networks J, in some situations; that simplification
would lead to excessive rigidity or robotic character in decision-making, not
allowing at all for the robustness of adjusting evaluations so as to account
for values which are not entirely translated into specific goals. This
corresponds to our natural observation, as human beings, that we can reason
very quickly about explicit goals, but that we still need to adjust the
resulting conclusions so as to account for "subconscious" feelings,
impressions and intuitions which lie beyond the explicitly stated goals.

Secondly,
it would seem extremely natural and fairly straightforward to extend this
framework to permit __multiple goals__ or outcomes. For example, if a hybrid
SEDP/SOM simulation model were trained, within each decision block, it would
lead to a very natural way of partitioning p(A), in
effect. One could simply use the general g^{+} network to fill in all
of the multiple goals of a decision block, except when exploring __particular__
outcome states, where we could modify only the g^{+} applying to that
state. However, this kind of design would lead to a great deal of additional
complexity. As an alternative, in situations where there are a small number of
desirable alternative outcomes, one could usually just split the block up into
separate blocks, one for each goal. After all, in the pure temporal chunking
situation, these different goals are not mutually consistent anyway. When there
is a mix of desirable and undesirable outcome possibilities, it is usually
enough to focus on the desired outcome (for higher-level planning purposes) and

let the
decision block learn to reduce the probability of the others to zero. When
humans are confronted with multiple attractors, it seems more plausible that
they __waver__ between these attractors (or focus on a more abstract goal
definition), rather than focus on multiple competing targets. Nevertheless, the
multiple goal option is just viable enough that it should be rejected
completely at this stage.

Thirdly,
when we implement the 3brain architecture on computer hardware, the obvious
approach for now is to implement each stratum in parallel, so that each uses
its own computational resources as efficiently as possible in making the
decisions at its level. Presumably, then, the higher levels would
"imagine" more distant future possibilities, etc. In practice,
however, if there are 7 strata, there could be a seven-fold saving in hardware
if one could somehow __multiplex__ these calculations through a single,
unified network system. The resulting complexity is probably not worth the
cost, at the present state of the art, but it may be useful eventually. In
fact, an introspective view of human thought, imagination and dreaming suggests
that such a multiplexed, one-idea-at-a-time approach
is probably used in the actual mammalian brain.

Fourth,
the communication between decision modules could be made far more complex than
a simple fuzzy goal image, g, even without imposing the requirement of complex
translation networks. In theory, for example, a decision module A could output,
instead of g^{A-}, an __explicit, structured verbal/symbolic__
description of the current value function J^{A0}+J^{AI} for the
entry states of A. If the preceding decision module B could truly understand
symbolic representations, then it could use such a representation directly, and
analyze it in very complex detail. Many readers will immediately notice that
this sounds more like a society of intelligent, symbolic reasoning units --
like a human society -- rather than a collection of neurons. It is extremely
unlikely that simple task modules in the mammalian brain can communicate with
each other in this way, since even whole mammals (except humans) cannot do as
much! Furthermore, because of conflicts of goals, human societies certainly do
not prove that such a "multimodular" level of intelligence -- far
beyond the mammalian level -- is really possible. Perhaps it will be possible,
someday, to build a unified intelligence on this basis, well beyond the human
level of intelligence. However, that is both unproven, and beyond the scope of
what is proposed here.

Fifth,
on a more mundane level, there is a very sensible-looking alternative to the
QSI: the Linear SI vector, which consists only of a scalar v_{0} and a vector ** v**, representing:

_{} (51)

The LSI approach is not a viable __alternative__ to
QSI, because it cannot represent the wide variety in concrete goal states
needed for a flexible planning system. However, it __is__ very plausible to
consider __adding__ an LSI object to the QSI objects, to provide additional
guidance to a decision block __based on__ future expectations. For example,
one can imagine trying to walk to a desired location (goal), while placing
extra weight on trying to avoid places where rain comes down on your head; this
weight may depend on special current information. It is not clear that the
additional complexity of allowing such an input would be worth the trouble;
however, it may be worth considering as an extension, after the baseline system
is more fully tested.

Finally,
in the design of J^{A-} networks, it is important to train the goal
subnetworks (especially per the growing and pruning aspects of learning) so as
to __preserve__ the elements (r_{i}*,w_{i})
of g^{A+} which decision A has little effect on. The practical effect
of this is to make g^{A-} into a kind of __convolution__ of g^{A+}
and of the effect of A. This, in turn, is crucial to the effectiveness of this
system in handling tasks where the various variables r_{i}
actually represent different regions in space, or even when they simply form
groups of variables that do not interact so strongly with each other.

**4 Spatial**** Chunking and
Critical Subsystems**

**4.1 Summary****: Spatial
Chunking in General, and Combinatorial Optimization**

The previous section describes a generalized neural network based intelligent system, to exploit the power of temporal chunking. However, as Albus [7] and others have stressed, the world of our experience also permits a certain degree of "spatial chunking." Spatial effects do appear to be very critical in major parts of the mammalian brain, such as the hippocampus.

This
section will suggest that most of the power of exploiting spatial effects can
be obtained __within__ the design of section 3, __so long as__ the __subsystems__
used in that design are properly chosen. However, as noted in section 3, the
main goal here is to replicate the level of intelligence seen in mammals, who are not so extended in space (at any time) as they are
in time.

The
simplest form of spatial effect is the __concurrent invocation__ of parallel
decision blocks, affecting different kinds of actions. One way to achieve such
capabilities is as follows. When there is a decision block currently activated,
and a part of the goal which involves __other variables__ not affected by
the current decision block (as should be evident from the structure of the J^{A-}
network), then one can define a __new__ decision block as the combination of
the first block and another block which addresses those other goal components.
Note that this is simply one more "arbitrary" rule to be added to the
growing(/pruning) system for decision blocks, similar
in flavor to the "insight" rule mentioned in section 3. One can the
invoke that two-things-at-a-time activity -- and either fine-tune it or discard
it. No fundamental changes in the underlying design are needed to add this kind
of capability. This additional capability is part of the fullest, mammal-like
version of the 3brain architecture.

A
second, more fundamental aspect of space is the exploitation of __spatial
symmetry__. Two of the papers included here by reference (Werbos and Pang[12], and Pang and Werbos[13]) describe how Euclidean
spatial symmetry can be incorporated into a supervised learning system, which
in turn could be used as the critic network of a larger intelligent system. In
a similar vein, all of the various neural networks described in the previous
section could be __implemented__ as neural networks embodying symmetry
principles. For Euclidean symmetry, we could simply use the kind of symmetry in
the incorporated papers. But for __nonEuclidean__
symmetry, we must use a further variation, to be described in section 4.2.

A
third important aspect of space is true spatial chunking. As an example, John
Holland has noted how a __roadmap__ serves as a spatially chunked
(simplified) representation of the more detailed pictures one might see from a
satellite. This kind of spatial chunking is really nothing more than a __compression__
of information across space. One can achieve this kind of compression by using
the compression method, SEDP [8, chapter 13], with component networks that move
across a spatial (Euclidean or nonEuclidean) "window," using the
principles of section 4.2.

Finally,
a fourth aspect of space is something which may be called the "big world
problem." For example, Albus [7] has pointed out that the usual sort of
estimated state vector one derives in engineering may really be just a small
window into the larger physical environment of a mobile intelligent system, an environment
which is mostly far outside the sight of the system. He argues that the system
must maintain both an estimated state vector, ** r**, in effect, and also a larger "world model" in
which the current

Most
of the subsystems required in section 3 already exist -- in many forms and many
variations -- in the neural network literature and elsewhere. The new papers
incorporated here (and section 4.2) provide some basis for __improving__ the
subsystems, and thereby improving the overall behavior of the system, but the
previous neural networks provide at least some basis for an early
implementation. The main exception to this concerns the training of adaptive __decision
networks__, networks to make both continuous and discrete decisions, so as to
maximize a rather complex function of these decisions. Section
4.3. will describe some new designs to fill in
that gap.

Finally,
all these various component neural networks -- whether
spatially symmetric or not -- are ultimately composed mainly of __function
approximation networks__, as discussed in [2]. For an efficient hardware
implementation of these systems, the most important aspect is to implement this
computation-intensive "inner loop" -- the function approximation
networks at the core of the system. At the present time, perhaps the greatest
computational throughput of __any__ general-purpose computing system (both
today and as expected in 10-20 years) comes from the 3DANN (3-Dimensional ANN)
hardware being developed by the Jet Propulsion Laboratory. The 3DANN system is
not implementing SRN components, as described in the incorporated papers.
However, for SRN networks implemented as a core feedforward network of a single
layer (which can always represent the equivalent of one implemented around a
multilayer network), it would require a relatively straightforward modification
of the 3DANN hardware to implement them. With cellular SRNs, where the
connections are more limited, the implementation would be far more compact,
using the same approach to hardware design. This is the preferred embodiment of
the 3-brain architecture -- i.e. a system of coupled SRNs, coupled according to
the higher-level connections implied above, implemented in modified 3DANN-style
hardware.

**4.2 Symmetry-Based Networks,
Euclidean and nonEuclidean**

The
papers incorporated here by Werbos and Pang[12], and by Pang and Werbos[13], display
how to build a Euclidean-symmetric network which __inputs__ an array of __spatially-located__
data, and also __outputs__ an array of spatially located outputs (to match a
spatially located array of targets).

There
is a straightforward but novel generalization of that design to allow any
mixture of spatially located outputs (or targets) and global, nonspatial
outputs (or targets). One builds a network with two parts: (1) a
"lower" part which is just like the simple spatial-to-spatial network
of the incorporated papers (or some other simple spatial-to-spatial design
incorporating Euclidean symmetry); and (2) an "upper" part which is a
nonspatial (ordinary) network whose inputs may consist of some nonspatial
inputs, __and__ up to n special additional inputs, where "n" is
the number of neurons in each "cell" of the lower part. Each of these
special inputs would represent the __sum__ of the outputs of the
corresponding neurons in each of the cells, summed across the cells. (Actually,
the number of cells can also be added as an input, when it is expected to
vary.) To train this kind of two-level structure, one can still use generalized
backpropagation directly. (See [8,ch.10] or the
incorporated papers.) This structure directly reflects the principle of
Euclidean Lie group symmetry, discussed in the incorporated papers. It is
somewhat interesting that the relation between the amygdala and the hippocampus
in the mammalian brain looks somewhat similar to this kind of two-layer
arrangement, in which the lower layer is sensitive to spatial encoding.

In
the example of the maze, discussed in the incorporated papers, the Euclidean
design can be interpreted as the repeated use of the __same__ core
"cell" of 5 neurons over and over again, in different locations, with
input from the cell itself and from each of its four neighbors. One
disadvantage of this design is that it requires input from four neighbors even
for edge cells, which do not really __have__ four neighbors.

An
alternative spatial-to-spatial design -- the nonEuclidean approach --
would be essentially the same in this case, __except that__ the cells can be
sorted into three types -- four-neighbor
cells, three-neighbor cells, and two-neighbor cells. Instead of reusing __one__
core network in __all__ cells, we can adapt __three__ different core networks,
for use on the three different types of cells. The resulting feedforward
network is clearly well-defined -- for each cell, we simply use the relevant
core network to generate __its__ outputs, and then combine them all to
generate the entire array. The resulting SRN wrapped around this feedforward
network would simply be defined relative to the feedforward network, as in the
incorporated papers. Adaptation can again be done by use of generalized
backpropagation, as previously mentioned. Then, to add a global layer on top of
this spatial-to-spatial structure, one again transfers inputs based on adding
up outputs of corresponding cells in similar objects -- otherwise exactly
following the second paragraph of this section.

In general, the nonEuclidean approach can be applied to any complex network of nodes, where "space" refers to nodes in a network rather than coordinates in a regular grid. One can again identify objects with similar relations to other objects, applying a the same core model to all similar objects.

(One can use an extremely loose concept
of similarity, and let the core model itself __learn__ what distinctions it
must pay attention to.) When objects are connected to complex relations, there
are two further variations here: (1) to treat the relations themselves as a
type of object; (2) to force equal weights for multiple instances of the same
relation out of any object (thereby eliminating the need to worry about how
many instances there are.). These choices are like the choices of connections
in ordinary ANN training -- choices to be optimized on the basis of incremental
error-based learning -- growing, pruning, etc.

Note,
as an example, that an SRN structure built on this kind of object-oriented
network, with a global classification network on top of it, might be an
excellent structure to input __structures or maps__ like representations of
chemical molecules, and output predictions of their global properties. It is
speculated that the mammalian brain operates mainly on the basis of nonEuclidean symmetry, object-oriented symmetry.

This concept of nonEuclidean or object-oriented symmetry may be interpreted as the neural network extension/version of what statisticians call pooled time-series cross-sectional modeling.

The
concept of object-oriented symmetry in neural nets is not 100% new. The idea
was first conceived by this inventor several years ago, when analyzing problems
of forecasting stocks. By building a general network, to be trained over
several actual stocks, analogous to pooled cross-sectional time-serioes
analysis, it was clear that a major imporvement in accuracy could be acheived.
This combination of ideas from econometrics and neural networks was quite
novel, in part because few reserachers work at the state of the art in both
fields. In 1996, in

**4.3 ANNs for Combinatorial
Optimization (Decision Networks)**

Even in conventional adaptive critic designs, there are severe limitations in the capabilities of existing "action networks" and in the methods used to train them. The decision networks discussed in section 3 perform essentially the same task as these conventional action networks, but -- because they are intended for use in demanding, complex applications -- the need for improved capability becomes essential.

In
general, the problem here is to build and train a network which outputs an
action or decision vector ** u**,
composed of both discrete and continuous components. For each observation of
inputs,

There
are two conventional approaches to this problem widely used in adaptive critic
systems. One approach -- when the choices are discrete -- is simply to consider
all possible choices, or to use a method like Barto's Arp which is also
extremely limited in handling large numbers of choices. Neither of these is
adequate for handling very complex decisions with continuous aspects. Another
approach is to use backpropagation, which introduces the possibility of getting
caught in a local minimum. This possibility is typically not a big problem for
lower-level action choices, but for large-scale decisions it can be extremely
serious, because of the "lumpy" nature of large-scale choices. (An
example might be the decision of where to place a stone on a Go board; __each__
of the 381 legal grid points is a kind of "local optimum," superior
to placing a stone on the illegal nearby sites off of the grid points. More generally, there is a problem in
long-term decision-making of separating the forest from the trees, when seeking
the highest point.) A third common alternative is the use of genetic algorithms
at each time t, which would appear radically different from what the circuitry of the brain seems
to suggest, and also seems unnecessarily slow for a real-time system.

This section will propose an alternative approach to this problem. In general, we propose the development and use of a Stochastic Action Network which has the property that:

_{} , (52)

where Z is a kind of normalization
function (similar to the partition functions of physics) set to insure that the
probabilities all add up to one, where k is a constant, and where T --
"temperature" -- is a global parameter which can be changed over time.
In effect, the SAN learns the "forest" instead of the
"trees;" it provides __options__ for actions or decisions. For a
complete action system, one can simply use the SAN to continually suggest new alternatives
(versus the previously chosen alternatives, whatever they may be), and one can
update the actual action or decision vector whenever the new option looks
better. In a way, these networks may be thought of as a kind of
"imagination network." Note that the inputs to the SAN may include
information about the previous decision, etc. Also, in the "Error Critic"
design [8,chapter 13], one meet even generate the recurrent values, ** R**, as an action vector in the
SAN sense; such an architecture might occasionally produce wavering images of
reality in certain situations (as is seen in real mammals).

There
are several possible approaches to adapt SAN networks. In such networks, we
begin by assuming we can generate vectors ** e**
made up of random variables from the usual normal distribution N(0,1). The SAN may then be written as:

_{} (53)

where __A__
is the neural network.

One
possible approach is to first strain a “D^{T}Q” network which inputs

** u** and

_{} (54)

We may try to adapt the weights W so as to minimize:

_{} (55)

The purpose of this is to make F_e
equal e, as it would for the desired normal distribution. In effect, this is
arbitrarily choosing a particular value for kT, but by scaling up e in
proportion to T one can achieve any desired thermal distribution, especially if
T is varied during training, with the adjustments scaled accordingly. Notice
that it would be very critical to adapt the D^{T}Q net as quickly or more quickly than we adapt the A network. A more
rigorous approach would be to avoid the D^{T}Q network, and __directly__
minimize (F_e-e)^{2}, which requires the use
of __second-order__ backpropagation, discussed in several sources, such as
[8,chapter 10].

One should be warned that these designs have only received a very limited amount of convergence analysis so far, only in the linearized case. Thus it is possible that a variant using an SEDP-based design to provide the SAN component may be worth exploring.

The
global "temperature" T is theoretically arbitrary. However, numerous
authors, such as Levine and Leven[11], have discussed
how variations in "novelty seeking" (which T represents) can serve
the motivations of organisms. Although the rules for adjusting T cannot be
specified in a rigid way apriori, they -- like the measures of success used in
weighting g^{+} training , discussed in
section 3.3 -- provide an important aspect of the "cognitive style"
or "personality" of the intelligent system.

In the mammalian brain, it is very interesting that the lower layers of the neocortex -- the most recent part of the brain -- provide both the state estimation (with Error Critic, we have argued[1]) and a kind of stochastic SAN-like output to the basal ganglia, where discrete choices are enforced. Clearly that arrangement fits in with the spirit of this design.

**5 Adding**** the Third
Brain**

Strictly
speaking, the architecture described above corresponds to the "upper
brain" and "middle brain" as described in [1]. For reasons
discussed in [1], this system cannot operate at the maximal sampling rate which
the underlying hardware seems capable of. In order to perform true maximal-rate
real-time control, one can simply add a "third brain" -- a
straightforward adaptive critic system as described in [8,chapter
13] and in [1], parallel to the olive-cerebellum system of the brain. As
described in [1] (which is incorporated here by reference), one can link this
lower-level system to the upper-level system by a simple master-slave
arrangement, in which changes in the lowest-level J from the upper system are
used to generate the main component of the "U" function maximized
over time by the lower system. Also, for the sake of parsimony, it is easiest
to define the __action__ outputs of this lower brain __relative to__ the
lowest-level action "decisions" of the upper brain; in other words,
we can calculate the total actual ** u**
as the sum of the lower-level

**References**

[1] P.Werbos, “Learning in the brain:
an engineering interpretation”, in *Learning
as Self-Organization*,
K.Pribram, Ed., Erlbaum, 1996.

[2] P. Werbos, “Neurocontrollers”, in *Encyclopedia of Electronics and Electrical Engineering*, J.Webster, Ed.,
Wiley, forthcoming. (Draft version incorporated
into patent.)

[3] R. Howard* Dynamic
Programming and Markhov Processes*, MIT Press, 1960.

[4] P.Werbos, “The cytoskeleton: Why it may be
crucial to human learning and to neurocontrol”,
*Nanobiology*, Vol. 1, No.1, 1992.

[5] D.P.Bertsekas and J.N.Tsitsiklis, *Neurodynamic Programming*, Athena Scientific,

[6] R.Sutton, “TD Models: Modeling the
World at a Mixture ofTime Scales.” CMPSCI
Technical Report 95-114. U.Mass. Amherst, December 1995, later published in *Proc. 12th Int. Conf. Machine Learning*, 531-539, Morgan Kaufmann, 1995.

[7] J.Albus, Outline of Intelligence, *IEEE Trans. Systems, Man and Cybernetics*, Vol.21, No.2, 1991.

[8] D.White and D. Sofge, eds, *Handbook of
Intelligent Control*, Van Nostrand, 1992.

[9] *The** Neural Basis of Motor Control*,

[10] H.Ritter, T.Martinetz, and
K.Schulten, *Neural Computation and Self- Organizing Maps*,
Addison-Wesley, 1992.

[11] D.S.Levine and S.J.Leven, *Motivation, Emotion, and Goal Direction in
Neural Networks*, Erlbaum,
1992.

[12] P.Werbos &
X.Z.Pang, “Generalized maze navigation: SRN critics solve what feedforward or Hebbian nets
cannot”, *Proc. Conf. Systems, Man and Cybernetics (SMC)* (*WCNN96 Proc*., Erlbaum, 1996.)

[13] X.Z.Pang &
P.Werbos, “Neural network design for J function approximation in dynamic programming”, *Math. Modelling and Scientific Computing*
(a Principia Scientia journal),
special issue on neural nets, winter 1996-1997

[14] P.Werbos,
“Supervised learning: can it escape its local minimum”, *WCNN93 Proceedings*, Erlbaum,
1993. Reprinted in *Theoretical Advances
in Neural Computation and Learning*,
V. Roychowdhury et al, Eds.,
Kluwer, 1994

[15] D.Levine and *Optimality
in Biological and Artificial Networks?,* Erlbaum, 1996.

[16] P.Werbos,
“Optimal neurocontrol: Practical benefits, new results and biological evidence”, *Proc.World Cong. on Neural Networks(WCNN95)*,
Erlbaum, 1995

[17] P.Werbos,
“Optimization methods for brain-like intelligent control”, *Proc. IEEE Conf. CDC*,
IEEE, 1995.

[18] P.Werbos, *The Roots of Backpropagation: From Ordered
Derivatives to Neural Networks and
Political Forecasting*, Wiley, 1994.

[19] P.Werbos, “Values, goals and
utility in an engineering-based theory of mammalian
intelligence” in *Brain and* *Values*,
K.Pribram ed., Erlbaum,
1997.

[20] K.Pribram, *Brain and Perception: Holonomy and Structure in Figural Processing*, Erlbaum 1991.