Generalized
Maze Navigation: SRN Critics Solve What Feedforward or Hebbian Nets Cannot
By
Paul J. Werbos
Room 675, National Science Foundation*
and
Xiaozhong Pang
Dept. of Electrical Engineering,
ABSTRACT
Previous
papers have explained why model-based adaptive critic designs -- unlike other
designs used in neurocontrol -- have the potential to replicate some of the
key, basic aspects of intelligence as seen in the brain. However, these designs
are modular designs, containing “simple” supervised learning systems as
modules. The intelligence of the overall system depends on the function
approximation abilities of these modules. For the generalized maze navigation
problem, no feedforward networks - MLP, RBF, CMAC, etc. - or networks based on
Hebbian learning have good enough approximation abilities. In this problem, one
learns to input a maze description, and output a policy or value
function, without having to relearn the policy when one encounters a new maze.
This paper reports how we solved a very simple but challenging instance of this
problem, using a new form of simultaneous recurrent network (SRN) based on a
cellular structure which has some interesting similarity to the hippocampus.
INTRODUCTION
Several
previous papers [1-3] have argued that model-based adaptive critics -- also
called model-based approximate dynamic programming (ADP) -- offer us the only
real hope of achieving true brain-like intelligence in artificial control
systems, or of understanding intelligence in the brain[4-5] itself.
In principle, ADP systems should be
able to approximate the solution to any problem in control or planning
which can be formulated as an optimization problem. This includes almost any
interesting problem! For example, winning a game of chess or of backgammon can
be formulated as the problem of maximizing the
_____________________________________
*This
paper represents the views of the authors only, not the views of their
employers past or present. However, because it was written on
probability
of victory over time. Using such a formulation, Tesauro[6] has shown that
simple adaptive critic systems can in fact play an excellent, master-class game
of backgammon. As another example, the most general approach to the classic nonlinear robust control
problem[7] involves converting that problem into an optimization problem, which
can then be solved by ADP in offline learning.
This paper will first review what a
Critic network is, and explain why we need to use more powerful Critic networks
in order to solve difficult problems in intelligent control. Next, it will
describe the SRN network used here, and compare it with other recurrent
networks. Finally, it will describe the very simple but challenging test
problem used here, and the empirical results. More complete information on all
these points -- ranging from literature review through to designs, flow charts
and empirical results -- is given in [8]. For the underlying C code, contact
pangxz@eng.umd.edu or
MOTIVATION: CRITICS FOR USE IN PLANNING, AND THE MAZE
Decades
ago, classic researchers in Artificial Intelligence (AI) such as Minsky[9] and
Newell, Shaw and Simon showed how the problem of “reinforcement learning” --
the problem of maximizing an observed reinforcement signal U(t) over future
time -- can encompass most of what we mean by “intelligence.” In 1968,
Werbos[10] showed how the problem of rein-forcement learning is linked to the
problem of approximating dynamic programming, and proposed a primitive form of
backpropagation as part of an ADP design. Since then, more sophisticated
designs and explanations have been developed[1-3], but they all are logically
based on the Bellman equation, the fundamental equation of dynamic programming.
Dynamic programming gives the exact
solution to the problem of how to maximize a utility function U(R(t)) over future times t, in a
nonlinear stochastic environment, where the vector R(t) represents the state of the environment at time t. Our
ability to do well on the reinforcement learning problem depends on our ability
to approximate the exact, optimal solution well -- i.e., our ability to
approximate the dynamic programming solution. Dynamic programming converts a
difficult problem in optimization over time (maximizing <U(R(t))>,
the
expected value of U(R(t))
over all future times) into a simpler problem in function maximization.
When there is a finite time horizon
(as with the maze), the Bellman equation [11] may be written:
J(R(t+1)) =
max (U(R(t),u(t))
+ <J(R(t+1))>) u(t) (1)
where
u(t) represents the actions
you take at time t. The problem is to solve for the function J. After J is
known, you can find the optimal action u(t)
at each time t by performing the maximization shown in equation 1.
In classical dynamic programming, we
solve for J exactly. In most forms of ADP (or, more precisely, in most forms of
adaptive critic design), we approximate J by some kind of function
approximation scheme, usually a neural network. Thus we approximate J(R) by some function J(R,W), where W is a set of
weights or parameters. J is called a Critic network.
For relatively “simple” optimization
problems -- such as conventional aircraft control [3] or vehicle control -- one
may expect the true function J to be relatively smooth. Barron[12] has shown
that the most common feedforward neural networks -- MLPs -- can approximate
smooth functions well, even as the number of inputs increases; he has also
shown that linear basis function approximators (like
Unfortunately, for serious problems
in planning or scheduling, the J function will typically not be smooth. We
cannot expect ADP systems based on feedforward or Hebbian Critics to perform
very well on such problems, if the Critic networks themselves are inherently
unable to approximate that kind of J function.
The generalized path planning or
spatial navigation problem is an example of a problem which is very difficult
for feedforward or Hebbian Critics. It is well known that the brains of animals
can solve such problems effectively, though it may take them some time to learn
how to do so. But if the Critic network cannot approximate the correct J
function, then no amount of learning can lead to high quality performance.
Until we implement Critics capable of solving such problems, our “intelligent
controllers” will fall short of true brain-like capabilities in a very
important way.
The problem of navigating a simple
maze, defined over an array of squares, is a simple but challenging example of
such a problem. The reason for choosing such a simple example, to begin with,
will be explained further in the last section of this paper.
CELLULAR SRNS VERSUS OTHER RECURRENT NETWORKS
Recently
many engineers have argued that: (1) recurrent networks should be useful in
theory because we know that they are important in the brain; but (2) it is not
practical to use them yet, because we know how to use feedforward nets, but the
technology of recurrent nets is too fuzzy and underdeveloped. One goal of this
work was to create an example where the use of a recurrent network is
straightforward and the potential engineering benefits are substantial.
The literature on recurrent networks
has confused many people, because there is such a large variety of designs,
aimed at performance on so many different tasks. Most of the literature
describes classical networks -- like the early designs of Grossberg and
Hopfield -- defined by ordinary differential equations (or even asynchronous
binary updates!), trained by some form of Hebbian learning. But this paper will
focus exclusively on networks used to approximate a J function, for a moderate
to large planning problem. For reasons given above, the Hebbian networks are
not appropriate for this particular task.
In practical applications today,
computations are mainly based on discrete time cycles or sampling rates, rather
than differential equations. In the discrete time formulation, there are two
related types of recurrent network, the Time-Lagged Recurrent Network (TLRN)
and the Simultaneous Recurrent Network (SRN). One way to define a TLRN is by
the equations:
Y(t+1) = f1(X(t), R(t), W) (2)
R(t+1) = f2(X(t), R(t), W) , (3)
where
we are trying to predict a vector Y(t+1),
where X(t) is the vector of
what we observe at time t, where W is an array of weights, and where R represents the recurrent
connections. More precisely, R(t)
represents some outputs of a set of neurons at time t, which will be remembered
and used in the next time period, t+1. The TLRN is trained so as to minimize
the error in predicting Y(t+1).
The vector R(t) is treated as
an intermediate part of the network, just like the output of an ordinary hidden
layer[8,11,13]. In practice, R
serves as a kind of short-term memory, or working memory, or state estimator.
TLRNs have had substantial real-world application in several places, such as
Feldkamp’s group at Ford Motor Company.
By contrast, the SRN is not intended
to provide better forecasting over time. It is not intended to provide memory
of past events or trends. Instead, it uses recurrence to provide a more general
function approximation capability, based on concepts like that of Turing theory
and complexity theory[11]. For more information about why SRNs are expected to
be important in theory, see [8,14] and related work by Giles et al.
Conceptually, the SRN is defined by the equations:
y(n+1)(t) = f(X(t), y(n)(t),
W) (4)
Y(t) = y(¥)(t) , (5)
where
Y(t) is the ultimate output
of the network at time t, where y(n)(t)
is the intermediate output at iteration number n, where X(t) is the input to the network at time t, where W is a set
of weights, and where f
represents any feedforward network which you choose to use here. In practice,
of course, we do not wait for an infinite number of iterations; we use some
sort of practical stopping rule to describe when y has “settled down.” In the work here, it was good enough
to use only 20 iterations; however, in learning, we used only one iteration in
the first 20 passes, two iterations in the next twenty, and so on, until we
reached 20. At each time t, we initialized the network with a simple vector y(0)(t) chosen on an
intuitive basis: -1 for all components of the vector, except for the last one,
set to zero.
For some applications requiring
brain-like intelligence, we would need to use hybrid networks incorporating both
kinds of recurrence, as described in [15] but explained more fully in [11]. For
example, in processing movie images, we might process 32 frames (images)
per
second, such that the time between “t” and “t+1” is 1/32 second. We might use a
fast neural chip, allowing a thousand iterations of recurrent processing per
frame, such that the time between “n” and “n+1” is 1/32000 second. This allows
us to combine short-term memory and iterative processing, so that our
network can learn to exploit both the dynamics of the image and the kinds of
recurrence needed for contour processing and segmentation and so on. (Authors
such as Grossberg, VonderMalsburg and Hartmann have shown how such recurrence
effects can work in the fixed-weight or fixed connection portion of an image
processing system.) We would theorize that the cerebral cortex uses a similar
style of computation, with a high-speed inner recurrent loop embedded within a
lower-speed sampling system, in order to perform the same task. For the maze
problem, however, we were dealing with a static function approximation task,
for which an SRN by itself is sufficient.
By definition, SRNs are trained to
minimize the gap between Y(t)
and Y(t), based on the
derivatives of error with respect to the weights. For the maze problem, we used
the usual measure of square error, summed over all the open squares in the
maze. There are five different techniques which can be used to estimate or
calculate the derivatives:
(1) Backpropagation Through Time (BTT),
which was first implemented on the MIT Multics in 1973[13], and gives exact
derivatives at a cost similar to the cost of a forward pass of the SRN; though
convenient in engineering, it is not plausible as a model of biology;
(2) Truncation, which usually means a single
pass of ordinary backpropagation -- the method normally used with “Simple
Recurrent Networks” in psychology[16];
(3) Forwards Propagation, which has been
reinvented many times under many names;
it
calculates exact derivatives in forwards time, but imposes high computational
costs for large networks, and is therefore not a plausible model for anything
in the brain[8];
(4) Simultaneous backpropagation, which
gives exact derivatives for a fully converged SRN, assuming in effect that Y(t) does not depend on y(0)(t); special
cases of this were developed independently by Werbos[15], Almeida and Pineda;
see [11,14] for the most general version;
(5) The Error Critic, which approximates
BTT, with an approximation valid both for converged and nonconverged networks
[8,11].
All but simultaneous backpropagation
are also applicable to TLRNs, but the Error Critic is the only one of these
choices which could be plausible as a description of how the cerebral cortex
handles time-lagged recurrence. For an SRN, both Error Critics and Simultaneous
Backpropagation appear plausible; however, the Error Critic would appear more
compatible with Freeman’s notion of recurrence as a device to support chaotic
search of the y space [22].
For this work, we used both BTT and
Truncation. We used BTT because it had the highest probability of performing
the task. We tried truncation because it is the simplest, most popular method
here. The implementation of BTT followed very closely the description of
Chapter 8 of [13], or the later part of chapter 10 of [11]. (See [8] for
details and flow charts, or see the C code.) To implement truncation, we used
almost the same code as with BTT, but cut the backpropagation off after one
iteration. We implemented an MLP alternative, simply by limiting both the
forward and backward passes of the SRN to one iteration.
Success in this work required a
special choice of feedforward core network f
(as in equation 4) and use of the Adaptive Learning Rate algorithm (ALR) [11,
ch.3; 8].
For the core network f, we chose to use a cellular
network that may be described as:
y(n+1)(ix,iy)
= g(X(ix,iy),
y(n)(ix,iy), y(n)(ix, iy±1), y(n)(ix± 1, iy), W)
(6)
J(ix,iy)
= Ws * y5(20)(ix,iy) ,
(7)
which
will now be explained in detail. Equations 6 and 7 are just one possible way of
implementing equation 4 (i.e., of choosing f.)
The time index “t” has been suppressed, because all of these quantities are
taken at the same time t.
In equations 6 and 7, we have added
two coordinate indices “ix” and “iy,” which correspond to the coordinates of a
square in the maze. Thus for a simple 5 by 5 maze, ix varies from 1 to 5 and iy
from 1 to 5. For formal mathematical reasons[8], we augment the maze by
assuming a wall of obstacles to the left of the maze and under it, to fill in
squares for which ix=0 and iy=0. When ix=5, we interpret “ix+1” to mean ix=0;
for iy=0, “iy-1” means iy=5, and so on. This augmentation does not change the
maze problem, but it ensures the exact validity of the classic Lie group
symmetry theory which justifies the cellular design.
The key points which make this
system cellular are that: (1) we are choosing y(n) to be a kind of three-dimensional array,
with five neurons located “at” or “above” each square; (2) we use the same
weights W and Ws in each cell, even though the inputs and outputs are
different at each square. (Of course, W is still a set of weights, not just one
weight.) For the function g
-- the network “at” each square -- we used a 5-neuron 5-output 11-input
generalized MLP, as described in chapter 8 of [13], except that we used tanh as
the transfer function s(net). The 11 inputs consisted of: (1) two binary inputs
X1 and X2 , indicating whether the square is covered by
an obstacle or is the goal cell; (2) the value of y1(n)
from
the
four neighboring cells; (3) y(n)
from the cell itself. In effect, the network at each cell is composed of two
parts, a connector part and a local part. This structure was used both with SRN
runs and MLP runs.
Because the same weights are used
(“shared”) across different squares or cells, this design dramatically reduces
the number of weights, compared with conventional networks.
This
is crucial to rapid learning and good generalization[14]. If we had allowed
different cells to use different weights, the symmetry of the augmented maze
problem guarantees that the best set of weights would have used the same
weights in every cell anyway; thus we are simplifying our network at no cost at
all, relative to the generalized maze navigation problem.
Lie group symmetry or “weight
sharing” has been used for decades in image processing, by authors like Laveen
Kanal, Lee Giles, Ysabel Guyon, LeCun, etc. Chua has shown how cellular
networks in general have far better throughput than conventional networks in
VLSI implementation. Our design here does not seem very brain-like, but can we
be sure? In fact, many neuroscientists now believe that the hippocampus does
have an explicit, built-in representation of spatial location. In
conversation[4], Pribram has described the hippocampus as the recurrent hidden
layer of the highest Critic network in the brain, the limbic system. Freeman’s
work on the hippocampus has demonstrated the importance of an inner loop of
recurrence, very similar in spirit to the design we are using [22].These
parallels are worthy of further, more careful evaluation.
To adapt this structure, we used the
ALR algorithm over three groups of weights: (1) Ws; (2) the bias or intercept
weights in W; (3) the rest of W. We also injected a cutoff to prevent gross
overshoot. See [8] for details.
RESULTS ON VERY SIMPLE INITIAL TESTS, AND CONCLUSIONS
In
the work so far, we have only used a very simple example of the maze navigation
problem,
strictly
because of time limitations. We used a single 5 by 5 maze, with a goal
at the square (4,4) and obstacles at (2,4), (3,3) and (4,2). If we had not imposed a cellular structure,
to reduce the number of weights, this use of a single training example would
have led to gross overtraining and zero error with all methods. However, with
the cellular constraints -- dictated by the generalized maze navigation
problem, as discussed above -- we did not have this problem.
There is a strong analogy between
this simple 5 by 5 maze problem and the simple XOR problem posed by Minsky
decades ago[17] and later solved by Rumelhart, Hinton and Williams[18]. In both
cases, the example itself was of no inherent importance. The problem could be
solved easily enough without neural networks. The example was chosen in order
to provide the simplest possible, minimal representation of a more general
problem. If neural networks could not have found a solution to the XOR problem,
through learning, that would have been a show-stopper all by itself;
there would have been no need (according to Minsky) to study them further,
until this initial challenge could be met. In a similar way, if neural networks
could not solve this very simple maze problem, it would seriously undermine all
our hopes to build intelligent controllers with neural networks. This
particular maze was chosen -- in advance -- to be especially tricky, with the
same kind of multiple choice confusion that the XOR problem exhibits.
Before we started this work, we
already knew that conventional neural networks could not solve the problem. As
discussed above, the MLP was by far the best conventional candidate to solve
this problem. However, substantial prior eforts to solve his problem using an MLP
were unsuccessful[23]. In our own work, the MLP
essentially
stopped learning after only 80 iterations. It froze up with a total square
error (over all clear squares) of about 500. Using an SRN trained by
truncation, we reached an error of 115 after 30,000 iterations, and little if
any sign of further progress. With an SRN trained by BTT, total error reached
0.87 after 30,000 iterations, and was still continuing to decrease.
The
key finding here is that a cellular SRN can solve this problem -- which no
other network has done before.
Furthermore, the J function as
predicted by SRN-BTT was close enough that it led to a correct choice of which
way to go on all squares of the maze. But with SRN-truncation, the
estimated J pointed in the wrong direction from 11 squares, in the right
direction from 10, and a tie on 1. (Here I assume that the robot traveling the
maze will stop if the square it is on has a J lower than all its neighbors.)
Again, the MLP was still worse.
In these tests, we also found that
the SRN-BTT was far more sensitive to initial weights -- as expected -- than
the other two alternatives. With the MLP and SRN-truncation, the results were
very similar for all the initial weights tried out. But with the SRN-BTT, we
tried out three or four distinct initial sets of weights, only one of which led
to rapid initial learning.
(These
initial weights were set by hand, very very arbitrarily; unfortunately, we had
problems seeding the Unix random number generator.) This experience fits in
with a kind of Murphy’s Law mentioned by Werbos in a 1987 paper: that networks
with a relatively high accuracy and ability to reduce error substantially also
tend to be the hardest to make converge. Thus the issue of local minima are
more serious for true SRNs than for MLPs. This highlights the need for a
“syncretic” design[14] in those applications where one wants to combine
high accuracy and generalization (as in SRNs) together with rapid real-time
learning (as in RBFs or associative memory networks).
In the next phase of this work, we
intend to explore several practical applications,
as
well as the true generalized maze problem, over many mazes. Because of the
importance of the local minimum issue, we intend to rely heavily on
step-by-step learning or “shaping”, as mentioned in chapter 3 of [11], as well
as features like modified error functions to prevent premature overtraining.
The need to use step-by-step learning may seem discouraging at first, but even
the human brain depends very heavily on that kind of strategy when learning to
solve truly difficult planning and control problems. Naturally, the next step
will be to embed all this into larger, modular optimization designs.
For the sake of completeness, we may
also do further work to demonstrate empirically that MLPs cannot solve
this problem, even with serious effort. To do this, we would use a more fully
connected cellular MLP, with 72 inputs to each cell (representing the entire
maze) and 5 to 8 active neurons in each cell. This would require a larger
training set, and explicit tests of overtraining. Clearly, however, we have
shown that an SRN can solve this problem, using a far more parsimonious
structure.
For reasons beyond the scope of this
paper, we have some reason to believe that the SRN is not simply learning to
reproduce the value iteration of dynamic programming, but is doing something a
bit simpler, requiring less precision in the recurrent connections.
In general, this work does not
indicate that we need to replace MLPs in relatively smooth, medium-sized
problems like those addressed today in automotive or aerospace applications.
However, when mappings become very complex and combinatorial in nature -- as in
image segmentation or true planning problems -- SRNs may become crucial to
success in practical applications. This has serious implications for our
understanding of intelligence in the brain[4].
REFERENCES
[1]
P.Werbos, Optimal neurocontrol: Practical benefits, new results and
biological evidence, Proc. World Cong. on
Neural Networks (WCNN95), Erlbaum, 1995.
[2]
P.Werbos, Optimization methods for brain-like intelligent control, Proc. Conf. Decision and Control (CDC95),
IEEE, 1995.
[3]
D.Prokhorov, R.Santiago & D.Wunsch, Adaptive critic designs: a case
study for neurocontrol. Neural Networks, Vol.8, No.9, 1995.
[4]
P.Werbos, Learning in the brain: An engineering interpretation. In K.Pribram,
ed., Learning as Self-Organization,
Erlbaum, 1996.
[5]
P. Werbos, Optimization: A foundation for understanding consciousness. In
D.Levine & W.Elsberry, Optimality
in Biological and Artificial Networks., Erlbaum, 1996.
[6]
G.J.Tesauro, Practical issues in temporal difference learning. Machine Learning, 1992, 8:p.257-277..
[7]
J.S.Baras & N.Patel, Information state for robust control of set-valued
discrete time systems, CDC95 [2].
[8]
X.Pang & P.Werbos, Neural network design for J function approximation in
dynamic programming, Journal on
Mathematical Modeling and Scientific Computing (Principia Scientia),
special issue on neural networks, planned as No. 1, 1997.
[9]
M.Minsky in E.Feigenbaum & Feldman, Computers and Thought, McGraw-Hill,1963.
[10]
P.Werbos, Elements of Intelligence, Cybernetica
(Namur), No.3, 1968
[11]
D.White & D.Sofge (eds) Handbook
of Intelligent Control, Van Nostrand, 1992.
[12]
A.R.Barron, Universal approximation bounds for superpositions of a sigmoidal
function IEEE Trans. Info. Theory 39(3) 930-945, 1993.
[13]
P.Werbos, The Roots of
Backpropagation: From Ordered Derivatives to Neural Networks and Political Forecasting, Wiley, 1994
[14]
P.Werbos, Supervised learning: Can it escape its local minimum?, WCNN93 [1]
[15]
P.Werbos, Generalization of backpropagation with application to a recurrent gas
market model, Neural Networks, Vol.1, p.339-365, 1988 (submitted August 1987).
[16]
L.Fausett, Fundamentals of Neural
Networks, Prentice-Hall, 1994.
[17]
M.Minsky & A.Papert, Perceptrons,
MIT Press, expanded edition, 1990.
[18]
D.Rumelhart, G.Hinton & R. Williams, in Rumelhart & McClelland eds, Parallel
Distributed Processing, Vol. 1, MIT Press, 1986.
[19]
T.Maxwell, L.Giles and Y.C.Lee, Generalization in neural networks: the
continguity problem, Proc. 1st Int’l
Conf. Neural Networks (ICNN87), IEEE, 1987.
[20] C.VonderMalsburg & W.Schneider, Biological Cybernetics , Vol.54,
p.29-40, 1986.
[21]
S.Grossberg et al, MURI Year 1 Progress
Report (to Office of Naval Research), S.Grossberg, Boston U., 1996. This
report in turn cites numerous papers in Neural
Networks.
[22]
H.Chang & W.Freeman, Parameter optimization in models of the olfactory
neural system, Neural Networks, Vol.
9, No. 1, p.1-14,1996.
[23]
P.Houillon & A.Caron, Planar robot control in cluttered space by artificial
neural network, Math. Modelling and
Scientific Computing (Principia Scientia, St. Louis), Vol. 2, p.498-502,
1993.