Generalized Maze Navigation: SRN Critics Solve What Feedforward or Hebbian Nets Cannot
Paul J. Werbos
Room 675, National Science Foundation*
Dept. of Electrical Engineering,
Previous papers have explained why model-based adaptive critic designs -- unlike other designs used in neurocontrol -- have the potential to replicate some of the key, basic aspects of intelligence as seen in the brain. However, these designs are modular designs, containing “simple” supervised learning systems as modules. The intelligence of the overall system depends on the function approximation abilities of these modules. For the generalized maze navigation problem, no feedforward networks - MLP, RBF, CMAC, etc. - or networks based on Hebbian learning have good enough approximation abilities. In this problem, one learns to input a maze description, and output a policy or value function, without having to relearn the policy when one encounters a new maze. This paper reports how we solved a very simple but challenging instance of this problem, using a new form of simultaneous recurrent network (SRN) based on a cellular structure which has some interesting similarity to the hippocampus.
Several previous papers [1-3] have argued that model-based adaptive critics -- also called model-based approximate dynamic programming (ADP) -- offer us the only real hope of achieving true brain-like intelligence in artificial control systems, or of understanding intelligence in the brain[4-5] itself.
In principle, ADP systems should be able to approximate the solution to any problem in control or planning which can be formulated as an optimization problem. This includes almost any interesting problem! For example, winning a game of chess or of backgammon can be formulated as the problem of maximizing the
paper represents the views of the authors only, not the views of their
employers past or present. However, because it was written on
probability of victory over time. Using such a formulation, Tesauro has shown that simple adaptive critic systems can in fact play an excellent, master-class game of backgammon. As another example, the most general approach to the classic nonlinear robust control problem involves converting that problem into an optimization problem, which can then be solved by ADP in offline learning.
This paper will first review what a
Critic network is, and explain why we need to use more powerful Critic networks
in order to solve difficult problems in intelligent control. Next, it will
describe the SRN network used here, and compare it with other recurrent
networks. Finally, it will describe the very simple but challenging test
problem used here, and the empirical results. More complete information on all
these points -- ranging from literature review through to designs, flow charts
and empirical results -- is given in . For the underlying C code, contact
MOTIVATION: CRITICS FOR USE IN PLANNING, AND THE MAZE
Decades ago, classic researchers in Artificial Intelligence (AI) such as Minsky and Newell, Shaw and Simon showed how the problem of “reinforcement learning” -- the problem of maximizing an observed reinforcement signal U(t) over future time -- can encompass most of what we mean by “intelligence.” In 1968, Werbos showed how the problem of rein-forcement learning is linked to the problem of approximating dynamic programming, and proposed a primitive form of backpropagation as part of an ADP design. Since then, more sophisticated designs and explanations have been developed[1-3], but they all are logically based on the Bellman equation, the fundamental equation of dynamic programming.
Dynamic programming gives the exact solution to the problem of how to maximize a utility function U(R(t)) over future times t, in a nonlinear stochastic environment, where the vector R(t) represents the state of the environment at time t. Our ability to do well on the reinforcement learning problem depends on our ability to approximate the exact, optimal solution well -- i.e., our ability to approximate the dynamic programming solution. Dynamic programming converts a difficult problem in optimization over time (maximizing <U(R(t))>,
the expected value of U(R(t)) over all future times) into a simpler problem in function maximization.
When there is a finite time horizon (as with the maze), the Bellman equation  may be written:
J(R(t+1)) = max (U(R(t),u(t)) + <J(R(t+1))>) u(t) (1)
where u(t) represents the actions you take at time t. The problem is to solve for the function J. After J is known, you can find the optimal action u(t) at each time t by performing the maximization shown in equation 1.
In classical dynamic programming, we solve for J exactly. In most forms of ADP (or, more precisely, in most forms of adaptive critic design), we approximate J by some kind of function approximation scheme, usually a neural network. Thus we approximate J(R) by some function J(R,W), where W is a set of weights or parameters. J is called a Critic network.
For relatively “simple” optimization
problems -- such as conventional aircraft control  or vehicle control -- one
may expect the true function J to be relatively smooth. Barron has shown
that the most common feedforward neural networks -- MLPs -- can approximate
smooth functions well, even as the number of inputs increases; he has also
shown that linear basis function approximators (like
Unfortunately, for serious problems in planning or scheduling, the J function will typically not be smooth. We cannot expect ADP systems based on feedforward or Hebbian Critics to perform very well on such problems, if the Critic networks themselves are inherently unable to approximate that kind of J function.
The generalized path planning or spatial navigation problem is an example of a problem which is very difficult for feedforward or Hebbian Critics. It is well known that the brains of animals can solve such problems effectively, though it may take them some time to learn how to do so. But if the Critic network cannot approximate the correct J function, then no amount of learning can lead to high quality performance. Until we implement Critics capable of solving such problems, our “intelligent controllers” will fall short of true brain-like capabilities in a very important way.
The problem of navigating a simple maze, defined over an array of squares, is a simple but challenging example of such a problem. The reason for choosing such a simple example, to begin with, will be explained further in the last section of this paper.
CELLULAR SRNS VERSUS OTHER RECURRENT NETWORKS
Recently many engineers have argued that: (1) recurrent networks should be useful in theory because we know that they are important in the brain; but (2) it is not practical to use them yet, because we know how to use feedforward nets, but the technology of recurrent nets is too fuzzy and underdeveloped. One goal of this work was to create an example where the use of a recurrent network is straightforward and the potential engineering benefits are substantial.
The literature on recurrent networks has confused many people, because there is such a large variety of designs, aimed at performance on so many different tasks. Most of the literature describes classical networks -- like the early designs of Grossberg and Hopfield -- defined by ordinary differential equations (or even asynchronous binary updates!), trained by some form of Hebbian learning. But this paper will focus exclusively on networks used to approximate a J function, for a moderate to large planning problem. For reasons given above, the Hebbian networks are not appropriate for this particular task.
In practical applications today, computations are mainly based on discrete time cycles or sampling rates, rather than differential equations. In the discrete time formulation, there are two related types of recurrent network, the Time-Lagged Recurrent Network (TLRN) and the Simultaneous Recurrent Network (SRN). One way to define a TLRN is by the equations:
Y(t+1) = f1(X(t), R(t), W) (2)
R(t+1) = f2(X(t), R(t), W) , (3)
where we are trying to predict a vector Y(t+1), where X(t) is the vector of what we observe at time t, where W is an array of weights, and where R represents the recurrent connections. More precisely, R(t) represents some outputs of a set of neurons at time t, which will be remembered and used in the next time period, t+1. The TLRN is trained so as to minimize the error in predicting Y(t+1). The vector R(t) is treated as an intermediate part of the network, just like the output of an ordinary hidden layer[8,11,13]. In practice, R serves as a kind of short-term memory, or working memory, or state estimator. TLRNs have had substantial real-world application in several places, such as Feldkamp’s group at Ford Motor Company.
By contrast, the SRN is not intended to provide better forecasting over time. It is not intended to provide memory of past events or trends. Instead, it uses recurrence to provide a more general function approximation capability, based on concepts like that of Turing theory and complexity theory. For more information about why SRNs are expected to be important in theory, see [8,14] and related work by Giles et al. Conceptually, the SRN is defined by the equations:
y(n+1)(t) = f(X(t), y(n)(t), W) (4)
Y(t) = y(¥)(t) , (5)
where Y(t) is the ultimate output of the network at time t, where y(n)(t) is the intermediate output at iteration number n, where X(t) is the input to the network at time t, where W is a set of weights, and where f represents any feedforward network which you choose to use here. In practice, of course, we do not wait for an infinite number of iterations; we use some sort of practical stopping rule to describe when y has “settled down.” In the work here, it was good enough to use only 20 iterations; however, in learning, we used only one iteration in the first 20 passes, two iterations in the next twenty, and so on, until we reached 20. At each time t, we initialized the network with a simple vector y(0)(t) chosen on an intuitive basis: -1 for all components of the vector, except for the last one, set to zero.
For some applications requiring brain-like intelligence, we would need to use hybrid networks incorporating both kinds of recurrence, as described in  but explained more fully in . For example, in processing movie images, we might process 32 frames (images)
per second, such that the time between “t” and “t+1” is 1/32 second. We might use a fast neural chip, allowing a thousand iterations of recurrent processing per frame, such that the time between “n” and “n+1” is 1/32000 second. This allows us to combine short-term memory and iterative processing, so that our network can learn to exploit both the dynamics of the image and the kinds of recurrence needed for contour processing and segmentation and so on. (Authors such as Grossberg, VonderMalsburg and Hartmann have shown how such recurrence effects can work in the fixed-weight or fixed connection portion of an image processing system.) We would theorize that the cerebral cortex uses a similar style of computation, with a high-speed inner recurrent loop embedded within a lower-speed sampling system, in order to perform the same task. For the maze problem, however, we were dealing with a static function approximation task, for which an SRN by itself is sufficient.
By definition, SRNs are trained to minimize the gap between Y(t) and Y(t), based on the derivatives of error with respect to the weights. For the maze problem, we used the usual measure of square error, summed over all the open squares in the maze. There are five different techniques which can be used to estimate or calculate the derivatives:
(1) Backpropagation Through Time (BTT), which was first implemented on the MIT Multics in 1973, and gives exact derivatives at a cost similar to the cost of a forward pass of the SRN; though convenient in engineering, it is not plausible as a model of biology;
(2) Truncation, which usually means a single pass of ordinary backpropagation -- the method normally used with “Simple Recurrent Networks” in psychology;
(3) Forwards Propagation, which has been reinvented many times under many names;
it calculates exact derivatives in forwards time, but imposes high computational costs for large networks, and is therefore not a plausible model for anything in the brain;
(4) Simultaneous backpropagation, which gives exact derivatives for a fully converged SRN, assuming in effect that Y(t) does not depend on y(0)(t); special cases of this were developed independently by Werbos, Almeida and Pineda; see [11,14] for the most general version;
(5) The Error Critic, which approximates BTT, with an approximation valid both for converged and nonconverged networks [8,11].
All but simultaneous backpropagation are also applicable to TLRNs, but the Error Critic is the only one of these choices which could be plausible as a description of how the cerebral cortex handles time-lagged recurrence. For an SRN, both Error Critics and Simultaneous Backpropagation appear plausible; however, the Error Critic would appear more compatible with Freeman’s notion of recurrence as a device to support chaotic search of the y space .
For this work, we used both BTT and Truncation. We used BTT because it had the highest probability of performing the task. We tried truncation because it is the simplest, most popular method here. The implementation of BTT followed very closely the description of Chapter 8 of , or the later part of chapter 10 of . (See  for details and flow charts, or see the C code.) To implement truncation, we used almost the same code as with BTT, but cut the backpropagation off after one iteration. We implemented an MLP alternative, simply by limiting both the forward and backward passes of the SRN to one iteration.
Success in this work required a special choice of feedforward core network f (as in equation 4) and use of the Adaptive Learning Rate algorithm (ALR) [11, ch.3; 8].
For the core network f, we chose to use a cellular network that may be described as:
y(n+1)(ix,iy) = g(X(ix,iy), y(n)(ix,iy), y(n)(ix, iy±1), y(n)(ix± 1, iy), W) (6)
J(ix,iy) = Ws * y5(20)(ix,iy) , (7)
which will now be explained in detail. Equations 6 and 7 are just one possible way of implementing equation 4 (i.e., of choosing f.) The time index “t” has been suppressed, because all of these quantities are taken at the same time t.
In equations 6 and 7, we have added two coordinate indices “ix” and “iy,” which correspond to the coordinates of a square in the maze. Thus for a simple 5 by 5 maze, ix varies from 1 to 5 and iy from 1 to 5. For formal mathematical reasons, we augment the maze by assuming a wall of obstacles to the left of the maze and under it, to fill in squares for which ix=0 and iy=0. When ix=5, we interpret “ix+1” to mean ix=0; for iy=0, “iy-1” means iy=5, and so on. This augmentation does not change the maze problem, but it ensures the exact validity of the classic Lie group symmetry theory which justifies the cellular design.
The key points which make this system cellular are that: (1) we are choosing y(n) to be a kind of three-dimensional array, with five neurons located “at” or “above” each square; (2) we use the same weights W and Ws in each cell, even though the inputs and outputs are different at each square. (Of course, W is still a set of weights, not just one weight.) For the function g -- the network “at” each square -- we used a 5-neuron 5-output 11-input generalized MLP, as described in chapter 8 of , except that we used tanh as the transfer function s(net). The 11 inputs consisted of: (1) two binary inputs X1 and X2 , indicating whether the square is covered by an obstacle or is the goal cell; (2) the value of y1(n) from
the four neighboring cells; (3) y(n) from the cell itself. In effect, the network at each cell is composed of two parts, a connector part and a local part. This structure was used both with SRN runs and MLP runs.
Because the same weights are used (“shared”) across different squares or cells, this design dramatically reduces the number of weights, compared with conventional networks.
This is crucial to rapid learning and good generalization. If we had allowed different cells to use different weights, the symmetry of the augmented maze problem guarantees that the best set of weights would have used the same weights in every cell anyway; thus we are simplifying our network at no cost at all, relative to the generalized maze navigation problem.
Lie group symmetry or “weight sharing” has been used for decades in image processing, by authors like Laveen Kanal, Lee Giles, Ysabel Guyon, LeCun, etc. Chua has shown how cellular networks in general have far better throughput than conventional networks in VLSI implementation. Our design here does not seem very brain-like, but can we be sure? In fact, many neuroscientists now believe that the hippocampus does have an explicit, built-in representation of spatial location. In conversation, Pribram has described the hippocampus as the recurrent hidden layer of the highest Critic network in the brain, the limbic system. Freeman’s work on the hippocampus has demonstrated the importance of an inner loop of recurrence, very similar in spirit to the design we are using .These parallels are worthy of further, more careful evaluation.
To adapt this structure, we used the ALR algorithm over three groups of weights: (1) Ws; (2) the bias or intercept weights in W; (3) the rest of W. We also injected a cutoff to prevent gross overshoot. See  for details.
RESULTS ON VERY SIMPLE INITIAL TESTS, AND CONCLUSIONS
In the work so far, we have only used a very simple example of the maze navigation problem,
strictly because of time limitations. We used a single 5 by 5 maze, with a goal at the square (4,4) and obstacles at (2,4), (3,3) and (4,2). If we had not imposed a cellular structure, to reduce the number of weights, this use of a single training example would have led to gross overtraining and zero error with all methods. However, with the cellular constraints -- dictated by the generalized maze navigation problem, as discussed above -- we did not have this problem.
There is a strong analogy between this simple 5 by 5 maze problem and the simple XOR problem posed by Minsky decades ago and later solved by Rumelhart, Hinton and Williams. In both cases, the example itself was of no inherent importance. The problem could be solved easily enough without neural networks. The example was chosen in order to provide the simplest possible, minimal representation of a more general problem. If neural networks could not have found a solution to the XOR problem, through learning, that would have been a show-stopper all by itself; there would have been no need (according to Minsky) to study them further, until this initial challenge could be met. In a similar way, if neural networks could not solve this very simple maze problem, it would seriously undermine all our hopes to build intelligent controllers with neural networks. This particular maze was chosen -- in advance -- to be especially tricky, with the same kind of multiple choice confusion that the XOR problem exhibits.
Before we started this work, we already knew that conventional neural networks could not solve the problem. As discussed above, the MLP was by far the best conventional candidate to solve this problem. However, substantial prior eforts to solve his problem using an MLP were unsuccessful. In our own work, the MLP
essentially stopped learning after only 80 iterations. It froze up with a total square error (over all clear squares) of about 500. Using an SRN trained by truncation, we reached an error of 115 after 30,000 iterations, and little if any sign of further progress. With an SRN trained by BTT, total error reached 0.87 after 30,000 iterations, and was still continuing to decrease.
The key finding here is that a cellular SRN can solve this problem -- which no other network has done before.
Furthermore, the J function as predicted by SRN-BTT was close enough that it led to a correct choice of which way to go on all squares of the maze. But with SRN-truncation, the estimated J pointed in the wrong direction from 11 squares, in the right direction from 10, and a tie on 1. (Here I assume that the robot traveling the maze will stop if the square it is on has a J lower than all its neighbors.) Again, the MLP was still worse.
In these tests, we also found that the SRN-BTT was far more sensitive to initial weights -- as expected -- than the other two alternatives. With the MLP and SRN-truncation, the results were very similar for all the initial weights tried out. But with the SRN-BTT, we tried out three or four distinct initial sets of weights, only one of which led to rapid initial learning.
(These initial weights were set by hand, very very arbitrarily; unfortunately, we had problems seeding the Unix random number generator.) This experience fits in with a kind of Murphy’s Law mentioned by Werbos in a 1987 paper: that networks with a relatively high accuracy and ability to reduce error substantially also tend to be the hardest to make converge. Thus the issue of local minima are more serious for true SRNs than for MLPs. This highlights the need for a “syncretic” design in those applications where one wants to combine high accuracy and generalization (as in SRNs) together with rapid real-time learning (as in RBFs or associative memory networks).
In the next phase of this work, we intend to explore several practical applications,
as well as the true generalized maze problem, over many mazes. Because of the importance of the local minimum issue, we intend to rely heavily on step-by-step learning or “shaping”, as mentioned in chapter 3 of , as well as features like modified error functions to prevent premature overtraining. The need to use step-by-step learning may seem discouraging at first, but even the human brain depends very heavily on that kind of strategy when learning to solve truly difficult planning and control problems. Naturally, the next step will be to embed all this into larger, modular optimization designs.
For the sake of completeness, we may also do further work to demonstrate empirically that MLPs cannot solve this problem, even with serious effort. To do this, we would use a more fully connected cellular MLP, with 72 inputs to each cell (representing the entire maze) and 5 to 8 active neurons in each cell. This would require a larger training set, and explicit tests of overtraining. Clearly, however, we have shown that an SRN can solve this problem, using a far more parsimonious structure.
For reasons beyond the scope of this paper, we have some reason to believe that the SRN is not simply learning to reproduce the value iteration of dynamic programming, but is doing something a bit simpler, requiring less precision in the recurrent connections.
In general, this work does not indicate that we need to replace MLPs in relatively smooth, medium-sized problems like those addressed today in automotive or aerospace applications. However, when mappings become very complex and combinatorial in nature -- as in image segmentation or true planning problems -- SRNs may become crucial to success in practical applications. This has serious implications for our understanding of intelligence in the brain.
 P.Werbos, Optimal neurocontrol: Practical benefits, new results and biological evidence, Proc. World Cong. on Neural Networks (WCNN95), Erlbaum, 1995.
 P.Werbos, Optimization methods for brain-like intelligent control, Proc. Conf. Decision and Control (CDC95), IEEE, 1995.
 D.Prokhorov, R.Santiago & D.Wunsch, Adaptive critic designs: a case study for neurocontrol. Neural Networks, Vol.8, No.9, 1995.
 P.Werbos, Learning in the brain: An engineering interpretation. In K.Pribram, ed., Learning as Self-Organization, Erlbaum, 1996.
 P. Werbos, Optimization: A foundation for understanding consciousness. In D.Levine & W.Elsberry, Optimality in Biological and Artificial Networks., Erlbaum, 1996.
 G.J.Tesauro, Practical issues in temporal difference learning. Machine Learning, 1992, 8:p.257-277..
 J.S.Baras & N.Patel, Information state for robust control of set-valued discrete time systems, CDC95 .
 X.Pang & P.Werbos, Neural network design for J function approximation in dynamic programming, Journal on Mathematical Modeling and Scientific Computing (Principia Scientia), special issue on neural networks, planned as No. 1, 1997.
 M.Minsky in E.Feigenbaum & Feldman, Computers and Thought, McGraw-Hill,1963.
 P.Werbos, Elements of Intelligence, Cybernetica (Namur), No.3, 1968
 D.White & D.Sofge (eds) Handbook of Intelligent Control, Van Nostrand, 1992.
 A.R.Barron, Universal approximation bounds for superpositions of a sigmoidal function IEEE Trans. Info. Theory 39(3) 930-945, 1993.
 P.Werbos, The Roots of Backpropagation: From Ordered Derivatives to Neural Networks and Political Forecasting, Wiley, 1994
 P.Werbos, Supervised learning: Can it escape its local minimum?, WCNN93 
 P.Werbos, Generalization of backpropagation with application to a recurrent gas market model, Neural Networks, Vol.1, p.339-365, 1988 (submitted August 1987).
 L.Fausett, Fundamentals of Neural Networks, Prentice-Hall, 1994.
 M.Minsky & A.Papert, Perceptrons, MIT Press, expanded edition, 1990.
 D.Rumelhart, G.Hinton & R. Williams, in Rumelhart & McClelland eds, Parallel Distributed Processing, Vol. 1, MIT Press, 1986.
 T.Maxwell, L.Giles and Y.C.Lee, Generalization in neural networks: the continguity problem, Proc. 1st Int’l Conf. Neural Networks (ICNN87), IEEE, 1987.
 C.VonderMalsburg & W.Schneider, Biological Cybernetics , Vol.54, p.29-40, 1986.
 S.Grossberg et al, MURI Year 1 Progress Report (to Office of Naval Research), S.Grossberg, Boston U., 1996. This report in turn cites numerous papers in Neural Networks.
 H.Chang & W.Freeman, Parameter optimization in models of the olfactory neural system, Neural Networks, Vol. 9, No. 1, p.1-14,1996.
 P.Houillon & A.Caron, Planar robot control in cluttered space by artificial neural network, Math. Modelling and Scientific Computing (Principia Scientia, St. Louis), Vol. 2, p.498-502, 1993.