OPTIMIZATION METHODS FOR BRAIN-LIKE INTELLIGENT CONTROL
Paul J. Werbos
Room 675, National Science Foundation1
Arlington, VA 22230
The term “intelligent control” has become increasingly fuzzy, as the words “intelligent” and “smart” are used for everything from cleverly designed toasters through to government reorganizations. This paper defines a more restricted class of designs, to be called “brain-like intelligent control.” The paper: (1) explains the definition and concepts behind it; (2) describes benefits in control engineering, emphasizing stability; (3) mentions 4 groups who have implemented such designs, for the first time, since late 1993; (4) discusses the brain as a member of this class, one which suggests features to be sought in future research. These designs involve approximate dynamic programming - dynamic programming approximated in generic ways to make it affordable on large-scale nonlinear control problems. These designs are based on learning. They permit a neural net implementation -- like the brain -- but do not require it. They include some but not all “reinforcement learning” or “adaptive critic” designs.
1. Definitions and Concepts
In classical control and in neural network control (neurocontrol), useful real-world systems are usually built up from designs which perform one or more of three basic tasks: (1) cloning of an existing expert or controller; (2) tracking of a setpoint or reference model, etc.; (3) optimization over time, with or without constraints. Neither of the first two is remotely plausible as a model of real intelligence -- of what human brains do as a whole system. Even though humans do learn from observing other people, we do not simply “clone” them, and we have an ability to go beyond what we learn from other people. Even when we do track some desired path, we ourselves choose our paths, and we change our paths adaptively in real time. Humans are not perfect optimizers; however, the idea of optimization over time fits with human and animal behavior so well that it has served as a kind of reference model in psychology, politics and economic for decades . For example, Herbert Simon and Howard Raiffa showed decades ago that all kinds of complex problem-solving
1 The views herein are personal, not the views of NSF.
behavior, goal-seeking behavior and economic decision-making can be produced as an application of optimization over time. Simon’s work is the foundation of much of the literature on “planning” in artificial intelligence (AI).
Do We Need Neural Nets?
implement a general-purpose method to learn nonlinear control laws, we
must first have a general-purpose method to represent or approximate nonlinear
functions. Such a method could be an artificial neural network (ANN). In the
ANN field, the task of learning to approximate a nonlinear function from
examples is called “supervised learning.” But we could also use other methods
such as lookup tables, gain scheduling or
The designs here are mainly based on chapters 3, 10 and 13 of the Handbook of Intelligent Control , which carefully presents all designs and pseudocode in a generic, modular fashion, calling on subroutines to perform the supervised learning tasks. These subroutines, in turn could be ANNs, elastic fuzzy logic systems[5,6], econometric models, or anything else which is manageable and differentiable. (Also see  for some typographic errors in .)
To merit being called “brain-like,” our designs must allow for the possibility that the components of the system could in fact be neural networks of some sort -- because the brain is in fact made up of neural networks, by definition. Likewise, our designs should explicitly include an option for real-time learning; however, in many engineering applications, “learning” from examples taken from a simulator or a database may actually be more useful.
Three Types of Optimization Over Time
In brief, we are looking for “brain-like” designs which address the classic problem of optimization over time -- the problem of outputting control vectors u(t), based on knowledge of a vector of observables (sensor inputs) X(t) and of the past, so as to maximize the expected value of some utility function U(X(t),u(t)) over all future times t>t. (Of course, “discount rates” and constraints may also be considered.) We are looking for designs which could in principle solve this problem entirely on the basis of learning, without any specific prior assumptions about the stochastic plant or environment to be controlled.
Broadly speaking, there are three traditional ways to address such problems. First, there is the brute-force use of static optimization methods, such as simulated annealing or genetic algorithms. But random search, uninformed by derivatives, is typically very slow and inefficient, compared with search strategies informed by derivatives, when the size of a system is large and derivative information is used intelligently. Such designs do not meet the basic requirement, mentioned in the Abstract, that they should be able to scale effectively to large problems. (Still, I would not question the potential importance of stochastic methods in some secondary roles, within larger control systems.) The mismatch between these designs and the brain should be intuitively obvious.
Second, there are straightforward gradient-based methods, based on explicit forecasts of a future stream of events. Such methods include the ordinary calculus of variations, differential dynamic program-ming, model-predictive control using matrices, and model-predictive control accelerated by use of backpropagation (ch.10 of ; Widrow and Kawato in ;[11-14]), etc. These designs have tremendous practical applications. However, they are not truly brain-like, for three reasons. First, they require derivative calculations which (for exact or robust results) cost O(N2) calculations in each time period or which require a kind of chaining or backpropagation backwards through time ; neither is brain-like. Second, they tend to assume the validity of a noise-free forecasting model, except in differential dynamic programming, which is still not numerically efficient in handling complex patterns of noise over time. Third, they usually impose an explicit, finite planning horizon -- usually a strict near-term boundary line between an interval which is totally planned and a more distant future which is totally ignored. These limitations are not so bad in many near-term engineering applications, but they do have some practical consequences (e.g. computational cost), and they are quite enough to rule out these designs as brain-like.
Approximate Dynamic Programming (ADP)
This leaves us with only one candidate for brain-like intelligent control -- systems based on approximate dynamic programming (ADP), or “reinforcement learning” or “adaptive critics.” These three terms -- ADP, reinforcement learning and adaptive critics -- have become approximate synonyms in recent years, in engineering. The concept of reinforcement learning, maximizing an observed measure of utility U(t), is very old, both in psychology and in AI. The link between reinforcement learning and dynamic programming was first discussed in an old paper of mine, but became well-known more as a result of my later papers[16,17,10,14]. Bernie Widrow implemented the first working ANN version, and coined the term “adaptive critic,” in 1973. Despite the long history of “reinforcement learning” in biology, there is now reason to believe that the ADP formulation is actually more plausible as a model of biological intelligence.
To understand ADP, one must first review the basics of classical dynamic programming, especially the versions developed by Howard and Bertsekas. Classical dynamic programming is the only exact and efficient method to compute the optimal control policy over time, in a general nonlinear stochastic environment. The only reason to approximate it is to reduce computational cost, so as to make the method affordable (feasible) across a wide range of applications.
In dynamic programming, the user supplies a utility function which may take the form U(R(t),u(t)) -- where the vector R is a Representation or estimate of the state of the environment (i.e. the state vector) -- and a stochastic model of the plant or environment. Then “dynamic programming” (i.e. solution of the Bellman equation) gives us back a secondary or strategic utility function J(R). The basic theorem is that maximizing <U(R(t),u(t))+J(R(t+1))> yields the optimal strategy, the policy which will maximize the expected value of U added up over all future time. Thus dynamic programming coverts a difficult problem in optimizing over many time intervals into a straightforward problem in short-term maximization. In classical dynamic programming, we find the exact function J which exactly solves the Bellman equation. In ADP, we learn a kind of “model” of the function J; this “model” is called a “Critic.” (Alternatively, some methods learn a model of the derivatives of J with respect to the variables Ri ; these correspond to Lagrange multipliers, li , and to the “price variables” of microeconomic theory. Some methods learn a function related to J, as in the Action-Dependent Adaptive Critic (ADAC), an idea first published at this conference.)
The family of ADP designs is extremely large. I have argued that it forms a kind of “ladder,” starting from the simplest methods -- which are a good starting place but limited in power -- and rising all the way up to the mammalian brain itself, and perhaps beyond . The simplest designs learn slowly when confronted with medium-sized engineering control problems, but the higher-level designs can learn much faster even on large problems, if implemented correctly.
Level zero of the ladder is the original Widrow critic. Level one is the Barto-Sutton-Anderson critic of 1983 and the Q-learning lookup-table design of Watkins from 1989, both reviewed by Barto in . Level two is the full implementation of ADAC, using derivative feedback from a Critic network to an Action network, as originally proposed by Lukes, Thompson and myself, and later extended and applied to several real-world problems by White and Sofge. (ADAC has been reinvented several times in the last year or two under the name of “modified Q-learning”.)
Even these three simple designs meet three of the four basic requirements which I would use to define brain-like intelligent control:
(1) They are serious engineering-based designs, able to solve difficult problems in optimization over time, based on learning, allowing for ANN implementation. See [4,11] for some examples of applications. This requirement rules out those reinforcement learning designs derived from computational neuroscience which have no well-defined engineering functionality. Note that the brain itself -- unlike most bottom-up physiological models of learning in the brain -- does in fact have a high level of engineering functionality across a wide range of complex control tasks.
(2) They include a Critic component, which corresponds to the “emotional” or “secondary reinforcement” system which is known to be a major component of animal learning, supported by well-known structures in the brain.
(3) They include an Action component, a component which actually outputs the control vector u(t), based on some kind of learning, where the learning is based on some sort of reinforcement signals originating in the Critic.
Nevertheless, as Grossberg has stressed in many discussions, these designs have a huge, gaping limitation in addressing the kind of intelligence we see demonstrated in animal learning: they lack an “expectations” or “prediction” system. Crudely speaking, about half the experiments in animal learning demonstrate “Skinnerian” learning (reward versus punishment, and secondary reinforcement), but half demonstrate “Pavlovian” learning, which is based on the learning of expectations. Focusing on just a few very simple, limited experiments on Pavlovian learning, one can actually find ways to fit the data using some simple reinforcement learning models (as demonstrated by Klopf); however, more complex experiments do indicate the need for an explicit expectations system. There is also some compelling new work in neuroscience supporting this idea (e.g. ). From an engineering viewpoint, there are many technical and institutional reasons to prefer the use of designs which exploit a system identification component, which could either be an ANN or a first-principles system model. Thus for a brain-like intelligent system, I would add a fourth requirement:
(4) They must include a “Model” component, a component which could be implemented as a learning system adapted by system identification techniques, used to generate the primary training feedback which adapts the Action network, and used to estimate the state vector R in partially observed environments. For a “level four” or higher ADP system, I would also require that the Model generate primary training feedback to adapt the Critic as well, as in the “DHP” design . This requirement is not satisfied by systems which use Models only as simulators to generate artificial training data, as in “dreaming” or in the “Dyna” architecture in .
Designs which meet these four requirements were first proposed in several of my earlier papers (,), and explained more completely in [4,7]. But only in the past two years have they been brought into serious implementations. As expected, they have shown significant improvements in performance over simpler reinforcement learning designs; however, more research will be needed to better understand their properties, to make them available in for a wider range of applications, and to replicate additional capabilities of the brain. Theoretical work on classical dynamic programming or on level-one ADP systems can be a useful preliminary step towards the understanding of more brain-like designs, but only if we make a conscious effort to “climb up the ladder” one step at a time as soon as we can.
2. Engineering Benefits
Both in my own work, and in the work supported by the program which I manage at NSF, the bulk of the effort has gone into understanding the practical benefits and costs of such designs in real-world engineering applications. As a result, this material is too complex and voluminous to summarize very easily. Thus I would urge the reader to go back to  and to further papers cited in  for more details.
In general, I would not recommend that anyone -- either in proving theorems or in building applications -- jump in one step from PID control to brain-like intelligent control. There is a very complex “ladder” of designs and applications, including both classical and ANN control designs. Usually there are significant benefits from going “up the ladder” just one step -- but the costs and benefits vary greatly from application to application.
Of course, stability -- actual stability more than theorems -- is a key concern in real-world applications.
The latest international conference on hypersonic flight contained a fascinating example of stability issues with standard H¥ control. Ranges of control parameters were developed which could stabilize the aircraft assuming a center of gravity located at 12 meters. Ranges were then developed for 11.3 meters. The regions were basically nonoverlapping. Thus for this extremely high-performance aircraft, stability can be a huge challenge. (It reminds me of the problem of walking in a canoe.) No matter how hard one works to control the center of the gravity in advance, it would be somewhat dangerous -- unnecessarily dangerous -- to rely on any fixed-parameter controller. This leads directly to a need for some sort of adaptive or learning-based control, in order to maximize stability, in examples like this.
With conventional adaptive control, as with ordinary ANN adaptive control, dozens upon dozens of stability theorems now exist. But in both cases, the theorems have many, many conditions, which are usually not satisfied in complex real-world systems. As a practical matter, the conventional designs generally involve a myopic minimization of tracking error (or a closely related function) at time t+1. Because of deadtimes, and sign reversals of impulse responses, etc., myopia commonly leads to instability in real systems. (With complex nonlinear systems, one can sometimes find Liapunov functions to overcome such problems, but this is quite difficult in practice; it is analogous to solving systems of nonlinear algebraic equations by exact analytical means.) Thus in complex chemical plants, for example, adaptive control is rarely used, because of the stability issue; instead, it is more common to use model-predictive control, one of the methods for nonmyopic optimization over time discussed in section 1.
In summary, methods for optimization over time have substantial advantages in terms of actually achieving greater stability. The pros and cons of different methods in that class were mentioned briefly in section 1. Such methods allow one to define a utility function which includes concepts like energy use, cost, pollution and depreciation, in addition to tracking error; such terms are crucial in many applications.
Some ADP systems (, Appendix to ch.2), such as systems using elastic fuzzy logic systems as Critics, may give us Critic networks which are Liapunov functions for classical adaptive control; however, considerable research will be needed to create a working computer tool which verifies this after the fact for a wide range of nonlinear problems. Likewise, in some applications it may be best to use a brain-like controller simply to calculate the local value measures (Q,R) fed into a classic LQR controller, in order to combine global optimality with existing local stability theorems. ADP systems, however, allow one to explicitly minimize the probability of ever entering a catastrophic state, based on a nonlinear stochastic model of the plant (which may of course include uncertainty in plant parameters).
3.Implementations of Brain-Like Intelligent Control
The 4 groups referred to in the Abstract are: (1) John Jameson; (2) Rob Santiago and collaborators ; (3) Wunsch and Prokhorov of Texas Tech; (4) S.Balakrishnan of the University of Missourri-Rolla . See also  and other papers by these authors in WCNN95. H.Berenji of IIS, working with NASA Ames, has also produced adaptive fuzzy systems which are very close to meeting the definition above.
Jameson performed the first successful implementation in 1993 of a level 3 ADP system. (See section 1 for how I define these “levels.”) He tested both a level 2 and level 3 system on a very simple but nonMarkhovian (i.e., partially observed) model of a robot arm. Despite his best efforts, the level 2 system simply could not control the system, but level 3 could. Jameson found this discouraging, but it supports my claim that we need to “climb up the ladder” to cope with more difficult problems. One can avoid nonMarkhovian problems by doing prior state estimation, but this requires system identification in any case; thus one might as well use a brain-like design. (Still, there may be advantages for hybrid level 2/3 designs.)
Wunsch and Prokhorov have compared a well-tuned PID controller, a level 2 critic and a level 3 critic on the bioreactor and autolander test problems in , problems which have proven extremely difficult for conventional methods. (Nonminimum phase, etc.) They solved both problems cleanly with a level 2 critic, and solved the autolander with PID, even using the “noisy” version of the problem. But when they added more noise and shortened the runway by a factor of 40%, the PID and the level 2 crashed 100% of the time. The level 3 crashed 60%, but came very close to landing in 2/3 of those cases. Later, in WCNN95, they reported 80% success in that problem, even using stringent landing criteria, using level 4 and 5 critics.
Balakrishnan has mainly studied problems in aircraft and missile control. Some of the best results, presented several times to the government, are still in press. For example, he has compared a number of standard well-developed designs in missile interception against a level 4 critic; he found that the latter could come closer to a true globally optimal trajectory, by at least an order of magnitude, compared with competing methods. He has done tests demonstrating robustness and closed-loop “re-adaptation.”
Finally, Berenji has implemented a system which is essentially equivalent to a level 3 critic (with adaptive fuzzy logic modules), except that the model network is replaced by a constant multiplier, in applications where the Jacobian of the model has fixed signs.
Exciting as this work is, it is, of course, only the earliest phase of a complex process.
4. Intelligent Control in the Brain
Back in 1981 and 1987, I published a kind of “cartoon model” of brain function (or, more precisely, of higher-level learning) as a level 5 ADP system. The 1987 paper was very apologetic in tone, because it left out a lot of key brain circuits -- such as the basal ganglia -- whose computational significance still remains obscure.
Since that time, however, I have come to appreciate that the apologies were excessive. So far as I know, that 1987 model is still the only model ever published which meets all the four basic tests above, tests which would have to be passed by any more accurate model. I would claim that this model does provide a valid first-order explanation of what is going on in the brain. It provides a first-pass starting point for an iterative process, aimed at explaining more and more detail in the future. New experiments, guided by ADP models, would be a crucial part of refining this understanding.
Why should engineers imagine that they have any hope at all of contributing to the understanding of something as complex as the brain? The methodological issues here are fairly complex(, ch.31 of ,). In essence, however, the key problem is that an understanding of the brain -- a control system more complex than any we build today -- requires more knowledge of control mathematics than do engineering devices; therefore, the engineering mathematics is a crucial prerequisite to a serious understanding of the functional capabilities of the brain, in learning, and of the circuitry which gives rise to these capabilities. Through the Collaborative Research Initiation (CRI) effort, and other funding initiatives in the planning stage, NSF and other agencies are now opening the door to the engineering-neuroscience collaborations needed to follow through on opportunities of this sort.
There is not enough room in this paper to discuss the current state of knowledge here in serious detail; in any case, this has been done elsewhere (ch.31 of ,). Crudely speaking, however, it seems clear that the brain is almost entirely a combination of three major pieces: (1) fixed, unlearned systems for preprocessing, postprocessing and definition of utility (U); (2) an upper-level ADP system which operates on a (clocked) sampling time on the order of 1/10-1/4 second; (3) a lower-level ADP system which operates on an effective sampling time on the order of 0.01 second. In other words, there is a kind of supervisory control arrangement here, required by the high complexity and long processing cycle of the upper-level system.
In the upper system, the “limbic system” -- known for decades as the main locus of “secondary reinforcement” or “emotion” -- acts as the Critic. The largest part of the human brain -- the cerebral cortex plus thalamus -- is adapted, primarily, to perform system identification. It builds up an “image of the world” or “working memory” based on circuitry which has a striking analogy to Kalman filtering. (See  for the neural net generalizations of Kalman filetring.) In this system, the thalamus -- the “input gate” to the cerebral cortex -- conveys the vector of (preprocessed) observables X. The cortex estimates the state vector R. A crucial aspect of Kalman filtering is the comparison between predictions of X(t+1) based on the predicted R(t+1), versus the actual observations X(t+1). In fact, reciprocal fibers going back from the cerebral cortex to the thalamus are all-pervasive. New research shows that some cells in the thalamus act as advanced predictors of other cells, and that they learn to remain good predictors even after the dynamics of the environment are changed artificially. (See  and more recent work by the same authors.)
In the ANN versions of Kalman filtering, one requires a high degree of global synchronization. There is generally a forward pass, in which the network calculates all the various estimates and predictions and intermediate results. Then there is a backwards pass, devoted to the calculations (including derivative calculations ) required to adapt the network. Physicists and others who attempt to model the brain using only ordinary differential equations would consider this anathema; they generally seek “asynchronous” models. Yet Llinas and others have shown that there are substantial and precise “clocks” in this system. Recent work by Barry Richmond at NIH substantiates the existence of an alternating computing cycle in the cerebral cortex strikingly consistent with what is necessary in effective ANNs.
Generally speaking, there are several outstanding issues here: (1) How does the brain achieve a high level of robustness over time in its system identification component? Notions of underdetermined modeling discussed by Ljung and by the later parts of chapter 10 of  may give us some clues, related to the biologists’ notions of “learning dynamical invariants.” ;
(2) How does the brain handle the “temporal chunking problem” -- closely related to the first question -- especially in medium time-scales, where AI approaches may be somewhat workable but neural net implementations are still called for? ; (3) How does the brain handle the interface between digital (discrete) decisions and continuous variables (including high-level variables like wealth and low-level variables like muscle force)? ; (4) When do components of R become so unchanging that they become stored in more permanent chemical form, even though they are not properly treated as parameters of a Critic or Model? The basal ganglia[30-32] clearly have something to do with these issues, but they -- like the cerebral cortex -- seem to operate at multiple levels of abstraction and multiple time-scales, all within a relatively uniform, modular and nonhierarchical structure.
In summary, new concepts from brain-like intelligent control can already explain some of the most important, fundamental aspects of intelligence in the brain, but they also point towards important new research needed to round out this explanation.
 A.Newell & H.Simon, Human Problem-Solving, Prentice-Hall 1972. See also E.Feigenbaum & Feldman, Computers and Thought, McGraw-Hill, 1963.
 H.Raiffa, Decision Analysis, Addison-Wesley, 1968.
 D.White & D.Sofge (eds) Handbook of Intelligent Control, Van Nostrand, 1992.
 P.Werbos, Elastic fuzzy logic: a better fit to neurocontrol and true intelligence J. of Intelligent and Fuzzy Systems , Vol.1, 365-377, 1993.
 P.Werbos, Neurocontrol and elastic fuzzy logic: Capabilities, concepts and applications. In M.Gupta and N.Sinha (eds), Intelligent Control Systems: Theory and Applications, IEEE Press, 1995.
 P.Werbos, Optimal neurocontrol: Practical benefits, new results and biological evidence, Proc. World Cong. on Neural Networks(WCNN95),Erlbaum, 1995.
 A.Bryson & Y.C.Ho, Applied Optimal Control, Ginn, 1969.
 D.Jacobson & D.Mayne, Differential Dynamic Programming, American Elsevier, 1970.
 W.Miller, R.Sutton & P.Werbos (eds), Neural Networks for Control, MIT Press, 1990, now in paper
 See Feldkamp,Puskorius,etc(Ford) in WCNN95.
 T.Hrycej, Model-based
training method for neural controllers. In Aleksander
I and Taylor J eds Artificial Neural Networks 2.
 P.Werbos, Maximizing long-term gas industry profits in two minutes in Lotus using neural network methods, IEEE Trans. SMC, March/April 1989.
 P.Werbos, The Roots of Backpropagation: From Ordered Derivatives to Neural Networks and Political Forecasting, Wiley, 1994.
The elements of intelligence. Cybernetica (
 P.Werbos, Advanced forecasting for global crisis warning and models of intelligence, General Systems Yearbook, 1977 issue.
 P.Werbos, Building and understanding adaptive systems: A statistical/numerical approach to factory automation and brain research, IEEE Trans. SMC, Jan./Feb. 1987.
 B.Widrow, N.Gupta & S.Maitra, Punish/reward: learning with a Critic in adaptive threshold systems, IEEE Trans. SMC, 1973, Vol. 5, p.455-465.
 P.Werbos, The cytoskeleton: Why it may be crucial to human learning and to neurocontrol, Nanobiology, Vol. 1, No.1, 1992.
 R.Howard, Dynamic Programming and Markhov Processes, MIT Press, 1960.
Dynamic Programming and Optimal Control,
 P.Werbos, Neural networks for control and system identification, IEEE Proc. CDC89, IEEE, 1989.
 M.Nicolelis, C.Lin, D.Woodward & J.Chapin, Induction of immediate spatiotemporal changes in thalamic networks by peripheral block of ascending cutaneous information, Nature, Vol.361, 11 Feb. 1993, p.533-536.
W.Schroder, M.Heller &P.Sacher, Robust control concept for a hypersonic test
 K.Pribram (ed) Origins:Brain and Self-Organization, Erlbaum, 1994.
Examples of continuous reinforcement learning control. In C.Dagli
et al, Intelligent Engineering Systems Through Artificial Neural Networks,
 D.Prokhorov, R.Santiago & D.Wunsch, Adaptive critic designs: a case study for neurocontrol. Neural Networks, in press.
& V.Biega, Adaptive critic based neural networks
for aircraft control, AIAA-95-3297, Proc.
AIAA GNC Conf,
Control circuits in the brain: Basic principles, and critical tasks requiring
engineers. In K.S.Narendra (ed), Proc. of
the 8th Yale Workshop on Adaptive and Learning Systems. New Have, CT: Prof.
Narendra, Dept. of Electrical
 W.Nauta & M.Feirtag, Fundamental Neuro-anatomy, W.H.Freeman, 1986.
 J.Houk, J.Davis & D.Beiser (eds), Models of Information Processing in the Basal Ganglia, MIT Press, 1995.
 H.Petri & M.Mishkin, Behaviorism, cognitivism and the neuropsychology of memory, American Scientist, Jan.-Feb. 1994, p.30-37.