Multiple Models for Approximate Dynamic Programming and
True Intelligent Control: Why and How
Paul J. Werbos
Room 675, National Science Foundation*
Arlington, VA, US 22230
This paper argues that true brain-like intelligent control systems would need to rely on a combination of multiple-model strategies, hierarchical control, and Approximate Dynamic Programming or reinforcement learning. The paper gives some modified Bellman equations which provide the basis for a new design
to accomplish this kind of unification.
In the past few years, Narendra has argued that multiple models are the key -- somehow -- to true intelligent
control. Since 1968, I have argued that Approximate Dynamic Programming (ADP) is the key to understanding and replicating the kind of intelligent control which we see in the mammalian brain [1,2]. Barto and his many collaborators (e.g., ) have argued that certain forms of reinforcement learning -- a subset of ADP -- can explain a wide variety of experiments in neuroscience and in animal learning. At this conference, Barto has also emphasized a classical idea from artificial intelligence (AI), the idea that complex
planning hierarchies are necessary in order to explain or replicate the highest forms of intelligent behavior.
This paper will argues that all three groups are essentially right. It will argue that multiple models (to cope with multiple regions in state space), ADP and hierarchical planning are all necessary, in order to understand or replicate the kind of intelligent decision-making behavior we see in the mammalian brain. (Reinforcement learning as such is not quite powerful enough; however, the difference between reinforcement learning and ADP is not important enough to discuss in detail here.)
More important, this paper will suggest a way to combine all these approaches in a single unified approach. The ultimate goal for this kind of research is to develop a unified and generalized design for intelligent control, which imposes the minimum possible assumptions apriori and uses learning to address the varied specific needs of varied applications.
2 Why Multiple Models and Hierarchy
The links between ADP, reinforcement learning and the brain have already been discussed at enormous length elsewhere [1-7]. But why would we expect multiple models and hierarchy to be part of the underlying “hard-wiring” of the brain? Why can’t we just build powerful neural ADP systems, and expect all the rest to be learned after the fact as a kind of emergent behavior?
Multiple models have already emerged as a major theme within the core of the neural network community . They have become ever more popular as a tool for use in supervised learning -- the task of learning a static nonlinear mapping from a vector of inputs x to a vector y of “targets” or “dependent variables” or “desired responses.” Supervised learning is a much simpler task than control, but it already embodies many of the difficulties which exist in learning control. The more sophisticated designs for ADP  all require subsystems which perform supervised learning. Supervised learning systems based on this _______________________
*This paper gives the views of the author, not those of NSF; however, as government work, it is in the public domain subject to proper citation.
multiple model approach are called Mixture of Experts (ME) systems.
Within supervised learning, the multiple-model approach makes it possible for learning in one domain of experience (or state space) to proceed independently of learning in other domains. Thus there is no problem with “unlearning” the properties of one region as one begins to explore another regions. This is
also very critical for robust performance during real-time learning in control, especially when some of the
rarely-visited domains are critical to the stability of the overall system. A simple and obvious way to obtain these benefits within ADP control is to insert ME systems as the supervised learning subsystems of the existing designs. This makes it possible to obtain many of the benefits of multiple models, while still using the basic designs in the existing literature .
This kind of intermediate approach may be very useful as a steppingstone towards more brain-like control. However, it does not yet incorporate the full potential of the multiple model approach. The problem here is that ADP -- unlike supervised learning -- does not allow a complete independence between what is learned in different regions of state space. As an example, consider a simple case where state space is divided up into two regions -- one where stability is a desperate problem, because the system is in real danger,
and another where performance and stability have equal importance, in effect. What we learn in the high-risk region has a big impact on the “J” function -- the local values -- at the borderline between the
two regions; the J function in the borderline really determines what is a good strategy for maximizing stability even within the normal region. Thus it is not possible to solve the optimal decision problem in the normal region independent of what is learned in the high-risk region. In order to avoid bad forms of
unlearning, one must take a more sophisticated approach to learning independently what we can learn independently across the two regions. One must directly address the problem of how to find the
optimal strategy, when state space is partitioned into regions.
There are at least two additional reasons why multiple models are important to true
First, consider the intuition behind some of the work in AI and robotics. In practical robotics,
complex motions are typically built up as a sequence of very distinct tasks. There are tasks like reaching,
grasping, inserting, moving, applied to distinct different objects. The best practical success with neural nets  has involved training different networks to perform different tasks. This naturally ties in with the
idea of hierarchical planning as used in robots. In cognitive neuroscience, the structure of planning is not so
straightforward as in AI ; however, it clearly does include some notion of “tasks” or “action schemata” as a fundamental hard-wired aspect of the system.
Second -- and more positively -- the “chunking” of control choices into “decisions” allows for more rapid and effective learning in ADP systems . (In effect, state space is partitioned into regions which represent different “tasks” or “decisions” or “episodes.”) It also permits the development of subsystems which directly address the discrete or granular nature of higher-level decisions; the discrete nature of these decisions could lead to gross problems with local minima if addressed only implicitly by ADP systems structured on the basis of short-term dynamics between time t and time t+1. (At present, such problems are
typically addressed by the use of evolutionary computation methods, such as genetic algorithms; however,
a more brain-like approach is necessary in order to really integrate stochastic search into ADP systems
and to enable the solution of problems involving larger state spaces.) This second argument for multiple models is valid only after the fact, after one sees that there do exist methods for exploiting “chunking”
in order to improve the performance of generalized ADP designs. The remainder of this paper will summarize recent work  in that area. It should be noted that most of this material is covered by an international patent disclosure.
3 Critical Equations
Because of limitations of space and time, this paper cannot possibly discuss all the critical aspects of
chunking and intelligence discussed in  and in . It may take many years to fill in all the
critical components proposed in . However, it is possible to extract a few crucial equations.
Likewise, it is not possible here to review all the basics of ADP (although some notation can be defined.)
For simplicity, this paper will address the case of an observed state space made up of a finite number of states i=1 to N, as in . It will not discuss the neural network extension, which is discussed in ,
and will in any case require further work.
First, I will use the notation explained at greater length in section 2.2 of . Each action policy p specifies the control actions u(i), across all possible states i. Jpi represents the value of the “J” function in state i, for the action policy p. From the basic theory of dynamic programming, we know that Ji equals Jpi for the optimal policy p. Let Pp represent the matrix of transition probabilities from states i to states j, under policy p. Let Mp be defined as (Pp)T/(1+r), where r is the interest rate in the underlying utility function.
(In Sutton’s notation, g=1/(1+r).) For any given policy p, it is straightforward to deduce that:
Section (2.3) of  explains why classical forms of ADP can be very slow in learning to “see”
far into the future, and why “chunking” can help to improve their learning speed. But this paper will
skip over those arguments. Instead, it will focus on the question of how to actually implement the approach.
The first key question before us is as follows: how can we develop techniques for performing the usual ADP updates -- value updates (updating the estimated J) and policy updates (updating p) -- which would exploit the partitioning of the states i into regions or blocks A, B, C...?
As a practical matter, it turns out to be easier to start by considering the possibility of fuzzy partitions. We may work out the machinery for that general case, where each state i has a degree
of membership mi(A) in each set or region A. We may then derive the machinery for the case of
crisp, distinct partitions as a special case of that more general machinery. It seems likely that the fuzzy version and the crisp version could both be useful in some applications.
In NSF-sponsored work , Sutton proposed the use of a “generalized Bellman equation” based on weighting factors bi, in order to represent the control problem implied by the effort to perform a single
“abstract action.” But these weighting factors are related, intuitively, to something like mi(A)/mj(A);
they cannot capture the idea of a fuzzy membership function, as written. Thus in , I argue that the Sutton-Bellman equation is not powerful enough to support the type of multiple models needed to coordinate multiple tasks and intelligent control. Section 2-6 of  suggests that the problem can be solved by making a change which looks small algebraically, but is very large from a conceptual and functional point of view.
Instead of using weighting factors bi, we may consider weighting factors bij. We may then use the same kind of substitution procedure used by Sutton to derive a modified Bellman equation.
We may start from:
Jpi = Upi + S[j] MpijJpi (2),
where I use “[j]” to indicate a summation over j, strictly because of cross-platform compatibility problems involving the Microsoft Word program. For any array bij, we mat derive the following equation by simple substitution:
Jpi = Upi + S[j] Mpij ( (1-bij)Jpi + bij(Upj + S[k] MpjkJpk) )
= Api + S[j] Mpij(1-bij)Jpj + S[k] CpikJpk, (3)
where I define:
Api = Upi + S[j] MpijbijUpj and Cpik = S[j] MpijbijMpjk (4),
and where the reader should be warned (with apologies) that this “A” is not the same as the block “A.”
Equation 4 is a different form of generalized Bellman equation which, to avoid confusion, might be
called the fuzzy Werbos-Bellman equation. There are several ways to choose the array bij in order to
represent the notion of fuzzy partitions, and generate a reasonable recurrence rule to support
learning. The most sensible-seeming choice (among those mentioned in ) is to set bij to mj/mi in the case where this is less than 1, and to 1 in other cases.
Starting from the fuzzy Werbos-Bellman equation, one can derive simpler recurrence relationships
for the case of a crisp partition, in which every state i belongs to one and only one set A, B, etc.
The resulting procedures are derived directly (without reference to fuzzy partitions) in section (2.4) of .
To begin with, one may deduce:
Jp |A = JA + S[BÎn(A)] JAB * (Jp|B). (5) ,
where the left-hand side represents the vector made up of the estimates of Jpi for those states i which are in block A, where JA represents a kind of internal local J estimate (to be discussed), where “n(A)”
represents the set of those blocks B (other than A) which can be entered directly in one time step by a transition from some state i in A to a state j in B, and where the asterisk represents matrix multiplication.
The fuzzy Werbos-Bellman equation leads to two crisp Werbos-Bellman equations:
JA = Up|A + MAJA (6a)
JAB = MAB + MAJAB (6b)
These equations can easily be streamlined, computationally, to address only those states j in B which can be
reached directly from A, etc. The resulting update rules are discussed in some detail in ,
and extended to the case of hierarchical control.
4 Design Concepts and Issues
Equations 6 are essentially just a starting point for a whole series of design, of ever-growing complexity .
From the viewpoint of stability theory, the most important next task for research might be simply to prove convergence for this class of design -- with only two levels of hierarchy (block versus state) -- in the case of crisp fixed partitions A, B, C... The goal would be to prove results similar to what has already been proven for simpler finite-state reinforcement learning systems in . Likewise, such two-level systems might
be tested in various applications.
Already, in equation 6, the basic concept of chunking and local learning is present.
The update equations for JA and JAB are entirely local; they depend only on experience within block A
(or the states j immediately reachable from A). Yet we still cannot escape the inherent interdependency of J values across blocks. Nevertheless: as we learn JA and JAB, we can combine equations 5 and 6 to
calculate directly an estimate of J at the entry to block A (for states i which happen to be
the first states experienced in block A) as a linear combination of the J values for the states
j in B immediately after A is left. In other words, we have an update equation (explained further in )
which associates the start of block A directly with the start of the following block. This is like having a Bellman equation which jumps from t to t+T, where T is the time spent in block A, rather than t to t+1.
The result is an ability to project into the future further than the ordinary updates would allow, by a factor of T. There is room now for theoretical and practical studies illustrating how this can lead to faster learning for complex decision problems where this is appropriate.
In , there is further discussion of how to extend this to a multilevel hierarchy, and how to
represent higher-level decisions as choices between alternative block representations.
Likewise, in the discussion of neural network approximations, there is discussion of the
issue of spatial chunking, which is also crucial to closing the gap between today’s learning control models and the capabilities we observe in the mammalian brain. Only after we understand how to replicate these kinds of capabilities in artificial, mathematical designs will we have any chance of understanding the particular
examples of such capabilities found in the actual brains of mammals.
 P.Werbos, The brain as a neurocontroller: New hypotheses and new experimental possibilities. In K. Pribram, ed., Origins: Brain and Self Organization, Erlbaum, 1994.
 P.Werbos, Applications of advances in nonlinear sensitivity analysis, in R.Drenick & F. Kozin (eds), System Modeling and Optimization: Proc. IFIP Conf. (1981),Springer 1982; reprinted in P.Werbos.
The Roots of Backpropagation, Wiley 1994.
 J.Houk, J.Keifer and A.Barto, Distributed motor commands in the limb premotor network, Trends Neurosci., Vol. 16, P.27-33, 1933.
 P.Werbos, The cytoskeleton: Why it may be crucial to human learning and to neurocontrol, Nanobiology, Vol. 1, No.1, 1992.
 J.Houk, J.Davis & D.Beiser (eds), Models of Information Processing in the Basal Ganglia, MIT Press, 1995.
.Werbos, Learning in the brain: An engineering interpretation. In K.Pribram and J.King, eds., Learning as Self-Organization , Erlbaum 1996.
 P.Werbos, Values, Goals and Utility in an Engineering-Based Theory of Mammalian Intelligence, in Karl H.Pribram, ed. , Brain and Values, Erlbaum: Hillsdale, NJ, 1998.
 M.I.Jordan and R.A.Jacobs, Modular and hierarchical learning systems, in M.A.Arbib
Handbook of Brain Theory and Neural Networks, MIT Press, 1995, p.579-582.
 White & D.Sofge, eds, Handbook of Intelligent Control, Van Nostrand, 1992.
 G.Hirzinger et al, in M. van der Meer, R. Schmidt and G.Wolf eds, Statusseminar des BMBF: Kunstliche Intelligenz, Neuroinformatik und Intelligente Systems, Berlin: DLR, 1996.
 Vernon Brooks, The Neural Basis of Motor Control, Oxford U. Press, 198_.
 P.Werbos, A Brain-Like Design To Learn Optimal Decision Strategies in Complex Environments, in M.Karny, K.Warwick and V.Kurkova, eds, Dealing with Complexity: A Neural Networks Approach. Springer, London, 1998. Also in S.Amari and N.Kasabov, Brain-Like Computing and Intelligent Information Systems. Springer,1998. See also international patent application #WO 97/46929, filed June 1997, published Dec.11.
 D.P.Bertsekas and J.N.Tsisiklis, Neuro-Dynamic Programming,. Belmont, Mass.: Athena Scientific, 1996.
 R.Sutton, TD Models: Modeling the World at a Mixture of Time Scales. CMPSCI Technical Report 95- 114. U.Mass. Amherst, December 1995, later published in Proc. 12th Int. Conf. Machine Learning, 531-539, Morgan Kaufmann, 1995.