** Multiple Models for Approximate Dynamic Programming and **

**True Intelligent Control:
Why and How**

Paul
J. Werbos

Room 675, National
Science Foundation^{*}

Arlington, VA, US 22230

** Abstract**

This paper argues that true brain-like intelligent control systems would need to rely on a combination of multiple-model strategies, hierarchical control, and Approximate Dynamic Programming or reinforcement learning. The paper gives some modified Bellman equations which provide the basis for a new design

to accomplish this kind of unification.

**1 Introduction**

In
the past few years, Narendra has argued that __multiple models__ are the key
-- somehow -- to true intelligent

control. Since 1968, I have argued that Approximate Dynamic Programming (ADP) is the key to understanding and replicating the kind of intelligent control which we see in the mammalian brain [1,2]. Barto and his many collaborators (e.g., [3]) have argued that certain forms of reinforcement learning -- a subset of ADP -- can explain a wide variety of experiments in neuroscience and in animal learning. At this conference, Barto has also emphasized a classical idea from artificial intelligence (AI), the idea that complex

planning hierarchies are necessary in order to explain or replicate the highest forms of intelligent behavior.

This paper will argues that all three groups are essentially right. It will argue that multiple models (to cope with multiple regions in state space), ADP and hierarchical planning are all necessary, in order to understand or replicate the kind of intelligent decision-making behavior we see in the mammalian brain. (Reinforcement learning as such is not quite powerful enough[4]; however, the difference between reinforcement learning and ADP is not important enough to discuss in detail here.)

More important, this paper will
suggest a way to combine all these approaches in a single unified approach. The
ultimate goal for this kind of research is to develop a unified and generalized
design for intelligent control, which imposes the minimum possible assumptions
apriori and uses __learning__ to address the varied specific needs of varied
applications.

**2 Why Multiple Models and
Hierarchy**

The
links between ADP, reinforcement learning and the brain have already been
discussed at enormous length elsewhere [1-7]. But why would we expect multiple
models and hierarchy to be part of the underlying “hard-wiring” of the brain?
Why can’t we just build powerful neural ADP systems, and expect all the rest to
be __learned__ after the fact as a
kind of __emergent behavior__?

Multiple models have already emerged
as a major theme within the core of the neural network community [8]. They have
become ever more popular as a tool for use in __supervised learning__ -- the
task of learning a static nonlinear mapping from a vector of inputs ** x** to a vector

*This paper gives the views of the author, not those of NSF; however, as government work, it is in the public domain subject to proper citation.

multiple model approach are called Mixture of Experts (ME) systems.

Within supervised learning, the multiple-model approach makes it possible for learning in one domain of experience (or state space) to proceed independently of learning in other domains. Thus there is no problem with “unlearning” the properties of one region as one begins to explore another regions. This is

also very critical for robust performance during real-time learning in control, especially when some of the

rarely-visited domains are critical to the stability of the overall system. A simple and obvious way to obtain these benefits within ADP control is to insert ME systems as the supervised learning subsystems of the existing designs. This makes it possible to obtain many of the benefits of multiple models, while still using the basic designs in the existing literature [8].

This kind of intermediate approach
may be very useful as a steppingstone towards more brain-like control. However,
it does not yet incorporate the full potential of the multiple model approach.
The problem here is that ADP -- unlike supervised learning -- does not allow a __complete__
independence between what is learned in different regions of state space. As an
example, consider a simple case where
state space is divided up into two regions -- one where stability is a
desperate problem, because the system is in real danger,

and another where performance and stability have equal importance, in effect. What we learn in the high-risk region has a big impact on the “J” function -- the local values -- at the borderline between the

two
regions; the J function in the borderline really determines what is a good
strategy for maximizing stability even within the normal region. Thus it is not
possible to solve the optimal decision problem in the normal region __independent__
of what is learned in the high-risk region. In order to avoid bad forms of

unlearning,
one must take a more sophisticated approach to learning independently what we __can__
learn independently across the two regions. One must directly address the
problem of how to find the

optimal strategy, when state space is partitioned into regions.

There are at least two additional reasons why multiple models are important to true

intelligent control.

First, consider the intuition behind some of the work in AI and robotics. In practical robotics,

complex
motions are typically built up as a __sequence__ of very distinct tasks.
There are tasks like reaching,

grasping, inserting, moving, applied to distinct different objects. The best practical success with neural nets [10] has involved training different networks to perform different tasks. This naturally ties in with the

idea of hierarchical planning as used in robots. In cognitive neuroscience, the structure of planning is not so

straightforward as in AI [11]; however, it clearly does include some notion of “tasks” or “action schemata” as a fundamental hard-wired aspect of the system.

Second -- and more positively -- the “chunking” of control choices into “decisions” allows for more rapid and effective learning in ADP systems [12]. (In effect, state space is partitioned into regions which represent different “tasks” or “decisions” or “episodes.”) It also permits the development of subsystems which directly address the discrete or granular nature of higher-level decisions; the discrete nature of these decisions could lead to gross problems with local minima if addressed only implicitly by ADP systems structured on the basis of short-term dynamics between time t and time t+1. (At present, such problems are

typically addressed by the use of evolutionary computation methods, such as genetic algorithms; however,

a more brain-like approach is necessary in order to really integrate stochastic search into ADP systems

and to enable the solution of problems involving larger state spaces.) This second argument for multiple models is valid only after the fact, after one sees that there do exist methods for exploiting “chunking”

in order to improve the performance of generalized ADP designs. The remainder of this paper will summarize recent work [12] in that area. It should be noted that most of this material is covered by an international patent disclosure.

**3 Critical Equations **

Because of limitations of space and time, this paper cannot possibly discuss all the critical aspects of

chunking and intelligence discussed in [7] and in [12]. It may take many years to fill in all the

critical components proposed in [12]. However, it is possible to extract a few crucial equations.

Likewise, it is not possible here to review all the basics of ADP (although some notation can be defined.)

For simplicity, this paper will address the case of an observed state space made up of a finite number of states i=1 to N, as in [13]. It will not discuss the neural network extension, which is discussed in [12],

and will in any case require further work.

First, I will use the notation
explained at greater length in section 2.2 of [12]. Each action policy p
specifies the control actions ** u**(i),
across all possible states i. J

(In Sutton’s notation, g=1/(1+r).) For any given policy p, it is straightforward to deduce that:

__J__^{p}=__U__^{p}+M^{p}__J__^{p} (1)

Section (2.3) of [12] explains why classical forms of ADP can be very slow in learning to “see”

far into the future, and why “chunking” can help to improve their learning speed. But this paper will

skip over those arguments. Instead, it will focus on the question of how to actually implement the approach.

The first key question before us is
as follows: how can we develop techniques for performing the usual ADP updates
-- value updates (updating the estimated ** J**)
and policy updates (updating p) -- which would exploit the partitioning of the states
i into regions or blocks A, B, C...?

As a practical matter, it turns out
to be easier to start by considering the possibility of __fuzzy partitions__.
We may work out the machinery for that general case, where each state i has a __degree__

of
membership m_{i}(A)
in each set or region A. We may then derive the machinery for the case of

crisp, distinct partitions as a special case of that more general machinery. It seems likely that the fuzzy version and the crisp version could both be useful in some applications.

In NSF-sponsored work [14], Sutton
proposed the use of a “generalized Bellman equation” based on weighting factors
b_{i},
in order to represent the control problem implied by the effort to perform a __single__

“abstract
action.” But these weighting factors are
related, intuitively, to something like m_{i}(A)/m_{j}(A);

they cannot capture the idea of a fuzzy membership function, as written. Thus in [12], I argue that the Sutton-Bellman equation is not powerful enough to support the type of multiple models needed to coordinate multiple tasks and intelligent control. Section 2-6 of [12] suggests that the problem can be solved by making a change which looks small algebraically, but is very large from a conceptual and functional point of view.

Instead of using weighting factors b_{i},
we may consider weighting factors b_{ij}. We may then use the same kind of
substitution procedure used by Sutton to derive a modified Bellman equation.

We may start from:

J^{p}_{i}
=
U^{p}_{i} + S[j] M^{p}_{ij}J^{p}_{i} (2),

where
I use “[j]” to indicate a summation over j, strictly because of cross-platform
compatibility problems involving the Microsoft Word program. For any array b_{ij},
we mat derive the following equation by simple substitution:

J^{p}_{i} = U^{p}_{i}
+ S[j] M^{p}_{ij}
( (1-b_{ij})J^{p}_{i}
+ b_{ij}(U^{p}_{j}
+ S[k]
M^{p}_{jk}J^{p}_{k})
)

= A^{p}_{i} + S[j] M^{p}_{ij}(1-b_{ij})J^{p}_{j} + S[k] C^{p}_{ik}J^{p}_{k}, (3)

where I define:

A^{p}_{i} = U^{p}_{i}
+ S[j] M^{p}_{ij}b_{ij}U^{p}_{j} and
C^{p}_{ik}
= S[j]
M^{p}_{ij}b_{ij}M^{p}_{jk} (4),

and where the reader should be warned (with apologies) that this “A” is not the same as the block “A.”

Equation 4 is a different form of generalized Bellman equation which, to avoid confusion, might be

called
the fuzzy Werbos-Bellman equation. There are several ways to choose the array b_{ij}
in order to

represent the notion of fuzzy partitions, and generate a reasonable recurrence rule to support

learning.
The most sensible-seeming choice (among those mentioned in [12]) is to set b__ij__
to m_{j}/m_{i}
in the case where this is less than 1, and to 1 in other cases.

Starting from the fuzzy Werbos-Bellman equation, one can derive simpler recurrence relationships

for the case of a crisp partition, in which every state i belongs to one and only one set A, B, etc.

The resulting procedures are derived directly (without reference to fuzzy partitions) in section (2.4) of [12].

To begin with, one may deduce:

__ J__^{p} |_{A} = __J__^{A} + S[BÎn(A)] J^{AB} * (__J__^{p}|_{B}). (5)** **,

where
the left-hand side represents the vector made up of the estimates of J^{p}_{i}
for those states i which are in block A, where __J__^{A} represents a kind of internal local J estimate
(to be discussed), where “n(A)”

represents the set of those blocks B (other than A) which can be entered directly in one time step by a transition from some state i in A to a state j in B, and where the asterisk represents matrix multiplication.

The fuzzy Werbos-Bellman equation leads to two crisp Werbos-Bellman equations:

__J__^{A}
= __U__^{p}|_{A}
+ M^{A}__J__^{A} (6a)

J^{AB} = M^{AB} + M^{A}J^{AB} (6b)

These equations can easily be streamlined, computationally, to address only those states j in B which can be

reached directly from A, etc. The resulting update rules are discussed in some detail in [12],

and extended to the case of hierarchical control.

**4 Design Concepts and Issues **

Equations 6 are essentially just a starting point for a whole series of design, of ever-growing complexity [12].

From the viewpoint of stability theory, the most important next task for research might be simply to prove convergence for this class of design -- with only two levels of hierarchy (block versus state) -- in the case of crisp fixed partitions A, B, C... The goal would be to prove results similar to what has already been proven for simpler finite-state reinforcement learning systems in [13]. Likewise, such two-level systems might

be tested in various applications.

Already, in equation 6, the basic concept of chunking and local learning is present.

The
update equations for __J__^{A}
and J^{AB} are entirely __local__; they depend only on experience __within__
block A

(or
the states j immediately reachable from A). Yet we still cannot escape the
inherent interdependency of J values across blocks. Nevertheless: as we learn __J__^{A} and J^{AB},
we can combine equations 5 and 6 to

calculate
__directly__ an estimate of J at the __entry__ to block A (for states i
which happen to be

the first states experienced in block A) as a linear combination of the J values for the states

j in B immediately after A is left. In other words, we have an update equation (explained further in [12])

which
associates the start of block A __directly__ with the start of the following
block. This is like having a Bellman equation which jumps from t to t+T, where
T is the time spent in block A, rather than t to t+1.

The result is an ability to project into the future further than the ordinary updates would allow, by a factor of T. There is room now for theoretical and practical studies illustrating how this can lead to faster learning for complex decision problems where this is appropriate.

In [12], there is further discussion of how to extend this to a multilevel hierarchy, and how to

represent
higher-level __decisions__ as choices between alternative block
representations.

Likewise, in the discussion of neural network approximations, there is discussion of the

issue
of __spatial chunking__, which is also crucial to closing the gap between
today’s learning control models and the capabilities we observe in the
mammalian brain. Only after we understand how to replicate these kinds of capabilities
in artificial, mathematical designs will we have any chance of understanding
the particular

examples of such capabilities found in the actual brains of mammals.

**References**

[1]
P.Werbos, The brain as a neurocontroller: New hypotheses and new experimental
possibilities. In K. Pribram, ed., *Origins:
Brain and Self Organization*, Erlbaum, 1994.

[2] P.Werbos, Applications of advances in nonlinear
sensitivity analysis, in R.Drenick & F. Kozin
(eds), *System Modeling and Optimization: Proc. IFIP Conf. (1981*__)__,Springer 1982; reprinted in P.Werbos.

*The Roots of Backpropagation*, Wiley
1994.

[3]
J.Houk, J.Keifer and A.Barto, Distributed motor commands in the limb premotor
network, *Trends Neurosci.*, Vol. 16, P.27-33, 1933.

[4] P.Werbos, The cytoskeleton:
Why it may be crucial to human learning and to neurocontrol, *Nanobiology*,
Vol. 1, No.1, 1992.

[5] J.Houk,
J.Davis & D.Beiser (eds), *Models of
Information Processing in the Basal Ganglia*, MIT Press, 1995.

[6].Werbos, Learning in the brain: An engineering
interpretation. In K.Pribram and J.King, eds., *Learning as Self-Organization* , Erlbaum
1996.

[7] P.Werbos, Values, Goals and
Utility in an Engineering-Based Theory of Mammalian Intelligence, in Karl H.Pribram, ed. , *Brain and Values*, Erlbaum: Hillsdale, NJ, 1998.

[8]
M.I.Jordan and R.A.Jacobs, Modular and hierarchical learning systems, in
M.A.Arbib

* Handbook of Brain Theory and Neural Networks*, MIT Press, 1995,
p.579-582.

[9] White & D.Sofge, eds, *Handbook of Intelligent Control*, Van Nostrand, 1992.

[10] G.Hirzinger et al, in M. van der Meer, R.
Schmidt and G.Wolf eds, *Statusseminar des
BMBF: Kunstliche Intelligenz,
Neuroinformatik und Intelligente Systems*, Berlin: DLR, 1996.

[11] Vernon Brooks, *The Neural Basis of Motor Control*,
Oxford U. Press, 198_.

[12] P.Werbos, A Brain-Like Design To Learn Optimal
Decision Strategies in Complex Environments, in M.Karny, K.Warwick and V.Kurkova, eds, *Dealing with Complexity: A Neural Networks Approach*. Springer, London, 1998. Also
in S.Amari and N.Kasabov, *Brain-Like Computing and Intelligent Information Systems*. Springer,1998.
See also international patent application #WO 97/46929, filed June 1997, published Dec.11.

[13] D.P.Bertsekas and J.N.Tsisiklis, *Neuro-Dynamic Programming*,. Belmont, Mass.: Athena Scientific, 1996.

[14] R.Sutton, TD Models: Modeling the World at a Mixture of
Time Scales. CMPSCI Technical Report 95- 114.
U.Mass. Amherst, December 1995, later published in *Proc. 12th Int. Conf. Machine Learning*, 531-539, Morgan Kaufmann, 1995.