Values, Goals and Utility in an

Values, Goals and Utility in an

Engineering-Based Theory of Mammalian Intelligence

Paul J. Werbos

National Science Foundation, Room 675*

Arlington, VA 22230

pwerbos@nsf.gov

Introduction and Strategy

Values, goals and utility play a central role both in formal psychology and in everyday life.

This paper will try to explain how our understanding of these concepts can be further unified and enriched, based on a new engineering-based theory of mammalian intelligence[1].

The new theory is essentially the latest iteration of an ongoing dialogue between neuroscientists like Karl Pribram and engineers like myself, each trying to grow beyond the usual limitations of these disciplines. A major goal of this effort is to bridge the gap between functional, realistic and intuitive descriptions of the brain and more formal mathematical, computational representations. After we develop a better understanding of how the higher-level functional abilities of the brain could be replicated, by certain types of parallel distributed computational structure, we will then be ready to do a more serious job of matching such structures to the lower-level features of neural circuitry and to suggest new experiments.

The underlying strategy of this approach has been discussed at great length in the past [2-4].

Those prior discussions have elaborated on issues such as the role of complexity, the importance of learning, the choice of mathematical tools, the strategies for experimental validation, and so on. Nevertheless, there are three strategic issues which need to be discussed yet again, because they are a source of concern to many researchers: (1) consciousness versus intelligence; (2) prior information versus learning; (3) neural networks versus computational neuroscience.

Consciousness Versus Intelligence

The new theory tries to replicate the higher-level intelligence of the brains of mammals.

This is not the same thing as explaining consciousness in the minds of humans.

Many of us are actually more interested in human consciousness, in principle, than in mammalian intelligence. For example, I have written at length about the various meanings of the word “conscious,” about the role of human language and symbolic reasoning, quantum computing, the soul, and so on[5].

However, those topics are beyond the scope of this article.

I will argue here that a deeper understanding of mammalian intelligence is actually a prerequisite to a deeper understanding of human consciousness, just as algebra is a prerequisite to calculus and calculus is a prerequisite to understanding gravity. In reality, in everyday life, we have to develop some kind of intuitive understanding of gravity (dropping things and breaking things) before we even learn algebra . In the same way, intuitive studies of language and the soul can be very important to everyday life.

But for a deeper understanding of gravity -- and of the higher levels of intelligence -- one must first master the prerequisites.

In fact, the word “intelligence” actually represents one of the meanings of the word “consciousness” in everyday conversation. Sometimes when we refer to systems as possessing “consciousness,” we mean that they are intelligent.

_________________________

*The views herein are those of the author, not those of NSF, though they are written on government time.

Some authors who write about “consciousness” or “intelligent systems” act as if consciousness were some kind of binary variable. They ask: “Is it conscious or is it not?”, as if the answer should be “yes” or “no.” In my view, the question itself is fundamentally misleading. On the other, some classical behaviorists have treated intelligence as if it were a continuous variable, like the Intelligence Quotient. In the extreme form of that view, snails and humans are treated as if they are qualitatively the same, and different only in their speed of learning.

Based on empirical reality, I would argue that “consciousness” or “intelligence” is actually more like a staircase -- a discrete series of levels, each level qualitatively different from the level before. (Actually, it is a tilty staircase, insofar as there are some quantitative differences within each level, but these are not as important as the qualitative differences between levels.) The most important levels, for our purposes here, correspond roughly to the classes of vertebrates -- fish versus amphibians versus reptiles versus birds versus mammals versus the symbolic-reasoning level of intelligence. These are the forms of intelligence that we actually see in nature.

At a superficial level, the differences between different mammals seem very large. For example, the neocortex of a normal adult monkey contains many more visual areas than does the neocortex of a mouse. However, lower classes of vertebrate do not even possess a neocortex at all. Furthermore, the same underlying principles of learning seem to determine what is actually learned and stored in the various areas in the neocortex. When one area is damaged but the necessary connections are present, other parts of the neocortex can learn to take over the functions of the damaged area. There is a huge literature

on “mass action” and “equipotentiality” which has demonstrated this. (This literature originated in the classic work of Lashley, Pribram, Freeman and their colleagues.) For example, neurons in the temporal cortex have learned to take over as edge detectors in some experiments. The new theory to be described here [1] tries to replicate the underlying learning abilities which are essentially the same across all mammals.

Years ago, in comparative psychology, Bitterman [6] demonstrated qualitative differences in the nature of learning across different classes of vertebrates. The details are beyond the scope of this paper, but did contribute to the development of our new theory.

Many researchers in Artificial Intelligence (AI) have argued that “real intelligence” requires the use of symbolic reasoning. One such researcher begins his talks by asking how Einstein reasoned, in order to develop his theories. I would certainly agree that these are excellent examples of higher-order intelligence. However, I would question our chances of building an artificial Einstein before we can build even the crudest version of an artificial mouse. If Einstein and the mouse were unrelated phenomena, it might make sense to aim directly to build an Einstein, since he is more interesting; however, the symbolic reasoning used by Einstein and by other relatively intelligent humans is deeply rooted in the mammalian

neocortex and in other brain structures which we share with other mammals. There are good humanistic

reasons to try to understand symbolic reasoning better, even now, as I discussed above; however, a deep understanding of the neocortex and of mammalian intelligence in general should permit a much deeper understanding of symbolic reasoning as well in the more distant future.

Prior Information Versus Learning

The new theory to be described here is a general purpose learning design. It can be

started out with essentially no concrete knowledge of its environment. Given time, it can develop a very complex understanding of its environment, and a complex plan of action, based entirely on learning from experience. The learning-based approach is now very familiar in the neural network field, where it has led to many practical, empirically-tested applications now being used in technology and in finance. The success of this approach has contributed to a resurgence of interest in learning even within hard-core AI.

On the other hand, some AI researchers still speculate that learning-based designs by themselves

could never replicate higher-order intelligence such as what we see in the neocortex. They speculate that higher-order intelligence can only grow in the soil of very extensive, hard-wired prior knowledge which is very concrete and specific in nature. Certainly, in the nervous system, there are many lower-level sensory and motor circuits and reflexes which are very specialized and preprogrammed in nature. But some of these researchers go on to claim that much of the higher-order processing -- such as the entire path up from edge detectors through to object identification -- is preprogrammed in a concrete way. As evidence,

they argue that some mammals -- like ungulates -- are actually born fully functional, and able to perform these kinds of cognitive tasks, before they have had any time to learn anything. They even argue that much of the supposed “learning” in human infants may be part of a preprogrammed developmental

process, similar to what happens to baby ungulates in the womb, and independent of learning experience.

This paper will not try to evaluate the last few claims, one way or another. But there is a crucial

logical point which needs to be considered: even if these claims should be true, even if there is substantial

concrete higher-order information born into the mammalian brain, this still does not imply a lack of ability to learn this same information if the occasion should arise.

As an example, even if certain cells in the neocortex are set up as edge detectors, right from birth, this does not imply that neocortical cells lack the ability to learn edge detection, through ordinary learning from experience. In fact, the temporal lobe experiments mentioned in the previous section clearly demonstrate that these cells do possess that learning ability.

There is no real paradox here. In practical engineering applications, one often initializes the weights and connections of an artificial neural network (ANN) to the best available initial guess, based on the available prior information, even though these same weights and connections will then be adapted

freely as if they had been the result of learning. There is no special reason to initialize the weights to zero.

One would expect evolution to use the same general sort of opportunistic approach, combining both

prior information and a generalized learning ability, to the extent that the machinery of DNA permits

the transmission of specific information.

All forms of learning do tend to make some sorts of prior assumptions about the world, implicitly, of a very general theoretical sort [7, ch.10]. The new theory to be described here is no exception. However, it does not require extensive, concrete forms of prior knowledge, of the sort which is

popular in the world of expert systems in AI.

Neural Networks Versus Computational Neuroscience

The new theory to be described here is quite different in flavor from the usual differential equation models which are most familiar in neuroscience. It results from a modeling approach very different from conventional computational neuroscience. This section will discuss these differences, somewhat briefly and crudely, in order to help the reader understand how these different approaches are

complementary.

Figure 1. Some Major Strands of Research Addressing Mammalian Brains

Figure 1 provides a crude overview of some important strands of research which try to understand the mammalian brain. Research involving invertebrates, etc., is beyond the scope of this paper. The chapter by James McClelland in this book probably belongs in a new box on the middle right,

but that too is beyond the scope of this paper.

On the left side of Figure 1 are two boxes representing two important trends in neuroscience which have had an uneasy coexistence for decades. Systems level neuroscience addresses higher-order issues such as the functional significance of major parts of the brain, the representation of emotions in the brain, the location and dynamics of working memory, the interaction between the neocortex, the limbic system and the basal ganglia, and so on. It attempts to understand fundamental issues such as the nature of human intelligence and motivation, through qualitative analysis of extensive clinical and experimental

information. It tries to account for experimental information from lower levels, but the major focus is on understanding the whole system. Karl Pribram has been one of the serious contributors to this field for decades.

Local circuit and subcellular neuroscience operates at a very different level. This area has had a tremendous spurt of growth in recent decades, due to the development of genetic technology and other new technologies which tell us more and more about the molecular events involved in some forms of learning, about connections between neurons, and so on.

Many neuroscientists believe that the main challenge to neuroscience as a whole is to explain the higher-level phenomena in terms of the local circuit and subcellular phenomena. In other words, the challenge is to connect the two levels of analysis more completely.

On the other hand, many other neuroscientists believe that the key challenge to the field is to put the whole field on a more precise, more mathematical footing. Their goal is to become more like physics, in some sense. These neuroscientists have developed more and more differential equation models -- some rooted in extremely detailed physics and chemistry -- which attempt to describe local circuits very precisely. By building up from such models, at ever higher levels of abstraction, they have also built up to models which actually try to replicate meaningful mental functions, such as associative memory or early vision in the adult.

The theory to be described here fits into the upper right box, which is currently the most sparsely populated part of this figure. Rather than modeling the brain from a bottoms-up point of view, it takes a top-down approach. It tries to provide a mathematical articulation or model of the kinds of phenomena which neuroscientists like Pribram have focused on, rather than articulating subcellular phenomena.

In the long term, I agree that the central challenge is to connect the two levels -- to develop detailed neural network models which match the subcellular data and the systems level information, both at the same time. There is a great need for researchers in the upper boxes and in the lower boxes to study each others’ work, in order to strengthen the effort to build a bridge between the two levels. On the other

hand, there is also a great deal of important work to be done in reinforcing the foundations at either side of the bridge. Particularly in the upper right box, the ideas are so new that considerable effort will be needed to sharpen, validate and consolidate the theory to be described here.

In the upper right box, the focus is on higher-level functionality. The goal is to develop neural network models which replicate higher-order learning-based intelligence -- not just a component of intelligence, but the whole system. Thus the lower right box tends to draw most heavily on chemistry and physics, while the upper right box requires a heavy emphasis on engineering. Please bear in mind that “engineering” does not consistent of a collection of clever monkeys; it is a large, long-standing intellectual endeavor, which has developed rigorous mathematical techniques addressing the issue of functionality -- of understanding what works, what does not, how to develop new designs, and how to understand a wide variety of challenging tasks. Engineering concepts and engineering testbeds provide a very crucial intellectual filter and empirical testing ground for allegedly functional designs -- including designs in the upper right box.

Most models used in engineering or in the artificial neural network (ANN) field do not fit within any of the boxes in Figure 1, because they try to achieve functionality in more limited kinds of tasks. Nevertheless, the theory to be described here was developed within the disciplines of ANNs and of engineering, because of the filters which they provide.

How important are these filters? As of 1995 [8], only one class of neural network model had ever been implemented which met four basic necessary tests for a model of brain-like intelligence: (1) demonstrated engineering functionality, for a completely learning-based system; (2) possession of an “emotional” or “values” component, enabling the system to learn to pursue goals effectively over time;

(3) possession and use of an “expectations” component, as needed to address behavioral experiments in classical conditioning; (4) the ability to take physical actions (e.g. control motors or muscles).

These are necessary tests, not sufficient tests, but, when applied together, they rule out all the other models, including those which have emerged from computational neuroscience.

Models in this class have since been adapted and extended to a considerable extent, in order to cope with difficult engineering control applications [9,10]. As a result of subsequent analysis -- both engineering analysis and the study of issues raised by Pribram and others -- these models are now essentially obsolete as models of mammalian intelligence; new, more complete models have been developed. However, the original models have still played a crucial role because: (1) by showing how it is possible to meet the four tests together, they opened the door to the development of more complex models in the same general family; (2) because they embody difficult control concepts in a simple context, they are crucial to our ability to really understand and develop more complex models embodying the same principles; (3) as testable, implemented designs they continue to provide a flow of empirical information about what works and what does not, and about the interface between higher-level models and lower-level components.

The Larger Context: The Role of Values and Utility

The new theory [1] is a theory of intelligence. But the mammalian brain includes more than just intelligence. Even before I began to develop serious designs for intelligent systems, decades ago[2], I started out by defining a working picture of how the intelligence fits in as part of the larger system [2,

11]. This picture is illustrated somewhat crudely in Figure 2.

Figure 2. Intuitive Picture of What the Brain Does

In Figure 2, the intelligence proper consists of one big box containing two boxes inside of it -- the higher Critic and “the Rest.”

In this picture, the intelligence proper builds up an image or representation of the present state of the world, denoted by the vector R. The Primary Motivational System (PMS) provides a utility function,

U(R). The job of the intelligent system is to develop a strategy for how to maximize U, over the long-term future, by manipulating the actions which it controls. The choice of actions is represented by another vector, u. Below the level of higher-order intelligence -- which is based on truly generalized learning dynamics -- there are more or less fixed preprocessors and postprocessors, such as the retina, the motor neurons in the spine, and even the “motor pools” in the brain stem[12]. These lower-level systems often do contain some kind of adaptation abilities, responsive to higher-level actions (u), but their adaptation tends to be somewhat specialized in nature.

Of course, this picture owes a lot to the classical work of Von Neumann and Morgenstern[13],

which in turn was essentially just a mathematical formulation of ideas from John Stuart Mill and Jeremy Bentham. Von Neumann’s notion of “utility function” is well known in economics, but somewhat less so in other disciplines. “Utility,” in Von Neumann’s sense, is not a measure of the usefulness of an object.

Rather, it is a measure used to evaluate the overall state of the entire world, R. (But see [3,4] for discussion of marginal utilities and prices.) Utility is a global measure of “success” or “performance,” or

any other “figure of merit” which an intelligent system tries to maximize. It represents the most fundamental desires or “needs” of the organism, as discussed in the chapter by Pribram in this book.

All across the social sciences, there have been very rabid debates for and against the idea that people maximize some kind of utility function. In this picture, the organism is not assumed to do a perfect job of maximizing utility; rather, it is assumed to learn to do a better and better job, over time, subject to certain constraints which limit the performance of any real-world computing system. Perhaps the most important constraint is the need for a balance between play or novelty-seeking behavior, versus the danger of becoming stuck forever in a “rut” or in a “local minimum.” See [5,14] for more discussion of these issues.

The utility function U(R) is not the only measure of value depicted in Figure 2. There is also another function, J(R), which is learned by the intelligent system proper. This function J(R) provides a kind of immediate strategic assessment of the situation R. The network which outputs J is called a “Critic” network.

The need for the Critic network emerges directly from the mathematics, which I will discuss in more detail below. Intuitively, however, the J function corresponds to the learned preferences discussed by Pribram in his chapter in this book. (See [3,4] for more discussion of this intuition.)

Following this picture, there are really four fundamental issues we need to consider when we try to understand “values”:

(1) How does the Primary Motivational System work? Where do the basic “needs” come from?

(2) How does the Upper Critic work? Where do our learned hopes and fears and preferences come from?

(3) Are there other measures of value in the “Rest of System,” and, if so, how do they work?

(4) Going beyond science, how can we as human beings do a better job of formulating and implementing our values, in light of this information?

Most of the chapters in this book, like most of my own published work, focus on “values” in the sense of question 2. They focus on the effort to understand the higher levels of intelligence as such, including the learned emotional associations in the main part of the limbic system. The new theory [1] addresses questions 2 and 3.

Nevertheless, as Pribram points out, the PMS is more fundamental and more universal, in some sense. It is the foundation on which the rest is built. Better intellectual understanding can logically supersede what we have learned from experience (questions 2 and 3), but it cannot supersede the PMS.

In my own work, I have discussed the PMS somewhat, mainly on a theoretical basis [2,5].

At this conference, Allan Schore gave an impressive talk on the process of imprinting, which is crucial to the dynamics of the PMS. The PMS obviously includes some fixed systems in the hypothalamus and epithalamus, which respond to a wide variety of simple variables like blood sugar, pain, and so on.

As Pribram points out, it often includes simple measurable variables which serve as indicators, in effect, of concepts which are more important to the organism but harder to measure. The PMS also includes some fairly complex social responses, as described by Schore, based in part on something like an associative memory. This is not the kind of associative memory which one uses as part of a utility maximization scheme; rather, it is a special kind of associative memory, with laws of its own, which requires more of an empirical approach to understanding.

The remainder of this paper will describe the new theory of intelligence, which includes a response to questions 2 and 3. For the sake of completeness, I will conclude this section with some thoughts about question 4.

Question 4 really raises the larger questions of ethics and of the concrete goals we pursue in our lives and in our society. In the past [5], I have argued that a sane and sapient person would gradually learn a two-pronged approach to these issues: (1) improved self-understanding, by learning, in effect, what one’s own PMS is really asking for (U); (2) a thoroughly systematic, intelligent effort -- supported by a synergistic use of symbolic reasoning and focused intuition -- to go ahead and maximize one’s U over the long-term future, over a large playing field. Broadly speaking, this is consistent with the spirit of Pribram’s comments in his chapter as well. Both prongs of this approach encourage us to learn to “see” all inputs to our consciousness -- including the emotional inputs -- in the same focused and intelligent way that many of us reserve for visual inputs only.

Certain cultural authorities would consider this approach extremely threatening. When people think for themselves, they may become harder to predict and control. However, these fears indicate a disrespect for human intelligence and a short-sighted effort to maximize the authorities’ own power.

(Indeed, such authorities have been known to foster illiteracy and the use of drugs in order to advance their allegedly higher ethical values.) As our environment becomes more and more challenging, and more and more dependent on the honest and creative management of complex information, such static, repressive and authoritarian attitudes become less and less consistent with human survival.

Description of The New Theory

Introduction

The new theory can be seen as a hybrid between hierarchical task planning, as practiced in AI, and neural network designs based on reinforcement learning or adaptive critics. In effect, the task planning aspect provides a way to operationalize the classic theory of Miller, Galanter and Pribram[15].

The neural network aspect provides a way to link the system to higher values and emotions, and to the brain. The design as a whole provides a unification of these two different approaches or aspects of higher-level intelligence.

The new theory is a natural outcome of earlier theories and concepts discussed in [3]. The details of the new theory were first reported as part of an international patent application filed on June 4, 1997. The main design-specification part of that disclosure was included in [1]. Links to biology are mentioned in [1], but they are sandwiched between discussions of the engineering details, which are quite complicated. This paper will not reproduce all of those details; instead, it will try to explain the points of greatest importance to biology and psychology.

Even one year ago, I had no intention of devising a theory which is strictly and classically hierarchical in nature. The concept of hierarchy has been used all too often as a quick ad hoc solution in AI, and it often limits the learning capability and flexibility of the resulting design. However, the seminal work of Vernon Brooks on motor control[16] makes it clear that some hierarchical approaches can still be

compatible with parts of the biology. In the end, a careful study of the mathematics [1] have me no choice, and it also provided a basis for incremental, neural-style learning in such a system.

In my present view, the new theory fits the mammalian brain in much the same way that certain line drawings can fit an oil painting. Often, before an artist paints an oil painting, he first draws a cartoon-style, black and white sketch of the same scene. The contours are all there, and all the objects are there, but the colors and the fine texture all have yet to be filled in. In the same way, the new theory -- as an engineering design -- actually involves several choices between possible subsystems. New subsystems will be needed, in the future, in order to improve performance and in order to fit the lower-level details of the brain more exactly. Nevertheless, this theory -- unlike its predecessors -- does seem to encompass all the major higher-order structure which needs to be filled in. The work ahead of us is less like running ahead into the wilderness, and more like settling in and developing known territory.

The work leading up to the new theory has already been published in the past. However, a certain degree of condensed review is necessary, to make this paper at least partly self-contained.

Relation to Prior Theories: Concepts and History

The new theory, like its predecessors, is a general purpose learning system. Like its predecessors, it is designed to learn how to develop a strategy of action which will maximize some externally specified utility function, U, over all future times. Expressed as an engineering design, it is a system which learns how to perform optimization over time. It is intended to operate in a generalized manner, such that it could learn to perform “any task,” based on learning from experience, without concrete prior knowledge of the task. This family of designs is sometimes referred to as “reinforcement learning systems,” but it is more accurate to call them “adaptive critic systems” or “approximate dynamic programming” (ADP).

If the new theory and its predecessors both perform this same function, then what is the difference between them? There are two important differences. First, the new design contains new features which substantially expand the range of tasks which can be learned in a practical way, at a reasonable learning speed. Second, these features provide a more complete fit to global features of the brain.

Decades ago, when this work began, I was impressed by the work of Hebb[17], who seemed to suggest that higher-level intelligence could emerge from the operation of a simple, general learning rule, operating across all neurons in the brain. That simple vision simply did not work out in functional or engineering terms, and it never provided any kind of explanation for the major structural components of the brain. Later, a variety of “reinforcement learning” models or designs were developed by several

authors [11,18,19]; however, none of them met the simple four tests for a brain-like system mentioned in the Introduction above and discussed at length in [5]. This section will only discuss designs which do at least meet those minimal tests.

Figure 3. The Earliest “Brain-Like” Control Design

Figure 3 illustrates the simplest design in this family. In this design, there were three major components of the overall “brain.” The highest was the emotional system or “Critic,” whose task was to estimate the function “J” discussed above. I associated the Critic with the limbic system of the brain.

The middle was the “Model” -- a neural network trained to act as a simulation model of the environment of the organism. (Following standard engineering approaches, the Model also calculated an “estimated state vector R”, i.e. a representation of the current state of the environment.) The Model served, in effect, as an expectations system. I associated it with the cerebro-thalamic system. Finally, the system also contains an Action network -- a network trained to output control signals u, which affect muscles, glands and the orientation of sensors.

The mathematics of how to adapt this system were first developed in 1970, first published in 1977[20], and explained more completely in 1992 [7]. The first practical engineering test, in simulation,

was published by Jameson in 1993. By 1995, three additional groups -- Santiago and myself, Wunsch and Prokhorov, and Balakrishnan -- had implemented, tested and published more advanced designs of the same general flavor[8]. In a 1977 conference [9], four new groups reported implementations, some with substantial practical significance -- Ford Motor Company, Accurate Automation Corporation, Lendaris and Tang. Dolmatova and myself, working through Scientific Cybernetics, Inc., have developed some new implementations, which will be published elsewhere in 1997. In addition to these nine groups in the United States, one researcher in Sweden -- Landelius [21] -- has also had important results. Even the simple design in Figure 3 is substantially more complex than the underlying designs used in the past in control engineering; however, as we develop more understanding of its practical capabilities relative to earlier control designs [10], the practical use of this family should continue to grow.

In Figure 3, the broken arrows represent a flow of value information, which is crucial to the dynamics of learning in this design. The overall design given in [7, chs. 3 and 13] does not specify what kinds of networks to insert into these boxes. It deliberately gives the user or modeler some flexibility in that regard. However, it does specify how to calculate the flow of value information, for any choice of network, and how to use that information in adapting the Critic and the Action network. (A different

chapter -- [7,ch.10] -- discusses the structure and dynamics of learning for Model networks.) Those value

calculations turn out to match exactly the rules for the “flow of cathexis” proposed by Freud years ago [2,3]. The same underlying algorithm which I developed for that application is now being used in most

practical applications of ANNs, both in technology and in finance[2].

Back in 1970, I hoped that this kind of design would be enough to replicate higher-order intelligence. If the three boxes were filled in with simple neural networks, each made up of one class of neuron, then Hebb’s dream would almost be fulfilled. Instead of one “general neuron model,” one would need three general neuron models, one for each class of neuron. One would also have a basis for understanding the division of the brain into major components [2,3,22].

Over the years, this hope had to be modified. Engineering research told us more and more about how to “fill in the boxes.” It told us that we need to fill in the boxes with fairly complex structures, in order to get maximum performance [7]. Instead of three cell types, the research pointed towards a need for boxes within boxes, ultimately leading to dozens and dozens of basic, generalized cell types. But this did not invalidate the overall model, at first. In fact, it led to an improved fit with neuroscience; for example, it began to suggest an explanation for some of the more detailed features of the cerebro-thalamic system[3].

By 1992, however, this study of substructures had led us to a major paradox [7,23].

We discovered that many higher-order learning tasks require us to fill in some of the boxes with a certain class of neural network -- the Simultaneous Recurrent Network (SRN) -- which requires many iterations in order to settle down and calculate its output [24,25]. This slow speed of calculation is inconsistent

with the requirement for fast calculation (circa 100-200 cycles per second) for the smooth coordination of bodily movements.

To explain this, I then proposed a “two brains in one” [3] theory of the mammalian brain. According to that theory, the brain contains two entire adaptive critic learning systems, each with its own Critic network. In the upper brain (limbic system and cerebro-thalamic system), the relevant boxes are filled in with these slow but powerful SRN components. Thus the upper brain performs a major cycle of computation over a relatively long time period -- 0.1 to 0.25 seconds, corresponding to alpha or theta rhythms. (Within that period, it goes through many minor iterations, as required by the SRN model and verified by recent multichannel EEG recordings in Pribram’s laboratories.) The lower brain (mainly the

olive and cerebellum) uses simpler, more feedforward components, which allow it to operate at a faster rate. The lower brain is trained, in effect, to try to make the upper brain happy.

All of these theories had one really large, gaping hole in them: the lack of an explanation for the role of the basal ganglia. Decades ago, this did not seem like such a big hole, because the importance of the basal ganglia was not yet so well understood. But evidence has begun to accumulate [16,26,3,4] showing that the basal ganglia do indeed play a crucial role in the brain. Fortunately, this role also seems to be related to certain functional weaknesses of the earlier theories as well. It seems to involve the organization of time or “tasks” into large blocks or “chunks,” which make it easier for the system to do long-term planning. Thus once again, by working to improve the functional capabilities of our theory, we can expect to converge more closely to neuroscience as well. For the complete form of our new theory, we also revisit the treatment of space, which immediately suggests new interpretations of some well-known information about vision in the neocortex.

Relation to Prior Theories: Basics of the Mathematics

The new theory continues to assume that there is a kind of master-slave relation between the “upper brain” and the “lower brain,” as in [3]. It replaces the old picture of the upper brain with something far more complex.

The previous picture of the upper brain was still essentially based on Figure 3 (with various elaborations discussed in the engineering literature[5]). Figure 3, in turn, was originally derived from an effort to approximate dynamic programming. Dynamic programming is the only exact and efficient general-purpose algorithm available to calculate the optimal strategy of action over time in an uncertain, noisy, nonlinear environment.

In dynamic programming, the user supplies a utility function U(X) and a stochastic model to

predict X(t+1) as a function of X(t) and u(t), where X denotes the present state of the environment and u denotes the actions taken by the controller at time t. (For simplicity I will assume that the state of the world as seen by the senses, X, is the same as the actual state of the world, R; however, the difference between the two is an important part of the theory.) Our task, in control, is to find the optimal strategy of action u(X). We do this by solving the Bellman equation:

(1)

where the angle brackets denote expected value, and where this “r” is a kind of interest rate which can be

set to zero for now. Our task is to find a function J(X) which satisfies this equation. According to the theorems of dynamic programming, the optimal strategy u(X) can be found simply by picking the actions which maximize <J(X(t+1))>, exactly as in equation 1.

Pure dynamic programming is not useful even in medium-scale control applications, because it is simply too difficult to find an exact solution to equation 1, in the general case. In approximate dynamic programming (ADP), we try to learn an approximate solution. In the simplest forms of ADP, we approximate the function J by some network J(X,W), where “W” represents the set of weights or parameters (e.g., synapse strengths) in the network. The goal is to adapt the weights W, so as to make J(X,W) fit equation 1 as closely as possible, in some sense. The equations used to adapt the weights W effectively imply a theory for how the synapse strengths are adapted over time in the corresponding biological neural network.

In the simplest previous theory, the Critic would be adapted as follows. At every other time t, we would remember X(t), and then wait until time t+1 before performing adaptation. At time t+1, we would calculate J*=U(X(t))+J(X(t+1)). Then we would go back to considering X(t). We would adjust the weights W so as to make J(X(t),W) match more closely to J*. I called this approach Heuristic Dynamic Programming (HDP) when I first published it in 1977 [20]; however, it was rediscovered in 1983 under the name “Temporal Differences”[18].

There are many variations of this approach, but the ones which work all have the same general effect: they update the estimate of J(t) so as to account for information one time period ahead, from time t+1. In effect, after training, the organism “sees into the future” one time period further than it did before learning. But if one time period is a very short interval of time (such as 0.25 second!), this may not be much of an improvement in foresight. Furthermore, it would take considerable training, over many possible states X(t), before the system could look ahead one time period further across all of these possible states.

This “foresight horizon problem” is not new. I discussed it, for example, in [27,28].

There are many ways to try to improve the degree of foresight, even for this class of designs; some of these approaches seem quite promising, while others have been proven incorrect [7,28]. Nevertheless, whatever tricks or design modifications one uses, there is no substitute for the simple effort to use “t+T” in place of “t+1,” where T is a larger time interval. More precisely, by using “t+T” instead of “t+1,” in addition to other design improvements, we can still expect to achieve much greater foresight. The problem lies in how to do this, within the framework of a learning-based neural network design. The new theory provides a solution to this problem. In AI, this problem is commonly referred to as the problem of “temporal chunking.”

Another major deficiency of the previous theories concerns the treatment of space. In AI, for example, the problem of “spatial chunking” has received a lot of attention. John Holland has talked about road maps as an example of a compressed representation of space. Albus [29] has discussed the problem of coordinating the actions of diverse robots spread over a large region of space such as a factory floor.

Even in the new theory, the treatment of space is seen as a subsystem issue. In principle, it does not affect the highest-level structure of the mathematical theory. Individual mammals, unlike factory management systems, are not really spread out over a large region of space. Humans and other mammals are not 100% flexible about “doing two things at the same time.”

Nevertheless, the treatment of space really does have some pervasive implications. There are some basic principles of symmetry which have been exploited very effectively in the image processing part of the neural network community [30,24,25], but largely ignored in other parts of the community.

The new theory assumes that the brain exploits a similar but more general form of symmetry, which I call “object symmetry”[1]. In essence, object symmetry simply predicts that objects will behave the same way, as a function of their relations to other objects, regardless of where they are physically located or seen. It is extremely important to exploit this principle, when trying to adapt to a very complex environment which contains thousands of significant objects and locations in it, but traditional neural networks do not do so.

In principle, there are three ways to train neural networks to exploit object symmetry. One way is to broadcast connections and weights from a “master copy” of a small neural network; this is not biologically plausible, in my view. Another way involves the use of special tricks to encourage separate but similar regions to learn the same relationships; for example, rapid eye movements may result in many areas in visual cortex scanning the same object, and learning from it. In my view, this kind of parallel learning is an important part of what happens in neocortex; for example, it explains the ability of adults to classify all the letters in a word at the same time, in parallel, in one major cycle time (circa 0.1-0.25 second). On the other hand, such parallel training does not exploit the full theoretical potential implied by symmetry; it requires a lot of learning experience.

The third approach is multiplexing -- the use of the same network, at different times, with different inputs, in order to recognize and predict different objects, accounting for their relations to a few related objects. The literature on dynamic feature binding in visual physiology strongly suggests that the brain does exploit the multiplexing approach, particularly in the regions of temporal cortex which recognize objects and (in humans) link up to the nearby speech centers. A key part of the new theory deals with methods for defining and adapting networks which map back and forth between ordinary vectors X and structures made up of objects and relations obeying object symmetry [1]. This in turn would have substantial implications for the organization of neocortex and hippocampus.

In the brain, this approach highlights the need for the organism to maintain a kind of global “world model” [29] or inventory of objects in associative memory, even as the organism really focuses on one (or a few) objects at any time. Fukushima [9] has presented a theory of how this might be done, by patching together visual images, but his theory does not exploit the links between images and places which are crucial to optimal performance. In our theory, we can use essentially the same sort of architecture, except that visual images are replaced by a cluster of objects and relations. Because such

a cluster would include links between specific objects or places, better performance could be expected.

Temporal Chunking: Reasons for a Task-Oriented Approach

Strictly speaking, temporal chunking presents an opportunity, not a problem. The conventional t/(t+1) designs do converge, in theory, to an optimal strategy -- eventually. My claim is that new designs, exploiting temporal chunking, can converge faster, when foresight over long periods of time is required.

In classical AI, there are two common forms of temporal chunking -- multiresolutional or clock-based approaches, and task-based or event-based approaches. This section will try to explain, a bit more precisely, why these approaches can improve foresight, and why the task-based approach is probably preferable.

For simplicity, consider the situation of an environment which can only exist in N possible states, where N is finite. Thus the state of the environment at any time, t, can be represented as some integer i

between 1 and N. Consider the situation where the strategy of action u(i) is fixed, and we are simply trying to learn/approximate the function J(i). For more explanation of this simple case, see [30,31].

In this situation, the utility function U can be represented as a vector with N components, U₁ through U_N, where U_i is simply U(i). Likewise, the function J(i) can be represented as a vector J. The model of the environment simply consists of a set of numbers, P_ij , representing the probability that the environment will occupy state i at time t+1 if it was in state j at time t. P is just a matrix of transition probabilities, and is called a Markhov chain. Let us define M=P^T, the transpose of P.

In this situation, the J function used in the Bellman equation has a simple representation:

J = U + MU + M²U + M³U + ... (2)

The first term on the right represents the utility U in the current state (time t), the second term represents the utility in the next state (time t+1), and so on. (Roughly speaking, if X(t+1)=PX(t), then U^TX(t+1) =

U^TPX(t) = (MU)^T X(t).) J represents the sum of utility over all future time periods, which is what we really want to maximize. From equation 2, we may immediately deduce the Bellman equation for this special case:

J = U + MJ (3)

The usual adaptive critic methods basically approximate an old update rule discussed by Howard [30]:

J⁽ⁿ⁺¹⁾ = U + MJ⁽ⁿ⁾ (4)

This states that our n+1-th estimate of the J function is based on the n-th estimate as shown in this equation. If we initialize J⁽⁰⁾ to equal U, then it is easy to deduce that:

J⁽ⁿ⁾ = U + MU + ... + MⁿU (5)

In other words, after n such very large updates (each accounting for every possible state of the environment!), our J function “sees” only n steps into the future.

In matrix algebra, there is a much faster way to calculate the same sort of result. We

may write it as:

J⁽ⁿ⁾ = ... (I + M⁸)(I + M⁴)(I + M²)(I + M)U (6)

(As an example, notice how (I+M⁴)(I+M²)(I+M) = I+M+M²+M³+...+M⁷!)

This sort of update procedure allows us to see more than 2ⁿ time periods into the future

after only n updates! Foresight grows exponentially, not linearly. The price is that we need to

compute the matrices M², M⁴, etc. But notice what these matrices represent! (M⁸)_ij , for example,

simply represents the probability of ending up in state i at time t+8, if you start out in state j at time t. These matrices are simply predictive models of the environment, jumping ahead 2 time periods, 4 time periods, etc. Thus equation 6 simply represents the classical idea of multiresolutional analysis, based on models that predict over large but fixed time intervals.

In the real world, however, there is a serious problem here. First of all, it is hard to imagine a biological basis for such an arrangement. Second of all, the original matrix M will usually be extremely sparse, but the matrix M^k for large k will tend to be fully connected. If we do not somehow exploit the sparsity of M, then we do not achieve the full potential for cost reduction and efficiency.

In developing the new model, I began by assuming that the states of the environment can be divided up into blocks, A, B, etc. This led to a modified Bellman equation which can be written,

crudely, as:

J^A = J^A0 + J^AIJ^A+, (7)

where J^A represents the J function for states within block A; where J^A0 represents the payoffs (U) to be expected in the future within block A, before the system next exits that block; where J^A+ represents the J function for states immediately outside block A, which I call the “postexit” states of A; and where J^AI is a kind of connection matrix.

In order to learn J most effectively, one can begin by learning J^A0 and J^AI within each block; this learning is done on a purely local basis, because it does not require knowledge of anything outside the block. Then, after these “local critics” have been adapted, one can use equation 7 to propagate values backwards from the end of the block (J^A+) to the initial states at the beginning of the block, which also serve as the postexit states of the previous block. In other words, equation 7 can act like equation 4, except that an update jumps over the entire time-history of a block in one direct calculation.

To make full use of this kind of procedure, one needs to develop a hierarchy of blocks within blocks (somewhat analogous to equation 6). One also needs to consider how we have a choice between blocks or tasks, at certain times. In reality, the choices we face are often very discrete in nature. (Consider, for example, the choice of where to place the next stone, in a game of Go.) Because they are discrete choices -- each representing a kind of local optimum -- we need to use a crisp, discrete decision-making system, very different from the incremental and fuzzy value systems which work best at the highest and lowest levels of the brain. (Actually, the new theory does allow for fuzzy and incremental decision

making at the middle level, within each decision block.)

The greatest challenge in developing the new theory was to find a smooth and effective approach to approximating this kind of design with neural networks, and to maintain a learning-based approach throughout.

Overview of the Structure

Figure 4. Structure of a Decision Block

According to the new theory, the “middle level” of the brain is made up, in effect, of a library of “decision blocks,” illustrated in Figure 4. In principle, each decision block is separate from every other block; however, as in any neural network design, a sharing of hidden nodes and of inputs is allowed.

Each decision block has a “level.” When decision block A is activated, it then has the job of deciding which lower-level decision block it should activate. The system is sequential. In other words, at any given time, only one block is activated at any level; however, new decision blocks can be constructed as the attempt to activate two unrelated blocks on the same level at the same time. (Usually, however, it takes some time for the system to learn how to do two specific things at once in a smooth manner.) In the brain, the “level” would probably correspond to the level of the relevant loop in the neocortex-basal-thalamus reverberatory circuit, as described by Brooks [16]. (In theory, however, there could also be a kind of linked-list connection between a decision and its subordinate.)

The trickiest part, in any decision block, is how to modulate the actions in that block, so as to better account for what later decision blocks will want after the current task is completed. This is done by providing the secondary decision information suggested in the figure: “adverbs” u_A, a fuzzy goal image g, and some sort of condensed representation of longer-term goals. In effect, this information provides a kind of condensed representation of the vector J^A+ of equation 7. The fuzzy goal image approximates J, in effect, as:

(8)

This representation may seem a bit arbitrary, at first, but it emerges rather strongly from careful consideration of the difficulty of sharing communications between decision blocks, and of the limitations of possible alternatives [1]. This is a strong theoretical prediction -- that explicit images of goals or of desired states of the world play a central role in the effectiveness of human planning. Perhaps the word “goal” really is another one of those four-letter words representing a hard-wired invariant of human thought.

Within each decision block, as before, there is also a need to adapt local critic networks, representing the J^A0 and J^AI of equation 7. Another network J^A+ can improve the accuracy, in principle, of the interface between this block and its successors; however, it is not strictly necessary. Yet another network J^A- is needed to help decision block A to formulate the goals g^A- which it could pass back to the previous decision blocks.

Another key feature of this design is the need for a kind of decision network within each block, analogous to the action network of Figure 3. At this level, however, the concern about local optima and the need for exploration of alternatives is very great; therefore, the decision networks must be stochastic networks, which propose possible decisions, based on some notion of their expected payoffs. A key part of the new design [1] is an approach to training such stochastic networks, which tries to make them output

options based on a Gibbs function of their expected value (in effect, J). This function includes a kind of global “temperature” parameter, which is compatible with discussions by Levine and Leven [32], among others, of global variations in novelty-seeking behavior in humans and other organisms. The assumption here is that the neocortex outputs these possible options, but that the discrete go/no-go choices are made in the basal ganglia (except for “decisions” output directly from the motor cortex to the lower brain). This is also consistent with Grossberg’s view that data on the basal ganglia imply that discrete, binary go/no-go decissions are made in that part of the brain. The local critics provide a basis for making these final choices, as a function of the global “temperature.”

Finally, the decision blocks must each contain a kind of internal Model, similar in a way to the Model network of Figure 3. This can be useful as part of the system to adapt the other networks, as in Figure 3. But it provides another important capability: the capability to simulate possible decisions and results. This capability for simulation or imagination or “dreaming”[22] is crucial to actually exploiting the benefits of large-scale temporal chunking. If a decision block represents a task which takes T time periods to unfold, then it should be possible to consider T alternative possible decisions or tasks or scenarios during a period when only one actual task is accomplished! In the brain, the multiplexing of different decision levels may reduce this benefit somewhat, but one would still expect that more time is spent considering future possibilities than in tracking present reality.

Another crucial aspect of this design is that the Model within a higher-level decision block must be truly stochastic, like the SEDP design in [7, ch.13]. The reasons for this are discussed further in [1],

along with the details of how to adapt these various component networks.

References

[1]P.Werbos, A Brain-Like Design To Learn Optimal Decision Strategies in Complex Environments,

in M.Karny, K.Warwick and V.Kurkova, eds, Dealing with Complexity: A Neural Networks Approach. Springer, London, 1997.

[2] P.Werbos, The Roots of Backpropagation: From Ordered Derivatives to Neural Networks and Political Forecasting, Wiley, 1994.

[3] K.Pribram, ed., Learning as Self-Organization , Erlbaum 1996.

[4] K.Pribram, ed., Origins: Brain and Self-Organization, Erlbaum, 1994, p.46-52.

[5] P.Werbos, Optimization: A Foundation for understanding consciousness. In D.Levine & W. Elsberry (eds) Optimality in Biological and Artificial Networks?, Erlbaum, 1996.

[6] Bitterman ME 1965 The evolution of intelligence Scientific American January

[7] . D.White & D.Sofge, eds, Handbook of Intelligent Control, Van Nostrand, 1992.

[8] P.Werbos, Optimal neurocontrol: Practical benefits, new results and biological evidence, Proc. World Cong. on Neural Networks(WCNN95),Erlbaum, 1995

[9] Proc. Int’l Conf. Neural Networks (ICNN97), IEEE, 1977.

[10] P. Werbos, Neurocontrollers, in J.Webster, ed, Encyclopedia of Electronics and Electrical Engineering, Wiley, forthcoming.

[11] P.Werbos, The elements of intelligence. Cybernetica (Namur), No.3, 1968.

[12] W.Nauta & M.Feirtag, Fundamental Neuro-anatomy, W.H.Freeman, 1986.

[13] J.Von Neumann and O.Morgenstern, The Theory of Games and Economic Behavior,Princeton NJ: Princeton U. Press, 1953.

[14] P.Werbos, The cytoskeleton: Why it may be crucial to human learning and to neurocontrol, Nanobiology, Vol. 1, No.1, 1992.

[15] Miller GA Galanter EH and Pribram K 1960 Plans and the Structure of Behavior (New York: Holt, Rinehart and Winston)

[16] Vernon Brooks, The Neural Basis of Motor Control, Oxford U. Press, 198_.

[17] D.O.Hebb, The Organization of Behavior, Wiley, 1949.

[18] A.Barto, R.Sutton and C.Anderson, Neuronlike adaptive elements that can solve difficult learning control problems, IEEE Trans. SMC, Vol. 13, No.5, 1983, p.834-846.

[19] B.Widrow, N.Gupta & S.Maitra, Punish/reward: learning with a Critic in adaptive threshold systems, IEEE Trans. SMC, 1973, Vol. 5, p.455-465.

[20] P.Werbos, Advanced forecasting for global crisis warning and models of intelligence, General Systems Yearbook, 1977 issue.

[21] T.Landelius, Reinforcement Learning and Distributed Local Model Synthesis, Ph.D. thesis and Report

No.469, Department of Electrical Engineering, Linkoping U., 58183, Linkoping, Sweden.

[22] P.Werbos, Building and understanding adaptive systems: A statistical/numerical approach to factory automation and brain research, IEEE Trans. SMC, Jan./Feb. 1987.

.[23] P.Werbos and A.Pellionisz, Neurocontrol and neurobiology, Proc. Int’l Joint Conf. Neural Networks (IJCNN92), IEEE, 1992.

[24] P.Werbos & X.Z.Pang, Generalized maze navigation: SRN critics solve what feedforward or Hebbian nets cannot.Proc. Conf. Systems, Man and Cybernetics (SMC) (Beijing), IEEE, 1996. (An earlier version appeared in WCNN96.)

[25] X.Z.Pang & P.Werbos, Neural network design for J function approximation in dynamic programming, Math. Modelling and Scientific Computing (a Principia Scientia journal), special issue on neural nets planned for circa December 1996. See also links on www.glue.umd.edu/~pangxz

[26] J.Houk, J.Davis & D.Beiser (eds), Models of Information Processing in the Basal Ganglia, MIT Press, 1995.

[27] W.T.Miller, R.Sutton & P.Werbos (eds), Neural Networks for Control, MIT Press,1990, now in paper

[28] P.Werbos, Consistency of HDP applied to a simple reinforcement learning problem, Neural Networks, March 1990

[29] J.Albus, Outline of Intelligence, IEEE Trans. Systems, Man and Cybernetics, Vol.21, No.2, 1991.

[30] R.Howard, Dynamic Programming and Markhov Processes, MIT Press, 1960.

[31] D.Bertsekas, Dynamic Programming and Optimal Control, Belmont, MA: Athena Scientific, 1995.

[32] D.S.Levine and S.J.Leven, eds, Motivation, Emotion, and Gvoal Direction in Neural Networks, Erlbaum, 1992