Supervised Learning:

Can It Escape Its Local Minimum?

Paul J. Werbos

Room 1151, National Science Foundation^*

Washington D.C. 20550

Abstract

Tremendous progress and new research horizons have opened up in developing artificial neural network (ANN) systems which use supervised learning as an elementary building block in performing more brain-like, generic tasks such as control or planning or forecasting across time[1,2]. However, many people believe that research into supervised learning itself is stuck in a kind of local minimum or plateau. Limited capabilities in supervised learning have created a serious dilemma, forcing us to choose between fast real-time learning versus good generalization with many inputs, without allowing us to have both together.

This paper will first review and evaluate the supervised learning field as a whole, highlighting open questions and research opportunities. Then it will argue that there are at least two possibilities for more breakthrough kinds of research which could seriously alleviate (though not eliminate) the speed-versus-generalization dilemma. Backpropagation itself, in its original form from 1974 [3], was based on a translation of ideas from Freud into well-defined engineering mathematics. The possibilities discussed here would involve a similar translation of another idea from Freud and an idea from Karl Pribram. Further insights from psychology or AI might also be of use in guiding the engineering implementation of these ideas.

Overview of Supervised Learning Today

Supervised learning is one of the best-defined, most universally understood ideas in the neural network field. We all know that "supervised learning" includes all methods to train ANNs to approximate a function Y=f(X), where X is a vector of "inputs" and Y is a vector of "targets," and where there may be noise in the training set. Neverthe-less, some people use the term "unsupervised learning" to describe forms of associative memory ("heteroassociative memory") used to perform the exact same task. This is simply a semantic error, which should be discouraged.

Years ago, the field of supervised learning was dominated by three methods: (1) basic backpropagation, used to adapt feedforward multilayer perceptrons (MLPs) by steepest descent; (2) heteroassociative memory schemes, based on Hebbian learning approaches, originated in the Grossberg school of thought; (3) correlation-based Hebbian schemes, which require the inputs to be decorrelated in order to get correct results even in the linear case. The third method has nearly disappeared for two reasons: (1) the limit to the linear case was a problem; (2) the requirement for decorrelated inputs tends to reduce fault tolerance, and is strongly at odds with biological studies showing redundancy or correlation in input representations (e.g. see Simmons in [2]).

In practical engineering applications today, supervised learning tasks are mainly performed by one of three types of design: (1) feedforward MLPs adapted by gradient-based methods, usually a bit more sophisticated than raw steepest descent; (2) feedforward "local learning networks," such as radial basis functions (RBF) and CMAC, the two methods most popular now in the U.S.[1]. The latter are used instead of Hebbian-based associative memories, because there is an impression that they yield similar capabilities but are much easier to implement. (There are a few unbiased engineers, however, who claim good results with variants of Pao's functional link networks[4] or from

Kanerva's sparse distributed memory[5]. In one application (see Ritter in [5]), a "neural gas" local network -- an

________________________________

*This paper is intended only to provoke thought. It should not be construed as any kind of policy statement either of NSF or of the author. As work done on government time, it is in the public domain subject to proper citation and retention of this caveat. This paper would have never been written without some strong encouragement from Stephen Grossberg, who nevertheless bears none of the blame for the specific views expressed here.

outgrowth of the Kohonen school of thought -- appears to yield much higher accuracy than CMAC, at the price of drastically reduced real-time learning ability. Kohonen-like local networks appear to be far more popular in Europe[5] than in the U.S., unless one counts Cooper's local RCE nets as part of that class.)

The conventional wisdom now is that MLPs yield better generalization and a more parsimonious representation of functions with many inputs, while the local networks are much better at learning quickly in real time, when the number of inputs is limited. Andrew Barron has proven new theorems about approximation errors with MLPs versus local (or Taylor series) networks which tend to support this view. The choice between MLPs and local networks is essentially a choice between different networks designs or functional forms, rather than a choice between adaptation techniques; many researchers have shown that some local networks can be adapted quickly with least squares and gradient-based techniques (i.e., backpropagation).

A major industrial user of ANNs recently put it this way:"Dozens of people have come to me claiming to have a way to bypass all the problems of slow learning in gradient-based systems. Every single one of them turned out to be peddling some kind of lookup table." Conceptually, at least, the local networks in use today can indeed be understood as smoothed-out, sophisticated versions of a lookup table.

The local minimum problem is another important issue in supervised learning, but it does not pose problems for these network designs as such. The details of that issue are beyond the scope of this paper.

Lessons From Batch/Offline Uses of Supervised Learning

The choice between fast real-time learning and good generalization with many inputs is not a big problem in many applications. In many applications, good generalization is essential but real-time learning is not. Even overnight convergence on a dedicated machine is often far cheaper -- in dollars and in labor -- than the alternatives. Even in real-time "adaptive control," one can train an ANN to "learn offline to be adaptive online"[1,6]. In other words, one can use offline training to adapt a recurrent network. The resulting controller may not actually learn in real time, but it could "adapt" to changes in the parameters of its environment through the action of the recurrent neurons[1,6]. This kind of capability requires the use of time-lagged recurrent networks (TLRNs[1]), which are not within the scope of supervised learning as such.

When batch or offline learning is acceptable, there are a wide range of tools available to speed up the adaptation of MLPs or of other global network designs. There are powerful accelerator boards and computers exhibited here at this conference. Also, error-minimization in batch mode is a special case of well-studied, well-understood problems in operations research. Even today, one reads papers reinventing the idea of using old conjugate-gradient methods, like Fletcher-Reeves and Polak-Ribiere ( which in turn are more precise versions of "momentum"); however, years ago[7], I reported -- as one might expect -- much better results from using the Shanno conjugate gradient method, which numerical analysts regarded as the standard method for very large, truly nonlinear problems. With smaller networks, the best reported results came from using "BFGS"[8] -- where the "S" stands for Shanno.

Shanno himself has suggested[9] that marginal further improvements could be had by using the newer Nocedal method, but that much larger improvements may be possible with new, modified versions of Karmarkar methods. This idea has been explored in a preliminary way by myself, by DeClaris and by Trafalis of Oklahoma state. It now seems that the Karmarkar approach maps into a kind of "attention" scheme, which mimics the tendency of biological systems to pay more attention to different patterns at different times (analogous to weighted regression in statistics). This appears to be a promising approach, but a lot more research would be needed to pin down its potential.

Shanno has also proposed that we work harder to develop "preconditioners" for use with conjugate gradient method. Unfortunately, I am aware of only one effort along those lines. That effort failed, in part because they focused on toy problems (i.e. few patterns, such that error could be reduced to zero). Perhaps the idea of developing temporary "targets" for hidden nodes[7,10] might be useful in developing such preconditioners.

For offline learning problems, many people still prefer to use "pattern learning," which works its way through an offline database one observation at a time, as if one were doing real-time learning. When there is a huge database of highly redundant patterns, this approach has often led to convergence in 20 to 30 "epochs" or passes through the database [11]; this is far less than the number of iterations required with a classical batch method. Nevertheless, my own experience[7] suggests that that approach will not work as well when the patterns in the database are nonredundant. Even with highly redundant databases, one could select a small, less redundant subset of the training set, and use batch learning on that subset (with an appropriate error measure), in order to get the weights into the right neighborhood; this could be useful as a time-saving preliminary to working through the entire training set.

When programming effort is more important than computer time, a simple steepest descent approach (in batch or pattern learning) may still be the best way to go. Back in 1987, for the official DOE/EIA model of the natural gas industry, I used a very simple adaptive-learning-rate method (given in my other WCNN93 paper, and in [1]) which provides a high degree of flexibility, and avoids some of the problems in those methods which require that you know the actual change in total error from epoch to epoch.

Studies of batch learning have also made clear the importance of adapting the connectivity of a network, and not just its weights. Dozens upon dozens of methods have been published for growing or pruning networks. For example, Fahlman has reported that his method for growing a network yields much faster learning than his earlier adaptive-learning-rate scheme by itself. Pruning techniques tend to be based on a variety of complexity-measure concepts with deep roots in statistics and epistemology (e.g., see chapter 10 of [1]). It is now well-known to all competent practitioners that generalization (and, in most cases, learning speed) improve dramatically when networks are "pruned down," to the point where there are far fewer weights in the network than there are target numbers in the training set. There is a great need at present for more consolidation and synthesis in this literature, which is scattered all over, from theoretical journals through to conference papers in chemical engineering.

For very large networks, I also suggested the use of guided random search techniques to generate experimental neurons or even groups of neurons, in addition to trying out new connections between neurons[12]; however, some of the ideas in that paper will only become relevant when people begin working with larger systems.

A crucial weakness in the current literature is the lack of attention to simultaneous-recurrent networks (SRNs) as a way of solving difficult supervised learning problems. For practical supervised learning applications, an SRN may be defined as a network whose output at any time, (t), is the final equilibrium value, y^(∞), which results from iterating over the equation y⁽ⁿ⁺¹⁾=f(X(t),y⁽ⁿ⁾,W), where f usually represents a simple feedforward neural network.

MLPs may be substantially better than local learning networks in approximating functions of many variables, but they are substantially less powerful than SRNs in performing this function. For example, Houillon and Caron[13] recently tried to train an MLP to learn the mapping from an image of a cluttered floor to the action required to navigate through this floor; when the MLP failed to learn the mapping (as discussed in their talk), they found an alternative nonneural design. Yet the calculations in the design which worked could have easily been learned by an SRN! Likewise, in his classic work on perceptrons, Minsky showed how SRNs could easily represent mappings (like his "connectivity" map) which MLPs cannot. These maps are not smooth.

SRNs have probably been underutilized because of the folklore which says that Hebbian learning is the right way to adapt them, that they are very hard to understand and implement, and that the only advantage they offer is associative memory; however, the previous paragraph shows that they are not so difficult after all, and that they offer far more than just associative memory. One can also use backpropagation to adapt such networks realtively easily, by iterating over the equation F_y⁽ⁿ⁺¹⁾ = F_ + F_f_y(X, , W, F_y⁽ⁿ⁾), where F_ is the gradient of error with respect to the network output, where F_f represents a subroutine or function which "backpropagates through" the feedforward network f, and where F_y⁽⁰⁾ may be initialized to zero (or to some other convenient value). The derivatives of error with respect to the weights result from backpropagating through f a final time, using the same backpropagation subroutine, using F_y^(∞) as an input to the function. See chapter 3 of [1] for a more complete discussion of how to do this (with a linear example to assist in debugging). I first applied this generalization of backpropagation in 1981, in a sensitivity analysis of a natural gas model used by DOE in developing recommendations to Congress for the deregulation of that industry; see [1,3] for more history and citations.

There are still many research issues in how best to use SRNs; for example, instead of minimizing E=(Y-)², one might minimize E+q(y^(N+1)-y^(N))² , where q is a "tension parameter," and where y^(N) is the output of the network in the Nth iteration as it tries to settle down. (See [1] for details.) Adjusting parameters like q involves practical research challenges similar in flavor to those of adjusting learning rates in basic backpropagation.

Better utilization of SRNs may be absolutely necessary to achieving truly brain-like capabilities in control and other applications.(See chapter 13 of [1], and see [2].)

Real-Time Learning: The Need for New Approaches

In order to replicate (or explain) true brain-like capabilities, we need to find some way to combine good generalization with many inputs together with more rapid real-time learning. Nothing available today really does this. At the same time, we should not seek (or claim!) capabilities beyond what is physically possible. Even the human brain has serious limitations here. There is no human brain alive which has learned a perfectly generalized cause-and-effect understanding or model of everything it has ever experienced of the physical and social universe.

Virtually all of the efforts to overcome this problem today are based on a paradigm which I would call weight-based learning. In weight-based learning, the network learns a new training pair X and Y by adjust the same weights W which it uses when calculating its outputs as a function of its inputs. That is essentially all the learning that there is in this paradigm. In some designs, the network may also adjust some auxiliary quantities, like learning rates, but these quantities are not expected to represent substantive information about the relation between inputs and targets.

In the next section, I will argue that we cannot begin to understand the brain, or to build truly efficient real-time learning systems, until we break out of the limits of the weight-based paradigm. This paper will describe a "new" paradigm, which I call syncretism [1,14], which could help us do that. In addition, it will mention a new type of neural network -- the dendritic field array -- consistent with the weight-based paradigm, which provides a Hebbian-like alternative to local and global networks.

While encouraging these new possibilities, I do not mean to downplay the importance of further research into real-time learning using the weight-based approach. There are many applications which do not require the full power of syncretism. For example, White and Sofge have used a neurocontrol design, using a controller adapted by backpropagation, to adapt the steering of an F-15 in real time; they have reported that the system requires only two seconds to adapt to a new configuration such as a disabled wing. There has been a fair amount of practical success in efforts to develop faster real-time learning methods, inspired by knowledge of what works in the batch case, or by practical intuition and experience; however, there is room to do far better here, by seeking more consolidation and synthesis of the disparate research results, and by trying (at times) to emulate the rigor which exists in the batch case. Conventional weight-based designs might will be useful as components of more powerful designs based on syncretism; therefore, the new paradigm does not undermine their value.

Syncretism: A New Paradigm for Neural Networks

After one truly understands the idea of syncretism, the idea seems incredibly obvious. I suspect that dozens of people will claim -- falsely -- that they have already been exploiting the idea in their own unique adaptation schemes. Dozens of psychologists, biologists and AI people can claim -- quite accurately -- that they have been working with equivalent ideas for decades and decades; however, those ideas have yet to be translated into neural network designs which really capture the power of this approach.

To explain this idea in a practical way, I would like to start out with a kind of hypothetical thought experiment. Suppose that we were trying to build the most perfect possible real-time learning system for supervised learning, without any limits on computational cost or memory. Suppose that the system would be given a training pair, X(t) and Y(t), at each time t, which would then become totally inaccessible to the system after time t. What would such an ideal system look like?

The answer to this question is very straightforward. Using its infinite memory, the system would simply store each training pair, X(t) and Y(t), into its memory. Using its unlimited computing power, the system would then use batch learning to learn the map from X to Y. (If you want to add some ability to forget obsolete memories, I would argue that you still should use the same general structure, but should allow for time-lagged recurrent structures or networks instead of simple static networks; however, such possibilities go beyond supervised learning as such.) In short, the system would consist of two subsystems: (1) a dumb but complete memory; (2) a global learning system, which learns from the memory subsystem, in batch mode, rather than current experience.

The ideal practical real-time learning system may simply be based on approximating this ideal real-time learning system. This is the idea of syncretism.

An obvious way to implement syncretism is to build a supervised learning system composed of two subnetworks. One network would be a dumb associative memory (perhaps with simple measures of confidence, like the radii used in radial basis functions). A second would be a global network, used for generalization and prediction. The global network would be adapted using weight-based adaptation, but the training pairs would come at some times from external reality, and at other times from the associative memory. Clearly, there are dozens of ways one might implement this kind of system; it is premature to recommend one or another.

Long ago, in [14], I proposed a somewhat more sophisticated arrangement, also based on syncretism. The idea was to make the output of the overall system equal to something like a weighted sum of the output of the associative memory and the output of the global subnetwork. The wight at any time would depend on the level of confidence expressed by the associative memory. Adaptation of the global subnetwork would depend on continually-updated evaluations of how far off the global network is for each memory prototype or cluster. The details of the design in [14] are far too specialized to be worth repeating here; however, they may have some suggestive value.

The design in [14] was motivated by an effort to translate certain ideas from Freud concerning the "ego" and the "id." Freud described the "id" as a primitive system, which predicts the future based on simple association. For example, when a young child is hit by a person wearing a blue shirt, he may end up fearing all people who wear blue shirts for years and years thereafter. Clearly, this example fits the kind of forecasting which "local networks" or "heteroassociative memories" perform. Freud argued that the "ego" -- which takes a long time to develop -- provides a more coherent, global model of cause-and-effect relations. He argued that people's actual expectations are determined by a combination of the ego and the id working together (as in [14]). He argued that the influence of a traumatic childhood memory could be reduced by reliving the memory, so that it could be assimilated into the ego; this corresponds to the idea of adapting the ego or global model in response to training examples presented by the id or associative memory. The design in [14] did make the influence of the id at any time depend on the degree of assimilation of the memory prototype, as in Freud's original idea.

Designs based on the kinds of arrangements discussed above may be described as "two-level syncretism."

Freud's most basic theories in psychology were in fact based on his effort to construct a neural-network theory of the dynamics of the mind. In translating the ideas into mathematics, I have generally relied on semi-popular accounts of his ideas (like [15]); however, Pribram's account [16] is probably more complete.

Biologists, psychologists and experts in AI might well suggest other ways of implementing syncretism which are also worth considering. For example, instead of a two-level system, one might imagine something like a hierarchy or heterarchy of progressively greater generality, rising up from dumb memory to global theories. The intermediate stages of such a hierarchy might include local-domain MLPs, analogous to the Jordan-Jacobs design[17], with local confidence measures built into the design. Designs of these general sorts may be called "multi-level syncretism."

It is not clear how to make such multi-level systems clean and brain-like; however, the complex system of columnar modules or "dendrons" in the cerebral cortex (see Eccles in [18]) might well be linked to such an arrangement. Or maybe not. Our knowledge is very limited at present.

It is conceivable that the psychological phenomenon of generalization -- going from a specific set of memories to a local generalization -- is based on something more like batch learning than on weight-based "real-time learning." It is also conceivable that deep sleep is the time when such functions are performed most intensely. The phenomenon of "consolidation" -- well-known in biology -- clearly plays a major role here. Artificial-seeming adaptation schemes, like Touretzsky's method for processing multiple observations in parallel on a WARP supercomputer, might well be closer to the brain than we would expect.

Clearly there is an enormous amount of research to be done here. It is also clear that our community has not even begun to address these vital issues.

Dendritic Field Arrays

Based on ideas like those above, I would argue that global networks (like MLPs and SRNs adapted by gradient-based methods) and local networks must both play a crucial role as sub-systems in the brain. (Some people believe that backpropagation could not exist in the brain in any form; however, see papers by myself and by Hammeroff in [2,3,18] for alternative views.)

Pribram has argued that image processing in the cerebral cortex is based on a radically different kind of network, which he would call a "dendritic field array." He has argued that den-dritic field arrays consist of a set of neurons, linked together in their processing. The first stage of processing -- the dendritic field stage -- performs the most complex processing; it is like a straightforward (though sparse) linear Hopfield network, with a huge number of elements, one for each point on or around any of the dendrites in the system. The final or nonlinear stage, ac-cording to Pribram, is little more than a readout stage; at that stage, each axon hillock calculates a linear combination of the many field values, and then applies a simple squashing function.

Pribram has cited substantial (if indirect) empirical evidence to support his theory. Simmons of Brown has recently performed a path-breaking analysis of sonar processing in the bat[2], in which he claims to have direct evidence, for the first time, of true dendritic field processing.

If Pribram were right, and if dendritic field processing should play at least some role in the brain, what would its significance be, from a computational point of view? A wide variety of quantum-mechanical interpretations have been suggested here[18]; however, from a computational point of view, there is a much simpler possibility [18]. One could imagine that the dendritic weights (C_ij) and readout weights (C'_ij) are adapted by something like:

where a(t) is a global attention factor, and where the local Hopfield convergence has the effects of yielding outputs:

Because of page charges, I cannot elaborate on the details here; see [18].

The net effect of this would be as follows: instead of a slow-learning nonlinear network (like an MLP) or a piecewise-flat network (the usual local networks), this yields a roughly piecewise-linear network where learning approaches 100% efficiency in real time within each local domain. Equations 1 and 2 are clearly very Hebbian in appearance, and equation 3 should be very familiar to statisticians. This kind of network would be far more efficient in the brain (because of its unique hardware) than in conventional computers; however, it is possible that it could be implemented efficiently in optical or VLSI or biomolecular hardware as well.

References

1. D.White & D.Sofge,eds,Handbook of Intelligent Control. Van Nostrand, 1992.

2. P.G.Madhavan et al, eds,Neuroscience and Neural Networks: Efforts Towards an Empirical Synthesis, 1993.

3. P.Werbos,Beyond Regression, Harvard Ph.D. thesis (1974), reprinted in P.Werbos, The Roots of Backpropagation: From Ordered Derivatives to Neural Networks to Political Forecasting, Wiley, 1993.

4. Y.H.Pao,Adaptive Pattern Recognition and Neural Networks,Addison-Wesley,1989.

5. I.Aleksander & J.Taylor,eds,Artificial Neural Networks 2, North Holland, 1992.

6. P.Werbos, Neurocontrol and related techniques, in A.Maren,ed,Handbook of Neural Computing Applications, Academic, 1990.

7. P.Werbos, Backpropagation: past and future. In ICNN Proceedings, IEEE, 1988.

8. Watrous and Shastry. In ICNN Proceedings, IEEE, 1987.

9. W.Miller, Sutton and Werbos, Neural Networks for Control, MIT Press, 1990.

10. R.Rohwer, The moving targets training algorithm,NIPS Proceedings,Morgan-Kauffman,1990.

11. I.Guyon et al,Comparing different neural network architectures for classifying handwritten digits,IJCNN Proceedings,IEEE,1989.

12. P.Werbos, Learning how the world works, Proc. Systems, Man, Cyb., IEEE, 1987.

13. P.Houillon & A.Caron,Planar robot control in cluttered space by artificial neural network. In Mathematical Modelling and Scientific Computing (8th International Conference). St. Louis: Principia Scientia, 1993. p.498-503.

14. P.Werbos, Advanced forecasting methods for crisis warning and models of intelligence, General Systems Yearbook, 1977 issue.

15. D.Yankelovich & W.Barett,Ego and Instint: The Psychoanalytic View of Human Nature -- Revised, Vintage,1971.

16. K.Pribram & M.Gill, Freud's Project Reassessed, Basic Books, 1976.

17. R.Jacobs et al,Adaptive mixtures of local experts,Neural Computation,3(1),1991.

18. K.Pribram,ed., Rethinking Neural Networks: Quantum Fields and Biological Evidence, Washington D.C.: INNS Press, 1993.