Supervised Learning:
Can It Escape Its Local Minimum?
Paul
J. Werbos
Room
1151, National Science Foundation*
Abstract
Tremendous progress and new research
horizons have opened up in developing artificial neural network (ANN) systems
which use supervised learning as an elementary building block in
performing more brain-like, generic tasks such as control or planning or
forecasting across time[1,2]. However, many people believe that research into
supervised learning itself is stuck in a kind of local minimum or plateau. Limited capabilities in supervised learning
have created a serious dilemma, forcing us to choose between fast real-time
learning versus good generalization with many inputs, without allowing us to
have both together.
This paper will first review and evaluate
the supervised learning field as a whole, highlighting open questions and
research opportunities. Then it will argue that there are at least two
possibilities for more breakthrough kinds of research which could seriously
alleviate (though not eliminate) the speed-versus-generalization dilemma.
Backpropagation itself, in its original form from 1974 [3], was based on a
translation of ideas from Freud into well-defined engineering mathematics. The
possibilities discussed here would involve a similar translation of another
idea from Freud and an idea from Karl Pribram. Further insights from psychology
or AI might also be of use in guiding the engineering implementation of these
ideas.
Overview
of Supervised Learning Today
Supervised learning is one of the
best-defined, most universally understood ideas in the neural network field. We
all know that "supervised learning" includes all methods to train
ANNs to approximate a function Y=f(X), where X
is a vector of "inputs" and Y
is a vector of "targets," and where there may be noise in the
training set. Neverthe-less, some people use the term "unsupervised
learning" to describe forms of associative memory ("heteroassociative
memory") used to perform the exact same task. This is simply a semantic
error, which should be discouraged.
Years ago, the field of supervised
learning was dominated by three methods: (1) basic backpropagation, used to
adapt feedforward multilayer perceptrons (MLPs) by steepest descent; (2)
heteroassociative memory schemes, based on Hebbian learning approaches,
originated in the Grossberg school of thought; (3) correlation-based Hebbian
schemes, which require the inputs to be decorrelated in order to get correct
results even in the linear case. The third method has nearly disappeared for
two reasons: (1) the limit to the linear case was a problem; (2) the
requirement for decorrelated inputs tends to reduce fault tolerance, and is
strongly at odds with biological studies showing redundancy or correlation in
input representations (e.g. see Simmons in [2]).
In practical engineering applications
today, supervised learning tasks are mainly performed by one of three types of
design: (1) feedforward MLPs adapted by gradient-based methods, usually a bit
more sophisticated than raw steepest descent; (2) feedforward "local
learning networks," such as radial basis functions (RBF) and CMAC, the two
methods most popular now in the
Kanerva's
sparse distributed memory[5]. In one application (see Ritter in [5]), a
"neural gas" local network -- an
________________________________
*This
paper is intended only to provoke thought. It should not be construed as
any kind of policy statement either of NSF or of the author. As work done on
government time, it is in the public domain subject to proper citation
and retention of this caveat. This paper would have never been written without
some strong encouragement from Stephen Grossberg, who nevertheless bears none
of the blame for the specific views expressed here.
outgrowth
of the Kohonen school of thought -- appears to yield much higher accuracy than
CMAC, at the price of drastically reduced real-time learning ability.
Kohonen-like local networks appear to be far more popular in Europe[5] than in
the
The conventional wisdom now is that MLPs
yield better generalization and a more parsimonious representation of functions
with many inputs, while the local networks are much better at learning quickly
in real time, when the number of inputs is limited. Andrew Barron has proven
new theorems about approximation errors with MLPs versus local (or
A major industrial user of ANNs recently
put it this way:"Dozens of people have come to me claiming to have a way
to bypass all the problems of slow learning in gradient-based systems. Every
single one of them turned out to be peddling some kind of lookup table."
Conceptually, at least, the local networks in use today can indeed be
understood as smoothed-out, sophisticated versions of a lookup table.
The local minimum problem is another
important issue in supervised learning, but it does not pose problems for these
network designs as such. The details of that issue are beyond the scope of this
paper.
Lessons From
Batch/Offline Uses of Supervised Learning
The
choice between fast real-time learning and good generalization with many inputs
is not a big problem in many applications. In many applications, good
generalization is essential but real-time learning is not. Even overnight
convergence on a dedicated machine is often far cheaper -- in dollars and in
labor -- than the alternatives. Even in real-time "adaptive control,"
one can train an ANN to "learn offline to be adaptive online"[1,6].
In other words, one can use offline training to adapt a recurrent
network. The resulting controller may not actually learn in real time,
but it could "adapt" to changes in the parameters of its environment through
the action of the recurrent neurons[1,6]. This kind of capability requires
the use of time-lagged recurrent networks (TLRNs[1]), which are not
within the scope of supervised learning as such.
When batch or offline learning is
acceptable, there are a wide range of tools available to speed up the
adaptation of MLPs or of other global network designs. There are powerful
accelerator boards and computers exhibited here at this conference. Also,
error-minimization in batch mode is a special case of well-studied,
well-understood problems in operations research. Even today, one reads papers
reinventing the idea of using old conjugate-gradient methods, like
Fletcher-Reeves and Polak-Ribiere ( which in turn are more precise versions of
"momentum"); however, years ago[7], I reported -- as one might expect
-- much better results from using the Shanno conjugate gradient method, which
numerical analysts regarded as the standard method for very large, truly
nonlinear problems. With smaller
networks, the best reported results came from using "BFGS"[8] --
where the "S" stands for Shanno.
Shanno himself has suggested[9] that
marginal further improvements could be had by using the newer Nocedal method,
but that much larger improvements may be possible with new, modified
versions of Karmarkar methods. This idea has been explored in a preliminary way
by myself, by DeClaris and by Trafalis of Oklahoma state. It now seems that the
Karmarkar approach maps into a kind of "attention" scheme, which
mimics the tendency of biological systems to pay more attention to different patterns
at different times (analogous to weighted regression in
statistics). This appears to be a promising approach, but a lot more research
would be needed to pin down its potential.
Shanno has also proposed that we work
harder to develop "preconditioners" for use with conjugate gradient
method. Unfortunately, I am aware of only one effort along those lines. That
effort failed, in part because they focused on toy problems (i.e. few patterns,
such that error could be reduced to zero). Perhaps the idea of developing
temporary "targets" for hidden nodes[7,10] might be useful in
developing such preconditioners.
For offline learning problems, many people
still prefer to use "pattern learning," which works its way through
an offline database one observation at a time, as if one were doing
real-time learning. When there is a huge database of highly redundant patterns,
this approach has often led to convergence in 20 to 30 "epochs" or
passes through the database [11]; this is far less than the number of
iterations required with a classical batch method. Nevertheless, my own
experience[7] suggests that that approach will not work as well when the
patterns in the database are nonredundant. Even with highly redundant
databases, one could select a small, less redundant subset of the
training set, and use batch learning on that subset (with an appropriate
error measure), in order to get the weights into the right neighborhood; this
could be useful as a time-saving preliminary to working through the
entire training set.
When programming effort is more
important than computer time, a simple steepest descent approach (in batch or
pattern learning) may still be the best way to go. Back in 1987, for the
official DOE/EIA model of the natural gas industry, I used a very simple
adaptive-learning-rate method (given in my other WCNN93 paper, and in [1])
which provides a high degree of flexibility, and avoids some of the problems in
those methods which require that you know the actual change in total error from
epoch to epoch.
Studies of batch learning have also made
clear the importance of adapting the connectivity of a network, and not
just its weights. Dozens upon dozens of methods have been published for growing
or pruning networks. For example, Fahlman has reported that his method for
growing a network yields much faster learning than his earlier
adaptive-learning-rate scheme by itself. Pruning techniques tend to be based on
a variety of complexity-measure concepts with deep roots in statistics and
epistemology (e.g., see chapter 10 of [1]). It is now well-known to all
competent practitioners that generalization (and, in most cases, learning
speed) improve dramatically when networks are "pruned down," to the
point where there are far fewer weights in the network than there are target
numbers in the training set. There is a great need at present for more
consolidation and synthesis in this literature, which is scattered all over,
from theoretical journals through to conference papers in chemical engineering.
For very large networks, I also
suggested the use of guided random search techniques to generate experimental
neurons or even groups of neurons, in addition to trying out new connections
between neurons[12]; however, some of the ideas in that paper will only become
relevant when people begin working with larger systems.
A crucial weakness in the current
literature is the lack of attention to simultaneous-recurrent networks
(SRNs) as a way of solving difficult supervised learning problems. For
practical supervised learning applications, an SRN may be defined as a network
whose output at any time, (t),
is the final equilibrium value, y(∞),
which results from iterating over the equation y(n+1)=f(X(t),y(n),W), where f usually represents a simple feedforward neural network.
MLPs may be substantially better than
local learning networks in approximating functions of many variables, but they
are substantially less powerful than SRNs in performing this function.
For example, Houillon and Caron[13] recently tried to train an MLP to learn the
mapping from an image of a cluttered floor to the action required to navigate
through this floor; when the MLP failed to learn the mapping (as discussed in
their talk), they found an alternative nonneural design. Yet the calculations
in the design which worked could have easily been learned by an SRN!
Likewise, in his classic work on perceptrons, Minsky showed how SRNs could
easily represent mappings (like his "connectivity" map) which MLPs
cannot. These maps are not smooth.
SRNs have probably been underutilized
because of the folklore which says that Hebbian learning is the right way to
adapt them, that they are very hard to understand and implement, and that the
only advantage they offer is associative memory; however, the previous
paragraph shows that they are not so difficult after all, and that they offer
far more than just associative memory. One can also use backpropagation to
adapt such networks realtively easily, by iterating over the equation F_y(n+1) = F_ + F_fy(X, ,
W, F_y(n)), where F_
is the gradient of error with respect to the network output, where F_f
represents a subroutine or function which "backpropagates through"
the feedforward network f,
and where F_y(0)
may be initialized to zero (or to some other convenient value). The derivatives
of error with respect to the weights result from backpropagating through f a final time, using the same
backpropagation subroutine, using F_y(∞)
as an input to the function. See chapter 3 of [1] for a more complete
discussion of how to do this (with a linear example to assist in debugging). I
first applied this generalization of backpropagation in 1981, in a sensitivity
analysis of a natural gas model used by DOE in developing recommendations to
Congress for the deregulation of that industry; see [1,3] for more history and
citations.
There are still many research issues in
how best to use SRNs; for example, instead of minimizing E=(Y-)2,
one might minimize E+q(y(N+1)-y(N))2 ,
where q is a "tension parameter," and
where y(N) is the output of the network in the Nth iteration as it
tries to settle down. (See [1] for details.) Adjusting parameters like q
involves practical research challenges similar in flavor to those of adjusting
learning rates in basic backpropagation.
Better utilization of SRNs may be absolutely
necessary to achieving truly brain-like capabilities in control and other
applications.(See chapter 13 of [1], and see [2].)
Real-Time
Learning: The Need for New Approaches
In order to replicate (or explain) true
brain-like capabilities, we need to find some way to combine good
generalization with many inputs together with more rapid real-time
learning. Nothing available today really does this. At the same time, we should
not seek (or claim!) capabilities beyond what is physically possible. Even the
human brain has serious limitations here. There is no human brain alive which
has learned a perfectly generalized cause-and-effect understanding or model of
everything it has ever experienced of the physical and social universe.
Virtually all of the efforts to overcome
this problem today are based on a paradigm which I would call weight-based
learning. In weight-based learning, the network learns a new training pair X and Y by adjust the same weights W which it uses when
calculating its outputs as a function of its inputs. That is essentially all
the learning that there is in this paradigm. In some designs, the network may
also adjust some auxiliary quantities, like learning rates, but these
quantities are not expected to represent substantive information about the
relation between inputs and targets.
In the next section, I will argue that we
cannot begin to understand the brain, or to build truly efficient real-time
learning systems, until we break out of the limits of the weight-based
paradigm. This paper will describe a "new" paradigm, which I call
syncretism [1,14], which could help us do that. In addition, it will mention a
new type of neural network -- the dendritic field array -- consistent with the
weight-based paradigm, which provides a Hebbian-like alternative to local and
global networks.
While encouraging these new possibilities,
I do not mean to downplay the importance of further research into real-time
learning using the weight-based approach. There are many applications which do
not require the full power of syncretism. For example, White and Sofge have
used a neurocontrol design, using a controller adapted by backpropagation, to
adapt the steering of an F-15 in real time; they have reported that the system
requires only two seconds to adapt to a new configuration such as a disabled
wing. There has been a fair amount of practical success in efforts to develop
faster real-time learning methods, inspired by knowledge of what works in the
batch case, or by practical intuition and experience; however, there is room to
do far better here, by seeking more consolidation and synthesis of the
disparate research results, and by trying (at times) to emulate the rigor which
exists in the batch case. Conventional weight-based designs might will be
useful as components of more powerful designs based on syncretism;
therefore, the new paradigm does not undermine their value.
Syncretism:
A New Paradigm for Neural Networks
After one truly understands the idea of
syncretism, the idea seems incredibly obvious. I suspect that dozens of people
will claim -- falsely -- that
they have already been exploiting the idea in their own unique adaptation
schemes. Dozens of psychologists, biologists and AI people can claim -- quite
accurately -- that they have been
working with equivalent ideas for decades and decades; however, those ideas
have yet to be translated into neural network designs which really
capture the power of this approach.
To explain this idea in a practical way, I
would like to start out with a kind of hypothetical thought experiment. Suppose that we were trying to build the most
perfect possible real-time learning system for supervised learning, without
any limits on computational cost or memory. Suppose that the system would
be given a training pair, X(t)
and Y(t), at each time t,
which would then become totally inaccessible to the system after time t. What
would such an ideal system look like?
The answer to this question is very
straightforward. Using its infinite memory, the system would simply store each
training pair, X(t) and Y(t), into its memory. Using its
unlimited computing power, the system would then use batch learning to
learn the map from X to Y. (If you want to add some
ability to forget obsolete memories, I would argue that you still should use
the same general structure, but should allow for time-lagged recurrent
structures or networks instead of simple static networks; however, such possibilities
go beyond supervised learning as such.)
In short, the system would consist of two subsystems: (1) a dumb but
complete memory; (2) a global learning system, which learns from the memory
subsystem, in batch mode, rather than current experience.
The ideal practical real-time
learning system may simply be based on approximating this ideal
real-time learning system. This is the idea of syncretism.
An obvious way to implement syncretism is
to build a supervised learning system composed of two subnetworks. One network would
be a dumb associative memory (perhaps with simple measures of confidence, like
the radii used in radial basis functions). A second would be a global network,
used for generalization and prediction. The global network would be adapted
using weight-based adaptation, but the training pairs would come at some
times from external reality, and at other times from the associative memory.
Clearly, there are dozens of ways one might implement this kind of system; it
is premature to recommend one or another.
Long
ago, in [14], I proposed a somewhat more sophisticated arrangement, also based
on syncretism. The idea was to make the output of the overall system equal to
something like a weighted sum of the output of the associative memory
and the output of the global subnetwork. The wight at any time would depend on
the level of confidence expressed by the associative memory. Adaptation
of the global subnetwork would depend on continually-updated evaluations of how
far off the global network is for each memory prototype or cluster. The
details of the design in [14] are far too specialized to be worth repeating
here; however, they may have some suggestive value.
The design in [14] was motivated by an
effort to translate certain ideas from Freud concerning the "ego" and
the "id." Freud described the "id" as a primitive system,
which predicts the future based on simple association. For example, when a
young child is hit by a person wearing a blue shirt, he may end up fearing all
people who wear blue shirts for years and years thereafter. Clearly, this
example fits the kind of forecasting which "local networks" or
"heteroassociative memories" perform. Freud argued that the
"ego" -- which takes a long time to develop -- provides a more
coherent, global model of cause-and-effect relations. He argued that people's
actual expectations are determined by a combination of the ego and the
id working together (as in [14]). He argued that the influence of a
traumatic childhood memory could be reduced by reliving the memory, so
that it could be assimilated into the ego; this corresponds to the idea
of adapting the ego or global model in response to training examples
presented by the id or associative memory. The design in [14] did make the
influence of the id at any time depend on the degree of assimilation of
the memory prototype, as in Freud's original idea.
Designs based on the kinds of arrangements
discussed above may be described as "two-level syncretism."
Freud's most basic theories in psychology
were in fact based on his effort to construct a neural-network theory of the
dynamics of the mind. In translating the ideas into mathematics, I have
generally relied on semi-popular accounts of his ideas (like [15]); however,
Pribram's account [16] is probably more complete.
Biologists, psychologists and experts in
AI might well suggest other ways of implementing syncretism which are also
worth considering. For example, instead of a two-level system, one might
imagine something like a hierarchy or heterarchy of progressively greater
generality, rising up from dumb memory to global theories. The intermediate
stages of such a hierarchy might include local-domain MLPs, analogous to
the Jordan-Jacobs design[17], with local confidence measures built into the
design. Designs of these general sorts may be called "multi-level
syncretism."
It is not clear how to make such
multi-level systems clean and brain-like; however, the complex system of
columnar modules or "dendrons" in the cerebral cortex (see Eccles in
[18]) might well be linked to such an arrangement. Or maybe not. Our knowledge
is very limited at present.
It is conceivable that the psychological
phenomenon of generalization -- going from a specific set of memories to a
local generalization -- is based on something more like batch learning than on
weight-based "real-time learning." It is also conceivable that deep
sleep is the time when such functions are performed most intensely. The
phenomenon of "consolidation" -- well-known in biology -- clearly
plays a major role here. Artificial-seeming adaptation schemes, like
Touretzsky's method for processing multiple observations in parallel on a WARP
supercomputer, might well be closer to the brain than we would expect.
Clearly there is an enormous amount of
research to be done here. It is also clear that our community has not even
begun to address these vital issues.
Dendritic
Field Arrays
Based on ideas like those above, I would
argue that global networks (like MLPs and SRNs adapted by gradient-based
methods) and local networks must both play a crucial role as sub-systems in the
brain. (Some people believe that backpropagation could not exist in the brain
in any form; however, see papers by myself and by Hammeroff in [2,3,18] for
alternative views.)
Pribram has argued that image processing
in the cerebral cortex is based on a radically different kind of network, which
he would call a "dendritic field array." He has argued that
den-dritic field arrays consist of a set of neurons, linked together in
their processing. The first stage of processing -- the dendritic field stage --
performs the most complex processing; it is like a straightforward (though
sparse) linear Hopfield network, with a huge number of elements, one for
each point on or around any of the dendrites in the system. The final or
nonlinear stage, ac-cording to Pribram, is little more than a readout stage; at
that stage, each axon hillock calculates a linear combination of the many field
values, and then applies a simple squashing function.
Pribram has cited substantial (if
indirect) empirical evidence to support his theory. Simmons of Brown has
recently performed a path-breaking analysis of sonar processing in the bat[2],
in which he claims to have direct evidence, for the first time, of true
dendritic field processing.
If Pribram were right, and if dendritic
field processing should play at least some role in the brain, what would its
significance be, from a computational point of view? A wide variety of
quantum-mechanical interpretations have been suggested here[18]; however, from
a computational point of view, there is a much simpler possibility [18]. One
could imagine that the dendritic weights (Cij) and readout weights
(C'ij) are adapted by something like:
1 |
2 |
where
a(t) is a global attention factor, and where the local Hopfield convergence has
the effects of yielding outputs:
3 |
4 |
Because
of page charges, I cannot elaborate on the details here; see [18].
The net effect of this would be as
follows: instead of a slow-learning nonlinear network (like an MLP) or a
piecewise-flat network (the usual local networks), this yields a roughly
piecewise-linear network where learning approaches 100% efficiency in
real time within each local domain. Equations 1 and 2 are clearly very Hebbian
in appearance, and equation 3 should be very familiar to statisticians. This
kind of network would be far more efficient in the brain (because of its unique
hardware) than in conventional computers; however, it is possible that it could
be implemented efficiently in optical or VLSI or biomolecular hardware as well.
References
1. D.White & D.Sofge,eds,Handbook of
Intelligent Control. Van Nostrand, 1992.
2. P.G.Madhavan et al, eds,Neuroscience
and Neural Networks: Efforts Towards an Empirical Synthesis, 1993.
3. P.Werbos,Beyond Regression, Harvard
Ph.D. thesis (1974), reprinted in P.Werbos, The Roots of Backpropagation:
From Ordered Derivatives to Neural Networks to Political Forecasting,
Wiley, 1993.
4. Y.H.Pao,Adaptive Pattern Recognition
and Neural Networks,Addison-Wesley,1989.
5. I.Aleksander & J.Taylor,eds,Artificial
Neural Networks 2,
6. P.Werbos, Neurocontrol and related
techniques, in A.Maren,ed,Handbook of Neural Computing Applications,
Academic, 1990.
7. P.Werbos, Backpropagation: past and
future. In ICNN Proceedings, IEEE, 1988.
8. Watrous and Shastry. In ICNN
Proceedings, IEEE, 1987.
9. W.Miller, Sutton and Werbos, Neural
Networks for Control, MIT Press, 1990.
10. R.Rohwer, The moving targets training
algorithm,NIPS Proceedings,Morgan-Kauffman,1990.
11. I.Guyon et al,Comparing different neural
network architectures for classifying handwritten digits,IJCNN Proceedings,IEEE,1989.
12. P.Werbos, Learning how the world works, Proc.
Systems, Man, Cyb., IEEE, 1987.
13. P.Houillon & A.Caron,Planar robot
control in cluttered space by artificial neural network. In Mathematical
Modelling and Scientific Computing (8th International Conference).
14. P.Werbos, Advanced forecasting methods
for crisis warning and models of intelligence, General Systems Yearbook,
1977 issue.
15. D.Yankelovich & W.Barett,Ego and
Instint: The Psychoanalytic View of Human Nature -- Revised, Vintage,1971.
16. K.Pribram & M.Gill, Freud's
Project Reassessed, Basic Books, 1976.
17. R.Jacobs et al,Adaptive mixtures of local
experts,Neural Computation,3(1),1991.
18. K.Pribram,ed., Rethinking Neural
Networks: Quantum Fields and Biological Evidence,