Learning or Adaptation by Big Jumps and Small Jumps
Creativity, Backpropagation and EC Made Simple
General Principles
The challenge of big jumps versus little jumps occurs again and again for intelligent and adaptive systems at all levels of organization, from the genes of bacteria, to learning in neural networks, to human lives, to economic systems, and to the global future of humanity and the highest spiritual levels. It is truly one of those situations where certain fundamental mathematical principles apply to systems in a general way – “general systems theory.” Here I will try to proceed from very simple principles familiar to everyone up to some very important practical details. I will talk about systems at all these levels, after I review a few basic facts.
One of the universal facts of life is that little jumps are easier than big jumps. Incremental, bit-by-bit progress is an essential tool of any system that survives or thrives in the real world. Yet sometimes incremental progress is not enough. Therefore, in the best engineered systems, and in the natural systems which survive a tough process of natural selection, we will see a kind of overall organization which includes systems that support incremental learning, systems which support big jumps, and a smooth way of trying to unite these two capabilities. As a practical matter, I have seen again and again how a better understanding of this universal fact of life could have led to better performance in all kinds of human activities, from energy-economic policy to computer design to the development of human potential.
For incremental learning, the key idea is that small changes in the things we can control can result in fairly predictable small changes in what we care about. Since we can predict the effects, we can tune the small changes to have good effects. In order to implement this capability, we must somehow estimate “derivatives.” For those who did not take calculus – derivatives are just the numbers which tell us how much some outcome will change, in proportion to how much we change one of the inputs which affects that outcome, when we only make a small change.
However, there are many kinds of derivative in mathematics. The ordinary (and partial) kinds of derivative discovered by Newton are what you get when you can write out a mathematical formula, that tells you what the outcome will be as a function of everything which affects it. But when you ask how the outcome of a complicated system changes, when you change one of the many things which influence it, we need to use a more general concept of derivative, which I have called an “ordered derivative.” This general new piece of mathematics, developed in the 1970’s, was the real reason why the IEEE Neural Network Society awarded me its Pioneer Award back in 1992, when there were very few of us around. More important, I developed a new body of mathematical tools to compute and to use ordered derivatives, not only for artificial neural networks, but for essentially any kind of complex dynamical system. These are the tools which make the most powerful form of incremental progress possible in the general case. People in the neural network field later attached the word “backpropagation” to the core tool here, the tool which directly calculated ordered derivatives. Backpropagation is basically just an organized flow of feedback signals designed to send derivative signals through all parts of a complex system, through a kind of backwards local flow of information.
Before I say more about backpropagation, I need to say more about the big picture here. Very often in life, there are times when direct small steps simply cannot get us to the best possible situation, or, more important, to where we need to go. Engineers and mathematicians like to talk about simple examples of these kinds of problems; if we can’t understand a simple case very well, how can we imagine that we have any useful real insight into complicated systems like governments or people?
Here is the favorite example of many engineers in talking about this problem. Try to imagine you are walking across the countryside, trying to get to the highest point in a very large park. If you just walk uphill (assuming no trees or fences that totally block you), you will indeed come out at the top – the top of whatever little hill you happened to start on. That is called a “local maximum.” If there is only one hill in the entire park, we call it “convex;” in a convex situation, the local maximum is also the highest point in the entire park. In a convex situation, incremental progress is always enough to get to the best situation.
But what if there are many hills? We call that a nonconvex problem. In that case, the only hope of finding the highest point is to jump somehow from one hill to the next, perhaps by understanding the global shape of the land, or perhaps by making a lot of big jumps at random so as to explore the space. Nonconvex optimization is an extremely important, practical part of real-world engineering. There are many brute force methods which work very well on some small, smooth problems, but for large problems all we have are “stochastic search methods.” The best available methods for practical engineers today are called “evolutionary computing” (EC). In fact, the largest conferences on EC in the world today are organized by the new IEEE Computational Intelligence Society (CIS). The CIS is essentially a union of researchers studying neural networks, backpropagation, EC, and other compatible technologies, such as symbolic reasoning systems (“fuzzy logic”) which assume that truth is often a matter of degree and not a black-or-white affair. A major challenge in CIS is to find ways to unite these technologies and the insights which they give us.
Closely cooperating with CIS is another important society, the International Neural Network Society (INNS), which strives especially hard to create a home for neuroscientists and psychologists interesting in working together with engineers and mathematicians, in order to use new mathematics as a way to understand the brain. The three original two-year Presidents of INNS were Stephen Grossberg, Bernard Widrow and myself – all three still very active in these issues. The Neural Information Processing Systems Foundation, led by Sejnowksi, has also led some important work in the core of this field.
Small Jumps Versus Big Jumps
in Economics and Energy
On the backpropagation side – did you know that market-based price signals are actually just a kind of derivative signal? For example, the value of a peanut to you is the amount of satisfaction (“utility”) you would get from having one extra peanut. In Economics 1, they call it “the marginal utility” (extra utility) of a peanut. In mathematical economics classes, they explain that this “marginal utility” is simply the derivative of your personal satisfaction level with respect to the amount of your peanut consumption. Microeconomic theory is full of very important theorems which explain how market economies can find a local maximum so long as “competition is perfect” – which basically means that the price signals do accurately represent marginal cost and marginal utility at the same time. In effect, perfect markets implement a kind of backpropagation under certain limited conditions which people need to understand very well in order to make good decisions. The flow of price signals is just a flow of derivative signals – a special case of backpropagation.
But – there are times when local minima are not good enough. For example, economists sometimes talk about “economies of scale.” If there is no corporation in the world big enough to afford the development of a new type of reusable launch vehicle (see my space page), then incremental progress in the marketplace may simply not be enough to get us to the new “high mountain” of what space could contribute to the world economy. (I would be very happy to discover they can – but in any case, it’s not a case of incremental small steps!!!) Under certain simplifying assumptions, we can prove that economies really are convex systems, so we don’t have to worry. The problem is that the assumptions are not always true, and we really need to know what they are, in order to avoid getting into trouble. The incremental progress offered by the market system is essential – but we need to know its limits.
Another important example of
nonconvexity in the economy is “reflexivity” as described by George Soros, the
famous and successful billionaire investor. Soros has written books explaining
very clearly how he made so much money, simply by seeing how reflexivity
applies in real economic systems – and how to use it. The basic problem is that
there are many possible economic paths to the future which meet all the conditions
of market theory, and are stable consistent “solutions” assuming that everyone
sees that same future. But which of these paths actually occur? It ends up
being a matter of psychology and expectations. In the large-scale energy model
(LEAP) then used to predict the future of the
Small Jumps Versus Big Jumps
and Creativity in Brains and Minds
In my tutorials on neural networks, I often tell everyone: “Everyone in this room is stuck in a local minimum. That’s doesn’t mean we’re all about to die. Think positively. It means that every one of us has some opportunity to do something better than we are doing now. But if you think you are in a rut, to some degree, you should look more carefully at the chimpanzees in the forest. Now, those folks are in a real rut! Humans have a lot of special new ways to enhance our creativity, above and beyond what we get from having almost the same kind of brain as the chimps have. But still our creativity is not perfect, and we all could do better. It is impossible to build a system which has truly perfect creativity – the ability to find the global optimum in all situations, with a finite computing capacity living in a complex world.”
This doesn’t mean that incremental learning is a bad thing, or a small part of our mind. If we didn’t have it, we wouldn’t be here at all. I remember one famous neural network person who said loudly: “Local minima are not such a bad thing. Don’t tell her, but my wife is a local minimum. If we do incremental learning, we always know that we end up improving whatever our best starting point was in any case.” I remember the questioner who asked: ”Oh, yeah? I have a learning system which is much better than yours. I can absolutely guarantee that mine will converge to the global optimum. What do you say to that?” I got up and replied: “Can you tell us how your system would find the global optimum in the example he just gave you? It would bounce all over the world again and again until it visited every woman on earth a dozen times, in depth. If a real organism behaved like that, it wouldn’t last very long – certainly not long enough to find the global optimum.” And indeed, for a real brain, it is important that search – even if stochastic – be nonrandom and intelligent in its own way; somehow, we learn from experience – incrementally – how to do better and better in searching through the various realms of life. As any teenager can tell you, “imagination” plays a crucial role in guiding that kind of search; “imagination” is basically that inborn faculty which constructs new possibilities or scenarios in our mind. There is a specific layer of the cerebral cortex of the brain which, in my view, outputs these scenarios, in my view; it is present in all mammals, even the mouse.
Even though humans have not yet mastered symbolic reasoning, through their genetic inheritance, the bit of creativity they have acquired through their limited degree of symbolic reasoning is the main foundation for the enormous power of human technology, compared to that of other species on the earth today.
As for incremental learning in brains – though I certainly discovered backpropagation (as defined here), I would certainly also admit that backpropagation is only one part of incremental learning, and that the core idea in backpropagation really came from someone else – Sigmund Freud. Before I developed the general mathematical form of backpropagation, I first developed the special case for neural networks, and I did so by translating one of the key insights of Freud from words and images into mathematics. (The act of translating ideas into mathematics and back is crucial to higher-level creativity in science and even in scientific policy-making!)
Freud called this insight of his a new model of “psychodynamics.” The brain, he said, is made up of neurons. Neurons represent events or “objects.” If we often experience event B after event A, then, under the right conditions, the brain learns that event A causes event B to happen. The brain learns a forwards causal association from A to B. It does this by developing a forwards connection from the neuron which represents A to the neuron which represents B. But that is not the main thing, said Freud. Learning cause and effect relations is not the real purpose of the brain. The human brain as a whole is driven by goals, drives and emotions. Emotions are basically just evaluations or “affect” or value measures attached to the various objects. So if A causes B, and we put a heavy value on B, we should propagate a value signal backwards from B to A. Freud proposed that neuron B would emit a chemical signal, representing the value of B to the organism, and send a signal backwards along the same connection pathway back to A; the backwards flow would be multiplied by the same connection strength used in the forwards direction, representing the strength of the causal relation. He sometimes called the chemical value signals “cathexis,” and he sometimes called the system of flowing information “psychic energy.” (At least, I remember reading those words somewhere in my old psychology books…). A vast flow of all kinds of “energy” through the system of neurons in the brain – but actually, a flow of chemicals.
And so, according to Freud, when we say that a certain topic or objects receives a lot of “negative energy” from someone, we really mean that the corresponding neuron in his/her brain is acquiring a strong negative cathexis on that topic. There is a flow of negative energy – flowing back in his/her brain, by backpropagation, from somewhere else, or coming directly from a memory of something bad about that topic. The coupling between memory and backpropagation is an important technical point, discussed in many of my past papers posted here; I hope to post a “made simple” on that topic as well, sooner or later.
Some Important Technicalities
First of all, I have told you that derivatives tell you how some outcome variable (like satisfaction) depends on other variables – like physical actions – that we may control or influence. But in a functioning brain, there is not just one outcome variable to be considered. For example, the organism must work for success in staying alive and healthy and in achieving other goals in the world around it, but it also must work to understand its world better. It is possible to design “intelligent systems” which do not learn anything about cause and effect relations in their environment, but they typically can’t cope with even medium-sized “toy problems” in engineering, problems far smaller than what human brains can handle. Thus there must at least two flows of dynamic feedback, or backpropagation, throughout any reasonable-sized brain – one flow representing the drive to make reality better, and another representing the drive to understand it better. Someday I may post a color diagram I drew in 1978, to show how multiple flows of backpropagation (in different colors) can be woven together in a larger intelligent system. I speculated that these different flows might be implemented by flows of different chemicals, in the Freudian view of the brain. They represent different “flavors” of psychic energy, and elicit different types of emotions and even hormonal response. The biological connections are discussed in many of the papers posted here. Recent experiments suggest that the main backwards flows are simply carried out by ordinary electrical flows in cell membranes.
By the way, if the two backwards flows are naively merged together along the same pathway, one can end up causing “conflict of interest” effects which screw up the functioning of the system. This has been a major practical difficulty in human organizations. Internal conflicts of interest have made almost all human organizations much less “intelligent” than one would expect from the sheer mass of neurons that they bring together. As the market economy shows, we can create distributed organizations sometimes which are very effective for organizing actions in certain areas. But the faculties of coherent prediction or foresight, and of collective imagination, still depend more on the intelligence and enlightenment expressed through the individual human. In the world today, the Millennium Project of the American Council for the United Nations University is perhaps the closest thing we have to a physical organization for collective foresight; for all of its limitations, it addresses an essential kind of hole in the world system.
There are also some important mathematical relations between “psychic energy” and ordinary physical energy. This gets to be extremely technical, and I hope that the experts will forgive me if I oversimplify at first, in this paragraph. Psychic energy in the brain, in my view, is essentially part of the machinery for implementing a kind of optimization capability. This capability is based on an approximation of something called the Bellman equation, a universal equation for optimization in any universe subject to random disturbances or uncertainties. But the universe itself appears to be governed by something called the Hamilton-Jacobi equation, another optimization equation so similar to the Bellman equation that many engineers (wrongly) think they are the same. But the universe does a perfect job of “maximizing” its “utility function” (Lagrangian); as a result, the physical energy of the universe (Hamiltonian) never is increased or decreased. It can be moved around, but it can never be created or destroyed. But, because human brains are always on the path to perfection but never quite there, our psychic energy can be increased substantially as we learn more – or decreased when we have psychological problems. Freud, of course, gradually became a great expert on these phenomena, and learned more and more as he grew older. Psychoanalysts often talk about people with “flat emotions,” “drained of cathexis,” who need help. Both highly creative people and highly deluded people usually have the opposite – intense psychic energy, bright eyes, passion and so on, often varying a lot with time. (It is tragic these days to see many schools using drugs to change the latter towards the former, without a real understanding of what they are doing.)
But – the universe does not really maximize or minimize its
“Lagrangian” across space and time. Long ago, Lagrange thought so… but
Many economists would be very quick to add that nineteenth
century economics worked very hard to overcome the earlier idea of collective
utility, which came from earlier work in