Back
in the 1960’s, I developed the idea of developing faster, better reinforcement
learning designs (now called RLADP, Reinforcement Learning and Approximate
Dynamic Programming) by approximating dynamic programming. Here I post links to
most of the main papers I wrote and published between the initial idea and
1988, and then say more about the history.
I
am amazed (even distraught) I do not find the scan I did of my 1968 Cybernetica
(Namur) paper, where I first published the DP approach to reinforcement
learning, and linked it to Freud and use of backwards recursion to make it
work! I clearly remember reviewing it and scanning it...but I am worried, since
my last two months at NSF involved SO much scanning and tossing that I fear I
may have lost my copies! (I will modify this later if I find it when I get
around the reorganizing my huge mass of computer files.)
Here
are links (click!) to papers I do have
at hand on this topic:
(1) A scanned copy of my 1972 thesis proposal, which got
deeper into the approach. I actually offset printed 50 to 100 copies of that,
and sent and discussed with many people, including Minsky, Grossberg, Dreyfus
and others. Harvard insisted that I not address all this in my thesis. The main
part of that history is discussed in my paper in Martin Bucker, George Corliss, Paul Hovland, Uwe
Naumann & Boyana Norris (eds), Automatic Differentiation: Applications,
Theory and Implementations, Springer (LNCS), New York, 2005. A few other
aspects are discussed in the book Talking
Nets, by Anderson and Rosenfeld.
(2)
Appendix B of my 1977 paper in
General Systems Yearbook defines Heuristic Dynamic Programming (HDP) (more or less
identical to TD) and the general idea of Dual Heuristic Programming (DHP) for
the first time. I attach that scan.
(3)
In 1979, I published a more complete, workable version of Dual Heuristic
Programming, also scanned
and attached.
Because I was as assistant professor in a policy program at UMCP, maybe I added
too much extra stuff on policy implications, but the math of the method is
there.
(4,5)
In 1980/81, I wrote a very detailed
DOE evaluation paper (“published” as an internal DOE report), and presented a
conference paper at IFIP, published by Springer, P.Werbos, Applications of advances in nonlinear
sensitivity analysis, in
R.Drenick & F. Kozin (eds), System
Modeling and Optimization: Proc. IFIP Conf. (1981),Springer 1982;
reprinted in P. Werbos, The Roots of Backpropagation, Wiley, 1994, which gave the mathematics and a flow
diagram for Globalized DHP. This was the paper which gave me the feeling that
“now it is REALLY published and I don’t have to hold back and worry about
plagiarism.” I sent the conference paper to four places after publication, two
of them being places I was funding at a high level from DOE; all four announced
within a year that they had invented a great new algorithm, without citation,
with different names!
(6)
In 1987, I published a paper in IEEE Trans SMC, elaborating
on how the GDHP variant of ADP can fit as a model of the human brain. (I now
know limits of that, and ways it should be upgraded, but even so I view it as
closer to the functional biological reality of intelligence than any other
whole brain models I have seen out there, by a wide margin.). That was the
paper which came to Richard Sutton’s attention, and led to him inviting me to
visit his company, GTE, soon after. He did cite it in his chapter in Miller,
Sutton and Werbos, which was mainly written in 1988 but held to 1990. At GTE,
as we walked on the grass next to the building, I stressed the concept of
“dreaming as simulation,” a key concept in my 1987 paper which he needed some
convincing on; his 1990 chapter on SARSA did implement that kind of simulation.
(7)
In my 1988 paper in
ICNN
(a session attended by about 2,000 people, chaired by Bernie Widrow), I
reviewed and cited that work. Widrow urged me to tell the whole story of ADP
and BP, as it might inspire other graduate students... but would it? The actual
talk was recorded, and I merged the text with the slides in the attached file.
=====================
I have
never put together a complete review of my work from the 1960’s to 1990 on what
is now called “reinforcement learning and approximate dynamic programming”
(RLADP), for many reasons. My chapter – chapter 1 – in the recent Handbook of
RLADP, edited by Lewis and Liu, gives a more modern overview of the field.
Until 1988, almost every step of the way in this field was followed by terrible
attacks, even by most of the people I tried to help. Kuhn’s classic text on the
difficulties of change in science gives some consolation, but I have mainly
preferred not to dwell on the past, and to redirect efforts to where they are
more appreciated (or at least less likely to cause less gross misuse of
technology). However... at this point, what’s to lose. Here is a brief review
of that early past.
Even
before my undergraduate days (which started in 1964 at Harvard, if you don’t
count the courses I took at Penn and at Princeton before that), I was deeply
influenced by four sources, all of which point towards RLADP in different ways:
(1) John Von Neumann’s Theory of Games and Economic Behavior, which laid out a
vision of rational behavior and utility functions; (2) Freud’s view of
psychodynamics, which one of my old classmates talked about a lot (George Davis
Gammon, son of a prominent professor at Penn); (3) Hebb’s Organization
of Behavior, raising the question of how this could be done in neural networks;
(4) Minsky’s chapter in Computers and Thought, which proposed reinforcement
learning as a path to true artificial intelligence. My goal here was to
understand how intelligence or mind are possible, following the idea that
intelligence is basically a system for learning how to maximize some kind of
(inborn) utility function over time, and linking that to our understanding of
human minds.
I
remember the exact moment when I began on the ADP path as such. As an
undergraduate, I decided to go to a mixer at Wellesley, and hitched a ride with
a couple going there in an old Volkswagen “beetle.” On the way, I mentioned my
view of intelligence, and interest in the puzzle of how one could build that
kind of reinforcement learning system which would really work. The guy driving
said that dynamic programming really addresses exactly that optimization
challenge – but couldn’t be a good model, because of curse of
dimensionality. Knowing about statistics
and neural networks, I immediately concluded that we could APPROXIMATE that
perfect method, and that this would be the right path.
In
1966, I took an independent study with Minsky, and do have a couple of the
papers I wrote for him in my scan files. They were the starting point for my
Cybernetica paper, which went much further, but did discuss the approximation of
dynamic programming as a path for
reinforcement learning. Minsky did give me a copy of a tech paper he did
jointly with Selfridge and said: “I really believed in reinforcement learning,
and had a design which HAD to work. But it didn’t. Yes, it would work on very
small problems, but since real
intelligent systems like brains must work on larger problems... and it
would not scale, I chose to go no further with this.” In fact, for various
reasons, many later workers reinvented things which would not scale, but I
focused my efforts on new designs which could scale much better to larger
tasks. That effort.. in a way is still continuing, but this post responds to a
question about what I did through 1988 on this topic, and will go no further.
Actually,
in that same period I wrote a paper for a very different kind of audience on
what this model tells us about the development of human potential, and a few
others on links to social science and policy (more realistic than the 1979
paper)... but again, that’s beyond the scope of this brief summary.
In
1988, when I accepted the job to run the new program in Neuroengineering at
NSF, one of my first actions was to contact Tom Miller and Richard Sutton, and
lead the organization of the workshop that year in New Hampshire, which in turn
led to our joint book from MIT Press. I still remember how we shared that hotel
in New Hampshire with a major team of George Bush senior, also active in that
year.