Back in the 1960’s, I developed the idea of developing faster, better reinforcement learning designs (now called RLADP, Reinforcement Learning and Approximate Dynamic Programming) by approximating dynamic programming. Here I post links to most of the main papers I wrote and published between the initial idea and 1988, and then say more about the history.

I am amazed (even distraught) I do not find the scan I did of my 1968 Cybernetica (Namur) paper, where I first published the DP approach to reinforcement learning, and linked it to Freud and use of backwards recursion to make it work! I clearly remember reviewing it and scanning it...but I am worried, since my last two months at NSF involved SO much scanning and tossing that I fear I may have lost my copies! (I will modify this later if I find it when I get around the reorganizing my huge mass of computer files.)

Here are links (click!) to papers I do have at hand on this topic:

(1) A scanned copy of my 1972 thesis proposal, which got deeper into the approach. I actually offset printed 50 to 100 copies of that, and sent and discussed with many people, including Minsky, Grossberg, Dreyfus and others. Harvard insisted that I not address all this in my thesis. The main part of that history is discussed in my paper in Martin Bucker, George Corliss, Paul Hovland, Uwe Naumann & Boyana Norris (eds), Automatic Differentiation: Applications, Theory and Implementations, Springer (LNCS), New York, 2005. A few other aspects are discussed in the book Talking Nets, by Anderson and Rosenfeld.

(2) Appendix B of my 1977 paper in General Systems Yearbook defines Heuristic Dynamic Programming (HDP) (more or less identical to TD) and the general idea of Dual Heuristic Programming (DHP) for the first time. I attach that scan.

(3) In 1979, I published a more complete, workable version of Dual Heuristic Programming, also scanned and attached. Because I was as assistant professor in a policy program at UMCP, maybe I added too much extra stuff on policy implications, but the math of the method is there.

(4,5) In 1980/81, I wrote a very detailed DOE evaluation paper (“published” as an internal DOE report), and presented a conference paper at IFIP, published by Springer, P.Werbos, Applications of advances in nonlinear sensitivity analysis, in R.Drenick & F. Kozin (eds), System Modeling and Optimization: Proc. IFIP Conf. (1981),Springer 1982; reprinted in P. Werbos, The Roots of Backpropagation, Wiley, 1994, which gave the mathematics and a flow diagram for Globalized DHP. This was the paper which gave me the feeling that “now it is REALLY published and I don’t have to hold back and worry about plagiarism.” I sent the conference paper to four places after publication, two of them being places I was funding at a high level from DOE; all four announced within a year that they had invented a great new algorithm, without citation, with different names!

(6) In 1987, I published a paper in IEEE Trans SMC, elaborating on how the GDHP variant of ADP can fit as a model of the human brain. (I now know limits of that, and ways it should be upgraded, but even so I view it as closer to the functional biological reality of intelligence than any other whole brain models I have seen out there, by a wide margin.). That was the paper which came to Richard Sutton’s attention, and led to him inviting me to visit his company, GTE, soon after. He did cite it in his chapter in Miller, Sutton and Werbos, which was mainly written in 1988 but held to 1990. At GTE, as we walked on the grass next to the building, I stressed the concept of “dreaming as simulation,” a key concept in my 1987 paper which he needed some convincing on; his 1990 chapter on SARSA did implement that kind of simulation.

(7) In my 1988 paper in ICNN (a session attended by about 2,000 people, chaired by Bernie Widrow), I reviewed and cited that work. Widrow urged me to tell the whole story of ADP and BP, as it might inspire other graduate students... but would it? The actual talk was recorded, and I merged the text with the slides in the attached file.

=====================

I have never put together a complete review of my work from the 1960’s to 1990 on what is now called “reinforcement learning and approximate dynamic programming” (RLADP), for many reasons. My chapter – chapter 1 – in the recent Handbook of RLADP, edited by Lewis and Liu, gives a more modern overview of the field. Until 1988, almost every step of the way in this field was followed by terrible attacks, even by most of the people I tried to help. Kuhn’s classic text on the difficulties of change in science gives some consolation, but I have mainly preferred not to dwell on the past, and to redirect efforts to where they are more appreciated (or at least less likely to cause less gross misuse of technology). However... at this point, what’s to lose. Here is a brief review of that early past.

Even before my undergraduate days (which started in 1964 at Harvard, if you don’t count the courses I took at Penn and at Princeton before that), I was deeply influenced by four sources, all of which point towards RLADP in different ways: (1) John Von Neumann’s Theory of Games and Economic Behavior, which laid out a vision of rational behavior and utility functions; (2) Freud’s view of psychodynamics, which one of my old classmates talked about a lot (George Davis Gammon, son of a prominent professor at Penn); (3) Hebb’s Organization of Behavior, raising the question of how this could be done in neural networks; (4) Minsky’s chapter in Computers and Thought, which proposed reinforcement learning as a path to true artificial intelligence. My goal here was to understand how intelligence or mind are possible, following the idea that intelligence is basically a system for learning how to maximize some kind of (inborn) utility function over time, and linking that to our understanding of human minds.

I remember the exact moment when I began on the ADP path as such. As an undergraduate, I decided to go to a mixer at Wellesley, and hitched a ride with a couple going there in an old Volkswagen “beetle.” On the way, I mentioned my view of intelligence, and interest in the puzzle of how one could build that kind of reinforcement learning system which would really work. The guy driving said that dynamic programming really addresses exactly that optimization challenge – but couldn’t be a good model, because of curse of dimensionality. Knowing about statistics and neural networks, I immediately concluded that we could APPROXIMATE that perfect method, and that this would be the right path.

In 1966, I took an independent study with Minsky, and do have a couple of the papers I wrote for him in my scan files. They were the starting point for my Cybernetica paper, which went much further, but did discuss the approximation of dynamic programming as a path for reinforcement learning. Minsky did give me a copy of a tech paper he did jointly with Selfridge and said: “I really believed in reinforcement learning, and had a design which HAD to work. But it didn’t. Yes, it would work on very small problems, but since real intelligent systems like brains must work on larger problems... and it would not scale, I chose to go no further with this.” In fact, for various reasons, many later workers reinvented things which would not scale, but I focused my efforts on new designs which could scale much better to larger tasks. That effort.. in a way is still continuing, but this post responds to a question about what I did through 1988 on this topic, and will go no further.

Actually, in that same period I wrote a paper for a very different kind of audience on what this model tells us about the development of human potential, and a few others on links to social science and policy (more realistic than the 1979 paper)... but again, that’s beyond the scope of this brief summary.

In 1988, when I accepted the job to run the new program in Neuroengineering at NSF, one of my first actions was to contact Tom Miller and Richard Sutton, and lead the organization of the workshop that year in New Hampshire, which in turn led to our joint book from MIT Press. I still remember how we shared that hotel in New Hampshire with a major team of George Bush senior, also active in that year.