Reconfigurable Flight Control via Neurodynamic Programming and Universally Stable Adaptive Control

Paul J. Werbos

National Science
Foundation*, Room 675

Arlington,
VA 22230

The first breakthrough success with reconfigurable flight control (RFC) was based on a form of neurodynamic programming more advanced than the simpler forms which have been widely popularized. RFC tries to minimize the probability of a crash after an aircraft has experienced unforseen, unpredictable damage, so severe that no controller can absolutely guarantee stability (i.e. the maximum probability of survival is less than one.) Some RFC simulations depend on highly unrealistic assumptions, like the implicit assumption that an airplane will not change its angle of attack by even one degree after being hit by a missile, or that there is only a small set of well-specified possible damage configurations. Hybrid systems based on linear-quadratic optimal control or neural adaptive control have demonstrated useful performance in RFC, but they are reduced forms of more general neurodynamic programming designs, which should yield higher survival probabilities and “universal” stable adaptive control (stability for a much broader class of plants than those allowed in past theorems of and Annaswamy [1] and Narendra and Mukhopadhyay[2].

__1. Introduction and Background__

This paper discusses how to achieve best possible performance (probability of survival) in reconfigurable flight control (RFC). It also reviews recent progress in approximate dynamic programming (ADP), which is sometimes called neurodynamic programming [3], reinforcement learning [4,5], or adaptive critics [6,7].

RFC
really took off as a major research investment after an initial $4 million
contract from NASA Ames, managed by Charles Jorgensen, to McDonnell-Douglas in

This paper argues that a proper use of advanced ADP offers the best hope of optimal performance in RFC. Section 2 reviews key concepts from dynamic programming and ADP, and compares them with alternative approaches for RFC. Section 3 discusses RFC more directly. We do not have enough space here to display the flow charts and equations from the references, but the basic idea is fairly simple, in the end.

Our
goal in RFC is to find the dynamic control rule which *maximizes a probability*. What we really want to do, then, is to
solve the dynamic programming problem which expresses that goal exactly,
accounting for the stochastic (uncertain) aspects and nonlinearity. The key
challenge is to find the best control in the first few seconds after damage,
when the aircraft is far away from any nominal apriori setpoints, and when
approximation error with a linear model would be substantial. No one can solve
the dynamic programming problem exactly, because of computational complexity,
but ADP offers a general set of designs to approximate the dynamic programming
solution as accurately as possible.

This is the best that can be done for this very difficult problem. Any company which claims that they can absolutely guarantee stability or survival, after such unforseen unrestricted damage, is simply

engaged in false advertising and marketing hype. As a practical matter, reducing the probability of a crash by a factor of two or three would already be a substantial achievement, with major implications both for civil aviation and for the balance of power in wartime.

*The views expressed here are those of the author and do not represent NSF in any way. However,

as work written on government time, this may be freely copied subject to proper acknowledgement and retention of this caveat.

__2. Dynamic Programming, Alternative Approaches and ADP__

__2.1. Dynamic
Programming As Such__

Dynamic programming is the only exact and efficient method
possible, in the general case, for solving *stochastic
dynamic optimization problems*. There are many, many variants of dynamic
programming, for continuous time versus discrete time, finite horizon versus
infinite-time problems, and so on. For a fully observed system ** X**(t) in discrete time t, we
usually try to find the optimal control law

J(** X**(t))
= Max
(U(

__u__

where U(** X**,

the tracking error at time t; in that case, dynamic
programming gives us a recipe for how to minimize tracking error *over all future time*. (This allows much
more stability than adaptive schemes based on

minimizing tracking error at time t+1. Instability may even
be *defined* as the possibility of
tracking errors

growing larger and larger in future times, after t+1; dynamic programming can be used to minimize that

possibility, in effect.) Dynamic programming also applies to optimization problems where there is some kind of “final time payoff function.”

__2.2. Alternatives
for RFC: Other Optimal Control, Robust Control, Adaptive Control__

Dynamic
programming gives us the best possible control law ** u**(

Both in LQG/LQR and in full dynamic
programming, we can control partially observed systems by designing *observer systems* which estimate the
complete state vector, in effect. Kalman filtering is crucial to practical work
with LQR/LQG[9,10]. The nonlinear version of it is often crucial to performance
with nonlinear ADP[14,15].

Robust control and adaptive control seem to offer an alternative to optimal control in stabilizing a damaged aircraft.

Robust control tools like mu synthesis are very highly developed for the linear case. They are closely related to linear-quadratic optimal control [12,13]. However, when a plant is very far away from the nominal trajectory, linear approximations to the dynamics are highly inaccurate. Using linear robust methods, we could represent the nonlinearities as totally unknown disturbances; however, this reduces the possible stability margins, compared with actually using the information at hand about the nature of the nonlinearities and uncertainties. Leading researchers in nonlinear robust control like Baras [15] have shown that the best, full-up nonlinear robust control rule results from solving the “Hamilton-Jacobi-Bellman”

equation (essentially just equation 1) with appropriate choices for U and other inputs. This brings us back to the task of approximating the solution to [1] as accurately as possible, for medium-scale problems like aircraft control.

Adaptive control seems to offer a totally different alternative. There are good intuitive reasons to

expect that adaptive control should be able to stabilize a wider class of plants than any fixed, feedforward robust controller. (See my review slides up at wwww.iamcm.org, or [14].) But practical experience suggests that robust control is far more reliable than the forms of adaptive control used most widely today.

Narendra and others have proven important total-system stability theorems for adaptive control – but only for a very limited class of plants, plants which obey very stringent assumptions. Narendra and Annaswamy prove such theorems for linear adaptive control [1], and Narendra and Mukhopadhyay [2] summarize very similar theorems for the general nonlinear case (where the linear approximator is replaced by a general nonlinear approximator, an artificial neural network(ANN)). Narendra and Annaswamy [1] summarize many years of research which had limited success at best in relaxing the restrictive assumptions, in search of “universal” stable adaptive control (the ability to stabilize any stabilizable plant), even in the linear case.

Almost all of the applications of neural adaptive control or general nonlinear adaptive control in recent years have been based on the designs discussed in [2], with or without minor modifications – with all

the limitations that this implies. In 1998 [13], I described how certain forms of ADP can be viewed as a relatively simple extension of adaptive control. I showed that they overcome the key difficulties in obtaining universal stable adaptive control. I have not yet proved the corresponding total system stability theorems, but these model-based ADP systems have already been used successfully in a number of difficult

problems where older methods have not worked as well [7,14].

In recent years, Narendra has argued that the stability of adaptive control can be improved by using multiple model or switching methods. In actuality, multiple model approaches are most useful in

cases where there are reasons to partition the state space,
either because the behavior is drastically different in different regions or
because there is a need to use multiple time scales in developing a controller;
in such cases, the best performance should come from *combining* multiple model approaches *together* with those forms of ADP which provide optimal performance
and stability within any region of state space [14]. This is an extremely
important area for future research, which may indeed be relevant to RFC. As a
preliminary step, however, the full exploration of ADP on RFC will also be
important.

__2.3. Approximate
Dynamic Programming (ADP)__

Early methods to solve equation 1 numerically were usually
expensive for medium-scale problems, because they allowed for *any possible function* J. In a series of
journal articles and other papers from 1968 to 1987, I proposed that we could
overcome this difficulty, and develop general-purpose reinforcement learning
machines, by using a general-purpose nonlinear approximation scheme (like a
neural network) to represent J. I developed a series of methods to adapt or
estimate the parameters of J, starting from Heuristic Dynamic Programming (HDP,
first published in 1977, and later rediscovered as “generalized TD”[17]) and
proceeding to more advanced methods. (See [13] for the history.)

Modern ADP is a very complex family of designs, ranging from relatively simple popularized methods which are very robust on small problems involving discrete decisions (and good at explaining many experiments in animal behavior) through to more complex designs which offer a serious hope of replicating true brain-like intelligence. The first working ADP system, implemented by Widrow [6], may be seen as “level zero” on the ladder of basic designs. The popular lookup-table TD and “Q learning” designs [3,8] may be viewed as level one. The ADP system used by White and Urnes was a level two system, based on what I would call “Action-Dependent HDP” [8]. This system used a stream of derivative feedback to train the Action network or controller, enabling much faster training than with level 1 designs, which White also tried out in various tests.

Levels 3 through 5, in my
classification, are all *model-based*
designs. They all require the use of some sort of model of the plant, which may
be anything from a differentiable first principles model through to a neural
network trained in real-time to emulate the plant [14,15]. They all train the
Action network or controller based on derivatives calculated by my method of
“backpropagating through the model,” described in [3,4,8,13,18].
Backpropagation through a model yields the same results as the brute force
methods usually used to calculate the same derivatives in Indirect Adaptive
Control (IAC), but it allows faster calculation and real-time implementation,
particularly for larger-scale problems. The level 3 design is essentially the
same as IAC, ala Narendra, except that the block which calculates tracking
error in IAC is replaced by a neural network or quadratic system which
approximates J. Since J is a Lyapunov function of the system, this converges to
stable control in the general case, for plants which are stabilizable.

In fairness to Barto and Watkins, I should note that this scheme of “levels” is a bit oversimplified.

Truly complex problems in higher-level intelligent control will involve a combination of discrete and continuous variables, requiring a complex integration of ”level one” methods and model-based methods. Beyond these basic designs, there are already many higher level ADP designs – such as the multiple-model ADP mentioned above – which are described further in the references, along with equations, flowcharts, and examples of applications.

__3. Previous Experience and Practical Issues With RFC__

For reasons of space, this section will not repeat material on RFC and on control designs already discussed

in sections 1 and 2.

The seminal work by White and Urnes [8] was a simulation study. They started from McDonnell-Douglas’ very sophisticated internal model of the F-15. They generated a wide range of random damage

conditions, not limited to the smaller damage sets considered by some later researchers in this field.

They imposed a requirement that the standby controller
system must learn to stabilize the craft within *two seconds* of the damage; otherwise, they would declare a crash
condition. With the level two ADP controller described in [8], they were able
to reduce the probability of crash in simulation from close to 100% down to
about 50%. Again, this was the precursor to the much larger projects managed
through NASA Ames, and then elsewhere.

Circa
1993, Charles Jorgensen of *offline*
learning, starting from relatively conservative hybrid neural-classical
designs, to achieve significant reductions in the crash rate – not as large as
the reductions possible in step two, but faster to get started, and useful in
paving the way for step two. Step two would essentially follow through on the
original work of White and Urnes to get better results. Step one would begin by
using a neural net system identification system to rapidly estimate the
parameters of a linear model, which would be coupled to a

linear-quadratic optimization scheme. NASA Dryden stressed that neural networks (like human pilots!)

are certainly *not*
excluded from their V&V process; the challenge was to systematically
develop procedures

to provide quality control in a pragmatic way, aimed at reducing the actual probability of crashes in the real world (which will unfortunately never be exactly zero).

In
actuality,

substantial opportunities to enhance the performance of that class of design, as discussed in section 2.

The
best possible performance in RFC should be expected from a *combined* off-line, on-line approach. Prior to any new test flights,
one would want to do the best one can with offline learning.

This would start out with the creation of a “metamodel,” as
discussed in the highly successful work of Ford Research in the clean car area.
(See [14] for a discussion of that work, involving “multistreaming.” See
[8,14,15] for system identification methods important to best performance in
that task.) This metamodel should allow one to simulate the effect of both
“normal” and “unanticipated” types of faults. Model-based ADP systems combined
with neural network observers (TLRN system identifiers) trained offline based
on such a metamodel should perform far better than the LQG style of “step one” RFC
system, *even without any use of real-time
learning or adaptation*. They should be able to deal with the V&V
process in much the same way that Jorgensen envisioned for the earlier step one
approach. It should be possible to “beat this system to death” in simulations,
in the same way that the crucial neural net estimator could have been beaten to
death in simulation in Jorgensen’s original step one plan.

On the
other hand, it may be possible to improve stability still further by *starting out* from the best

RFC from offline training, and adding an online training/adaptation scheme to further adapt the parameters.

Maybe. This is not entirely obvious in the RFC application.
But it *is* fairly obvious that one
could improve

on the neural adaptive control approach by changing it in two ways: (1) initializing it with an Action network developed by intense offline training as above; (2) replacing the simple measure of tracking error used in IAC by the Critic network developed in the offline training. In simulation studies, Balakrishnan has demonstrated enormous improvements in IAC performance in aircraft and missile control, after disturbances, when an offline-trained Critic network is used in place of the usual square error measure to

control the real-time adaptation. (In some cases, this may also require some offline training of learning rates in order to optimize performance; however, Balakrishnan reported no need for such additional tricks.)

In practice, with a well-developed metamodel, it is not so clear what additional advantages may accrue to real-time learning in this particular application. It is much more plausible, however, that a use

of partitioned state spaces may improve ADP performance in some ways. The kinds of partitions used by Motter to improve wind tunnel control may also be used here, in conjunction with new methods for

multi-model ADP [13,19]. This could be a significant step both in improving aviation safety and in

demonstrating the value of more truly brain-like approaches to intelligent control.

1. K.Narendra & A.Annaswamy, *Stable Adaptive Systems*.

2. K. Narendra & S. Mukhopadhyay, Intelligent
control using neural networks. In M. Gupta & N. Sinha

eds., *Intelligent Control Systems*, IEEE Press,
1996

3. D.P.Bertsekas and J.N.Tsisiklis, *Neuro-Dynamic Programming*,.

Athena Scientific, 1996

4.W.T.Miller, R.Sutton & P.Werbos (eds), *Neural Networks for Control*, MIT Press,
1990, now in paper,

chs. 2 and 3

5.P.Werbos, The elements of
intelligence. Cybernetica (

6.B.Widrow, N.Gupta & S.Maitra, Punish/reward: learning with a Critic in adaptive threshold systems,

*IEEE
Trans. SMC*, 1973, Vol. 5, p.455-465

7.P.Werbos, New Directions in
ACDs: Keys to Intelligent Control and Understanding the Brain. In *Proc. *

*IJCNN 2000*, IEEE, 2000

8. HIC White & D.Sofge, eds, *Handbook of Intelligent Control*, Van
Nostrand, 1992

9. P.Werbos, Neural networks for control and system
identification, *IEEE Proc. CDC89*, IEEE, 1989

10. J.Von Neumann and O.Morgenstern, *The Theory of Games and Economic Behavior*,

Princeton U. Press, 1953

11. Bryson and Ho A.Bryson & Y.C.Ho, *Applied Optimal Control*, Ginn, 1969

12. Stengel Richard F. Stengel, *Optimal Control and Estimation*, Dover edition, 1994

13.P.Werbos,** ***Stable Adaptive Control Using New Critic
Designs*. xxx.lanl.gov: adap-org/9810001

(October 1998).

14. P. Werbos, Neurocontrollers, in J.Webster, ed, *Encyclopedia of Electrical and Electronics *

*Engineering*, Wiley, 1999

15. Principe and Euliano, 2000

16.J.S.Baras and N.S.Patel, Information state for robust
control of set-valued discrete time systems, *Proc. *

*34th
Conf. Decision and Control (CDC)*, IEEE, 1995. p.2302.

17.R.S.Sutton, Learning to predict by the methods of
temporal differences, *Machine Learning*,
Vol. 3, p.9-

44, 1988

18. P.Werbos, *The
Roots of Backpropagation: From Ordered Derivatives to Neural Networks and *

*Political
Forecasting*, Wiley, 1994.

19.P.Werbos, Multiple Models for Approximate Dynamic Programming and True Intelligent Control: Why

and How. In K. Narendra, ed., *Proc. 10th Yale Conf. on Learning and Adaptive Systems*. New

Haven: K.Narendra, EE Dept., Yale U., 1998.