Reconfigurable Flight Control via Neurodynamic Programming and Universally Stable Adaptive Control
Paul J. Werbos
National Science
Foundation*, Room 675
Arlington,
VA 22230
The first breakthrough success with reconfigurable flight control (RFC) was based on a form of neurodynamic programming more advanced than the simpler forms which have been widely popularized. RFC tries to minimize the probability of a crash after an aircraft has experienced unforseen, unpredictable damage, so severe that no controller can absolutely guarantee stability (i.e. the maximum probability of survival is less than one.) Some RFC simulations depend on highly unrealistic assumptions, like the implicit assumption that an airplane will not change its angle of attack by even one degree after being hit by a missile, or that there is only a small set of well-specified possible damage configurations. Hybrid systems based on linear-quadratic optimal control or neural adaptive control have demonstrated useful performance in RFC, but they are reduced forms of more general neurodynamic programming designs, which should yield higher survival probabilities and “universal” stable adaptive control (stability for a much broader class of plants than those allowed in past theorems of and Annaswamy [1] and Narendra and Mukhopadhyay[2].
1. Introduction and Background
This paper discusses how to achieve best possible performance (probability of survival) in reconfigurable flight control (RFC). It also reviews recent progress in approximate dynamic programming (ADP), which is sometimes called neurodynamic programming [3], reinforcement learning [4,5], or adaptive critics [6,7].
RFC
really took off as a major research investment after an initial $4 million
contract from NASA Ames, managed by Charles Jorgensen, to McDonnell-Douglas in
This paper argues that a proper use of advanced ADP offers the best hope of optimal performance in RFC. Section 2 reviews key concepts from dynamic programming and ADP, and compares them with alternative approaches for RFC. Section 3 discusses RFC more directly. We do not have enough space here to display the flow charts and equations from the references, but the basic idea is fairly simple, in the end.
Our goal in RFC is to find the dynamic control rule which maximizes a probability. What we really want to do, then, is to solve the dynamic programming problem which expresses that goal exactly, accounting for the stochastic (uncertain) aspects and nonlinearity. The key challenge is to find the best control in the first few seconds after damage, when the aircraft is far away from any nominal apriori setpoints, and when approximation error with a linear model would be substantial. No one can solve the dynamic programming problem exactly, because of computational complexity, but ADP offers a general set of designs to approximate the dynamic programming solution as accurately as possible.
This is the best that can be done for this very difficult problem. Any company which claims that they can absolutely guarantee stability or survival, after such unforseen unrestricted damage, is simply
engaged in false advertising and marketing hype. As a practical matter, reducing the probability of a crash by a factor of two or three would already be a substantial achievement, with major implications both for civil aviation and for the balance of power in wartime.
*The views expressed here are those of the author and do not represent NSF in any way. However,
as work written on government time, this may be freely copied subject to proper acknowledgement and retention of this caveat.
2. Dynamic Programming, Alternative Approaches and ADP
2.1. Dynamic
Programming As Such
Dynamic programming is the only exact and efficient method possible, in the general case, for solving stochastic dynamic optimization problems. There are many, many variants of dynamic programming, for continuous time versus discrete time, finite horizon versus infinite-time problems, and so on. For a fully observed system X(t) in discrete time t, we usually try to find the optimal control law u(X) which satisfies the Bellman equation:
J(X(t))
= Max
(U(X(t), u) + <J(X(t+1))/(1+r)) (1)
u
where U(X, u) is a user-supplied utility function [10], where r is a user-supplied interest rate (usually zero), where “<>” denotes expectation value, where u(x(t)) should be the value of u which maximizes the right-hand side of (1), and where J is an unknown function to be solved for. Many practical applications actually call for minimization over time; in such cases, “Max” is replaced by “Min,” and “U” is actually a user-supplied cost or disutility function (whatever you want to minimize!), and J is called “cost to go.” For example, in conventional tracking control problems, we as users may choose “U” to be (X(t)-Xref(t))2,
the tracking error at time t; in that case, dynamic programming gives us a recipe for how to minimize tracking error over all future time. (This allows much more stability than adaptive schemes based on
minimizing tracking error at time t+1. Instability may even be defined as the possibility of tracking errors
growing larger and larger in future times, after t+1; dynamic programming can be used to minimize that
possibility, in effect.) Dynamic programming also applies to optimization problems where there is some kind of “final time payoff function.”
2.2. Alternatives
for RFC: Other Optimal Control, Robust Control, Adaptive Control
Dynamic programming gives us the best possible control law u(X), in principle, for all problems which can properly be expressed as dynamic optimization problems. But the exact solution of equation 1 is essentially impossible in the general case. Earlier numerical methods for approximating equation 1 were extremely expensive, because of the curse of dimensionality. Thus in practical control theory, people have been forced to use extreme brute force approximations or special cases. When U is quadratic and the plant is stochastic but linear, equation 1 reduces to the well-known LQG or LQR optimal control schemes [11,12,13]. Linear-quadratic optimal control can be near-perfect, in practice, when we represent the plant by linearizing about some nominal trajectory, so long as the plant will stay close enough to that trajectory that the linear approximation is accurate. (This is not the case for RFC!) Alternatively, with a nonlinear model but no uncertainty or noise, we can use traditional calculus of variation methods [9,10]; furthermore, when we use such methods to minimize tracking error over time, we end up with nonlinear model-predictive control [8, ch.10] or receding-horizon control. (But the stochastic aspects – the maximization of a probability -- are the very core of the RFC problem!)
Both in LQG/LQR and in full dynamic programming, we can control partially observed systems by designing observer systems which estimate the complete state vector, in effect. Kalman filtering is crucial to practical work with LQR/LQG[9,10]. The nonlinear version of it is often crucial to performance with nonlinear ADP[14,15].
Robust control and adaptive control seem to offer an alternative to optimal control in stabilizing a damaged aircraft.
Robust control tools like mu synthesis are very highly developed for the linear case. They are closely related to linear-quadratic optimal control [12,13]. However, when a plant is very far away from the nominal trajectory, linear approximations to the dynamics are highly inaccurate. Using linear robust methods, we could represent the nonlinearities as totally unknown disturbances; however, this reduces the possible stability margins, compared with actually using the information at hand about the nature of the nonlinearities and uncertainties. Leading researchers in nonlinear robust control like Baras [15] have shown that the best, full-up nonlinear robust control rule results from solving the “Hamilton-Jacobi-Bellman”
equation (essentially just equation 1) with appropriate choices for U and other inputs. This brings us back to the task of approximating the solution to [1] as accurately as possible, for medium-scale problems like aircraft control.
Adaptive control seems to offer a totally different alternative. There are good intuitive reasons to
expect that adaptive control should be able to stabilize a wider class of plants than any fixed, feedforward robust controller. (See my review slides up at wwww.iamcm.org, or [14].) But practical experience suggests that robust control is far more reliable than the forms of adaptive control used most widely today.
Narendra and others have proven important total-system stability theorems for adaptive control – but only for a very limited class of plants, plants which obey very stringent assumptions. Narendra and Annaswamy prove such theorems for linear adaptive control [1], and Narendra and Mukhopadhyay [2] summarize very similar theorems for the general nonlinear case (where the linear approximator is replaced by a general nonlinear approximator, an artificial neural network(ANN)). Narendra and Annaswamy [1] summarize many years of research which had limited success at best in relaxing the restrictive assumptions, in search of “universal” stable adaptive control (the ability to stabilize any stabilizable plant), even in the linear case.
Almost all of the applications of neural adaptive control or general nonlinear adaptive control in recent years have been based on the designs discussed in [2], with or without minor modifications – with all
the limitations that this implies. In 1998 [13], I described how certain forms of ADP can be viewed as a relatively simple extension of adaptive control. I showed that they overcome the key difficulties in obtaining universal stable adaptive control. I have not yet proved the corresponding total system stability theorems, but these model-based ADP systems have already been used successfully in a number of difficult
problems where older methods have not worked as well [7,14].
In recent years, Narendra has argued that the stability of adaptive control can be improved by using multiple model or switching methods. In actuality, multiple model approaches are most useful in
cases where there are reasons to partition the state space, either because the behavior is drastically different in different regions or because there is a need to use multiple time scales in developing a controller; in such cases, the best performance should come from combining multiple model approaches together with those forms of ADP which provide optimal performance and stability within any region of state space [14]. This is an extremely important area for future research, which may indeed be relevant to RFC. As a preliminary step, however, the full exploration of ADP on RFC will also be important.
2.3. Approximate
Dynamic Programming (ADP)
Early methods to solve equation 1 numerically were usually expensive for medium-scale problems, because they allowed for any possible function J. In a series of journal articles and other papers from 1968 to 1987, I proposed that we could overcome this difficulty, and develop general-purpose reinforcement learning machines, by using a general-purpose nonlinear approximation scheme (like a neural network) to represent J. I developed a series of methods to adapt or estimate the parameters of J, starting from Heuristic Dynamic Programming (HDP, first published in 1977, and later rediscovered as “generalized TD”[17]) and proceeding to more advanced methods. (See [13] for the history.)
Modern ADP is a very complex family of designs, ranging from relatively simple popularized methods which are very robust on small problems involving discrete decisions (and good at explaining many experiments in animal behavior) through to more complex designs which offer a serious hope of replicating true brain-like intelligence. The first working ADP system, implemented by Widrow [6], may be seen as “level zero” on the ladder of basic designs. The popular lookup-table TD and “Q learning” designs [3,8] may be viewed as level one. The ADP system used by White and Urnes was a level two system, based on what I would call “Action-Dependent HDP” [8]. This system used a stream of derivative feedback to train the Action network or controller, enabling much faster training than with level 1 designs, which White also tried out in various tests.
Levels 3 through 5, in my classification, are all model-based designs. They all require the use of some sort of model of the plant, which may be anything from a differentiable first principles model through to a neural network trained in real-time to emulate the plant [14,15]. They all train the Action network or controller based on derivatives calculated by my method of “backpropagating through the model,” described in [3,4,8,13,18]. Backpropagation through a model yields the same results as the brute force methods usually used to calculate the same derivatives in Indirect Adaptive Control (IAC), but it allows faster calculation and real-time implementation, particularly for larger-scale problems. The level 3 design is essentially the same as IAC, ala Narendra, except that the block which calculates tracking error in IAC is replaced by a neural network or quadratic system which approximates J. Since J is a Lyapunov function of the system, this converges to stable control in the general case, for plants which are stabilizable.
In fairness to Barto and Watkins, I should note that this scheme of “levels” is a bit oversimplified.
Truly complex problems in higher-level intelligent control will involve a combination of discrete and continuous variables, requiring a complex integration of ”level one” methods and model-based methods. Beyond these basic designs, there are already many higher level ADP designs – such as the multiple-model ADP mentioned above – which are described further in the references, along with equations, flowcharts, and examples of applications.
3. Previous Experience and Practical Issues With RFC
For reasons of space, this section will not repeat material on RFC and on control designs already discussed
in sections 1 and 2.
The seminal work by White and Urnes [8] was a simulation study. They started from McDonnell-Douglas’ very sophisticated internal model of the F-15. They generated a wide range of random damage
conditions, not limited to the smaller damage sets considered by some later researchers in this field.
They imposed a requirement that the standby controller system must learn to stabilize the craft within two seconds of the damage; otherwise, they would declare a crash condition. With the level two ADP controller described in [8], they were able to reduce the probability of crash in simulation from close to 100% down to about 50%. Again, this was the precursor to the much larger projects managed through NASA Ames, and then elsewhere.
Circa
1993, Charles Jorgensen of
linear-quadratic optimization scheme. NASA Dryden stressed that neural networks (like human pilots!)
are certainly not excluded from their V&V process; the challenge was to systematically develop procedures
to provide quality control in a pragmatic way, aimed at reducing the actual probability of crashes in the real world (which will unfortunately never be exactly zero).
In
actuality,
substantial opportunities to enhance the performance of that class of design, as discussed in section 2.
The best possible performance in RFC should be expected from a combined off-line, on-line approach. Prior to any new test flights, one would want to do the best one can with offline learning.
This would start out with the creation of a “metamodel,” as discussed in the highly successful work of Ford Research in the clean car area. (See [14] for a discussion of that work, involving “multistreaming.” See [8,14,15] for system identification methods important to best performance in that task.) This metamodel should allow one to simulate the effect of both “normal” and “unanticipated” types of faults. Model-based ADP systems combined with neural network observers (TLRN system identifiers) trained offline based on such a metamodel should perform far better than the LQG style of “step one” RFC system, even without any use of real-time learning or adaptation. They should be able to deal with the V&V process in much the same way that Jorgensen envisioned for the earlier step one approach. It should be possible to “beat this system to death” in simulations, in the same way that the crucial neural net estimator could have been beaten to death in simulation in Jorgensen’s original step one plan.
On the other hand, it may be possible to improve stability still further by starting out from the best
RFC from offline training, and adding an online training/adaptation scheme to further adapt the parameters.
Maybe. This is not entirely obvious in the RFC application. But it is fairly obvious that one could improve
on the neural adaptive control approach by changing it in two ways: (1) initializing it with an Action network developed by intense offline training as above; (2) replacing the simple measure of tracking error used in IAC by the Critic network developed in the offline training. In simulation studies, Balakrishnan has demonstrated enormous improvements in IAC performance in aircraft and missile control, after disturbances, when an offline-trained Critic network is used in place of the usual square error measure to
control the real-time adaptation. (In some cases, this may also require some offline training of learning rates in order to optimize performance; however, Balakrishnan reported no need for such additional tricks.)
In practice, with a well-developed metamodel, it is not so clear what additional advantages may accrue to real-time learning in this particular application. It is much more plausible, however, that a use
of partitioned state spaces may improve ADP performance in some ways. The kinds of partitions used by Motter to improve wind tunnel control may also be used here, in conjunction with new methods for
multi-model ADP [13,19]. This could be a significant step both in improving aviation safety and in
demonstrating the value of more truly brain-like approaches to intelligent control.
1. K.Narendra & A.Annaswamy, Stable Adaptive Systems.
2. K. Narendra & S. Mukhopadhyay, Intelligent
control using neural networks. In M. Gupta & N. Sinha
eds., Intelligent Control Systems, IEEE Press,
1996
3. D.P.Bertsekas and J.N.Tsisiklis, Neuro-Dynamic Programming,.
Athena Scientific, 1996
4.W.T.Miller, R.Sutton & P.Werbos (eds), Neural Networks for Control, MIT Press, 1990, now in paper,
chs. 2 and 3
5.P.Werbos, The elements of
intelligence. Cybernetica (
6.B.Widrow, N.Gupta & S.Maitra, Punish/reward: learning with a Critic in adaptive threshold systems,
IEEE Trans. SMC, 1973, Vol. 5, p.455-465
7.P.Werbos, New Directions in
ACDs: Keys to Intelligent Control and Understanding the Brain. In Proc.
IJCNN 2000, IEEE, 2000
8. HIC White & D.Sofge, eds, Handbook of Intelligent Control, Van
Nostrand, 1992
9. P.Werbos, Neural networks for control and system identification, IEEE Proc. CDC89, IEEE, 1989
10. J.Von Neumann and O.Morgenstern, The Theory of Games and Economic Behavior,
Princeton U. Press, 1953
11. Bryson and Ho A.Bryson & Y.C.Ho, Applied Optimal Control, Ginn, 1969
12. Stengel Richard F. Stengel, Optimal Control and Estimation, Dover edition, 1994
13.P.Werbos, Stable Adaptive Control Using New Critic Designs. xxx.lanl.gov: adap-org/9810001
(October 1998).
14. P. Werbos, Neurocontrollers, in J.Webster, ed, Encyclopedia of Electrical and Electronics
Engineering, Wiley, 1999
15. Principe and Euliano, 2000
16.J.S.Baras and N.S.Patel, Information state for robust
control of set-valued discrete time systems, Proc.
34th Conf. Decision and Control (CDC), IEEE, 1995. p.2302.
17.R.S.Sutton, Learning to predict by the methods of temporal differences, Machine Learning, Vol. 3, p.9-
44, 1988
18. P.Werbos, The
Roots of Backpropagation: From Ordered Derivatives to Neural Networks and
Political Forecasting, Wiley, 1994.
19.P.Werbos, Multiple Models for Approximate Dynamic Programming and True Intelligent Control: Why
and How. In K. Narendra, ed., Proc. 10th Yale Conf. on Learning and Adaptive Systems. New
Haven: K.Narendra, EE Dept., Yale U., 1998.