xspace for space after text in newcommands
By default, if a figure consumes 60% of the page it will get its own float-page. To change that we have to adjust the value of the floatpagefraction derivative.
See more information here.
Allow images to be cropped
Self-explanatory.
The bookmark package implements a new bookmark (outline) organisation for package hyperref. This lets us change the “tree-navigation” associated with the generated pdf and constrain the menu only to H:2.
Symbols such as diamond suit, which can be used for aesthetically separating paragraphs, could be added with the package fdsymbol
. I’ll use bbding which offers the more visually appealing \FourStar
. I took this idea from seeing the thesis of the mimosis package author.
The csquotes package offers context sensitive quotation facilities, improving the typesetting of inline quotes.
Already imported by mimosis class.
To enclose quote environments with quotes from csquotes, see the following TeX SE thread.
And then use quotes as:
# The options derivative adds text after the environment. We use it to add the author. #+ATTR_LATEX: :options {\cite{Frahm1994}} /Current (fMRI) applications often rely on "effects" or "statistically significant differences", rather than on a proper analysis of the relationship between neuronal activity, haemodynamic consequences, and MRI physics./
Note that org-ref links won’t work here because the attr latex will be pasted as-is in the .tex file.
The date time package allows us to specify a “formatted” date object, which will print different formats according to the current locale & language. I use this in my title page.
General configuration.
Improvements provided with the Mimosis class.
Remove ISSN, DOI and URL to shorten the bibliography.
And increase the spacing between the entries, as per default they are too small.
Also reduce the font-size
The following commands make chapter numbers BrickRed, which look like the Donders color.
Already imported when using mimosis.
Fine tuning of spacing between paragraphs. See thread here.
Make the equation numbers follow the chapter, not the whole thesis.
ps2ascii phd-thesis.pdf | wc -w
{{{acronym(brl,BRL,Bristol Robotics Laboratory)}}} {{{acronym(mogpe,MoGPE,Mixtures of Gaussian Process Experts)}}} {{{acronym(moe,MoE,Mixture of Experts)}}} {{{acronym(mosvgpe,MoSVGPE,Mixtures of Sparse Variational Gaussian Process Experts)}}} {{{acronym(gp,GP,Gaussian Process)}}} {{{acronym(gps,GPs,Gaussian Processes)}}} {{{acronym(svgp,SVGP,Sparse Variational Gaussian Process)}}} {{{acronym(fitc,FITC,Fully Independent Training Conditional)}}} {{{acronym(mdp,MDP,Markov Decision Process)}}} {{{acronym(ard,ARD,Automatic Relevance Determination)}}} {{{acronym(ode,ODE,Ordinary Differential Equation)}}} {{{acronym(sde,SDE,Stochastic Differential Equation)}}} {{{acronym(elbo,ELBO,Evidence Lower Bound)}}} {{{acronym(vae,VAE,Variational Auto-Encoder)}}} {{{acronym(rl,RL,Reinforcement Learning)}}} {{{acronym(mbrl,MBRL,Model-Based Reinforcement Learning)}}} {{{acronym(mfrl,MFRL,Model-Free Reinforcement Learning)}}} {{{acronym(hmm,HMM,Hidden Markov Model)}}} {{{acronym(svi,SVI,Stochastic Variational Inference)}}} {{{acronym(soc,SOC,Stochastic Optimal Control)}}} {{{acronym(mpc,MPC,Model Predictive Control)}}} {{{acronym(lqr,LQR,Linear Quadratic Regulator)}}}
{{{acronym(ddp,DDP,Differential Dynamic Programming)}}} {{{acronym(ilqr,iLQR,iterative Linear Quadratic Regulator)}}} {{{acronym(ilqg,iLQG,iterative Linear Quadratic Gaussian)}}} {{{acronym(pid,PID,Proportional Integral Derivative)}}} {{{acronym(mcmc,MCMC,Markov Chain Monte Carlo)}}} {{{acronym(slsqp,SLSQP,Sequential Least Squares Programming)}}}
{{{acronym(epsrc,EPSRC,Engineering and Physical Sciences Research Council)}}} {{{acronym(farscope,FARSCOPE,Future Autonomous and Robotic Systems)}}}
{{{acronym(pets,PETS,Probabilistic Ensembles with Trajectory Sampling)}}} {{{acronym(pipps,PIPPS,Probabilistic Inference for Particle-based Policy Search)}}} {{{acronym(pilco,PILCO,Probabilistic Inference for Learning COntrol)}}}
{{{acronym(gpssm,GPSSM,Gaussian Process State Space Model)}}} {{{acronym(bald,BALD,Bayesian Active Learning by Disagreement)}}}
{{{acronym(ig,IG,Indirect Optimal Control via Latent Geodesics)}}} {{{acronym(dre,DRE,Direct Optimal Control via Riemannian Energy)}}} {{{acronym(cai,MRCaI,Mode Remaining Control as Inference)}}} {{{acronym(cai_unimodal,CaI,Control as Inference)}}}
{{{acronym(modeopt,ModeOpt,Mode Optimisation)}}}
{{{acronym(nlpp,NLPP,Negative Log Predictive Probability)}}} {{{acronym(mae,MAE,Mean Absolute Error)}}} {{{acronym(rmse,RMSE,Root Mean Squared Error)}}}
The modern world is pervaded with dynamical systems that we seek to control to achieve a desired behaviour. Examples include autonomous vehicles, aircraft, robotic manipulators, financial markets and energy management systems. In the last decade, \acrfull{rl}, and learning-based control in general, have become popular paradigms for controlling dynamical systems citep:hewingLearningBased2020,sutton2018reinforcement. This can be accounted to significant improvements in sensing and computational capabilities, as well as recent successes in machine learning.
This growing interest in learning-based control has emphasised the importance of real-world considerations. Real-world systems are often highly nonlinear, exhibit stochasticity and multimodalities, are expensive to run (energy-intensive, subject to wear and tear) and must be controlled subject to constraints (for safety, efficiency, and so on). In contrast to simulation, the control of physical systems also has real-world consequences: components may get damaged, the system may damage its environment, or the system may catastrophically fail. As such, any learning-based control strategy deployed in the real world should handle both the uncertainty inherent to the environment and the uncertainty introduced by learning from observations.
Many dynamical systems exhibit multimodalities, where some of the dynamics modes are believed to be inoperable or undesirable. These multimodalities may be due to spatially varying model parameters, for example, process noise terms modelling aircraft turbulence, or friction coefficients modelling surface-tyre interactions for ground vehicles over different terrain. In these systems, it is desirable to avoid entering specific dynamics modes that are believed to be inoperable. Perhaps they are hard to control due to stability, or the mode switch is hard to control. Alternatively, they may be inefficient or low performing. Given these motivations, this thesis focuses on controlling dynamical systems from an initial state, to a target state, whilst avoiding specific dynamics modes.
Model-based control comprises a powerful set of techniques for finding controls of constrained dynamical systems, given a dynamics model describing the evolution of the controlled system. It is commonly used for controlling aircraft, robotic manipulators, and walking robots citep:vonstrykDirect1992,bettsSurvey1998,gargUnified2010. One caveat is that it requires a relatively accurate mathematical model of the system. Traditionally, these mathematical models are built using first principles based on physics. However, accurately modelling the underlying transition dynamics can be challenging and lead to the introduction of model errors. These model errors may be due to incorrectly specifying model parameters or the models themselves. For example, modelling a nonlinear system to be linear or a multimodal system to be unimodal. Incorrectly specifying these model parameters and their associated uncertainty can have a detrimental impact on controller performance. Handling these issues is a central goal of robust and stochastic optimal control citep:freemanRobust1996,stengelStochastic1986.
The difficulties associated with constructing mathematical representations of dynamical systems can be overcome by learning from observations citep:ljungSystem1999. Learning dynamics models has the added benefit that it alleviates the dependence on domain experts for specifying accurate models, making it easier to deploy more general techniques. However, learning dynamics models for control introduces other difficulties. For example, it is important to know where the model cannot predict confidently due to a lack of training observations. This concept is known as epistemic uncertainty and is reduced in the limit of infinite data. Correctly quantifying uncertainty is crucial for intelligent decision-making.
In a risk-averse setting, control strategies should avoid entering regions of a learned dynamics model with high epistemic uncertainty. This is because it is impossible to guarantee constraint satisfaction in a learned model, i.e. if the trajectory will avoid the undesired dynamics mode. Conversely, in an explorative setting, if the epistemic uncertainty has been quantified, it can be used to guide exploration into regions of the dynamics that have not previously been observed. This experience can then be used to update the model, in turn reducing its epistemic uncertainty.
These two settings are the main focus of this thesis. If the dynamics are not fully known a priori, an agent will not be able to plan a risk-averse trajectory to the target state confidently. How can the agent explore its environment, in turn reducing the epistemic uncertainty associated with its dynamics model? As this thesis assumes that complete knowledge of the dynamics are not fully known a priori, a main interest is jointly inferring the underlying dynamics modes, as well as how the system switches between them, through repeated interactions with the system. Once the agent has explored enough, how can this learned dynamics model be exploited to plan risk-averse trajectories that remain in the desired dynamics mode?
The methods developed throughout this thesis are motivated by a 2D quadcopter navigation example.
See cref:fig-problem-statement for a schematic of the environment and details of the problem.
The goal is to fly the quadcopter from an initial state
- Mode 1
- a non-turbulent dynamics mode away from the fan,
- Mode 2
- an inoperable, turbulent dynamics mode in front of the fan.
The turbulent dynamics mode is subject to higher drift (in the negative
The methods developed throughout this thesis are motivated by a 2D quadcopter navigation example.
See cref:fig-problem-statement for a schematic of the environment and details of the problem.
The goal is to fly the quadcopter from an initial state
- Mode 1
- is an operable dynamics mode away from the fan,
- Mode 2
- is an inoperable, turbulent dynamics mode in front of the fan.
The turbulent dynamics mode is subject to higher drift (in the negative
The state-space of the velocity controlled quadcopter example consists of the 2D Cartesian coordinates
The overall goal is to navigate from an initial state
More formally, this thesis considers nonlinear, stochastic, multimodal dynamical systems,
with continuous states
The state difference between time
It considers systems where the underlying dynamics modes are defined by disjoint state domains.
That is, each dynamics mode is defined by its state domain
Notice that each mode’s transition dynamics are free to leave their state space
The overall goal is to navigate from an initial state
Notice that the controller
Given this definition of a mode remaining controlled system, this work seeks to solve,
where
Given a historical data set of state transitions from the environment, a common approach is to first learn a single-step dynamics model using a Gaussian process citep:deisenrothPILCO2011,doerrOptimizing2017,vinogradskaStability2016,rohrProbabilistic2021,hewingLearningBased2020,kollerLearningBased2018. This learned dynamics model can then be leveraged for model-based control.
cref:fig-traj-opt-over-gating-mask-svgp-all-baseline demonstrates the shortcomings of performing trajectory
optimisation in a \acrshort{gp} dynamics model trained on state transitions sampled from both dynamics modes.
It shows that the \acrshort{gp} is not able to learn a representation of the dynamics which is true to the underlying system.
This is because it cannot model the multimodal behaviour present in the environment.
The optimised controls successfully drive the \acrshort{gp} dynamics from the start state
In contrast,
cref:fig-traj-opt-over-gating-mask-svgp-desired-baseline shows results after training on state transitions from
only the desired (operable) dynamics mode.
It is able to accurately predict state transitions in the desired dynamics mode.
However, as this approach only considers the dynamics of the desired mode,
trajectories in the environment deviate from those planned in the learned dynamics model when they pass through the
turbulent mode.
This approach is problematic because the trajectory does not reach the target state
This cref:chap-dynamics strives to address these issues by learning representations of multimodal dynamical systems that correctly identify the underlying dynamics modes and how the system switches between them. cref:chap-traj-opt-geometry
Conveniently, the \acrshort{mosvgpe} method from cref:chap-dynamics can be used to learn a factorised representation of the underlying dynamics modes. This method correctly identifies the underlying dynamics modes and provides informative latent spaces that can be used to encode mode remaining behaviour into control strategies. In particular, the \acrshort{gp}-based gating network infers informative latent structure. The remainder of this chapter is concerned with encoding mode remaining behaviour into trajectory optimisation algorithms after training a \acrshort{mosvgpe} dynamics model on the historical data set of state transitions.
The main goals of the trajectory optimisation in this Chapter can be summarised as follows,
- Goal 1
- Remain in the desired dynamics mode $\desiredMode$, label:to-goal-mode
- Goal 2
- Avoid regions of the learned dynamics with high epistemic uncertainty i.e. that cannot be predicted confidently. For example, due to limited training observations. label:to-goal-unc
- in the desired dynamics mode,
- in how the system switches between the underlying dynamics modes.
\todo{update fig-traj-opt-baseline with trajectories which don’t use nominal mean function so that they deviate in the region with high epistemic uncertainty.}
All of the methods in this thesis relate back to this mode remaining navigation problem. For this reason, the problem is formally defined here.
This thesis explores methods for mode remaining control in multimodal dynamical systems that explicitly reason about the uncertainties arising during learning and control. The primary contributions of this thesis are as follows:
- cref:chap-dynamics: is concerned with learning representations of multimodal dynamical systems, where both the underlying dynamics modes and how the system switches between them, are not fully known a priori. Motivated by learning dynamics models for model-based control, it formulates a probabilistic model rich with latent spaces for control. It then derives a variational inference scheme that principally handles uncertainty whilst providing scalability via stochastic gradient methods. The method is a \acrfull{mogpe} method with a \acrfull{gp}-based gating network.
- cref:chap-traj-opt-control: investigates model-based control techniques that leverage the probabilistic model
from cref:chap-dynamics to solve the mode remaining navigation problem.
Due to the complexity of the problem, this chapter
assumes prior access to the environment, such that a data set of state transitions
$\mathcal{D}$ has previously been collected and used to train the model. It presents three trajectory optimisation algorithms that leverage the learned dynamics model’s latent structure to solve the mode remaining navigation problem. - cref:chap-active-learning: then considers the more realistic scenario of not having prior access to the environment. In this scenario, the agent does not have access to a historical data set for model learning. Instead, it must actively explore its environment, collect data and use it to update its dynamics model, whilst simultaneously attempting to remain in the desired dynamics mode. It presents an exploration strategy for exploring multimodal dynamical systems whilst remaining in a desired dynamics mode with high probability. It then details how this exploration strategy can be combined with the methods from cref:chap-traj-opt-control,chap-dynamics to solve the mode remaining navigation problem.
The first trajectory optimisation algorithm presented in cref:sec-traj-opt-collocation and an initial version of the approach for learning multimodal dynamical systems in cref:chap-dynamics, are published in:
Many physical systems operate under switching dynamics modes due to
changing environmental or internal conditions.
Examples include: robotic grasping where objects with different
properties have to be manipulated, robotic locomotion in environments with varying surface types
and the control of aircraft in environments subject to different levels of turbulence.
When controlling these systems, it may be preferred to find trajectories that remain
in a single dynamics mode.
This paper is interested in controlling a DJI Tello quadcopter in an environment
with spatially varying turbulence induced by a fan at the side of the room, shown
in Fig. ref:fig-problem-statement.
It is hard to know the exact transition dynamics due to complex and uncertain
interactions between the quadcopter and the fan.
The system’s transition dynamics resemble a mixture of two modes: a turbulent mode in front of
the fan and a non-turbulent mode everywhere else.
When planning a trajectory from start state
Trajectory optimisation comprises a powerful set of techniques for finding open-loop controls of dynamical systems such that an objective function is minimised whilst satisfying a set of constraints. It is commonly used for controlling aircraft, robotic manipulators, and walking robots cite:VonStryk1992,Betts1998,Garg2010. One caveat to trajectory optimisation is that it requires a relatively accurate mathematical model of the system. Traditionally, these mathematical models are built using first principles based on physics. However, accurately modelling the underlying transition dynamics can be challenging and lead to the introduction of model errors. For example, both observation and process noise are inherent in many real-world systems and can be hard to model due to both spatial and temporal variations. Incorrectly accounting for this uncertainty can have a detrimental impact on controller performance and is an active area of research in the robust and stochastic optimal control communities cite:FreemanRandyA.2009,Stengel1988.
The difficulties associated with constructing mathematical models can be overcome by learning from observations cite:Ljung1997. However, learning dynamics models for control introduces other difficulties. For example, it is important to know where the model cannot predict confidently due to a lack of training observations. This concept is known as epistemic uncertainty and is reduced in the limit of infinite data. Probabilistic models have been used to account for epistemic uncertainty and also provide a principled approach to modelling stochasticity i.e. aleatoric uncertainty cite:Schneider1996,Deisenroth2011. For example, cite:Cutler,Deisenroth2011,Pan2014 use Gaussian processes (GPs) to learn transition dynamics. \acrshort{gps} lend themselves to data-efficient learning through the selection of informative priors, and when used in a Bayesian setting offer well calibrated uncertainty estimates. Methods for learning probabilistic multimodal transition dynamics have also been proposed: cite:Mckinnon used a mixture of \acrshort{gp} experts method, cite:Moerland studied the used of deep generative models and cite:Kaiser2020a proposed a Bayesian model that learns independent dynamics modes whilst maintaining a probabilistic belief over which mode is responsible for predicting at a given input location.
There has also been work developing control algorithms exploiting learned multimodal transition dynamics cite:Herzallah2020. However, our work differs as it seeks to find trajectories that remain in a single dynamics mode whilst avoiding regions of the transition dynamics that cannot be predicted confidently. To the best of our knowledge, there is no previous work addressing such trajectory optimisation in transition dynamics models.
Probabilistic modelling and Bayesian inference are a promising avenue for learning dynamics models to be used for controlling real-world systems. \parmarginnote{probabilistic modelling} The Bayesian framework provides a principled approach to modelling both the epistemic uncertainty associated with the model, and the aleatoric uncertainty inherent to the system (e.g. process noise). For example, cite:deisenrothPILCO2011,cutlerEfficient2015,panProbabilistic2014 use Gaussian processes (GPs) to learn transition dynamics from observations. \acrshort{gps} lend themselves to data-efficient learning through the selection of informative priors, and when used in a Bayesian setting offer well calibrated uncertainty estimates. Methods for learning probabilistic multimodal transition dynamics have also been proposed: cite:mckinnonLearning2017 use a Mixture of \acrshort{gp} Experts method, cite:moerlandLearning2017 study the use of deep generative models and cite:kaiserBayesian2020 propose a Bayesian model that learns independent dynamics modes whilst maintaining a probabilistic belief over which mode is responsible for predicting at a given input location.
The primary goal of this thesis is to control stochastic, multimodal, nonlinear dynamical systems to a target state, whilst remaining in the desired dynamics mode. This is a \acrfull{soc} problem which can be summarised as follows:
This chapter formally defines this mode remaining navigation problem and reviews the relevant literature.
Dynamical systems describe the behaviour of a system over time
where the discrete mode indicator function
This thesis assumes that the state
Optimal control
Optimal control is a branch of mathematical optimisation that seeks to find a controller
where
Controller space
The controller space
Mode remaining
This thesis considers systems where the underlying dynamics modes are defined by disjoint state domains, i.e.
Note that the objective function in cref:eq-optimal-control-objective is not of primary interest in this work.
The novelty of this work arises from remaining in the desired dynamics mode
The discrete-time optimal control problem considered in this thesis can be modelled as a
\acrfull{mdp}, as seen in cref:fig-mdp.
The \acrshort{mdp} framework refers to the controller
In general, solutions to optimal control problems can be characterised by two different approaches, Pontryagin’s Minimum Principe citep:pontryagin1987mathematical (based on the calculus of variations) and Bellman’s Minimum Principle citep:bellmanDynamic1956. Although Pontryagin’s Minimum Principle offers computational benefits over Bellman’s principle, it does not readily generalise to the stochastic case. For this reason, we restrict our discussion to Bellman’s principle and the solution arising from it, known as dynamic programming.
Dynamic programming citep:bellmanDynamic1956 encompasses a large class of algorithms that can be
used to find optimal controllers given a model of the environment as an \acrshort{mdp}.
However, classical dynamic programming algorithms are of limited use as they rely on
accurate dynamics models and have a significant computational expense.
Nevertheless, they are still important theoretically.
The main idea of dynamic programming (and \acrshort{rl} in general) is to structure the search for good controllers using
value functions
In the finite horizon setting, directly solving the Bellman equations backwards in time is referred to as dynamic programming.
Approximate dynamic programming encompasses a large class of methods that, given a controller
Approximate dynamic programming encompasses a large class of methods that given a controller
Model-based vs model-free These methods are out of the scope of this thesis.
There are multiple approaches to finding controllers
To apply control and planning techniques in systems with unknown dynamics, system identification emerged as a set of techniques for computing unknown parameters, e.g. mass of a component citep:ljungSystem1999. This is a two-staged approach which first learns about the environment and then uses this learned model to find the optimal controller. However, this approach learns about the environment globally and often incurs high costs during system identification.
\acrfull{rl} provides the most general framework for extending optimal control to problems with incomplete knowledge of the system dynamics. The classic text by cite:sutton2018reinforcement gives a general introduction to \acrshort{rl}. The main goal is to learn good behaviours from interactions with an environment. Typically this is in the form of a state feedback controller, known as a policy, which makes an agent’s interaction with the environment closed-loop. In contrast to the system identification approach, the goal of \acrshort{rl} is to minimise costs (maximise rewards) during the learning process. Further to this, \acrshort{rl} only needs to learn about the states relevant to solving the optimal control problem.
This thesis is interested in a subset of \acrshort{rl} known as \acrfull{mbrl}.
It solves the optimal control problem in cref:eq-optimal-control-objective
by first learning a dynamics model and then using this learned model with model-based control techniques.
As both \acrfull{mfrl} and \acrshort{mbrl} methods learn models,
the model in the name refers to the dynamics model
As more systems are becoming data-driven, learning dynamics models for model-based control has shifted to
needing task-centric methods that simultaneously learn about the environment, whilst optimising a controller
to obtain low cumulative costs.
The field of \acrshort{mbrl} seeks to solve many optimal control problems by incrementally learning a dynamics model
in this way citep:deisenrothPILCO2011,chuaDeep2018.
\acrshort{mbrl} shares similarities with the system identification and control process,
except that the dynamics learning and control are updated simultaneously.
There are two main components to a \acrshort{mbrl} algorithm: 1) a method for learning a dynamics
model
This section reviews model-based control methods that leverage learned dynamics models in the \acrshort{mbrl} setting. In the \acrshort{rl} literature, model-based control is often referred to as planning. The work in this thesis is primarily focused on model-based control techniques, in particular, trajectory optimisation.
Trajectory optimisation
Instead of approximating a value function or a policy, it is possible to directly optimise the controls
$\controlTraj = \{\control_1, \ldots, \controlT-1 \}$ over a horizon
which finds an optimal sequence of controls
Model predictive control
Instead of directly applying the control inputs found with trajectory optimisation in an open-loop fashion,
it is possible to obtain a closed-loop controller.
This is achieved by iteratively applying the first control
This is known as \acrfull{mpc} citep:eduardof.Model2007. However, in practice, it is often too computationally expensive to obtain real-time control with \acrshort{mpc}. Many approximate solutions have been introduced in the literature, that seek to balance the computational complexity and accuracy trade-off differently citep:bettsSurvey1998.
For example, \acrfull{ilqr} can generate trajectories for nonlinear systems by iteratively approximating the dynamics to be linear around a nominal trajectory and optimising for the controls. \acrshort{ilqr} works well for quadratic cost functions but can be used with any cost function by approximating the cost function with a second-order Taylor expansion. However, in this case, \acrshort{ilqr} is susceptible to converging to terrible (local) optima if the true cost function is highly non-convex. cite:boedeckerApproximate2014 present a real-time \acrshort{ilqr} controller based on sparse \acrshort{gps}. cite:rohrProbabilistic2021 propose a novel \acrshort{lqr} controller synthesis for linearised \acrshort{gp} dynamics that yields robust controllers with respect to a probabilistic stability margin.
\acrshort{mpc} has directly been used with ensembles of probabilistic neural networks citep:chuaDeep2018,nagabandiDeep2020 and with \acrshort{gps} citep:kamtheDataEfficient2018. cite:lambertLowLevel2019 control a quadcopter using online \acrshort{mpc} and a dynamics model learned using probabilistic neural networks.
Offline trajectory optimisation Instead of solving the trajectory optimisation problem in cref:eq-trajectory-optimisation online, it can be solved offline. For example, the state-control trajectory can be found offline and used as a reference trajectory for a tracking controller. Alternatively, the trajectory optimiser can be used offline to learn a state feedback controller (policy) using guided policy search citep:levineGuided2013.
This work aims to control multimodal dynamical systems subject to the mode remaining constraint in cref:eq-main-problem. However, neither the underlying dynamics modes nor how the system switches between them, are known a priori. To this end, this thesis is interested in model-based control techniques which can learn and enforce latent constraints.
It is common to require constraints on the states
The feasible regions of these constraints can be written as sets,
so the constraints can alternatively be written as,
For a parametric controller
There are multiple approaches to enforcing state constraints via invariant sets. Two common approaches are Lyapunov functions citep:lyapunovGeneral1992a and control barrier functions citep:amesControl2019. Lyapunov functions are more restrictive than control barrier functions as they provide stability guarantees which are not a necessary condition to render $\stateDomain\text{feasible}$ forward invariant. Although these are interesting directions for future work, they are out of the scope of this thesis.
Unknown constraints This work is interested in learning constraints whilst ensuring that they are satisfied. cite:ariafarADMMBO2019,gelbartBayesian2014 introduce algorithms to minimise an unknown objective (Bayesian optimisation) subject to unknown constraints. cite:sadighSafe2016 propose an \acrshort{mpc} method that satisfies a priori unknown constraints with high probability. However, they do not deploy a strategy to actively learn about the constraints. In contrast, cite:schreiterSafe2015 consider safe exploration for active learning. They distinguish safe and unsafe regions with a binary \acrshort{gp} classifier, which is learned separately to the dynamics model. Their exploration strategy then considers the differential entropy of the dynamics \acrshort{gp} and they use the \acrshort{gp} classifier to define a set of safety constraints.
Stochastic constraints
In stochastic systems it is not possible to make deterministic statements about constraints.
This is because given a start state
A simple approach is to consider expected constraints $\E[\constraintFunc\state(\state)]$. However, although expected performance is a reasonable objective, expected constraints make less sense. For example, although the expected constraints $\E[\constraintFunc\state(\state)]$ hold, a system may still violate them frequently if the constraints variance $\V[\constraintFunc\state(\state)]$ is high. cite:ferberGames1958 proposed risk sensitivity which uses higher-order moments as well as the expected value. Value at risk citep:duffieOverview1997a is an even stronger notion, which guarantees constraint satisfaction with high probability.
\acrshort{mpc} citep:eduardof.Model2007 is the most direct method to embed constraints. At each time step, \acrshort{mpc} ensures that the constraints hold over a given horizon. However, it is worth noting that these constraints still cannot be guaranteed in stochastic systems. For stochastic systems, cite:schwarmChanceconstrained1999 proposed to satisfy constraints with high probability. Such constraints are named chance constraints. Chance constraints are applicable in systems where the uncertainty arises from learning from observations. i.e. they are applicable with latent constraints.
This section reviews methods for learning representations of dynamical systems for control. When learning representations of dynamical systems from observations it is important to consider the different forms of uncertainty. For example, when using a learned model for control, it is important to know what we do not know. This knowledge can be used to encode risk-sensitive control (avoid regions where the model cannot predict confidently) or to guide exploration. cref:sec-unc-exploration discusses exploration strategies that leverage well-calibrated uncertainty estimates.
This section characterises the uncertainty that arises in \acrshort{rl}.
Aleatoric uncertainty
Dynamical systems give rise to temporal observations arriving as a sequence
Epistemic uncertainty
The uncertainty arising from learning the dynamics
Distinguishing these two sources of uncertainty is important in \acrshort{rl}. For example, the epistemic uncertainty is useful for guiding exploration into regions of the system that have not been observed. In turn, this data can be used to reduce the model’s epistemic uncertainty by updating the model. In contrast, driving the system into regions of the model with high aleatoric uncertainty is not desirable. Consider the case where the model is confident that the system is subject to high process noise in a particular region. Guiding the system into this region will not reduce the model’s epistemic uncertainty because the model has already been trained on data from this region. Further to this, it may be undesirable to enter regions of high aleatoric uncertainty because they may result in poor performance or even catastrophic failure.
This work considers single-step dynamics models with the delta state formulation, which regularises the predictive distribution, given by $\state\timeInd+1 &= \state_\timeInd + \dynamicsFunc(\state_\timeInd, \control_\timeInd) + \bm\epsilon$. Although multi-step dynamics models are an interesting direction for learning-based control, they are out of the scope of this thesis.
Single-step dynamics models have been deployed in a large variety of learning-based control algorithms. Early approaches include using single-step linear models citep:schneiderExploiting1996 and single-step \acrshort{gp} models citep:deisenrothPILCO2011,vinogradskaStability2016,rohrProbabilistic2021,hewingLearningBased2020,kollerLearningBased2018 for low-dimensional control problems. More recently, single-step dynamics models have been learned using neural networks. For example, citep:chuaDeep2018,jannerWhen2019,kurutachModelEnsemble2018 use ensembles of neural networks and citep:depewegLearning2017,galImproving2016 use Bayesian neural networks with parametric uncertainty.
Probabilistic models Mathematical models are compact representations (sets of assumptions) that attempt to capture key features of the phenomenon of interest in a precise mathematical form. Probabilistic modelling provides the capability of constructing mathematical models that can represent and manipulate uncertainty in data, models, decisions and predictions. As such, linking observed data to underlying phenomena through probabilistic models is an interesting direction for modelling, analysing and controlling dynamical systems. It enables the uncertainty to be represented and manipulated; it provides a systematic way to combine observations with existing knowledge via a mathematical model. Learning representations of dynamical systems for control using probabilistic models has shown much promise. Moreover, learning single-step dynamics models that quantify uncertainty has been central to recent successes in \acrshort{mbrl} citep:chuaDeep2018,jannerWhen2019.
Modelling a Bayesian belief over the dynamics function
The mathematical machinery underpinning \acrshort{gps} is now detailed.
Multivariate Gaussian identities Inference techniques with \acrshort{gps} leverage multivariate Gaussian conditioning operations. As such, introducing the multivariate Gaussian identities is a natural place to start.
Gaussian distributions are popular in machine learning and control theory. This is not only due to their natural emergence in statistical scenarios (central limit theorem) but also their intuitiveness and mathematical properties that render their manipulation tractable and easy.
Consider a multivariate Gaussian whose random variables are partitioned into two vectors
where $\bm\mu\f$ and $\bm\mu\u$ represent the mean vectors, $\bm\Sigma\f\f$ and $\bm\Sigma\u\u$ represent the covariance matrices, and $\bm\Sigma\u\f$ and $\bm\Sigma\f\u$ represent the cross-covariance matrices. The marginalisation property of Gaussian distributions states that for two jointly Gaussian random variables, the marginals are also Gaussian,
Conveniently, the conditional densities are also Gaussian,
Consider the case where
Gaussian processes
Informally, \acrshort{gps} are a generalisation of the multivariate Gaussian distribution, indexed by an
input domain as opposed to an index set.
Similar to how a sample from an
More intuitively, a Gaussian process is a distribution over functions
Importantly, for a given set of training inputs from the
functions domain
Given mean and kernel functions with parameters
where the dependency on the parameters
Given the multivariate Gaussian conditionals in cref:eq-gaussian-conditional, it is easy to see how
the distribution over the test function value
It is typical in real-world modelling scenarios that observations of the true function values
where
to relate the observations to the latent function values
This predictive distribution is the \acrshort{gp} posterior.
Importantly, the predictive variance
The nonparametric nature of Gaussian process methods is responsible for their flexibility but also
their shortcomings, namely their memory and computational limitations.
In general, the computational complexity of Gaussian process methods scale with $\mathcal{O}(N3)$,
due to the inversion of the
Consider a second Gaussian density
with mean $\mathbf{m}\u$ and covariance $\mathbf{S}\u\u
Replacing
Inducing point methods citep:snelsonSparse2005a augment the latent variables
with inducing input-output pairs known as inducing “inputs”
where $\Kmm$ is the covariance function evaluated between all inducing inputs
where $\mathbf{Q}nn = \mathbf{K}nm\mathbf{K}mm-1\mathbf{K}nmT$. The full joint distribution is then given by,
Computationally efficient inference is obtained by approximating the integration over
This is the standard bound shown in cite:hensmanGaussian2013 ($\mathcal{L}1$ defined in Eq. 1). In cite:titsiasVariational2009 this bound is then substituted into the marginal likelihood to obtain a tractable lower bound,
where
cite:hensmanGaussian2013 noted that this bound has complexity $O(NM2)$ so introduced additional variational parameters to achieve a more computationally scalable bound with complexity $O(M3)$. Instead of collapsing the inducing points in cref:eq-titsias-bound they explicitly represent them as a variational distribution,
which they use to lower bound the marginal likelihood
where
Importantly, this bound is written as a sum over input-output pairs which has induced the necessary conditions
to preform stochastic gradient methods on
In contrast to the previously presented approaches, we are interested in learning representations
of multimodal dynamical systems.
cref:fig-traj-opt-over-gating-mask-svgp-all-baseline
demonstrates the shortcomings of learning a \acrshort{gp} dynamics model for the quadcopter navigation problem in the
illustrative example from cref:illustrative_example.
It shows the results of performing trajectory optimisation in a \acrshort{gp} dynamics model trained on state
transitions sampled from both dynamics modes.
The \acrshort{gp} has not been able to learn a representation of the dynamics which is true to the underlying system,
due to the discontinuities associated with the multimodal transition dynamics (changing lengthscales/noise variances etc).
The trajectory optimiser was able to find a trajectory from the start state
cref:fig-traj-opt-over-gating-mask-svgp-desired-baseline shows results after training on state transitions from
only the desired, operable dynamics mode (red).
The learned dynamics model can accurately predict state transitions in the desired dynamics mode (red).
However, as this approach only considers the dynamics of the desired mode,
trajectories in the environment (cyan) deviate from those planned in the
learned dynamics model (magenta) when they pass through the turbulent mode.
This is problematic because the trajectory passes through the turbulent dynamics mode
(which may lead to catastrophic failure) and does not reach the target state
Methods for learning probabilistic multimodal dynamics have been proposed. cite:moerlandLearning2017 use deep generative models, namely a conditional \acrfull{vae}, to learn multimodal transition dynamics for \acrshort{mbrl}. cite:kaiserBayesian2020 use the data association with \acrshort{gps} model; a Bayesian model that learns independent dynamics modes whilst maintaining a probabilistic belief over which mode is responsible for predicting at a given input location. cite:mckinnonLearning2017 also use an approach based on \acrshort{gps}, except that they use a \acrfull{mogpe} method to learn the switching behaviour for robot dynamics online.
Latent spaces for control It is worth noting that the introduction of latent variables into probabilistic models is a key component providing them with interesting and powerful capabilities for synergising model learning and control. For example, cite:hafnerLearning2019,rybkinModelBased2021 learn latent spaces which provide convenient spaces for control (or planning). cref:fig-traj-opt-over-gating-mask-svgp-desired-baseline highlights the need for learning informative latent variables representing how the system switches between the underlying dynamics modes. Without such information, it is not possible to encode the notion of mode remaining/avoiding behaviour. As such, this work is interested in learning latent spaces that are rich with information regarding how a system switches between its underlying dynamics modes.
\acrfull{rl} agents face a trade-off between exploration, where they seek to explore the environment and improve their models, and exploitation, where they make decisions which are optimal for the data observed so far. There are many approaches from the literature used to tackle the exploration-exploitation trade-off. In \acrshort{mbrl}, the goal is often to reduce the real-world sample complexity at the cost of increased model sample complexity.
There are two main uncertainties which are often modelled and used for exploration.
Value-based methods base their exploration on the uncertainty of the value function
Greedy exploitation
One of the most commonly used exploration strategies is to select the controller that maximises
the expected performance under the learned dynamics model $\dynamicsModel(f \mid \mathcal{D}0:i-1)$.
Note that
Note that the negative sign is used because the objective
Thompson sampling
An alternative and theoretically grounded strategy is Thompson sampling.
This approach samples a single model $f_i ∼ \dynamicsModel(f \mid \mathcal{D}0:i)$ at every iteration
In general, it is intractable to sample from $\dynamicsModel(f \mid \mathcal{D}0:i)$. Note that after the sampling step this problem is equivalent to greedy exploitation.
Alternatively, some \acrshort{mbrl} algorithms, such as cite:sekarPlanning2020, adopt a two-phase exploration strategy. The first phase is interested in exploring the environment and summarising this past experience in the form of a model. The second phase then seeks to solve a downstream task, for which it is given a cost (reward) function. This two-stage approach does not require an objective that changes its exploration-exploitation balance as it gathers more knowledge of the environment.
Active learning Active learning is a class of exploration algorithms which fit into the two-phase exploration approach. The goal of information-theoretic active learning is to reduce the number of possible hypotheses as fast as possible, e.g. minimise the uncertainty associated with the parameters using Shannon’s entropy citep:coverElements2006,
In contrast to greedy exploitation, active learning does not seek to maximise a
black-box objective. Instead, it is only interested in exploration. There are many
approaches to active learning in the static setting, i.e. in systems where an arbitrary
state
Recent work has addressed active learning in \acrshort{gp} dynamics models. cite:schreiterSafe2015 propose a greedy entropy-based strategy that considers the entropy of the next state. cite:buisson-fenetActively2020 also propose a greedy entropy-based strategy except that they consider the entropy accumulated over a trajectory. In contrast, cite:caponeLocalized2020,yuActive2021 propose using the mutual information.
cite:caponeLocalized2020 find the most informative state as the one that minimises the mutual information between it and a set of reference states (a discretisation of the domain). They then find a set of controls to drive the system to this most informative state. Given a fixed number of time steps, their method yields a better model than the greedy entropy-based strategies. cite:yuActive2021 propose an alternative approach that leverages their \acrfull{gpssm} inference scheme to estimate the mutual information between all the variables in time $I \left[\mathbf{y}1:t, \hat{\mathbf{y}}t+1 ; \mathbf{f}1:t+1 \right]$. Here $\mathbf{y}1:t$ denotes the set of observed outputs and $\hat{\mathbf{y}}t+1$ denotes the output predicted by the \acrshort{gpssm}. This contrasts with other approaches, which study the latest mutual information $I[\hat{\mathbf{y}}t+1 ; \mathbf{f}t+1]$.
Myopic active learning In \acrshort{rl} and control it is standard to consider objectives over a potentially infinite horizon. However, active learning objectives often myopically consider the information gain at the next query point only. In contrast, it is possible to consider the information gain over a potentially infinite horizon, reliving this myopia. The mutual information approaches in cite:caponeLocalized2020,yuActive2021 fall into this myopic category as they only maximise the information gain at the next time step. In contrast, the entropy-based strategy in cite:buisson-fenetActively2020 considers the entropy over a horizon.
This chapter is concerned with learning representations of multimodal dynamical systems for model-based control.
It is interested in systems where both the underlying
dynamics modes and how the system switches between them, are not fully known a priori.
This chapter assumes access to a data set of state transitions
Following the motivation in cref:to_motivation, this chapter seeks to identify the underlying dynamics modes correctly whilst inferring latent structure that can be exploited for control. The main goals of this chapter can be summarised as follows,
- accurately identify the true underlying dynamics modes,
- learn latent spaces for planning/control.
The probabilistic model constructed in this chapter resembles a \acrfull{mogpe} with a \acrshort{gp}-based gating network and is named \acrfull{mosvgpe}. Following other \acrshort{mogpe} methods, it is evaluated on the motorcycle data set citep:Silverman1985. It is then tested on a real-world quadcopter data set representing the illustrative example detailed in cref:illustrative_example.
This work considers learning representations of unknown or partially unknown, stochastic, multimodal, nonlinear dynamical systems. That is, it seeks to learn a representation of the dynamics from the problem statement in cref:problem-statement-main. This chapter considers single-step dynamics models with the delta state formulation that regularises the predictive distribution. The dynamics are given by,
with continuous states
This chapter assumes access to historical data comprising state transitions from
To ease notation, our modelling only considers a single output dimension.
The extension to multiple output dimensions follows from standard \acrshort{gp} methodologies and is detailed where necessary.
To further ease notation, the state-control input domain is denoted
$\inputDomain = \stateDomain × \controlDomain \subseteq \R\InputDim$
and a single state-control input is denoted
$\singleInput = (\state\timeInd,\control\timeInd)$.
Given this formulation, this chapter aims to learn the mapping
where both the latent dynamics functions $\{\unknownDynamicsK\}\modeInd=1\ModeInd$
and how the system switches between them
\acrfull{gps} are the state-of-the-art approach for Bayesian nonparametric regression and they provide a powerful mechanism for encoding expert domain knowledge. They are flexible enough to model arbitrary smooth functions with the simplicity of only requiring inference over a small number of interpretable parameters, such as lengthscales and the contributions of signal and noise variance in the data. These properties are induced by the covariance function, which models the covariance between observations. As such, \acrshort{mogpe} methods are a promising direction for modelling multimodal dynamical systems. This section recaps the \acrshort{mogpe} concepts that this chapter builds upon.
\newline
Mixture Models
Mixture models are a natural choice for modelling multimodal systems.
Given an input
where
\newline
Mixture of Experts The \acrfull{moe} model citep:jacobsAdaptive1991 is
an extension where the mixing probabilities
depend on the input variable
where
\newline Nonparametric Mixtures of Experts Modelling the experts as \acrshort{gps} gives rise to a class of powerful models known as \acrfull{mogpe}. They can model multimodal distributions as they model a mixture of distributions over the outputs, usually a Gaussian mixture in the regression setting citep:trespMixtures2000a,rasmussenInfinite2001. They can model non-stationary functions as each expert learns separate hyperparameters (lengthscales, noise variances etc). Many \acrshort{mogpe} methods have been proposed, and in general, they differ in the formulation of their gating network and their approximate inference algorithms.
\newline
cite:rasmussenInfinite2001 highlighted that the traditional \acrshort{moe} marginal likelihood does not apply when the experts are nonparametric. This is because the model assumes that the observations are i.i.d. given the model parameters, which is contrary to \acrshort{gp} models, which model the dependencies in the joint distribution, given the hyperparameters. cite:rasmussenInfinite2001 point out that there is a joint distribution corresponding to every possible combination of assignments (of observations to experts). The marginal likelihood is then a sum over exponentially many ($\ModeInd\NumData$) sets of assignments,
where
where each expert follows the standard Gaussian likelihood model,
with
where
Modelling the experts as \acrshort{gps} gives rise to a class of powerful models known as
Mixture of Gaussian Process Experts (MoGPE).
These models address both of the limitations that we discussed.
They can model multimodal predictive distributions as they model a mixture of distributions
over the outputs, usually a Gaussian mixture in the regression setting (cite:Tresp,Rasmussen2002).
They are able to model non-stationary functions as
each expert can learn different lengthscales, noise variances etc and the gating
network can turn each one “on” and “off” in different regions of the input space.
Many approaches to \acrshort{mogpe} have been proposed and in general the key differences between them are
the formulation of the gating network
and their inference algorithms (and associated approximations) cite:Yuksel2012.
The original \acrshort{mogpe} work by cite:Tresp proposed a gating network as a softmax of
Although this gating function divides up the input space cite:Rasmussen2002 argue that
data not assigned to a \acrshort{gp} expert will lead to bias near the boundaries.
Instead cite:Rasmussen2002 formulate the gating network in terms of conditional
distributions on the expert indicator variable, on which they place an input-dependent
Dirichlet process prior.
cite:Meeds2005 proposed an alternative infinite \acrshort{mogpe} similar to cite:Rasmussen2002 except
that they specify a full generative model over the input and output space
cite:Yuan proposed a \acrshort{mogpe} with a variational inference scheme which is much
faster than using \acrshort{mcmc}.
They assume that the expert indicators
We note that mixture models are inherently unidentifiable as different combinations of mixture components and mixing coefficients can generate the same predictive distributions. The gating network can be considered a handle for encoding prior knowledge that can be used to constrain the set of admissible functions. This can improve identifiability and lead to learned representations that better reflect our understanding of the system. The simplest case being reordering the experts. Figure ref:gating_network_comparison provides a visual comparison of different gating network formulations, providing intuition for how they can differ in restricting the set of admissible functions.
Modelling the gating network with \acrshort{gps} enables us to encode informative prior knowledge through the choice of mean and covariance functions. For example, adopting a squared exponential covariance function would encode prior belief that the mixing probabilities should vary smoothly across the input space. If modelling a dynamical system with oscillatory behaviour (e.g. an engine) a periodic kernel could be adopted. Prior knowledge of the mixing probability values can be encoded through the choice of mean function. We may also be interested in exploiting techniques from differential geometry e.g. finding probabilistic geodesics (cite:Tosi2014) on the gating network. Again, our choice of covariance function can be used to encode how differentiable the gating network should be. Learning better representations will also improve the models ability to extrapolate as it will be more true to the underlying system.
We can also consider the implications of the mentioned gating networks from an inference perspective. It is known that developing inference algorithms for propagating uncertain (Gaussian) inputs through \acrshort{gps} is itself a challenging problem (see cite:Ustyuzhaninov2019). Therefore, developing a scalable inference algorithm to propagate inputs that are distributed according to a Gaussian mixture will likely lead to valuable information loss. Theoretically the approaches by cite:Rasmussen2002,Meeds2005 are able to achieve accurate results but their inference relies on \acrshort{mcmc} sampling methods, which can be slow to converge.
We consider our approach as trading in the computational benefits that can be obtained through the formulation of the gating network for the ability to improve identifiability with informative \acrshort{gp} priors. The main contributions of this paper are twofold. Firstly, we re-formulate a gating network based on \acrshort{gps} to improve identifiability. Secondly, we derive an evidence lower bound which improves on the complexity issues associated with inference when adopting such a gating network. Motivated by learning representations of dynamical systems with two regimes we instantiate the model with two experts as it is a special case where the gating network can be calculated in closed form. This is because we seek to learn an operable mode with one expert and explain away the inoperable mode with the other. This results in the gating network indicating which regions of the input space are operable, providing a convenient space for planning.
Motivated by synergising model learning and control, the probabilistic model constructed in this chapter resembles a \acrfull{mogpe} with a \acrshort{gp}-based gating network. The \acrshort{gp}-based gating network infers latent geometric structure that is exploited by the control algorithm in Section ref:sec-traj-opt-geometric.
The marginal likelihood in cref:eq-np-moe-marginal-likelihood-assign can be expanded to show each of the experts’ latent variables,
where each expert follows the standard Gaussian likelihood model,
with
where
For ease of notation and understanding, only a single output dimension has been considered,
although in most scenarios the output dimension will be greater than
Motivated by improving identifiability and learning latent spaces for control, this work adopts a \acrshort{gp}-based gating network resembling a \acrshort{gp} classification model, similar to that used in the original \acrshort{mogpe} model citep:trespMixtures2000a. The \acrshort{gp}-based gating network can be used to constrain the set of admissible functions through the placement of informative \acrshort{gp} priors on the gating functions. Further to this, in cref:chap-traj-opt-control, the geometry of the \acrshort{gp}-based gating network is used to encode mode remaining behaviour into control strategies. cref:chap-active-learning then leverages the power of the \acrshort{gp}-based gating network to construct an explorative trajectory optimisation algorithm that can consider the information gain over trajectories, as opposed to just at the next state.
The marginal likelihood is given by,
where the gating network resembles a \acrshort{gp}
classification model, with a factorised classification likelihood
where
Softmax ($\ModeInd>2$) In the general case, when there are more than two experts, the gating network’s likelihood is defined as the Softmax function,
Each gating function
where
Each mode’s mixing probability
Bernoulli ($\ModeInd=2$) Instantiating the model with two experts,
If this sigmoid function satisfies the point symmetry condition then
the following holds,
where
With this formulation, the marginal likelihood of this model can be written to clearly show the expectations over the latent variables,
where
This model makes single-step probabilistic predictions,
where the predictive distribution over the output
For ease of notation and understanding, only a single output dimension has been considered,
although in most scenarios the state dimension will be greater than
\todo{Can I just say it’s left out because trivial?}
The gating network governs how the dynamics switch between modes.
This work is interested in spatially varying modes so
formulates an input dependent Categorical distribution over
where
\todo[inline]{insert gating network references}
The probabilities of this Categorical distribution
This work is interested in finding trajectories that can avoid areas of the transition dynamics model that cannot be predicted confidently. Placing independent sparse \acrshort{gp} priors on each gating function provides a principled approach to modelling the epistemic uncertainty associated with each gating function. The gating function’s posterior covariance is a quantitative value that can be exploited by the trajectory optimisation.
Each gating function’s inducing inputs are denoted
$\bm\xi_h(k)$ and outputs as $\hat{\mathbf{h}}(k)
where $p(\mathbf{h}_t \mid \hat{\mathbf{x}}t-1, \hat{\mathbf{h}}) = ∏k=1^K p\left({h}(k)_t \mid \hat{\mathbf{x}}t-1, \hat{\mathbf{h}}(k)\right)$
is the
Instantiating the model with two experts is a special case where only a single gating function is needed.
The output of a function can be mapped through a sigmoid
function
The distribution over
We note that $p({h}_n(k) \mid \mathbf{h}¬ n, \mathbf{X})$
is a \acrshort{gp} conditional and denote its mean
Given this gating network our marginal likelihood is now an analytic mixture of two Gaussians.
can analytically marginalise.
Performing Bayesian inference involves finding the posterior over the latent variables,
where the denominator is the marginal likelihood from cref:eq-marginal-likelihood-assign.
Exact inference in our model is intractable due to the marginalisation over the set of expert indicator variables.
For this reason, we resort to a variational approximation.
The rich structure of our model makes it hard to construct an \acrshort{elbo} that can
be evaluated in closed form whilst accurately modelling the complex dependencies.
Further to this, the marginal likelihood is extremely expensive to evaluate,
as there are $\ModeInd\NumData$ sets of assignments
\acrfull{svi} citep:hoffmanStochastic2013 relies upon having a set of local variables
factorised across observations and a set of global variables.
The marginalisation over the set of expert indicator variables
Augmented experts
We sidestep the hard assignment of observations to experts by augmenting each expert with a set
of separate independent inducing points
where the set of all inducing inputs associated with the experts has been denoted
Augmented gating network Following a similar approach for the gating network, each gating function is augmented with a
set of
where
Marginal likelihood These inducing points are used to approximate the marginal likelihood with a factorisation over observations that is favourable for constructing a \acrshort{gp}-based gating network. Our approximate marginal likelihood is given by,
where the conditional distributions
where
$\expertKernelMM = k\modeInd (\expertInducingInput,\expertInducingInput)$
represents the $\modeInd\text{th}$ expert’s kernel evaluated between its inducing inputs,
Our work follows from sparse \acrshort{gp} methodologies that assume, given the inducing variables, the latent function values factorise over observations. Our approximation assumes that given the inducing points, the marginalisation over every possible assignment of data points to experts can be factorised over data. In a similar spirit to the \acrfull{fitc} approximation citep:naish-guzmanGeneralized2008,quinonero-candelaUnifying2005, this can be viewed as a likelihood approximation,
Importantly, the factorisation over observations has been moved
outside of the marginalisation over the expert indicator variable, i.e.
the expert indicator variable can be marginalised for each data point separately.
This approximation assumes that the inducing variables,
$\{\expertInducingOutput\}\modeInd=1^\ModeInd$, are
a sufficient statistic for their associated latent function values,
$\{\mode{\latentFunc}(\allInputK) \}\modeInd=1^\ModeInd$
and the set of assignments
Our approximate marginal likelihood captures
the joint distribution over the data and assignments through the inducing variables
\acrfull{svi} citep:hoffmanStochastic2013 relies upon having a set of local variables
factorised across observations and a set of global variables.
Annoyingly, the order of the marginalisation over the expert indicator variable
and the product over observations
Augmented Probability Space Following the approach by cite:titsiasVariational2009 and cite:hensmanGaussian2013,
the probability space is first augmented with a set of
where
Our approximate marginal likelihood is then given by,
where the joint distribution over the data is captured by the inducing variables
The augmented marginal likelihood is then given by,
where the conditional distributions
where
$\expertKernelMM = k\modeInd (\expertInducingInput,\expertInducingInput)$
represents the $\modeInd\text{th}$ experts kernel evaluated between its inducing inputs,
A central assumption of our work follows from sparse \acrshort{gp} methodologies, that assume, given the inducing variables, the latent function values factorise over observations. In a similar spirit to the \acrshort{fitc} approximation cite:naish-guzmanGeneralized2008,quinonero-candelaUnifying2005, we propose the following likelihood approximation,
Importantly, this approximation moves the factorisation over observations
outside of the marginalisation over the expert indicator variable.
It captures a rich approximation of each expert’s covariance whilst marginalising the expert
indicator variable.
This approximation assumes that the inducing variables,
$\{\gatingInducingOutput, \expertInducingOutput\}\modeInd=1^\ModeInd$, are
a sufficient statistic for their associated latent function values,
$\{\mode{\latentFunc}(\allInput), \mode{\gatingFunc}(\allInput) \}\modeInd=1^\ModeInd$.
When all of the expert’s inducing inputs
where the joint distribution over the data is captured by the inducing variables
Instead of collapsing the inducing variables as seen in cite:titsiasVariational2009,
they can be explicitly represented as variational distributions,
Tight lower bound Following a similar approach to cite:hensmanGaussian2013,hensmanScalable2015, a lower bound on cref:eq-augmented-marginal-likelihood can be obtained,
where we parameterise the variational posteriors to be independent Gaussians,
The bound in cref:eq-lower-bound-tight meets the necessary conditions to perform stochastic gradient methods on
Further lower bound Following cite:hensmanScalable2015, these issues can be overcome
by further bounding
where
Moving the marginalisation over the latent gating functions
Further^2 lower bound
Nevertheless, we proceed and further bound the experts for comparison.
Jensen’s inequality is applied to the conditional probability
where
Intuitively, this bound can be seen as modifying the likelihood approximation in cref:eq-likelihood-approximation. Instead of mixing the \acrshort{gps} associated with each expert, this approximation simply mixes their associated noise models.
As each \acrshort{gp}’s inducing variables are normally distributed, the functional form of the variational posteriors are given by,
where
$\mode{\mathbf{A}} = \expertKernelnM \expertKernelMM-1$ and
$\mode{\hat{\mathbf{A}}} = \gatingKernelnM \gatingKernelMM-1
The tight lower bound
The bounds in cref:eq-lower-bound-further,eq-lower-bound-tight,eq-lower-bound-further-2
meet the necessary conditions to perform stochastic
gradient methods on
Stochastic optimisation
At each iteration
where $\expertInducingOutput(\expertSampleInd) &∼ \expertInducingVariational$
and
$\mathbf{\gatingFunc}(\x\batchSampleInd)(\gatingSampleInd) &∼ \gatingsVariationalSample$
denote samples from the variational posteriors.
Computational complexity Assuming that each expert has the same number of inducing points
For a given set of test inputs $\testInput ∈ \R\NumTest × \InputDim
where dependence on the test inputs
Experts The experts make predictions at new test locations by integrating over their latent function posteriors,
However, the experts’ true posteriors
where $\mode{\mathbf{A}} = \expertKernelsM \expertKernelMM-1
Gating network The mixing probabilities associated with the gating network are obtained by integrating the gating network’s posterior through the gating likelihood,
Again, the gating network’s true posterior
where $\mode{\hat{\mathbf{A}}} = \gatingKernelsM \gatingKernelMM-1
so cref:eq-gating-prediction is approximated with Monte Carlo quadrature.
In the two expert case there is only a single gating function,
In this case, the gating likelihood is the Gaussian cdf,
so cref:eq-gating-prediction can be calculated in closed-form with,
where $μh_*$ and $σ^2h_*$ are the mean and variance of the variational posterior
\todo[inline]{Need to extend bound for the gating network}
As a \acrfull{moe} method, our model aims to improve on standard \acrshort{gp} regression with the ability to model non-stationary functions and multimodal distributions over the output variable. With this in mind, the model and approximate inference scheme are evaluated on two data sets. Following other \acrshort{mogpe} work, they are first tested on the motorcycle data set citep:Silverman1985. Although this data set does not represent state transitions from a dynamical system, it does contain non-stationary points and heterogeneous noise, making it interesting to study from the \acrshort{mogpe} perspective. Secondly, they are tested on the illustrative example from cref:illustrative_example. That is, a data set collected onboard a DJI Tello quadcopter flying in an environment subject to two dynamics modes.
All data sets were split into test and training sets with
where
The Motorcycle data set (discussed in cite:Silverman1985) contains 133 data points ($\allInput ∈ \R133 × 1$ and $\allOutput ∈ \R133 × 1$) and input dependent noise. The data set represents motorcycle impact data – time (ms) vs acceleration (g). The data set is represented by the black crosses in cref:fig-y-mcycle-two-experts.
To test the performance of \acrshort{mosvgpe}, the model is
instantiated with
cref:tab-mcycle-metrics summarises the results for the three \acrshort{elbo}s
(
#+Caption[\acrshort{mosvgpe} results on motorcycle data set]: Results on the Motorcycle data set citep:Silverman1985 with different instantiations of our model (\acrshort{mosvgpe}).
\acrshort{rmse} | \acrshort{nlpp} | \acrshort{mae} | |
---|---|---|---|
\acrshort{gp} | |||
\acrshort{svgp} |
|||
\acrshort{svgp} |
|||
\acrshort{mosvgpe} |
|||
\acrshort{mosvgpe} |
|||
\acrshort{mosvgpe} |
|||
\acrshort{mosvgpe} |
|||
\acrshort{mosvgpe} |
|||
\acrshort{mosvgpe} |
The \acrshort{nlpp} indicates the probability of the data given the
parameters which are not marginalised, e.g. hyperparameters and inducing inputs.
Following Bayesian model selection, it is known that lower values indicate higher performing models, i.e.
predictive posteriors that more accurately match the distribution of the data.
The predictive posterior is most accurate when \acrshort{mosvgpe} is instantiated with three experts
Regarding the accuracy of the predictive means, the standard \acrshort{gp} regression model achieved the best \acrshort{rmse}, followed by the \acrshort{svgp} models and then the \acrshort{mosvgpe} models. It is worth noting that all of the \acrshort{rmse} and \acrshort{mae} scores are very similar. Although adding more experts to the \acrshort{mosvgpe} model appears to learn more accurate predictive posteriors, the predictive means appear to deteriorate ever so slightly (indicated by higher \acrshort{rmse}/\acrshort{mae} values). This is most likely due to bias at the boundaries between the experts, resulting from the mixing behaviour arising from our \acrshort{gp}-based gating network. If the gating functions do not have low lengthscales then they will not be able to immediately switch from one expert to another. It is worth noting that this drop in performance is negligible.
The two further lower bounds (
cref:fig-y-means-mcycle-two-experts-tight,fig-y-means-mcycle-two-experts-further
compare the posterior means (black solid line) to the \acrshort{svgp}’s posterior mean (red dashed line) and
cref:fig-y-samples-mcycle-two-experts-tight,fig-y-samples-mcycle-two-experts-further
compare the posterior densities to the \acrshort{svgp}.
The red lines show plus or minus two standard deviations of the \acrshort{svgp}’s posterior variance.
As the \acrshort{mosvgpe} posterior is a Gaussian mixture, it is visualised by drawing samples
from its posterior, i.e. sample a mode indicator variable
Predictive posteriors Both \acrshort{mosvgpe} results are capable of modelling the non-stationarity
at
Latent variables More insight into this behaviour can be obtained by considering the latent variables.
Figure ref:fig-latent-mcycle-two-experts shows the posteriors over the latent variables where
cref:fig-expert-gps-mcycle-two-experts-tight,fig-expert-gps-mcycle-two-experts-further
show the \acrshort{gp} posteriors over each expert’s latent function
The lengthscale of the gating network kernel governs how fast the model can shift responsibility from
expert one (cyan) to expert two (magenta).
For both lower bounds,
the distribution over the expert indicator variable tends to a uniform distribution (maximum entropy)
at
The model was then instantiated with three experts
From cref:tab-mcycle-metrics, it is clear that the predictive posterior associated with
In cref:fig-expert-gps-mcycle-three-experts-tight the third expert’s posterior returns to the prior at
The tight lower bound
Batch size
One of the main benefits of the variational inference scheme presented in this chapter, is that the bound can be
calculated with minibatches of a data set
Number of inducing points
Our variational inference scheme models the joint distribution over the data and assignments via the inducing variables
(
\todo{add results for num inducing points leading to worse performance e.g. M=8}
Evidence lower bounds
The tight lower bound
\newpage
As this work is motivated by learning representations of real-world dynamical systems, it was tested on a real-world quadcopter data set following the illustrative example detailed in cref:illustrative_example. The data set was collected at the Bristol Robotics Laboratory using a velocity controlled DJI Tello quadcopter and a Vicon tracking system. A high turbulence dynamics mode was induced by placing a desktop fan at the right side of a room. cref:fig-quadcopter-environment shows a diagram of the environment. The data set represents samples from a dynamical system with constant controls, i.e. $Δ\state\timeInd+1 = \latentFunc(\state\timeInd;\control\timeInd=\control_*)$.
Environment The environment is modelled with two dimensions
(the
Data collection The Vicon system provided access to the true position of the quadcopter at all times, which
enabled pre-planned trajectories to be flown, using a simple \acrshort{pid} controller on
feedback from the Vicon system.
To simplify data collection,
nine trajectories from
Data processing The Vicon stream recorded data at 100Hz, which was then downsampled to
give a time step of
The model was instantiated with two experts, with the goal of each expert learning a separate dynamics mode and the gating network learning a representation of how the underlying dynamics modes vary over the state space. The model was trained using the model and training parameters in cref:tab-params-quadcopter.
#+caption[\acrshort{mosvgpe}’s moment matched predictive posterior with
#+caption[\acrshort{mosvgpe}’s experts’ posteriors with
At a new input location
Gating network
Figure ref:fig-gating-network-quadcopter-subset shows the gating network after training on the
data set.
Figure ref:fig-gating-mixing-probs-quadcopter-subset (right) indicates that the model has assigned responsibility
to Expert 2 in front of the fan, as its mixing probability
Identifiability
These results show that the \acrshort{gp}-based gating network is capable of turning a single expert
on in multiple regions of the input space.
This is a desirable behaviour as it has enabled only two underlying dynamics modes to be identified.
In contrast, other \acrshort{mogpe} methods may have assigned an extra expert to one of the regions modelled by Expert 1.
In particular, the regions at
Experts
Figure ref:fig-experts-f-quadcopter-subset shows the predictive posteriors $q(\latentFunckd(\singleTestInput))$
associated with each dimension
Both experts were initialised with independent inducing inputs,
The bottom right two plots in Figure ref:fig-experts-f-quadcopter-subset show the posterior variance
associated with the
The quadcopter frames and Euler angles (roll
The 2D nonlinear quadcopter dynamics are based on the
state vector is given by
where,
where
where $g=9.81ms-2$ is the acceleration due to gravity,
\newpage
cite:williamsAdvancing cite:watsonStochastic2020 cite:watsonAdvancing2021
cite:levineVariational2013
cite:bhardwajDifferentiable2020
cite:mukadamContinuoustime2018
The dynamics of a point mass in 2D can represented with the
state vector
The point mass can apply a force along its
where
the dynamics can be calculated as,
The unknown dynamics can be modelled by placing \acrshort{gp} priors on the accelerations (
\newpage
Implicit data assignment
It is worth noting that in contrast to other \acrshort{mogpe} methods, this model does not directly assign observations to experts.
However, after augmenting each expert with separate inducing points,
the model has the flexibility to loosely partition the data set.
Just as sparse \acrshort{gp} methods can be viewed as methods that parameterise the full nonparametric \acrshort{gp},
our approach can be viewed as parameterising the nonparametric \acrshort{mogpe}.
Conveniently, our parameterisation, in particular the likelihood approximation in cref:eq-likelihood-approximation,
deals with the issue of marginalising exponentially many sets of assignments of observations to experts.
As evident from the results in this chapter, this likelihood approximation appears to retain important information
regarding the assignment of observations to experts, whilst efficiently marginalising the expert indicator variable.
It is also worth noting that the number of inducing points
Bayesian treatment of inducing inputs Common practice in sparse \acrshort{gp} methods is to jointly optimise the hyperparameters and the inducing inputs. Optimising only some of the parameters, instead of marginalising all of them, is known as Type-II maximum likelihood. In Bayesian model selection, it is well-known that Type-II maximum likelihood can lead to overfitting if the number of parameters being optimised is large. In the case of inducing inputs, there can often be beyond hundreds or thousands that need to be optimised. Further to this, cite:rossiSparse2021 show that optimising the inducing inputs relies on being able to optimise both the prior and the posterior, therefore contradicting Bayesian inference. Our variational inference scheme follows common practice and optimises the inducing inputs jointly with the hyperparameters. In some instances, we observe that optimising the inducing inputs leads to them taking values far away from the training data. Often this can be avoided by simply sampling the inducing inputs the training inputs and fixing them, i.e. not optimising them. This often leads to better \acrshort{nlpp} scores as well. This observation highlights that a Bayesian treatment of the inducing inputs is an interesting direction for future work. However, specifying priors and performing efficient posterior inference over the inducing inputs is a challenging problem.
Latent spaces for control
The gating network consists of two spaces which are rich with information regarding how the
system switches between it’s underlying dynamics modes,
namely, the pmf over the expert indicator variable and the \acrshort{gp} posteriors over the gating functions.
It is worth noting that all \acrshort{mogpe} methods have a pmf over the expert indicator variable.
However, this space suffers from interpretability issues.
This is because in conventional \acrshort{mogpe} methods, the epistemic uncertainty associated with the
gating network is not decoupled from the pmf over the expert indicator variable.
Consider the meaning of the mixing probabilities tending to a uniform
distribution
- It has **high** epistemic uncertainty, so cannot confidently predict which expert is responsible,
- It has **low** epistemic uncertainty and confidently mixes the experts’ predictions,
- This happens at the boundaries between experts.
This interpretability issue is overcome by our \acrshort{gp}-based gating network, as these two cases are modelled differently. Either the gating function(s) are all equal and their posterior variance(s) are low, implying that the gating network has **low** epistemic uncertainty and is likely at a boundary between experts. Alternatively, the gating functions’ posterior variance(s) could be high, implying it has **high** epistemic uncertainty.
Importantly, the \acrshort{gp} posteriors associated with our gating network, not only infer information regarding the mode switching but also model the gating network’s epistemic uncertainty. Further to this, formulating the gating network \acrshort{gps} with differentiable mean and covariance functions, enables techniques from Riemannian geometry to be deployed on the gating functions citep:carmoRiemannian1992. The power of the \acrshort{gp}-based gating network will become apparent when its latent geometry is leveraged for control in cref:chap-traj-opt-geometry and when its \acrshort{gps} are used to develop an information-based exploration strategy in cref:chap-active-learning.
This chapter has presented a method for learning representations of multimodal dynamical systems using a \acrshort{mogpe} method. Motivated by correctly identifying the underlying dynamics modes and inferring latent structure that can be exploited for control, this work formulated a gating network based on input-dependent gating functions. This aids the inherent identifiability issues associated with mixture models as it can be used to constrain the set of admissible functions through the placement of informative \acrshort{gp} priors on the gating functions. Further to this, the \acrshort{gp} posteriors over the gating functions provide convenient latent spaces for control. This is because they are rich with information regarding the separation of the underlying dynamics modes and also model the epistemic uncertainty associated with the gating network. In later chapters this uncertainty will be used to construct risk-averse control strategies and to guide exploration for \acrshort{mbrl}.
The variational inference scheme presented in this chapter addresses the issue of marginalising
every possible set of assignments of observations to experts
– of which there are $\ModeInd\NumData$ possibilities
– in the \acrshort{mogpe} marginal likelihood.
It overcomes the issue of assigning observations to experts by augmenting each expert \acrshort{gp}
with a set of inducing points.
These inducing points are assumed to be a sufficient statistic for the joint distribution
over every possible set of assignments to experts.
This induces a factorisation over data which
is used to derive three \acrshort{elbo}s that provide a coupling between the
optimisation of the experts and the gating network, by efficiently marginalising the expert indicator variable for single
data points.
The \acrshort{elbo}s are compared on the Motorcycle data set citep:Silverman1985.
The
This chapter is concerned with controlling unknown or partially unknown, multimodal dynamical systems,
given a single-step predictive dynamics model learned using the \acrshort{mosvgpe} method from cref:chap-dynamics.
In particular, it is concerned with mode remaining trajectory optimisation, which
is formally defined in cref:def-mode-remaining-main.
Informally, mode remaining trajectory optimisation attempts to find trajectories
from an initial state
The \acrshort{mosvgpe} method from cref:chap-dynamics was intentionally formulated with latent variables – to represent the mode switching behaviour and its associated uncertainty – so that they could be leveraged to encode mode remaining behaviour into control strategies. This chapter unleashes the power of these latent variables by making decisions under their uncertainty.
The remainder of this chapter is organised as follows. cref:sec-problem-statement formally states the problem. cref:chap-traj-opt-geometry details two methods that leverage the geometry of the \acrshort{mosvgpe} gating network. The first method in cref:sec-traj-opt-collocation resembles an indirect optimal control method as it solves the necessary conditions which indirectly represent the original optimal control problem. In contrast, the second method in cref:sec-traj-opt-energy takes a more standard approach and directly solves the optimal control problem. cref:chap-traj-opt-inference then introduces an alternative approach to mode remaining trajectory optimisation, which does not leverage the geometry of the gating network. Instead, it extends the \acrfull{cai_unimodal} framework citep:toussaintRobot2009 and encodes mode remaining behaviour via conditioning on the mode indicator variable. cref:chap-traj-opt-results evaluates and compares all three methods using the illustrative example from cref:illustrative_example. An initial version of the \acrfull{ig} method presented in cref:sec-traj-opt-collocation is published in cite:scannellTrajectory2021.
The goal of this chapter is to solve the mode remaining navigation problem in cref:problem-statement-main. Due to the novelty of this problem, the work in this chapter considers trajectory optimisation algorithms rather than state feedback (closed-loop) controllers. This mode remaining trajectory optimisation problem is given by,
where the dynamics are from cref:eq-dynamics-main and the resulting open-loop controller is given by,
$π(\timeInd) = \control\timeInd \quad ∀ \timeInd ∈ \{0, \ldots, \TimeInd-1\}$.
Given the desired dynamics mode
Given that neither the underlying dynamics modes nor how the system switches between them, are known a priori,
it is not possible to solve cref:eq-mode-soc-problem with the mode remaining guarantee in cref:def-mode-remaining-main.
However, well-calibrated uncertainty estimates associated with a learned dynamics model make it possible to
find mode remaining trajectories with high probability.
Therefore, this work relaxes the requirement to finding mode remaining trajectories with high probability.
Let us formally define a
Trajectories satisfying this
This chapter assumes prior access to the environment, such that a data set of state transitions has previously been collected and used to learn a single-step dynamics model.
Given this learned dynamics model, it is assumed that a desired dynamics mode
Given a learned dynamics model and a desired dynamics mode
- Goal 1
- Navigate to the target state $\targetState$,
- Goal 2
- Remain in the operable, desired dynamics mode $\desiredMode$,
- Goal 3
- Avoid regions of the learned dynamics with high epistemic uncertainty,
- Goal 3.1
- in the desired dynamics mode $\latentFunc\desiredMode$, i.e. where the underlying dynamics are not known,
- Goal 3.2
- in the gating network $\modeVar$, i.e. where it is not known which mode governs the dynamics.
Goal 3 arises due to learning the dynamics model from observations. The learned model may not be able to confidently predict which mode governs the dynamics in a given region. This is due to a lack of training observations and is known as epistemic uncertainty. It is desirable to avoid entering these regions as it may result in the system leaving the desired dynamics mode.
This section introduces two different approaches to performing mode remaining trajectory optimisation. They both exploit concepts from Riemannian geometry – extended to probabilistic manifolds – to encode mode remaining behaviour. The first approach in cref:sec-traj-opt-collocation resembles an indirect optimal control method citep:kirkOptimal2004 as it projects the trajectory optimisation problem onto an \acrfull{ode} that implicitly encodes the mode remaining behaviour. As such, we name this approach \acrfull{ig}. The second approach in cref:sec-traj-opt-collocation is a direct optimal control method that resembles standard Gaussian process control methods with the mode remaining behaviour encoded via a geometric objective function. We name this approach \acrfull{dre}.
The \acrshort{mosvgpe} model correctly identifies the underlying dynamics modes and infers informative
latent spaces that can be used to encode mode remaining behaviour.
cref:fig-traj-opt-gating-network-gp shows the gating network posterior after training
\acrshort{mosvgpe} on the historical data set of state transitions from the illustrative quadcopter
example in cref:illustrative_example.
The work in this chapter is based on the observation that
Goals 1 and 2 can be encoded as finding length minimising trajectories on the manifold parameterised by the
desired mode’s gating function, shown in the left-hand plot of cref:eq-traj-opt-gating-network-gp-post.
Intuitively, the length of a trajectory from
\newline
Lengths in Euclidean spaces
The
where Newton’s notation has been used to denote differentiation with respect to time
\newline
\newline
Lengths on Riemannian manifolds
The length of a trajectory
Applying the chain-rule allows cref:eq-manifold-length to be expressed in terms of the Jacobian and the velocity,
This implies that the length of a trajectory on the manifold
where $\metricTensor\mathbf{x_t} = \jacobian^T \jacobian$
is a symmetric positive definite matrix
(akin to a local Mahalanobis distance measure), known as the natural Riemannian metric.
The length of a trajectory on a manifold
cref:fig-traj-opt-gating-network-gp shows the \acrshort{gp} posterior over the desired mode’s gating function,
These trajectories will attempt to remain in the desired dynamics mode
\newline
Following cite:tosiMetrics2014 we formulate a metric tensor that captures the variance in the manifold via a probability distribution. First note that as the differential operator is linear, the derivative of a \acrshort{gp} is also a \acrshort{gp}, assuming that the mean and covariance functions are differentiable.
Therefore, the metric tensor
where
Importantly, this expected metric tensor includes a covariance term
Setting
The model in Chapter ref:chap-dynamics is built upon sparse \acrshort{gp} approximations, so the Jacobian in cref:eq-predictive-jacobian-dist must be extended for such approximations.
This section presents a trajectory optimisation algorithm that exploits the fact that
length minimising trajectories on the manifold endowed with the expected metric from cref:eq-expected-metric,
encodes all of the goals.
As shortest lengths on a manifold are known as geodesics, we refer to them as geodesic trajectories.
The algorithm presented in this section
exploits a classic result of Riemannian geometry, that geodesic trajectories are solutions
to a 2nd order \acrshort{ode}, known as the geodesic \acrshort{ode}
Geodesic \acrshort{ode} An important observation from cite:carmoRiemannian1992, is that geodesics satisfy a continuous-time $2\text{nd}$ order \acrshort{ode}, given by,
where
Solving the 2nd order \acrshort{ode} in cref:eq-2ode with the expected metric from cref:eq-expected-metric-weighting, is equivalent to solving our trajectory optimisation problem subject to the same boundary conditions. This resembles an indirect optimal control method as it is based on an observation that the necessary conditions for optimality are encoded via the geodesic \acrshort{ode}. However, it is worth noting that solutions to the geodesic \acrshort{ode} are not guaranteed to satisfy the dynamics constraints.
Collocation
Since neither
where $\ddot{\stateCol}i+\frac{1{2}}, \dot{\stateCol}i+\frac{1{2}}, \stateColi+\frac{1{2}}$
are obtained by interpolating between
Notice that no integrals need to be computed as all of the functions are algebraic operations.
In practice, a quadratic cost function is used to regularise the state derivative
where
Latent variable controls
This nonlinear program returns a collocation state trajectory
where the variational posterior is given by
where each time step of the latent control trajectory is assumed to be normally distributed,
and its variational posterior is given by,
The posterior over the latent control trajectory
Although this method provides an elegant solution to finding trajectories that satisfy Goals 1, 2 and 3, it is not without its limitations. First of all, this approach does not necessarily find trajectories that satisfy the dynamics constraints, as it projects the problem onto the geodesic \acrshort{ode}.
Secondly, it does not consider the full distribution over state-control trajectories. Without the inclusion of the full probabilistic dynamics model, it is impossible to consider the full distribution over state-control trajectories. Although propagating uncertainty through a single dynamics \acrshort{gp} is straightforward, handling the collocation constraints is not. This is because the geodesic \acrshort{ode} will become a \acrfull{sde}.
Although this method provides an elegant solution to finding trajectories that satisfy Goals 1, 2 and 3, it is not without its limitations. First of all, this approach does not necessarily find trajectories that satisfy the dynamics constraints, as it projects the problem onto the geodesic \acrshort{ode}.
Secondly, it does not consider the full distribution over state-control trajectories. Without the inclusion of the full probabilistic dynamics model, it is impossible to consider the full distribution over state-control trajectories. Although propagating uncertainty through a single dynamics \acrshort{gp} is straightforward, handling the collocation constraints is not. This is because the geodesic \acrshort{ode} will become a \acrshort{sde}.
The goal of this chapter is to find trajectories that are
where $q(\GatingFunc(\state\timeInd) \mid \state\timeInd)$ is the approximate posterior over the
gating functions from cref:eq-predictive-gating.
Note the dependence on the state input $\state\timeInd$ is reintroduced here,
as it becomes a random variable when making multi-step predictions.
This probability will decrease as the uncertainty in the state increases.
For example, when a trajectory passes through regions of the desired dynamics \acrshort{gp} with high uncertainty.
This is implied by the marginalisation over
This section details a direct optimal control approach which embeds the mode remaining behaviour directly into the \acrfull{soc} problem, via a geometric objective function. In contrast to the previous approach, this method:
- enforces the dynamics constraints,
- principally handles the uncertainty associated with the dynamics.
This approach is a shooting method that enforces the dynamics constraints through simulation, i.e. the state trajectory is enforced to match the integral of the dynamics with respect to time.
Similar to the previous approach, this method builds on the observation that
length minimising trajectories on the Riemannian manifold
with an objective function that minimises the Riemannian energy,
where
where $\stateDiff\timeInd = \state\timeInd - \state\timeInd-1$ is the state difference. The mode remaining behaviour and the terminal state boundary condition are encoded via the objective function.
This may seem like an easy optimisation problem,
however, calculating the expected cost in cref:eq-mode-soc-problem-geometry-cost is not straightforward.
Given a starting state
This work adopts a two-stage approximation to obtain a closed-form expression for the expected cost.
First, multi-step dynamics predictions are approximated to obtain normally distributed states at
each time step.
Given normally distributed states, calculating the expected terminal and control cost terms in
cref:eq-mode-soc-problem-geometry-cost is straightforward.
However, the expected Riemannian energy in cref:eq-mode-soc-problem-geometry-cost has no closed-form expression,
due to the dependence of metric
Multi-step predictions in the \acrshort{mosvgpe} dynamics model have no closed-form solution because the state difference after the first time step is a Gaussian mixture, and propagating Gaussian mixtures through Gaussian processes has no closed-form solution. Further to this, constructing approximate closed-form solutions is difficult, due to the exponential growth in the number of Gaussian components.
This work sidesteps this issue and obtains closed-form multi-step predictions by enforcing that the controlled system remains in the desired dynamics mode. Multi-step predictions can then be calculated in closed-form by cascading single-step predictions using the desired dynamics \acrshort{gp}, whose transition density is given by,
where $\mathbf{A}\desiredMode = \desiredExpertKernel(\singleInput, \desiredExpertInducingInput)\desiredExpertKernel(\desiredExpertInducingInput, \desiredExpertInducingInput)-1$ and $\singleInput=(\state\timeInd, \control\timeInd)$. Cascading single-step predictions requires recursively mapping uncertain state-control inputs through the desired mode’s dynamics \acrshort{gp}, i.e. recursively calculating the following integral,
with
$δ-\text{mode remaining}$ chance constraints
Enforcing the controlled system to remain in the desired dynamics mode simplifies
calculating multi-step predictions and the expected cost in cref:eq-mode-soc-problem-geometry-cost.
As the dynamics model is learned from observations, this work relaxes the requirement to ensuring that trajectories
are
These constraints enforce the system to remain in the desired dynamics mode with satisfaction probability
where $q(\GatingFunc(\state\timeInd) \mid \state\timeInd)$ is the approximate posterior over the gating functions from cref:eq-predictive-gating.
Given this approach for simulating the \acrshort{mosvgpe} dynamics model, the state at each time step is normally distributed. Unlike the terminal and control cost terms in cref:eq-mode-soc-problem-geometry-cost, the expected Riemannian energy,
has no closed-form expression under normally distributed states.
This is because the metric tensor
The distribution over the Jacobian when the input location in normally distributed $\state\timeInd ∼ \mathcal{N}(\stateMean, \stateCov)$ can be calculated in closed-form when using the Squared Exponential kernel. However, this work simplifies the problem and calculates the Jacobian at the state mean of each time step along a trajectory, i.e. $\Jac\state_{\timeInd} ≈ \frac{∂ \manifoldFunction(\stateMean)}{∂ \stateMean}$. The distribution over the Jacobian given deterministic inputs can be calculated using cref:eq-predictive-jacobian-dist.
Approximating the Jacobian to be independent of the state enables the expected metric tensor to be calculated in closed-form with cref:eq-expected-metric-weighting. Given this approximation, the Riemannian energy retains a quadratic form, so the expectation with respect to $\stateDiff\timeInd ∼ \mathcal{N}(\stateDiffMean, \stateDiffCov)$ can be calculated with,
Given this approximation for the expected Riemannian energy, the expected cost in cref:eq-mode-soc-problem-geometry can be calculated in closed-form with,
This work then approximately solves the problem in cref:eq-mode-soc-problem by solving,
using \acrshort{slsqp} in SciPy citep:2020SciPy-NMeth.
This method obtains closed-form expressions for the expected cost in cref:eq-mode-soc-problem-geometry
by constraining the system to be
An alternative approach to obtain mode remaining behaviour is to optimise subject to the chance constraints in cref:eq-mode-chance-constraint alone, i.e. without the Riemannian energy cost term. However, this constrained optimisation is often not able to converge in practice. Experiments and intuition indicate that the geometry of the gating functions provides a much better optimisation landscape. This is because the gating functions vary gradually over the state domain, whilst the mixing probability changes abruptly at the boundaries between dynamics modes.
Therefore in practice, the optimisation in cref:eq-mode-soc-problem-geometry-approx is performed unconstrained, i.e. without enforcing the chance constraints at every iteration. Instead, the chance constraints are used to validate trajectories found by the unconstrained optimiser, before deploying them in the environment. In most experiments, this strategy was far superior than constraining the optimisation at every iteration.
Cost Functions
This work primarily focuses on quadratic costs, as they are ubiquitous in control and lead
to closed-form expectations under normally distributed states
${\state\timeInd ∼ \mathcal{N}(\stateMean, \stateCov)}$
and controls
${\control\timeInd ∼ \mathcal{N}(\controlMean, \controlCov)}$.
This work seeks to find trajectories between a start state
where
This section presents an alternative approach to finding mode remaining trajectories, named \acrfull{cai}. In contrast to the previous section that encoded mode remaining behaviour via the latent geometry of the \acrshort{mosvgpe}’s gating network, this section unleashes the power of the probability mass function over the expert indicator variable. As all \acrshort{mogpe} methods have a probability mass function over the expert indicator variable, the method presented in this chapter is applicable in a wider range of \acrshort{mogpe} dynamics models. cref:sec-inference-background recaps the necessary background and related work and cref:sec-traj-opt-inference then details the trajectory optimisation algorithm.
This section first recaps the \acrfull{cai_unimodal} framework.
To formulate optimal control as probabilistic inference
it is first embedded into a graphical model (see Figure ref:fig-basic-control-graphical-model).
The joint probability model (over a trajectory) is augmented with an
additional variable to encode the notion of cost (or reward)
over the trajectory (see Figure ref:fig-augmented-control-graphical-model).
The new variable is a Bernoulli random variable
A common (and convenient) approach is to formulate
the likelihood using an exponential transform of
the cost. This results in a Boltzmann distribution where the
inverse temperature,
The resulting negative log-likelihood, for a single state-control trajectory \stateTraj, \controlTraj, is an affine transformation of the cost,
which preserves convexity.
The set of optimal Bernoulli variables over a trajectory is denoted
$\optimalVarTraj = \{\optimalVar\timeInd=1\}\timeInd=0\TimeInd
This work primarily focuses on quadratic costs,
Terminal Control Problems
To find a trajectory between a start state,
It is worth noting that if a cost function is not quadratic then a quadratic approximation can be obtained with a Taylor expansion.
\newline
The joint probability for an optimal trajectory (i.e.
for
where
where
They then distinguish between a prior policy
Intuitively
is equivalent to the \acrshort{soc} problem, with a modified integral cost,
The problem in cref:eq-kl-control finds trajectories that balance
minimising expected costs
$\E\controlledPolicyDist(\stateTraj, \controlTraj) \left[ \costFunc(\stateTraj, \controlTraj) \right]$
and selecting a policy
Maximum entropy regularisation Formulating trajectory optimisation in this way encodes the maximum causal entropy principle, which is often used to achieve robustness, in particular for inverse optimal control citep:ziebartModeling2010.
Other approaches There are multiple approaches to performing inference in this graphical model. Trading accuracy for computational complexity is often required for real-time control. In this case, one approach is to approximate the dynamics with linear or quadratic approximations, as is done in \acrshort{ilqr}/\acrshort{ilqg} and \acrshort{gp} respectively. Given linear dynamics, the full graphical model in cref:eq-traj-opt-joint-dist can be computed using approximate Gaussian message passing, for which effective methods exist citep:loeligerFactor2007. The inference problem can then be solved using the expectation maximisation algorithm for dynamical system estimation citep:shumwayAPPROACH1982,ghahramaniLearning1999,schonSystem2011, with input estimation citep:watsonStochastic2021.
This section details how the \acrfull{cai_unimodal} framework can be extended to multimodal dynamical systems and used to encode mode remaining behaviour. cref:eq-traj-opt-gating-network-prob-post-inf shows the probability mass function over the expert indicator variable after training \acrshort{mosvgpe} on the historical data set of state transitions from the quadcopter navigation problem in cref:illustrative_example. Intuitively, the goal is to find trajectories that remain in regions of the dynamics with a high probability of remaining in the desired dynamics mode.
\newline
In order to find trajectories that remain in the desired
dynamics mode, this work further augments
the graphical model in cref:fig-control-graphical-model with the mode indicator variable
where $\modeVarTraj = \{\modeVar\timeInd=\desiredMode \}\timeInd=0\TimeInd$
denotes every time step of a trajectory belonging to the desired dynamics mode
This work draws on the connection between KL-divergence control citep:rawlikStochastic2013 and structured variational inference. Whilst the derivation shown here differs from cite:rawlikStochastic2013, the underlying framework and objective are the same.
In variational inference, the goal is to approximate a distribution
with the variational distribution
As this method is using a learned representation of the transition dynamics, it suffices to assume that the dynamics are given by the desired mode’s learned dynamics. Constructing approximate closed-form solutions based on the model in cref:chap-dynamics is difficult, due to the exponential growth in the number of Gaussian components. Similar to the approach in cref:sec-dynamics-predictions, this method obtains multi-step predictions by cascading single-step predictions through the desired mode’s dynamics \acrshort{gp}. However, this approach extends the predictions to handle normally distributed controls $\control\timeInd ∼ \mathcal{N}(\controlMean, \controlCov)$. Multi-step predictions are then obtained by recursively calculating the following integral,
using the moment matching approximation from citep:kussGaussian2006.
Note that $\modeVarTraj0:\timeInd$ denotes the first
with
Variational inference seeks to optimise
where the inequality is obtained via Jensen’s inequality.
Note the cancellation in the second line where
we assume $\transitionDistK=q(\state\timeInd+1 \mid \state_0, \modeVarTraj0:\timeInd)$.
Assuming a uniform prior policy
The \acrshort{elbo} in cref:eq-traj-opt-elbo resembles the KL control objective in cref:eq-kl-control-uniform
but with an extra term encoding the mode remaining behaviour.
In practice, maximum entropy control is achieved by parameterising the control at each time step to
be normally distributed ${\controlVarDist = \mathcal{N}(\control\timeInd \mid \controlMean, \controlCov)}$.
The control dimensions are assumed independent so
The control problem is then given by,
which encodes mode remaining behaviour alongside maximum entropy control.
The variational parameters
where
Cost Functions
This work primarily focuses on quadratic costs, as they are ubiquitous in control and lead
to closed-form expectations under normally distributed states
${\state\timeInd ∼ \mathcal{N}(\stateMean, \stateCov)}$
and controls
${\control\timeInd ∼ \mathcal{N}(\controlMean, \controlCov)}$.
This work seeks to find trajectories between a start state
where
Approximate Inference for Dynamics Predictions
Fortunately, this work is interested in remaining in a single dynamics mode
where $\mode{\mathbf{A}} = \expertKernel(\singleInput, \expertInducingInput)\left(\expertKernel(\expertInducingInput, \expertInducingInput)\right)-1$. To obtain long-term predictions, we cascade one-step predictions. This requires mapping uncertain state-control inputs through the desired mode’s \acrshort{gp} dynamics model. This prediction problem corresponds to recursively calculating the following integral,
with uncertain_conditional
from GPflow citep:GPflow2017.
For more details on the approximation see Section 7.2.1 of citep:kussGaussian2006.
Given this method for making long term predictions, the distribution over state-control trajectories is given by,
The variational lower bound is then given by sums over each time step,
The aformentioned trajectory optimisation technique relies on the ability to simulate the controlled system using the learned dynamics model. Recursively estimating the state of a nonlinear dynamical system is a common problem. Exact Bayesian solutions in closed-form can only be obtained for a few special cases. For example, the Kalman filter for linear Gaussian systems citep:kalmanNew1960 is exact. In the nonlinear case, approximate methods are required to obtain efficient closed-form solutions. \todo{cite Extended Kalman filter (EKF), Unscented Kalman filter (UKF) etc?}
Constructing approximate closed-form solutions based on the model in Chapter ref:chap-dynamics
is especially difficult, due to the exponential growth in the number of Gaussian components.
Consider approximating the dynamics modes to be independent over a trajectory of length,
Luckily, this work is interested in remaining in a single dynamics mode
The model in Chapter ref:chap-dynamics represents each mode’s dynamics function as a sparse variational GP, giving the transition density as,
where the functional form of \expertVariational is given by,
Remember that the variational posterior
that captures the joint distribution in the data through the inducing variables
Consider the problem of predicting the next state $\state\timeInd+1$
given multivariate normally distributed states,
${\state\timeInd ∼ \mathcal{N}(\stateMean, \stateCov)}$,
and controls,
${\control\timeInd ∼ \mathcal{N}(\controlMean, \controlCov)}$,
where
Approximate closed-form solutions exist for propagating uncertain inputs through \acrshort{gp} models
cite:girardGaussian2003,kussGaussian2006,quinonero-candelaPropagation2003.
\todo{correct girard citation}
This work exploits the moment-matching approximation
implemented in the uncertain_conditional
from GPflow cite:GPflow2017.
For more details on the approximation see Section 7.2.1 of cite:kussGaussian2006.
\todo{Add moment matching figure?}
This work propagates model uncertainty forwards by cascading such one-step predictions.
This chapter has presented three mode remaining trajectory optimisation algorithms.
The first two have shown how the geometry of the \acrshort{mosvgpe} gating network
infers valuable information regarding how a multimodal dynamical system switches between its underlying dynamics modes.
Moreover, they have shown how this latent geometry can be leveraged to encode mode remaining behaviour into
two different control strategies.
Both of these control strategies introduce a user tunable parameter
In cref:chap-traj-opt-results the methods presented in this chapter are evaluated and compared using the quadcopter navigation problem from the illustrative example in cref:illustrative_example. It turns out that in practice, the \acrfull{dre} method from cref:sec-traj-opt-energy and the \acrfull{cai} method from cref:chap-traj-opt-inference, perform significantly better than the \acrfull{ig} method.
This chapter has shown how the geometry of the \acrshort{mosvgpe} gating network
infers valuable information regarding how a multimodal dynamical system switches between its underlying dynamics modes.
Moreover, it has shown how this latent geometry can be leveraged to encode mode remaining behaviour into
two different control strategies.
Both methods have been evaluated on a quadcopter navigation problem.
Overall, the direct optimal control approach offers superior performance, as it finds trajectories that do not violate
the dynamics constraints, whilst ensuring trajectories are
This chapter evaluates the three mode remaining trajectory optimisation algorithms:
- \acrfull{ig} from cref:sec-traj-opt-collocation,
- \acrfull{dre} from cref:sec-traj-opt-energy,
- \acrfull{cai} from cref:sec-traj-opt-inference,
in a simulated version of the illustrative example from cref:illustrative_example.
That is, flying a velocity controlled quadcopter from an initial state
The collocation solver from the \acrfull{ig} method in cref:chap-traj-opt-geometry is tested on a real-world quadcopter state transition data set. This data set was collected with constant controls so the controls could not be recovered from the full \acrshort{mosvgpe} dynamics model, nor could the \acrfull{dre} method from cref:chap-traj-opt-geometry or the \acrfull{cai} method from cref:chap-traj-opt-inference be tested. Nevertheless, the results are a small step towards validating the method’s applicability to real-world systems. All three control methods are tested in the two simulated environments so that they can be compared. To aid comparison with the real-world experiments, the layout of the first simulated environment (Environment 1) was kept consistent with the real-world experiments.
The \acrfull{ig} method presented in cref:sec-traj-opt-collocation was evaluated using data from the real-world quadcopter navigation problem detailed in cref:sec-brl-experiment. However, a different subset of the environment was not observed and the model was trained using an old variational inference scheme, not the one presented in cref:chap-dynamics. cref:fig-environment-1 shows the environment and details the quadcopter navigation problem. The controls were kept constant during data collection, reducing the dynamics to $Δ \state\timeInd+1 = \dynamicsFunc(\state\timeInd; \control\timeInd=\fixedControl)$. See cref:sec-brl-experiment for more details on data collection and processing.
The model from cref:chap-dynamics was instantiated with
cref:fig-geometric-traj-opt shows the gating network posterior where the model has learned
two dynamics modes, characterised by the drift and process noise induced by the fan.
Mode 1 (red) represents the operable dynamics mode whilst Mode 2 (blue)
represents the inoperable turbulent dynamics mode.
This is illustrated in cref:fig-geometric-traj-opt-over-prob which shows the probability that
the desired mode (
The initial (cyan) trajectory in cref:fig-geometric-traj-opt was initialised as a straight line with
#+CAPTION[Comparison of \acrfull{ig} experiments on real-world quadcopter]: \acrfull{ig} Comparison of performance with different settings of
Trajectory | Mixing Probability | Epistemic Uncertainty |
$∑i=1I Pr(\modeVar_i=1 \mid \state_i)$ | $∑i=1I \mathbb{V}[\gatingFunc1(\state_i)]$ | |
---|---|---|
Initial | ||
Optimised |
||
Optimised |
\newline
The higher probability of remaining in the desired dynamics mode indicates that
the lower setting of
In contrast, the trajectory found with
These results indicate the nonlinear program in cref:eq-collocation-problem,
i.e. solving the geodesic \acrshort{ode} in cref:eq-2ode via collocation, is capable of finding state
trajectories from
It is worth noting here that the mode remaining behaviour is sensitive to the tolerance on the collocation constraints. Setting the tolerance too small often resulted in the solver failing to converge, whilst setting the tolerance too large omitted any mode remaining behaviour.
All of the control methods are now tested in two simulated environments so that they can be compared. This section first details the two simulation environments.
The simulated environments have two dynamics modes,
The state domain is constrained to
Data set
We sampled
Following the experiments in cref:sec-traj-opt-results-brl,
the model from cref:chap-dynamics was instantiated with
Before evaluating the three trajectory optimisation algorithms in the simulated environments, let us restate the goals from cref:chap-traj-opt-control:
- Goal 1
- Navigate to the target state $\targetState$,
- Goal 2
- Remain in the operable, desired dynamics mode $\desiredMode$,
- Goal 3
- Avoid regions of the learned dynamics with high epistemic uncertainty,
- Goal 3.1
- in the desired dynamics mode $\latentFunc\desiredMode$, i.e. where the underlying dynamics are not known,
- Goal 3.2
- in the gating network $\modeVar$, i.e. where it is not known which mode governs the dynamics.
The performance of the trajectory optimisation algorithms are evaluated using four performance indicators:
- Goal 2 is measured using the probability of remaining in the desired mode without the model’s uncertainty marginalised,
This probability will only decrease when a trajectory leaves the desired mode.
- Goal 3.1 is measured using the state variance accumulated from cascading single-step predictions,
This will increase when a trajectory passes through regions of the desired dynamics mode with high epistemic uncertainty.
- Goal 3.2 is measured using the gating function variance accumulated from cascading single-step predictions,
This will increase when a trajectory passes through regions of the gating network with high epistemic uncertainty.
- Probability of remaining in the desired mode with the model’s uncertainty marginalised,
calculated using cref:eq-mode-chance-constraint-integral,
This probability will decrease when a trajectory leaves the desired mode and when it passes through regions of the learned dynamics with high uncertainty.
The methods’ ability to remain in the desired dynamics mode
Three settings of the tunable
Initialisation All of the \acrfull{dre} and \acrfull{cai} experiments used
a horizon of
Visualisation
cref:fig-all-traj-opt-prob,fig-all-traj-opt-mean,fig-all-traj-opt-variance
show the trajectory optimisation results for both environments overlayed on their associated gating network posteriors.
In each figure, the left-hand column shows results for Environment 1
whilst the right-hand column shows the results for Environment 2.
The figures show the state trajectories obtained from cascading single-step predictions through
the desired mode’s \acrshort{gp} dynamics
using the controls found by the \acrshort{ig} method (top row), the
\acrshort{dre} method (middle row) and the \acrshort{cai} method (bottom row).
cref:fig-all-traj-opt-prob shows the optimised trajectories overlayed on the desired
mode’s (
The methods are first evaluated at their ability to navigate to the target state, that is, achieve Goal 1.
\acrfull{ig}
The \acrshort{ig} method’s collocation solver was able to
find state trajectories to the target state in all experiments.
This is indicated by the collocation state trajectories
In Environment 1, the inference strategy successfully recovered controls that
drive the system along the collocation solver’s state trajectory.
This is indicated by the dynamics trajectory
One possible explanation is that the \acrshort{elbo} in cref:eq-control-elbo-svgp could not
recover the controls due to a local optimum.
The collocation state trajectories (left) in the experiments with
\acrfull{dre}
All of the \acrshort{dre} experiments, (middle row of
cref:fig-all-traj-opt-prob,fig-all-traj-opt-mean,fig-all-traj-opt-variance)
were able to find trajectories
that navigate to the target state
\acrfull{cai}
All of the \acrshort{cai} experiments successfully navigated to the target state under the dynamics.
This is indicated by the trajectories in
the bottom row of cref:fig-all-traj-opt-prob,fig-all-traj-opt-mean,fig-all-traj-opt-variance
successfully navigating to the target state
The experiments are now evaluated at their ability to remain in the desired dynamics mode, that is, their ability to achieve Goal 2.
\newline
\acrfull{ig}
None of the \acrshort{ig} experiments were able to remain in the desired dynamics mode successfully.
In all but one of the experiments, this was due to the collocation solver
$\text{IG}\text{collocation}$ finding trajectories that pass over the mode boundary into
the turbulent dynamics mode.
This is indicated by the collocation state trajectories in the left-hand plots of cref:fig-collocation-traj-opt
passing directly through the turbulent dynamics mode.
This motivated a further experiment, to see if the collocation solver was getting stuck in a local optimum
induced by the initial trajectory.
The IGcollocation
In Environment 1, although none of the \acrshort{ig}
experiments successfully remained in the desired dynamics mode, they did exhibit some mode remaining behaviour.
That is, the trajectories with
\newline
\acrfull{dre}
All but one of the \acrshort{dre} experiments were able to remain in the desired dynamics mode successfully.
This is indicated by the dynamics trajectories in the middle row of
cref:fig-all-traj-opt-prob,fig-all-traj-opt-mean,fig-all-traj-opt-variance
not passing over the mode boundary into the turbulent dynamics mode.
From visual inspection of the middle rows in
cref:fig-all-traj-opt-prob,fig-all-traj-opt-mean,fig-all-traj-opt-variance
the trajectories found with \acrshort{dre}
Although the trajectories found with \acrshort{dre}
The trajectory found with \acrshort{dre}
It is worth noting that in practice the mode chance constraints would inform us that the trajectory
is not
\newline
\acrfull{cai} It is clear from the bottom rows of cref:fig-all-traj-opt-prob,fig-all-traj-opt-mean,fig-all-traj-opt-variance that all \acrshort{cai} experiments found trajectories that remained in the desired dynamics mode. Further to this, the maximum entropy control term resulted in more clearance from the mode boundary. This is shown by the experiments with Gaussian controls \acrshort{cai} (gauss), indicated by black circles, having more clearance than the experiments with deterministic controls \acrshort{cai} (Dirac), indicated by the pink circles. This is a desirable behaviour that is expected from the maximum entropy control term. This observation is confirmed in cref:tab-results-sim-envs, where in both environments the experiments with maximum entropy control obtained the highest probabilities of remaining in the desired dynamics mode when considering the model’s epistemic uncertainty.
In Environment 2, both of the \acrshort{cai} experiments found discrete-time trajectories that remained in the desired dynamics mode. This is indicated by none of the trajectories’ time steps being in the turbulent dynamics mode. However, when interpolating the discrete-time trajectory for the Environment 2 experiment without maximum entropy control (pink circles), shown in the bottom right plots in cref:fig-all-traj-opt-prob,fig-all-traj-opt-mean,fig-all-traj-opt-variance, the trajectory crosses the mode boundary. This is an undesirable behaviour because in non-simulated environments this would correspond to the trajectory entering the turbulent dynamics mode. In contrast, the clearance resulting from the maximum entropy control term alleviates this issue.
\newline
It is clear from cref:fig-mode-conditioning-no-max-entropy-traj-opt-5,fig-mode-conditioning-max-entropy-traj-opt-5 that both of the \acrshort{cai} experiments found discrete-time trajectories that remained in the desired dynamics mode. This is indicated by none of the trajectories time steps being in the turbulent dynamics mode. However, when interpolating the discrete-time trajectory for the experiment without maximum entropy control, shown in cref:fig-mode-conditioning-no-max-entropy-traj-opt-5, the trajectory crosses the mode boundary. This is an undesirable behaviour, as in a non simulated environment this would correspond to the trajectory entering the turbulent dynamics mode. Visual inspection of cref:fig-mode-conditioning-no-max-entropy-traj-opt-5,fig-mode-conditioning-max-entropy-traj-opt-5 show that the maximum entropy control term resulted in a trajectory with more clearance from the mode boundary. This is a desirable behaviour that is expected from the maximum entropy control term. This observation is confirmed in cref:tab-results-sim-envs-inf, where the experiment with maximum entropy control obtained a higher probability of remaining in the desired dynamics mode when considering the model’s epistemic uncertainty.
Finally, the methods are evaluated at their ability to avoid regions of the dynamics with high epistemic uncertainty. cref:tab-results-sim-envs shows the state variance and the gating function variance accumulated over each trajectory. In all Environment 1 experiments, the state variance accumulated over the trajectory (from cascading single-step predictions via moment matching) were fairly similar. This is due to the dynamics of the simulator being simple enough that the desired mode’s \acrshort{gp} can confidently interpolate into both the turbulent dynamics mode and the region which has not been observed. In the latter case, it is up to the gating network to model the epistemic uncertainty arising from limited training data.
In contrast to the experiments in Environment 1, the state variance accumulated over each trajectory does vary between experiments in Environment 2. However, the results in cref:tab-results-sim-envs suggest that the state variance does not significantly impact the mode remaining behaviour. This is indicated by no correlation between the mode probability and the state variance, which is likely due to extremely low state variance.
\newline
\acrfull{ig}
In the \acrshort{ig} experiments in Environment 1 with
\newline
\acrfull{dre}
The \acrshort{dre} experiments exhibited the most uncertainty avoiding behaviour.
This is indicated in cref:tab-results-sim-envs by the Environment 1 experiment with \acrshort{dre}
In Environment 2 the trajectory found with \acrshort{dre}
These results indicate that the covariance term – in the expected Riemannian metric – plays an important role
at keeping trajectories in the desired dynamics mode.
However, they also indicate that
\newline
\acrfull{cai} The \acrshort{cai} experiments in Environment 1 avoided the region of high epistemic uncertainty in the gating network. This is shown by the \acrshort{cai} (gauss) (black circles) and the \acrshort{cai} (Dirac) (pink circles) in the bottom left plot of cref:fig-all-traj-opt-variance, navigating far to the left around the region of high gating function variance. This uncertainty avoiding behaviour is confirmed in cref:tab-results-sim-envs, where the \acrshort{cai} experiments accumulated the least amount of gating function variance out of all the trajectories which remained in the desired dynamics mode. However, in Environment 2, the \acrshort{cai} experiments did not exhibit as much uncertainty avoiding behaviour in the gating network. In fact, they obtained higher accumulations of gating function variance by quite a margin. As mentioned earlier, this result suggests that avoiding regions of high epistemic uncertainty is not the most important factor for obtaining the highest probability of remaining in the desired dynamics mode. This change in behaviour is due to the relevance of the uncertainty avoiding behaviour being automatically handled by the marginalisation of the gating function in cref:eq-traj-opt-elbo.
Although marginalisation is a principled way of handling uncertainty, in this scenario,
it can be conjectured otherwise.
First, note that the quantitative performance measures do not tell the full story.
This is because in the region with no observations, it is perfectly plausible that there
is another region belonging to the turbulent dynamics mode, or a different dynamics mode altogether.
In this case, the trajectory found with \acrshort{dre}
In both environments, the \acrshort{cai} (guass) experiments with maximum entropy control obtained lower accumulations of gating function variance than the \acrshort{cai} (Dirac) experiments without maximum entropy control. This further confirms that the maximum entropy control behaviour improves the performance of the \acrshort{cai} method.
\newline
Epistemic uncertainty Each experiment’s ability to avoid regions of high epistemic uncertainty is now evaluated.
In contrast to the experiments in Environment 1, the state variance accumulated over each trajectory does vary
between experiments.
However, the results in cref:tab-results-sim-envs suggest that the state variance does not have a large impact
on the mode remaining behaviour.
This is indicated by no correlation between the mode probability and the state variance,
which is likely due to the state variance being extremely low.
The trajectory found with
The control-as-inference experiment with maximum entropy control, shown in cref:fig-mode-conditioning-max-entropy-over-svgp-traj-opt-5, obtained a lower accumulation of gating function variance than the experiment without maximum entropy control. Relative to the geometry-based methods, the trajectories found with the control-as-inference method, shown in cref:fig-mode-conditioning-max-entropy-traj-opt-5,fig-mode-conditioning-no-max-entropy-traj-opt-5, do not exhibit as much uncertainty avoiding behaviour in the gating network. In fact, they obtained higher accumulations of gating function variance by quite a margin. In contrast, in Environment 1, the trajectories found with the control-as-inference method obtained some of the most uncertainty avoiding behaviour, indicated by some of the lowest accumulated gating function variance. This change in behaviour is due to the relevance of the uncertainty avoiding behaviour being automatically handled by the marginalisation of the gating function in cref:eq-traj-opt-elbo.
Although marginalisation is a principled way of handling uncertainty, in this scenario,
it can be conjectured otherwise.
First note that the quantitative performance measures do not tell the full story.
This is because in the region with no observations, it is perfectly plausible that there
is another region belonging to the turbulent dynamics mode, or a different dynamics mode altogether.
In this case, the trajectory found with
The results for
\newline
The methods were then tested in a second simulated environment (Environment 2). In this environment, the turbulent dynamics mode and the unobserved regions are in different locations to Environment 1. cref:fig-dataset-scenario-5 shows the state transition data set that was sampled from Environment 2 and used to train the \acrshort{mosvgpe} dynamics model. cref:fig-traj-opt-gating-network-5 shows the gating network posterior after training on this data set. Expert 1 represents the turbulent dynamics mode and Expert 2 represents the operable, desired dynamics mode. This results from Expert 2 having much higher drift and process noise than Expert 1.
All of the direct optimal control and control-as-inference experiments used a horizon of
Four different settings of the
This section details the main findings from the experiments, compares the methods and discusses directions for future work.
Geometry-based methods The \acrshort{ig} method’s collocation solver is susceptible to local optima around its initial state trajectory. Although this can be overcome by engineering the initial solution, it makes it difficult to get the method working in practice. On top of this, the variational inference strategy used to recover the controls does not always recover the correct controls. This may be due to the collocation solver’s state trajectory not satisfying the dynamics constraints or the variational inference getting stuck in local optima. Regardless, this is a big limitation that made the \acrshort{ig} method fail in some environments.
Overall, the \acrshort{dre} method performed significantly better than the \acrshort{ig} method in the experiments. Firstly, two of the \acrshort{dre} experiments in Environment 1 remained in the desired dynamics mode, whilst the \acrshort{ig} experiments could not. Further to this, the \acrshort{dre} approach worked in Environment 2, where it was not possible to get the \acrshort{ig} method working, without engineering the initial solution. Not only did the \acrshort{dre} approach perform better in the experiments, but it is also significantly easier to configure. That is, setting the optimisation parameters is straightforward. In contrast, setting the parameters for the \acrshort{ig} method is not. In particular, setting the upper and lower bounds for the collocation constraints is pernickety. In conclusion, the \acrshort{dre} method is the preferred geometry-based method for finding mode remaining trajectories.
How to set $λ$?
Although increasing
The quantitative results indicate that good performance is generally achieved with
\acrfull{cai} The \acrshort{cai} method performed well in all experiments. The experiments showed that the maximum entropy control behaviour obtained from using normally distributed controls improves the mode remaining behaviour and works well in practice.
Overall
The results in cref:tab-results-sim-envs indicate that the \acrshort{cai} (gauss) experiments were the highest performing
in both environments.
They obtained the highest probability of remaining in the desired dynamics mode when considering the model’s epistemic
uncertainty.
Visual inspection combined with the results in cref:tab-results-sim-envs, further
indicate that the \acrshort{dre}
This section compares the three control methods presented in cref:sec-traj-opt-collocation,sec-traj-opt-energy,chap-traj-opt-inference and suggests future work to address their limitations. cref:tab-geometry-control-comparison offers a succinct comparison of the three approaches.
Dynamics constraints
Both the \acrshort{cai} method presented in cref:chap-traj-opt-inference
and the \acrshort{dre} approach presented in
cref:sec-traj-opt-energy find trajectories that satisfy the dynamics constraints, i.e. the learned dynamics.
They achieve this by enforcing the distribution over state trajectories
to match the distribution from cascading single-step predictions through the desired mode’s dynamics \acrshort{gp}.
This can be seen as a method for approximating the integration of the controlled dynamics with respect to time, whilst
considering the epistemic uncertainty associated with learning from observations.
The chance constraints ensure the controlled system is
Decision-making under uncertainty The combination of the approximate dynamics integration and the chance constraints in the \acrshort{dre} method, leads to a closed-form expression for the expected cost. This expression considers the epistemic uncertainty associated with the learned dynamics model, both in the desired dynamics mode and in the gating network. Similarly, for the \acrshort{cai} method, the \acrshort{elbo} principally handles the uncertainty in both the desired dynamics mode and gating network. In contrast, the \acrshort{ig} method ignores the uncertainty accumulated by cascading single-step predictions through the learned dynamics model. That is, it considers the uncertainty associated with the gating network and ignores any uncertainty in the desired dynamics mode. Although this did not have a massive impact in the simulated quadcopter experiments, it may have a larger impact in real-world systems with more complex dynamics modes.
Uncertainty propagation This work only considered the moment matching approximation for propagating uncertainty through the probabilistic dynamics model. cite:chuaDeep2018 test different uncertainty propagation schemes in Bayesian neural networks. It would be interesting to test sampling-based approaches for the \acrshort{dre} approach. For example, to see the impact of calculating the expected Riemannian energy over a trajectory without approximations.
Decouple goals
Without decoupling the uncertainty avoiding behaviour from the mode remaining behaviour,
it is not possible to find trajectories which avoid entering the region of high epistemic uncertainty in the
gating network.
The geometry-based approaches provide a mechanism to set the relevance of the covariance term, i.e. decouple the goals.
This was achieved by augmenting the expected Riemannian metric tensor with the user-tunable weight
parameter
Multiple desired modes Although not tested, the approaches are theoretically sound and should be applicable in systems with more than two dynamics modes. However, it is interesting to consider if this is even necessary given the goals. For example, in the quadcopter experiment, the dynamics model was intentionally instantiated with two dynamics modes, even though there could be more in reality. The desired dynamics mode was engineered to have a noise variance that was deemed operable. The other dynamics mode was then used to explain away all of the inoperable modes. In most scenarios, a similar approach could be followed. Nevertheless, it is interesting to consider systems with more than two modes. This is because the main goal is to avoid entering the inoperable dynamics mode, not just remain in the desired dynamics mode. Although not tested here, the \acrshort{cai} method should be able to remain in more than one desired dynamics mode. This can be achieved by conditioning on a set of modes. In contrast, the geometry-based methods in cref:chap-traj-opt-geometry are only capable of remaining in a single dynamics mode.
Continuous-time trajectories The \acrshort{dre} and \acrshort{cai} methods required control regularisation to prevent state transitions that “jump” over the undesired mode. Although the discrete-time steps of the trajectory appear to satisfy the constraints and minimise the cost, in reality, the continuous-time trajectory may pass through the undesired mode. This is a general problem that arises when solving continuous-time problems in discrete time. In contrast, the \acrshort{ig} method uses a collocation solver where the cost can be evaluated at arbitrary points along the continuous-time trajectory. An interesting direction for future work is to extend the \acrshort{dre} and \acrshort{cai} methods to find trajectories in continuous time. For example, cite:mukadamContinuoustime2018,dongMotion2016 use \acrshort{gps} for continuous-time trajectories when solving motion planning via probabilistic inference.
State-control dependent modes
This chapter assumed that the dynamics modes were separated according to their disjoint
state domains, i.e.
$\stateDomaini ∩ \stateDomainj = ∅$ for distinct
Dynamics models
The \acrshort{cai} method can be deployed in a wider range of dynamics models than the geometry-based
methods.
First of all, it is not limited to differentiable mean and covariance functions for the gating function \acrshort{gps}.
Secondly, it can be deployed in dynamics models learned with any \acrshort{mogpe} method.
This is because all \acrshort{mogpe} methods consist of a probability mass function over the expert indicator
variable.
In contrast, the geometry-based methods are limited to the \acrshort{mosvgpe} method because they depend on
it’s \acrshort{gp}-based gating network.
Further to this, the \acrshort{cai} method is applicable in systems where the dynamics modes are not
necessarily separated according to their state domains
Real-time control Whilst real-time control requires efficient algorithms, “offline” trajectory optimisation can trade in computational cost for greater accuracy. This work is primarily interested in finding trajectories that attempt to remain in the desired dynamics mode. For simplicity, it has considered the “offline” trajectory optimisation setting. The increased computational time may hinder its suitability to obtain a closed-loop controller via \acrshort{mpc} citep:eduardof.Model2007. However, it can be used “offline” to generate reference trajectories for a tracking controller or for guided policy search in a \acrshort{mbrl} setting citep:levineGuided2013. Alternatively, future work could investigate approximate inference methods for efficient state estimation to aid with real-time control, e.g. \acrshort{ilqg}/\acrshort{gp}.
Infeasible trajectories
It is worth noting that it might not be possible to find a trajectory
between a given start state
This chapter has evaluated and compared the \acrfull{ig} method from cref:sec-traj-opt-collocation, the \acrfull{dre} method from cref:sec-traj-opt-energy and the \acrfull{cai} method from cref:chap-traj-opt-inference. The methods’ abilities to navigate to a target state, whilst remaining in the desired dynamics mode, were evaluated on two velocity controlled quadcopter navigation problems. The results in this chapter have verified that the latent geometry of the \acrshort{mosvgpe} gating network can be used to encode mode remaining behaviour into control strategies. However, the results indicate that the \acrshort{ig} method is not only the lowest performing method but is also the hardest method to configure. In contrast, the \acrshort{dre} and \acrshort{cai} methods work well in practice, whilst being much easier to configure.
Although the geometry-based methods have a tunable parameter
From the experiments in this chapter, it can be concluded that both the
\acrshort{dre} and \acrshort{cai} methods are competitive approaches for finding mode remaining trajectories.
They find trajectories that navigate to the target state, satisfy the dynamics constraints and attempt to remain in the
desired dynamics mode.
Further to this, it is easy to verify that they are
Similarly to cref:chap-traj-opt-control,
this chapter is concerned with controlling unknown or partially unknown, multimodal dynamical systems,
from an initial state
The main contribution of this chapter is an explorative trajectory optimisation algorithm that can explore multimodal environments with a high probability of remaining in the desired dynamics mode. This chapter further proposes an approach to solve the mode remaining navigation problem from cref:problem-statement-main by consolidating all of the work in this thesis.
This chapter is organised as follows. cref:problem-statement-explore formally states the problem and the assumptions that are made in this chapter. cref:sec-exploration then details the exploration strategy proposed in this work and cref:sec-modeopt presents \acrfull{modeopt}, a \acrshort{mbrl} algorithm for solving the mode remaining navigation problem in cref:problem-statement-main. Finally, cref:sec-preliminary-results presents preliminary results in a simulated quadcopter environment and cref:sec-future-work-explore discusses \acrshort{modeopt} and details directions for future work.
Similarly to cref:chap-traj-opt-control, the goal of this chapter is to solve the mode
remaining navigation problem in cref:problem-statement-main.
That is, to navigate from an initial state
Given that this work leverages a learned dynamics model, it is not
possible to find trajectories that are mode remaining according to cref:def-mode-remaining-main.
Following cref:chap-traj-opt-control, this work relaxes the mode remaining constraint
to be
where the dynamics are given by cref:eq-dynamics-main, the objective is given by cref:eq-optimal-control-objective,
and the mode remaining constraint in cref:eq-chance-constraint-delta-mode is from the
It is worth noting that the agent would not be able to explore the environment without relaxing the
mode remaining constraint from cref:def-mode-remaining-main.
Intuitively, the more the
Initial mode remaining controller
In robotics applications, an initial set of poor performing controllers can normally be obtained via
simulation or domain knowledge.
This work assumes access to an initial data set of state transitions
$\dataset_0 = \{(\state_n, \control_n), Δ \state_n\}n=1N_0$ collected from around the start state
This data set is used to learn a predictive dynamics model
This section details our method for solving the mode remaining navigation problem by consolidating all of
the work in this thesis. The method is named \acrfull{modeopt}.
At its core, \acrshort{modeopt} learns a single-step dynamics model
\newline
$δ-\text{mode remaining}$ exploration In order to implement such an algorithm, we require a method for exploring the environment with mode remaining guarantees. That is, a controller that will explore the environment whilst ensuring the controlled system remains in the desired dynamics mode with high probability, i.e. satisfy cref:eq-mode-remaining-def-explore.
The performance of \acrshort{modeopt} depends on its ability to explore the environment. This section recaps some relevant exploration strategies that fit into the general \acrfull{mbrl} procedure shown in cref:alg-mbrl.
Greedy exploitation
One of the most commonly used exploration strategies is to select the controller that maximises
the expected performance under the learned dynamics model $\dynamicsModel(f \mid \mathcal{D}0:i-1)$.
Note that
This approach is used in \acrshort{pilco} citep:deisenrothPILCO2011 and
\acrshort{gp}-\acrshort{mpc} citep:kamtheDataEfficient2018
where the dynamics are represented using \acrshort{gps} and the moment matching approximation is used to cascade
single-step predictions.
cite:parmasPIPPS2018 propose \acrshort{pipps},
a similar approach to \acrshort{pilco}, except that they use Monte Carlo methods
to propagate uncertainty forward it time, instead of using the moment matching approximation.
Similarly, \acrshort{pets} citep:chuaDeep2018 uses this exploration strategy but represents the dynamics
using ensembles of probabilistic neural networks.
This strategy initially favours exploring regions of the environment where the learned dynamics
model is not confident, i.e. has high epistemic uncertainty.
Once it has gathered knowledge of the environment and the model’s
epistemic uncertainty has been reduced, it favours maximising the objective function
In contrast to standard \acrshort{mbrl}, the method presented in this chapter side-steps the
challenging exploration-exploitation trade-off. That is, \acrshort{modeopt} does not require
an objective that changes its exploration-exploitation balance as it gathers more
knowledge of the environment. This is because it uses the
\newline
Active learning Active learning is another class of exploration algorithms. The goal of information-theoretic active learning is to reduce the number of possible hypothesis as fast as possible, e.g. minimise the uncertainty associated with the parameters using Shannon’s entropy citep:coverElements2006,
In contrast to greedy exploitation, active learning does not seek to maximise a
black-box objective. Instead, it is only interested in exploration. There are many
approaches to active learning in the static setting, i.e. in systems where an arbitrary
state
Recent work has addressed active learning in \acrshort{gp} dynamics models. cite:schreiterSafe2015 propose a greedy entropy-based strategy that considers the entropy of the next state. cite:buisson-fenetActively2020 also propose a greedy entropy-based strategy except that they consider the entropy accumulated over a trajectory. In contrast, cite:caponeLocalized2020,yuActive2021 propose to use the mutual information.
cite:caponeLocalized2020 find the most informative state as the one that minimises the mutual information between it and a set of reference states (a discretisation of the domain). They then find a set of controls to drive the system to this most informative state. Given a fixed number of time steps, their method yields a better model than the greedy entropy-based strategies. cite:yuActive2021 propose an alternative approach that leverages their \acrfull{gpssm} inference scheme to estimate the mutual information between all the variables in time $I \left[\mathbf{y}1:t, \hat{\mathbf{y}}t+1 ; \mathbf{f}1:t+1 \right]$. Here $\mathbf{y}1:t$ denotes the set of observed outputs and $\hat{\mathbf{y}}t+1$ denotes the output predicted by the \acrshort{gpssm}. This contrasts other approaches, which tend to study the latest mutual information $I[\hat{\mathbf{y}}t+1 ; \mathbf{f}t+1]$.
Myopic active learning In \acrshort{rl} and control it is standard to consider objectives over a (potentially infinite) horizon. Should active learning in dynamical systems be any different? This thesis hypothesises that it must be important to consider the information gain over a (potentially infinite) horizon, as opposed to myopically selecting the next input, i.e. only considering the next query point. The mutual information approaches in cite:caponeLocalized2020,yuActive2021 fall into this myopic category as they only maximise the information gain at the next time step. In contrast, the greedy entropy-based strategy in cite:buisson-fenetActively2020 considers the entropy over a horizon. The exploration strategy presented in this chapter follows a similar approach and relieves the myopia of active learning by leveraging a similar greedy entropy-based strategy over a finite horizon.
The performance of \acrshort{modeopt} depends on its ability to explore the environment. The exploration strategy is primarily interested in exploring a single desired dynamics mode whilst avoiding entering any of the other modes. This is a challenging problem because the agent must observe regions outside of the desired dynamics mode in order to learn that a particular region does not belong to the desired mode. To the best of our knowledge, there is no previous work addressing the exploration of a single dynamics mode in multimodal dynamical systems.
$δ-\text{mode remaining}$ exploration
The exploration strategy presented here is primarily interested in exploring
regions of the gating network that are uncertain.
It relieves the myopia of active learning by considering trajectory optimisation with
an entropy-based strategy over a finite horizon. Further to this,
it principally ensures solutions are
where $\dynamicsModel(\state\timeInd+1 \mid \state0, \control0:\timeInd, \modeVarTraj0:\timeInd=\desiredMode)$
is the multi-step dynamics prediction calculated by recursively propagating uncertainty through the desired
mode’s dynamics \acrshort{gp} using the moment matching approximation.
See cref:eq-state-unc-prop for more details.
Chance constraints
To ensure that the controlled system remains in the desired mode with high probability, i.e. trajectories are
\newline
Gating network entropy
To calculate the joint gating entropy term
$\mathcal{H}\left[\desiredGatingFunction(\stateTraj) \mid \stateTraj, \dataset0:i-1 \right]$
the state trajectory
where
where
where $q(\desiredGatingFunction(\gatingInducingInput)) = \mathcal{N}\left( \desiredGatingFunction(\gatingInducingInput \mid \hat{\mathbf{m}}\desiredMode, \hat{\mathbf{S}}\desiredMode \right)$.
\newline
Gating network entropy
The conditional entropy of the gating network over a trajectory
$\mathcal{H}\left[\desiredGatingFunction(\state\timeInd) \mid \stateTraj, \desiredGatingFunction(\stateTraj¬\timeInd), \dataset \right]$
can be calculated by assuming that the unobserved gating function values
where the mean and variance are given by,
where
where
This section details how this exploration strategy is embedded into a \acrshort{mbrl} loop
to solve the mode remaining navigation problem in cref:eq-mode-explore-problem.
The algorithm is named \acrshort{modeopt} and is detailed in cref:alg-mode-opt.
The algorithm is initialised with a start state
\newline
Dynamics model $\dynamicsModel$
\acrshort{modeopt} learns a factorised representation of the underlying
dynamics modes using the \acrshort{mosvgpe} model from cref:chap-dynamics.
Importantly, it learns a single-step dynamics model
which also infers valuable information regarding how the system switches between the modes.
Mode remaining controller $\modeController$
The mode remaining control methods from cref:chap-traj-opt-control
are used to construct a mode remaining controller
It uses the learned dynamics model
Explorative controller $\explorativeController$
When
The goal of this controller is to explore the environment whilst remaining in the desired dynamics mode. It is worth noting that this is an open-loop trajectory optimisation algorithm. Extending the method in cref:sec-exploration to feedback controllers is left for future work.
This section presents initial results solving the illustrative quadcopter navigation problem in cref:illustrative_example using the exploration strategy from cref:sec-exploration. In particular, it shows results using the simulated Environment 1 from cref:sec-traj-opt-results-simulated. See cref:sec-traj-opt-results-simulated for more details on the environment and the simulator set up.
cref:fig-explorative-traj-opt-7-initial shows the \acrshort{mosvgpe}’s gating network posterior after training
on the initial data set
This section details how the experiments were configured.
Initial data set $\dataset_0$
The initial data set was collected by simulating
Model learning Following the experiments in cref:chap-traj-opt-results
this section instantiated the \acrshort{mosvgpe} dynamics model with
Explorative controller
The explorative controller
This section evaluates the different terms in the explorative controller’s objective from cref:eq-explorative-traj-opt. In particular, it motivates why the entropy-based term was combined with the state difference term. It then shows why the joint gating entropy over a trajectory was used, instead of summing the marginal entropy at each time step, as is done in cite:buisson-fenetActively2020.
\newline
State difference term A simple exploration strategy is to favour trajectories whose centre of mass is closer to the target state. This corresponds to solving the constrained optimisation in cref:eq-explorative-traj-opt with only the state difference term. However, using only the state difference term with the mode chance constraints leads to the optimisation getting stuck in a local optimum and never exploring to the target state. This is shown in cref:fig-explorative-state-diff-traj-opt-over-prob-7, which shows the final iteration of the optimisation where the trajectory is stuck at the mode boundary. If the strategy searched more to the left it would be able to expand the mode chance constraints and explore around the mode boundary. However, the mode chance constraints have induced a local optimum that prevented \acrshort{modeopt} from exploring in this direction. A strategy which favours searching regions of the state space which have not previously been observed could likely avoid the local optima in cref:fig-explorative-state-diff-traj-opt-over-prob-7. This intuition motivated adding the gating entropy term in cref:eq-explorative-traj-opt, which favours exploration in regions of high epistemic uncertainty.
\newline
Joint vs factorised entropy Before combining the state difference and entropy terms into a single objective, the impact of different gating entropy objectives is evaluated. In particular, the joint gating entropy term in cref:eq-explorative-traj-opt, given by,
is compared with the sum of marginal entropies over the trajectory, which is referred to as the factorised entropy and given by,
cref:fig-entropy-comparison-traj-opt-7 shows the trajectories found at the first \acrshort{modeopt} iteration
when using these two entropy objectives.
The factorised entropy objective, shown in
cref:fig-entropy-mvn-diag-traj-opt-over-prob-7, has collapsed the entire trajectory onto a single state
of high entropy.
This is an undesirable behaviour because it does not maximise the information gain along the entire trajectory.
In contrast, the result in cref:fig-entropy-mvn-full-cov-traj-opt-over-prob-7 considers the joint entropy over
the entire trajectory. The trajectory has spread along the
This section presents preliminary results of \acrshort{modeopt}’s exploration strategy.
The results show the strategy successfully exploring the simulated Environment 1 from cref:sec-traj-opt-results-simulated.
The results presented here show how the desired mode’s mixing probability and the gating function’s variance
change during each iteration.
The exploration strategy is visualised using a grid of figures where each row corresponds to
the data collection process corresponding to a particular iteration
- The first (left) column shows the data set $\dataset0:i$ collected at the previous iterations overlayed on the gating network posterior after training on it.
- The second (middle) column overlays the trajectory found by the explorative controller
$\explorativeController$ on the gating network posterior before training on the data collected by the explorative controller. - The third (right) column shows the updated data set $\dataset0:i+1$ after collecting data with the explorative controller, overlayed on the gating network before training on it.
The row below then shows the gating network posterior after training on the updated data set $\dataset0:i+1$ from the previous iteration. Together cref:fig-explorative-traj-opt-over-prob-7-0-4,fig-explorative-traj-opt-over-variance-7-0-4 show how the model’s epistemic uncertainty reduces after collecting and training on new data, and how this leads to the mode chance constraints expanding at each iteration.
\newline
Initial model
The top left figures in
cref:fig-explorative-traj-opt-over-prob-7-0-4,fig-explorative-traj-opt-over-variance-7-0-4
show the initial data set
Iteration $\mathbf{i=0}$
The top row of
cref:fig-explorative-traj-opt-over-prob-7-0-4,fig-explorative-traj-opt-over-variance-7-0-4
visualise the initial iteration
\newline
Iteration $\mathbf{i=2}$
The second row of plots in
cref:fig-explorative-traj-opt-over-prob-7-0-4 shows the first iteration
where the
Iteration $\mathbf{i=3}$ The third row shows another iteration where the dynamics model did not learn the mode boundary and as a result, found an explorative trajectory which left the desired dynamics mode. Although not desirable, collecting this data was necessary for inferring the mode boundary.
Iteration $\mathbf{i=4}$
The fourth row shows the next iteration which
shows that given these new observations the \acrshort{mosvgpe}’s gating network is able to infer that the
state transitions belong to another dynamics mode.
As a result, the explorative trajectory did not leave the desired dynamics mode.
That is, the mode chance constraints successfully restricted the domain such that the trajectory did not leave
the desired dynamics mode.
Further to this, with
Leaving the desired dynamics mode The interplay between the explorative objective and the mode chance constraints was effective at preventing the explorative controller from leaving the desired dynamics mode during training. Only two iterations of \acrshort{modeopt} left the desired dynamics mode during training and arguably this is necessary in order to learn the location of the mode boundary.
\newline
Final iteration
cref:fig-explorative-traj-opt-over-prob-7-6-14,fig-explorative-traj-opt-over-variance-7-6-14
show the later iterations of the exploration phase.
They show \acrshort{modeopt} gradually exploring the domain by expanding the mode chance constraints.
The bottom row shows the final iteration of the exploration phase where the
mode chance constraints have expanded such that the target state
\newline
Exploration
The results in
cref:fig-explorative-traj-opt-over-prob-7-0-4,fig-explorative-traj-opt-over-variance-7-0-4,fig-explorative-traj-opt-over-prob-7-6-14,fig-explorative-traj-opt-over-variance-7-6-14
suggest that balancing the state difference and the joint gating entropy terms in
cref:eq-explorative-traj-opt results in exploration to the target state
This section discusses \acrshort{modeopt} and proposes some directions for future work.
\newline
Further experiments First of all, it should be noted that the work presented in this chapter is a first step at consolidating the work in this thesis to solve the mode remaining navigation problem in cref:eq-mode-explore-problem. As such, more experiments are required to fully test \acrshort{modeopt}. For example, further experiments are required to validate the exploration strategy in environments with more than two dynamics modes and on real-world systems.
Exploration guarantees
\acrshort{modeopt} side-steps the exploration-exploitation trade-off which is common in \acrshort{mbrl}.
It does so by separating the exploration into the explorative controller
Myopic vs non-myopic active learning
In dynamical systems, myopic learning corresponds to selecting the control input by only considering the information gain
at the next state.
In contrast, non-myopic learning selects the control input by considering the
information gain over the next
\newline Information criterion The exploration strategy presented in this chapter has shown that the \acrshort{gp}-based gating network is useful for active learning when considering trajectories over a horizon. This is because the \acrshort{gps} have been able to model the joint distribution over the gating function values along a trajectory. This enabled the exploration strategy to find trajectories where each time step is aware of the information gain of other time steps. This is useful for finding non-myopic trajectories as it increases the information gain over the entire trajectory. However, this has only been verified by visual inspection and further analysis is left for future work. It is also worth noting that the exploration strategy is only possible because the gating network is based on \acrshort{gps}. As such, this method is not widely applicable in all \acrshort{mogpe} dynamics models.
An alternative approach is to use the entropy of the mode indicator variable
A popular approach to active learning for binary classification is \acrfull{bald} citep:houlsbyBayesian2011. \acrshort{bald} alleviates this issue by considering the following objective (in the myopic setting),
This objective is visualised in the right column of cref:fig-explorative-traj-opt-entropy-comparison-7-2-7. Intuitively, it seeks the state $\state\timeInd$ for which the parameters under the posterior disagree about the outcome the most, hence the name. The objective is low in regions of the mode boundary which have been observed. This is a desirable behaviour which makes extending this objective to dynamical systems an interesting direction for future work.
It is worth noting that the joint distribution over the mode indicator variable in \acrshort{mogpe} models is factorised over time. As such, there is no way to condition the entropy at a given time step on all other time steps. However, it may be possible to use the joint distribution over the gating function values $p(\gatingFunc(\stateTraj) \mid \stateTraj, \dataset0:i)$ to achieve this behaviour.
\newline
More than 2 modes Although not tested here, \acrshort{modeopt} is theoretically sound and should be applicable in environments with more than two dynamics modes. However, the exploration strategy uses the joint entropy of desired mode’s gating function over a trajectory. Further experiments are required to see if this strategy works when the \acrshort{mosvgpe} dynamics model is instantiated with more than two experts. This is because when using more than two experts the \acrshort{mosvgpe} model uses the Softmax likelihood with a gating function for each expert. In contrast, the two expert case uses the Bernoulli likelihood with a single gating function.
Inducing points The method presented in this chapter initialised each of the sparse \acrshort{gps} in the \acrshort{mosvgpe} dynamics model with a fixed number of inducing points. Although this approach worked well in the experiments, it is unlikely that this will always be the case. For example, consider exploring much larger environments, i.e. environments with larger state domains. In these environments, the \acrshort{mosvgpe}’s ability to accurately model an ever increasing data set with a fixed number of inducing points will decrease. This is due to the sparse approximation’s ability to model the true nonparametric model deteriorating as the number of data points increases. See cite:burtRates2019 for details on rates of convergence for sparse \acrshort{gps}. As such, an interesting direction for future work is to study methods for dynamically adding new inducing points to each \acrshort{gp}.
Fixing model parameters during training
During its initial iterations,
\acrshort{modeopt} only explores the desired dynamics mode and does not observe any state
transitions belonging to another mode. As a result, the lengthscale of the gating network \acrshort{gp} increases.
This often results in the gating network becoming overconfident,
almost as if the gating network believes that only a single dynamics mode exists over
the entire domain. When this happens, the
$δ-\text{mode remaining}$
Exploring a single dynamics mode in multimodal dynamical systems
where the underlying dynamics modes and how the system switches between them, are not fully known a priori,
is a hard problem.
This is because the agent must observe regions outside of the desired dynamics mode
in order to know that a particular region does not belong to the desired mode.
The
Related work To the best of our knowledge, there is no previous work addressing the exploration of a single dynamics mode in multimodal dynamical systems. cite:schreiterSafe2015 use a \acrshort{gp} classifier to identify safe and unsafe regions when learning \acrshort{gp} dynamics models in an active learning setting. However, they assume that they can directly observe whether a particular data point from the environment belongs to either the safe or unsafe regions. In contrast, we considered scenarios where the mode cannot be directly observed from the environment, but instead, is inferred by the probabilistic dynamics model.
This chapter has presented a novel strategy for exploring multimodal dynamical systems whilst remaining
in the desired dynamics mode with high probability.
Moreover, it has proposed how this exploration strategy can be combined with the dynamics model from cref:chap-dynamics,
the
The algorithm, named \acrshort{modeopt}, was tested in a simulated version of the illustrative quadcopter example, verifying that the algorithm can work in practice. However, it has not been fully tested in environments with more than two modes, nor has it been tested on a real-world system. Further testing and analysis of \acrshort{modeopt} is left for future work.
The main objective of this thesis was to solve the mode remaining navigation problem in cref:eq-main-problem. That is, to control a multimodal dynamical system – where neither the underlying dynamics modes, nor how the system switches between them, are known a priori – to a target state, whilst remaining in the desired dynamics mode. Based on well-established methods from Bayesian statistics and machine learning, this thesis proposed \acrshort{modeopt}, a general framework for approximately solving the mode remaining navigation problem.
At the core of \acrshort{modeopt} is a Bayesian approach to learning multimodal dynamical systems, named \acrfull{mosvgpe}, that accurately identifies the underlying dynamics modes, as well as how the system switches between them. Further to this, it learns informative latent structure that \acrshort{modeopt} leverages to encode mode remaining behaviour into control strategies. The method’s ability to learn factorised representations of multimodal data sets whilst retaining well-calibrated uncertainty estimates was validated on a real-world quadcopter data set, as well as on the motorcycle data set. Further to this, its applicability to learning dynamics models for model-based control was validated in two simulated environments.
As this thesis has focused on model-based techniques that leverage a learned dynamics model, it had to relax the requirement of remaining in the desired dynamics mode, to remaining in the desired dynamics mode with high probability. Initially, when not much of the environment has been observed, it is not possible to find trajectories to the target state that remain in the desired mode with high probability. This is due to the learned dynamics model having high epistemic uncertainty. In this scenario, \acrshort{modeopt} reduces the model’s epistemic uncertainty by exploring the environment and updating the dynamics model with new data. \acrshort{modeopt} sidesteps the exploration-exploitation trade-off which is common in \acrshort{mbrl} algorithms by introducing a set of chance constraints. That is, \acrshort{modeopt} does not need an objective function that changes its exploration-exploitation balance as it gathers more data. The mode chance constraints are a powerful tool that allow \acrshort{modeopt} to deploy separate controllers during the explorative and exploration phases.
An explorative trajectory optimisation algorithm that leverages the \acrshort{gp}-based gating network is proposed. Experiments confirm that the explorative controller is effective at maximising the information gain over the entire trajectory and targeting exploration towards the target state. However, further analysis of the exploration strategy is left for future work.
Three exploitative trajectory optimisation algorithms that find trajectories to the target state, whilst attempting to remain in the desired dynamics mode have been presented. Two of the methods show how the latent geometry of the \acrshort{gp}-based gating network can be leveraged to encode mode remaining behaviour, whilst the third approach shows how this can be achieved by extending the \acrshort{cai_unimodal} framework to multimodal dynamical systems. Their ability to remain in the desired dynamics mode was tested in two simulated environments. Both the \acrfull{dre} method in cref:sec-traj-opt-energy and the \acrfull{cai} method in cref:chap-traj-opt-inference performed well in the experiments.
This thesis provided experimental evidence that \acrshort{modeopt} can work in practice, however, this was only in simulation. Evaluating its performance on real-world problems is left for future work.
- A probabilistic framework
There are many promising directions for future work. Some are extensions of the work in this thesis, whilst others are alternative approaches to solving the mode remaining navigation problem in cref:eq-main-problem. Many of the extensions to the work in this thesis are discussed in the relevant chapters so are not re-discussed here.
Higher-dimensional problems Firstly, \acrshort{modeopt} is mostly restricted to lower dimensional problems due to the difficulties of defining \acrshort{gp} priors in high dimensions. In \acrshort{mbrl} there has been interesting progress learning better statistical models and scaling them up to higher-dimensional problems, for example, using Bayesian neural networks. This is an interesting direction for future work as it may lead to more practical algorithms.
External sensing \acrshort{modeopt} uses internal sensing to obtain information on the robot, such as where it is and how fast it is travelling. It then uses this information to infer the separation between the underlying dynamics modes from state transitions. However, in some applications, it may be possible to infer the separation between the underlying dynamics modes using external sensors. For example, in autonomous driving, the friction coefficients associated with different road surfaces may define a set of dynamics modes. In this setting, cameras could likely be used to detect changes in the road surface and thus the separation between the underlying dynamics modes. Although not applicable in all settings, this is a promising direction for future work, as the agent would never have to enter the undesired dynamics mode.
Real-time feedback control
Although the trajectories found by the controllers are
Model-free approaches Finally, this thesis has solely focused on model-based approaches for solving the mode remaining navigation problem in cref:eq-main-problem. However, it may be possible to solve it with model-free methods. For example, \acrfull{mfrl} methods may be able to learn reactive policies which automatically turn back when they encounter hard to control dynamics.
Table ref:tab-params-quadcopter contains the optimisation settings and initial values for the optimisable parameters that were used to train the model on the velocity controlled quadcopter data set in Section ref:sec-brl-experiment.
Experts Both expert’s \acrshort{gp} priors were initialised with constant mean functions (with
a learnable parameter $c\modeInd
Gating Network The gating network \acrshort{gp} was initialised with a zero mean function, and a squared exponential kernel with \acrshort{ard}.
\newline Model Training Table ref:tab-params-quadcopter contains the initial values for all of the trainable parameters in the model.
\newline Model Training Table ref:tab-params-quadcopter contains the initial values for all of the trainable parameters in the model.