Skip to content

Latest commit

 

History

History
15774 lines (14265 loc) · 928 KB

phd-thesis.org

File metadata and controls

15774 lines (14265 loc) · 928 KB

Bayesian Learning for Control in Multimodal Dynamical Systems

Config

Run this src block with C-c C-c

Org Mode Export Options

Macros

LaTeX Export Headers and Options

Packages

xspace for space after text in newcommands

Font Awesome icons

Footnotes

Tensor indexing (pre subscripts)

Epigraph (chapter quotes)

Grey box for block quotes

Equation Definitions

Create a Definition theorem

Floating images configuration

By default, if a figure consumes 60% of the page it will get its own float-page. To change that we have to adjust the value of the floatpagefraction derivative.

See more information here.

Allow images to be cropped

Hyperref

Self-explanatory.

Cleverref

Bookmarks

The bookmark package implements a new bookmark (outline) organisation for package hyperref. This lets us change the “tree-navigation” associated with the generated pdf and constrain the menu only to H:2.

BBding

Symbols such as diamond suit, which can be used for aesthetically separating paragraphs, could be added with the package fdsymbol. I’ll use bbding which offers the more visually appealing \FourStar. I took this idea from seeing the thesis of the mimosis package author.

CS Quotes

The csquotes package offers context sensitive quotation facilities, improving the typesetting of inline quotes.

Already imported by mimosis class.

To enclose quote environments with quotes from csquotes, see the following TeX SE thread.

And then use quotes as:

# The options derivative adds text after the environment. We use it to add the author.
#+ATTR_LATEX: :options {\cite{Frahm1994}}
/Current (fMRI) applications often rely on "effects" or "statistically significant differences", rather than on a proper analysis of the relationship between neuronal activity, haemodynamic consequences, and MRI physics./

Note that org-ref links won’t work here because the attr latex will be pasted as-is in the .tex file.

Date Time

The date time package allows us to specify a “formatted” date object, which will print different formats according to the current locale & language. I use this in my title page.

Bibliography

General configuration.

Improvements provided with the Mimosis class.

Remove ISSN, DOI and URL to shorten the bibliography.

And increase the spacing between the entries, as per default they are too small.

Also reduce the font-size

Improve chapter font colors and font size

The following commands make chapter numbers BrickRed, which look like the Donders color.

Setspace for controlling line spacing

Already imported when using mimosis.

Parskip

Fine tuning of spacing between paragraphs. See thread here.

Table of Contents improvements

Possible Equation improvements

Make the equation numbers follow the chapter, not the whole thesis.

TikZ and bayesnet for graphical models

Captions

Maths diag

Algorithms

Maths

argmin/argmax

Maths cancel

UoB guidelines

PDF metadata:

Text Variables

Math Variables

Num Data / Mode / State Dimension / Control Dimension (k, d, t/n)

Data set

Noise Vars

Mode Indicator Variable

Tensor Indexing

Gating Network New

Experts New

Params

Sparse GPs

Experts

Variables
Inputs

Gating

Variables
Inputs

Misc

Old

Continuous

Prob Dists New

Prob Dists

Gating network

Kernels

Desired Mode

Jacobian

Old

Gating Network Old

Desired Mode Old

Experts Old

Tools

word count

ps2ascii phd-thesis.pdf | wc -w

Acronyms

{{{acronym(brl,BRL,Bristol Robotics Laboratory)}}} {{{acronym(mogpe,MoGPE,Mixtures of Gaussian Process Experts)}}} {{{acronym(moe,MoE,Mixture of Experts)}}} {{{acronym(mosvgpe,MoSVGPE,Mixtures of Sparse Variational Gaussian Process Experts)}}} {{{acronym(gp,GP,Gaussian Process)}}} {{{acronym(gps,GPs,Gaussian Processes)}}} {{{acronym(svgp,SVGP,Sparse Variational Gaussian Process)}}} {{{acronym(fitc,FITC,Fully Independent Training Conditional)}}} {{{acronym(mdp,MDP,Markov Decision Process)}}} {{{acronym(ard,ARD,Automatic Relevance Determination)}}} {{{acronym(ode,ODE,Ordinary Differential Equation)}}} {{{acronym(sde,SDE,Stochastic Differential Equation)}}} {{{acronym(elbo,ELBO,Evidence Lower Bound)}}} {{{acronym(vae,VAE,Variational Auto-Encoder)}}} {{{acronym(rl,RL,Reinforcement Learning)}}} {{{acronym(mbrl,MBRL,Model-Based Reinforcement Learning)}}} {{{acronym(mfrl,MFRL,Model-Free Reinforcement Learning)}}} {{{acronym(hmm,HMM,Hidden Markov Model)}}} {{{acronym(svi,SVI,Stochastic Variational Inference)}}} {{{acronym(soc,SOC,Stochastic Optimal Control)}}} {{{acronym(mpc,MPC,Model Predictive Control)}}} {{{acronym(lqr,LQR,Linear Quadratic Regulator)}}}

{{{acronym(ddp,DDP,Differential Dynamic Programming)}}} {{{acronym(ilqr,iLQR,iterative Linear Quadratic Regulator)}}} {{{acronym(ilqg,iLQG,iterative Linear Quadratic Gaussian)}}} {{{acronym(pid,PID,Proportional Integral Derivative)}}} {{{acronym(mcmc,MCMC,Markov Chain Monte Carlo)}}} {{{acronym(slsqp,SLSQP,Sequential Least Squares Programming)}}}

{{{acronym(epsrc,EPSRC,Engineering and Physical Sciences Research Council)}}} {{{acronym(farscope,FARSCOPE,Future Autonomous and Robotic Systems)}}}

{{{acronym(pets,PETS,Probabilistic Ensembles with Trajectory Sampling)}}} {{{acronym(pipps,PIPPS,Probabilistic Inference for Particle-based Policy Search)}}} {{{acronym(pilco,PILCO,Probabilistic Inference for Learning COntrol)}}}

{{{acronym(gpssm,GPSSM,Gaussian Process State Space Model)}}} {{{acronym(bald,BALD,Bayesian Active Learning by Disagreement)}}}

{{{acronym(ig,IG,Indirect Optimal Control via Latent Geodesics)}}} {{{acronym(dre,DRE,Direct Optimal Control via Riemannian Energy)}}} {{{acronym(cai,MRCaI,Mode Remaining Control as Inference)}}} {{{acronym(cai_unimodal,CaI,Control as Inference)}}}

{{{acronym(modeopt,ModeOpt,Mode Optimisation)}}}

{{{acronym(nlpp,NLPP,Negative Log Predictive Probability)}}} {{{acronym(mae,MAE,Mean Absolute Error)}}} {{{acronym(rmse,RMSE,Root Mean Squared Error)}}}

Frontmatter

Title Page

Abstract

Abstract

Abstract

Covid-19 Statement

Declaration

Acknowledgements

TOC and Mainmatter

Maths

lit review maths

Domains

Bounds

maths

kernels

inference

Maths Notation

Maths Symbols

Domains

Bounds

maths

kernels

inference

Tables

Introduction

maths

Domains

Maths

Intro

The modern world is pervaded with dynamical systems that we seek to control to achieve a desired behaviour. Examples include autonomous vehicles, aircraft, robotic manipulators, financial markets and energy management systems. In the last decade, \acrfull{rl}, and learning-based control in general, have become popular paradigms for controlling dynamical systems citep:hewingLearningBased2020,sutton2018reinforcement. This can be accounted to significant improvements in sensing and computational capabilities, as well as recent successes in machine learning.

This growing interest in learning-based control has emphasised the importance of real-world considerations. Real-world systems are often highly nonlinear, exhibit stochasticity and multimodalities, are expensive to run (energy-intensive, subject to wear and tear) and must be controlled subject to constraints (for safety, efficiency, and so on). In contrast to simulation, the control of physical systems also has real-world consequences: components may get damaged, the system may damage its environment, or the system may catastrophically fail. As such, any learning-based control strategy deployed in the real world should handle both the uncertainty inherent to the environment and the uncertainty introduced by learning from observations.

Many dynamical systems exhibit multimodalities, where some of the dynamics modes are believed to be inoperable or undesirable. These multimodalities may be due to spatially varying model parameters, for example, process noise terms modelling aircraft turbulence, or friction coefficients modelling surface-tyre interactions for ground vehicles over different terrain. In these systems, it is desirable to avoid entering specific dynamics modes that are believed to be inoperable. Perhaps they are hard to control due to stability, or the mode switch is hard to control. Alternatively, they may be inefficient or low performing. Given these motivations, this thesis focuses on controlling dynamical systems from an initial state, to a target state, whilst avoiding specific dynamics modes.

Model-based control comprises a powerful set of techniques for finding controls of constrained dynamical systems, given a dynamics model describing the evolution of the controlled system. It is commonly used for controlling aircraft, robotic manipulators, and walking robots citep:vonstrykDirect1992,bettsSurvey1998,gargUnified2010. One caveat is that it requires a relatively accurate mathematical model of the system. Traditionally, these mathematical models are built using first principles based on physics. However, accurately modelling the underlying transition dynamics can be challenging and lead to the introduction of model errors. These model errors may be due to incorrectly specifying model parameters or the models themselves. For example, modelling a nonlinear system to be linear or a multimodal system to be unimodal. Incorrectly specifying these model parameters and their associated uncertainty can have a detrimental impact on controller performance. Handling these issues is a central goal of robust and stochastic optimal control citep:freemanRobust1996,stengelStochastic1986.

The difficulties associated with constructing mathematical representations of dynamical systems can be overcome by learning from observations citep:ljungSystem1999. Learning dynamics models has the added benefit that it alleviates the dependence on domain experts for specifying accurate models, making it easier to deploy more general techniques. However, learning dynamics models for control introduces other difficulties. For example, it is important to know where the model cannot predict confidently due to a lack of training observations. This concept is known as epistemic uncertainty and is reduced in the limit of infinite data. Correctly quantifying uncertainty is crucial for intelligent decision-making.

In a risk-averse setting, control strategies should avoid entering regions of a learned dynamics model with high epistemic uncertainty. This is because it is impossible to guarantee constraint satisfaction in a learned model, i.e. if the trajectory will avoid the undesired dynamics mode. Conversely, in an explorative setting, if the epistemic uncertainty has been quantified, it can be used to guide exploration into regions of the dynamics that have not previously been observed. This experience can then be used to update the model, in turn reducing its epistemic uncertainty.

These two settings are the main focus of this thesis. If the dynamics are not fully known a priori, an agent will not be able to plan a risk-averse trajectory to the target state confidently. How can the agent explore its environment, in turn reducing the epistemic uncertainty associated with its dynamics model? As this thesis assumes that complete knowledge of the dynamics are not fully known a priori, a main interest is jointly inferring the underlying dynamics modes, as well as how the system switches between them, through repeated interactions with the system. Once the agent has explored enough, how can this learned dynamics model be exploited to plan risk-averse trajectories that remain in the desired dynamics mode?

Illustrative Example label:illustrative_example

intro

The methods developed throughout this thesis are motivated by a 2D quadcopter navigation example. See cref:fig-problem-statement for a schematic of the environment and details of the problem. The goal is to fly the quadcopter from an initial state $\state_0$, to a target state $\statef$. However, it considers a quadcopter operating in an environment subject to spatially varying wind – induced by a fan – where the system can be represented by two dynamics modes,

Mode 1
a non-turbulent dynamics mode away from the fan,
Mode 2
an inoperable, turbulent dynamics mode in front of the fan.

The turbulent dynamics mode is subject to higher drift (in the negative $x$ direction) and to higher diffusion (process noise). It is hard to know the exact turbulent dynamics due to complex and uncertain interactions between the quadcopter and the wind field. Further to this, it is believed that controlling the system in the turbulent dynamics mode may be infeasible. This is because the unpredictability of the turbulence may cause catastrophic failure. Therefore, when flying the quadcopter from an initial state $\state_0$, to a target state $\statef$, it is desirable to find trajectories that avoid entering this turbulent dynamics mode. However, the underlying dynamics modes and how the system switches between them, are \textit{not fully known a priori}. To this end, this thesis is interested in synergising model learning and model-based control (or planning), to solve this quadcopter navigation problem.

intro

The methods developed throughout this thesis are motivated by a 2D quadcopter navigation example. See cref:fig-problem-statement for a schematic of the environment and details of the problem. The goal is to fly the quadcopter from an initial state $\state_0$, to a target state $\statef$. However, it considers a quadcopter operating in an environment subject to spatially varying wind – induced by a fan – where two dynamics modes can represent the system,

Mode 1
is an operable dynamics mode away from the fan,
Mode 2
is an inoperable, turbulent dynamics mode in front of the fan.

The turbulent dynamics mode is subject to higher drift (in the negative $x$ direction) and to higher diffusion (process noise). It is hard to know the exact turbulent dynamics due to complex and uncertain interactions between the quadcopter and the wind field. Further to this, controlling the system in the turbulent dynamics mode may be infeasible. This is because the unpredictability of the turbulence may cause catastrophic failure. Therefore, when flying the quadcopter from an initial state $\state_0$, to a target state $\statef$, it is desirable to find trajectories that avoid entering this turbulent dynamics mode.

The state-space of the velocity controlled quadcopter example consists of the 2D Cartesian coordinates $\state = (x, y)$ and the controls consist of the speed in each direction, given by $\control = (\velocity_x, \velocity_y)$.

Problem Statement

The overall goal is to navigate from an initial state $\state_0$ – in the desired dynamics mode $\desiredMode$ – to the target state $\targetState$, whilst guaranteeing that the controlled system remains in the desired dynamics mode.

More formally, this thesis considers nonlinear, stochastic, multimodal dynamical systems,

with continuous states $\state_t ∈ \stateDomain \subseteq \R^\StateDim$, continuous controls $\control_t ∈ \controlDomain \subseteq \R^\ControlDim$ and a discrete-time index $t$. The discrete mode indicator function $α\timeInd = α(\state\timeInd) ∈ \modeDomain = \{1,\ldots,\ModeInd\}$, $α : \stateDomain → \modeDomain = \{1,\ldots,\ModeInd\}$, indicates which of the $\ModeInd$ underlying dynamics modes $\{\mode{\latentFunc} : \mode{\stateDomain} × \controlDomain → \stateDomain \}\modeInd=1^\ModeInd$, and associated noise models $\mode{\bm\epsilon} &∼ \mathcal{N}\left(\mathbf{0}, \bm\Sigma\mode{\bm\epsilon} \right)$ governs the system at a given time step $\timeInd$.

The state difference between time $\timeInd$ and $\timeInd+1$ is denoted $Δ \state\timeInd+1 = \state\timeInd+1 - \state\timeInd$. At any given time step $t$, one of the $\ModeInd$ dynamics modes $\{\unknownDynamicsK : \stateDomain × \controlDomain → \stateDomain \}\modeInd=1^\ModeInd$, and associated noise models $\mode{\bm\epsilon} &∼ \mathcal{N}\left(\mathbf{0}, \bm\Sigma\mode{\bm\epsilon} \right)$, where $\bm\Sigma\mode{ε} = \text{diag}\left[ σ\modeInd,12, \ldots, σ\modeInd,\StateDim^2 \right]$, are selected by a discrete mode indicator variable, $α_t ∈ \modeDomain = \mathbb{Z} ∩ [1,\ModeInd]$. This work considers systems that are stochastic (subject to process noise), but are not subject to observation noise. Thus, the $\mode{\bm\epsilon}$ term solely represents the process noise.

It considers systems where the underlying dynamics modes are defined by disjoint state domains. That is, each dynamics mode is defined by its state domain $\mode{\stateDomain} = \{ \state ∈ \stateDomain \mid \modeVar(\state) = \modeInd \}$, with $\stateDomaini ∩ \stateDomainj = ∅$ for $i ≠ j$. Each mode’s dynamics are then given by,

Notice that each mode’s transition dynamics are free to leave their state space $\mode{\stateDomain}$ and enter another mode. Ideally, this work seeks to enforce the controlled system to remain in a given mode.

The overall goal is to navigate from an initial state $\state_0$ – in the desired dynamics mode $\desiredMode$ – to the target state $\targetState$, whilst guaranteeing that the controlled system remains in the desired dynamics mode. Formally, a mode remaining controlled system is defined as follows.

Notice that the controller $\policy : \stateDomain × \mathbb{Z} → \controlDomain$ is defined generally. That is, it may take the form of a deterministic feedback policy or an open-loop set of controls without state feedback. It is assumed that $Π$ is the set of deterministic policies.

Given this definition of a mode remaining controlled system, this work seeks to solve,

where $J : Π → \R$ denotes the objective function. Note that the objective function is not of primary importance in this work. The novelty of this problem arises from remaining in the desired dynamics modes $\desiredMode$, when the underlying dynamics modes and how the system switches between them, are \textit{not fully known a priori}. To this end, this thesis is interested in synergising model learning and model-based control (or planning), to solve this problem.

Motivation label:to_motivation

Given a historical data set of state transitions from the environment, a common approach is to first learn a single-step dynamics model using a Gaussian process citep:deisenrothPILCO2011,doerrOptimizing2017,vinogradskaStability2016,rohrProbabilistic2021,hewingLearningBased2020,kollerLearningBased2018. This learned dynamics model can then be leveraged for model-based control.

cref:fig-traj-opt-over-gating-mask-svgp-all-baseline demonstrates the shortcomings of performing trajectory optimisation in a \acrshort{gp} dynamics model trained on state transitions sampled from both dynamics modes. It shows that the \acrshort{gp} is not able to learn a representation of the dynamics which is true to the underlying system. This is because it cannot model the multimodal behaviour present in the environment. The optimised controls successfully drive the \acrshort{gp} dynamics from the start state $\state_0$, to the target state $\targetState$. However, as the \acrshort{gp} dynamics model does not correspond to the true underlying dynamics, the optimised controls do not successfully navigate to the target state $\targetState$ in the environment. This example demonstrates the issues associated with learning a dynamics model that cannot model the discontinuities associated with multimodal transition dynamics. Further to this, it demonstrates the more general shortcomings of model-based control when leveraging learned dynamics models that cannot represent the true underlying dynamics.

In contrast, cref:fig-traj-opt-over-gating-mask-svgp-desired-baseline shows results after training on state transitions from only the desired (operable) dynamics mode. It is able to accurately predict state transitions in the desired dynamics mode. However, as this approach only considers the dynamics of the desired mode, trajectories in the environment deviate from those planned in the learned dynamics model when they pass through the turbulent mode. This approach is problematic because the trajectory does not reach the target state $\targetState$, nor does it avoid the turbulent dynamics mode, which may lead to catastrophic failure. It is also worth noting here that the underlying dynamics modes and how the system switches between them, are not fully known a priori. As such, partitioning the data set and learning the dynamics model in cref:fig-traj-opt-over-gating-mask-svgp-desired-baseline is not possible in realistic scenarios.

This cref:chap-dynamics strives to address these issues by learning representations of multimodal dynamical systems that correctly identify the underlying dynamics modes and how the system switches between them. cref:chap-traj-opt-geometry

Conveniently, the \acrshort{mosvgpe} method from cref:chap-dynamics can be used to learn a factorised representation of the underlying dynamics modes. This method correctly identifies the underlying dynamics modes and provides informative latent spaces that can be used to encode mode remaining behaviour into control strategies. In particular, the \acrshort{gp}-based gating network infers informative latent structure. The remainder of this chapter is concerned with encoding mode remaining behaviour into trajectory optimisation algorithms after training a \acrshort{mosvgpe} dynamics model on the historical data set of state transitions.

The main goals of the trajectory optimisation in this Chapter can be summarised as follows,

Goal 1
Remain in the desired dynamics mode $\desiredMode$, label:to-goal-mode
Goal 2
Avoid regions of the learned dynamics with high epistemic uncertainty i.e. that cannot be predicted confidently. For example, due to limited training observations. label:to-goal-unc
  • in the desired dynamics mode,
  • in how the system switches between the underlying dynamics modes.

\todo{update fig-traj-opt-baseline with trajectories which don’t use nominal mean function so that they deviate in the region with high epistemic uncertainty.}

Problem Statement label:problem-statement-main

All of the methods in this thesis relate back to this mode remaining navigation problem. For this reason, the problem is formally defined here.

Contributions

This thesis explores methods for mode remaining control in multimodal dynamical systems that explicitly reason about the uncertainties arising during learning and control. The primary contributions of this thesis are as follows:

  • cref:chap-dynamics: is concerned with learning representations of multimodal dynamical systems, where both the underlying dynamics modes and how the system switches between them, are not fully known a priori. Motivated by learning dynamics models for model-based control, it formulates a probabilistic model rich with latent spaces for control. It then derives a variational inference scheme that principally handles uncertainty whilst providing scalability via stochastic gradient methods. The method is a \acrfull{mogpe} method with a \acrfull{gp}-based gating network.
  • cref:chap-traj-opt-control: investigates model-based control techniques that leverage the probabilistic model from cref:chap-dynamics to solve the mode remaining navigation problem. Due to the complexity of the problem, this chapter assumes prior access to the environment, such that a data set of state transitions $\mathcal{D}$ has previously been collected and used to train the model. It presents three trajectory optimisation algorithms that leverage the learned dynamics model’s latent structure to solve the mode remaining navigation problem.
  • cref:chap-active-learning: then considers the more realistic scenario of not having prior access to the environment. In this scenario, the agent does not have access to a historical data set for model learning. Instead, it must actively explore its environment, collect data and use it to update its dynamics model, whilst simultaneously attempting to remain in the desired dynamics mode. It presents an exploration strategy for exploring multimodal dynamical systems whilst remaining in a desired dynamics mode with high probability. It then details how this exploration strategy can be combined with the methods from cref:chap-traj-opt-control,chap-dynamics to solve the mode remaining navigation problem.

Associated Publications

The first trajectory optimisation algorithm presented in cref:sec-traj-opt-collocation and an initial version of the approach for learning multimodal dynamical systems in cref:chap-dynamics, are published in:

Introduction traj opt paper

Many physical systems operate under switching dynamics modes due to changing environmental or internal conditions. Examples include: robotic grasping where objects with different properties have to be manipulated, robotic locomotion in environments with varying surface types and the control of aircraft in environments subject to different levels of turbulence. When controlling these systems, it may be preferred to find trajectories that remain in a single dynamics mode. This paper is interested in controlling a DJI Tello quadcopter in an environment with spatially varying turbulence induced by a fan at the side of the room, shown in Fig. ref:fig-problem-statement. It is hard to know the exact transition dynamics due to complex and uncertain interactions between the quadcopter and the fan. The system’s transition dynamics resemble a mixture of two modes: a turbulent mode in front of the fan and a non-turbulent mode everywhere else. When planning a trajectory from start state $\mathbf{x}_0$ to desired state $\mathbf{x}_f$ it is preferred to avoid entering the turbulent mode, as it results in poor performance and sometimes even system failure.

Trajectory optimisation comprises a powerful set of techniques for finding open-loop controls of dynamical systems such that an objective function is minimised whilst satisfying a set of constraints. It is commonly used for controlling aircraft, robotic manipulators, and walking robots cite:VonStryk1992,Betts1998,Garg2010. One caveat to trajectory optimisation is that it requires a relatively accurate mathematical model of the system. Traditionally, these mathematical models are built using first principles based on physics. However, accurately modelling the underlying transition dynamics can be challenging and lead to the introduction of model errors. For example, both observation and process noise are inherent in many real-world systems and can be hard to model due to both spatial and temporal variations. Incorrectly accounting for this uncertainty can have a detrimental impact on controller performance and is an active area of research in the robust and stochastic optimal control communities cite:FreemanRandyA.2009,Stengel1988.

The difficulties associated with constructing mathematical models can be overcome by learning from observations cite:Ljung1997. However, learning dynamics models for control introduces other difficulties. For example, it is important to know where the model cannot predict confidently due to a lack of training observations. This concept is known as epistemic uncertainty and is reduced in the limit of infinite data. Probabilistic models have been used to account for epistemic uncertainty and also provide a principled approach to modelling stochasticity i.e. aleatoric uncertainty cite:Schneider1996,Deisenroth2011. For example, cite:Cutler,Deisenroth2011,Pan2014 use Gaussian processes (GPs) to learn transition dynamics. \acrshort{gps} lend themselves to data-efficient learning through the selection of informative priors, and when used in a Bayesian setting offer well calibrated uncertainty estimates. Methods for learning probabilistic multimodal transition dynamics have also been proposed: cite:Mckinnon used a mixture of \acrshort{gp} experts method, cite:Moerland studied the used of deep generative models and cite:Kaiser2020a proposed a Bayesian model that learns independent dynamics modes whilst maintaining a probabilistic belief over which mode is responsible for predicting at a given input location.

There has also been work developing control algorithms exploiting learned multimodal transition dynamics cite:Herzallah2020. However, our work differs as it seeks to find trajectories that remain in a single dynamics mode whilst avoiding regions of the transition dynamics that cannot be predicted confidently. To the best of our knowledge, there is no previous work addressing such trajectory optimisation in transition dynamics models.

Probabilistic modelling and Bayesian inference are a promising avenue for learning dynamics models to be used for controlling real-world systems. \parmarginnote{probabilistic modelling} The Bayesian framework provides a principled approach to modelling both the epistemic uncertainty associated with the model, and the aleatoric uncertainty inherent to the system (e.g. process noise). For example, cite:deisenrothPILCO2011,cutlerEfficient2015,panProbabilistic2014 use Gaussian processes (GPs) to learn transition dynamics from observations. \acrshort{gps} lend themselves to data-efficient learning through the selection of informative priors, and when used in a Bayesian setting offer well calibrated uncertainty estimates. Methods for learning probabilistic multimodal transition dynamics have also been proposed: cite:mckinnonLearning2017 use a Mixture of \acrshort{gp} Experts method, cite:moerlandLearning2017 study the use of deep generative models and cite:kaiserBayesian2020 propose a Bayesian model that learns independent dynamics modes whilst maintaining a probabilistic belief over which mode is responsible for predicting at a given input location.

Background and Related Work

maths

intro

The primary goal of this thesis is to control stochastic, multimodal, nonlinear dynamical systems to a target state, whilst remaining in the desired dynamics mode. This is a \acrfull{soc} problem which can be summarised as follows:

This chapter formally defines this mode remaining navigation problem and reviews the relevant literature.

Problem Statement label:problem-statement-main

Dynamical systems describe the behaviour of a system over time $t$ and are a key component of both control theory and \acrshort{rl}. At any given time $t$, a dynamical system has a state, represented as a vector of real numbers $\state_t ∈ \stateDomain \subseteq \R\StateDim$. The system can be controlled by applying control actions $\mathbf{u}_t ∈ \controlDomain \subseteq \R\ControlDim$ at any given time. This thesis considers stochastic, multimodal, nonlinear dynamical systems, given by,

where the discrete mode indicator function $α : \stateDomain → \modeDomain$ indicates which of the $\ModeInd$ underlying dynamics modes $\{\mode{\latentFunc} : \mode{\stateDomain} × \controlDomain → \stateDomain \}\modeInd=1^\ModeInd$, and associated noise models $\mode{\bm\epsilon} &∼ \mathcal{N}\left(\mathbf{0}, \bm\Sigma\mode{\bm\epsilon} \right)$, governs the system at a given time step $\timeInd$. The output of the mode indicator function is referred to as the mode indicator variable and is given by $α\timeInd = α(\state\timeInd) ∈ \modeDomain = \{1,\ldots,\ModeInd\} = \mathbb{Z} ∩ [1, \ModeInd]$.

This thesis assumes that the state $\state$ is observed directly and is not subject to observation noise. This is a standard assumption in the \acrfull{mdp} framework, which is commonly adopted in the \acrshort{rl} literature. In this case, the $\mode{\bm\epsilon}$ term solely represents the process noise, which accounts for unwanted and, in general, unknown system disturbances. For example, it is hard to model aerodynamic effects on aircraft, so these could be accounted for in the process noise term.

Optimal control Optimal control is a branch of mathematical optimisation that seeks to find a controller $π$ that optimises an objective function $Jπ(\x)$. The objective function might be formulated to solve a particular task or to make the dynamical system behave in a certain way. Typically the goal is to minimise a cost function $\costFunc : \stateDomain × \controlDomain → \R$. This thesis considers the finite horizon problem, given by,

where $\TimeInd$ is known as the horizon. The objective $Jπ(\state)$ quantifies the cost of deploying the controller $π$ from an initial state $\state$. The cost function typically consists of a terminal cost and a term which is integrated over the horizon (integral cost). In a navigation task, the terminal cost may consist of the distance to the target, whilst the integral cost may encode the notion of minimum effort control, e.g. energy consumption citep:kirkOptimal2004.

Controller space The controller space $Π$ defines the set of controllers over which optimisation is performed. Controllers can have state feedback such that they are given by $\control_t = \policy(\state_t)$. These are closed-loop controllers and in the \acrshort{rl} literature are referred to as policies. Alternatively, controllers can depend on time $\control_t = \policy(t)$, in which case they are open-loop controllers. This thesis considers the general case, given by $\control_t = \policy(\state_t, t)$, which encompasses both open-loop and closed-loop controllers.

Mode remaining This thesis considers systems where the underlying dynamics modes are defined by disjoint state domains, i.e. $\mode{\stateDomain} = \{ \state ∈ \stateDomain \mid \modeVar(\state) = \modeInd \}$, with $\stateDomaini ∩ \stateDomainj = ∅$ for distinct $i, j ∈ \{1, \ldots \ModeInd\}$. Notice that each mode’s dynamics can leave their state space $\mode{\stateDomain}$ and enter another mode. Ideally, this work seeks to enforce the controlled system to remain in a given mode. Formally, a mode remaining controlled system is defined as follows.

Note that the objective function in cref:eq-optimal-control-objective is not of primary interest in this work. The novelty of this work arises from remaining in the desired dynamics mode $\desiredMode$.

Optimal Control label:sec-optimal-control

MDP fig

intro

The discrete-time optimal control problem considered in this thesis can be modelled as a \acrfull{mdp}, as seen in cref:fig-mdp. The \acrshort{mdp} framework refers to the controller $π$ as the agent, the control $\control$ as the action, and the dynamical system $f$ as the environment. Further to this, the goal is to maximise the cumulative reward instead of minimising the cumulative cost. However, slight manipulation of the objective enables cost and reward functions to be interchanged. The key idea in \acrshort{mdp}s is that all past information is represented by the current state $\state_t$, which is known as the Markov property. In this thesis, most work is formulated through the lens of optimal control. However, the controller is sometimes referred to as the agent and the system is often referred to as the environment.

In general, solutions to optimal control problems can be characterised by two different approaches, Pontryagin’s Minimum Principe citep:pontryagin1987mathematical (based on the calculus of variations) and Bellman’s Minimum Principle citep:bellmanDynamic1956. Although Pontryagin’s Minimum Principle offers computational benefits over Bellman’s principle, it does not readily generalise to the stochastic case. For this reason, we restrict our discussion to Bellman’s principle and the solution arising from it, known as dynamic programming.

Dynamic Programming

Dynamic programming citep:bellmanDynamic1956 encompasses a large class of algorithms that can be used to find optimal controllers given a model of the environment as an \acrshort{mdp}. However, classical dynamic programming algorithms are of limited use as they rely on accurate dynamics models and have a significant computational expense. Nevertheless, they are still important theoretically. The main idea of dynamic programming (and \acrshort{rl} in general) is to structure the search for good controllers using value functions $V$. Optimal controllers can easily be found from optimal value functions $V_*$, which satisfy the Bellman optimality equations,

In the finite horizon setting, directly solving the Bellman equations backwards in time is referred to as dynamic programming.

Approximate dynamic programming encompasses a large class of methods that, given a controller $π$, approximate the value function $Vπ(\state)$ for each state $\state$ with a parameterised function. The approximate value function can then be used for policy improvement, where a controller with superior performance is computed. In general, approximate dynamic programming is a large collection of algorithms that encompasses methods from \acrshort{rl} as well. However, these methods are out of the scope of this thesis.

Approximate Dynamic Programming

Approximate dynamic programming encompasses a large class of methods that given a controller $π$, approximate the value function $Jπ(\state)$ for each state $\state$ with a parameterised function (policy evaluation). The approximate value function can then be used for policy improvement. In general, approximate dynamic programming is a large collection of algorithms that encompasses methods from \acrshort{rl} as well. There are two main methods for approximation in dynamic programming. The first involves approximations in the value space, where the value function is approximated $\hat{V}(f(\state_t, π(\state_t, t)))$. The second is approximation in policy space by optimising over a suitable class of policies.

Model-based vs model-free These methods are out of the scope of this thesis.

Reinforcement Learning

intro

There are multiple approaches to finding controllers $π$ that minimise the expected cost in cref:eq-optimal-control-objective subject to the stochastic dynamics in cref:eq-dynamics-main. However, a central assumption of many methods is that both the system dynamics and cost function are known a priori. In contrast, we consider problems where the underlying dynamics modes $\{\mode{\dynamicsFunc}\}\modeInd=1\ModeInd$ and how the system switches between them $\modeVar$, are \textit{not fully known a priori}. Further to this, the problem statement in cref:eq-main-problem contains a mode remaining constraint, which requires explicit knowledge of the desired dynamics mode and its state domain.

To apply control and planning techniques in systems with unknown dynamics, system identification emerged as a set of techniques for computing unknown parameters, e.g. mass of a component citep:ljungSystem1999. This is a two-staged approach which first learns about the environment and then uses this learned model to find the optimal controller. However, this approach learns about the environment globally and often incurs high costs during system identification.

\acrfull{rl} provides the most general framework for extending optimal control to problems with incomplete knowledge of the system dynamics. The classic text by cite:sutton2018reinforcement gives a general introduction to \acrshort{rl}. The main goal is to learn good behaviours from interactions with an environment. Typically this is in the form of a state feedback controller, known as a policy, which makes an agent’s interaction with the environment closed-loop. In contrast to the system identification approach, the goal of \acrshort{rl} is to minimise costs (maximise rewards) during the learning process. Further to this, \acrshort{rl} only needs to learn about the states relevant to solving the optimal control problem.

MBRL Alg

Model-based Reinforcement Learning

This thesis is interested in a subset of \acrshort{rl} known as \acrfull{mbrl}. It solves the optimal control problem in cref:eq-optimal-control-objective by first learning a dynamics model and then using this learned model with model-based control techniques. As both \acrfull{mfrl} and \acrshort{mbrl} methods learn models, the model in the name refers to the dynamics model $f$, as \acrshort{mfrl} approaches do not learn representations of the dynamics $f$. cref:alg-mbrl-background illustrates a common \acrshort{mbrl} procedure where an agent incrementally explores its environment, collecting data $\mathcal{D}_i$ at each iteration $i$ and updating its dynamics model $\dynamicsModel$.

As more systems are becoming data-driven, learning dynamics models for model-based control has shifted to needing task-centric methods that simultaneously learn about the environment, whilst optimising a controller to obtain low cumulative costs. The field of \acrshort{mbrl} seeks to solve many optimal control problems by incrementally learning a dynamics model in this way citep:deisenrothPILCO2011,chuaDeep2018. \acrshort{mbrl} shares similarities with the system identification and control process, except that the dynamics learning and control are updated simultaneously. There are two main components to a \acrshort{mbrl} algorithm: 1) a method for learning a dynamics model $\dynamicsModel$ and 2) a controller (or policy) $π$ that leverages the learned dynamics model $\dynamicsModel$. The controller may be a parameterised state feedback controller (policy) $π$, which is trained using \acrshort{mfrl} algorithms and the learned dynamics model. Alternatively, the controller may take the form of a model-based control algorithm such as \acrfull{mpc}.

Model-based Control and Planning

This section reviews model-based control methods that leverage learned dynamics models in the \acrshort{mbrl} setting. In the \acrshort{rl} literature, model-based control is often referred to as planning. The work in this thesis is primarily focused on model-based control techniques, in particular, trajectory optimisation.

Trajectory optimisation Instead of approximating a value function or a policy, it is possible to directly optimise the controls $\controlTraj = \{\control_1, \ldots, \controlT-1 \}$ over a horizon $T$. This is known as trajectory optimisation and is given by,

which finds an optimal sequence of controls $\controlTraj$ for a given start state $\state_0$. Many recent works have used trajectory optimisation with learned dynamics models. For example, cite:nakkaChanceConstrained2021 developed a chance-constrained trajectory optimisation algorithm that leverages a dynamics model learned using a robust regression model. cite:rybkinModelBased2021 utilise a latent space dynamics model – learned with a convolutional neural network for the encoder and a recurrent neural network for the dynamics – which scales to high-dimensional inputs such as images. In \acrshort{mbrl}, trajectory optimisers often exploit inaccuracies of the learned dynamics model. cite:boneyRegularizing2019 propose a trajectory optimisation approach that relives this exploitation by learning the dynamics using a denoising autoencoder. The resulting trajectory optimisation algorithm avoids regions of the learned dynamics model where no data has been observed.

Model predictive control Instead of directly applying the control inputs found with trajectory optimisation in an open-loop fashion, it is possible to obtain a closed-loop controller. This is achieved by iteratively applying the first control $\control_0$ and resolving the trajectory optimisation problem in cref:eq-trajectory-optimisation,

This is known as \acrfull{mpc} citep:eduardof.Model2007. However, in practice, it is often too computationally expensive to obtain real-time control with \acrshort{mpc}. Many approximate solutions have been introduced in the literature, that seek to balance the computational complexity and accuracy trade-off differently citep:bettsSurvey1998.

For example, \acrfull{ilqr} can generate trajectories for nonlinear systems by iteratively approximating the dynamics to be linear around a nominal trajectory and optimising for the controls. \acrshort{ilqr} works well for quadratic cost functions but can be used with any cost function by approximating the cost function with a second-order Taylor expansion. However, in this case, \acrshort{ilqr} is susceptible to converging to terrible (local) optima if the true cost function is highly non-convex. cite:boedeckerApproximate2014 present a real-time \acrshort{ilqr} controller based on sparse \acrshort{gps}. cite:rohrProbabilistic2021 propose a novel \acrshort{lqr} controller synthesis for linearised \acrshort{gp} dynamics that yields robust controllers with respect to a probabilistic stability margin.

\acrshort{mpc} has directly been used with ensembles of probabilistic neural networks citep:chuaDeep2018,nagabandiDeep2020 and with \acrshort{gps} citep:kamtheDataEfficient2018. cite:lambertLowLevel2019 control a quadcopter using online \acrshort{mpc} and a dynamics model learned using probabilistic neural networks.

Offline trajectory optimisation Instead of solving the trajectory optimisation problem in cref:eq-trajectory-optimisation online, it can be solved offline. For example, the state-control trajectory can be found offline and used as a reference trajectory for a tracking controller. Alternatively, the trajectory optimiser can be used offline to learn a state feedback controller (policy) using guided policy search citep:levineGuided2013.

Constrained Control

This work aims to control multimodal dynamical systems subject to the mode remaining constraint in cref:eq-main-problem. However, neither the underlying dynamics modes nor how the system switches between them, are known a priori. To this end, this thesis is interested in model-based control techniques which can learn and enforce latent constraints.

It is common to require constraints on the states $\state$ and controls $\control$ of a controlled system. For example, an autonomous system may wish to remain in a subset of its state space where it knows its dynamics model is valid. The system may also be subject to constraints on the controls due to physical limitations, e.g. how quickly a quadcopter can accelerate and turn. Constraints of this type can be encoded via inequality constraints on the states $\state$ and controls $\control$,

The feasible regions of these constraints can be written as sets,

so the constraints can alternatively be written as,

For a parametric controller $\policy$, the control constraints can be encoded directly into the controller by parameterising it so that its range is restricted to $\controlDomain\text{feasible}$, i.e. $\policy(\state_\timeInd) ∈ \controlDomain\text{feasible}$, for all $\state ∈ \stateDomain\text{feasible}$. The state constraints can be enforced by ensuring that the set $\StateDomain$ is forward invariant.

There are multiple approaches to enforcing state constraints via invariant sets. Two common approaches are Lyapunov functions citep:lyapunovGeneral1992a and control barrier functions citep:amesControl2019. Lyapunov functions are more restrictive than control barrier functions as they provide stability guarantees which are not a necessary condition to render $\stateDomain\text{feasible}$ forward invariant. Although these are interesting directions for future work, they are out of the scope of this thesis.

Unknown constraints This work is interested in learning constraints whilst ensuring that they are satisfied. cite:ariafarADMMBO2019,gelbartBayesian2014 introduce algorithms to minimise an unknown objective (Bayesian optimisation) subject to unknown constraints. cite:sadighSafe2016 propose an \acrshort{mpc} method that satisfies a priori unknown constraints with high probability. However, they do not deploy a strategy to actively learn about the constraints. In contrast, cite:schreiterSafe2015 consider safe exploration for active learning. They distinguish safe and unsafe regions with a binary \acrshort{gp} classifier, which is learned separately to the dynamics model. Their exploration strategy then considers the differential entropy of the dynamics \acrshort{gp} and they use the \acrshort{gp} classifier to define a set of safety constraints.

Stochastic constraints In stochastic systems it is not possible to make deterministic statements about constraints. This is because given a start state $\state_0$ the resulting trajectories are random variables. Therefore, the constraints $\constraintFunc\state(\state)$ are also random variables. To reason about constraints in stochastic systems, we need a method to measure uncertainty in this random variable.

A simple approach is to consider expected constraints $\E[\constraintFunc\state(\state)]$. However, although expected performance is a reasonable objective, expected constraints make less sense. For example, although the expected constraints $\E[\constraintFunc\state(\state)]$ hold, a system may still violate them frequently if the constraints variance $\V[\constraintFunc\state(\state)]$ is high. cite:ferberGames1958 proposed risk sensitivity which uses higher-order moments as well as the expected value. Value at risk citep:duffieOverview1997a is an even stronger notion, which guarantees constraint satisfaction with high probability.

\acrshort{mpc} citep:eduardof.Model2007 is the most direct method to embed constraints. At each time step, \acrshort{mpc} ensures that the constraints hold over a given horizon. However, it is worth noting that these constraints still cannot be guaranteed in stochastic systems. For stochastic systems, cite:schwarmChanceconstrained1999 proposed to satisfy constraints with high probability. Such constraints are named chance constraints. Chance constraints are applicable in systems where the uncertainty arises from learning from observations. i.e. they are applicable with latent constraints.

Learning Dynamical Systems for Control

intro

This section reviews methods for learning representations of dynamical systems for control. When learning representations of dynamical systems from observations it is important to consider the different forms of uncertainty. For example, when using a learned model for control, it is important to know what we do not know. This knowledge can be used to encode risk-sensitive control (avoid regions where the model cannot predict confidently) or to guide exploration. cref:sec-unc-exploration discusses exploration strategies that leverage well-calibrated uncertainty estimates.

Sources of Uncertainty

This section characterises the uncertainty that arises in \acrshort{rl}.

Aleatoric uncertainty Dynamical systems give rise to temporal observations arriving as a sequence $\stateTraj = \{\state_1, \ldots, \state_\TimeInd\}$. These measurements are often corrupted by (observation) noise due to imperfections in the measurement process. Even when it is known that there is uncertainty in the measurement process, there still remains uncertainty about its form. Given our current understanding of the real-world, many dynamical systems also appear to be inherently stochastic. This is due to our inability to model certain phenomena accurately (e.g. turbulence). Stochasticity arising from state transitions in this way is known as process noise. Observation and process noise are the constituent sources of aleatoric uncertainty; uncertainties that are inherent in a system and cannot be reduced.

Epistemic uncertainty The uncertainty arising from learning the dynamics $f$ from observations is known as epistemic uncertainty. For example, we may be uncertain about the structure of the model or the value of specific model parameters $θ$. This type of uncertainty accounts for those we could know, but we do not know a priori. As such, epistemic uncertainty can be reduced by exploring, collecting data and updating the model with this new experience.

Distinguishing these two sources of uncertainty is important in \acrshort{rl}. For example, the epistemic uncertainty is useful for guiding exploration into regions of the system that have not been observed. In turn, this data can be used to reduce the model’s epistemic uncertainty by updating the model. In contrast, driving the system into regions of the model with high aleatoric uncertainty is not desirable. Consider the case where the model is confident that the system is subject to high process noise in a particular region. Guiding the system into this region will not reduce the model’s epistemic uncertainty because the model has already been trained on data from this region. Further to this, it may be undesirable to enter regions of high aleatoric uncertainty because they may result in poor performance or even catastrophic failure.

Learning Single-Step Dynamics Models

This work considers single-step dynamics models with the delta state formulation, which regularises the predictive distribution, given by $\state\timeInd+1 &= \state_\timeInd + \dynamicsFunc(\state_\timeInd, \control_\timeInd) + \bm\epsilon$. Although multi-step dynamics models are an interesting direction for learning-based control, they are out of the scope of this thesis.

Single-step dynamics models have been deployed in a large variety of learning-based control algorithms. Early approaches include using single-step linear models citep:schneiderExploiting1996 and single-step \acrshort{gp} models citep:deisenrothPILCO2011,vinogradskaStability2016,rohrProbabilistic2021,hewingLearningBased2020,kollerLearningBased2018 for low-dimensional control problems. More recently, single-step dynamics models have been learned using neural networks. For example, citep:chuaDeep2018,jannerWhen2019,kurutachModelEnsemble2018 use ensembles of neural networks and citep:depewegLearning2017,galImproving2016 use Bayesian neural networks with parametric uncertainty.

Probabilistic models Mathematical models are compact representations (sets of assumptions) that attempt to capture key features of the phenomenon of interest in a precise mathematical form. Probabilistic modelling provides the capability of constructing mathematical models that can represent and manipulate uncertainty in data, models, decisions and predictions. As such, linking observed data to underlying phenomena through probabilistic models is an interesting direction for modelling, analysing and controlling dynamical systems. It enables the uncertainty to be represented and manipulated; it provides a systematic way to combine observations with existing knowledge via a mathematical model. Learning representations of dynamical systems for control using probabilistic models has shown much promise. Moreover, learning single-step dynamics models that quantify uncertainty has been central to recent successes in \acrshort{mbrl} citep:chuaDeep2018,jannerWhen2019.

Modelling a Bayesian belief over the dynamics function $f$ provides a principled method for modelling epistemic uncertainty arising from learning from data, through the posterior distribution. \acrshort{gps} are the state-of-the-art approach for Bayesian nonparametric regression and are a building block for the methods presented in this thesis. \acrshort{gps} have become a popular choice for learning representations of dynamical systems citep:nguyen-tuongModel2009,buisson-fenetActively2020,wangSafe2018 and have been used for both \acrshort{rl} citep:deisenrothPILCO2011,doerrOptimizing2017,vinogradskaNumerical2020,polymenakosSafe2019 and for \acrshort{mpc} citep:kollerLearningBased2018,hewingCautious2020,hewingLearningBased2020,kollerLearningBased2018,kamtheDataEfficient2018.

Gaussian Processes

intro

The mathematical machinery underpinning \acrshort{gps} is now detailed.

Multivariate Gaussian identities Inference techniques with \acrshort{gps} leverage multivariate Gaussian conditioning operations. As such, introducing the multivariate Gaussian identities is a natural place to start.

Gaussian distributions are popular in machine learning and control theory. This is not only due to their natural emergence in statistical scenarios (central limit theorem) but also their intuitiveness and mathematical properties that render their manipulation tractable and easy.

Consider a multivariate Gaussian whose random variables are partitioned into two vectors $\f$ and $\u$. The joint distribution takes the following form,

where $\bm\mu\f$ and $\bm\mu\u$ represent the mean vectors, $\bm\Sigma\f\f$ and $\bm\Sigma\u\u$ represent the covariance matrices, and $\bm\Sigma\u\f$ and $\bm\Sigma\f\u$ represent the cross-covariance matrices. The marginalisation property of Gaussian distributions states that for two jointly Gaussian random variables, the marginals are also Gaussian,

Conveniently, the conditional densities are also Gaussian,

Consider the case where $\u$ represents some observations and $\f$ represents a new test location. cref:eq-gaussian-conditional can be used to make inferences in the location $\f$ given the observations $\u$, i.e. make sophisticated interpolations on the measurements based on their closeness. In real-world scenarios, it is desirable to consider the entire input domain, instead of simply pre-selecting a discrete set of locations. Gaussian processes provide this mathematical machinery.

Gaussian processes Informally, \acrshort{gps} are a generalisation of the multivariate Gaussian distribution, indexed by an input domain as opposed to an index set. Similar to how a sample from an $N-\text{dimensional}$ multivariate Gaussian is an $N-\text{dimensional}$ vector, a sample from a \acrshort{gp} is a random function over its domain. Formally, a \acrshort{gp} is defined as follows,

More intuitively, a Gaussian process is a distribution over functions $f(⋅): \gpDomain → \R$ defined over an input domain $\gpDomain ∈ \RD_f$. Whilst Gaussian distributions represent distributions over finite-dimensional vectors, \acrshort{gps} represent distributions over infinite-dimensional functions. A \acrshort{gp} is fully defined by a mean function $μ(⋅) : \gpDomain → \R$ and a covariance function $k(⋅, ⋅) : \gpDomain × \gpDomain → \R$ (also known as a kernel),

Importantly, for a given set of training inputs from the functions domain $~{\mathbf{X} = \{ \mathbf{x}_1, \ldots, \mathbf{x}_N \}}$, the associated function values $\mathbf{f} = \{f(\mathbf{x}_1), \ldots, f(\mathbf{x}_N) \}$, are jointly Gaussian. This results in an $N-\text{dimensional}$ multivariate Gaussian random variable $\mathbf{f}$, given by,

Given mean and kernel functions with parameters $\bm\theta$, the marginal distribution is given by,

where the dependency on the parameters $\bm\theta$ has been dropped, i.e. $p(\f \mid \mathbf{X}) = p(\f \mid \mathbf{X}, \bm\theta)$. This simplification will be used throughout this thesis for notational conciseness. By definition, these observations $\mathbf{f}$ are jointly Gaussian with any un-observed function value $f_* = f(\mathbf{x}_*)$ at a new test input,

Given the multivariate Gaussian conditionals in cref:eq-gaussian-conditional, it is easy to see how the distribution over the test function value $f_*$, can be obtained by conditioning on the observations,

It is typical in real-world modelling scenarios that observations of the true function values $\f$ are not directly accessible. Instead, observations are usually corrupted by noise,

where $σ^2_n$ is the noise variance. In this scenario, the function values $\f$ become latent variables and a Gaussian likelihood is introduced,

to relate the observations to the latent function values $\f$. The predictive distribution for a test input $\mathbf{x}_*$ follows from cref:eq-gaussian-conditional,

This predictive distribution is the \acrshort{gp} posterior. Importantly, the predictive variance $σ^2$ in cref:eq-gp-prediction-noisy quantifies the epistemic uncertainty associated with making a prediction at the test input $\mathbf{x}_*$, whilst the value of the noise variance $σ_n^2$ quantifies the aleatoric uncertainty. This structured handling of uncertainty makes \acrshort{gps} extremely powerful for learning representations of dynamical systems.

Sparse Gaussian Processes for Regression label:sec-sparse-gps

The nonparametric nature of Gaussian process methods is responsible for their flexibility but also their shortcomings, namely their memory and computational limitations. In general, the computational complexity of Gaussian process methods scale with $\mathcal{O}(N3)$, due to the inversion of the $N × N$ covariance matrix $k(\mathbf{X},\mathbf{X})$. This makes its application to data sets with more than a few thousand data points prohibitive. The main direction in the literature attempting to overcome this limitation is sparse approximations citep:titsiasVariational2009,snelsonSparse2005a,quinonero-candelaUnifying2005,leibfriedTutorial2021. Many approximation schemes have been proposed to reduce the computational complexity, see cite:quinonero-candelaUnifying2005 for a review. This section provides a short overview of sparse approximations.

Consider a second Gaussian density $q(\u)$ over the vector $\u$, as well as $p(\u)$, given by,

with mean $\mathbf{m}\u$ and covariance $\mathbf{S}\u\u$. The marginal distribution over $\f$ can be obtained by conditioning on $\u$ and marginalising w.r.t. this new distribution $q(\f) = ∫ p(\f \mid \u) q(\u) \text{d} \u$. This marginal distribution is also Gaussian,

Replacing $q(\u)$ with $p(\u)$ would recover the original marginal distribution $p(\f)$ with mean $\bm\mu\f$ and covariance $\bm\Sigma\f$, as expected from cref:eq-gaussian-marginal.

Inducing point methods citep:snelsonSparse2005a augment the latent variables with inducing input-output pairs known as inducing “inputs” $\Z ∈ \inputDomain$ and inducing “variables” $\u = f(\Z)$. Introducing a set of $M$ inducing points from the same \acrshort{gp} prior $p(\u \mid \Z) &= \mathcal{N}(\u \mid μ(\Z), k(\Z, \Z))$ can reduce the computational cost if $M \leqq N$. Note that these inducing variables $\u$ are jointly gaussian with the latent function values $\mathbf{f}$,

where $\Kmm$ is the covariance function evaluated between all inducing inputs $\Z$, and $\Knm$ is the covariance function evaluated between the data inputs $\mathbf{X}$ and the inducing inputs $\Z$. This joint distribution can be re-written using the properties of multivariate normal distributions as,

where $\mathbf{Q}nn = \mathbf{K}nm\mathbf{K}mm-1\mathbf{K}nmT$. The full joint distribution is then given by,

Computationally efficient inference is obtained by approximating the integration over $\mathbf{f}$. The popular Fully Independent Training Conditional (FITC) method (in the case of a Gaussian likelihood) is obtained via the factorisation: $p(\mathbf{y} \mid \u) ≈ ∏Nn=1 p(yn \mid \u)$. To obtain a variational approximation Jensen’s inequality is first used to bound this conditional,

This is the standard bound shown in cite:hensmanGaussian2013 ($\mathcal{L}1$ defined in Eq. 1). In cite:titsiasVariational2009 this bound is then substituted into the marginal likelihood to obtain a tractable lower bound,

where $σ^2$ is the noise variance associated with the Gaussian likelihood. The bound in cref:eq-titsias-bound becomes exact (i.e. recovers the true log marginal likelihood) when the inducing inputs $\Z$ equal the data inputs $\mathbf{X}$ as $\Knm=\Kmm=\Knn$ so $\mathbf{Q}nn=\Knn$.

cite:hensmanGaussian2013 noted that this bound has complexity $O(NM2)$ so introduced additional variational parameters to achieve a more computationally scalable bound with complexity $O(M3)$. Instead of collapsing the inducing points in cref:eq-titsias-bound they explicitly represent them as a variational distribution,

which they use to lower bound the marginal likelihood $p(\mathbf{y} \mid \mathbf{X})$ by further bounding $\mathcal{L}\text{titsias}$,

where $q(\mathbf{f}) = ∫ p(\mathbf{f} \mid \u) q(\u) \text{d} \u$. This is equivalent to $\mathcal{L}3$ defined in Eq. 4 cite:hensmanGaussian2013. If the likelihood factorizes across data $p(\mathbf{y} \mid \mathbf{f}) = ∏n=1N p(yn \mid fn)$ only the marginals of $q(\mathbf{f})$ are needed to calculate cref:eq-hensman-bound,

Importantly, this bound is written as a sum over input-output pairs which has induced the necessary conditions to preform stochastic gradient methods on $q(\u)$.

Learning Multimodal Dynamical Systems label:to_motivation

figs

intro

In contrast to the previously presented approaches, we are interested in learning representations of multimodal dynamical systems. cref:fig-traj-opt-over-gating-mask-svgp-all-baseline demonstrates the shortcomings of learning a \acrshort{gp} dynamics model for the quadcopter navigation problem in the illustrative example from cref:illustrative_example. It shows the results of performing trajectory optimisation in a \acrshort{gp} dynamics model trained on state transitions sampled from both dynamics modes. The \acrshort{gp} has not been able to learn a representation of the dynamics which is true to the underlying system, due to the discontinuities associated with the multimodal transition dynamics (changing lengthscales/noise variances etc). The trajectory optimiser was able to find a trajectory from the start state $\state_0$, to the target state $\targetState$ in the \acrshort{gp} dynamics (magenta). However, as the \acrshort{gp} dynamics do not accurately represent the true underlying dynamics, the state trajectory resulting from rolling out the optimised controls in the environment (cyan) does not match the \acrshort{gp} dynamics trajectory (magenta). This example motivates the need to correctly identify the underlying modes when learning representations of multimodal dynamical systems for control.

cref:fig-traj-opt-over-gating-mask-svgp-desired-baseline shows results after training on state transitions from only the desired, operable dynamics mode (red). The learned dynamics model can accurately predict state transitions in the desired dynamics mode (red). However, as this approach only considers the dynamics of the desired mode, trajectories in the environment (cyan) deviate from those planned in the learned dynamics model (magenta) when they pass through the turbulent mode. This is problematic because the trajectory passes through the turbulent dynamics mode (which may lead to catastrophic failure) and does not reach the target state $\targetState$. Without inferring information regarding how the system switches between its underlying dynamics modes, it is not possible to encode mode remaining behaviour into control algorithms.

Methods for learning probabilistic multimodal dynamics have been proposed. cite:moerlandLearning2017 use deep generative models, namely a conditional \acrfull{vae}, to learn multimodal transition dynamics for \acrshort{mbrl}. cite:kaiserBayesian2020 use the data association with \acrshort{gps} model; a Bayesian model that learns independent dynamics modes whilst maintaining a probabilistic belief over which mode is responsible for predicting at a given input location. cite:mckinnonLearning2017 also use an approach based on \acrshort{gps}, except that they use a \acrfull{mogpe} method to learn the switching behaviour for robot dynamics online.

Latent spaces for control It is worth noting that the introduction of latent variables into probabilistic models is a key component providing them with interesting and powerful capabilities for synergising model learning and control. For example, cite:hafnerLearning2019,rybkinModelBased2021 learn latent spaces which provide convenient spaces for control (or planning). cref:fig-traj-opt-over-gating-mask-svgp-desired-baseline highlights the need for learning informative latent variables representing how the system switches between the underlying dynamics modes. Without such information, it is not possible to encode the notion of mode remaining/avoiding behaviour. As such, this work is interested in learning latent spaces that are rich with information regarding how a system switches between its underlying dynamics modes.

Alternative Approaches

intro

Uncertainty-based Exploration Strategies label:sec-unc-exploration

\acrfull{rl} agents face a trade-off between exploration, where they seek to explore the environment and improve their models, and exploitation, where they make decisions which are optimal for the data observed so far. There are many approaches from the literature used to tackle the exploration-exploitation trade-off. In \acrshort{mbrl}, the goal is often to reduce the real-world sample complexity at the cost of increased model sample complexity.

There are two main uncertainties which are often modelled and used for exploration. Value-based methods base their exploration on the uncertainty of the value function $V$ at the current state $\state_t$. Actions with higher value estimates get higher probabilities of being selected based on uncertainty estimates around the values citep:moerlandEfficient2017,auerUsing2002. An alternative approach for \acrshort{mbrl} is to use state-based exploration. This approach is state-specific, reward independent and usually seeks states with high epistemic uncertainty. This section recaps some relevant exploration strategies that fit into the general \acrshort{mbrl} procedure shown in cref:alg-mbrl-background.

Greedy exploitation One of the most commonly used exploration strategies is to select the controller that maximises the expected performance under the learned dynamics model $\dynamicsModel(f \mid \mathcal{D}0:i-1)$. Note that $i$ denotes the iteration/episode number of the \acrshort{mbrl} loop. This greedy strategy is given by,

Note that the negative sign is used because the objective $J(f, π)$ is based on a cost function and not a reward function. This approach is used in \acrshort{pilco} citep:deisenrothPILCO2011 and \acrshort{gp}-\acrshort{mpc} citep:kamtheDataEfficient2018 where the dynamics are represented using \acrshort{gps} and the moment matching approximation is used to cascade single-step predictions. cite:parmasPIPPS2018 propose \acrshort{pipps}, a similar approach to \acrshort{pilco}, except that they use Monte Carlo methods to propagate uncertainty forward in time, instead of using the moment matching approximation. Similarly, \acrshort{pets} citep:chuaDeep2018 uses this exploration strategy but represents the dynamics using ensembles of probabilistic neural networks. This strategy initially favours exploring regions of the environment where the learned dynamics model is not confident, i.e. has high epistemic uncertainty. Once it has gathered knowledge of the environment and the model’s epistemic uncertainty has been reduced, it favours maximising the objective function $J(f, π)$.

Thompson sampling An alternative and theoretically grounded strategy is Thompson sampling. This approach samples a single model $f_i ∼ \dynamicsModel(f \mid \mathcal{D}0:i)$ at every iteration $i$ and uses the sampled model to optimise the controller. This is given by,

In general, it is intractable to sample from $\dynamicsModel(f \mid \mathcal{D}0:i)$. Note that after the sampling step this problem is equivalent to greedy exploitation.

Alternatively, some \acrshort{mbrl} algorithms, such as cite:sekarPlanning2020, adopt a two-phase exploration strategy. The first phase is interested in exploring the environment and summarising this past experience in the form of a model. The second phase then seeks to solve a downstream task, for which it is given a cost (reward) function. This two-stage approach does not require an objective that changes its exploration-exploitation balance as it gathers more knowledge of the environment.

Active learning Active learning is a class of exploration algorithms which fit into the two-phase exploration approach. The goal of information-theoretic active learning is to reduce the number of possible hypotheses as fast as possible, e.g. minimise the uncertainty associated with the parameters using Shannon’s entropy citep:coverElements2006,

In contrast to greedy exploitation, active learning does not seek to maximise a black-box objective. Instead, it is only interested in exploration. There are many approaches to active learning in the static setting, i.e. in systems where an arbitrary state $\state$ can be sampled. In contrast, dynamical systems must be steered to $\state$ through the unknown dynamics $f$ through a sequence of controls $\controlTraj$. Thus, information gain along the trajectory must also be considered. As highlighted by cite:buisson-fenetActively2020, information gain in dynamical systems is fundamentally different to the static problem addressed by cite:krauseNearOptimal2008 and cite:houlsbyBayesian2011. The goal is to pick the most informative control trajectory $\controlTraj$ whilst observing $\stateTraj$.

Recent work has addressed active learning in \acrshort{gp} dynamics models. cite:schreiterSafe2015 propose a greedy entropy-based strategy that considers the entropy of the next state. cite:buisson-fenetActively2020 also propose a greedy entropy-based strategy except that they consider the entropy accumulated over a trajectory. In contrast, cite:caponeLocalized2020,yuActive2021 propose using the mutual information.

cite:caponeLocalized2020 find the most informative state as the one that minimises the mutual information between it and a set of reference states (a discretisation of the domain). They then find a set of controls to drive the system to this most informative state. Given a fixed number of time steps, their method yields a better model than the greedy entropy-based strategies. cite:yuActive2021 propose an alternative approach that leverages their \acrfull{gpssm} inference scheme to estimate the mutual information between all the variables in time $I \left[\mathbf{y}1:t, \hat{\mathbf{y}}t+1 ; \mathbf{f}1:t+1 \right]$. Here $\mathbf{y}1:t$ denotes the set of observed outputs and $\hat{\mathbf{y}}t+1$ denotes the output predicted by the \acrshort{gpssm}. This contrasts with other approaches, which study the latest mutual information $I[\hat{\mathbf{y}}t+1 ; \mathbf{f}t+1]$.

Myopic active learning In \acrshort{rl} and control it is standard to consider objectives over a potentially infinite horizon. However, active learning objectives often myopically consider the information gain at the next query point only. In contrast, it is possible to consider the information gain over a potentially infinite horizon, reliving this myopia. The mutual information approaches in cite:caponeLocalized2020,yuActive2021 fall into this myopic category as they only maximise the information gain at the next time step. In contrast, the entropy-based strategy in cite:buisson-fenetActively2020 considers the entropy over a horizon.

Probabilistic Inference for Learning Multimodal Dynamical Systems label:chap-dynamics

Maths Symbols

Domains

Bounds

maths

kernels

inference

Intro

This chapter is concerned with learning representations of multimodal dynamical systems for model-based control. It is interested in systems where both the underlying dynamics modes and how the system switches between them, are not fully known a priori. This chapter assumes access to a data set of state transitions $\dataset$, previously sampled from the system at a constant frequency, i.e. with a fixed time-step.

Following the motivation in cref:to_motivation, this chapter seeks to identify the underlying dynamics modes correctly whilst inferring latent structure that can be exploited for control. The main goals of this chapter can be summarised as follows,

  1. accurately identify the true underlying dynamics modes,
  2. learn latent spaces for planning/control.

The probabilistic model constructed in this chapter resembles a \acrfull{mogpe} with a \acrshort{gp}-based gating network and is named \acrfull{mosvgpe}. Following other \acrshort{mogpe} methods, it is evaluated on the motorcycle data set citep:Silverman1985. It is then tested on a real-world quadcopter data set representing the illustrative example detailed in cref:illustrative_example.

Problem Statement

This work considers learning representations of unknown or partially unknown, stochastic, multimodal, nonlinear dynamical systems. That is, it seeks to learn a representation of the dynamics from the problem statement in cref:problem-statement-main. This chapter considers single-step dynamics models with the delta state formulation that regularises the predictive distribution. The dynamics are given by,

with continuous states $\state_t ∈ \stateDomain \subseteq \R^\StateDim$, continuous controls $\control_t ∈ \controlDomain \subseteq \R^\ControlDim$ and time index $t$. The state difference between time $\timeInd$ and $\timeInd+1$ is denoted $Δ \state\timeInd+1 = \state\timeInd+1 - \state\timeInd$. At any given time step $t$, one of the $\ModeInd$ dynamics modes $\{\unknownDynamicsK : \mode{\stateDomain} × \controlDomain → \stateDomain \}\modeInd=1^\ModeInd$ and associated noise models $\mode{\bm\epsilon} &∼ \mathcal{N}\left(\mathbf{0}, \bm\Sigma\mode{\bm\epsilon} \right)$, are selected by a discrete mode indicator variable $α\timeInd ∈ \modeDomain = \{1,\ldots,\ModeInd\}$. This work assumes that the discrete mode indicator variable depends on the state $\state$ such that it is governed by a discrete mode indicator function $α : \stateDomain → \modeDomain$.

This chapter assumes access to historical data comprising state transitions from $\NumEpisodes$ trajectories of length $T$, sampled with a fixed time step $Δ t=t*$. The data set has ${N=ET}$ elements and we abuse notation by appending the independent trajectories along time to get the data set $\mathcal{D}=\{(\state\timeInd,\control\timeInd),Δ \statet+1\}\NumData-1\timeInd=0$.

To ease notation, our modelling only considers a single output dimension. The extension to multiple output dimensions follows from standard \acrshort{gp} methodologies and is detailed where necessary. To further ease notation, the state-control input domain is denoted $\inputDomain = \stateDomain × \controlDomain \subseteq \R\InputDim$ and a single state-control input is denoted $\singleInput = (\state\timeInd,\control\timeInd)$. Given this formulation, this chapter aims to learn the mapping $\unknownDynamics$, which switches between $\ModeInd$ different functions $\mode{\latentFunc}$. This can be cast as a regression problem,

where both the latent dynamics functions $\{\unknownDynamicsK\}\modeInd=1\ModeInd$ and how the system switches between them $\modeVar$, must be inferred from observations. A single observation is denoted as $(\singleInput, \singleOutput) = ((\statet,\controlt), Δ \statet+1)$. The set of all inputs is denoted as $\allInput ∈ \R^{\NumData × \InputDim$ and the set of all outputs as $\allOutput ∈ \R\NumData × 1$. With this notation, the regression problem can be written as,

Preliminaries

\acrfull{gps} are the state-of-the-art approach for Bayesian nonparametric regression and they provide a powerful mechanism for encoding expert domain knowledge. They are flexible enough to model arbitrary smooth functions with the simplicity of only requiring inference over a small number of interpretable parameters, such as lengthscales and the contributions of signal and noise variance in the data. These properties are induced by the covariance function, which models the covariance between observations. As such, \acrshort{mogpe} methods are a promising direction for modelling multimodal dynamical systems. This section recaps the \acrshort{mogpe} concepts that this chapter builds upon.

mixture models

\newline

Mixture Models Mixture models are a natural choice for modelling multimodal systems. Given an input $\singleInput$ and an output $\singleOutput$, mixture models model a mixture of distributions over the output,

where $\gatingParams$ and $\expertParams$ represent model parameters and $\singleModeVar$ is a discrete indicator variable assigning observations to components. The predictive distribution $\singleMoeEvidence$ consists of $K$ mixture components ${p(\singleOutput \mid \modeVarK, \singleInput, \expertParamsK)}$ that are weighted according to their mixing probabilities $Pr(\modeVarK \mid \gatingParams)$.

mixture of experts

\newline

Mixture of Experts The \acrfull{moe} model citep:jacobsAdaptive1991 is an extension where the mixing probabilities depend on the input variable $\moeGatingPosterior$, and are collectively referred to as the gating network. The individual component densities $\moeExpertPosterior$ are then referred to as experts, as at different regions in the input space, different components are responsible for predicting. Given $\ModeInd$ experts $\moeExpertPosterior$, each with parameters $\expertParamsK$, the \acrshort{moe} marginal likelihood is given by,

where $\singleModeVar$ is the expert indicator variable assigning observations to experts. The probability mass function over the expert indicator variable is referred to as the gating network and indicates which expert governs the model at a given input location. See cite:yukselTwenty2012 for a survey of \acrshort{moe} methods. Note the correspondence between the expert indicator variable $\singleModeVar$ and the mode indicator variable from cref:eq-multimodal-dynamics-disc.

Nonparametric Mixtures of Experts

\newline Nonparametric Mixtures of Experts Modelling the experts as \acrshort{gps} gives rise to a class of powerful models known as \acrfull{mogpe}. They can model multimodal distributions as they model a mixture of distributions over the outputs, usually a Gaussian mixture in the regression setting citep:trespMixtures2000a,rasmussenInfinite2001. They can model non-stationary functions as each expert learns separate hyperparameters (lengthscales, noise variances etc). Many \acrshort{mogpe} methods have been proposed, and in general, they differ in the formulation of their gating network and their approximate inference algorithms.

npmoe graphical model

after graphical model

\newline

cite:rasmussenInfinite2001 highlighted that the traditional \acrshort{moe} marginal likelihood does not apply when the experts are nonparametric. This is because the model assumes that the observations are i.i.d. given the model parameters, which is contrary to \acrshort{gp} models, which model the dependencies in the joint distribution, given the hyperparameters. cite:rasmussenInfinite2001 point out that there is a joint distribution corresponding to every possible combination of assignments (of observations to experts). The marginal likelihood is then a sum over exponentially many ($\ModeInd\NumData$) sets of assignments,

where $\allModeVar = \{\modeVar_1, \ldots, \modeVar_\NumData \}$ represents a set of assignments for all observations. cref:fig-graphical-model-npmoe shows its graphical model representation. Note that $\allInputK = \{\singleInput : \singleModeVarK \}\numData=1\NumData_{\modeInd}$ and $\allOutputK = \{\singleOutput : \singleModeVarK \}\numData=1\NumData_{\modeInd}$ represent the $\NumData\modeInd$ inputs and outputs assigned to the $\modeInd\text{th}$ expert respectively. This distribution factors into the product over experts, where each expert models the joint Gaussian distribution over the observations assigned to it. Assuming a mixture of Gaussian process regression models, the marginal likelihood in cref:eq-np-moe-marginal-likelihood-assign can be expanded to show each of the experts’ latent variables,

where each expert follows the standard Gaussian likelihood model,

with $\mode{f}$ and $\mode{\noiseVar}^2$ representing the latent function and the noise variance associated with the $\modeInd$\text{th} expert. Note that for notational conciseness the dependency on $\modeVarK$ is dropped from $\singleExpertLikelihood$ as it is implied by the mode indexing $\mode{\latentFunc}$. The dependence on $\gatingParams$ and $\expertParams$ is also dropped from here on in. Independent \acrshort{gp} priors are placed on each of the expert’s latent functions,

where $\expertMeanFunc(⋅)$ and $\expertCovFunc(⋅, ⋅)$ represent the mean and covariance functions associated with the $\modeInd\text{th}$ expert respectively. This leads to each expert resembling a standard \acrshort{gp} regression model with a Gaussian likelihood. For notational conciseness the dependence on the inputs $\allInputK$ and hyperparameters $\expertParamsK$ is dropped for each \acrshort{gp} prior, i.e. $\expertPrior = p\left( \mode{f}(\allInputK) \mid \allInputK, \expertParamsK \right)$.

Related Work

Modelling the experts as \acrshort{gps} gives rise to a class of powerful models known as Mixture of Gaussian Process Experts (MoGPE). These models address both of the limitations that we discussed. They can model multimodal predictive distributions as they model a mixture of distributions over the outputs, usually a Gaussian mixture in the regression setting (cite:Tresp,Rasmussen2002). They are able to model non-stationary functions as each expert can learn different lengthscales, noise variances etc and the gating network can turn each one “on” and “off” in different regions of the input space. Many approaches to \acrshort{mogpe} have been proposed and in general the key differences between them are the formulation of the gating network and their inference algorithms (and associated approximations) cite:Yuksel2012. The original \acrshort{mogpe} work by cite:Tresp proposed a gating network as a softmax of $K$ \acrshort{gps} (bottom left plot in Figure ref:gating_network_comparison). An EM inference scheme is proposed, which assuming there are $N$ observations requires $\mathcal{O}(3KN^3)$ computations per iteration.

Although this gating function divides up the input space cite:Rasmussen2002 argue that data not assigned to a \acrshort{gp} expert will lead to bias near the boundaries. Instead cite:Rasmussen2002 formulate the gating network in terms of conditional distributions on the expert indicator variable, on which they place an input-dependent Dirichlet process prior. cite:Meeds2005 proposed an alternative infinite \acrshort{mogpe} similar to cite:Rasmussen2002 except that they specify a full generative model over the input and output space $p(\mathbf{y}, \mathbf{x})$ as opposed to just a conditional model $p(\mathbf{y} | \mathbf{x})$. Each experts inputs are now Gaussian distributed (middle top plot of Figure ref:gating_network_comparison) giving rise to the gating network in the middle bottom plot of Figure ref:gating_network_comparison. This gating network is not capable of turning an expert “on” in multiple regions of the input space. It will instead generate a new cluster and assign it a new expert. These approaches reduce the computational burden by associating each expert with a subset of the observations but rely on Markov Chain Monte Carlo (MCMC) inference schemes.

cite:Yuan proposed a \acrshort{mogpe} with a variational inference scheme which is much faster than using \acrshort{mcmc}. They assume that the expert indicators $α$ are generated by a Dirichlet distribution and that given the expert indicator the inputs follow a Gaussian mixture model (GMM), i.e. they have introduced a cluster indicator variable $z$. This contrasts earlier approaches that let the expert indicator act as the input cluster indicator. This is illustrated in the right-hand plots of Figure ref:gating_network_comparison which show two Gaussian mixtures over the inputs (one for each expert) and the resulting expert mixing probabilities. Introducing the separate cluster indicator variable has given rise to a gating network that can turn a single expert “on” in multiple regions of the input space. However, it does not provide a handle for encoding prior information like the \acrshort{gp} based gating network.

We note that mixture models are inherently unidentifiable as different combinations of mixture components and mixing coefficients can generate the same predictive distributions. The gating network can be considered a handle for encoding prior knowledge that can be used to constrain the set of admissible functions. This can improve identifiability and lead to learned representations that better reflect our understanding of the system. The simplest case being reordering the experts. Figure ref:gating_network_comparison provides a visual comparison of different gating network formulations, providing intuition for how they can differ in restricting the set of admissible functions.

Modelling the gating network with \acrshort{gps} enables us to encode informative prior knowledge through the choice of mean and covariance functions. For example, adopting a squared exponential covariance function would encode prior belief that the mixing probabilities should vary smoothly across the input space. If modelling a dynamical system with oscillatory behaviour (e.g. an engine) a periodic kernel could be adopted. Prior knowledge of the mixing probability values can be encoded through the choice of mean function. We may also be interested in exploiting techniques from differential geometry e.g. finding probabilistic geodesics (cite:Tosi2014) on the gating network. Again, our choice of covariance function can be used to encode how differentiable the gating network should be. Learning better representations will also improve the models ability to extrapolate as it will be more true to the underlying system.

We can also consider the implications of the mentioned gating networks from an inference perspective. It is known that developing inference algorithms for propagating uncertain (Gaussian) inputs through \acrshort{gps} is itself a challenging problem (see cite:Ustyuzhaninov2019). Therefore, developing a scalable inference algorithm to propagate inputs that are distributed according to a Gaussian mixture will likely lead to valuable information loss. Theoretically the approaches by cite:Rasmussen2002,Meeds2005 are able to achieve accurate results but their inference relies on \acrshort{mcmc} sampling methods, which can be slow to converge.

We consider our approach as trading in the computational benefits that can be obtained through the formulation of the gating network for the ability to improve identifiability with informative \acrshort{gp} priors. The main contributions of this paper are twofold. Firstly, we re-formulate a gating network based on \acrshort{gps} to improve identifiability. Secondly, we derive an evidence lower bound which improves on the complexity issues associated with inference when adopting such a gating network. Motivated by learning representations of dynamical systems with two regimes we instantiate the model with two experts as it is a special case where the gating network can be calculated in closed form. This is because we seek to learn an operable mode with one expert and explain away the inoperable mode with the other. This results in the gating network indicating which regions of the input space are operable, providing a convenient space for planning.

Identifiable Mixtures of Gaussian Process Experts

intro

intro

Motivated by synergising model learning and control, the probabilistic model constructed in this chapter resembles a \acrfull{mogpe} with a \acrshort{gp}-based gating network. The \acrshort{gp}-based gating network infers latent geometric structure that is exploited by the control algorithm in Section ref:sec-traj-opt-geometric.

The marginal likelihood in cref:eq-np-moe-marginal-likelihood-assign can be expanded to show each of the experts’ latent variables,

where each expert follows the standard Gaussian likelihood model,

with $\mode{f}$ and $\mode{\noiseVar}^2$ representing the latent function and the noise variance associated with the $\modeInd$\text{th} expert. Note that the dependency on $\modeVarK$ is dropped from $\singleExpertLikelihood$ for notational conciseness. Independent \acrshort{gp} priors are placed on each of the expert’s latent functions,

where $\expertMeanFunc(⋅)$ and $\expertCovFunc(⋅, ⋅)$ represent the mean and covariance functions associated with the $\modeInd\text{th}$ expert respectively. This leads to each expert resembling a standard \acrshort{gp} regression model with a Gaussian likelihood. For notational conciseness the dependence on the inputs $\allInputK$ and hyperparameters $\expertParamsK$ is dropped for each \acrshort{gp} prior, i.e. $\expertPrior = p\left( \mode{f}(\allInputK) \mid \allInputK, \expertParamsK \right)$. \todo{add reference to section number} Note that the expectation over each experts’ latent function in cref:eq-np-moe-marginal-likelihood is inside the marginalisation of the expert indicator variable $\modeVar$.

For ease of notation and understanding, only a single output dimension has been considered, although in most scenarios the output dimension will be greater than $1$. This work can be extended to multiple output dimensions following standard multioutput \acrshort{gp} methodologies cite:vanderwilkFramework2020.

Identifiable Mixtures of Sparse Variational Gaussian Process Experts

Motivated by improving identifiability and learning latent spaces for control, this work adopts a \acrshort{gp}-based gating network resembling a \acrshort{gp} classification model, similar to that used in the original \acrshort{mogpe} model citep:trespMixtures2000a. The \acrshort{gp}-based gating network can be used to constrain the set of admissible functions through the placement of informative \acrshort{gp} priors on the gating functions. Further to this, in cref:chap-traj-opt-control, the geometry of the \acrshort{gp}-based gating network is used to encode mode remaining behaviour into control strategies. cref:chap-active-learning then leverages the power of the \acrshort{gp}-based gating network to construct an explorative trajectory optimisation algorithm that can consider the information gain over trajectories, as opposed to just at the next state.

The marginal likelihood is given by,

where the gating network resembles a \acrshort{gp} classification model, with a factorised classification likelihood $\gatingLikelihood$ dependent on input dependent functions $\GatingFunc = \{\mode{\gatingFunc} : \inputDomain → \R \}\modeInd=1^\ModeInd$, known as gating functions. The probability mass function over the expert indicator variable is given by,

where $[\singleModeVarK]$ denotes the Iverson bracket. The probabilities of this Categorical distribution $\singleGatingLikelihood$ are governed by a classification likelihood (Bernoulli/Softmax). Fig. ref:fig-graphical-model-gp-gating-network shows the graphical model where the $\ModeInd$ latent gating functions $\GatingFunc$ are evaluated and normalised to obtain the mixing probabilities $\singleGatingLikelihood$. The mode indicator variable $\singleModeVar$ is then sampled from the Categorical distribution governed by these probabilities. The indicated expert’s latent function $\mode{f}$ and noise model $\mathcal{N}\left(\bm0,\noiseVarK^2 \right)$ are then evaluated to generate the output $\singleOutput$.

Softmax ($\ModeInd>2$) In the general case, when there are more than two experts, the gating network’s likelihood is defined as the Softmax function,

Each gating function $\mode{\gatingFunc}$ describes how its corresponding mode’s mixing probability varies over the input space. Modelling the gating network with input-dependent functions enables informative prior knowledge to be encoded through the placement of \acrshort{gp} priors on each gating function. Further to this, if the modes are believed to only vary over a subset of the state-control input space, then the gating functions can depend only on this subset. Independent \acrshort{gp} priors are placed on each gating function, giving the gating network prior,

where $\gatingMeanFunc(⋅)$ and $\gatingCovFunc(⋅,⋅)$ are the mean and covariance functions associated with the $\modeInd^\text{th}$ gating function. Similarly to the experts, dependence on the inputs and hyperparameters is dropped from the gating network’s \acrshort{gp} prior, i.e. $\gatingPrior = p(\GatingFunc(\allInput) \mid \allInput, \gatingParams)$. In contrast to the experts, partitioning the data set is not desirable for the gating network \acrshort{gps}, as each gating function should depend on all of the training observations.

Each mode’s mixing probability $\singleGatingPosterior$ is then obtained by marginalising all of the gating functions. In the general case where $\singleGatingLikelihood$ uses the softmax function (cref:eq-softmax) this integral is intractable, so it is approximated with Monte Carlo quadrature.

Bernoulli ($\ModeInd=2$) Instantiating the model with two experts, $\singleModeVar ∈ \{1, 2\}$, is a special case where only a single gating function is needed. This is because the output of a function $\gatingFunc(\singleInput)$ can be mapped through a sigmoid function $\text{sig} : \R → [0, 1]$ and interpreted as a probability,

If this sigmoid function satisfies the point symmetry condition then the following holds, $Pr(\singleModeVar=2 \mid \modei{\gatingFunc}{1}(\singleInput)) = 1 - Pr(\singleModeVar=1 \mid \modei{\gatingFunc}{1}(\singleInput))$. This only requires a single gating function and no normalisation term needs to be calculated. If the sigmoid function in cref:eq-sigmoid is selected to be the Gaussian cumulative distribution function $Φ(\gatingFunc(⋅)) = ∫\gatingFunc(⋅)-∞ \mathcal{N}(τ | 0, 1) \text{d} τ$, then the mixing probability can be calculated in closed-form,

where $μ_h$ and $σ^2h$ represent the mean and variance of the gating \acrshort{gp} at $\singleInput$ respectively.

graphical model

graphical model non sparse experts

Generative Model

With this formulation, the marginal likelihood of this model can be written to clearly show the expectations over the latent variables,

where $\expertPrior$ is the $\modeInd\text{th}$ expert’s prior and $\gatingPrior$ is the gating network’s prior. Observe that the \acrshort{gp} priors have removed the factorisation over data which is present in the ME marginal likelihood cref:eq-mixture-marginal-likelihood. Fig. ref:fig-graphical-model shows the graphical model where the $\ModeInd$ latent gating functions $\GatingFunc$ are evaluated and normalised to obtain the mixing probabilities $\singleGatingLikelihood$. The mode indicator variable $\singleModeVar$ is then sampled from the Categorical distribution governed by these probabilities. The indicated mode’s latent function $\mode{f}$ and process noise $\noiseVarK$ are then evaluated to generate the state difference $\singleOutput$.

This model makes single-step probabilistic predictions, where the predictive distribution over the output $\singleOutput$ is given by a mixture of Gaussians. This provides the model flexibility to model multimodal transition dynamics $f$ as mixtures of $\modeInd$ experts. In the two expert case the marginal likelihood is a mixture of two Gaussians and can be calculated in closed-form. \todo{can it be calculated in closed form?} It has complexity $\mathcal{O}(\ModeInd\NumData^4)$ to evaluate and the general case (with more than two experts) has complexity $\mathcal{O}(\ModeInd^2\NumData^4)$.

For ease of notation and understanding, only a single output dimension has been considered, although in most scenarios the state dimension will be greater than $1$. This work considers independent output dimensions which follow trivially from multioutput \acrshort{gp} methodologies.

\todo{Can I just say it’s left out because trivial?}

Gating Network

The gating network governs how the dynamics switch between modes. This work is interested in spatially varying modes so formulates an input dependent Categorical distribution over $α_t$,

where $[α_t=k]$ denotes the Iverson bracket. It should be noted that other gating networks have been proposed that provide computational benefits for inference. However, these gating networks often lack a handle for encoding domain knowledge that can be used to restrict the set of possible

\todo[inline]{insert gating network references} The probabilities of this Categorical distribution $Pr(α_t=k \mid \mathbf{h}_t)$ are obtained by evaluating $K$ latent gating functions $\{h(k)\}k=1^K$ and normalising their output. Each gating function evaluated at $\hat{\mathbf{x}}t-1$ is denoted as $h(k)_t = h(k)(\hat{\mathbf{x}}t-1)$ and at all observations ${h}(k)1:T$. The set of all gating functions evaluated at $\hat{\mathbf{x}}t-1$ is denoted as $\mathbf{h}_t$ and at all observations as $\mathbf{h}1:T$. Each gating function $h(k)$ describes how its corresponding mode’s mixing probability varies over the input space.

This work is interested in finding trajectories that can avoid areas of the transition dynamics model that cannot be predicted confidently. Placing independent sparse \acrshort{gp} priors on each gating function provides a principled approach to modelling the epistemic uncertainty associated with each gating function. The gating function’s posterior covariance is a quantitative value that can be exploited by the trajectory optimisation.

Each gating function’s inducing inputs are denoted $\bm\xi_h(k)$ and outputs as $\hat{\mathbf{h}}(k)$. For all gating functions, they are collected as $\bm\xi_h$ and $\hat{\mathbf{h}}$ respectively. The probability that the $t\text{th}$ observation is generated by mode $k$ given the inducing inputs is obtained by marginalising $\mathbf{h}_t$,

where $p(\mathbf{h}_t \mid \hat{\mathbf{x}}t-1, \hat{\mathbf{h}}) = ∏k=1^K p\left({h}(k)_t \mid \hat{\mathbf{x}}t-1, \hat{\mathbf{h}}(k)\right)$ is the $K$ independent sparse \acrshort{gp} priors on the gating functions.

Two Experts

Instantiating the model with two experts is a special case where only a single gating function is needed. The output of a function can be mapped through a sigmoid function $\text{sig} : \R → [0, 1]$ and interpreted as a probability $Pr(α=1 \mid \gatingFunc(⋅))$. If this sigmoid function satisfies the point symmetry condition then we know that $Pr(α=2 \mid \gatingFunc(⋅)) = 1 - Pr(α =1 \mid \gatingFunc(⋅))$.

The distribution over $\gatingFunc(⋅)$

We note that $p({h}_n(k) \mid \mathbf{h}¬ n, \mathbf{X})$ is a \acrshort{gp} conditional and denote its mean $μ_h$ and variance $σ^2_h$. Choosing the sigmoid as the Gaussian cdf $Φ(h_n) = ∫h_n-∞ \mathcal{N}(τ | 0, 1) \text{d} τ$ leads to,

Given this gating network our marginal likelihood is now an analytic mixture of two Gaussians.

can analytically marginalise.

Approximate Inference label:sec-inference

intro

Performing Bayesian inference involves finding the posterior over the latent variables,

where the denominator is the marginal likelihood from cref:eq-marginal-likelihood-assign. Exact inference in our model is intractable due to the marginalisation over the set of expert indicator variables. For this reason, we resort to a variational approximation. The rich structure of our model makes it hard to construct an \acrshort{elbo} that can be evaluated in closed form whilst accurately modelling the complex dependencies. Further to this, the marginal likelihood is extremely expensive to evaluate, as there are $\ModeInd\NumData$ sets of assignments $\allModeVar$ that need to be marginalised. For each set of assignments, $\ModeInd$ \acrshort{gp} experts need to be evaluated, each with complexity $\mathcal{O}(\NumData3)$. For these reasons, this work derives a variational approximation based on inducing variables, that provides scalability by utilising stochastic gradient-based optimisation.

\acrfull{svi} citep:hoffmanStochastic2013 relies upon having a set of local variables factorised across observations and a set of global variables. The marginalisation over the set of expert indicator variables $\allModeVar$ in cref:eq-np-moe-marginal-likelihood is prohibitive to \acrshort{svi}. This is because \acrshort{svi} requires a set of local variables factorised over data but the marginalisation considers the sets of assignments for the entire data set. Following the approach by cite:titsiasVariational2009, we augment the probability space with a set of $\NumInducing$ inducing variables for each \acrshort{gp}. However, instead of collapsing these inducing variables, we explicitly represent them as variational distributions and use them to lower bound (and then further bound) the marginal likelihood, similar to cite:hensmanGaussian2013,hensmanScalable2015. cref:fig-graphical-model-sparse shows the graphical model of the augmented joint probability space.

sparse graphical model

Augmented Probability Space

Augmented experts We sidestep the hard assignment of observations to experts by augmenting each expert with a set of separate independent inducing points $(\expertInducingInput, \expertInducingOutput)$. Each expert’s inducing points are assumed to be from its \acrshort{gp} prior,

where the set of all inducing inputs associated with the experts has been denoted $\expertsInducingInput$ and the set of all inducing variables as $\expertsInducingOutput$. Note that the dependence on the inducing inputs has been dropped for notational conciseness. Introducing separate inducing points from each expert’s \acrshort{gp} can loosely be seen as “partitioning” the observations between experts. However, as the assignment of observations to experts is not known a priori, the inducing inputs $\expertInducingInput$ and variables $\expertInducingOutput$, must be inferred from observations.

Augmented gating network Following a similar approach for the gating network, each gating function is augmented with a set of $\NumInducing$ inducing points from its corresponding \acrshort{gp} prior,

where $\gatingInducingOutput$ are the inducing variables associated with the $\modeInd\text{th}$ gating function. Again, the dependence on the inducing inputs has been dropped for notational conciseness. The set of all inducing variables associated with the gating network has been denoted $\gatingsInducingOutput$. Partitioning of the data set is not desirable for the gating function \acrshort{gps} as each gating function should depend on all of the training observations. For this reason, the gating functions share the same inducing inputs $\gatingInducingInput$.

Marginal likelihood These inducing points are used to approximate the marginal likelihood with a factorisation over observations that is favourable for constructing a \acrshort{gp}-based gating network. Our approximate marginal likelihood is given by,

where the conditional distributions $\singleExpertGivenInducing$ and $\singleGatingGivenInducing$ follow from standard sparse \acrshort{gp} methodologies,

where $\expertKernelMM = k\modeInd (\expertInducingInput,\expertInducingInput)$ represents the $\modeInd\text{th}$ expert’s kernel evaluated between its inducing inputs, $\expertKernelnn = k_k (\singleInput, \singleInput)$ represents it evaluated between the $\numData\text{th}$ training input and $\expertKernelnM = k_k (\singleInput, \expertInducingInput)$ between the $\numData\text{th}$ training input and its inducing inputs. Similarly for the gating network. Figure ref:fig-graphical-model-sparse shows the graphical model of this augmented joint probability model.

Our work follows from sparse \acrshort{gp} methodologies that assume, given the inducing variables, the latent function values factorise over observations. Our approximation assumes that given the inducing points, the marginalisation over every possible assignment of data points to experts can be factorised over data. In a similar spirit to the \acrfull{fitc} approximation citep:naish-guzmanGeneralized2008,quinonero-candelaUnifying2005, this can be viewed as a likelihood approximation,

Importantly, the factorisation over observations has been moved outside of the marginalisation over the expert indicator variable, i.e. the expert indicator variable can be marginalised for each data point separately. This approximation assumes that the inducing variables, $\{\expertInducingOutput\}\modeInd=1^\ModeInd$, are a sufficient statistic for their associated latent function values, $\{\mode{\latentFunc}(\allInputK) \}\modeInd=1^\ModeInd$ and the set of assignments $\allModeVar$. This approximation becomes exact in the limit $\ModeInd\NumInducing=\NumData$, if each expert’s inducing points represent the true data partition $\{\expertInducingInput, \expertInducingOutput\}\modeInd=1\ModeInd = \{\allInputK, \mode{\latentFunc}(\allInputK)\}\modeInd=1\ModeInd$. It is also worth noting that cref:eq-augmented-marginal-likelihood captures a rich approximation of each expert’s covariance but as $\ModeInd\NumInducing \ll \NumData$ the computational complexity is much lower. This approximation efficiently couples the gating network and the experts by marginalising the expert indicator variable for each data point separately.

Our approximate marginal likelihood captures the joint distribution over the data and assignments through the inducing variables $\expertsInducingOutput$. As such, information regarding the assignment of observations to experts must pass through the bottleneck of the inducing variables. This approximation induces a local factorisation over observations and a set of global variables – the necessary conditions for \acrshort{svi}. This approach can loosely be viewed as parameterising the full nonparametric model in cref:eq-np-moe-marginal-likelihood-assign to obtain a desirable factorisation for 1) constructing a \acrshort{gp}-based gating network and 2) deriving an \acrshort{elbo} that can be optimised with stochastic gradient methods, whilst still capturing the complex dependencies between the gating network and experts.

Augmented probability space

\acrfull{svi} citep:hoffmanStochastic2013 relies upon having a set of local variables factorised across observations and a set of global variables. Annoyingly, the order of the marginalisation over the expert indicator variable and the product over observations $\NumData$ in cref:eq-marginal-likelihood is prohibitive to \acrshort{svi}.

Augmented Probability Space Following the approach by cite:titsiasVariational2009 and cite:hensmanGaussian2013, the probability space is first augmented with a set of $\NumInducing$ inducing variables from each \acrshort{gp} prior,

where $\gatingInducingOutput = \mode{\gatingFunc}(\gatingInducingInput)$ are the inducing points associated with the $\modeInd\text{th}$ gating function. Partitioning of the data set is not desirable for the gating function \acrshort{gps} as each gating function should depend on all of the training observations. For this reason, the gating functions share the same inducing inputs $gatingInducingInput$. The set of all inducing variables associated with the gating network has been denoted $\gatingsInducingOutput$.

Our approximate marginal likelihood is then given by,

where the joint distribution over the data is captured by the inducing variables $(\gatingsInducingOutput, \expertsInducingOutput)$. As such, information regarding the assignment of observations to experts must pass through the bottleneck of the inducing variables. This approximation induces a local factorisation over observations and a set of global variables – the inducing variables – the necessary conditions for \acrshort{svi}. Figure ref:fig-graphical-model-sparse shows the graphical model of this augmented joint probability model.

The augmented marginal likelihood is then given by,

where the conditional distributions $\allExpertGivenInducing$ and $\allGatingGivenInducing$ follow from standard sparse \acrshort{gp} methodologies,

where $\expertKernelMM = k\modeInd (\expertInducingInput,\expertInducingInput)$ represents the $\modeInd\text{th}$ experts kernel evaluated between its inducing inputs, $\expertKernelNN = k_k (\allInput, \allInput)$ represents it evaluated between the training inputs and $\expertKernelNM = k_k (\allInput, \expertInducingInput)$ between the training inputs and its inducing inputs. Similarly for the gating network.

A central assumption of our work follows from sparse \acrshort{gp} methodologies, that assume, given the inducing variables, the latent function values factorise over observations. In a similar spirit to the \acrshort{fitc} approximation cite:naish-guzmanGeneralized2008,quinonero-candelaUnifying2005, we propose the following likelihood approximation,

Importantly, this approximation moves the factorisation over observations outside of the marginalisation over the expert indicator variable. It captures a rich approximation of each expert’s covariance whilst marginalising the expert indicator variable. This approximation assumes that the inducing variables, $\{\gatingInducingOutput, \expertInducingOutput\}\modeInd=1^\ModeInd$, are a sufficient statistic for their associated latent function values, $\{\mode{\latentFunc}(\allInput), \mode{\gatingFunc}(\allInput) \}\modeInd=1^\ModeInd$. When all of the expert’s inducing inputs $\expertInducingInput$ and the gating network’s inducing inputs $\gatingInducingInput$ are equal to the training inputs $\expertInducingInput = \gatingInducingInput=\allInput$, this approximation is exact. This approximation becomes exact in the limit $M=N$. Our approximate marginal likelihood is then given by,

where the joint distribution over the data is captured by the inducing variables $(\gatingsInducingOutput, \expertsInducingOutput)$. As such, information regarding the assignment of observations to experts must pass through the bottleneck of the inducing variables. This approximation induces a local factorisation over observations and a set of global variables – the inducing variables – the necessary conditions for \acrshort{svi}. Figure ref:fig-graphical-model-sparse shows the graphical model of this augmented joint probability model.

Evidence Lower Bounds

Instead of collapsing the inducing variables as seen in cite:titsiasVariational2009, they can be explicitly represented as variational distributions, $(\expertsInducingVariational, \gatingsInducingVariational)$ and used to obtain a variational lower bound, aka \acrfull{elbo}. This section derives three \acrshort{elbo}s. The first bound is the tightest but requires approximating $\NumInducing$ dimensional integrals for each expert and gating function. Two further lower bounds which replace some (or all) of the $\NumInducing$ dimensional integrals with one-dimensional integrals are derived. These further bounds offer improved computational properties at the cost of loosening the bound. All three of these bounds are evaluated in cref:sec-mcycle-results.

Tight lower bound Following a similar approach to cite:hensmanGaussian2013,hensmanScalable2015, a lower bound on cref:eq-augmented-marginal-likelihood can be obtained,

where we parameterise the variational posteriors to be independent Gaussians,

The bound in cref:eq-lower-bound-tight meets the necessary conditions to perform stochastic gradient methods on $\expertsInducingVariational$ and $\gatingsInducingVariational$, as the expected log likelihood (first term) is written as a sum over input-output pairs. However, this expectation cannot be calculated in closed form and must be approximated. The joint distributions over the inducing variables for each expert \acrshort{gp} $\expertInducingVariational$ and each gating function \acrshort{gp} $\gatingInducingVariational$, are $\NumInducing$ dimensional multivariate normal distributions. Therefore, each expectation requires an $\NumInducing$ dimensional integral to be approximated.

Further lower bound Following cite:hensmanScalable2015, these issues can be overcome by further bounding $\tightBound$ from cref:eq-lower-bound-tight. This removes the $\NumInducing$ dimensional integrals associated with each of the gating functions. Jensen’s inequality can be applied to the conditional probability $\singleLatentGatingGivenInducing$, obtaining the further bound,

where $\gatingsVariational$ represents the variational posterior given by,

Moving the marginalisation over the latent gating functions $\GatingFunc(\singleInput)$ outside of the marginalisation over the expert indicator variable is possible because the mixing probabilities are dependent on all of the gating functions and not just their associated gating function. In contrast, moving the marginalisation over each expert’s latent function $\mode{\latentFunc}(\singleInput)$ outside of the marginalisation over the expert indicator variable corresponds to changing the underlying model, in particular, the likelihood approximation in cref:eq-likelihood-approximation.

Further^2 lower bound Nevertheless, we proceed and further bound the experts for comparison. Jensen’s inequality is applied to the conditional probability $\singleLatentExpertGivenInducing$, obtaining the further^2 bound,

where $\expertsVariational$ represents the variational posterior given by,

Intuitively, this bound can be seen as modifying the likelihood approximation in cref:eq-likelihood-approximation. Instead of mixing the \acrshort{gps} associated with each expert, this approximation simply mixes their associated noise models.

As each \acrshort{gp}’s inducing variables are normally distributed, the functional form of the variational posteriors are given by,

where $\mode{\mathbf{A}} = \expertKernelnM \expertKernelMM-1$ and $\mode{\hat{\mathbf{A}}} = \gatingKernelnM \gatingKernelMM-1$. Importantly, these variational posteriors marginalise the inducing variables in closed form, with Gaussian convolutions. $\furtherBound$ removes $\ModeInd$ of the undesirable approximate $M$ dimensional integrals from cref:eq-lower-bound-tight and $\furtherBoundTwo$ removes $\ModeInd^2$. The variational expectation in cref:eq-lower-bound-further still requires approximation, however, the integrals are now only one-dimensional. These integrals are approximated with Gibbs sampling and in practice only single samples are used because the added stochasticity helps the optimisation.

The tight lower bound $\tightBound$ is the most accurate lower bound, but it is also the most computationally expensive. Both the further lower bound $\furtherBound$ and the further^2 lower bound $\furtherBoundTwo$ have lower computational complexity at the cost of being looser bounds. The performance of these bounds is evaluated in cref:sec-mcycle-results.

Optimisation

The bounds in cref:eq-lower-bound-further,eq-lower-bound-tight,eq-lower-bound-further-2 meet the necessary conditions to perform stochastic gradient methods on $\expertsInducingVariational$ and $\gatingsInducingVariational$. Firstly, they contain a sum of $\NumData$ terms corresponding to input-output pairs, enabling optimisation with mini-batches. Secondly, the expectations over the log-likelihood are calculated using Monte Carlo samples.

Stochastic optimisation At each iteration $j$, a random subset of $\NumData_b$ data points are sampled from the data set $\mathcal{D}$, to get a minibatch $\mathcal{D}_j = \{\x_i, \y_i\}i=1\NumData_b$. The further lower bound $\furtherLowerBound$ is then approximated by,

where $\expertInducingOutput(\expertSampleInd) &∼ \expertInducingVariational$ and $\mathbf{\gatingFunc}(\x\batchSampleInd)(\gatingSampleInd) &∼ \gatingsVariationalSample$ denote samples from the variational posteriors. $\hat{S}$ samples are drawn from the gating network’s variational posterior and $S$ samples are drawn from the experts’ variational posterior. The variational distributions over the inducing variables are represented using the mean vector $\mode{\mathbf{m}}$ and the lower triangular $\mode{\mathbf{L}}$ of the covariance matrix $\mode{\mathbf{S}} = \mode{\mathbf{L}} \mode{\mathbf{L}}^T$. A downside to this formulation is that $(\ModeInd^2)(M-1)M/2 + \ModeInd^2 M$ extra parameters need to be optimised. In the two expert case, this reduces to $(\ModeInd+1)(M-1)M/2 + (\ModeInd+1) M$ extra parameters. Optimising the inducing inputs ($\gatingInducingInput$ and $\expertsInducingInput$) introduces a further $\ModeInd^2\NumInducing\InputDim$ optimisation parameters. The inducing inputs $\gatingInducingInput,\{\expertInducingInput\}\modeInd=1\ModeInd$, kernel hyperparameters and noise variances, are treated as variational hyperparameters and optimised alongside the variational parameters, using stochastic gradient descent e.g. Adam citep:kingmaAdam2017.

Computational complexity Assuming that each expert has the same number of inducing points $\NumInducing$, the cost of computing the KL divergences and their derivatives is $\mathcal{O}\left( \ModeInd \NumInducing3 \right)$. The cost of computing the expected likelihood term is dependent on the batch size $\NumData_b$. For each data point in the minibatch, each of the $\ModeInd$ gating function variational posteriors has complexity $\mathcal{O}\left( \NumInducing2 \right)$ to evaluate. For each data point, only a single sample is drawn from each of these distributions. Sampling each expert’s inducing variable distribution $\expertInducingVariational$ has complexity $\mathcal{O}(\NumInducing^2)$ because the covariance is represented as the lower triangular (via the Cholesky decomposition). In addition to this sampling, calculating each expert’s conditional $\singleExpertGivenInducing$ given these samples has complexity $\mathcal{O}(\NumInducing^2)$.

Predictions

For a given set of test inputs $\testInput ∈ \R\NumTest × \InputDim$, this model makes probabilistic predictions following a mixture of $\ModeInd$ Gaussians. Making predictions with this model involves calculating a density over the output for each expert and combining them using the probabilities obtained from the gating network, i.e. marginalising the expert indicator variable. This is given by,

where dependence on the test inputs $\testInput$ and the training inputs $\allInput$ have been dropped for notational conciseness.

Experts The experts make predictions at new test locations by integrating over their latent function posteriors,

However, the experts’ true posteriors $\expertPosterior$ are not known and have been approximated. Each expert’s approximate posterior is given by $q(\mode{\latentFunc}(\allInputK), \expertInducingOutput) = p(\mode{\latentFunc}(\allInputK) \mid \expertInducingOutput) q(\expertInducingOutput)$. To make a prediction at a set of test locations $\testInput$, we substitute our approximate posterior into the standard probabilistic rule,

where $\mode{\mathbf{A}} = \expertKernelsM \expertKernelMM-1$. This integral has complexity $\mathcal{O}(\NumInducing^2)$.

Gating network The mixing probabilities associated with the gating network are obtained by integrating the gating network’s posterior through the gating likelihood,

Again, the gating network’s true posterior $\gatingPosterior$ has been approximated,

where $\mode{\hat{\mathbf{A}}} = \gatingKernelsM \gatingKernelMM-1$. In the general case, when $\ModeInd > 2$, the gating likelihood is the softmax,

so cref:eq-gating-prediction is approximated with Monte Carlo quadrature. In the two expert case there is only a single gating function, $\gatingFunc_1$, as,

In this case, the gating likelihood is the Gaussian cdf,

so cref:eq-gating-prediction can be calculated in closed-form with,

where $μh_*$ and $σ^2h_*$ are the mean and variance of the variational posterior $\gatingVariationalPosteriorBernoulli$ at $\singleTestInput$.

sparse graphical model old

\todo[inline]{Need to extend bound for the gating network}

Evaluation of Model and Approximate Inference

Intro

As a \acrfull{moe} method, our model aims to improve on standard \acrshort{gp} regression with the ability to model non-stationary functions and multimodal distributions over the output variable. With this in mind, the model and approximate inference scheme are evaluated on two data sets. Following other \acrshort{mogpe} work, they are first tested on the motorcycle data set citep:Silverman1985. Although this data set does not represent state transitions from a dynamical system, it does contain non-stationary points and heterogeneous noise, making it interesting to study from the \acrshort{mogpe} perspective. Secondly, they are tested on the illustrative example from cref:illustrative_example. That is, a data set collected onboard a DJI Tello quadcopter flying in an environment subject to two dynamics modes.

Experiments

All data sets were split into test and training sets with $70\%$ for training and $30\%$ for testing. In order to evaluate and compare the full predictive posteriors the \acrfull{nlpp} is computed on the test set. The models are also compared using the \acrfull{rmse} and the \acrfull{mae}. Given a test data set $(\testInput, \testOutput) = \{(\singleTestInput, \singleTestOutput)\}\numTest=1\NumTest$, they are calculated as follows,

where $\predictedSingleOutput$ is the model’s prediction at $\singleTestInput$. Note that all figures in this section show models that were trained on the full data set, i.e. no test/train split.

Evaluation on Motorcycle Data Set label:sec-mcycle-results

intro

The Motorcycle data set (discussed in cite:Silverman1985) contains 133 data points ($\allInput ∈ \R133 × 1$ and $\allOutput ∈ \R133 × 1$) and input dependent noise. The data set represents motorcycle impact data – time (ms) vs acceleration (g). The data set is represented by the black crosses in cref:fig-y-mcycle-two-experts.

To test the performance of \acrshort{mosvgpe}, the model is instantiated with $\ModeInd=2$ and $\ModeInd=3$ experts. All experiments on the Motorcycle data set use $\NumInducing=32$ inducing points for all \acrshort{gps} and are trained for $25,000$ iterations with Adam citep:kingmaAdam2017, with a learning rate of $0.01$ and a batch size of $\NumData_b=16$. The results are compared against a \acrshort{gp} and a \acrfull{svgp}, which use Squared Exponential kernels with \acrfull{ard} and a Gaussian likelihood.

cref:tab-mcycle-metrics summarises the results for the three \acrshort{elbo}s ($\tightBound$, $\furtherBound$, $\furtherBoundTwo$) and compares them to a standard \acrshort{gp} regression model and a \acrshort{svgp} method instantiated with $\NumInducing=16$ and $\NumInducing=32$ inducing points. Both methods use Gaussian likelihoods and optimise their hyperparameters, noise variances (and inducing inputs in the \acrshort{svgp} case) using their well-known objectives – the marginal likelihood and \acrshort{elbo}.

Results table

#+Caption[\acrshort{mosvgpe} results on motorcycle data set]: Results on the Motorcycle data set citep:Silverman1985 with different instantiations of our model (\acrshort{mosvgpe}).

\acrshort{rmse} \acrshort{nlpp} \acrshort{mae}
\acrshort{gp} $\mathbf{0.4357}$ $0.9886$ $0.3242$
\acrshort{svgp} $(M=16)$ $0.4427$ $0.9762$ $0.3257$
\acrshort{svgp} $(M=32)$ $0.4437$ $0.9832$ $0.3271$
\acrshort{mosvgpe} $(k=2, \tightBound)$ $0.4442$ $0.4863$ $0.3260$
\acrshort{mosvgpe} $(k=2, \furtherBound)$ $0.4590$ $0.5073$ $0.3355$
\acrshort{mosvgpe} $(k=2, \furtherBoundTwo)$ $0.4472$ $0.5271$ $\mathbf{0.3218}$
\acrshort{mosvgpe} $(k=3, \tightBound)$ $0.4569$ $\mathbf{0.2634}$ $0.3301$
\acrshort{mosvgpe} $(k=3 ,\furtherBound)$ $0.4866$ $0.2695$ $0.3449$
\acrshort{mosvgpe} $(k=3 ,\furtherBoundTwo)$ $0.4575$ $0.5467$ $0.3270$

After results table

The \acrshort{nlpp} indicates the probability of the data given the parameters which are not marginalised, e.g. hyperparameters and inducing inputs. Following Bayesian model selection, it is known that lower values indicate higher performing models, i.e. predictive posteriors that more accurately match the distribution of the data. The predictive posterior is most accurate when \acrshort{mosvgpe} is instantiated with three experts $\ModeInd=3$ and trained using the tight lower bound $\tightBound$. In both the two and three expert experiments, the tight lower bound $\tightBound$ achieved better \acrshort{nlpp} than both of the further/further^2 lower bounds, $\furtherBound$ and $\furtherBoundTwo$. This is expected as it is a tighter bound. As both of the further/further^2 lower bounds offer improved computational properties, it is interesting to compare their performance. The \acrshort{nlpp} scores for the further lower bound $\furtherBound$ are almost equal to the tight lower bound $\tightBound$. In contrast, the \acrshort{nlpp} score in the three expert experiment for the further^2 lower bound $\furtherBoundTwo$ is significantly worse. This indicates that valuable information is lost in this bound. This was expected as this bound corresponds to a further likelihood approximation, which mixes the experts’ noise models as opposed to their full \acrshort{svgp} models.

Regarding the accuracy of the predictive means, the standard \acrshort{gp} regression model achieved the best \acrshort{rmse}, followed by the \acrshort{svgp} models and then the \acrshort{mosvgpe} models. It is worth noting that all of the \acrshort{rmse} and \acrshort{mae} scores are very similar. Although adding more experts to the \acrshort{mosvgpe} model appears to learn more accurate predictive posteriors, the predictive means appear to deteriorate ever so slightly (indicated by higher \acrshort{rmse}/\acrshort{mae} values). This is most likely due to bias at the boundaries between the experts, resulting from the mixing behaviour arising from our \acrshort{gp}-based gating network. If the gating functions do not have low lengthscales then they will not be able to immediately switch from one expert to another. It is worth noting that this drop in performance is negligible.

Parameter settings two experts

two expert y fig

Two Experts

The two further lower bounds ($\furtherBound$ and $\furtherBoundTwo$), derived in cref:sec-inference, are compared by training each instantiation of the model using the same model and training parameters. cref:tab-params-motorcycle-two contains the initial values for all of the trainable parameters in the model. They are compared by instantiating the model with two experts $\ModeInd=2$ and comparing their performance. The results are shown in cref:fig-y-mcycle-two-experts,fig-latent-mcycle-two-experts, where cref:fig-y-mcycle-two-experts visualises the predictive posteriors and cref:fig-latent-mcycle-two-experts visualises the posteriors over the latent variables. The left column shows results for $\furtherBound$ and the right column shows results for $\furtherBoundTwo$. This layout is used in cref:fig-y-mcycle-two-experts,fig-latent-mcycle-two-experts,fig-y-mcycle-three-experts,fig-latent-mcycle-three-experts.

cref:fig-y-means-mcycle-two-experts-tight,fig-y-means-mcycle-two-experts-further compare the posterior means (black solid line) to the \acrshort{svgp}’s posterior mean (red dashed line) and cref:fig-y-samples-mcycle-two-experts-tight,fig-y-samples-mcycle-two-experts-further compare the posterior densities to the \acrshort{svgp}. The red lines show plus or minus two standard deviations of the \acrshort{svgp}’s posterior variance. As the \acrshort{mosvgpe} posterior is a Gaussian mixture, it is visualised by drawing samples from its posterior, i.e. sample a mode indicator variable $\modeVar_*$ and then draw a sample from the corresponding expert.

Predictive posteriors Both \acrshort{mosvgpe} results are capable of modelling the non-stationarity at $x ≈ -0.7$ better than the \acrshort{svgp}. At this non-stationary point, there are two modes in the \acrshort{mosvgpe} predictive distributions, indicated by the overlap in samples from each expert in cref:fig-y-samples-mcycle-two-experts-tight,fig-y-samples-mcycle-two-experts-further. The \acrshort{svgp} has explained the observations by increasing its single noise variance term. In contrast, both of the \acrshort{mosvgpe} results have been able to learn two noise variances and these reflect the noise in the observations much better. This is indicated by expert one learning a low noise variance and expert two a high noise variance (similar to the \acrshort{svgp}’s noise variance).

Latent variables More insight into this behaviour can be obtained by considering the latent variables. Figure ref:fig-latent-mcycle-two-experts shows the posteriors over the latent variables where cref:fig-expert-gps-mcycle-two-experts-tight,fig-expert-gps-mcycle-two-experts-further show the \acrshort{gp} posteriors over each expert’s latent function $q(\mode{\latentFunc}(\x_*))$. cref:fig-gating-gps-mcycle-two-experts-tight,fig-gating-gps-mcycle-two-experts-further show the \acrshort{gp} posteriors over the latent gating functions $q(\mode{\gatingFunc}(\x_*))$ and cref:fig-mixing-probs-mcycle-two-experts-tight,fig-mixing-probs-mcycle-two-experts-further show the mixing probabilities associated with the probability mass function over the expert indicator variable $\modeVar$.

The lengthscale of the gating network kernel governs how fast the model can shift responsibility from expert one (cyan) to expert two (magenta). For both lower bounds, the distribution over the expert indicator variable tends to a uniform distribution (maximum entropy) at $x \geq 1.5$. This can be seen by the cyan/magenta lines in cref:fig-mixing-probs-mcycle-two-experts-tight,fig-mixing-probs-mcycle-two-experts-further tending to $0.5$. Optimising with both bounds resulted in expert one (cyan) learning a long lengthscale to fit the horizontal line from $-2<x<-1$ and expert two (magenta) learning a shorter lengthscale function to fit the wiggly section from $-0.5<x<1.2$. The noise variance inferred by expert one is larger for $\furtherBoundTwo$ than for $\furtherBound$. The uncertainty in the experts’ latent functions is also higher for $\furtherBoundTwo$. This is shown by the $95\%$ confidence intervals (shaded cyan/magenta) being wider in cref:fig-expert-gps-mcycle-two-experts-further than in cref:fig-expert-gps-mcycle-two-experts-tight. This is because $\furtherBoundTwo$ is attempting to fit both experts to the entire data set and only mixes their noise models. In contrast, $\furtherBound$ fits each expert only in the regions where the gating network has assigned it responsibility.

two expert latent fig

Parameter settings three experts

three expert y fig

Three Experts

The model was then instantiated with three experts $\ModeInd=3$ and trained following the same procedure as the two experts’ experiments. cref:tab-params-motorcycle-three shows the initial values for all of the trainable parameters in the model. The results are shown in cref:fig-y-mcycle-three-experts, where the top row visualises the predictive mean and the bottom row the predictive density, for $\furtherBound$ (left column) and $\furtherBoundTwo$ (right column). cref:fig-latent-mcycle-three-experts then visualises the posteriors over the latent variables associated with each model/bound combination.

From cref:tab-mcycle-metrics, it is clear that the predictive posterior associated with $\furtherBound$ is the most accurate as it obtained the best \acrshort{nlpp} score. As expected, the two lower bounds explain the data completely differently. Instantiating the model with three experts $\ModeInd=3$ and training with $\furtherBound$, leads to the extra expert (magenta) fitting to the data at $x \geq 1.5$ and the gating network assigning responsibility to it in this region. In contrast, instantiating the model with three experts $\ModeInd=3$ and training with $\furtherBoundTwo$, results in the gating network never using the extra expert (cyan). This is indicated by the extra expert’s (cyan) probability remaining low over the region with training data in cref:fig-mixing-probs-mcycle-three-experts-further. Similar to the two expert case, the distribution over the expert indicator variable at $x \geq 1.5$ tends to a uniform distribution (maximum entropy), over the experts that are turned “on”.

In cref:fig-expert-gps-mcycle-three-experts-tight the third expert’s posterior returns to the prior at $x \geq 1.5$. This is indicated by the $95\%$ confidence intervals associated with the third expert’s \acrshort{gp} (shaded yellow) being wide in this region. This demonstrates that not only is the gating network turning the experts “on” and “off” in different regions but the model is also exhibiting data assignment behaviour. That is, each expert appears to only be fitting to the observations in the regions where the gating network has assigned it responsibility. In our case, this behaviour is achieved via the inducing variables capturing the joint distribution over the experts and the set of assignments, i.e. implicitly assigning data points to experts.

three expert latent fig

Summary

The tight lower bound $\tightBound$ and further lower bound $\furtherBound$ recovered similar results in all experiments. This indicates that $\furtherBound$ does not loosen the bound to a point where it loses valuable information. In contrast, $\furtherBoundTwo$ is not able to recover the same results. This was expected as $\furtherBoundTwo$ corresponds to a further likelihood approximation, where the experts’ noise models are mixed instead of their full \acrshort{svgp}s. $\furtherBound$ offers a rich \acrshort{elbo} for optimising \acrshort{mosvgpe} that achieves similar results to $\tightBound$, whilst having lower computational complexity per evaluation. For this reason, the remainder of this thesis uses $\furtherBound$ for all experiments.

Batch Size vs Number of Inducing Points

Batch size One of the main benefits of the variational inference scheme presented in this chapter, is that the bound can be calculated with minibatches of a data set $\dataset$ and used as the objective for stochastic gradient descent. Decreasing the batch size increases the stochasticity in the bound and leads to convergence in less evaluations of the \acrshort{elbo}. This is evident in cref:fig-mcycle-training-curve, which compares the negative \acrshort{elbo} for different numbers of inducing points and batch sizes. Further to this, computing the \acrshort{elbo} is less computationally demanding for smaller batch sizes. However, if the batch size is made too small, then the optimisation can become unstable and prevent the optimiser from finding a good solution. The (blue) learning curve for batch size $N_b=32$ in cref:fig-mcycle-training-curve-133 shows an example of this behaviour. In this case, the learning rate had to be made smaller, leading to slower convergence. This is shown by the orange learning curve, which has not been able to reach the same negative \acrshort{elbo} as the other batch sizes. This is most likely due to the lower learning rate. This interplay between the batch size and learning rate is well-known in machine learning.

Number of inducing points Our variational inference scheme models the joint distribution over the data and assignments via the inducing variables ($\expertsInducingOutput$ and $\gatingsInducingOutput$). In practice, the number of inducing points should be less than the number of data points $(\NumInducing \ll \NumData)$, to obtain improved computational performance. The learning curves in cref:fig-mcycle-training-curve visualise the learning process for different numbers of inducing points $\NumInducing$. Best performance is obtained when the model is instantiated with $\NumInducing=133$ inducing inputs, i.e. a one-to-one correspondence between inducing inputs and data inputs, $\expertInducingInput=\gatingInducingInput=\allInput$. As the number of inducing points decreases the model is still able to recover the same negative \acrshort{elbo}.

\todo{add results for num inducing points leading to worse performance e.g. M=8}

Evidence lower bounds The tight lower bound $\tightBound$ and further lower bound $\furtherBound$ recovered similar results in all experiments. This indicates that $\furtherBound$ does not loosen the bound to a point where it loses valuable information. In contrast, $\furtherBoundTwo$ is not able to recover the same results. This was expected as $\furtherBoundTwo$ corresponds to a further likelihood approximation, where the experts’ noise models are mixed instead of their full \acrshort{svgp}s. $\furtherBound$ offers a rich \acrshort{elbo} for optimising \acrshort{mosvgpe} that achieves similar results to $\tightBound$, whilst having lower computational complexity per evaluation. For this reason, the remainder of this thesis uses $\furtherBound$ for all experiments.

training loss fig

End

\newpage

Evaluation on Velocity Controlled Quadcopter label:sec-brl-experiment

intro

As this work is motivated by learning representations of real-world dynamical systems, it was tested on a real-world quadcopter data set following the illustrative example detailed in cref:illustrative_example. The data set was collected at the Bristol Robotics Laboratory using a velocity controlled DJI Tello quadcopter and a Vicon tracking system. A high turbulence dynamics mode was induced by placing a desktop fan at the right side of a room. cref:fig-quadcopter-environment shows a diagram of the environment. The data set represents samples from a dynamical system with constant controls, i.e. $Δ\state\timeInd+1 = \latentFunc(\state\timeInd;\control\timeInd=\control_*)$.

Environment The environment is modelled with two dimensions (the $x$ and $y$ coordinates), which is a realistic assumption, as altitude control can be achieved with a separate controller. The state space is then the 2D coordinates $\state = [x, y]$ and the control is simply the velocity $\control = [v_x, v_y]$.

Data collection The Vicon system provided access to the true position of the quadcopter at all times, which enabled pre-planned trajectories to be flown, using a simple \acrshort{pid} controller on feedback from the Vicon system. To simplify data collection, nine trajectories from $y=2$ to $y=-3$, with different initial $x$ locations, were used as target trajectories to be tracked by the \acrshort{pid} controller. Each trajectory was repeated 7 times to capture the variability (process noise) in the dynamics.

Data processing The Vicon stream recorded data at 100Hz, which was then downsampled to give a time step of $Δ t = 0.1s$. This reduced the size of the data set and left reasonable lengthscales. The data set consists of $\NumData=1816$ state transitions. cref:fig-quiver visualises the state transition data set as a quiver plot.

parameter settings

Results

The model was instantiated with two experts, with the goal of each expert learning a separate dynamics mode and the gating network learning a representation of how the underlying dynamics modes vary over the state space. The model was trained using the model and training parameters in cref:tab-params-quadcopter.

#+caption[\acrshort{mosvgpe}’s moment matched predictive posterior with $\ModeInd=2$ after training on the real-world quadcopter data set with $\furtherBound$]: Moment matched predictive posterior $p(Δ \state_* \mid \x_*)$ after training on the quadcopter data set. Each row corresponds to an output dimension $d$ where the left plot shows the moment matched mean $\E[Δ \state_d]$ and the right plot shows the moment matched variance $\V[Δ \state_d]$. ./images/model/quadcopter/subset-10/y_moment_matched.pdf

#+caption[\acrshort{mosvgpe}’s experts’ posteriors with $\ModeInd=2$ after training on the real-world quadcopter data set with $\furtherBound$]: Visualisation of the experts’ predictive posteriors $\predictiveExpertsPrior$ after training on the quadcopter data set. Each row corresponds to a single \acrshort{gp} posterior, $q(fkd(\singleTestInput))$, corresponding to dimension $d$ of expert $\modeInd$. The mean $\E[fkd(\singleTestInput)]$ is on the left and the variance $\V[fkd(\x_*)]$ is on the right. The noise variances learned by Expert 1 and Expert 2 were $Σ_1 = \diag\left([0.0063, 0.0259]]\right)$ and $Σ_2 = \diag\left([0.0874, 0.0432]\right)$ respectively. ./images/model/quadcopter/subset-10/experts_f.pdf

At a new input location $\singleTestInput$ the density over the output, $p(\singleTestOutput \mid \singleTestInput)$, follows a mixture of $\ModeInd$ Gaussians. Visualising a mixture of two Gaussians with a two-dimensional input space and a two-dimensional output space requires the components and mixing probabilities to be visualised separately. To aid with visualisation, Figure ref:fig-y-mm-quadcopter-subset shows the predictive density approximated as a unimodal Gaussian density (via moment matching), where each row corresponds to an output dimension. The predictive mean is fairly constant over the domain, except for the region in front of the fan, where it is higher. This result makes sense as the data set was assumed to be collected with constant controls. The region with high predictive mean in front of the fan is modelling the drift arising from the fan blowing the quadcopter in the negative $x$ direction. The right-hand plots of Figure ref:fig-y-mm-quadcopter-subset show the predictive variance. It is high in the bottom left where there are no training observations, indicating that the method has successfully represented the model’s epistemic uncertainty. It is also high in the region in front of the fan, showing that the model has successfully inferred the high process noise, associated with the turbulence induced by the fan. The individual experts and the gating network are now visualised separately.

Gating network Figure ref:fig-gating-network-quadcopter-subset shows the gating network after training on the data set. Figure ref:fig-gating-mixing-probs-quadcopter-subset (right) indicates that the model has assigned responsibility to Expert 2 in front of the fan, as its mixing probability $Pr(\singleTestModeVar=1 \mid \singleTestInput)$ is high (red) in this region. This implies that Expert 2 represents the turbulent dynamics mode in front of the fan. Figure ref:fig-gating-gps-quadcopter-subset shows the \acrshort{gp} posteriors associated with the gating functions. The mean of the gating function associated with Expert 1 $\E[h1(\singleTestInput)]$ is high (red) in the low-turbulence regions and low (white) in the high-turbulence region in front of the fan. The posterior variance associated with the gating function \acrshort{gps} is high in the region with no training observations. This is a desirable behaviour because it is modelling the epistemic uncertainty. These results demonstrate that the gating network infers important information regarding how the system switches between dynamics modes over the input space.

Identifiability These results show that the \acrshort{gp}-based gating network is capable of turning a single expert on in multiple regions of the input space. This is a desirable behaviour as it has enabled only two underlying dynamics modes to be identified. In contrast, other \acrshort{mogpe} methods may have assigned an extra expert to one of the regions modelled by Expert 1. In particular, the regions at $y>0$ and $y<-1$ may have been assigned to separate experts.

Experts Figure ref:fig-experts-f-quadcopter-subset shows the predictive posteriors $q(\latentFunckd(\singleTestInput))$ associated with each dimension $d$ of each expert $\modeInd$. The method has successfully learned a factorised representation of the underlying dynamics, where Expert 1 has learned a dynamics mode with low process noise $Σ_1 = \diag\left([0.0063, 0.0259]]\right)$ and Expert 2 a mode with high process noise $Σ_2 = \diag\left([0.0874, 0.0432]\right)$. Expert 2 has also clearly learned the drift induced by the fan, indicated by the dark red region at $y=0$ in the two bottom left plots of Figure ref:fig-experts-f-quadcopter-subset. It has also learned the control response of the \acrshort{pid} controller correcting for the deviation from the reference trajectory, indicated by the white region below $y=0$. The control response is an artefact of the data collection process. Expert 2 has therefore learned both the drift and process noise terms associated with the turbulent dynamics mode.

Both experts were initialised with independent inducing inputs, $\expertInducingInput$, providing the model flexibility to “soft” partition the data set. That is, each expert has the freedom to set its inducing inputs, $\expertInducingInput$, to support only a subset of the data set. The posterior (co)variance associated with each expert represents their epistemic uncertainty. The top right plot in Figure ref:fig-experts-f-quadcopter-subset shows the posterior variance associated with the $x$ output dimension of Expert 1. The posterior variance increases in front of the fan because the gating network has assigned responsibility to the other expert in this region. However, the posterior variance associated with the $y$ output dimension of Expert 1, is not high in this region. This is due to the lengthscale of the $y$ output dimension allowing Expert 1 to confidently extrapolate.

The bottom right two plots in Figure ref:fig-experts-f-quadcopter-subset show the posterior variance associated with the $x$ and $y$ output dimensions of Expert 2. The posterior variance is high everywhere except for the region in front of the fan. Again, this is due to the gating network assigning responsibility to the other expert outside of the region in front of the fan. These results demonstrate that the likelihood approximation in cref:eq-likelihood-approximation, combined with our gating network and variational inference scheme, is capable of modelling the assignment of observations to experts via the inducing points.

Evaluation on Simulated Quadcopter Data Set

The quadcopter frames and Euler angles (roll $\roll$, pitch $\pitch$ and yaw $\yaw$) are shown in Figure ref:fig-quadcopter-frames. The subscripts $\world$, $\body$ and $\intermediate$ are used to denote the world $\worldFrame$, body $\bodyFrame$ and intermediate $\intermediateFrame$ (after yaw angle rotation) frames respectively. The the 3D nonlinear quadcopter dynamics are described with the quadcopter model from cite:wangSafe2018,zhouVector2014.

The 2D nonlinear quadcopter dynamics are based on the state vector is given by $\state = [x, y, \velocityx, \velocityy, \yaw]$ where $\positions = [x, y]$ is the Euclidean coordinates relative to the world frame $\worldFrame$, $\Velocity = [\velocityx, \velocityy]$ is the velocity in the world frame $\worldFrame$ and $\yaw$ is the yaw angle, i.e. the angle around the $z$ axis.

where,

where $\Thrust = [\thrust, 0]^T$ is the total thrust force in the quadcopters from the rotors and $\torque$ is the torque on the quadcopter around the $z$ axis of the world frame $\worldFrame$. The thrust and torque are realistic controls for a 2D quadcopter system and gives the control vector \control = [\thrust, \torque].

where $g=9.81ms-2$ is the acceleration due to gravity, $m$ is the mass, $\zw=[0,0,1]^T$, $\positions=[x, y, z]^T$ is the position of the quadcopter in the world frame $\worldFrame$, $\Velocity=[\velocity_x, \velocity_y, \velocity_z]^T$ is the and velocity of the quadcopter in the world frame $\worldFrame$ respectively. The angular velocity of the quadcopter in the body frame $\bodyFrame$ is denoted $\angularVelocity=[p, q ,r]^T$ and in tensor form $\angularVelocityMatrix=[[0,-p,q];[r,0,-p];[-q,p,0]]$

$\state=[x, y, z, \velocity_x, \velocity_y, \velocity_z, \roll, \pitch, \yaw, \dot{\roll}, \dot{\pitch}, \dot{\yaw}]$

$\state=[x, y, z, \velocity_x, \velocity_y, \velocity_z, \roll, \pitch, \yaw]^T$ $\state=[\ddot{x}, \ddot{y}, \ddot{z}, \dot{\roll}, \dot{\pitch}, \dot{\yaw}]^T$

\newpage

Point Mass 2D Dynamics

cite:williamsAdvancing cite:watsonStochastic2020 cite:watsonAdvancing2021

cite:levineVariational2013

cite:bhardwajDifferentiable2020

cite:mukadamContinuoustime2018

The dynamics of a point mass in 2D can represented with the state vector $\state = [x, y, \velocityx, \velocityy, \yaw]$, where $\positions = [x, y]$ denotes the 2D Euclidean coordinates relative to the world frame $\worldFrame$, $\Velocity = [\velocityx, \velocityy]$ is the velocity in the world frame $\worldFrame$ and

The point mass can apply a force along its $x$ axis (known as the thrust vector), which is denoted $\Thrust = [\thrust, 0]^T$. It can also rotate itself by applying a torque $\torque$ around its $z$ axis. The resulting control vector for the 2D point mass is given by $\control = [\thrust, \torque]$. The nonlinear dynamics of the system are given by,

where $m$ is the mass and $\inertiaZ$ is the moment of inertia around the vertical $z$ axis. Defining the rotation matrix from the body frame $\bodyFrame$ to the world frame $\worldFrame$ as,

the dynamics can be calculated as,

The unknown dynamics can be modelled by placing \acrshort{gp} priors on the accelerations ($\dot{\dot{x}}, \dot{\dot{y}}$) and the angular (yaw) velocity $\dot{\yaw}$,

\newpage

Discussion and Future Work

Implicit data assignment It is worth noting that in contrast to other \acrshort{mogpe} methods, this model does not directly assign observations to experts. However, after augmenting each expert with separate inducing points, the model has the flexibility to loosely partition the data set. Just as sparse \acrshort{gp} methods can be viewed as methods that parameterise the full nonparametric \acrshort{gp}, our approach can be viewed as parameterising the nonparametric \acrshort{mogpe}. Conveniently, our parameterisation, in particular the likelihood approximation in cref:eq-likelihood-approximation, deals with the issue of marginalising exponentially many sets of assignments of observations to experts. As evident from the results in this chapter, this likelihood approximation appears to retain important information regarding the assignment of observations to experts, whilst efficiently marginalising the expert indicator variable. It is also worth noting that the number of inducing points $\NumInducing$ associated with each expert, could be set by considering the number of data points believed to belong to a particular expert. Currently, each expert’s inducing inputs are initialised by randomly sampling a subset of the data inputs. Future work could explore different techniques for initialising each expert’s inducing inputs.

Bayesian treatment of inducing inputs Common practice in sparse \acrshort{gp} methods is to jointly optimise the hyperparameters and the inducing inputs. Optimising only some of the parameters, instead of marginalising all of them, is known as Type-II maximum likelihood. In Bayesian model selection, it is well-known that Type-II maximum likelihood can lead to overfitting if the number of parameters being optimised is large. In the case of inducing inputs, there can often be beyond hundreds or thousands that need to be optimised. Further to this, cite:rossiSparse2021 show that optimising the inducing inputs relies on being able to optimise both the prior and the posterior, therefore contradicting Bayesian inference. Our variational inference scheme follows common practice and optimises the inducing inputs jointly with the hyperparameters. In some instances, we observe that optimising the inducing inputs leads to them taking values far away from the training data. Often this can be avoided by simply sampling the inducing inputs the training inputs and fixing them, i.e. not optimising them. This often leads to better \acrshort{nlpp} scores as well. This observation highlights that a Bayesian treatment of the inducing inputs is an interesting direction for future work. However, specifying priors and performing efficient posterior inference over the inducing inputs is a challenging problem.

Latent spaces for control The gating network consists of two spaces which are rich with information regarding how the system switches between it’s underlying dynamics modes, namely, the pmf over the expert indicator variable and the \acrshort{gp} posteriors over the gating functions. It is worth noting that all \acrshort{mogpe} methods have a pmf over the expert indicator variable. However, this space suffers from interpretability issues. This is because in conventional \acrshort{mogpe} methods, the epistemic uncertainty associated with the gating network is not decoupled from the pmf over the expert indicator variable. Consider the meaning of the mixing probabilities tending to a uniform distribution ${Pr(\singleTestModeVarK \mid \singleTestInput)=0.5}$. This corresponds to maximum entropy for a categorical distribution and could mean two different things. It could mean that,

  1. It has **high** epistemic uncertainty, so cannot confidently predict which expert is responsible,
  2. It has **low** epistemic uncertainty and confidently mixes the experts’ predictions,
    • This happens at the boundaries between experts.

This interpretability issue is overcome by our \acrshort{gp}-based gating network, as these two cases are modelled differently. Either the gating function(s) are all equal and their posterior variance(s) are low, implying that the gating network has **low** epistemic uncertainty and is likely at a boundary between experts. Alternatively, the gating functions’ posterior variance(s) could be high, implying it has **high** epistemic uncertainty.

Importantly, the \acrshort{gp} posteriors associated with our gating network, not only infer information regarding the mode switching but also model the gating network’s epistemic uncertainty. Further to this, formulating the gating network \acrshort{gps} with differentiable mean and covariance functions, enables techniques from Riemannian geometry to be deployed on the gating functions citep:carmoRiemannian1992. The power of the \acrshort{gp}-based gating network will become apparent when its latent geometry is leveraged for control in cref:chap-traj-opt-geometry and when its \acrshort{gps} are used to develop an information-based exploration strategy in cref:chap-active-learning.

Conclusion

This chapter has presented a method for learning representations of multimodal dynamical systems using a \acrshort{mogpe} method. Motivated by correctly identifying the underlying dynamics modes and inferring latent structure that can be exploited for control, this work formulated a gating network based on input-dependent gating functions. This aids the inherent identifiability issues associated with mixture models as it can be used to constrain the set of admissible functions through the placement of informative \acrshort{gp} priors on the gating functions. Further to this, the \acrshort{gp} posteriors over the gating functions provide convenient latent spaces for control. This is because they are rich with information regarding the separation of the underlying dynamics modes and also model the epistemic uncertainty associated with the gating network. In later chapters this uncertainty will be used to construct risk-averse control strategies and to guide exploration for \acrshort{mbrl}.

The variational inference scheme presented in this chapter addresses the issue of marginalising every possible set of assignments of observations to experts – of which there are $\ModeInd\NumData$ possibilities – in the \acrshort{mogpe} marginal likelihood. It overcomes the issue of assigning observations to experts by augmenting each expert \acrshort{gp} with a set of inducing points. These inducing points are assumed to be a sufficient statistic for the joint distribution over every possible set of assignments to experts. This induces a factorisation over data which is used to derive three \acrshort{elbo}s that provide a coupling between the optimisation of the experts and the gating network, by efficiently marginalising the expert indicator variable for single data points. The \acrshort{elbo}s are compared on the Motorcycle data set citep:Silverman1985. The $\furtherBound$ bound provides the best performance as it balances the accuracy offered by the tight bound $\tightBound$, with the computational improvements offered by further bounding the \acrshort{gps}. The results demonstrate that the variational inference scheme principally handles uncertainty whilst providing scalability via stochastic variational inference. The method is further evaluated on a real-world quadcopter example demonstrating that it can successfully learn a factorised representation of a real-world, multimodal, robotic system.

Mode Remaining Trajectory Optimisation label:chap-traj-opt-control

Maths

Intro

This chapter is concerned with controlling unknown or partially unknown, multimodal dynamical systems, given a single-step predictive dynamics model learned using the \acrshort{mosvgpe} method from cref:chap-dynamics. In particular, it is concerned with mode remaining trajectory optimisation, which is formally defined in cref:def-mode-remaining-main. Informally, mode remaining trajectory optimisation attempts to find trajectories from an initial state $\state_0$ – in the desired dynamics mode – to a target state $\state_f$, whilst remaining in the desired dynamics mode.

The \acrshort{mosvgpe} method from cref:chap-dynamics was intentionally formulated with latent variables – to represent the mode switching behaviour and its associated uncertainty – so that they could be leveraged to encode mode remaining behaviour into control strategies. This chapter unleashes the power of these latent variables by making decisions under their uncertainty.

The remainder of this chapter is organised as follows. cref:sec-problem-statement formally states the problem. cref:chap-traj-opt-geometry details two methods that leverage the geometry of the \acrshort{mosvgpe} gating network. The first method in cref:sec-traj-opt-collocation resembles an indirect optimal control method as it solves the necessary conditions which indirectly represent the original optimal control problem. In contrast, the second method in cref:sec-traj-opt-energy takes a more standard approach and directly solves the optimal control problem. cref:chap-traj-opt-inference then introduces an alternative approach to mode remaining trajectory optimisation, which does not leverage the geometry of the gating network. Instead, it extends the \acrfull{cai_unimodal} framework citep:toussaintRobot2009 and encodes mode remaining behaviour via conditioning on the mode indicator variable. cref:chap-traj-opt-results evaluates and compares all three methods using the illustrative example from cref:illustrative_example. An initial version of the \acrfull{ig} method presented in cref:sec-traj-opt-collocation is published in cite:scannellTrajectory2021.

Problem Statement label:sec-problem-statement

The goal of this chapter is to solve the mode remaining navigation problem in cref:problem-statement-main. Due to the novelty of this problem, the work in this chapter considers trajectory optimisation algorithms rather than state feedback (closed-loop) controllers. This mode remaining trajectory optimisation problem is given by,

where the dynamics are from cref:eq-dynamics-main and the resulting open-loop controller is given by, $π(\timeInd) = \control\timeInd \quad ∀ \timeInd ∈ \{0, \ldots, \TimeInd-1\}$. Given the desired dynamics mode $\desiredMode$, cref:eq-mode-soc-problem seeks to find a control trajectory $\controlTraj = \control0:\TimeInd-1$, to navigate from an initial state $\state0 ∈ \desiredStateDomain$, to a target state $\targetState ∈ \desiredStateDomain$, over a horizon $\TimeInd$, whilst minimising a cost function, $\costFunc: \stateDomain × \controlDomain → \R$ and keeping the system in the desired dynamics mode $\desiredMode$. To simplify notation, the state and control trajectories are denoted as $\stateTraj = \state1:T$ and $\controlTraj = \control0:T-1$ respectively.

Given that neither the underlying dynamics modes nor how the system switches between them, are known a priori, it is not possible to solve cref:eq-mode-soc-problem with the mode remaining guarantee in cref:def-mode-remaining-main. However, well-calibrated uncertainty estimates associated with a learned dynamics model make it possible to find mode remaining trajectories with high probability. Therefore, this work relaxes the requirement to finding mode remaining trajectories with high probability. Let us formally define a $δ-\text{mode remaining}$ controller $π$.

Trajectories satisfying this $δ-\text{mode remaining}$ definition are guaranteed to remain in the desired dynamics mode with probability up to $1-δ$. Therefore, smaller $δ$ values correspond to a higher confidence of remaining in the desired dynamics mode.

This chapter assumes prior access to the environment, such that a data set of state transitions has previously been collected and used to learn a single-step dynamics model.

Given this learned dynamics model, it is assumed that a desired dynamics mode $\desiredMode$ is either known or can easily be identified. This is a realistic assumption as the parameters associated with each dynamics \acrshort{gp} can be used to identify different behaviours. For example, the noise variance associated with each mode’s dynamics \acrshort{gp} models its process noise. Therefore, it is easy to identify undesirable dynamics modes with high process noise.

Given a learned dynamics model and a desired dynamics mode $\desiredMode$, the goals of the trajectory optimisation in this chapter can be summarised as follows,

Goal 1
Navigate to the target state $\targetState$,
Goal 2
Remain in the operable, desired dynamics mode $\desiredMode$,
Goal 3
Avoid regions of the learned dynamics with high epistemic uncertainty,
Goal 3.1
in the desired dynamics mode $\latentFunc\desiredMode$, i.e. where the underlying dynamics are not known,
Goal 3.2
in the gating network $\modeVar$, i.e. where it is not known which mode governs the dynamics.

Goal 3 arises due to learning the dynamics model from observations. The learned model may not be able to confidently predict which mode governs the dynamics in a given region. This is due to a lack of training observations and is known as epistemic uncertainty. It is desirable to avoid entering these regions as it may result in the system leaving the desired dynamics mode.

Mode Remaining Control via Latent Geometry label:chap-traj-opt-geometry

intro

This section introduces two different approaches to performing mode remaining trajectory optimisation. They both exploit concepts from Riemannian geometry – extended to probabilistic manifolds – to encode mode remaining behaviour. The first approach in cref:sec-traj-opt-collocation resembles an indirect optimal control method citep:kirkOptimal2004 as it projects the trajectory optimisation problem onto an \acrfull{ode} that implicitly encodes the mode remaining behaviour. As such, we name this approach \acrfull{ig}. The second approach in cref:sec-traj-opt-collocation is a direct optimal control method that resembles standard Gaussian process control methods with the mode remaining behaviour encoded via a geometric objective function. We name this approach \acrfull{dre}.

Concepts from Riemannian Geometry label:sec-geometry-recap

intro

The \acrshort{mosvgpe} model correctly identifies the underlying dynamics modes and infers informative latent spaces that can be used to encode mode remaining behaviour. cref:fig-traj-opt-gating-network-gp shows the gating network posterior after training \acrshort{mosvgpe} on the historical data set of state transitions from the illustrative quadcopter example in cref:illustrative_example. The work in this chapter is based on the observation that Goals 1 and 2 can be encoded as finding length minimising trajectories on the manifold parameterised by the desired mode’s gating function, shown in the left-hand plot of cref:eq-traj-opt-gating-network-gp-post. Intuitively, the length of a trajectory from $\mathbf{x}_0$ to $\mathbf{x}_f$ on the manifold given by the desired mode’s gating function, increases when it passes over the contours; analogous to climbing a hill. Given appropriate scaling of the gating function, shortest trajectories between two locations are those that attempt to follow the contours, and as a result, remain in a single mode by not climbing up or down any hills. This section will review the relevant concepts from Riemannian geometry and show how they can be used to encode Goals 1 and 2. cref:sec-prob-geo then extends these concepts to probabilistic geometries to encode Goal 3.

Lengths in Euclidean Spaces

\newline Lengths in Euclidean spaces The $l^2$ norm (Euclidean norm) provides an intuitive notion for the length of a vector $\manifoldInput ∈ \manifoldDomain \subseteq \R\manifoldDomainDim$ in a Euclidean space. A continuous-time trajectory is denoted $\trajectory: [t_0, t_f] → \manifoldDomain$. Note that this has overloaded the discrete-time trajectory notation. Under the $l^2$ norm, the length of a trajectory $\trajectory$ is given by,

where Newton’s notation has been used to denote differentiation with respect to time $t$. As a norm can be expressed for any space endowed with an inner product, it is possible to calculate lengths of trajectories on manifolds endowed with an inner product.

Riemannian manifolds

\newline

Lengths on Riemannian Manifolds

\newline

Lengths on Riemannian manifolds The length of a trajectory $\trajectory$ on a manifold $\manifold$, can be calculated by mapping it through the function $\manifoldFunction$ and using cref:eq-euclidean-length,

Applying the chain-rule allows cref:eq-manifold-length to be expressed in terms of the Jacobian and the velocity,

This implies that the length of a trajectory on the manifold $\manifold$, can be calculated in the input space $\manifoldDomain$, using a locally defined norm,

where $\metricTensor\mathbf{x_t} = \jacobian^T \jacobian$ is a symmetric positive definite matrix (akin to a local Mahalanobis distance measure), known as the natural Riemannian metric. The length of a trajectory on a manifold $\manifold$, endowed with the metric $\metricTensor$, can then be calculated with,

cref:fig-traj-opt-gating-network-gp shows the \acrshort{gp} posterior over the desired mode’s gating function, $\desiredGatingFunction : \stateDomain → \gatingCodomain$. Consider finding length minimising trajectories on the manifold $\desiredManifold=\desiredGatingFunction(\gatingDomain)$ associated with the desired mode’s gating function, where the metric is given by,

These trajectories will attempt to remain in the desired dynamics mode $\desiredMode$, encoding Goals 1 and 2. However, length minimising trajectories subject to this metric do not encode Goal 3. That is, they will not avoid regions of the learned dynamics, which cannot be predicted confidently due to high epistemic uncertainty. Goal 3 can be encoded by observing that the metric tensor is actually a random variable and extending the concepts of length minimising trajectories to probabilistic manifolds.

Probabilistic Geometries label:sec-prob-geo

Extension to Probabilistic Geometries

\newline

Following cite:tosiMetrics2014 we formulate a metric tensor that captures the variance in the manifold via a probability distribution. First note that as the differential operator is linear, the derivative of a \acrshort{gp} is also a \acrshort{gp}, assuming that the mean and covariance functions are differentiable.

Therefore, the metric tensor $\desiredMetricTensor$ in cref:eq-desired-metric-tensor is the outer product of two normally distributed random variables. As such, the metric tensor $\desiredMetricTensor$ is also a random variable, following a non-central Wishart distribution citep:andersonNonCentral1946,

where $P$ is the number of degrees of freedom (always one in our case) and $\E\left[\desiredJacobian\right]$ and $\covJac$ are the mean and covariance matrices associated with the \acrshort{gp} over the Jacobian. The expected value of the metric tensor in cref:eq-metric-dist is given by,

Importantly, this expected metric tensor includes a covariance term $\covJac$, which implies that lengths on the manifold calculated under this expected metric will increase in areas of high covariance. This is a desirable behaviour because it encourages length minimising trajectories to avoid regions of the learned dynamics with high epistemic uncertainty, encoding Goal 3. To aid with user control, the metric tensor in cref:eq-expected-metric is modified with a weighting parameter $λ$ that enables the relevance of the covariance term to be adjusted,

Setting $λ$ to be small should find trajectories that prioritise staying in the desired mode, whereas selecting a large $λ$ should find trajectories that prioritise avoiding regions of the dynamics with high epistemic uncertainty.

Extension to Sparse Variational Gaussian Processes

The model in Chapter ref:chap-dynamics is built upon sparse \acrshort{gp} approximations, so the Jacobian in cref:eq-predictive-jacobian-dist must be extended for such approximations.

\acrfull{ig} label:sec-traj-opt-collocation

intro

This section presents a trajectory optimisation algorithm that exploits the fact that length minimising trajectories on the manifold endowed with the expected metric from cref:eq-expected-metric, encodes all of the goals. As shortest lengths on a manifold are known as geodesics, we refer to them as geodesic trajectories. The algorithm presented in this section exploits a classic result of Riemannian geometry, that geodesic trajectories are solutions to a 2nd order \acrshort{ode}, known as the geodesic \acrshort{ode} $f_G$. As solutions to this \acrshort{ode} encode the necessary conditions for finding length minimising trajectories, the method presented in this section resembles an indirect optimal control method.

Geodesics

Geodesic \acrshort{ode} An important observation from cite:carmoRiemannian1992, is that geodesics satisfy a continuous-time $2\text{nd}$ order \acrshort{ode}, given by,

where $\operatorname{vec}[\metricTensor(\manifoldInput(t)])$ stacks the columns of $\metricTensor(\manifoldInput(t))$ and $⊗$ denotes the Kronecker product. The implication of cref:eq-geodesic,eq-2ode, is that trajectories that are solutions to the $2\text{nd}$ order \acrshort{ode} in cref:eq-2ode, implicitly minimise their length on the manifold, i.e. the objective in cref:eq-manifold-length-G. Given this observation, computing geodesics involves finding a solution to cref:eq-2ode with $\manifoldInput(t_0) = \manifoldInput_0$ and $\manifoldInput(t_f) = \manifoldInput_f$. This is a boundary value problem (BVP) with a smooth solution so it can be solved using any BVP solver, e.g. (multiple) shooting or collocation methods.

Implicit Trajectory Optimisation

Solving the 2nd order \acrshort{ode} in cref:eq-2ode with the expected metric from cref:eq-expected-metric-weighting, is equivalent to solving our trajectory optimisation problem subject to the same boundary conditions. This resembles an indirect optimal control method as it is based on an observation that the necessary conditions for optimality are encoded via the geodesic \acrshort{ode}. However, it is worth noting that solutions to the geodesic \acrshort{ode} are not guaranteed to satisfy the dynamics constraints.

Collocation Since neither $\dot{\mathbf{x}}(t_0)$ nor $\dot{\mathbf{x}}(t_f)$ are known, cref:eq-2ode cannot be solved with simple forward or backward integration. Instead, the problem is transcribed using collocation. Collocation methods are used to transcribe continuous-time trajectory optimisation problems into nonlinear programs, i.e. constrained parameter optimisations citep:kellyIntroduction2017,fahrooDirect2000. The expected metric in cref:eq-expected-metric is substituted into cref:eq-2ode and solved via collocation. This work implements a Hermite-Simpson collocation method. It parameterises the state trajectory using cubic polynomials and the dynamics equations (the geodesic \acrshort{ode} in this case) are imposed as constraints at a set of collocation points. The trajectory $[t_0, t_f]$ is discretised into $I$ intervals where the collocation points are the mid points of the discretisation intervals. The collocation states $\bar{\stateCol} = \{\stateColi+\frac{1{2}}\}i=0I-1$ are obtained by interpolating the polynomials. The derivative of the collocation states w.r.t. time $\{\dot{\stateCol}i+\frac{1{2}}, \ddot{\stateCol}i+\frac{1{2}}\}i=0I-1$ are obtained algebraically via the polynomials. The collocation constraints then enforce the second derivative of the collocation states interpolated by the polynomials $\{\ddot{\stateCol}i+\frac{1{2}} \}i=0I-1$, to equal the geodesic \acrshort{ode} $f_G$ at the collocation points. This is achieved through the collocation defects,

where $\ddot{\stateCol}i+\frac{1{2}}, \dot{\stateCol}i+\frac{1{2}}, \stateColi+\frac{1{2}}$ are obtained by interpolating between $i$ and $i+1$. cref:eq-defect defines a set of constraints that ensure trajectories are solutions to the geodesic \acrshort{ode} $f_G$. The nonlinear program that this method solves is given by,

Notice that no integrals need to be computed as all of the functions are algebraic operations. In practice, a quadratic cost function is used to regularise the state derivative $\dot{\stateCol}$,

where $\controlCostMatrix$ is a user-defined, real symmetric positive definite weight matrix. It is solved using \acrfull{slsqp} in SciPy citep:2020SciPy-NMeth.

Latent variable controls This nonlinear program returns a collocation state trajectory $\bar{\stateCol}$ which parameterises a continuous-time state trajectory (via the polynomials). However, it does not return the control trajectory. The control trajectory is recovered from the state trajectory by performing inference in the probabilistic dynamics model. In order to do this, the state trajectory is first discretised. In practice, using the collocation states as the discretised state trajectory worked well, i.e. $\stateTraj = \{\state\timeInd\}\timeInd=0I = \{\stateColi\}i=0I$. The state difference outputs $\stateDiffTraj = \{\stateDiff\timeInd\}\timeInd=1\TimeInd$ are calculated from the state trajectory $\stateTraj$. The control trajectory $\controlTraj$ is then inferred from the state trajectory by extending the \acrshort{elbo} for the desired mode’s \acrshort{svgp} expert with latent variable inputs. Following cite:hensmanGaussian2013, the \acrshort{elbo} for a single \acrshort{svgp} expert is given by,

where the variational posterior is given by $q(\mode{\latentFunc}(\stateTraj, \controlTraj)) &= ∫ p(\mode{\latentFunc}(\stateTraj, \controlTraj) \mid \mode{\latentFunc}(\expertInducingInput)) q(\mode{\latentFunc}(\expertInducingInput)) \text{d} \mode{\latentFunc}(\expertInducingInput)$. The control inputs $\controlTraj$ are recovered by treating them as latent variables and extending the lower bound to,

where each time step of the latent control trajectory is assumed to be normally distributed,

and its variational posterior is given by,

The posterior over the latent control trajectory $q(\controlTraj) ≈ p(\controlTraj \mid \stateTraj, \stateDiffTraj)$ is obtained by finding the variational parameters $\{\mathbf{m}\timeInd, \mathbf{S}\timeInd\}\timeInd=0\TimeInd-1$ that maximise the \acrshort{elbo} in cref:eq-control-elbo.

Remarks

Although this method provides an elegant solution to finding trajectories that satisfy Goals 1, 2 and 3, it is not without its limitations. First of all, this approach does not necessarily find trajectories that satisfy the dynamics constraints, as it projects the problem onto the geodesic \acrshort{ode}.

Secondly, it does not consider the full distribution over state-control trajectories. Without the inclusion of the full probabilistic dynamics model, it is impossible to consider the full distribution over state-control trajectories. Although propagating uncertainty through a single dynamics \acrshort{gp} is straightforward, handling the collocation constraints is not. This is because the geodesic \acrshort{ode} will become a \acrfull{sde}.

Remarks

Although this method provides an elegant solution to finding trajectories that satisfy Goals 1, 2 and 3, it is not without its limitations. First of all, this approach does not necessarily find trajectories that satisfy the dynamics constraints, as it projects the problem onto the geodesic \acrshort{ode}.

Secondly, it does not consider the full distribution over state-control trajectories. Without the inclusion of the full probabilistic dynamics model, it is impossible to consider the full distribution over state-control trajectories. Although propagating uncertainty through a single dynamics \acrshort{gp} is straightforward, handling the collocation constraints is not. This is because the geodesic \acrshort{ode} will become a \acrshort{sde}.

The goal of this chapter is to find trajectories that are $δ-\text{mode remaining}$ (cref:def-delta-mode-remaining). However, this method is not able to provide such guarantees as it ignores the state uncertainty accumulated over a trajectory. Consider calculating the probability that a single time step remains in the desired mode,

where $q(\GatingFunc(\state\timeInd) \mid \state\timeInd)$ is the approximate posterior over the gating functions from cref:eq-predictive-gating. Note the dependence on the state input $\state\timeInd$ is reintroduced here, as it becomes a random variable when making multi-step predictions. This probability will decrease as the uncertainty in the state increases. For example, when a trajectory passes through regions of the desired dynamics \acrshort{gp} with high uncertainty. This is implied by the marginalisation over $\state_\timeInd$. It will also decrease when the trajectory passes through regions of the state space where the gating functions are uncertain, indicated by the marginalisation over $\GatingFunc(\state\timeInd)$. The method presented in this section does not quantify the state uncertainty over a trajectory and at best can calculate the following probability, $Pr(\modeVar\timeInd = \desiredMode \mid \state0:\timeInd, \control0:\timeInd, \bm\modeVar0:\timeInd-1=\desiredMode)$. As such, this method cannot validate that trajectories are $δ-\text{mode remaining}$.

\acrfull{dre} label:sec-traj-opt-energy

intro

This section details a direct optimal control approach which embeds the mode remaining behaviour directly into the \acrfull{soc} problem, via a geometric objective function. In contrast to the previous approach, this method:

  1. enforces the dynamics constraints,
  2. principally handles the uncertainty associated with the dynamics.

This approach is a shooting method that enforces the dynamics constraints through simulation, i.e. the state trajectory is enforced to match the integral of the dynamics with respect to time.

Similar to the previous approach, this method builds on the observation that length minimising trajectories on the Riemannian manifold $\manifold$, associated with the desired mode’s gating function $\desiredGatingFunction$, encodes the goals. Further to this, this method exploits the fact that length minimising trajectories on a Riemannian manifold $\manifold$, are also energy minimising trajectories citep:carmoRiemannian1992. As such, mode remaining behaviour can be encoded by solving,

with an objective function that minimises the Riemannian energy,

where $\terminalStateCostMatrix$ and $\controlCostMatrix$ are user-defined, real, symmetric, positive semi-definite and positive definite matrices respectively. The energy of a trajectory on a Riemannian manifold, endowed with the metric $\metricTensor$, is given by,

where $\stateDiff\timeInd = \state\timeInd - \state\timeInd-1$ is the state difference. The mode remaining behaviour and the terminal state boundary condition are encoded via the objective function.

This may seem like an easy optimisation problem, however, calculating the expected cost in cref:eq-mode-soc-problem-geometry-cost is not straightforward. Given a starting state $\state_0$ and a control trajectory $\controlTraj$, the expectation in cref:eq-mode-soc-problem-geometry-cost is taken with respect to the joint state-metric distribution over a trajectory, $p(\stateTraj, \metricTensorTraj, \mid \state_0, \controlTraj)$. Calculating this expectation is difficult as multi-step predictions in the \acrshort{mosvgpe} dynamics model cannot be calculated in closed form.

This work adopts a two-stage approximation to obtain a closed-form expression for the expected cost. First, multi-step dynamics predictions are approximated to obtain normally distributed states at each time step. Given normally distributed states, calculating the expected terminal and control cost terms in cref:eq-mode-soc-problem-geometry-cost is straightforward. However, the expected Riemannian energy in cref:eq-mode-soc-problem-geometry-cost has no closed-form expression, due to the dependence of metric $\metricTensor$ on the state. The second stage approximates the calculation of the expected Riemannian energy under normally distributed states.

Approximate Inference for Dynamics Predictions label:sec-dynamics-predictions

Multi-step predictions in the \acrshort{mosvgpe} dynamics model have no closed-form solution because the state difference after the first time step is a Gaussian mixture, and propagating Gaussian mixtures through Gaussian processes has no closed-form solution. Further to this, constructing approximate closed-form solutions is difficult, due to the exponential growth in the number of Gaussian components.

This work sidesteps this issue and obtains closed-form multi-step predictions by enforcing that the controlled system remains in the desired dynamics mode. Multi-step predictions can then be calculated in closed-form by cascading single-step predictions using the desired dynamics \acrshort{gp}, whose transition density is given by,

where $\mathbf{A}\desiredMode = \desiredExpertKernel(\singleInput, \desiredExpertInducingInput)\desiredExpertKernel(\desiredExpertInducingInput, \desiredExpertInducingInput)-1$ and $\singleInput=(\state\timeInd, \control\timeInd)$. Cascading single-step predictions requires recursively mapping uncertain state-control inputs through the desired mode’s dynamics \acrshort{gp}, i.e. recursively calculating the following integral,

with $p(\state_0) = δ(\state_0)$. Approximate closed-form solutions exist for propagating normally distributed states and controls through \acrshort{gp} models citep:girardApproximate2004,kussGaussian2006,quinonero-candelaPropagation2003. This work exploits the moment-matching approximation from Section 7.2.1 of cite:kussGaussian2006.

$δ-\text{mode remaining}$ chance constraints Enforcing the controlled system to remain in the desired dynamics mode simplifies calculating multi-step predictions and the expected cost in cref:eq-mode-soc-problem-geometry-cost. As the dynamics model is learned from observations, this work relaxes the requirement to ensuring that trajectories are $δ-\text{mode remaining}$ (cref:def-delta-mode-remaining). The conditions to be $δ-\text{mode remaining}$ can be enforced with chance constraints,

These constraints enforce the system to remain in the desired dynamics mode with satisfaction probability $\satisfactionProb=1-δ$, at each time step. As the \acrshort{mosvgpe} model assumes that the mode indicator variable $\modeVar$ depends on the state via the gating function, this probability is calculated as follows,

where $q(\GatingFunc(\state\timeInd) \mid \state\timeInd)$ is the approximate posterior over the gating functions from cref:eq-predictive-gating.

Approximate Riemannian Energy

Given this approach for simulating the \acrshort{mosvgpe} dynamics model, the state at each time step is normally distributed. Unlike the terminal and control cost terms in cref:eq-mode-soc-problem-geometry-cost, the expected Riemannian energy,

has no closed-form expression under normally distributed states. This is because the metric tensor $\metricTensor$ depends on the Jacobian, which depends on the state. However, it is possible to approximate the expected energy to obtain a closed-form expression.

The distribution over the Jacobian when the input location in normally distributed $\state\timeInd ∼ \mathcal{N}(\stateMean, \stateCov)$ can be calculated in closed-form when using the Squared Exponential kernel. However, this work simplifies the problem and calculates the Jacobian at the state mean of each time step along a trajectory, i.e. $\Jac\state_{\timeInd} ≈ \frac{∂ \manifoldFunction(\stateMean)}{∂ \stateMean}$. The distribution over the Jacobian given deterministic inputs can be calculated using cref:eq-predictive-jacobian-dist.

Approximating the Jacobian to be independent of the state enables the expected metric tensor to be calculated in closed-form with cref:eq-expected-metric-weighting. Given this approximation, the Riemannian energy retains a quadratic form, so the expectation with respect to $\stateDiff\timeInd ∼ \mathcal{N}(\stateDiffMean, \stateDiffCov)$ can be calculated with,

Given this approximation for the expected Riemannian energy, the expected cost in cref:eq-mode-soc-problem-geometry can be calculated in closed-form with,

This work then approximately solves the problem in cref:eq-mode-soc-problem by solving,

using \acrshort{slsqp} in SciPy citep:2020SciPy-NMeth. This method obtains closed-form expressions for the expected cost in cref:eq-mode-soc-problem-geometry by constraining the system to be $δ-\text{mode remaining}$ (cref:def-delta-mode-remaining).

Practical Implementation

An alternative approach to obtain mode remaining behaviour is to optimise subject to the chance constraints in cref:eq-mode-chance-constraint alone, i.e. without the Riemannian energy cost term. However, this constrained optimisation is often not able to converge in practice. Experiments and intuition indicate that the geometry of the gating functions provides a much better optimisation landscape. This is because the gating functions vary gradually over the state domain, whilst the mixing probability changes abruptly at the boundaries between dynamics modes.

Therefore in practice, the optimisation in cref:eq-mode-soc-problem-geometry-approx is performed unconstrained, i.e. without enforcing the chance constraints at every iteration. Instead, the chance constraints are used to validate trajectories found by the unconstrained optimiser, before deploying them in the environment. In most experiments, this strategy was far superior than constraining the optimisation at every iteration.

Cost Functions

Cost Functions This work primarily focuses on quadratic costs, as they are ubiquitous in control and lead to closed-form expectations under normally distributed states ${\state\timeInd ∼ \mathcal{N}(\stateMean, \stateCov)}$ and controls ${\control\timeInd ∼ \mathcal{N}(\controlMean, \controlCov)}$. This work seeks to find trajectories between a start state $\state_0$ and a target state $\targetState$, at time $\TimeInd$. It is common to minimise the deviation of the final state $\state_\TimeInd$ from the desired target state $\targetState$. It is also common to find trajectories that minimise the expenditure of control effort. As such, this work adopts the following cost function,

where $\terminalStateCostMatrix$ and $\controlCostMatrix$ are user-defined, real, symmetric, positive semi-definite and positive definite matrices respectively. Importantly, the expected value of this cost under normally distributed states and controls can be calculated in closed form with,

Mode Remaining Control as Probabilistic Inference label:chap-traj-opt-inference

Maths

intro

This section presents an alternative approach to finding mode remaining trajectories, named \acrfull{cai}. In contrast to the previous section that encoded mode remaining behaviour via the latent geometry of the \acrshort{mosvgpe}’s gating network, this section unleashes the power of the probability mass function over the expert indicator variable. As all \acrshort{mogpe} methods have a probability mass function over the expert indicator variable, the method presented in this chapter is applicable in a wider range of \acrshort{mogpe} dynamics models. cref:sec-inference-background recaps the necessary background and related work and cref:sec-traj-opt-inference then details the trajectory optimisation algorithm.

Background and Related Work label:sec-inference-background

Cost Functions as Likelihoods

This section first recaps the \acrfull{cai_unimodal} framework. To formulate optimal control as probabilistic inference it is first embedded into a graphical model (see Figure ref:fig-basic-control-graphical-model). The joint probability model (over a trajectory) is augmented with an additional variable to encode the notion of cost (or reward) over the trajectory (see Figure ref:fig-augmented-control-graphical-model). The new variable is a Bernoulli random variable $\optimalVar_\timeInd ∈ \{0, 1\}$, that indicates if time step $\timeInd$ is optimal $\optimalVar_\timeInd=1$, or not optimal $\optimalVar_\timeInd=0$. The likelihood distribution can be formulated by mapping the negative cost through a monotonically increasing function $\monotonicFunc$, giving the likelihood,

A common (and convenient) approach is to formulate the likelihood using an exponential transform of the cost. This results in a Boltzmann distribution where the inverse temperature, $\temperature$, is used to scale the cost,

The resulting negative log-likelihood, for a single state-control trajectory \stateTraj, \controlTraj, is an affine transformation of the cost,

which preserves convexity. The set of optimal Bernoulli variables over a trajectory is denoted $\optimalVarTraj = \{\optimalVar\timeInd=1\}\timeInd=0\TimeInd$. When the inverse temperature parameter is set to $\temperature=1$ the maximum likelihood trajectory coincides with classical optimal control citep:toussaintRobot2009.

Quadratic Cost Functions

This work primarily focuses on quadratic costs, $\costFunc$, which are ubiquitous in control.

Terminal Control Problems To find a trajectory between a start state, $\state_0$, and a target state, $\targetState$, at time, $\TimeInd$, it is common to minimise the deviation of the final state, $\state_\TimeInd$, from the desired target state $\targetState$. To allow greater generality, a user-defined, real symmetric positive definite matrix, $\terminalStateCostMatrix$, can be introduced,

It is worth noting that if a cost function is not quadratic then a quadratic approximation can be obtained with a Taylor expansion.

graphical model

graphical model

Inference of Sequential Latent Variables

\newline

The joint probability for an optimal trajectory (i.e. for $\optimalVar_\timeInd=1$ for all $\timeInd ∈ \{0,\ldots,\TimeInd\}$), can be factorised using its Markovian structure,

where $\controlDist$ denotes the controller or policy. cite:toussaintRobot2009 highlights that although the maximum likelihood trajectory coincides with the classical optimal trajectory, taking expectations over trajectories i.e. calculating $log p(\optimalVarTraj \mid \state_0)$, is not equivalent to expected cost minimisation. cite:rawlikStochastic2013 extend the concepts from cite:toussaintRobot2009 to show the general relation to classical \acrshort{soc}. For a given policy $\policy$, they introduce the posterior distribution over state-control trajectories as,

where $Z=p(\optimalVarTraj \mid \state_0)$. This distribution $\policyDist(\stateTraj, \controlTraj \mid \optimalVarTraj, \state_0)$ is conditioned on the optimality variable but generated by a potentially uniform policy $\policy$.

They then distinguish between a prior policy $\priorPolicy$ and an unknown control policy $\policy$. The prior distribution over state-control trajectories under the control policy $\policy$, is given by,

Intuitively $\controlledPolicyDist(\stateTraj, \controlTraj)$, is thought of as the controlled process, which is not conditioned on optimality and $\priorPolicyDist(\stateTraj, \controlTraj \mid \optimalVarTraj, \state_0)$ as the posterior process, conditioned on optimality but generated by a potentially uniform policy, $\priorPolicy$. The dual problem is then to find the control policy, $\policy$, where the controlled process, $\controlledPolicyDist(\stateTraj, \controlTraj)$, matches the posterior process, $\priorPolicyDist(\stateTraj, \controlTraj \mid \optimalVarTraj, \state_0)$. Given $\priorPolicy$ is an arbitrary stochastic policy and $\mathbb{D}$ is the set of deterministic policies, the problem,

is equivalent to the \acrshort{soc} problem, with a modified integral cost,

The problem in cref:eq-kl-control finds trajectories that balance minimising expected costs $\E\controlledPolicyDist(\stateTraj, \controlTraj) \left[ \costFunc(\stateTraj, \controlTraj) \right]$ and selecting a policy $\policy$ that is similar to the prior policy $\priorPolicy$. If the prior policy $\priorPolicy$ is assumed to be uniform, then $\E\controlledPolicyDist(\stateTraj, \controlTraj) \left[ log \priorPolicy(\controlTraj \mid \stateTraj) \right]$ becomes constant and the optimised policy, $\policy_*$, is a balance of minimising expected costs and maximising the policy’s entropy,

Maximum entropy regularisation Formulating trajectory optimisation in this way encodes the maximum causal entropy principle, which is often used to achieve robustness, in particular for inverse optimal control citep:ziebartModeling2010.

Other approaches There are multiple approaches to performing inference in this graphical model. Trading accuracy for computational complexity is often required for real-time control. In this case, one approach is to approximate the dynamics with linear or quadratic approximations, as is done in \acrshort{ilqr}/\acrshort{ilqg} and \acrshort{gp} respectively. Given linear dynamics, the full graphical model in cref:eq-traj-opt-joint-dist can be computed using approximate Gaussian message passing, for which effective methods exist citep:loeligerFactor2007. The inference problem can then be solved using the expectation maximisation algorithm for dynamical system estimation citep:shumwayAPPROACH1982,ghahramaniLearning1999,schonSystem2011, with input estimation citep:watsonStochastic2021.

Mode Remaining Control as Inference label:sec-traj-opt-inference

maths

Fig

intro

This section details how the \acrfull{cai_unimodal} framework can be extended to multimodal dynamical systems and used to encode mode remaining behaviour. cref:eq-traj-opt-gating-network-prob-post-inf shows the probability mass function over the expert indicator variable after training \acrshort{mosvgpe} on the historical data set of state transitions from the quadcopter navigation problem in cref:illustrative_example. Intuitively, the goal is to find trajectories that remain in regions of the dynamics with a high probability of remaining in the desired dynamics mode.

graphical model

after graphical model

\newline

In order to find trajectories that remain in the desired dynamics mode, this work further augments the graphical model in cref:fig-control-graphical-model with the mode indicator variable $\modeVar ∈ \modeDomain$ from cref:eq-multimodal-dynamics-disc. The resulting graphical model is shown in Figure ref:fig-mode-augmented-control-graphical-model, where the evidence is that $\optimalVar\timeInd=1$ and $\modeVar\timeInd=\desiredMode$ for all $\timeInd ∈ \{0,\ldots,\TimeInd\}$. The joint probability model is then given by,

where $\modeVarTraj = \{\modeVar\timeInd=\desiredMode \}\timeInd=0\TimeInd$ denotes every time step of a trajectory belonging to the desired dynamics mode $\desiredMode$. cref:eq-traj-opt-joint-dist-mode says that the probability of observing a trajectory is given by taking the product of its probability of occurring according to the dynamics, with the exponential of the negative cost and the probability of remaining in the desired dynamics mode. Given deterministic dynamics, the trajectory with the highest probability will be that with the lowest cost and highest probability of remaining in the desired dynamics mode.

Variational Inference

This work draws on the connection between KL-divergence control citep:rawlikStochastic2013 and structured variational inference. Whilst the derivation shown here differs from cite:rawlikStochastic2013, the underlying framework and objective are the same.

In variational inference, the goal is to approximate a distribution $p(\mathbf{y})$ with another, simpler distribution $q(\mathbf{y})$. Typically this distribution $q(\mathbf{y})$ is selected to be a product of conditional distributions connected in a chain or tree, which lends itself to tractable inference. In this work, the goal is to approximate the intractable distribution over optimal trajectories,

with the variational distribution $\trajectoryVarDist$. Note that $Z=p(\optimalVarTraj, \modeVarTraj \mid \state_0)$. In this work the variational distribution over the controller is assumed independent of the state. That is, instead of parameterising the controller with state feedback, it is parameterised as a set of open-loop controls $\controlTraj = \{\control_0, \ldots, \control\TimeInd-1\}$ that are treated as random variables with density q(\controlTraj). Calculating $\trajectoryVarDist$ for a set of controls $q(\controlTraj) = ∏\timeInd=0\TimeInd-1 \controlVarDist$, requires simulating the trajectory in the learned, single-step dynamics model, i.e. making long-term predictions.

Approximate Inference for Dynamics Predictions

As this method is using a learned representation of the transition dynamics, it suffices to assume that the dynamics are given by the desired mode’s learned dynamics. Constructing approximate closed-form solutions based on the model in cref:chap-dynamics is difficult, due to the exponential growth in the number of Gaussian components. Similar to the approach in cref:sec-dynamics-predictions, this method obtains multi-step predictions by cascading single-step predictions through the desired mode’s dynamics \acrshort{gp}. However, this approach extends the predictions to handle normally distributed controls $\control\timeInd ∼ \mathcal{N}(\controlMean, \controlCov)$. Multi-step predictions are then obtained by recursively calculating the following integral,

using the moment matching approximation from citep:kussGaussian2006. Note that $\modeVarTraj0:\timeInd$ denotes the first $\timeInd$ elements of $\modeVarTraj$, i.e. $\modeVarTraj0:\timeInd = \{ \modeVar_i=\desiredMode\}i=0\timeInd$. Given this method for making multi-step predictions, the variational distribution over state-control trajectories is given by,

with $q(\state_0) = δ(\state_0)$. Note that the state and control at each time step are normally distributed.

Variational Inference for Sequential Latent Variables

Variational inference seeks to optimise $q(\controlTraj)$ w.r.t. the \acrfull{elbo}. In this setup, the evidence is that $\optimalVar\timeInd=1$ and $\modeVar\timeInd=\desiredMode$ for all $\timeInd ∈ \{0,\ldots,\TimeInd\}$. Given this, the \acrshort{elbo} is given by,

where the inequality is obtained via Jensen’s inequality. Note the cancellation in the second line where we assume $\transitionDistK=q(\state\timeInd+1 \mid \state_0, \modeVarTraj0:\timeInd)$. Assuming a uniform prior policy $\policy(\control_\timeInd \mid \state_\timeInd)$ leads to the $\text{KL}$ term reducing to an entropy term and a constant,

The \acrshort{elbo} in cref:eq-traj-opt-elbo resembles the KL control objective in cref:eq-kl-control-uniform but with an extra term encoding the mode remaining behaviour. In practice, maximum entropy control is achieved by parameterising the control at each time step to be normally distributed ${\controlVarDist = \mathcal{N}(\control\timeInd \mid \controlMean, \controlCov)}$. The control dimensions are assumed independent so $\controlCov$ becomes diagonal. The maximum entropy behaviour can be omitted by using deterministic controls, i.e. parameterising them to follow a Dirac delta distribution $\controlVarDist = δ(\control_\timeInd)$.

Control problem

The control problem is then given by,

which encodes mode remaining behaviour alongside maximum entropy control. The variational parameters $\controlMean$ (and $\controlCov$) are found by maximising the \acrshort{elbo} using gradient-based optimisation. At each iteration, the \acrshort{elbo} is calculated by rolling out the control distribution in the desired mode’s \acrshort{gp} dynamics model using cref:eq-state-control-unc-prop-inf. The resulting state-control trajectory is then used to calculate the \acrshort{elbo}. The cost function is given by,

where $\terminalStateCostMatrix$ and $\controlCostMatrix$ are user-defined, real, symmetric, positive semi-definite and positive definite matrices respectively. This cost function encodes the terminal state boundary condition in cref:eq-mode-soc-problem via a quadratic cost that minimises the deviation of the final state $\state\TimeInd$ from the target state $\targetState$. Importantly, the expected value of this cost under normally distributed states and controls can be calculated in closed-form with,

Cost Functions

Cost Functions This work primarily focuses on quadratic costs, as they are ubiquitous in control and lead to closed-form expectations under normally distributed states ${\state\timeInd ∼ \mathcal{N}(\stateMean, \stateCov)}$ and controls ${\control\timeInd ∼ \mathcal{N}(\controlMean, \controlCov)}$. This work seeks to find trajectories between a start state $\state_0$ and a target state $\targetState$, at time $\TimeInd$. It is common to minimise the deviation of the final state $\state_\TimeInd$ from the desired target state $\targetState$. It is also common to find trajectories that minimise the expenditure of control effort. As such, this work adopts the following cost function,

where $\terminalStateCostMatrix$ and $\controlCostMatrix$ are user-defined, real, symmetric, positive semi-definite and positive definite matrices respectively. Importantly, the expected value of this cost under normally distributed states and controls can be calculated in closed form with,

Approximate Inference for Dynamics Predictions

Approximate Inference for Dynamics Predictions Fortunately, this work is interested in remaining in a single dynamics mode $\desiredMode$, which simplifies making long-term predictions. The model in Chapter ref:chap-dynamics represents each mode’s dynamics function as a sparse variational GP, giving the transition density as,

where $\mode{\mathbf{A}} = \expertKernel(\singleInput, \expertInducingInput)\left(\expertKernel(\expertInducingInput, \expertInducingInput)\right)-1$. To obtain long-term predictions, we cascade one-step predictions. This requires mapping uncertain state-control inputs through the desired mode’s \acrshort{gp} dynamics model. This prediction problem corresponds to recursively calculating the following integral,

with $q(\state_0) = δ(\state_0)$. This work considers normally distributed controls ${\control\timeInd ∼ \mathcal{N}(\controlMean, \controlCov)}$ and deterministic controls. Approximate closed-form solutions exist for propagating normally distributed states and controls through \acrshort{gp} models citep:girardGaussian2003,kussGaussian2006,quinonero-candelaPropagation2003. \todo{correct girard citation} This work exploits the moment-matching approximation implemented in the uncertain_conditional from GPflow citep:GPflow2017. For more details on the approximation see Section 7.2.1 of citep:kussGaussian2006.

Given this method for making long term predictions, the distribution over state-control trajectories is given by,

The variational lower bound is then given by sums over each time step,

Approximate Inference for Dynamics Predictions

The aformentioned trajectory optimisation technique relies on the ability to simulate the controlled system using the learned dynamics model. Recursively estimating the state of a nonlinear dynamical system is a common problem. Exact Bayesian solutions in closed-form can only be obtained for a few special cases. For example, the Kalman filter for linear Gaussian systems citep:kalmanNew1960 is exact. In the nonlinear case, approximate methods are required to obtain efficient closed-form solutions. \todo{cite Extended Kalman filter (EKF), Unscented Kalman filter (UKF) etc?}

Constructing approximate closed-form solutions based on the model in Chapter ref:chap-dynamics is especially difficult, due to the exponential growth in the number of Gaussian components. Consider approximating the dynamics modes to be independent over a trajectory of length, $N$, and recursively propagating each component through both modes. This approximation would lead to the distribution over the final state consisting of $\ModeInd^N$ Gaussian components.

Luckily, this work is interested in remaining in a single dynamics mode $\desiredMode$, which simplifies the problem. Assuming that the controlled system remains in this desired dynamics mode, the state trajectory can be simulated using only a single dynamics function,

The model in Chapter ref:chap-dynamics represents each mode’s dynamics function as a sparse variational GP, giving the transition density as,

where the functional form of \expertVariational is given by,

Remember that the variational posterior $\expertVariational$ is an approximation,

that captures the joint distribution in the data through the inducing variables $\expertInducingOutput$. Given a deterministic state-control input, $\singleInput$, cref:eq-sparse-gp-dynamics can be used to calculate the density over the next state $\state\timeInd+1$. However, as both the state and control at a given time step could also be Gaussian distributed, ${\state\timeInd ∼ \mathcal{N}(\stateMean, \stateCov)}$, and ${\control\timeInd ∼ \mathcal{N}(\controlMean, \controlCov)}$, the joint distribution of the state and control, $p(\singleInput)$, is also Gaussian distributed,

Predictions with Uncertain Inputs

Consider the problem of predicting the next state $\state\timeInd+1$ given multivariate normally distributed states, ${\state\timeInd ∼ \mathcal{N}(\stateMean, \stateCov)}$, and controls, ${\control\timeInd ∼ \mathcal{N}(\controlMean, \controlCov)}$, where $\desiredDynamicsFunc ∼ \mathcal{GP}$. This prediction problem corresponds to calculating the following integral,

Approximate closed-form solutions exist for propagating uncertain inputs through \acrshort{gp} models cite:girardGaussian2003,kussGaussian2006,quinonero-candelaPropagation2003. \todo{correct girard citation} This work exploits the moment-matching approximation implemented in the uncertain_conditional from GPflow cite:GPflow2017. For more details on the approximation see Section 7.2.1 of cite:kussGaussian2006. \todo{Add moment matching figure?} This work propagates model uncertainty forwards by cascading such one-step predictions.

Conclusion

This chapter has presented three mode remaining trajectory optimisation algorithms. The first two have shown how the geometry of the \acrshort{mosvgpe} gating network infers valuable information regarding how a multimodal dynamical system switches between its underlying dynamics modes. Moreover, they have shown how this latent geometry can be leveraged to encode mode remaining behaviour into two different control strategies. Both of these control strategies introduce a user tunable parameter $λ$ that can be tuned to either prioritise remaining in the desired dynamics mode or avoiding regions of high epistemic uncertainty. However, cref:chap-traj-opt-results will show that setting the $λ$ parameter is not straightforward in practice. The third method presented in this chapter has shown how the probability mass function over the \acrshort{mogpe}’s expert indicator variable can be used to encode mode remaining behaviour for trajectory optimisation. In particular, it has shown how the \acrshort{cai} framework citep:toussaintRobot2009,toussaintProbabilistic2006,kappenOptimal2013 can be extended to multimodal dynamical systems and how mode remaining behaviour can be encoded by conditioning on the mode indicator variable.

In cref:chap-traj-opt-results the methods presented in this chapter are evaluated and compared using the quadcopter navigation problem from the illustrative example in cref:illustrative_example. It turns out that in practice, the \acrfull{dre} method from cref:sec-traj-opt-energy and the \acrfull{cai} method from cref:chap-traj-opt-inference, perform significantly better than the \acrfull{ig} method.

Conclusion

This chapter has shown how the geometry of the \acrshort{mosvgpe} gating network infers valuable information regarding how a multimodal dynamical system switches between its underlying dynamics modes. Moreover, it has shown how this latent geometry can be leveraged to encode mode remaining behaviour into two different control strategies. Both methods have been evaluated on a quadcopter navigation problem. Overall, the direct optimal control approach offers superior performance, as it finds trajectories that do not violate the dynamics constraints, whilst ensuring trajectories are $δ-\text{mode remaining}$.

Quadcopter Experiments - Mode Remaining Trajectory Optimisation label:chap-traj-opt-results

acronyms

maths

intro

This chapter evaluates the three mode remaining trajectory optimisation algorithms:

  1. \acrfull{ig} from cref:sec-traj-opt-collocation,
  2. \acrfull{dre} from cref:sec-traj-opt-energy,
  3. \acrfull{cai} from cref:sec-traj-opt-inference,

in a simulated version of the illustrative example from cref:illustrative_example. That is, flying a velocity controlled quadcopter from an initial state $\state_0$, to a target state $\statef$, whilst avoiding the turbulent dynamics mode. The methods are further evaluated on a second simulated velocity controlled quadcopter navigation problem. See cref:fig-environments for schematics of the two environments. The turbulent dynamics modes (green) are subject to higher drift due to the wind field created by the fan. They are also subject to higher diffusion (process noise) resulting from the turbulence induced by the fan. Although the exact turbulent dynamics are unknown, they are believed to be difficult to control. This is due to the high process noise which may lead to catastrophic failure. Therefore, it is desirable to find trajectories that avoid entering this turbulent dynamics mode.

The collocation solver from the \acrfull{ig} method in cref:chap-traj-opt-geometry is tested on a real-world quadcopter state transition data set. This data set was collected with constant controls so the controls could not be recovered from the full \acrshort{mosvgpe} dynamics model, nor could the \acrfull{dre} method from cref:chap-traj-opt-geometry or the \acrfull{cai} method from cref:chap-traj-opt-inference be tested. Nevertheless, the results are a small step towards validating the method’s applicability to real-world systems. All three control methods are tested in the two simulated environments so that they can be compared. To aid comparison with the real-world experiments, the layout of the first simulated environment (Environment 1) was kept consistent with the real-world experiments.

Real-World Quadcopter Experiments label:sec-traj-opt-results-brl

intro

The \acrfull{ig} method presented in cref:sec-traj-opt-collocation was evaluated using data from the real-world quadcopter navigation problem detailed in cref:sec-brl-experiment. However, a different subset of the environment was not observed and the model was trained using an old variational inference scheme, not the one presented in cref:chap-dynamics. cref:fig-environment-1 shows the environment and details the quadcopter navigation problem. The controls were kept constant during data collection, reducing the dynamics to $Δ \state\timeInd+1 = \dynamicsFunc(\state\timeInd; \control\timeInd=\fixedControl)$. See cref:sec-brl-experiment for more details on data collection and processing.

results figs

Model Learning

The model from cref:chap-dynamics was instantiated with $\ModeInd=2$ experts and trained on the data collected from the velocity controlled quadcopter experiment. Each mode’s dynamics \acrshort{gp} used a Squared Exponential kernel with \acrfull{ard} and a constant mean function. The gating network used a single gating function with a Bernoulli likelihood. Its \acrshort{gp} prior utilised a Squared Exponential kernel with \acrshort{ard} and a zero mean function.

cref:fig-geometric-traj-opt shows the gating network posterior where the model has learned two dynamics modes, characterised by the drift and process noise induced by the fan. Mode 1 (red) represents the operable dynamics mode whilst Mode 2 (blue) represents the inoperable turbulent dynamics mode. This is illustrated in cref:fig-geometric-traj-opt-over-prob which shows the probability that the desired mode ($\modeVar = 1$) governs the dynamics over the domain. cref:fig-geometric-traj-opt-over-svgp shows the \acrshort{gp} posterior mean (left) and variance (right) of the gating function $\gatingFunc_1$ associated with the desired dynamics mode. The mean is high where the model believes the desired mode is responsible for predicting, low where it believes another mode is responsible and zero where it is uncertain. The variance (right) has also clearly captured information regarding the epistemic uncertainty, i.e. where the model is uncertain which mode governs the dynamics.

Trajectory Optimisation using Indirect Optimal Control via Latent Geodesics

The initial (cyan) trajectory in cref:fig-geometric-traj-opt was initialised as a straight line with $10$ collocation points, indicated by the crosses. The collocation solver guarantees that trajectories end at the target state. However, trajectories are not guaranteed to remain in the desired dynamics mode, nor are they guaranteed to satisfy the system dynamics. cref:tab-results compares the initial trajectory to the results obtained with two settings of the $λ$ parameter from cref:eq-expected-metric-weighting, which determines the relevance of the covariance term in the expected Riemannian metric.

Table

#+CAPTION[Comparison of \acrfull{ig} experiments on real-world quadcopter]: \acrfull{ig} Comparison of performance with different settings of $λ$.

Trajectory Mixing Probability Epistemic Uncertainty
$∑i=1I Pr(\modeVar_i=1 \mid \state_i)$ $∑i=1I \mathbb{V}[\gatingFunc1(\state_i)]$
Initial $7.480$ $1.345$
Optimised $λ=20.0$ $6.091$ $\mathbf{1.274}$
Optimised $λ=0.5$ $\mathbf{8.118}$ $1.437$

Figs

After Table

\newline

The higher probability of remaining in the desired dynamics mode indicates that the lower setting of $λ=0.5$ exhibits more mode remaining behaviour. This can be seen visually in cref:fig-geometric-traj-opt-over-prob, where the trajectory with $λ=0.5$ remains in regions of the model with high probability. The right-hand plot in cref:fig-geometric-traj-opt-over-svgp shows that this trajectory favours remaining in the desired mode at the cost of entering the region of the gating network with high epistemic uncertainty. This is quantified in cref:tab-results, which shows it accumulates more gating function variance than both the initial trajectory and the trajectory found with $λ=20.0$.

In contrast, the trajectory found with $λ=20.0$ initially remains in the desired mode but then enters the turbulent mode in favour of avoiding the area of high epistemic uncertainty. This is confirmed in cref:tab-results as the trajectory found with $λ=20.0$ accumulates the least gating function variance over the trajectory. This is further visualised in cref:fig-epistemic_var_vs_time,fig-mixing_prob_vs_time. These results align with the intended behaviour of the user tunable $λ$ parameter. However, in this experiment, increasing the relevance of the covariance term in the expected Riemannian metric has no benefit. This is because it pushes the trajectory into the turbulent dynamics mode.

These results indicate the nonlinear program in cref:eq-collocation-problem, i.e. solving the geodesic \acrshort{ode} in cref:eq-2ode via collocation, is capable of finding state trajectories from $\state_0$ to $\targetState$ that exhibit mode remaining behaviour. However, these experiments have not validated the method’s ability to recover the controls from the state trajectory using cref:eq-control-elbo. This is because a full transition dynamics model has not been learned so the control trajectory cannot be recovered. In cref:sec-traj-opt-results-simulated the method is evaluated in a simulated environment where its ability to recover the controls is tested.

It is worth noting here that the mode remaining behaviour is sensitive to the tolerance on the collocation constraints. Setting the tolerance too small often resulted in the solver failing to converge, whilst setting the tolerance too large omitted any mode remaining behaviour.

Simulated Quadcopter Experiments label:sec-traj-opt-results-simulated

intro

All of the control methods are now tested in two simulated environments so that they can be compared. This section first details the two simulation environments.

Simulator Setup

The simulated environments have two dynamics modes, $\modeVar ∈ \{1, 2\}$, whose transition dynamics are given by a single integrator discretised using the forward Euler method (velocity times time). The modes are induced by different wind fields which are characterised by their drift $\windDrift{\modeInd}$ and process noise $\windTurbulence{\modeInd}$ terms. Each mode’s dynamics are given by,

The state domain is constrained to $x, y ∈ [-3, 3]$ and min/max controls are implemented by constraining the control domain to $\velocity_x, \velocity_y ∈ [-5, 5]$.

Data set We sampled $4000$ state transitions from each environment with $Δ \timeInd = 0.25s$. We then removed a subset of the state transitions to induce a region of high epistemic uncertainty in the learned dynamics model. This enabled the methods’ ability to avoid regions of high epistemic uncertainty to be tested. cref:fig-dataset-scenario-both shows the two environments as well as the state transition data sets that were sampled from them. In Environment 2, the turbulent dynamics mode and the unobserved regions are in different locations to Environment 1. Further to this, the region with no observations does not overlap the mode boundary like in Environment 1.

Simulator

Model Learning

Following the experiments in cref:sec-traj-opt-results-brl, the model from cref:chap-dynamics was instantiated with $\ModeInd=2$ experts, one to represent the desired dynamics mode and one to represent the turbulent dynamics mode. The experiments used the same \acrshort{gp} priors and hyperparameters as the real-world experiments in cref:sec-traj-opt-results-brl. However, in contrast to the real-world experiments, the simulated experiments presented here learn the full transition dynamics model, i.e. they do not assume constant controls. cref:fig-traj-opt-gating-network-7,fig-traj-opt-gating-network-5 show the gating network posteriors after training on the data sets from Environment 1 and Environment 2 respectively. In both environments, Expert 1 represents the turbulent dynamics mode and Expert 2 represents the operable, desired dynamics mode. This results from Expert 1 learning much higher drift and process noise than Expert 2.

Learned models

Performance Indicators

Before evaluating the three trajectory optimisation algorithms in the simulated environments, let us restate the goals from cref:chap-traj-opt-control:

Goal 1
Navigate to the target state $\targetState$,
Goal 2
Remain in the operable, desired dynamics mode $\desiredMode$,
Goal 3
Avoid regions of the learned dynamics with high epistemic uncertainty,
Goal 3.1
in the desired dynamics mode $\latentFunc\desiredMode$, i.e. where the underlying dynamics are not known,
Goal 3.2
in the gating network $\modeVar$, i.e. where it is not known which mode governs the dynamics.

The performance of the trajectory optimisation algorithms are evaluated using four performance indicators:

  1. Goal 2 is measured using the probability of remaining in the desired mode without the model’s uncertainty marginalised,

    This probability will only decrease when a trajectory leaves the desired mode.

  2. Goal 3.1 is measured using the state variance accumulated from cascading single-step predictions,

    This will increase when a trajectory passes through regions of the desired dynamics mode with high epistemic uncertainty.

  3. Goal 3.2 is measured using the gating function variance accumulated from cascading single-step predictions,

    This will increase when a trajectory passes through regions of the gating network with high epistemic uncertainty.

  4. Probability of remaining in the desired mode with the model’s uncertainty marginalised, calculated using cref:eq-mode-chance-constraint-integral,

    This probability will decrease when a trajectory leaves the desired mode and when it passes through regions of the learned dynamics with high uncertainty.

The methods’ ability to remain in the desired dynamics mode $\desiredMode$, that is, to accomplish Goal 2, is measured using the probability of remaining in the desired dynamics mode cref:eq-mode-probability-no-unc. The secondary goal of avoiding regions of the learned dynamics model with high epistemic uncertainty in the desired dynamics mode (Goal 3.1), is measured by summing the state variance over each trajectory cref:eq-metric-state-var. Similarly, each method’s ability to avoid regions of the gating network with high epistemic uncertainty in the gating network (Goal 3.2), is measured by summing the gating function variance over each trajectory cref:eq-metric-gating-var. Intuitively, the goal is to maximise the probability of being in the desired mode, whilst minimising the variance accumulated over a trajectory. This corresponds to maximising the probability of remaining in the desired dynamics mode with all of the model’s epistemic uncertainty marginalised, i.e. maximising cref:eq-mode-probability-all-unc.

Table

Results

intro

Three settings of the tunable $λ$ parameter were tested for both of the geometry-based methods in each environment. The \acrfull{cai} method from cref:sec-traj-opt-inference was tested with Gaussian controls (gauss) and with deterministic (Dirac) controls. This enabled the maximum entropy control behaviour (resulting from Gaussian controls) to be compared to a baseline without maximum entropy control. cref:tab-results-sim-envs summarises all of the results from both environments.

Initialisation All of the \acrfull{dre} and \acrfull{cai} experiments used a horizon of $\TimeInd=20$ time steps and all of the \acrfull{ig} experiments were initialised with $I=20$ collocation points. Further to this, all of the \acrshort{ig} experiments were initialised with straight line trajectories between $\state_0$ and $\targetState$, except for one experiment in Environment 2, named \acrshort{ig} λ=1.0 (mid point), which was initialised with two straight lines, between $\state_0$, $[-2.0, -2.0]$ and $\targetState$. This is because the \acrshort{ig} experiments struggled to find solutions in Environment 2 – due to local optima – without adjusting the initial solution.

Visualisation cref:fig-all-traj-opt-prob,fig-all-traj-opt-mean,fig-all-traj-opt-variance show the trajectory optimisation results for both environments overlayed on their associated gating network posteriors. In each figure, the left-hand column shows results for Environment 1 whilst the right-hand column shows the results for Environment 2. The figures show the state trajectories obtained from cascading single-step predictions through the desired mode’s \acrshort{gp} dynamics using the controls found by the \acrshort{ig} method (top row), the \acrshort{dre} method (middle row) and the \acrshort{cai} method (bottom row). cref:fig-all-traj-opt-prob shows the optimised trajectories overlayed on the desired mode’s ($\modeVar=2$) mixing probability. This is useful for seeing how well trajectories remain in regions of the learned dynamics model with high probability of being in the desired mode. cref:fig-all-traj-opt-mean shows the trajectories overlayed on the gating function’s posterior mean. As the geometry-based methods in this chapter are based on finding shortest trajectories on this manifold, the posterior mean plot is useful for observing contour following behaviour. As the methods should find trajectories that avoid regions of the gating network with high epistemic uncertainty, cref:fig-all-traj-opt-variance shows them overlayed on the gating function’s posterior variance. Because the \acrshort{ig} method is a two-stage process, which first finds a state trajectory using the collocation solver and then recovers the controls using the inference strategy in cref:eq-control-elbo-svgp, its collocation state trajectory is compared to its dynamics trajectory in cref:fig-collocation-traj-opt. The three methods are now compared by analysing how well they achieve the three goals.

Collocation comparison

Goal 1 - Navigate to the Target State

The methods are first evaluated at their ability to navigate to the target state, that is, achieve Goal 1.

\acrfull{ig} The \acrshort{ig} method’s collocation solver was able to find state trajectories to the target state in all experiments. This is indicated by the collocation state trajectories $\bar{\mathbf{z}}$ (left-hand plots in cref:fig-collocation-traj-opt) reaching the target state $\targetState$ in both environments. This is because the collocation solver $\text{IG}\text{collocation}$ is guaranteed to satisfy the boundary conditions. However, the inference strategy cannot always recover controls that drive the system along the collocation solver’s state trajectory and thus to the target state $\targetState$.

In Environment 1, the inference strategy successfully recovered controls that drive the system along the collocation solver’s state trajectory. This is indicated by the dynamics trajectory $\bar{\mathbf{x}}$ (right in cref:fig-collocation-traj-opt-7), following the state trajectory $\bar{\mathbf{z}}$ found by the collocation solver $\text{IG}\text{collocation}$ (left in cref:fig-collocation-traj-opt-7). However, in Environment 2, the inference strategy could not recover the controls that could drive the system along the collocation solver’s state trajectory. This can be seen by the dynamics trajectories (right in cref:fig-collocation-traj-opt-5) differing from the collocation trajectories (left in cref:fig-collocation-traj-opt-5.

One possible explanation is that the \acrshort{elbo} in cref:eq-control-elbo-svgp could not recover the controls due to a local optimum. The collocation state trajectories (left) in the experiments with $λ=0.5, 1.0, 5.0$, which were initialised with straight line trajectories, passed through the turbulent dynamics mode. As this region belongs to the turbulent dynamics mode, the desired mode’s dynamics \acrshort{gp} has high epistemic uncertainty in this region. This high uncertainty may have resulted in the \acrshort{elbo} being low when trajectories pass through this region and as a result, the gradient-based optimiser may have got stuck in the local optimum before this region. Further to this, the experiment that was initialised with a mid point at $[-2.0, -2.0]$ (purple) was not able to recover the correct controls, even though the collocation state trajectory (purple squares) does not pass through the turbulent dynamics mode. Monitoring the optimisation of the control trajectory showed that the control trajectory hit the local optima during optimisation. Therefore, it is possible that this experiment also got stuck in this local optimum. Although not tested, it may be possible to overcome this issue with random restarts, i.e. different initial control trajectories.

\acrfull{dre} All of the \acrshort{dre} experiments, (middle row of cref:fig-all-traj-opt-prob,fig-all-traj-opt-mean,fig-all-traj-opt-variance) were able to find trajectories that navigate to the target state $\targetState$ under the desired mode’s \acrshort{gp} dynamics. It is worth noting that this method does not guarantee that trajectories will satisfy the boundary conditions. However, setting the terminal state cost matrix $\terminalStateCostMatrix$ to be very high, appears to work well.

\acrfull{cai} All of the \acrshort{cai} experiments successfully navigated to the target state under the dynamics. This is indicated by the trajectories in the bottom row of cref:fig-all-traj-opt-prob,fig-all-traj-opt-mean,fig-all-traj-opt-variance successfully navigating to the target state $\targetState$. Similarly to the \acrshort{dre} method, this approach is not guaranteed to find trajectories that will end at the target state. However, setting a high terminal state cost matrix $\terminalStateCostMatrix$ worked well.

All prob

All mean

Goal 2 - Remain in the Desired Mode

The experiments are now evaluated at their ability to remain in the desired dynamics mode, that is, their ability to achieve Goal 2.

Indirect optimal control

\newline

\acrfull{ig} None of the \acrshort{ig} experiments were able to remain in the desired dynamics mode successfully. In all but one of the experiments, this was due to the collocation solver $\text{IG}\text{collocation}$ finding trajectories that pass over the mode boundary into the turbulent dynamics mode. This is indicated by the collocation state trajectories in the left-hand plots of cref:fig-collocation-traj-opt passing directly through the turbulent dynamics mode. This motivated a further experiment, to see if the collocation solver was getting stuck in a local optimum induced by the initial trajectory. The IGcollocation $λ=1.0$ (mid point) experiment (purple squares in cref:fig-collocation-traj-opt-5) was initialised as two straight lines with a mid point at $[-2.0, -2.0]$ to see if setting the initial trajectory not to pass over the mode boundary could enable it to find a mode remaining trajectory. cref:fig-collocation-traj-opt-5 indicates that with this initialisation, the collocation solver was able to find a mode remaining trajectory. This result indicates that the collocation solver is susceptible to local optima induced by the initial trajectory. Note that the need to initialise the collocation solver in this way limits the method’s applicability. For example, how should the initial solution be set in environments where the state-space cannot be visualised as easily?

In Environment 1, although none of the \acrshort{ig} experiments successfully remained in the desired dynamics mode, they did exhibit some mode remaining behaviour. That is, the trajectories with $λ=0.5$ (blue) and with $λ=1.0$ (green) in cref:fig-collocation-traj-opt-7 navigate to the left and almost reach the mode boundary. However, they could not find solutions that remained in the desired mode for the entire trajectory.

Direct optimal control

\newline \acrfull{dre} All but one of the \acrshort{dre} experiments were able to remain in the desired dynamics mode successfully. This is indicated by the dynamics trajectories in the middle row of cref:fig-all-traj-opt-prob,fig-all-traj-opt-mean,fig-all-traj-opt-variance not passing over the mode boundary into the turbulent dynamics mode. From visual inspection of the middle rows in cref:fig-all-traj-opt-prob,fig-all-traj-opt-mean,fig-all-traj-opt-variance the trajectories found with \acrshort{dre} $λ=1.0$ (green stars) have the most clearance from the turbulent dynamics in both environments, compared to other \acrshort{dre} $λ$ experiments. The trajectories in the middle row of cref:fig-all-traj-opt-mean show the contour following that achieves the mode remaining behaviour. In particular, the second half of the Environment 2 trajectories in the right-hand plot. cref:tab-results-sim-envs confirms that in both environments, the trajectory found with \acrshort{dre} $λ=1.0$ obtained the highest probability of remaining in the desired dynamics mode when not considering the model’s epistemic uncertainty.

Although the trajectories found with \acrshort{dre} $λ=0.5, 1.0$ in Environment 1 obtained the highest probability when not considering the epistemic uncertainty, they did not obtain the highest probabilities when considering it. This is because they both accumulate high gating function variance when they pass over the region of high epistemic uncertainty in the gating network. This indicates that it is important to consider the model’s epistemic uncertainty when finding mode remaining trajectories.

The trajectory found with \acrshort{dre} $λ=20.0$ in Environment 1 obtained the lowest probability of remaining in the desired dynamics mode. Note that increasing the relevance of the covariance term (increasing $λ$) is equivalent to decreasing the relevance of the mode remaining term. As a result, the optimisation landscape has favoured trajectories that avoid the region of high posterior variance in the gating network more than following a true geodesic path on the mean of the gating function. This can be seen in cref:fig-all-traj-opt-variance where the trajectory found with \acrshort{dre} $λ=20.0$ (orange stars) did not successfully remain in the desired dynamics mode. This indicates that setting the relevance of the covariance term too high can have a negative impact on performance. These results suggest that care should be taken when adjusting the value of $λ$.

It is worth noting that in practice the mode chance constraints would inform us that the trajectory is not $δ-\text{mode remaining}$, so the trajectory would not be executed in the environment.

Control-as-inference

\newline

\acrfull{cai} It is clear from the bottom rows of cref:fig-all-traj-opt-prob,fig-all-traj-opt-mean,fig-all-traj-opt-variance that all \acrshort{cai} experiments found trajectories that remained in the desired dynamics mode. Further to this, the maximum entropy control term resulted in more clearance from the mode boundary. This is shown by the experiments with Gaussian controls \acrshort{cai} (gauss), indicated by black circles, having more clearance than the experiments with deterministic controls \acrshort{cai} (Dirac), indicated by the pink circles. This is a desirable behaviour that is expected from the maximum entropy control term. This observation is confirmed in cref:tab-results-sim-envs, where in both environments the experiments with maximum entropy control obtained the highest probabilities of remaining in the desired dynamics mode when considering the model’s epistemic uncertainty.

In Environment 2, both of the \acrshort{cai} experiments found discrete-time trajectories that remained in the desired dynamics mode. This is indicated by none of the trajectories’ time steps being in the turbulent dynamics mode. However, when interpolating the discrete-time trajectory for the Environment 2 experiment without maximum entropy control (pink circles), shown in the bottom right plots in cref:fig-all-traj-opt-prob,fig-all-traj-opt-mean,fig-all-traj-opt-variance, the trajectory crosses the mode boundary. This is an undesirable behaviour because in non-simulated environments this would correspond to the trajectory entering the turbulent dynamics mode. In contrast, the clearance resulting from the maximum entropy control term alleviates this issue.

mode remaining

\newline

It is clear from cref:fig-mode-conditioning-no-max-entropy-traj-opt-5,fig-mode-conditioning-max-entropy-traj-opt-5 that both of the \acrshort{cai} experiments found discrete-time trajectories that remained in the desired dynamics mode. This is indicated by none of the trajectories time steps being in the turbulent dynamics mode. However, when interpolating the discrete-time trajectory for the experiment without maximum entropy control, shown in cref:fig-mode-conditioning-no-max-entropy-traj-opt-5, the trajectory crosses the mode boundary. This is an undesirable behaviour, as in a non simulated environment this would correspond to the trajectory entering the turbulent dynamics mode. Visual inspection of cref:fig-mode-conditioning-no-max-entropy-traj-opt-5,fig-mode-conditioning-max-entropy-traj-opt-5 show that the maximum entropy control term resulted in a trajectory with more clearance from the mode boundary. This is a desirable behaviour that is expected from the maximum entropy control term. This observation is confirmed in cref:tab-results-sim-envs-inf, where the experiment with maximum entropy control obtained a higher probability of remaining in the desired dynamics mode when considering the model’s epistemic uncertainty.

Goal 3 - Avoid Regions of High Epistemic Uncertainty

intro

Finally, the methods are evaluated at their ability to avoid regions of the dynamics with high epistemic uncertainty. cref:tab-results-sim-envs shows the state variance and the gating function variance accumulated over each trajectory. In all Environment 1 experiments, the state variance accumulated over the trajectory (from cascading single-step predictions via moment matching) were fairly similar. This is due to the dynamics of the simulator being simple enough that the desired mode’s \acrshort{gp} can confidently interpolate into both the turbulent dynamics mode and the region which has not been observed. In the latter case, it is up to the gating network to model the epistemic uncertainty arising from limited training data.

In contrast to the experiments in Environment 1, the state variance accumulated over each trajectory does vary between experiments in Environment 2. However, the results in cref:tab-results-sim-envs suggest that the state variance does not significantly impact the mode remaining behaviour. This is indicated by no correlation between the mode probability and the state variance, which is likely due to extremely low state variance.

All variance
Indirect optimal control

\newline

\acrfull{ig} In the \acrshort{ig} experiments in Environment 1 with $λ=0.5, 1.0$, the trajectories navigate directly over the region of high posterior variance associated with the gating function. This can be seen in the top left plot in cref:fig-all-traj-opt-variance, by the blue and green diamonds trajectories. In contrast, the \acrshort{ig} experiment with $λ=20.0$ (orange diamonds) avoided this region. However, this uncertainty avoiding behaviour came at the cost of passing straight through the turbulent dynamics mode. In Environment 2 the \acrshort{ig} experiments did not remain in the desired dynamics mode, nor did they exhibit any uncertainty avoiding behaviour. As mentioned previously, this was likely due to the collocation solver getting stuck in local optima as well as the inference strategy struggling to recover the correct controls.

Direct optimal control

\newline

\acrfull{dre} The \acrshort{dre} experiments exhibited the most uncertainty avoiding behaviour. This is indicated in cref:tab-results-sim-envs by the Environment 1 experiment with \acrshort{dre} $λ=20.0$ obtaining the lowest accumulation of gating function variance out of all the experiments. The middle left plot in cref:fig-all-traj-opt-variance shows that the trajectory found with \acrshort{dre} $λ=20.0$ (orange stars) avoids the region of high gating function variance. However, similarly to the \acrshort{ig} experiments, this came at the cost of leaving the desired dynamics mode. These results confirm that the covariance term in the expected Riemannian metric can encode the notion of avoiding regions of high epistemic uncertainty in the gating network. However, it also suggests that adjusting $λ$ to avoid regions of high epistemic uncertainty in the gating network is not always favourable and may result in the optimisation landscape favouring trajectories that leave the desired dynamics mode; essentially defeating the goal of introducing the $λ$ parameter in the first place.

In Environment 2 the trajectory found with \acrshort{dre} $λ=1.0$ accumulated the least state and gating function variance. Interestingly, cref:tab-results-sim-envs indicates that the experiment with \acrshort{dre} $λ=5.0$ resulted in less uncertainty avoiding behaviour. Further to this, the \acrshort{dre} $λ=1.0$ experiment, which accumulated the least amount of state and gating function variance obtained the second highest probability of remaining in the desired dynamics mode when considering the model’s epistemic uncertainty. This result suggests that avoiding regions of high epistemic uncertainty is favourable for remaining in the desired dynamics mode. However, when combined with the result in Environment 1 it is not possible to draw this conclusion. Not only did increasing the value of $λ$ not always lead to an increase in the uncertainty avoiding behaviour, but increasing the uncertainty avoiding behaviour was also not always beneficial, as indicated by the Environment 1 results with \acrshort{dre} $λ=20.0$.

These results indicate that the covariance term – in the expected Riemannian metric – plays an important role at keeping trajectories in the desired dynamics mode. However, they also indicate that $λ$ alters both the mode remaining term and the covariance term, in a complex manner. As such, it is not always clear how the value of $λ$ should be set. This is especially the case in Environment 1, where the inoperable dynamics mode intersects the region of high epistemic uncertainty. In this case, the effect of adjusting the tunable $λ$ parameter is complex enough that it may be best to just leave it at $λ=1.0$.

Control-as-inference

\newline

\acrfull{cai} The \acrshort{cai} experiments in Environment 1 avoided the region of high epistemic uncertainty in the gating network. This is shown by the \acrshort{cai} (gauss) (black circles) and the \acrshort{cai} (Dirac) (pink circles) in the bottom left plot of cref:fig-all-traj-opt-variance, navigating far to the left around the region of high gating function variance. This uncertainty avoiding behaviour is confirmed in cref:tab-results-sim-envs, where the \acrshort{cai} experiments accumulated the least amount of gating function variance out of all the trajectories which remained in the desired dynamics mode. However, in Environment 2, the \acrshort{cai} experiments did not exhibit as much uncertainty avoiding behaviour in the gating network. In fact, they obtained higher accumulations of gating function variance by quite a margin. As mentioned earlier, this result suggests that avoiding regions of high epistemic uncertainty is not the most important factor for obtaining the highest probability of remaining in the desired dynamics mode. This change in behaviour is due to the relevance of the uncertainty avoiding behaviour being automatically handled by the marginalisation of the gating function in cref:eq-traj-opt-elbo.

Although marginalisation is a principled way of handling uncertainty, in this scenario, it can be conjectured otherwise. First, note that the quantitative performance measures do not tell the full story. This is because in the region with no observations, it is perfectly plausible that there is another region belonging to the turbulent dynamics mode, or a different dynamics mode altogether. In this case, the trajectory found with \acrshort{dre} $λ=5.0$ (red stars) is most likely to avoid entering the turbulent dynamics mode, because it avoids the region of high epistemic uncertainty associated with no observations. The probability of remaining in the desired dynamics mode does not capture this notion. There are two reasons for this. Firstly, the gating network is overconfident in this region due to learning a long lengthscale. This could be overcome by fixing the lengthscale a priori. However, there is a second issue due to the fundamental modelling approximation made by \acrshort{gps}. The \acrshort{gp}-based gating network is not aware that the region with no observations is so large that it could fit another turbulent dynamics mode inside it. This is due to the Squared Exponential kernel function relying on point-wise calculations to build the covariance matrix. Future work could explore methods that capture higher-order interactions, for example, the size and shape of modes.

In both environments, the \acrshort{cai} (guass) experiments with maximum entropy control obtained lower accumulations of gating function variance than the \acrshort{cai} (Dirac) experiments without maximum entropy control. This further confirms that the maximum entropy control behaviour improves the performance of the \acrshort{cai} method.

epistemic uncertainty

\newline

Epistemic uncertainty Each experiment’s ability to avoid regions of high epistemic uncertainty is now evaluated.

In contrast to the experiments in Environment 1, the state variance accumulated over each trajectory does vary between experiments. However, the results in cref:tab-results-sim-envs suggest that the state variance does not have a large impact on the mode remaining behaviour. This is indicated by no correlation between the mode probability and the state variance, which is likely due to the state variance being extremely low. The trajectory found with $λ=1.0$ accumulated the least state and gating function variance. In contrast, the result for $λ=0.01$ accumulated the highest state and gating variance, whilst also obtaining the highest probability of remaining in the desired dynamics mode. As mentioned earlier, this result suggests that avoiding regions of high epistemic uncertainty is not the most important factor for obtaining the highest probability of remaining in the desired dynamics mode.

The control-as-inference experiment with maximum entropy control, shown in cref:fig-mode-conditioning-max-entropy-over-svgp-traj-opt-5, obtained a lower accumulation of gating function variance than the experiment without maximum entropy control. Relative to the geometry-based methods, the trajectories found with the control-as-inference method, shown in cref:fig-mode-conditioning-max-entropy-traj-opt-5,fig-mode-conditioning-no-max-entropy-traj-opt-5, do not exhibit as much uncertainty avoiding behaviour in the gating network. In fact, they obtained higher accumulations of gating function variance by quite a margin. In contrast, in Environment 1, the trajectories found with the control-as-inference method obtained some of the most uncertainty avoiding behaviour, indicated by some of the lowest accumulated gating function variance. This change in behaviour is due to the relevance of the uncertainty avoiding behaviour being automatically handled by the marginalisation of the gating function in cref:eq-traj-opt-elbo.

Although marginalisation is a principled way of handling uncertainty, in this scenario, it can be conjectured otherwise. First note that the quantitative performance measures do not tell the full story. This is because in the region with no observations, it is perfectly plausible that there is another region belonging to the turbulent dynamics mode, or a different dynamics mode altogether. In this case, the trajectory found with $λ=5.0$ is most likely to avoid entering the turbulent dynamics mode. This is because it avoids the region of high epistemic uncertainty associated with no observations. The probability of remaining in the desired dynamics mode does not capture this notion. There are two reasons for this. Firstly, the gating network is overconfident in this region due to learning a long lengthscale. This could be overcome by fixing the lengthscale a priori. However, there is a second issue due to the fundamental modelling approximation made by \acrshort{gps}. The \acrshort{gp}-based gating network is unaware that the region with no observations is so large that it could fit another turbulent dynamics mode inside it. This is due to the Squared Exponential kernel function relying on point-wise calculations to build the covariance matrix. Future work could explore methods that capture higher-order interactions, for example, the size and shape of modes.

The results for $λ=0.01$, $λ=0.5$ and $λ=1.0$ in cref:tab-results-sim-envs, follow the results for Environment 1, suggesting that increasing $λ$ leads to more uncertainty avoiding behaviour in the gating network. The result for $λ=5.0$ in cref:tab-results-sim-envs does not follow this trend. However, visual inspection of the right-hand plots of cref:fig-riemannian-energy-traj-opt-over-svgp-5,fig-riemannian-energy-high-traj-opt-over-svgp-5 show that the trajectories for $λ=1.0$ and $λ=5.0$, are initially similar whilst they avoid the region with high epistemic uncertainty in the gating network. The increased relevance of the covariance term with $λ=5.0$ then appears to have had a negative impact on the trajectories ability to remain in the desired dynamics mode. This can be seen in cref:fig-riemannian-energy-high-traj-opt-over-svgp-5, where the trajectory passes very close to the mode boundary. Again, this further emphasises the difficulty in setting $λ$ due to complex interactions between the mode remaining behaviour and the uncertainty avoiding behaviour.

Env 1 All mean /variance

Env 2 All mean /variance

Environment 1

Geodesic collocation 0.5

Geodesic collocation 1.0

Geodesic collocation 20

Riemannian energy 0.5

Riemannian energy 1

Riemannian energy 20

Mode Conditioning max entropy

Mode Conditioning NO max entropy

Environment 2

blah

\newline

The methods were then tested in a second simulated environment (Environment 2). In this environment, the turbulent dynamics mode and the unobserved regions are in different locations to Environment 1. cref:fig-dataset-scenario-5 shows the state transition data set that was sampled from Environment 2 and used to train the \acrshort{mosvgpe} dynamics model. cref:fig-traj-opt-gating-network-5 shows the gating network posterior after training on this data set. Expert 1 represents the turbulent dynamics mode and Expert 2 represents the operable, desired dynamics mode. This results from Expert 2 having much higher drift and process noise than Expert 1.

All of the direct optimal control and control-as-inference experiments used a horizon of $\TimeInd=20$ time steps and all of the indirect optimal control experiments were initialised with $I=20$ collocation points. Further to this, all of the indirect optimal control experiments were initialised with straight line trajectories between $\state_0$ and $\targetState$, except for the experiment in cref:fig-geodesic-mid-point-traj-opt-5, which was initialised with two straight lines, between $\state_0$, $[-2.0, -2.0]$ and $\targetState$. This is because the indirect geodesic control experiments struggled to find solutions in this environment – due to local optima – without adjusting the initial solution.

Four different settings of the $λ$ parameter were tested for the direct control method and three settings were tested for the indirect optimal control method. This was due to the indirect optimal control method’s collocation solver failing with $λ=0.01$. cref:tab-results-sim-envs summarises the direct optimal control and the control-as-inference results in Environment 2. The experiments for the indirect optimal control method are not included in cref:tab-results-sim-envs because none of them found trajectories to the target state.

Geodesic collocation 0.5

Geodesic collocation 1.0

Geodesic collocation 1.0 mid point

Geodesic collocation 5

Riemannian energy 0.01

Riemannian energy 0.5

Riemannian energy 1

Riemannian energy 5

Mode Conditioning max entropy

Mode Conditioning NO max entropy

Conclusion

intro

This section details the main findings from the experiments, compares the methods and discusses directions for future work.

Geometry-based methods The \acrshort{ig} method’s collocation solver is susceptible to local optima around its initial state trajectory. Although this can be overcome by engineering the initial solution, it makes it difficult to get the method working in practice. On top of this, the variational inference strategy used to recover the controls does not always recover the correct controls. This may be due to the collocation solver’s state trajectory not satisfying the dynamics constraints or the variational inference getting stuck in local optima. Regardless, this is a big limitation that made the \acrshort{ig} method fail in some environments.

Overall, the \acrshort{dre} method performed significantly better than the \acrshort{ig} method in the experiments. Firstly, two of the \acrshort{dre} experiments in Environment 1 remained in the desired dynamics mode, whilst the \acrshort{ig} experiments could not. Further to this, the \acrshort{dre} approach worked in Environment 2, where it was not possible to get the \acrshort{ig} method working, without engineering the initial solution. Not only did the \acrshort{dre} approach perform better in the experiments, but it is also significantly easier to configure. That is, setting the optimisation parameters is straightforward. In contrast, setting the parameters for the \acrshort{ig} method is not. In particular, setting the upper and lower bounds for the collocation constraints is pernickety. In conclusion, the \acrshort{dre} method is the preferred geometry-based method for finding mode remaining trajectories.

How to set $λ$? Although increasing $λ$ generally leads to more uncertainty avoiding behaviour in the gating network, this is not necessarily the case. Further to this, avoiding regions of high epistemic uncertainty in the gating network does not always lead to higher probabilities of remaining in the desired dynamics mode under cref:eq-mode-probability-all-unc. As a result, care should be taken if/when adjusting $λ$.

The quantitative results indicate that good performance is generally achieved with $λ=1.0$. That is, not modifying the expected Riemannian metric tensor. Although some benefits can be obtained via setting $λ$, for example, encoding a notion of risk-sensitivity for avoiding the region with no observations, it is not always clear how it should be set. Further to this, in many realistic scenarios, the state-space will not be easy to visualise like in the 2D quadcopter experiments. For these reasons, I conclude that it is best to air on the side of caution when setting $λ$. That is, $λ$ should remain at $1.0$ unless it is immediately clear how it should be set.

\acrfull{cai} The \acrshort{cai} method performed well in all experiments. The experiments showed that the maximum entropy control behaviour obtained from using normally distributed controls improves the mode remaining behaviour and works well in practice.

Overall The results in cref:tab-results-sim-envs indicate that the \acrshort{cai} (gauss) experiments were the highest performing in both environments. They obtained the highest probability of remaining in the desired dynamics mode when considering the model’s epistemic uncertainty. Visual inspection combined with the results in cref:tab-results-sim-envs, further indicate that the \acrshort{dre} $λ=1.0$ experiments performed well in both environments. Not only did the \acrshort{dre} experiments avoid leaving the desired dynamics mode but they also avoided the region of the gating network with high epistemic uncertainty more so than the \acrshort{cai} experiments. Both the \acrfull{dre} and the \acrfull{cai} with maximum entropy control \acrshort{cai} (gauss) methods are competitive approaches for finding mode remaining trajectories.

Discussion & Future Work

This section compares the three control methods presented in cref:sec-traj-opt-collocation,sec-traj-opt-energy,chap-traj-opt-inference and suggests future work to address their limitations. cref:tab-geometry-control-comparison offers a succinct comparison of the three approaches.

Dynamics constraints Both the \acrshort{cai} method presented in cref:chap-traj-opt-inference and the \acrshort{dre} approach presented in cref:sec-traj-opt-energy find trajectories that satisfy the dynamics constraints, i.e. the learned dynamics. They achieve this by enforcing the distribution over state trajectories to match the distribution from cascading single-step predictions through the desired mode’s dynamics \acrshort{gp}. This can be seen as a method for approximating the integration of the controlled dynamics with respect to time, whilst considering the epistemic uncertainty associated with learning from observations. The chance constraints ensure the controlled system is $δ-\text{mode remaining}$, which makes the approximate dynamics integration valid. In contrast, the \acrshort{ig} method from cref:sec-traj-opt-collocation does not find trajectories that are guaranteed to satisfy the dynamics constraints. This is a limiting factor that makes the \acrshort{ig} method less appealing. Therefore, methods for incorporating the dynamics constraints into \acrshort{ig} are an interesting direction for future work.

Decision-making under uncertainty The combination of the approximate dynamics integration and the chance constraints in the \acrshort{dre} method, leads to a closed-form expression for the expected cost. This expression considers the epistemic uncertainty associated with the learned dynamics model, both in the desired dynamics mode and in the gating network. Similarly, for the \acrshort{cai} method, the \acrshort{elbo} principally handles the uncertainty in both the desired dynamics mode and gating network. In contrast, the \acrshort{ig} method ignores the uncertainty accumulated by cascading single-step predictions through the learned dynamics model. That is, it considers the uncertainty associated with the gating network and ignores any uncertainty in the desired dynamics mode. Although this did not have a massive impact in the simulated quadcopter experiments, it may have a larger impact in real-world systems with more complex dynamics modes.

Uncertainty propagation This work only considered the moment matching approximation for propagating uncertainty through the probabilistic dynamics model. cite:chuaDeep2018 test different uncertainty propagation schemes in Bayesian neural networks. It would be interesting to test sampling-based approaches for the \acrshort{dre} approach. For example, to see the impact of calculating the expected Riemannian energy over a trajectory without approximations.

Decouple goals Without decoupling the uncertainty avoiding behaviour from the mode remaining behaviour, it is not possible to find trajectories which avoid entering the region of high epistemic uncertainty in the gating network. The geometry-based approaches provide a mechanism to set the relevance of the covariance term, i.e. decouple the goals. This was achieved by augmenting the expected Riemannian metric tensor with the user-tunable weight parameter $λ$, which can be adjusted to determine the relevance of the covariance term. The experiments show that although adjusting $λ$ can be beneficial in some scenarios, it is not necessarily straightforward to set. In particular, when regions of high epistemic uncertainty intersect with mode boundaries. Therefore, care should be taken when setting $λ$. The \acrshort{cai} method does not provide a mechanism for adjusting the relevance of each behaviour. Instead, it relies on the marginalisation of the latent variables in the \acrshort{elbo} (cref:eq-traj-opt-inference) to automatically handle it, without requiring human input. This works well in practice but may be a limitation if it is desirable to decouple the goals and balance the mode remaining and uncertainty avoiding behaviour.

Multiple desired modes Although not tested, the approaches are theoretically sound and should be applicable in systems with more than two dynamics modes. However, it is interesting to consider if this is even necessary given the goals. For example, in the quadcopter experiment, the dynamics model was intentionally instantiated with two dynamics modes, even though there could be more in reality. The desired dynamics mode was engineered to have a noise variance that was deemed operable. The other dynamics mode was then used to explain away all of the inoperable modes. In most scenarios, a similar approach could be followed. Nevertheless, it is interesting to consider systems with more than two modes. This is because the main goal is to avoid entering the inoperable dynamics mode, not just remain in the desired dynamics mode. Although not tested here, the \acrshort{cai} method should be able to remain in more than one desired dynamics mode. This can be achieved by conditioning on a set of modes. In contrast, the geometry-based methods in cref:chap-traj-opt-geometry are only capable of remaining in a single dynamics mode.

Continuous-time trajectories The \acrshort{dre} and \acrshort{cai} methods required control regularisation to prevent state transitions that “jump” over the undesired mode. Although the discrete-time steps of the trajectory appear to satisfy the constraints and minimise the cost, in reality, the continuous-time trajectory may pass through the undesired mode. This is a general problem that arises when solving continuous-time problems in discrete time. In contrast, the \acrshort{ig} method uses a collocation solver where the cost can be evaluated at arbitrary points along the continuous-time trajectory. An interesting direction for future work is to extend the \acrshort{dre} and \acrshort{cai} methods to find trajectories in continuous time. For example, cite:mukadamContinuoustime2018,dongMotion2016 use \acrshort{gps} for continuous-time trajectories when solving motion planning via probabilistic inference.

State-control dependent modes This chapter assumed that the dynamics modes were separated according to their disjoint state domains, i.e. $\stateDomaini ∩ \stateDomainj = ∅$ for distinct $i, j ∈ \{1, \ldots, \ModeInd\}$. It would be interesting to extend this work to systems where the modes are governed by both state and control. For example, flying at high speed through a wind field may be deemed an operable dynamics mode, whilst flying at low speed may not.

Dynamics models The \acrshort{cai} method can be deployed in a wider range of dynamics models than the geometry-based methods. First of all, it is not limited to differentiable mean and covariance functions for the gating function \acrshort{gps}. Secondly, it can be deployed in dynamics models learned with any \acrshort{mogpe} method. This is because all \acrshort{mogpe} methods consist of a probability mass function over the expert indicator variable. In contrast, the geometry-based methods are limited to the \acrshort{mosvgpe} method because they depend on it’s \acrshort{gp}-based gating network. Further to this, the \acrshort{cai} method is applicable in systems where the dynamics modes are not necessarily separated according to their state domains $\mode{\stateDomain}$. This is because it does not rely on the state-dependent gating functions.

Real-time control Whilst real-time control requires efficient algorithms, “offline” trajectory optimisation can trade in computational cost for greater accuracy. This work is primarily interested in finding trajectories that attempt to remain in the desired dynamics mode. For simplicity, it has considered the “offline” trajectory optimisation setting. The increased computational time may hinder its suitability to obtain a closed-loop controller via \acrshort{mpc} citep:eduardof.Model2007. However, it can be used “offline” to generate reference trajectories for a tracking controller or for guided policy search in a \acrshort{mbrl} setting citep:levineGuided2013. Alternatively, future work could investigate approximate inference methods for efficient state estimation to aid with real-time control, e.g. \acrshort{ilqg}/\acrshort{gp}.

Infeasible trajectories It is worth noting that it might not be possible to find a trajectory between a given start state $\state_0$ and a target state $\targetState$, that satisfies the chance constraints. This may be due to either the desired dynamics mode being uncertain, or the gating network being uncertain. In this scenario, it is desirable to explore the environment and update the learned dynamics model with this new experience. This will reduce the epistemic uncertainty in the model, increasing the likelihood of being able to find trajectories that satisfy the chance constraints. This motivates the work in cref:chap-active-learning, which addresses exploration in multimodal dynamical systems, whilst attempting to remain in the desired dynamics mode.

Summary

This chapter has evaluated and compared the \acrfull{ig} method from cref:sec-traj-opt-collocation, the \acrfull{dre} method from cref:sec-traj-opt-energy and the \acrfull{cai} method from cref:chap-traj-opt-inference. The methods’ abilities to navigate to a target state, whilst remaining in the desired dynamics mode, were evaluated on two velocity controlled quadcopter navigation problems. The results in this chapter have verified that the latent geometry of the \acrshort{mosvgpe} gating network can be used to encode mode remaining behaviour into control strategies. However, the results indicate that the \acrshort{ig} method is not only the lowest performing method but is also the hardest method to configure. In contrast, the \acrshort{dre} and \acrshort{cai} methods work well in practice, whilst being much easier to configure.

Although the geometry-based methods have a tunable parameter $λ$ for balancing the mode remaining and the uncertainty avoiding behaviour, it does not work well in practice. In contrast, the \acrshort{cai} method automatically balances the behaviours by marginalising the uncertainty in it’s \acrshort{elbo}.

From the experiments in this chapter, it can be concluded that both the \acrshort{dre} and \acrshort{cai} methods are competitive approaches for finding mode remaining trajectories. They find trajectories that navigate to the target state, satisfy the dynamics constraints and attempt to remain in the desired dynamics mode. Further to this, it is easy to verify that they are $δ-\text{mode remaining}$ by evaluating the mode chance constraints.

Mode Remaining Exploration for Model-Based Reinforcement Learning label:chap-active-learning

maths

Intro

Similarly to cref:chap-traj-opt-control, this chapter is concerned with controlling unknown or partially unknown, multimodal dynamical systems, from an initial state $\state_0$ – in the desired dynamics mode $\desiredMode$ – to a target state $\targetState$, whilst guaranteeing that the controlled system remains in the desired dynamics mode. However, in contrast to previous chapters, it does not assume prior access to the environment. That is, it considers the more realistic scenario, where the agent must iteratively explore its environment, collect data and update it’s dynamics model – whilst remaining in the desired dynamics mode – until it can confidently navigate to the target state $\targetState$. Following previous chapters, this chapter also considers model-based approaches where the dynamics model is learned from observations.

The main contribution of this chapter is an explorative trajectory optimisation algorithm that can explore multimodal environments with a high probability of remaining in the desired dynamics mode. This chapter further proposes an approach to solve the mode remaining navigation problem from cref:problem-statement-main by consolidating all of the work in this thesis.

This chapter is organised as follows. cref:problem-statement-explore formally states the problem and the assumptions that are made in this chapter. cref:sec-exploration then details the exploration strategy proposed in this work and cref:sec-modeopt presents \acrfull{modeopt}, a \acrshort{mbrl} algorithm for solving the mode remaining navigation problem in cref:problem-statement-main. Finally, cref:sec-preliminary-results presents preliminary results in a simulated quadcopter environment and cref:sec-future-work-explore discusses \acrshort{modeopt} and details directions for future work.

Problem Statement label:problem-statement-explore

intro

Similarly to cref:chap-traj-opt-control, the goal of this chapter is to solve the mode remaining navigation problem in cref:problem-statement-main. That is, to navigate from an initial state $\state_0$ – in the desired dynamics mode $\desiredMode$ – to the target state $\targetState$, whilst guaranteeing that the controlled system remains in the desired dynamics mode. However, this chapter does not assume a priori access to the environment, such that a data set of state transitions can be collected and used to learn a dynamics model. Instead, the agent must incrementally explore the environment without violating the mode remaining constraints in cref:def-mode-remaining-main.

Given that this work leverages a learned dynamics model, it is not possible to find trajectories that are mode remaining according to cref:def-mode-remaining-main. Following cref:chap-traj-opt-control, this work relaxes the mode remaining constraint to be $δ-\text{mode remaining}$, i.e. mode remaining with high probability. Formally this problem is given by,

where the dynamics are given by cref:eq-dynamics-main, the objective is given by cref:eq-optimal-control-objective, and the mode remaining constraint in cref:eq-chance-constraint-delta-mode is from the $δ-\text{mode remaining}$ definition in cref:def-delta-mode-remaining.

It is worth noting that the agent would not be able to explore the environment without relaxing the mode remaining constraint from cref:def-mode-remaining-main. Intuitively, the more the $δ-\text{mode remaining}$ constraint is relaxed, the more the controller $π$ can explore. However, this will also increase the controller’s chance of leaving the desired dynamics mode. Thus, the algorithm should expand the region that is known to belong to the desired dynamics mode towards the target state $\targetState$, without violating the $δ-\text{mode remaining}$ constraint.

Initial Model

Initial Mode remaining controller

Initial mode remaining controller In robotics applications, an initial set of poor performing controllers can normally be obtained via simulation or domain knowledge. This work assumes access to an initial data set of state transitions $\dataset_0 = \{(\state_n, \control_n), Δ \state_n\}n=1N_0$ collected from around the start state $\state_0$.

This data set is used to learn a predictive dynamics model $\dynamicsModel$ which is locally accurate around the start state $\state_0$. cref:fig-explorative-traj-opt-7-initial shows the \acrshort{mosvgpe}’s gating network posterior after training on the initial data set $\dataset_0$ for the quadcopter navigation problem in the illustrative example from cref:illustrative_example. cref:fig-explorative-data-over-prob-7-initial illustrates that the desired dynamics mode is $\desiredMode=2$ and cref:fig-explorative-data-over-svgp-7-initial shows that the \acrshort{gp} posterior over the gating function is uncertain away from the initial data set $\dataset_0$, indicated by high posterior variance (red). This is further demonstrated by the probability mass function over the mode indicator variable in cref:fig-explorative-data-over-prob-7-initial, tending to a uniform distribution, i.e. maximum entropy. Although this model can be used to learn an initial controller, it may not work outside of the initial state domain $\initialStateDomain$ and may not be able to find $δ-\text{mode remaining}$ trajectories to the target state $\targetState$, due to the model having high epistemic uncertainty.

Mode Optimisation label:sec-mode-optimisation

intro

This section details our method for solving the mode remaining navigation problem by consolidating all of the work in this thesis. The method is named \acrfull{modeopt}. At its core, \acrshort{modeopt} learns a single-step dynamics model $\dynamicsModel$ using the \acrshort{mosvgpe} model from cref:chap-dynamics. Given this model, the mode remaining control methods in cref:chap-traj-opt-control can be used to find trajectories to the target state $\targetState$. The mode chance constraints from cref:eq-mode-chance-constraint can then be used to check if these trajectories are $δ-\text{mode remaining}$ under the learned dynamics model $\dynamicsModel$. Initially, it will not be possible to find $δ-\text{mode remaining}$ trajectories under the learned dynamics model, due to high epistemic uncertainty. In this case, the agent must explore the environment, collect data and update its dynamics model. Eventually, the agent will reduce the model’s epistemic uncertainty such that it can find $δ-\text{mode remaining}$ trajectories to the target state $\targetState$. Ideally, the exploration strategy will have some guarantee of remaining in the desired dynamics mode. cref:fig-mode-opt-loop illustrates this process in a flowchart.

Flowchart

After flowchart

\newline

$δ-\text{mode remaining}$ exploration In order to implement such an algorithm, we require a method for exploring the environment with mode remaining guarantees. That is, a controller that will explore the environment whilst ensuring the controlled system remains in the desired dynamics mode with high probability, i.e. satisfy cref:eq-mode-remaining-def-explore.

Exploration Strategies label:sec-exploration-strategies

The performance of \acrshort{modeopt} depends on its ability to explore the environment. This section recaps some relevant exploration strategies that fit into the general \acrfull{mbrl} procedure shown in cref:alg-mbrl.

Greedy exploitation One of the most commonly used exploration strategies is to select the controller that maximises the expected performance under the learned dynamics model $\dynamicsModel(f \mid \mathcal{D}0:i-1)$. Note that $i$ denotes the iteration/episode number of the \acrshort{mbrl} loop. This greedy strategy is given by,

This approach is used in \acrshort{pilco} citep:deisenrothPILCO2011 and \acrshort{gp}-\acrshort{mpc} citep:kamtheDataEfficient2018 where the dynamics are represented using \acrshort{gps} and the moment matching approximation is used to cascade single-step predictions. cite:parmasPIPPS2018 propose \acrshort{pipps}, a similar approach to \acrshort{pilco}, except that they use Monte Carlo methods to propagate uncertainty forward it time, instead of using the moment matching approximation. Similarly, \acrshort{pets} citep:chuaDeep2018 uses this exploration strategy but represents the dynamics using ensembles of probabilistic neural networks. This strategy initially favours exploring regions of the environment where the learned dynamics model is not confident, i.e. has high epistemic uncertainty. Once it has gathered knowledge of the environment and the model’s epistemic uncertainty has been reduced, it favours maximising the objective function $J(f, π)$. See cite:curiEfficient2020 for a recap of some commonly used \acrshort{mbrl} exploration strategies.

In contrast to standard \acrshort{mbrl}, the method presented in this chapter side-steps the challenging exploration-exploitation trade-off. That is, \acrshort{modeopt} does not require an objective that changes its exploration-exploitation balance as it gathers more knowledge of the environment. This is because it uses the $δ-\text{mode remaining}$ chance constraints to separate the exploitative mode remaining controller $\modeController$ from an explorative controller $\explorativeController$. However, the explorative controller $\explorativeController$ should seek to reduce the epistemic uncertainty in the learned dynamics model $\dynamicsModel$ such that $δ-\text{mode remaining}$ trajectories to the target state $\targetState$ can eventually be found. That is, the exploration should be targeted towards the target state $\targetState$.

MBRL Alg

Active learning

\newline

Active learning Active learning is another class of exploration algorithms. The goal of information-theoretic active learning is to reduce the number of possible hypothesis as fast as possible, e.g. minimise the uncertainty associated with the parameters using Shannon’s entropy citep:coverElements2006,

In contrast to greedy exploitation, active learning does not seek to maximise a black-box objective. Instead, it is only interested in exploration. There are many approaches to active learning in the static setting, i.e. in systems where an arbitrary state $\state$ can be sampled. In contrast, dynamical system must be steered to $\state$ through the unknown dynamics $f$ through a sequence of controls $\controlTraj$. Thus, there is information gain along the trajectory which must also be considered. As highlighted by cite:buisson-fenetActively2020, information gain in dynamical systems is therefore fundamentally different to the static problem addressed by cite:krauseNearOptimal2008 and cite:houlsbyBayesian2011. The goal here is to pick the most informative control trajectory $\controlTraj$ whilst observing $\stateTraj$.

Recent work has addressed active learning in \acrshort{gp} dynamics models. cite:schreiterSafe2015 propose a greedy entropy-based strategy that considers the entropy of the next state. cite:buisson-fenetActively2020 also propose a greedy entropy-based strategy except that they consider the entropy accumulated over a trajectory. In contrast, cite:caponeLocalized2020,yuActive2021 propose to use the mutual information.

cite:caponeLocalized2020 find the most informative state as the one that minimises the mutual information between it and a set of reference states (a discretisation of the domain). They then find a set of controls to drive the system to this most informative state. Given a fixed number of time steps, their method yields a better model than the greedy entropy-based strategies. cite:yuActive2021 propose an alternative approach that leverages their \acrfull{gpssm} inference scheme to estimate the mutual information between all the variables in time $I \left[\mathbf{y}1:t, \hat{\mathbf{y}}t+1 ; \mathbf{f}1:t+1 \right]$. Here $\mathbf{y}1:t$ denotes the set of observed outputs and $\hat{\mathbf{y}}t+1$ denotes the output predicted by the \acrshort{gpssm}. This contrasts other approaches, which tend to study the latest mutual information $I[\hat{\mathbf{y}}t+1 ; \mathbf{f}t+1]$.

Myopic active learning In \acrshort{rl} and control it is standard to consider objectives over a (potentially infinite) horizon. Should active learning in dynamical systems be any different? This thesis hypothesises that it must be important to consider the information gain over a (potentially infinite) horizon, as opposed to myopically selecting the next input, i.e. only considering the next query point. The mutual information approaches in cite:caponeLocalized2020,yuActive2021 fall into this myopic category as they only maximise the information gain at the next time step. In contrast, the greedy entropy-based strategy in cite:buisson-fenetActively2020 considers the entropy over a horizon. The exploration strategy presented in this chapter follows a similar approach and relieves the myopia of active learning by leveraging a similar greedy entropy-based strategy over a finite horizon.

Mode Remaining Exploration label:sec-exploration

The performance of \acrshort{modeopt} depends on its ability to explore the environment. The exploration strategy is primarily interested in exploring a single desired dynamics mode whilst avoiding entering any of the other modes. This is a challenging problem because the agent must observe regions outside of the desired dynamics mode in order to learn that a particular region does not belong to the desired mode. To the best of our knowledge, there is no previous work addressing the exploration of a single dynamics mode in multimodal dynamical systems.

exploration

$δ-\text{mode remaining}$ exploration The exploration strategy presented here is primarily interested in exploring regions of the gating network that are uncertain. It relieves the myopia of active learning by considering trajectory optimisation with an entropy-based strategy over a finite horizon. Further to this, it principally ensures solutions are $δ-\text{mode remaining}$ via the mode chance constraints in cref:eq-mode-chance-constraint. The objective combines entropy-based exploration of the gating network with the main goal of navigating to the target state $\targetState$. The exploration strategy is given by,

where $\dynamicsModel(\state\timeInd+1 \mid \state0, \control0:\timeInd, \modeVarTraj0:\timeInd=\desiredMode)$ is the multi-step dynamics prediction calculated by recursively propagating uncertainty through the desired mode’s dynamics \acrshort{gp} using the moment matching approximation. See cref:eq-state-unc-prop for more details. $\stateCostMatrix$ and $\controlCostMatrix$ are user-defined, real, symmetric, positive semi-definite and positive definite matrices respectively. Intuitively, the first (joint gating entropy) term seeks to find trajectories that explore regions of the gating network with high epistemic uncertainty and the second (state difference) term targets exploration towards the target state $\targetState$. The joint gating entropy prevents trajectories from collapsing onto a single location of high entropy as it favours trajectories spreading over the state space in order to maximise the entropy of the entire trajectory. The state difference term favours trajectories whose centre of mass is closer to the target state $\targetState$. The control cost term regularises the control trajectory.

Chance constraints To ensure that the controlled system remains in the desired mode with high probability, i.e. trajectories are $δ-\text{mode remaining}$, this method deploys the chance constraints from cref:eq-mode-chance-constraint. In practice, the constrained optimisation in cref:eq-explorative-traj-opt fails if the initial trajectory does not satisfy the mode chance constraints. As such, this work deploys a simple strategy to ensure the initial trajectory satisfies the constraints. We sample a “fake” target state from the initial data set $\dataset_0$ and optimise the initial trajectory using the cost function in cref:eq-quadratic-cost-control. This finds a straight line trajectory between the start state $\state_0$ and the “fake” target state which is in the initial state domain. Experiments show that this procedure finds trajectories where all of the time steps are in the desired mode with high probability.

gating network entropy

\newline Gating network entropy To calculate the joint gating entropy term $\mathcal{H}\left[\desiredGatingFunction(\stateTraj) \mid \stateTraj, \dataset0:i-1 \right]$ the state trajectory $\stateTraj$ is first obtained by cascading single-step predictions through the desired mode’s dynamics \acrshort{gp} using the moment matching approximation from cite:kussGaussian2006. The key idea is then to assume that the gating function values along the trajectory $\desiredGatingFunction(\stateTraj)$ are jointly Gaussian according to the \acrshort{gp} over the desired mode’s gating function. The joint distribution over the gating function values $\desiredGatingFunction(\stateTraj)$ is then given by,

where $\desiredGatingMeanFunc(⋅)$ and $\desiredGatingCovFunc(⋅, ⋅)$ are sparse \acrshort{gp} mean and covariance functions, given by,

where $\desiredGatingKernel$ and $\gatingInducingInput$ are the kernel and inducing inputs associated with the desired mode’s gating function respectively. This sparse approximation arises because the \acrshort{mosvgpe} model uses sparse \acrshort{gps} and approximates the posterior with,

where $q(\desiredGatingFunction(\gatingInducingInput)) = \mathcal{N}\left( \desiredGatingFunction(\gatingInducingInput \mid \hat{\mathbf{m}}\desiredMode, \hat{\mathbf{S}}\desiredMode \right)$.

gating network entropy

\newline Gating network entropy The conditional entropy of the gating network over a trajectory $\mathcal{H}\left[\desiredGatingFunction(\state\timeInd) \mid \stateTraj, \desiredGatingFunction(\stateTraj¬\timeInd), \dataset \right]$ can be calculated by assuming that the unobserved gating function values $\desiredGatingFunction(\stateTraj)$ are jointly Gaussian with the gating function values at the training inputs $\desiredGatingFunction(\allInput)$. The distribution over the desired mode’s gating function at $\state\timeInd$ conditioned on the rest of the trajectory $p(\desiredGatingFunction(\state\timeInd) \mid \stateTraj¬\timeInd, \desiredGatingFunction(\stateTraj¬\timeInd), \dataset)$ can be calculated using the properties of multivariate normal distributions,

where the mean and variance are given by,

where $\desiredGatingMeanFunc(⋅)$ and $\desiredGatingCovFunc(⋅, ⋅)$ are sparse \acrshort{gp} mean and covariance functions, given by,

where $\desiredGatingKernel$ and $\gatingInducingInput$ are the kernel and inducing inputs associated with the desired mode’s gating function respectively. The sparse \acrshort{gp} term arises because the \acrshort{mosvgpe} model uses sparse \acrshort{gps} and approximates the posterior $p(\desiredGatingFunction(\state\timeInd) \mid \dataset0:i-1) ≈ q(\desiredGatingFunction(\state\timeInd))$, where,

Mode Remaining Model-based Reinforcement Learning label:sec-modeopt

intro

This section details how this exploration strategy is embedded into a \acrshort{mbrl} loop to solve the mode remaining navigation problem in cref:eq-mode-explore-problem. The algorithm is named \acrshort{modeopt} and is detailed in cref:alg-mode-opt. The algorithm is initialised with a start state $\state_0$, a target state $\targetState$, a desired dynamics mode $\desiredMode$, a data set of state transitions from the desired dynamics mode $\dataset_0$ and a calibrated dynamics model $\dynamicsModel$.

Alg

after alg

\newline

Dynamics model $\dynamicsModel$ \acrshort{modeopt} learns a factorised representation of the underlying dynamics modes using the \acrshort{mosvgpe} model from cref:chap-dynamics. Importantly, it learns a single-step dynamics model $\dynamicsModel$, given by,

which also infers valuable information regarding how the system switches between the modes.

Mode remaining controller $\modeController$ The mode remaining control methods from cref:chap-traj-opt-control are used to construct a mode remaining controller $\modeController$,

It uses the learned dynamics model $\dynamicsModel$ to find trajectories to the target state $\targetState$. The mode chance constraints from cref:eq-mode-chance-constraint are then used to check if trajectories are $δ-\text{mode remaining}$.

Explorative controller $\explorativeController$ When $δ-\text{mode remaining}$ trajectories to the target state $\targetState$ cannot be found, \acrshort{modeopt} relies on an explorative controller $\explorativeController$ to explore the environment. This work uses the exploration strategy from cref:sec-exploration to construct such an explorative controller,

The goal of this controller is to explore the environment whilst remaining in the desired dynamics mode. It is worth noting that this is an open-loop trajectory optimisation algorithm. Extending the method in cref:sec-exploration to feedback controllers is left for future work.

Preliminary Results label:sec-preliminary-results

intro

This section presents initial results solving the illustrative quadcopter navigation problem in cref:illustrative_example using the exploration strategy from cref:sec-exploration. In particular, it shows results using the simulated Environment 1 from cref:sec-traj-opt-results-simulated. See cref:sec-traj-opt-results-simulated for more details on the environment and the simulator set up.

cref:fig-explorative-traj-opt-7-initial shows the \acrshort{mosvgpe}’s gating network posterior after training on the initial data set $\dataset_0$. The start state $\state_0$, the target state $\targetState$, the initial data set $\dataset_0$ (blue crosses) and the mode boundary, are overlayed to help visualise the mode remaining navigation problem. The turbulent dynamics mode is the region within the mode boundary and the desired dynamics mode is everywhere else. To further aid visualisation, the mode chance constraints contour is marked on all plots (purple line), i.e. $Pr(\modeVar=\desiredMode \mid \state, \dataset0:i) = 1-δ$. Intuitively, \acrshort{modeopt} seeks to expand the $1-δ$ contour until the target state $\targetState$ lies within the contour.

Experiment Configuration

This section details how the experiments were configured.

Initial data set $\dataset_0$ The initial data set was collected by simulating $15$ random trajectories from the start state $\state_0$ and terminating them when they left the initial state domain $\initialStateDomain$. This resulted in an initial data set of $118$ state transitions, which is visualised as the blue crosses in cref:fig-explorative-traj-opt-7-initial.

Model learning Following the experiments in cref:chap-traj-opt-results this section instantiated the \acrshort{mosvgpe} dynamics model with $\ModeInd=2$ experts, one to represent each of the dynamics modes. Each mode’s dynamics \acrshort{gp} used a Squared Exponential kernel with \acrshort{ard} and a constant mean function. The gating network used a single gating function with a Squared Exponential kernel with \acrshort{ard} and a zero mean function. At each \acrshort{modeopt} iteration, an early stopping callback was used to terminate the dynamics model training. The early stopping callback used a min delta of $0$ and a patience of $1200$. This meant that training terminated after 1200 epochs of no improvement.

Explorative controller The explorative controller $\explorativeController$ was initialised with a horizon of $\TimeInd=15$. At each \acrshort{modeopt} iteration, the previous solution was used as the initial solution for the trajectory optimiser. In all experiments, \acrshort{modeopt} was configured with $δ=0.2$, which corresponds to mode chance constraints with a satisfaction probability of 0.8. That is, the mode chance constraints were given by,

Comparison of Exploration Terms

This section evaluates the different terms in the explorative controller’s objective from cref:eq-explorative-traj-opt. In particular, it motivates why the entropy-based term was combined with the state difference term. It then shows why the joint gating entropy over a trajectory was used, instead of summing the marginal entropy at each time step, as is done in cite:buisson-fenetActively2020.

state diff vs entropy vs

state diff

joint vs factorised entropy

state diff vs entropy vs

state diff text

\newline

State difference term A simple exploration strategy is to favour trajectories whose centre of mass is closer to the target state. This corresponds to solving the constrained optimisation in cref:eq-explorative-traj-opt with only the state difference term. However, using only the state difference term with the mode chance constraints leads to the optimisation getting stuck in a local optimum and never exploring to the target state. This is shown in cref:fig-explorative-state-diff-traj-opt-over-prob-7, which shows the final iteration of the optimisation where the trajectory is stuck at the mode boundary. If the strategy searched more to the left it would be able to expand the mode chance constraints and explore around the mode boundary. However, the mode chance constraints have induced a local optimum that prevented \acrshort{modeopt} from exploring in this direction. A strategy which favours searching regions of the state space which have not previously been observed could likely avoid the local optima in cref:fig-explorative-state-diff-traj-opt-over-prob-7. This intuition motivated adding the gating entropy term in cref:eq-explorative-traj-opt, which favours exploration in regions of high epistemic uncertainty.

entropy comparison text

\newline

Joint vs factorised entropy Before combining the state difference and entropy terms into a single objective, the impact of different gating entropy objectives is evaluated. In particular, the joint gating entropy term in cref:eq-explorative-traj-opt, given by,

is compared with the sum of marginal entropies over the trajectory, which is referred to as the factorised entropy and given by,

cref:fig-entropy-comparison-traj-opt-7 shows the trajectories found at the first \acrshort{modeopt} iteration when using these two entropy objectives. The factorised entropy objective, shown in cref:fig-entropy-mvn-diag-traj-opt-over-prob-7, has collapsed the entire trajectory onto a single state of high entropy. This is an undesirable behaviour because it does not maximise the information gain along the entire trajectory. In contrast, the result in cref:fig-entropy-mvn-full-cov-traj-opt-over-prob-7 considers the joint entropy over the entire trajectory. The trajectory has spread along the $1-δ$ contour, which corresponds to the region with the highest gating entropy. These results confirm that it is important to consider the information gain over the entire trajectory.

Exploration in Environment 1

This section presents preliminary results of \acrshort{modeopt}’s exploration strategy. The results show the strategy successfully exploring the simulated Environment 1 from cref:sec-traj-opt-results-simulated. The results presented here show how the desired mode’s mixing probability and the gating function’s variance change during each iteration. The exploration strategy is visualised using a grid of figures where each row corresponds to the data collection process corresponding to a particular iteration $i$. For example, cref:fig-explorative-traj-opt-over-prob-7-0-4 shows how the desired mode’s mixing probability changes at the start of the exploration phase and cref:fig-explorative-traj-opt-over-variance-7-0-4 shows how the gating function’s variance changes.

  1. The first (left) column shows the data set $\dataset0:i$ collected at the previous iterations overlayed on the gating network posterior after training on it.
  2. The second (middle) column overlays the trajectory found by the explorative controller $\explorativeController$ on the gating network posterior before training on the data collected by the explorative controller.
  3. The third (right) column shows the updated data set $\dataset0:i+1$ after collecting data with the explorative controller, overlayed on the gating network before training on it.

The row below then shows the gating network posterior after training on the updated data set $\dataset0:i+1$ from the previous iteration. Together cref:fig-explorative-traj-opt-over-prob-7-0-4,fig-explorative-traj-opt-over-variance-7-0-4 show how the model’s epistemic uncertainty reduces after collecting and training on new data, and how this leads to the mode chance constraints expanding at each iteration.

Step 0:4 Prob Figs

Step 0:4 Variance Figs

Initial model text

\newline

Initial model The top left figures in cref:fig-explorative-traj-opt-over-prob-7-0-4,fig-explorative-traj-opt-over-variance-7-0-4 show the initial data set $\dataset_0$ (blue crosses) overlayed on the gating network after training on it. cref:fig-explorative-traj-opt-over-prob-7-0-4 shows that the probability of being in the desired dynamics mode $\desiredMode=2$ is high (red) around the initial data set. It tends to $0.5$ away from the data, which corresponds to maximum entropy for a Categorical distribution. cref:fig-explorative-traj-opt-over-variance-7-0-4 shows that the gating function’s posterior variance is low around the initial data set and high everywhere else. Intuitively, \acrshort{modeopt} seeks to expand the explorable region so that $δ-\text{mode remaining}$ trajectories to the target state can be found. As such, it seeks to reduce the model’s epistemic uncertainty by decreasing the gating function’s variance. In turn, this will increase the region that the model believes belongs to the desired dynamics mode. This is visualised by the chance constraint contour $Pr(\modeVar=\desiredMode \mid \state, \dataset0:i)=0.8$ expanding as \acrshort{modeopt} collects more data and trains on it.

Iteration $\mathbf{i=0}$ The top row of cref:fig-explorative-traj-opt-over-prob-7-0-4,fig-explorative-traj-opt-over-variance-7-0-4 visualise the initial iteration $i=0$ of \acrshort{modeopt}. Given the dynamics model after training on the initial data set $\dataset_0$, the explorative controller $\explorativeController$ found the trajectory in the top middle plot. The trajectory explores towards the target state $\targetState$, which is the goal of the state difference term in cref:eq-explorative-traj-opt. The top middle plot in cref:fig-explorative-traj-opt-over-variance-7-0-4 shows that the trajectory has also spread over regions of the desired mode’s gating function with high posterior variance. This is the goal of the joint gating entropy term in cref:eq-explorative-traj-opt. The contour plots in the second row show how the gating network changes after training on $\dataset0:1$. The gating function’s posterior variance has decreased around the new observations. The purple lines representing the $Pr(α=\desiredMode \mid \state, \dataset0) = 0.8$ contour in the top row, and the $Pr(α=\desiredMode \mid \state, \dataset0:1) = 0.8$ contour in the second row, show how the mode chance constraints expand after training on $\dataset0:1$. The region between the contours represents the newly explorable region. This shows that reducing the model’s epistemic uncertainty leads to the explorable region expanding under the mode chance constraints. Further to this, it indicates that in this environment, setting $δ=0.2$ enables exploration outside of the initial state domain $\initialStateDomain$.

iteration 2 text

\newline

Iteration $\mathbf{i=2}$ The second row of plots in cref:fig-explorative-traj-opt-over-prob-7-0-4 shows the first iteration where the $δ-\text{mode remaining}$ chance constraints have expanded the explorable domain such that it intersects with the undesired dynamics mode. This is indicated by the purple $0.8$ contour intersecting with the purple line labelled mode boundary. As a result, the trajectory found by the explorative controller left the desired dynamics mode and was subject to the turbulent dynamics mode. This is shown by the trajectory crossing the mode boundary and also by the environment trajectory (cyan) deviating from the dynamics trajectory (magenta) due to the high drift associated with the turbulent dynamics mode. It is worth noting that \acrshort{modeopt} can never learn where a mode boundary is without observing state transitions from outside of the desired dynamics mode. However, \acrshort{modeopt} was not able to learn the mode boundary from this single trajectory.

Iteration $\mathbf{i=3}$ The third row shows another iteration where the dynamics model did not learn the mode boundary and as a result, found an explorative trajectory which left the desired dynamics mode. Although not desirable, collecting this data was necessary for inferring the mode boundary.

Iteration $\mathbf{i=4}$ The fourth row shows the next iteration which shows that given these new observations the \acrshort{mosvgpe}’s gating network is able to infer that the state transitions belong to another dynamics mode. As a result, the explorative trajectory did not leave the desired dynamics mode. That is, the mode chance constraints successfully restricted the domain such that the trajectory did not leave the desired dynamics mode. Further to this, with $δ=0.2$ the trajectory optimiser was able to explore around the mode boundary without leaving the desired dynamics mode. This confirms that the exploration strategy in cref:sec-exploration is capable of exploring subject to the constrained domain.

Leaving the desired dynamics mode The interplay between the explorative objective and the mode chance constraints was effective at preventing the explorative controller from leaving the desired dynamics mode during training. Only two iterations of \acrshort{modeopt} left the desired dynamics mode during training and arguably this is necessary in order to learn the location of the mode boundary.

Step 6:14 Prob Figs

Step 6:14 Variance Figs

Final Iteration text

\newline

Final iteration cref:fig-explorative-traj-opt-over-prob-7-6-14,fig-explorative-traj-opt-over-variance-7-6-14 show the later iterations of the exploration phase. They show \acrshort{modeopt} gradually exploring the domain by expanding the mode chance constraints. The bottom row shows the final iteration of the exploration phase where the mode chance constraints have expanded such that the target state $\targetState$ lies within the explorable domain. This result confirms that the constrained exploration strategy is capable of exploring to the target state $\targetState$ with $δ=0.2$. However, it is worth noting that the mode chance constraints intersect the mode boundary at the final iteration. As such $δ-\text{mode remaining}$ trajectories found with the mode remaining controller $\modeController$ may leave the desired dynamics mode. In this case, \acrshort{modeopt} requires more observations around the mode boundary to learn the true position of the mode boundary. This issue arises because the constraints are latent and are being inferred from data.

Discussion

\newline

Exploration The results in cref:fig-explorative-traj-opt-over-prob-7-0-4,fig-explorative-traj-opt-over-variance-7-0-4,fig-explorative-traj-opt-over-prob-7-6-14,fig-explorative-traj-opt-over-variance-7-6-14 suggest that balancing the state difference and the joint gating entropy terms in cref:eq-explorative-traj-opt results in exploration to the target state $\targetState$ without getting stuck in a local minimum. Intuitively, this objective favours maximum entropy trajectories whose centre of mass is closest to the target state $\targetState$. It is worth noting that not all settings of the cost matrices, $\stateCostMatrix$ and $\controlCostMatrix$, resulted in \acrshort{modeopt} converging in the experiments. As mentioned previously, the performance of \acrshort{modeopt}’s exploration is dependent on the interplay between the entropy term, the state difference term and the mode chance constraints. As such, the convergence of \acrshort{modeopt} likely depends on the cost matrices, $\stateCostMatrix$, $\controlCostMatrix$ and the mode satisfaction probability given by our choice of $δ$. Exploration can likely be guaranteed by relaxing $δ$, however, relaxing $δ$ too far corresponds to removing the mode chance constraints. Nevertheless, the results shown here are an initial step showing that \acrshort{modeopt} can work in practice. However, further analysis is left for future work.

Discussion & Future Work label:sec-future-work-explore

This section discusses \acrshort{modeopt} and proposes some directions for future work.

intro

\newline

Further experiments First of all, it should be noted that the work presented in this chapter is a first step at consolidating the work in this thesis to solve the mode remaining navigation problem in cref:eq-mode-explore-problem. As such, more experiments are required to fully test \acrshort{modeopt}. For example, further experiments are required to validate the exploration strategy in environments with more than two dynamics modes and on real-world systems.

Exploration guarantees \acrshort{modeopt} side-steps the exploration-exploitation trade-off which is common in \acrshort{mbrl}. It does so by separating the exploration into the explorative controller $\explorativeController$ and the exploitation into the mode remaining controller $\modeController$. The mode chance constraints are then used to see if the exploitative mode controller $\modeController$ can find a $δ-\text{mode remaining}$ trajectory to the target state. If it cannot, it falls back on the explorative controller $\explorativeController$ to find informative trajectories, collect data and reduce the model’s epistemic uncertainty. Although this approach principally side-steps the exploration-exploitation trade-off, it does not have any exploration guarantees. That is, given enough time, it is not guaranteed to have explored enough such that $δ-\text{mode remaining}$ trajectories to the target state can be found. Further to this, some experiments showed that the constrained optimisation can get stuck exploring away from the target state. As such, an interesting direction for future work is to study exploration guarantees, for example, through regret bounds. Such analysis may lead to an automatic method for selecting the smallest $δ$ that can guarantee exploration. A more simple (empirical) approach may quantify the rate at which the gating network’s epistemic uncertainty is reducing and use it to relax $δ$ if the agent has stopped exploring.

Myopic vs non-myopic active learning In dynamical systems, myopic learning corresponds to selecting the control input by only considering the information gain at the next state. In contrast, non-myopic learning selects the control input by considering the information gain over the next $\TimeInd$ states. The exploration strategy presented in this chapter selected the next $\TimeInd$ controls by considering the information gain over the next $\TimeInd$ states. In future work, it would be interesting to put the exploration strategy into an \acrshort{mpc} loop, such that it selects the next control, by considering the information gain over the next $\TimeInd$ states. This would enable a better comparison of myopic and non-myopic strategies.

Entropy comparison

Information criterion

\newline Information criterion The exploration strategy presented in this chapter has shown that the \acrshort{gp}-based gating network is useful for active learning when considering trajectories over a horizon. This is because the \acrshort{gps} have been able to model the joint distribution over the gating function values along a trajectory. This enabled the exploration strategy to find trajectories where each time step is aware of the information gain of other time steps. This is useful for finding non-myopic trajectories as it increases the information gain over the entire trajectory. However, this has only been verified by visual inspection and further analysis is left for future work. It is also worth noting that the exploration strategy is only possible because the gating network is based on \acrshort{gps}. As such, this method is not widely applicable in all \acrshort{mogpe} dynamics models.

An alternative approach is to use the entropy of the mode indicator variable $\modeVar$. cref:fig-explorative-traj-opt-entropy-comparison-7-2-7 compares how the gating function entropy and two alternative information-based objectives change during each \acrshort{modeopt} iteration. Note that the figure shows myopic versions of the objectives that do not consider full trajectories. At the start of \acrshort{modeopt} (top row), the shape of the entropy landscapes is quite similar. However, when the gating network starts learning the mode boundary at iteration $i=4$ (second row) the shape of the entropy over the mode indicator variable (middle column) no longer matches the entropy of the gating function (left column). This is in line with what happens in \acrshort{gp} classification and is to be expected from the gating network. As mentioned in cref:chap-dynamics, \acrshort{mogpe} gating networks tend to a uniform distribution 1) where they are not very confident (have high epistemic uncertainty) but also 2) where they are confident (have low epistemic uncertainty) but are at the boundary between modes. This can be seen in the last two rows of cref:fig-explorative-traj-opt-entropy-comparison-7-2-7. In this figure, high values (red) indicate regions where the objectives want to query next. The entropy of the mode indicator variable (middle column) is high at the mode boundary even though the model has observed data at this location. Therefore, this objective may keep exploring mode boundaries that it has already observed. In turn, this may lead to the strategy getting stuck in a local optimum at a mode boundary.

A popular approach to active learning for binary classification is \acrfull{bald} citep:houlsbyBayesian2011. \acrshort{bald} alleviates this issue by considering the following objective (in the myopic setting),

This objective is visualised in the right column of cref:fig-explorative-traj-opt-entropy-comparison-7-2-7. Intuitively, it seeks the state $\state\timeInd$ for which the parameters under the posterior disagree about the outcome the most, hence the name. The objective is low in regions of the mode boundary which have been observed. This is a desirable behaviour which makes extending this objective to dynamical systems an interesting direction for future work.

It is worth noting that the joint distribution over the mode indicator variable in \acrshort{mogpe} models is factorised over time. As such, there is no way to condition the entropy at a given time step on all other time steps. However, it may be possible to use the joint distribution over the gating function values $p(\gatingFunc(\stateTraj) \mid \stateTraj, \dataset0:i)$ to achieve this behaviour.

AFTER Information criterion

\newline

More than 2 modes Although not tested here, \acrshort{modeopt} is theoretically sound and should be applicable in environments with more than two dynamics modes. However, the exploration strategy uses the joint entropy of desired mode’s gating function over a trajectory. Further experiments are required to see if this strategy works when the \acrshort{mosvgpe} dynamics model is instantiated with more than two experts. This is because when using more than two experts the \acrshort{mosvgpe} model uses the Softmax likelihood with a gating function for each expert. In contrast, the two expert case uses the Bernoulli likelihood with a single gating function.

Inducing points The method presented in this chapter initialised each of the sparse \acrshort{gps} in the \acrshort{mosvgpe} dynamics model with a fixed number of inducing points. Although this approach worked well in the experiments, it is unlikely that this will always be the case. For example, consider exploring much larger environments, i.e. environments with larger state domains. In these environments, the \acrshort{mosvgpe}’s ability to accurately model an ever increasing data set with a fixed number of inducing points will decrease. This is due to the sparse approximation’s ability to model the true nonparametric model deteriorating as the number of data points increases. See cite:burtRates2019 for details on rates of convergence for sparse \acrshort{gps}. As such, an interesting direction for future work is to study methods for dynamically adding new inducing points to each \acrshort{gp}.

Fixing model parameters during training During its initial iterations, \acrshort{modeopt} only explores the desired dynamics mode and does not observe any state transitions belonging to another mode. As a result, the lengthscale of the gating network \acrshort{gp} increases. This often results in the gating network becoming overconfident, almost as if the gating network believes that only a single dynamics mode exists over the entire domain. When this happens, the $δ-\text{mode remaining}$ chance constraints expand the explorable region significantly further than the observed data. This is undesirable because it leads to trajectories leaving the desired dynamics mode significantly more. In practice, fixing the kernel’s hyperparameters (e.g. lengthscale and signal variance) after training on the initial data set $\dataset_0$, appeared to alleviate this issue. However, it is left to future work to study this in more depth.

$δ-\text{mode remaining}$ Exploring a single dynamics mode in multimodal dynamical systems where the underlying dynamics modes and how the system switches between them, are not fully known a priori, is a hard problem. This is because the agent must observe regions outside of the desired dynamics mode in order to know that a particular region does not belong to the desired mode. The $δ-\text{mode remaining}$ exploration strategy proposed in cref:sec-exploration resulted in the agent leaving the desired dynamics mode multiple times in the experiments. In future work, it would be interesting to study this in more detail. For example, how often must the agent leave the desired dynamics mode in order to accurately learn the mode boundary? And, how often does the agent fail when leaving the desired dynamics mode in practice?

Related work To the best of our knowledge, there is no previous work addressing the exploration of a single dynamics mode in multimodal dynamical systems. cite:schreiterSafe2015 use a \acrshort{gp} classifier to identify safe and unsafe regions when learning \acrshort{gp} dynamics models in an active learning setting. However, they assume that they can directly observe whether a particular data point from the environment belongs to either the safe or unsafe regions. In contrast, we considered scenarios where the mode cannot be directly observed from the environment, but instead, is inferred by the probabilistic dynamics model.

Conclusion

This chapter has presented a novel strategy for exploring multimodal dynamical systems whilst remaining in the desired dynamics mode with high probability. Moreover, it has proposed how this exploration strategy can be combined with the dynamics model from cref:chap-dynamics, the $δ-\text{mode remaining}$ trajectory optimisation algorithms from cref:chap-traj-opt-control and the $δ-\text{mode remaining}$ chance constraints from cref:eq-mode-chance-constraint, to approximately solve the mode remaining navigation problem in cref:eq-main-problem. That is, it can control multimodal dynamical systems – where the underlying dynamics modes and how the system switches between them, are not fully known a priori – from a start state $\state_0$, to a target state $\targetState$ whilst guaranteeing that the system remains in the desired dynamics mode with high probability.

The algorithm, named \acrshort{modeopt}, was tested in a simulated version of the illustrative quadcopter example, verifying that the algorithm can work in practice. However, it has not been fully tested in environments with more than two modes, nor has it been tested on a real-world system. Further testing and analysis of \acrshort{modeopt} is left for future work.

Conclusion

intro

The main objective of this thesis was to solve the mode remaining navigation problem in cref:eq-main-problem. That is, to control a multimodal dynamical system – where neither the underlying dynamics modes, nor how the system switches between them, are known a priori – to a target state, whilst remaining in the desired dynamics mode. Based on well-established methods from Bayesian statistics and machine learning, this thesis proposed \acrshort{modeopt}, a general framework for approximately solving the mode remaining navigation problem.

At the core of \acrshort{modeopt} is a Bayesian approach to learning multimodal dynamical systems, named \acrfull{mosvgpe}, that accurately identifies the underlying dynamics modes, as well as how the system switches between them. Further to this, it learns informative latent structure that \acrshort{modeopt} leverages to encode mode remaining behaviour into control strategies. The method’s ability to learn factorised representations of multimodal data sets whilst retaining well-calibrated uncertainty estimates was validated on a real-world quadcopter data set, as well as on the motorcycle data set. Further to this, its applicability to learning dynamics models for model-based control was validated in two simulated environments.

As this thesis has focused on model-based techniques that leverage a learned dynamics model, it had to relax the requirement of remaining in the desired dynamics mode, to remaining in the desired dynamics mode with high probability. Initially, when not much of the environment has been observed, it is not possible to find trajectories to the target state that remain in the desired mode with high probability. This is due to the learned dynamics model having high epistemic uncertainty. In this scenario, \acrshort{modeopt} reduces the model’s epistemic uncertainty by exploring the environment and updating the dynamics model with new data. \acrshort{modeopt} sidesteps the exploration-exploitation trade-off which is common in \acrshort{mbrl} algorithms by introducing a set of chance constraints. That is, \acrshort{modeopt} does not need an objective function that changes its exploration-exploitation balance as it gathers more data. The mode chance constraints are a powerful tool that allow \acrshort{modeopt} to deploy separate controllers during the explorative and exploration phases.

An explorative trajectory optimisation algorithm that leverages the \acrshort{gp}-based gating network is proposed. Experiments confirm that the explorative controller is effective at maximising the information gain over the entire trajectory and targeting exploration towards the target state. However, further analysis of the exploration strategy is left for future work.

Three exploitative trajectory optimisation algorithms that find trajectories to the target state, whilst attempting to remain in the desired dynamics mode have been presented. Two of the methods show how the latent geometry of the \acrshort{gp}-based gating network can be leveraged to encode mode remaining behaviour, whilst the third approach shows how this can be achieved by extending the \acrshort{cai_unimodal} framework to multimodal dynamical systems. Their ability to remain in the desired dynamics mode was tested in two simulated environments. Both the \acrfull{dre} method in cref:sec-traj-opt-energy and the \acrfull{cai} method in cref:chap-traj-opt-inference performed well in the experiments.

This thesis provided experimental evidence that \acrshort{modeopt} can work in practice, however, this was only in simulation. Evaluating its performance on real-world problems is left for future work.

Summary of Contributions

  1. A probabilistic framework

Future Work

There are many promising directions for future work. Some are extensions of the work in this thesis, whilst others are alternative approaches to solving the mode remaining navigation problem in cref:eq-main-problem. Many of the extensions to the work in this thesis are discussed in the relevant chapters so are not re-discussed here.

Higher-dimensional problems Firstly, \acrshort{modeopt} is mostly restricted to lower dimensional problems due to the difficulties of defining \acrshort{gp} priors in high dimensions. In \acrshort{mbrl} there has been interesting progress learning better statistical models and scaling them up to higher-dimensional problems, for example, using Bayesian neural networks. This is an interesting direction for future work as it may lead to more practical algorithms.

External sensing \acrshort{modeopt} uses internal sensing to obtain information on the robot, such as where it is and how fast it is travelling. It then uses this information to infer the separation between the underlying dynamics modes from state transitions. However, in some applications, it may be possible to infer the separation between the underlying dynamics modes using external sensors. For example, in autonomous driving, the friction coefficients associated with different road surfaces may define a set of dynamics modes. In this setting, cameras could likely be used to detect changes in the road surface and thus the separation between the underlying dynamics modes. Although not applicable in all settings, this is a promising direction for future work, as the agent would never have to enter the undesired dynamics mode.

Real-time feedback control Although the trajectories found by the controllers are $δ-\text{mode remaining}$, often when they are executed in the environment they are not $δ-\text{mode remaining}$. This behaviour is expected with open-loop controllers. Although this did not pose an issue in the simulated experiments, it may pose an issue in real-world environments. Future work could explore methods for obtaining closed-loop controllers. A simple approach is to embed the trajectory optimisation algorithm into an \acrshort{mpc} loop to obtain a closed-loop controller. This would require the trajectory optimisation algorithms to be made faster so that they could be used for real-time \acrshort{mpc}, for example, via locally linear dynamics approximations. Alternatively, an state-feedback controllers could be learned. For example, the trajectory optimisation algorithms could be used for guided policy search citep:levineGuided2013, or, they could be used for policy optimisation by reformulating them as a sum of discounted rewards with the mode chance constraints implemented via Lagrange multipliers. It should be straightforward to verify that learned feedback controllers are $δ-\text{mode remaining}$ under the dynamics using the mode chance constraints.

Model-free approaches Finally, this thesis has solely focused on model-based approaches for solving the mode remaining navigation problem in cref:eq-main-problem. However, it may be possible to solve it with model-free methods. For example, \acrfull{mfrl} methods may be able to learn reactive policies which automatically turn back when they encounter hard to control dynamics.

Appendix

Multivariate Normals

Derivatives of a Gaussian Process

Parameter Settings

Velocity Controlled Quadcopter Experiments

Table ref:tab-params-quadcopter contains the optimisation settings and initial values for the optimisable parameters that were used to train the model on the velocity controlled quadcopter data set in Section ref:sec-brl-experiment.

Experts Both expert’s \acrshort{gp} priors were initialised with constant mean functions (with a learnable parameter $c\modeInd$) and separate independent squared exponential kernels (with Automatic Relevance Determination) on each output dimension. Table ref:tab-params-quadcopter shows a single set of kernel parameters $σ_f, l$ for each expert, as the kernel associated with each output dimension was initialised with the same initial values. Each experts likelihood was initialised with diagonal covariance matrices $Σε_{\modeInd}$.

Gating Network The gating network \acrshort{gp} was initialised with a zero mean function, and a squared exponential kernel with \acrshort{ard}.

Motorcycle Experiments

Two Experts

\newline Model Training Table ref:tab-params-quadcopter contains the initial values for all of the trainable parameters in the model.

Three Experts

\newline Model Training Table ref:tab-params-quadcopter contains the initial values for all of the trainable parameters in the model.

Bibliography

Back Matter