# Defining POMDPs and MDPs

As described in the Concepts and Architecture section, an MDP is defined by the state space, action space, transition distributions, reward function, and discount factor, $(S,A,T,R,\gamma)$. A POMDP also includes the observation space, and observation probability distributions, for a definition of $(S,A,T,R,O,Z,\gamma)$. A problem definition in POMDPs.jl consists of an implicit or explicit definition of each of these elements.

It is possible to define a (PO)MDP with a more traditional object-oriented approach in which the user defines a new type to represent the (PO)MDP and methods of interface functions to define the tuple elements. However, the QuickPOMDPs package provides a more concise way to get started, using keyword arguments instead of new types and methods. Essentially each keyword argument defines a corresponding POMDPs api function. Since the important concepts are the same for the object oriented approach and the QuickPOMDP approach, we will use the latter for this discussion.

This guide has three parts: First, it explains a very simple example (the Tiger POMDP), then uses a more complex example to illustrate the broader capabilities of the interface. Finally, some alternative ways of defining (PO)MDPs are discussed.

This guide assumes that you are comfortable programming in Julia, especially familiar with various ways of defining *anonymous functions*. Users should consult the Julia documentation to learn more about programming in Julia.

## A Basic Example: The Tiger POMDP

In the first section of this guide, we will explain a QuickPOMDP implementation of a very simple problem: the classic Tiger POMDP[1]. In the tiger POMDP, the agent is tasked with escaping from a room. There are two doors leading out of the room. Behind one of the doors is a tiger, and behind the other is sweet, sweet freedom. If the agent opens the door and finds the tiger, it gets eaten (and receives a reward of -100). If the agent opens the other door, it escapes and receives a reward of 10. The agent can also listen. Listening gives a noisy measurement of which door the tiger is hiding behind. Listening gives the agent the correct location of the tiger 85% of the time. The agent receives a reward of -1 for listening. The complete implementation looks like this:

```
using QuickPOMDPs: QuickPOMDP
using POMDPModelTools: Deterministic, Uniform, SparseCat
m = QuickPOMDP(
states = ["left", "right"],
actions = ["left", "right", "listen"],
observations = ["left", "right"],
discount = 0.95,
transition = function (s, a)
if a == "listen"
return Deterministic(s) # tiger stays behind the same door
else # a door is opened
return Uniform(["left", "right"]) # reset
end
end,
observation = function (a, sp)
if a == "listen"
if sp == "left"
return SparseCat(["left", "right"], [0.85, 0.15]) # sparse categorical
else
return SparseCat(["right", "left"], [0.85, 0.15])
end
else
return Uniform(["left", "right"])
end
end,
reward = function (s, a)
if a == "listen"
return -1.0
elseif s == a # the tiger was found
return -100.0
else # the tiger was escaped
return 10.0
end
end,
initialstate = Uniform(["left", "right"]),
);
```

The next sections explain how each of the elements of the POMDP tuple are defined in this implementation:

#### State, action and observation spaces

In this example, each state, action, and observation is a `String`

. The state, action and observation spaces ($S$, $A$, and $O$), are defined with the `states`

, `actions`

and `observations`

keyword arguments. In this case, they are simply `Vector`

s containing all the elements in the space.

#### Transition and observation distributions

The `transition`

and `observation`

keyword arguments are used to define the transition distribution, $T$, and observation distribution, $Z$, respectively. These models are defined using functions that return *distribution objects* (more info below). The transition function takes state and action arguments and returns a distribution of the resulting next state. The observation function takes in an action and the resulting next state (`sp`

, short for "s prime") and returns the distribution of the observation emitted at this state.

#### Reward function

The `reward`

keyword argument defines $R$. It is a function that takes in a state and action and returns a number.

#### Discount and initial state distribution

The discount factor, $\gamma$, is defined with the `discount`

keyword, and is simply a number between 0 and 1. The initial state distribution, `b_0`

, is defined with the `initialstate`

argument, and is a distribution object.

The example above shows a complete implementation of a very simple discrete-space POMDP. However, POMDPs.jl is capable of concisely expressing much more complex models with continuous and hybrid spaces. The guide below introduces a more complex example to fully explain the ways that a POMDP can be defined.

## Guide to Defining POMDPs

### A more complex example: A partially-observable mountain car

Mountain car is a classic problem in reinforcement learning. A car starts in a valley between two hills, and must reach the goal at the top of the hill to the right (see wikipedia for image). The actions are left and right acceleration and neutral and the state consists of the car's position and velocity. In this partially-observable version, there is a small amount of acceleration noise and observations are normally-distributed noisy measurements of the position. This problem can be implemented as follows:

```
import QuickPOMDPs: QuickPOMDP
import POMDPModelTools: ImplicitDistribution
import Distributions: Normal
mountaincar = QuickPOMDP(
actions = [-1., 0., 1.],
obstype = Float64,
discount = 0.95,
transition = function (s, a)
ImplicitDistribution() do rng
x, v = s
vp = v + a*0.001 + cos(3*x)*-0.0025 + 0.0002*randn(rng)
vp = clamp(vp, -0.07, 0.07)
xp = x + vp
return (xp, vp)
end
end,
observation = (a, sp) -> Normal(sp[1], 0.15),
reward = function (s, a, sp)
if sp[1] > 0.5
return 100.0
else
return -1.0
end
end,
initialstate = ImplicitDistribution(rng -> (-0.2*rand(rng), 0.0)),
isterminal = s -> s[1] > 0.5
)
```

The following sections provide a detailed guide to defining the components of a POMDP using this example and the tiger pomdp further above.

### State, action, and observation spaces

In POMDPs.jl, a state, action, or observation can be represented by any Julia object, for example an integer, a floating point number, a string or `Symbol`

, or a vector. For example, in the tiger problem, the states are `String`

s, and in the mountaincar problem, the state is a `Tuple`

of two floating point numbers, and the actions and observations are floating point numbers. These types are usually inferred from the space or initial state distribution definitions.

Objects representing individual states, actions, and observations should not be altered once they are created, since they may be used as dictionary keys or stored in histories. Hence it is usually best to use immutable objects such as integers or `StaticArray`

s.

The state, action, and observation spaces are defined with the `states`

, `actions`

, and `observations`

Quick(PO)MDP keyword arguments. The simplest way to define these spaces is with a `Vector`

of states, e.g. `states = ["left", "right"]`

in the tiger problem. More complicated spaces, such as vector spaces and other continuous, uncountable, or hybrid sets can be defined with custom objects that adhere to the space interface. However it should be noted that, for many solvers, *an explicit enumeration of the state and observation spaces is not needed*. Instead, it is sufficient to specify the state or observation *type* using the `statetype`

or `obstype`

arguments, e.g. `obstype = Float64`

in the mountaincar problem.

If you are having a difficult time representing the state or observation space, it is likely that you will not be able to use a solver that requires an explicit representation. It is usually best to omit that space from the definition and try solvers to see if they work.

#### State- or belief-dependent action spaces

In some problems, the set of allowable actions depends on the state or belief. This can be implemented by providing a function of the state or belief to the `actions`

argument, e.g. if you can only take the action `1`

in state `1`

, but can take actions `2`

and `3`

, in an MDP, you might use

```
actions = function (s)
if s == 1
return [1,2,3]
else
return [2,3]
end
end
```

Similarly, in a POMDP, you may wish to only allow action `1`

if the belief `b`

assigns a nonzero probability to state `1`

. This can be accomplished with

```
actions = function (b)
if pdf(b, 1) > 0.0
return [1,2,3]
else
return [2,3]
end
end
```

### Transition and observation distributions

The transition and observation observation distributions are specified through *functions that return distributions*. A distribution object implements parts of the distribution interface, most importantly a `rand`

function that provides a way to sample the distribution and, for explicit distributions, a `pdf`

function that evaluates the probability mass or density of a given outcome. In most simple cases, you will be able to use a pre-defined distribution like the ones listed below, but occasionally you will define your own for more complex problems.

Since the `transition`

and `observation`

functions return distributions, you should not call `rand`

within these functions (unless it is within an `ImplicitDistribution`

sampling function (see below)).

The `transition`

function takes in a state `s`

and action `a`

and returns a distribution object that defines the distribution of next states given that the current state is `s`

and the action is `a`

, that is $T(s' | s, a)$. Similarly the `observation`

function takes in the action `a`

and the next state `sp`

and returns a distribution object defining $O(z | a, s')$.

It is also possible to define the `observation`

function in terms of the previous state `s`

, along with `a`

, and `sp`

. This is necessary, for example, when the observation is a measurement of change in state, e.g. `sp - s`

. However some solvers may use the `a, sp`

method (and hence cannot solve problems where the observation is conditioned on $s$ and $s'$). Since providing an `a, sp`

method *automatically* defines the `s, a, sp`

method, problem writers should usually define only the `a, sp`

method, and only define the `s, a, sp`

method if it is necessary. Except for special performance cases, problem writers should *never* need to define both methods.

#### Commonly-used distributions

In most cases, the following pre-defined distributions found in the POMDPModelTools and Distributions packages will be sufficient to define models.

`Deterministic`

The `Deterministic`

distribution should be used when there is no randomness in the state or observation given the state and action inputs. This commonly occurs when the new state is a deterministic function of the state and action or the state stays the same, for example when the action is `"listen"`

in the tiger example above, the transition function returns `Deterministic(s)`

.

`SparseCat`

In discrete POMDPs, it is common for the state or observation to have a few possible outcomes with specified probabilities. This can be represented with a sparse categorical `SparseCat`

distribution that takes a list of outcomes and a list of associated probabilities as arguments. For instance, in the tiger example above, when the action is `"listen"`

, there is an 85% chance of receiving the correct observation. Thus if the state is `"left"`

, the observation distribution is `SparseCat(["left", "right"], [0.85, 0.15])`

, and `SparseCat(["right", "left"], [0.85, 0.15])`

if the state is `"right"`

.

Another example where `SparseCat`

distributions are useful is in grid-world problems, where there is a high probability of transitioning along the direction of the action, a low probability of transitioning to other adjacent states, and zero probability of transitioning to any other states.

`Uniform`

Another common case is a uniform distribution over a space or set of outcomes. This can be represented with a `Uniform`

object that takes a set of outcomes as an argument. For example, the initial state distribution in the tiger problem is represented with `Uniform(["left", "right"])`

indicating that both states are equally likely.

##### Distributions.jl

If the states or observations have numerical or vector values, the Distributions.jl package provides a suite of suitable distributions. For example, the observation function in the partially-observable mountain car example above,

`observation = (a, sp) -> Normal(sp[1], 0.15)`

returns a `Normal`

distribution from this package with a mean that depends on the car's location (the first element of state `sp`

) and a standard deviation of 0.15.

`ImplicitDistribution`

In many cases, especially when the state or observation spaces are continuous or hybrid, it is difficult or impossible to specify the probability density explicitly. Fortunately, many solvers for these problems do not require explicit density information and instead need only samples from the distribution. In this case, an "implicit distribution" or "generative model" is sufficient. In POMDPs.jl, this can be represented using an `ImplicitDistribution`

object.

The argument to an `ImplicitDistribution`

constructor is a function that takes a random number generator as an argument and returns a sample from the distribution. To see how this works, we'll look at an example inspired by the mountaincar initial state distribution. Samples from this distribution are position-velocity tuples where the velocity is always zero, but the position is uniformly distributed between -0.2 and 0. Consider the following code:

```
using Random: MersenneTwister
using POMDPModelTools: ImplicitDistribution
rng = MersenneTwister(1)
d = ImplicitDistribution(rng -> (-0.2*rand(rng), 0.0))
rand(rng, d)
# output
(-0.04720666913240939, 0.0)
```

Here, `rng`

is the random number generator. When `rand(rng, d)`

is called, the sampling function, `rng -> (-0.2*rand(rng), 0.0)`

, is called to generate a state. The sampling function uses `rng`

to generate a random number between 0 and 1 (`rand(rng)`

), multiplies it by -0.2 to get the position, and creates a tuple with the position and a velocity of `0.0`

and returns an initial state that might be, for instance `(-0.11, 0.0)`

. Any time that a solver, belief updater, or simulator needs an initial state for the problem, it will be sampled in this way.

The random number generator is a subtype of `AbstractRNG`

. It is important to use this random number generator for all calls to `rand`

in the sample function for reproducible results. Moreover some solvers use specialized random number generators that allow them to reduce variance. See also the What if I don't use the `rng`

argument? FAQ.

It is also common to use Julia's `do`

block syntax to define more complex sampling functions. For instance the transition function in the mountaincar example returns an ImplicitDistribution with a sampling function that (1) generates a new noisy velocity through a `randn`

call, then (2) clamps the velocity, and finally (3) integrates the position with Euler's method:

```
transition = function (s, a)
ImplicitDistribution() do rng
x, v = s
vp = v + a*0.001 + cos(3*x)*-0.0025 + 0.0002*randn(rng)
vp = clamp(vp, -0.07, 0.07)
xp = x + vp
return (xp, vp)
end
end
```

Because of the nonlinear clamp operation, it would be difficult to represent this distribution explicitly.

##### Custom distributions

If none of the distributions above are suitable, for example if you need to represent an explicit distribution with hybrid support, it is not difficult to define your own distributions by implementing the functions in the distribution interface.

### Reward functions

The reward function maps a combination of state, action, and observation arguments to the reward for a step. For instance, the reward function in the mountaincar problem,

```
reward = function (s, a, sp)
if sp[1] > 0.5
return 100.0
else
return -1.0
end
end
```

takes in the previous state, `s`

, the action, `a`

, and the resulting state, `sp`

and returns a large positive reward if the resulting position, `sp[1]`

, is beyond a threshold (note the coupling of the terminal reward) and a small negative reward on all other steps. If the reward in the problem is stochastic, the `reward`

function implemented in POMDPs.jl should return the mean reward.

There are two possible reward function argument signatures that a problem-writer might consider implementing for an MDP: `(s, a)`

and `(s, a, sp)`

. For a POMDP, there is an additional version, `(s, a, sp, o)`

. The `(s, a, sp)`

version is useful when transition to a terminal state results in a reward, and the `(s, a, sp, o)`

version is useful for cases when the reward is associated with an observation, such as a negative reward for the stress caused by a medical diagnostic test that indicates the possibility of a disease. **Problem writers should implement the version with the fewest number of arguments possible**, since the versions with more arguments are automatically provided to solvers and simulators if a version with fewer arguments is implemented.

In rare cases, it may make sense to implement two or more versions of the function, for example if a solver requires `(s, a)`

, but the user wants an observation-dependent reward to show up in simulation. It is OK to implement two methods of the reward function as long as the following relationships hold: $R(s, a) = E_{s'\sim T(s'|s,a)}[R(s, a, s')]$ and $R(s, a, s') = E_{o \sim Z(o | s, a, s')}[R(s, a, s', o)]$. That is, the versions with fewer arguments *must* be expectations of versions with more arguments.

### Other Components

#### Discount factors

The `discount`

keyword argument is simply a number between 0 and 1 used to discount rewards in the future.

#### Initial state distribution

The `initialstate`

argument should be a distribution object (see above) that defines the initial state distribution (and initial belief for POMDPs).

#### Terminal states

The function supplied to the `isterminal`

object defines which which states in the POMDP are terminal. The function should take a state as an argument as an argument and return `true`

if the state is terminal and `false`

otherwise. For example, in the mountaincar example above, `isterminal = s -> s[1] > 0.5`

indicates all states where the position, `s[1]`

is greater than 0.5 are terminal.

It is assumed that the system will take no further steps once it has reached a terminal state. Since reward is assigned for taking steps, no additional award can be accumulated from a terminal state. Consequently, the most important property of terminal states is that *the value of a terminal state is always zero*. Many solvers leverage this property for efficiency. As in the mountaincar example

## Other ways to define a (PO)MDP

Besides the Quick(PO)MDP approach above, there are several alternative ways to define (PO)MDP models:

### Object-oriented

First, it is possible to create your own (PO)MDP types and implement the components of the POMDP directly as methods of POMDPs.jl interface functions. This approach can be thought of as the "low-level" way to define a POMDP, and the QuickPOMDP as merely a syntactic convenience. There are a few things that make this object-oriented approach more cumbersome than the QuickPOMDP approach, but the structure is similar. For example, the tiger QuickPOMDP shown above can be implemented as follows:

```
import POMDPs
using POMDPs: POMDP
using POMDPModelTools: Deterministic, Uniform, SparseCat
struct TigerPOMDP <: POMDP{String, String, String}
p_correct::Float64
indices::Dict{String, Int}
TigerPOMDP(p_correct=0.85) = new(p_correct, Dict("left"=>1, "right"=>2, "listen"=>3))
end
POMDPs.states(m::TigerPOMDP) = ["left", "right"]
POMDPs.actions(m::TigerPOMDP) = ["left", "right", "listen"]
POMDPs.observations(m::TigerPOMDP) = ["left", "right"]
POMDPs.discount(m::TigerPOMDP) = 0.95
POMDPs.stateindex(m::TigerPOMDP, s) = m.indices[s]
POMDPs.actionindex(m::TigerPOMDP, a) = m.indices[a]
POMDPs.obsindex(m::TigerPOMDP, o) = m.indices[o]
function POMDPs.transition(m::TigerPOMDP, s, a)
if a == "listen"
return Deterministic(s) # tiger stays behind the same door
else # a door is opened
return Uniform(["left", "right"]) # reset
end
end
function POMDPs.observation(m::TigerPOMDP, a, sp)
if a == "listen"
if sp == "left"
return SparseCat(["left", "right"], [m.p_correct, 1.0-m.p_correct])
else
return SparseCat(["right", "left"], [m.p_correct, 1.0-m.p_correct])
end
else
return Uniform(["left", "right"])
end
end
function POMDPs.reward(m::TigerPOMDP, s, a)
if a == "listen"
return -1.0
elseif s == a # the tiger was found
return -100.0
else # the tiger was escaped
return 10.0
end
end
POMDPs.initialstate(m::TigerPOMDP) = Uniform(["left", "right"])
# output
```

It is easy to see that the new methods are similar to the keyword arguments in the QuickPOMDP approach, except that every function has an initial `m`

argument that has the newly created POMDP type. There are several differences from the QuickPOMDP approach: First, the POMDP is represented by a new `struct`

that is a subtype of `POMDP{S,A,O}`

. The state, action, and observation types must be specified as the `S`

, `A`

, and `O`

parameters of the `POMDP`

abstract type. Second, this new `struct`

may contain problem-specific fields, which makes it easy for others to construct POMDPs that have the same structure but different parameters. For example, in the code above, the `struct`

has a `p_correct`

parameter that specifies the probability of receiving a correct observation when the "listen" action is taken. The final and most cumbersome difference between this object-oriented approach and using QuickPOMDPs is that the user must implement `stateindex`

, `actionindex`

, and `obsindex`

to map states, actions, and observations to appropriate indices so that data such as values can be stored and accessed efficiently in vectors.

### Using a single generative function instead of separate $T$, $Z$, and $R$

In some cases, you may wish to use a simulator that generates the next state, observation, and/or reward ($s'$, $o$, and $r$) simultaneously. This is sometimes called a "generative model".

For example if you are working on an autonomous driving POMDP, the car may travel for one or more seconds in between POMDP decision steps during which it may accumulate reward and observation measurements. In this case it might be very difficult to create a reward or observation function based on $s$, $a$, and $s'$. For such situations, `gen`

function is an alternative to `transition`

, `observation`

, and `reward`

. `gen`

should take in state, action, and random number generator arguments and return a `NamedTuple`

with keys `sp`

(for "s-prime", the next state), `o`

, and `r`

. The mountaincar example above can be implemented with `gen`

as follows:

```
using QuickPOMDPs: QuickPOMDP
using POMDPModelTools: ImplicitDistribution
mountaincar = QuickPOMDP(
actions = [-1., 0., 1.],
obstype = Float64,
discount = 0.95,
gen = function (s, a, rng)
x, v = s
vp = v + a*0.001 + cos(3*x)*-0.0025 + 0.0002*randn(rng)
vp = clamp(vp, -0.07, 0.07)
xp = x + vp
if xp > 0.5
r = 100.0
else
r = -1.0
end
o = xp + 0.15*randn(rng)
return (sp=(xp, vp), o=o, r=r)
end,
initialstate = ImplicitDistribution(rng -> (-0.2*rand(rng), 0.0)),
isterminal = s -> s[1] > 0.5
)
```

`gen`

is not tied to the QuickPOMDP approach; it can also be used in the object-oriented paradigm.

It is possible to mix and match `gen`

with `transtion`

, `observation`

, and `reward`

. For example, if the `gen`

function returns a `NamedTuple`

with `sp`

and `r`

keys, POMDPs.jl will try to use `gen`

to generate states and rewards and the `observation`

function to generate observations.

Implementing `gen`

instead of `transition`

, `observation`

, and `reward`

will limit which solvers you can use; for example, it is impossible to use a solver that requires an explicit transition distribution

### Tabular

Finally, it is sometimes convenient to define (PO)MDPs with tables that define the transition and observation probabilities and rewards. In this case, the states, actions, and observations must simply be integers.

The code below is a tabular implementation of the tiger example with the states, actions, and observations mapped to the following integers:

integer | state, action, or observation |
---|---|

1 | "left" |

2 | "right" |

3 | "listen" |

```
using POMDPModels: TabularPOMDP
T = zeros(2,3,2)
T[:,:,1] = [1. 0.5 0.5;
0. 0.5 0.5]
T[:,:,2] = [0. 0.5 0.5;
1. 0.5 0.5]
O = zeros(2,3,2)
O[:,:,1] = [0.85 0.5 0.5;
0.15 0.5 0.5]
O[:,:,2] = [0.15 0.5 0.5;
0.85 0.5 0.5]
R = [-1. -100. 10.;
-1. 10. -100.]
m = TabularPOMDP(T, R, O, 0.95)
```

Here `T`

is a $|S| \times |A| \times |S|$ array representing the transition probabilities, with `T[sp, a, s]`

$= T(s' | s, a)$. Similarly, `O`

is an $|O| \times |A| \times |S|$ encoding the observation distribution with `O[o, a, sp]`

$= Z(o | a, s')$, and `R`

is a $|S| \times |A|$ matrix that encodes the reward function. 0.95 is the discount factor.