A tutorial on the free-energy framework for modelling perception and learning

Based on K. Friston and following R. Bogacz I shortly summarise the former, and solve the programming exercises given in the paper. This motives another post, taking a more general look at active learning in ML.

Why is this important/exciting?

R. Bogacz delivers a detailed and beautifully made tutorial on a subject that can be very difficult to understand, Variational Bayes, as seen from Karl Fristons perspective; Namely invoking the Free-Energy principle to motivate Active Inference. In the last decade, Active Inference has gained traction with the wider neuroscientific community, and recently Karl was measured to be the most influential neuroscientist in the modern era.


I use the same notation as in R. Bogacz. The code implements and refers to equations in the paper, and network graphs were provided by Bogacz himself.

Introduction - Biological intuition

A simple organism is trying to infer the size of a food item. The only source of noisy information is one photoreceptor that signals the light reflected from this item, we denote this . The non-linear function that relates size to photosensory input is assumed to be . We assume that this signal is normally distributed with mean and variance .

Part I - Bayes

We can write of the the likelihood function (probability of a size given a signal ) as


is the normal distribution with mean and variance .

Through learning or evolutionary filtering, the agent has been endowed with priors on the expected size of food items, and therefore expects sizes of food items to normally distributed with mean and where the subscript stands for prior. Formally


To compute the exact distribution of sensory input we can formulate the posterior using Bayes theorem

where the denominator is

and sum the whole range of possible sizes.

The following code implements such an exact solution and plots it.

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import norm

sns.set(style="white", palette="muted", color_codes=True)

%matplotlib inline
# non-linear transformation of size to perceived sensory input e.g. g(phi)
def sensory_transform(input):
    sensory_output = np.square(input)
    return sensory_output

The reason we explicitly define is that we might want to change it later. For now we assume a simple non-linear relation . The following snippet of code assumes values of and plots the posterior distribution.

def exact_bayes():
    # variabels 
    v = 2 # real size of item
    sigma_u = 1 # standard deviation of noisy input
    v_p = 3 # mean of prior
    sigma_p = 1 # variance of prior / sensory noise 
    s_range = np.arange(0.01,5,0.01) # range of sensory input
    s_step = 0.01 # step size
    # exact bayes (equation 4)
    numerator = (np.multiply(norm.pdf(s_range,v_p,sigma_p),# prior
                            norm.pdf(v,sensory_transform(s_range),sigma_u))) # likelihood
    normalisation = np.sum(numerator*s_step) # denominator / model evidence / p(noisy input) (equation 5)
    posterior = numerator / normalisation # posterior
    # plot exact bayes
    plt.ylabel(r' $p(v | u)$')


Inspecting the graph, we find that approximately maximises the posterior. There are two fundamental problems with this approach

  1. The posterior does not take a standard form, and is thus described by (potentially) infinitely many moments, instead of just simple sufficient statistics, such as the mean and the variance of a gaussian.

  2. The normalisation term that sits in the numerator of Bayes formula

can be complicated and numerical solutions often rely on computationally intense algorithms, such as the Expectation-Maximisation algorithm.

Part II - Iteration using Eulers method

We are interested in a more general way of finding the value that maximises the posterior . This involves maximising the numerator of Bayes equation. As this is independent of the denominator and therefore maximising will maximise the posterior. By taking the logarithm to the numerator we get

and the dynamics can be derived (see notes) to be

The next snippit of code asumes values for and implements the above dynamics to find the value of that maximises the posterior using a manual implementation of the dynamics and iterating using Eulers method.

def simple_dyn():
    # variabels 
    v = 2 # real size of item
    sigma_u = 1 # standard deviation of noisy input
    v_p = 3 # mean of prior
    sigma_p = 1 # variance of prior / sensory noise 
    s_range = np.arange(0.01,5,0.01) # range of sensory input
    s_step = 0.01 # step size
    # assume that phi maximises the posterior 
    phi = np.zeros(np.size(s_range))
    # use Eulers method to find the most likely value of phi
    for i in range(1,len(s_range)):
        phi[0] = v_p
        phi[i] = phi[i - 1] + s_step * ( ( (v_p - phi[i - 1]) / sigma_u ) +
        ( ( v - sensory_transform(phi[i - 1]) ) / sigma_u ) * (2 * phi[i - 1]) ) # equation 12
    # plot convergence
    plt.ylabel(r' $\phi$')


It is clear that the output converges rapidly to , the value that maximises the posterior.

So we ask the question: What does a minimal and biologically plausible network model that can do such calculations look like?

Part III - Biological plausibility

Firstly, we must specify what exactly biologically plausible means.

  • A neuron only performs computations on the input it is given, weighted by its synaptic weights.

  • Synaptic plasticity of one neuron, is only based on the activity of pre-synaptic and post-synaptic activity connecting to that neuron.

Consider the dynamics of a simple network that relies on just two neurons and is coherent with the above requirements of local computation

where and are the prediction errors

that arise from the assumption that the input is normally distributed (again, see Bogaez for derivations).

Below the network dynamics are sketched. Arrows denote excitatory input, circles inhibitory input, and unless specified (), weights of connections are assumed to be one.


The next snippit of code implements those dynamics and thus, the network “learns” what value of that maximises the posterior.

def learn_phi():
    # variabels 
    v = 2 # real size of item
    sigma_u = 1 # standard deviation of noisy input
    v_p = 3 # mean of prior
    sigma_p = 1 # variance of prior / sensory noise 
    s_range = np.arange(0.01,5,0.01) # range of sensory input
    s_step = 0.01 # step size
    # preallocate
    phi = np.zeros(np.size(s_range)) 
    epsilon_e = np.zeros(np.size(s_range)) 
    epsilon_s = np.zeros(np.size(s_range))
    # dynamics of prediction errors for size (epsilon_v) and sensory input (epsilon_u)
    for i in range(1,len(s_range)):
        phi[0] = v_p # initialise best guess (prior) of size
        epsilon_e[0] = 0 # initialise prediction error for size
        epsilon_s[0] = 0 # initialise prediction error for sensory input
        phi[i] = phi[i-1] + s_step*( -epsilon_e[i-1] + epsilon_s[i-1] * ( 2*(phi[i-1]) ) ) # equation 12
        epsilon_e[i] = epsilon_e[i-1] + s_step*( phi[i-1] - v_p - sigma_u * epsilon_e[i-1] ) # equation 13
        epsilon_s[i] = epsilon_s[i-1] + s_step*( v - sensory_transform(phi[i-1]) - sigma_p * epsilon_s[i-1] ) # equation 14
    # plot network dynamics


As the figure shows, the network learns but is slower in converging than when using Eulers method, as the model relies on several other nodes that are inhibits and excites each other, which causes oscillatory behaviour. Both and oscillate and converge to the values where

Which can be understood as the steady-state of the network. This means that minimising prediction errors (by learning the correct parameters) minimises the free energy potential - a tenet in Active Inference.

Part IV - Approximate estimation of

Recall that we assumed that size was communicated via. a noisy signal assumed to be normally distributed. Above, we outlined a simple sample method for finding the mean value that maximises the posterior .

By expanding this simple model, we can esimate the variance of the normal distribution as well. Considering computation in one single node computing prediction error

where is the variance of the best guess of , . Estimation of can be achieved by adding a interneuron which is connected to the prediction error node, and receives input from this via the connection with weight encoding , such as sketched in the figure below.


The dynamics are described by

which the following snippit of code implements.

def learn_sigma():
    # variabels 
    v = 2 # observed homeostatic error 
    sigma_u = 1 # standard deviation of the homeostatic error
    v_p = 3 # mean of prior homeostatic error / noisy input / simple prior
    sigma_p = 1 # variance of prior / sensory noise 
    s_range = np.arange(0.01,5,0.01) # range of sensory input
    s_step = 0.01 # step size
    # new variabels 
    maxt = 20 # maximum number of iterations
    trials = 2000 # of trials 
    epi_length = 20 # length of episode
    alpha = 0.01 # learning rate
    mean_phi = 5 # the average value that maximises the posterior
    sigma_phi = 2 # the variance of phi
    last_phi = 5 # the last observed phi
    # preallocate
    sigma = np.zeros(trials)
    error = np.zeros(trials)
    e = np.zeros(trials)
    sigma[0] = 1 # initialise sigma in 1 
    for j in range(1,trials):
        error[0] = 0 # initialise error in zero
        e[0] = 0 # initialise interneuron e in zero
        phi = np.random.normal(5, np.sqrt(2), 1) # draw a new phi every round 
        for i in range(1,2000):
            error[i] = error[i-1] + s_step*(phi-last_phi-e[i-1]) # equation 59
            e[i] = e[i-1] + s_step*(sigma[j-1]*error[i-1]-e[i-1]) # equation 60 
        sigma[j] = sigma[j-1] + alpha*(error[-1]*e[-1]-1) # synaptic weight (Sigma) update
    # plot dynamics of Sigma
    plt.ylabel(r' $\Sigma$')


Because is constantly varying phi = np.random.normal(5, np.sqrt(2), 1) never converge to just one value, but instead to approximately 2, the variance of .