Deployable probabilistic programming

David Tolpin, BGU & PUB+

http://offtopia.net/infergo-talk/

press Space to move through slides

Probabilistic Programming

Programs as statistical models:

1 var breastCancer = flip(0.01)
2 var benignCyst = flip(0.2)
3 var positiveMammogram = (breastCancer && flip(0.8))
4                       || (benignCyst && flip(0.5))
5 condition(positiveMammogram)
6 return breastCancer
                
In the example: breast cancer probability given positive mammogram.

Probabilistic Programs

  • Compute posterior distributions.
  • May contain conditions, loops, recursions.
  • Are (mostly) deterministic!
  • Must be run `backwards'.
  • (Approximate) inference algorithms are required.

Probabilistic programming languages

  • 45 PPLs listed on Wikipedia.
  • 18 PPLs participated in developer meetup at PROBPROG 2018.
  • 8 of the latter are not on Wikipedia (yet).

Probabilistic programming

Two polar views:

  1. Framework for statistical modelling
  2. Inferentiable programming — programs with inferrable parameters

Challenges

  • Simulation vs. inference
  • Data
  • Deployment

Simulation vs. inference


data {
    int N;
    vector[N] y;
    vector[N] x;
}

parameters {
    real alpha; real beta;
    real sigma;
}
model {
    alpha ~ normal(0,10);    
    beta ~ normal(0,10);
    sigma ~ cauchy(0,5);

    for (n in 1:N)
        y[n] ~ normal(
          alpha + beta * x[n],
          sigma);
}

Data structures

I mean:

def update_beliefs(beliefs, i, j, bandwidth):
    beliefs[i][j] += 1
    evidence = sum(beliefs[i])
    if evidence > bandwidth:
        for j in range(len(beliefs[i])):
            beliefs[i][j] *= bandwidth / evidence
      
I write:

def update_beliefs(beliefs, i, j, bandwidth):
    update = torch.zeros(beliefs.shape)
    update[i, j] = 1
    beliefs = beliefs + update
    evidence = beliefs[i, :].sum()
    if evidence > bandwidth:
        scale = torch.ones(beliefs.shape)
        scale[i, :] = bandwidth / evidence
        beliefs = beliefs * scale
    return beliefs
            

Deployment

PyStan (70Mb) Stan, C++ compiler, Cython, OCaml?
Pyro (>600Mb)
Successfully installed contextlib2-0.5.5
decorator-4.3.0 graphviz-0.10.1 networkx-2.2
opt-einsum-2.3.2 pyro-ppl-0.3.0 six-1.12.0
torch-1.0.0 tqdm-4.29.1
Turing.jl (50Mb):
Installed NNlib ─────────── v0.4.3
Installed Colors ────────── v0.9.5
Installed Arpack ────────── v0.3.0
Installed BinaryProvider ── v0.5.3
Installed ForwardDiff ───── v0.10.1
Installed ColorTypes ────── v0.7.5
Installed ProgressMeter ─── v0.9.0
Installed ZipFile ───────── v0.8.0
Installed StaticArrays ──── v0.10.2
Installed Juno ──────────── v0.5.3
... (~40 packages)

Guidelines

  • One language
  • Common data structures
  • Inference code re-used in simulation

Probabilistic programming
in Go

Go is

  • small,
  • expressive,
  • efficient,
  • popular for server-side programming.

'hello world': model


 1  type Model struct {
 2      Data []float64
 3  }
 4  
 5  // x[0] is the mean, x[1] is the 
 6  // log stddev of the distribution
 7  func (m *Model) Observe(x []float64) float64 {
 8    // Our prior is a unit normal ...
 9    ll := Normal.Logps(0, 1, x...)
10    // ... but the posterior is based on data observations.
11    ll += Normal.Logps(x[0], math.Exp(x[1]), m.Data...)
12    return ll
13  }
            

'hello world': data

 1  m := &Model{[]float64{
 2    -0.854, 1.067, -1.220, 0.818, -0.749,
 3    0.805, 1.443, 1.069, 1.426, 0.308}}
            

'hello world': optimization


 1  x := []float64{0, 0}
 2    
 3  opt := &infer.Momentum{
 4    Rate:  0.01,
 5    Decay: 0.998,
 6  }
 7  for iter := 0; iter != 1000; iter++ {
 8    opt.Step(m, x)
 9  }
10  mean, logs = x[0], x[1]
            

'hello world': posterior


 1  x := []float64{0, 0}
 2    
 3  hmc := &infer.HMC{
 4    Eps: 0.1,
 5  }
 6  samples := make(chan []float64)
 7  hmc.Sample(m, x, samples)
 8  for i := 0; i != 1000; i++ {
 9    x = <-samples
10  }
11  hmc.Stop()
            

'hello world': streaming


 1  type Model struct {
 2    Data chan float64  // data is a channel
 3    N    int           // batch size
 4  }
 5  
 6  func (m *Model) Observe(x []float64) float64 {
 7    ll := Normal.Logps(0, 1, x...)
 8    // observe a batch of data from the channel
 9    for i := 0; i != m.N; i++ {
10      ll += Normal.Logp(x[0], math.Exp(x[1]), <- m.Data)
11    }
12    return ll
13  }
            

Why Go?

  • Comes with parser and type checker.
  • Compiles and runs fast.
  • Allows efficient parallel execution, via goroutines.

Infergo

  • Models are written in Go.
  • Relies on automatic differentiation for inference.
  • Works anywhere where Go does.
  • No external dependencies.
  • Licensed under the MIT license.

Model

Model interface:

1 type Model interface {
2   Observe(parameters []float64) float64
3 }
            
A model (exponential distribution):

1 type Expon float64 
2 
3 func (m Expon) Observe(x []float64) float64 {
4   return -x*m
5 }
            

Distributions

A distribution is a model:

 1  var Normal normal
 2
 3  // Observe implements the Model interface.
 4  func (dist normal) Observe(x []float64) float64 {
 5    mu, sigma, y := x[0], x[1], x[2:]
 6    return dist.Logps(mu, sigma, y...)
 7  }
            

Distributions (cont.)


 8  // Logp computes the log pdf of a single observation.
 9  func (_ normal) Logp(mu, sigma float64, y float64)
10      float64 {
11    ...
12  }
13
14  // Logps computes the log pdf of a vector of observations.
15  func (_ normal) Logps(mu, sigma float64, y ...float64)
16      float64 {
17    ...
18  }
        

Automatic differentiation

  • Numeric differentiation: $\frac {df} {dx} \approx \frac {f (x + \Delta x) - f(x)} {\Delta x}$
  • Symbolic differentiation: $\frac {duv} {dx} = u\frac{dv} {dx} + v \frac{du} {dx}$
  • Algorithmic differentiation
    • Operator overloading.
    • Source code transformation.

Differentiation in Infergo

  • Model methods returning float64 or nothing are differentiated.
  • Within the methods, the following is differentiated:
    • assignments to float64;
    • returns of float64;
    • standalone calls to differentiated model methods.

Differentiation in Infergo

Reverse-mode via source code transformation:

1  func (m *Model) Observe(x []float64) float64 {
2    var ll float64
3    ad.Assignment(&ll, ad.Call(func(_ []float64) {
4      Normal.Logps(0, 0, x...)
5    }, 2, ad.Value(0), ad.Value(1)))
6    ad.Assignment(&ll, 
7      ad.Arithmetic(ad.OpAdd, &ll,
8        ad.Call(func(_ []float64) {
9          Normal.Logps(0, 0, m.Data...)
10        }, 2, &x[0], ad.Elemental(math.Exp, &x[1]))))
11    return ad.Return(&ll)
12  }
        

Differentiation in Infergo

Tape


 1  type adTape struct {
 2      records    []record    // recorded instructions
 3      places     []*float64  // variable places
 4      values     []float64   // stored values
 5      elementals []elemental // gradients of elementals
 6      cstack     []counters  // counter stack (see below)
 7  }
              

Differentiation in Infergo

Tape


 1  func Arithmetic(op int, px ...*float64) *float64 {
 2      tape := tapes.get()
 3      // Register
 4      p := Value(0)
 5      r := record{
 6          typ: typArithmetic,
 7          op:  op,
 8          p:   len(tape.places),
 9      }
10      tape.places = append(tape.places, p)
11      tape.places = append(tape.places, px...)
12      tape.records = append(tape.records, r)
13      // Run
14      switch op {
15      case OpNeg:
16          *p = -*px[0]
17          //  ...
18      }
19      return p
20  }
          

Inference

  • Optimization via gradient ascent (Momentum, Adam), works with streamed and stochastic data.
  • Full posterior inference via HMC variants. Samples are produced concurrently and passed through a channel.
  • Third-party inference algorithms.

Third-party inference

  • Straightforward because Infergo uses built-in Go float64.
  • Gonum (http://gonum.org/) library for numerical computation:
    
    func FuncGrad(m Model) (
    Func func(x []float64) float64,
    Grad func(grad []float64, x []float64))
                

Examples

  • Simple models are still simple.
  • Complex models are less complex.

8 schools

StanInfergo

data {
int<lower=0> J;
vector[J] y;
vector<lower=0>[J] sigma;
}

parameters {
real mu;
real<lower=0> tau;
vector[J] eta;
}

transformed parameters {
vector[J] theta;
theta = mu + tau * eta;
}

model {
eta ~ normal(0, 1);
y ~ normal(theta, sigma);
}

        

type Model struct {
J          int
Y          []float64
Sigma      []float64
}

func (m *Model) Observe(x []float64) float64 {
mu := x[0]
tau := math.Exp(x[1])
eta := x[2:]

ll := Normal.Logp(0, 1, eta)
for i, y := range m.Y {
theta := mu + tau*eta[i]
ll += Normal.Logp(theta, m.Sigma[i], y)
}
return ll
}
        

Linear regression


type Model struct {
Data  [][]float64
}

func (m *Model) Observe(x []float64)
float64 {
ll := 0.

alpha, beta := x[0], x[1]
sigma := math.Exp(x[2])

for i := range m.Data {
ll += Normal.Logp(
  m.Simulate(m.Data[i][0], alpha, beta),
  sigma, m.Data[i][1])
}
...
        

...
return ll
}

// Simulate predicts y for x based on
// inferred parameters.
func (m *Model) Simulate(x, alpha, beta float64)
float64 {
y := alpha +beta*x
return y
}
        

Latent Dirichlet allocation

Infergo (0.5s|8.9s)Stan (54s|3.7s)

type LDAModel struct {
K     int       // num topics
V     int       // num words
M     int       // num docs
N     int       // total word instances
Word  []int     // word n
Doc   []int     // doc for word n
Alpha []float64 // topic prior
Beta  []float64 // word prior
}

func (m *LDAModel) Observe(x []float64) float64 {
ll := Normal.Logps(0, 1, x...)
theta := make([][]float64, m.M)
D.Simplices(&x, m.K, theta)
phi := make([][]float64, m.K)
D.Simplices(&x, m.V, phi)

// priors
ll += Dirichlet{m.K}.Logps(m.Alpha, theta...)
ll += Dirichlet{m.V}.Logps(m.Beta, phi...)

gamma := make([]float64, m.K)
for in := 0; in != m.N; in++ {
for ik := 0; ik != m.K; ik++ {
  gamma[ik] = math.Log(theta[m.Doc[in]-1][ik]) +
    math.Log(phi[ik][m.Word[in]-1])
}
ll += D.LogSumExp(gamma)
}
return ll
}

        

data {
int K;               // num topics
int V;               // num words
int M;               // num docs
int N;           // total word instances
int w[N];    // word n
int doc[N];  // doc for word n
vector[K] alpha;     // topic prior
vector[V] beta;      // word prior
}


parameters {
simplex[K] theta[M]; // topic dist for doc m
simplex[V] phi[K];   // word dist for topic k
}


model {
for (m in 1:M)  
theta[m] ~ dirichlet(alpha);  // prior
for (k in 1:K)  
phi[k] ~ dirichlet(beta);     // prior

for (n in 1:N) {
real gamma[K];
for (k in 1:K) 
  gamma[k] <- log(theta[doc[n],k]) 
    + log(phi[k,w[n]]);
increment_log_prob(log_sum_exp(gamma));
}
}
        

Acknowledgements

  • Frank Wood introduced me to probabilistic programming.
  • Jan-Willem van de Meent discussed with me motives, ideas, and implementation choices behind Infergo.
  • PUB+ supported me in development of Infergo.

Thank you!

Gehenna Gopher