Gaussian mixture models in PyMc.

%matplotlib inline
import numpy as np, seaborn as sb, math, matplotlib.pyplot as plt, pandas as pd

Generate and plot some sample data.

params = [
    (150, 10, 0.7),
    (200, 20, 0.3)
]

no_samples = 1000
dist = [np.random.normal(mu, sigma, math.floor(mixing_prob * no_samples)) for (mu, sigma, mixing_prob) in params]
xs = np.array([item for sublist in dist for item in sublist])
sb.distplot(xs)
<matplotlib.axes._subplots.AxesSubplot at 0x7f23e3d327b8>

png

Use catetorical variable to model assignment to cluster. P represents the mixing probability.

import pymc as pm
p = pm.Uniform("p", 0, 1)
assignment = pm.Categorical("assignment", [p, 1 - p], size = xs.shape[0])

Model stds as uniform on 0 100 and centers as normaly distributed at roughly the peaks we can see in the histogram. Pick precision to be sufficiently vague.

taus = 1.0 / pm.Uniform("stds", 0, 100, size = 2) **2
centers = pm.Normal("centers", [145, 210], [0.01, 0.02], size = 2)
@pm.deterministic
def center_i(assignment=assignment, centers=centers):
    return centers[assignment]

@pm.deterministic
def tau_i(assignment=assignment, taus=taus):
    return taus[assignment]

observations = pm.Normal("obs", center_i, tau_i, value=xs, observed=True)

model = pm.Model([p, assignment, observations, taus, centers])

Use MCMC to fit the parameters of the model.

mcmc = pm.MCMC(model)
mcmc.sample(40000, 10000)
mcmc.sample(10000)
 [-----------------100%-----------------] 10000 of 10000 complete in 5.9 sec
pm.Matplot.plot(mcmc)
Plotting p
Plotting centers_0
Plotting centers_1
Plotting stds_0
Plotting stds_1

png

png

png

#print(mcmc.stats())
#mcmc.stats()['stds']['95% HPD interval']
center_trace = mcmc.trace("centers")[:]

centers = [['Center 1', x] for x in center_trace[:, 0]]
centers += [['Center 2', x] for x in center_trace[:, 1]]

df = pd.DataFrame(data = centers)
df.columns = ['Posterior', 'Value']

g = sb.FacetGrid(df, col= 'Posterior')
g.map(plt.hist, 'Value');

png

std_trace = mcmc.trace("stds")[:]

stds = [['Std 1', x] for x in std_trace[:, 0]]
stds += [['Std 2', x] for x in std_trace[:, 1]]

df = pd.DataFrame(data = stds)
df.columns = ['Std', 'Value']

g = sb.FacetGrid(df, col= 'Std')
g.map(plt.hist, 'Value');

png

Use SciKit learn’s Gaussian Mixture Model to compare results.

from sklearn import mixture

xs1= np.array([[x] for x in xs])
m = mixture.GMM(2).fit(xs1)

m.means_
array([[ 199.04754288],
       [ 150.20537679]])
m.covars_ ** 0.5
array([[ 20.8382151 ],
       [  9.98902017]])

The GMM model corresponds well to the PyMC model.

MCMC didn’t scale very well with more observed data, this was probably due to the low acceptance rate of the categorical variable. The model was sensitive to the priors on the center variables, particulary the precision parameter.