Gaussian mixture models in PyMc.
%matplotlib inline import numpy as np, seaborn as sb, math, matplotlib.pyplot as plt, pandas as pd
Generate and plot some sample data.
params = [ (150, 10, 0.7), (200, 20, 0.3) ] no_samples = 1000
dist = [np.random.normal(mu, sigma, math.floor(mixing_prob * no_samples)) for (mu, sigma, mixing_prob) in params] xs = np.array([item for sublist in dist for item in sublist])
<matplotlib.axes._subplots.AxesSubplot at 0x7f23e3d327b8>
Use catetorical variable to model assignment to cluster. P represents the mixing probability.
import pymc as pm p = pm.Uniform("p", 0, 1) assignment = pm.Categorical("assignment", [p, 1 - p], size = xs.shape)
Model stds as uniform on 0 100 and centers as normaly distributed at roughly the peaks we can see in the histogram. Pick precision to be sufficiently vague.
taus = 1.0 / pm.Uniform("stds", 0, 100, size = 2) **2 centers = pm.Normal("centers", [145, 210], [0.01, 0.02], size = 2)
@pm.deterministic def center_i(assignment=assignment, centers=centers): return centers[assignment] @pm.deterministic def tau_i(assignment=assignment, taus=taus): return taus[assignment] observations = pm.Normal("obs", center_i, tau_i, value=xs, observed=True) model = pm.Model([p, assignment, observations, taus, centers])
Use MCMC to fit the parameters of the model.
mcmc = pm.MCMC(model) mcmc.sample(40000, 10000) mcmc.sample(10000)
[-----------------100%-----------------] 10000 of 10000 complete in 5.9 sec
Plotting p Plotting centers_0 Plotting centers_1 Plotting stds_0 Plotting stds_1
#print(mcmc.stats()) #mcmc.stats()['stds']['95% HPD interval']
center_trace = mcmc.trace("centers")[:] centers = [['Center 1', x] for x in center_trace[:, 0]] centers += [['Center 2', x] for x in center_trace[:, 1]] df = pd.DataFrame(data = centers) df.columns = ['Posterior', 'Value'] g = sb.FacetGrid(df, col= 'Posterior') g.map(plt.hist, 'Value');
std_trace = mcmc.trace("stds")[:] stds = [['Std 1', x] for x in std_trace[:, 0]] stds += [['Std 2', x] for x in std_trace[:, 1]] df = pd.DataFrame(data = stds) df.columns = ['Std', 'Value'] g = sb.FacetGrid(df, col= 'Std') g.map(plt.hist, 'Value');
Use SciKit learn’s Gaussian Mixture Model to compare results.
from sklearn import mixture xs1= np.array([[x] for x in xs]) m = mixture.GMM(2).fit(xs1) m.means_
array([[ 199.04754288], [ 150.20537679]])
m.covars_ ** 0.5
array([[ 20.8382151 ], [ 9.98902017]])
The GMM model corresponds well to the PyMC model.
MCMC didn’t scale very well with more observed data, this was probably due to the low acceptance rate of the categorical variable. The model was sensitive to the priors on the center variables, particulary the precision parameter.