首页 > 代码库 > EM and GMM(Code)

EM and GMM(Code)

  In EM and GMM(Theory), I have introduced the theory of em algorithm for gmm. Now lets practice it in matlab!

  1. Generate 1000 pieces of  random 2-dimention data which obey 5 gaussian distribution.

function X = GenerateData
    Sigma = [1, 0; 0, 1];
    mu1 = [1, -1];
    x1 = mvnrnd(mu1, Sigma, 200);
    mu2 = [5.5, -4.5];
    x2 = mvnrnd(mu2, Sigma, 200);
    mu3 = [1, 4];
    x3 = mvnrnd(mu3, Sigma, 200);
    mu4 = [6, 4.5];
    x4 = mvnrnd(mu4, Sigma, 200);
    mu5 = [9, 0.0];
    x5 = mvnrnd(mu5, Sigma, 200);
    % obtain the 1000 data points to be clustered
    X = [x1; x2; x3; x4; x5];
end

  2. Complete em algorithm.

function [Mu, Sigma, Pi, r_nk] = EmForGmm(Data, classNum, initVal)
    % Data : Matrix(n * d), n is the quantity of the data and d is the data
    %        dimention
    % classNum : Scale
    % initVal : Cell(3 * 1), initial value for Mu, Sigma and Pi
    %           cell 1: Mu
    %           cell 2: Sigma
    %           cell 3: Pi
    [sampleNum, sampleDim] = size(Data);
    indexPoint = zeros(sampleNum, 1);
    while(1)
        for n = 1 : sampleNum
            x = Data(n, :);
            px_nk_sumk = 0;
            for k = 1 : classNum
                Sigma_k = initVal{2}(:,:,k);
                Mu_k = initVal{1}(k,:);
                Pi_k = initVal{3}(k);
                px(n,k) = (1/(2*pi^(sampleDim/2)*det(Sigma_k)^(0.5))) ...
                    * exp(-0.5 * (x - Mu_k)*inv(Sigma_k)*(x - Mu_k)‘);
                px_nk_sumk = px_nk_sumk + Pi_k * px(n, k);
            end
            for k = 1 : classNum
                Sigma_k = initVal{2}(:,:,k);
                Mu_k = initVal{1}(k,:);
                Pi_k = initVal{3}(k);
                r(n, k) = Pi_k * px(n, k) / px_nk_sumk;
            end
        end
        Nk = sum(r)‘;
        newMuK = r‘ * Data;
        Nkk = repmat(Nk,1,2);
        newMuK = newMuK ./ Nkk;
        for i = 1 : classNum
            nk = Nk(i);
            MuT = repmat(newMuK(i,:),sampleNum,1);
            xT = Data - MuT;
            rT = r(:,i);
            rT = repmat(rT,1,2);
            newSigma(:,:,i) = xT‘ * (xT .* rT) / nk;
        end
        newPiK = Nk / sampleNum;
        indexPointT = indexPoint;
        [aa,indexPoint] = max(r,[],2);
        j1 = sum(sum(abs(newMuK - initVal{1}))) < 1e-6;
        j2 = sum(sum(sum(abs(newSigma - initVal{2})))) < 1e-6;
        j3 = sum(abs(newPiK - initVal{3})) < 1e-6;
        clf;
        if (j1 && j2 && j3)
            for i = 1:sampleNum
                if (indexPoint(i)==1)
                    plot(Data(i,1), Data(i,2), ‘r.‘)
                end
                if (indexPoint(i)==2)
                    plot(Data(i,1), Data(i,2), ‘b.‘)
                end
                if (indexPoint(i)==3)
                    plot(Data(i,1), Data(i,2), ‘k.‘)
                end
                if (indexPoint(i)==4)
                    plot(Data(i,1), Data(i,2), ‘g.‘)
                end
                if (indexPoint(i)==5)
                    plot(Data(i,1), Data(i,2), ‘m.‘)
                end
                hold on;
            end
            break;
        else
            initVal{1} = newMuK;
            initVal{2} = newSigma;
            initVal{3} = newPiK;
        end
    end
    Mu = newMuK;
    Sigma = newSigma;
    Pi = newPiK;
    r_nk = r;
end

 3. Complete main function.

clear,clc,clf
Data = http://www.mamicode.com/GenerateData;>

  4. Result.

  The cluster result can be show as figure 3.

技术分享

Figure 3

  The probality distribution function can be writen as:

\[  p(\mathbf{x}) = \sum_{k=1}^{K}\pi_kp(\mathbf{x}|\mu_k\Sigma_k)  \]

  where, 

   $\mu_1  =  (1.028, -1.158) $, $\mu_2  =  (5.423, -4.538) $, $\mu_3  =  (1.036, 3.975) $,  $\mu_4  =  (5.835, 4.474) $,  $\mu_5  =  (9.074, -0.063) $

  Notice that, when generate the data:

$\mu_1  =  (1, -1) $, $\mu_2  =  (5.5, -4.5) $, $\mu_3  =  (1, 4) $, $\mu_4  =  (6, 4.5) $, $\mu_5  =  (9, 0) $)

 


 

\[
\Sigma_1 = \left(
\begin{array}{cc}
1.0873& 0.0376\\
0.0376& 0.8850
\end{array}
\right),
\Sigma_2 = \left(
\begin{array}{cc}
1.1426& 0.0509\\
0.0509& 0.9192
\end{array}
\right),
\Sigma_3 = \left(
\begin{array}{cc}
0.9752& -0.0712\\
-0.0712& 0.9871
\end{array}
\right),
\Sigma_4 = \left(
\begin{array}{cc}
1.0111& -0.0782\\
-0.0782& 1.2034
\end{array}
\right),
\Sigma_5 = \left(
\begin{array}{cc}
0.8665& -0.1527\\
-0.1527& 0.9352
\end{array}
\right)
\]

 Notice that, when generate the data: 

\[\Sigma = \left(
\begin{array}{cc}
1& 0\\
0& 1
\end{array}
\right)
\] 

 


 

 $\pi_1  =  0.1986$, $\pi_2  =  0.2004 $, $\pi_3  =  0.1992$,  $\pi_4  =  0.2015 $,  $\pi_5  =  0.2002$

 Notice that, when generate the data:  each guassian components occupy 20% of all data. (1000 data point, 200 for each guassian components)

EM and GMM(Code)