﻿ 不同类型数据下混合模型参数估计效果的对比研究 Comparative Study on Effects of Parameter Estimation of Mixture Models under Different Types of Data

Statistics and Application
Vol.06 No.04(2017), Article ID:22513,10 pages
10.12677/SA.2017.64054

Comparative Study on Effects of Parameter Estimation of Mixture Models under Different Types of Data

Xiaoying Wang, Yinghua Li, Xuemei Yang

School of Mathematics and Physics, North China Electric Power University, Beijing

Received: Oct. 8th, 2017; accepted: Oct. 23rd, 2017; published: Oct. 30th, 2017

ABSTRACT

The normal mixture model has more applications in describing data. But it is easily influenced by the outlier, and the maximum likelihood estimation of parameters is not robust estimation. T-distribution mixture model has better robustness than Gauss mixture model to analyze data with longer time than normal tails because of its heavy-tails. In this paper, we studied a univariate t mixture model primarily. Based on EM algorithm, we derived the iteration steps of maximum likelihood estimation of the model’s unknown parameters. Furthermore, we did a comparative analysis by three types of simulated data. Simulation study shows that this model has an advantage in fitting data with longer time than normal tails. The initial value is given by k-means method.

Keywords:EM Algorithm, Mixture T-Distribution Model, K-Means Initialization

Copyright © 2017 by authors and Hans Publishers Inc.

1. 引言

2. 有限t-分布混合模型

2.1. 一元学生t-分布

$t\left(y|\mu ,\sigma ,\nu \right)=\frac{\Gamma \left(\frac{1+\nu }{2}\right)}{\Gamma \left(\frac{\nu }{2}\right)}{\left(\frac{1}{\text{π}{\sigma }^{2}\nu }\right)}^{\frac{1}{2}}{\left[1+\frac{{\left(y-\mu \right)}^{2}}{{\sigma }^{2}\nu }\right]}^{-\left(\frac{1+\nu }{2}\right)}$ (1)

2.2. t-分布有限混合模型

k个子分布的概率密度函数， ${\theta }_{k}=\left\{{\mu }_{k},{\sigma }_{k},{\nu }_{k}\right\}$ 为对应的子分布的概率密度函数参数， $\Psi =\left\{{\pi }_{1},\cdots ,{\pi }_{m},{\theta }_{1},\cdots {\theta }_{m}\right\}$ 为参数空间。因此，有限混合t-分布模型可以表示为

$t\left(y|\theta \right)=\sum _{k=1}^{m}{\pi }_{k}t\left(y|{\theta }_{k}\right)$ (2)

$t\left(y|\theta \right)={\pi }_{1}t\left(y|{\theta }_{1}\right)+{\pi }_{2}t\left(y|{\theta }_{2}\right)+\left(1-{\pi }_{1}-{\pi }_{2}\right)t\left(y|{\theta }_{3}\right)$ (3)

3. 模型参数极大似然估计的EM算法

$U$ 为引进的另一个隐变量 $U=\left\{{u}_{1},{u}_{2},\cdots ,{u}_{N}\right\}$ ，给定 ${z}_{ik}=1$ 时， ${u}_{i}$ 是独立同分布的，且满足

${u}_{i}|{z}_{ik}=1\sim G\left(\frac{\nu }{2},\frac{\nu }{2}\right)$$Y$ 为可观测数据集 $Y=\left\{{y}_{1},\cdots ,{y}_{N}\right\}$ ，且有 ${y}_{i}|{u}_{i},{z}_{ik}=1\sim N\left({\mu }_{k},{\sigma }_{k}^{2}/{u}_{i}\right)$ 。由(1) (3)两式

$\begin{array}{c}p\left(Y,Z,U|\Psi \right)=\prod _{i=1}^{N}p\left({y}_{i},{z}_{i1},{z}_{i2},{u}_{i}|\Psi \right)\\ =\prod _{i=1}^{N}\prod _{k=1}^{3}{\left\{\frac{\sqrt{{u}_{i}}}{\sqrt{2\text{π}}{\sigma }_{k}}{\text{e}}^{-\frac{{u}_{i}{\left({y}_{i}-{\mu }_{k}\right)}^{2}}{2{\sigma }_{k}^{2}}}\left[\frac{{\left(\frac{\nu }{2}\right)}^{\frac{\nu }{2}}}{\Gamma \left(\frac{\nu }{2}\right)}{\text{e}}^{-\frac{\nu }{2}{u}_{i}}{u}_{i}{}^{\frac{\nu }{2}-1}\right]{\text{π}}_{k}\right\}}^{{z}_{ik}}\end{array}$ (4)

$\begin{array}{l}\mathrm{log}p\left(Y,Z,U|\psi \right)=\sum _{k=1}^{3}\sum _{i=1}^{N}{z}_{ik}\left\{\frac{1}{2}\mathrm{log}{u}_{i}-\frac{1}{2}\mathrm{log}2\text{π}-\mathrm{log}{\sigma }_{k}-\frac{{u}_{i}}{2{\sigma }_{k}^{2}}{\left({y}_{i}-{\mu }_{k}\right)}^{2}\\ \text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}+\frac{\nu }{2}\mathrm{log}\frac{\nu }{2}-\mathrm{log}\Gamma \left(\frac{\nu }{2}\right)-\frac{\nu }{2}{u}_{i}+\mathrm{log}{\pi }_{k}\right\}\\ \text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}=\sum _{k=1}^{3}\sum _{i=1}^{N}{z}_{ik}\mathrm{log}{\pi }_{k}\\ \text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}+\sum _{k=1}^{3}\sum _{i=1}^{N}{z}_{ik}\left\{-\mathrm{log}\Gamma \left(\frac{\nu }{2}\right)+\frac{\nu }{2}\mathrm{log}\left(\frac{\nu }{2}\right)+\frac{\nu }{2}\left(\mathrm{log}{u}_{i}-{u}_{i}\right)\right\}\\ \text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}+\sum _{k=1}^{3}\sum _{i=1}^{N}{z}_{ik}\left\{-\frac{1}{2}\mathrm{log}2\text{π}-\frac{1}{2}\mathrm{log}{u}_{i}-\mathrm{log}{\sigma }_{k}-\frac{{u}_{i}{\left({y}_{i}-{\mu }_{k}\right)}^{2}}{2{\sigma }_{k}^{2}}\right\}\end{array}$ (5)

EM算法是一种迭代近似求解算法，它主要分两步进行：E步是对对数似然函数求期望，M步是最大化对数似然函数以获得新的参数值。

E步： $Q\left(\Psi |{\Psi }^{\left(j\right)}\right)={E}_{Z,U}\left[\mathrm{log}p\left(Y,Z,U|\Psi \right)|Y,{\Psi }^{\left(j\right)}\right]$

${E}_{Z,U}\left[{z}_{ik}|Y,{\Psi }^{\left(j\right)}\right]={\tau }_{ik}^{\left(j\right)}=\frac{{\pi }_{k}^{\left(j\right)}{f}_{k}\left({y}_{i}|{\theta }_{k}^{\left(j\right)}\right)}{\sum _{k=1}^{3}{\pi }_{k}^{\left(j\right)}{f}_{k}\left({y}_{i}|{\theta }_{k}^{\left(j\right)}\right)}$ (6)

${E}_{Z,U}\left[{u}_{ik}|y,{\Psi }^{\left(j\right)}\right]={u}_{ik}^{\left(j\right)}=\frac{{\nu }^{\left(j\right)}+1}{{\nu }^{\left(j\right)}+{\left({y}_{i}-{\mu }_{k}^{\left(j\right)}\right)}^{2}/{\sigma }_{k}^{2\left(j\right)}}$ (7)

${E}_{Z,U}\left[\mathrm{log}{u}_{ik}|y,{\Psi }^{\left(j\right)}\right]={l}_{ik}^{\left(j\right)}=\mathrm{log}{u}_{ik}^{\left(j\right)}+\left\{\psi \left(\frac{{\nu }^{\left(j\right)}+1}{2}\right)-\mathrm{log}\left(\frac{{\nu }^{\left(j\right)}+1}{2}\right)\right\}$ (8)

$\begin{array}{l}Q\left(\Psi |{\Psi }^{\left(j\right)}\right)=\sum _{k=1}^{3}\sum _{i=1}^{N}{\tau }_{ik}^{\left(j\right)}\mathrm{log}{\pi }_{k}\\ +\sum _{k=1}^{3}\sum _{i=1}^{N}{\tau }_{ik}^{\left(j\right)}\left\{-\mathrm{log}\Gamma \left(\frac{\nu }{2}\right)+\frac{\nu }{2}\mathrm{log}\left(\frac{\nu }{2}\right)+\frac{\nu }{2}\left(\mathrm{log}{u}_{ik}^{\left(j\right)}+\Psi \left(\frac{{\nu }^{\left(j\right)}+1}{2}\right)-\mathrm{log}\left(\frac{{\nu }^{\left(j\right)}+1}{2}\right)-{u}_{ik}^{\left(j\right)}\right)\right\}\\ +\sum _{k=1}^{3}\sum _{i=1}^{N}{\tau }_{ik}^{\left(j\right)}\left\{-\frac{1}{2}\mathrm{log}2\text{π}-\mathrm{log}{\sigma }_{k}-\frac{{u}_{ik}^{\left(j\right)}{\left({y}_{i}-{\mu }_{k}\right)}^{2}}{2{\sigma }_{k}^{2}}-\frac{1}{2}\left[\mathrm{log}{u}_{ik}^{\left(j\right)}+\Psi \left(\frac{{\nu }^{\left(j\right)}+1}{2}\right)-\mathrm{log}\left(\frac{{\nu }^{\left(j\right)}+1}{2}\right)\right]\right\}\end{array}$ (9)

M步： ${\theta }^{\left(j+1\right)}=\mathrm{arg}\text{\hspace{0.17em}}\underset{\theta }{\mathrm{max}}\text{\hspace{0.17em}}Q\left(\theta ,{\theta }^{\left(j\right)\right)}$

${\pi }_{k}^{\left(j+1\right)}=\sum _{i=1}^{N}{\tau }_{ik}^{\left(j\right)}/N$ (10)

${\mu }_{k}^{\left(j+1\right)}=\frac{\sum _{i=1}^{N}{\tau }_{ik}^{\left(j\right)}{u}_{ik}^{\left(j\right)}{y}_{i}}{\sum _{i=1}^{N}{\tau }_{ik}^{\left(j\right)}{u}_{ik}^{\left(j\right)}}$ (11)

${\sigma }_{k}^{\left(j+1\right)}=\text{sqrt}\left(\frac{\sum _{i=1}^{N}{\tau }_{ik}^{\left(j\right)}{u}_{ik}^{\left(j\right)}{\left({y}_{i}-{\mu }_{k}^{\left(j+1\right)}\right)}^{2}}{\sum _{i=1}^{N}{\tau }_{ik}^{\left(j\right)}{u}_{ik}^{\left(j\right)}}\right)$ (12)

$\mathrm{log}\frac{\nu }{2}-\psi \left(\frac{\nu }{2}\right)+1+\frac{1}{\sum _{i=1}^{N}{\tau }_{ik}^{\left(j\right)}}\sum _{i=1}^{N}{\tau }_{ik}^{\left(j\right)}\left({l}_{ik}^{\left(j\right)}-{u}_{ik}^{\left(j\right)}\right)=0$ (13)

4. 数值模拟

$MSE\left({\pi }_{1}\right)=\frac{1}{n}\sum _{i=1}^{n}{\left({\pi }_{1i}-{\pi }_{1\left(0\right)}\right)}^{2}$

4.1. 混合高斯分布数据

$\nu =3$ 时，混合高斯模型参数估计的均方误差均比混合t-分布模型参数估计的均方误差小，这一点在 ${\upsilon }_{1}$${\upsilon }_{2}$${\upsilon }_{3}$ 上更为明显；在 $\nu =15,30$ 时，除了对 ${\upsilon }_{1}$${\upsilon }_{2}$${\upsilon }_{3}$ 的估计结果混合高斯模型略好于混合t-分布模型外，两种方法对其他参数的估计的均方误差，几乎无差。此外还有，随着自由度的增大，混合t-分布模型参数估计的均方误差变小；整体来看，样本量越大，MSE越小。

4.2. 混合t-分布数据

Table 1. The simulation results of ν = 3

Table 2. The simulation results of ν = 15

Table 3. The simulation results of ν = 30

Table 4. The simulation results of ν = 3

Table 5. The simulation results of ν = 15

Table 6. The simulation results of ν = 30

${\upsilon }_{2}$${\upsilon }_{3}$ 外，混合t-分布模型参数估计的均方误差比混合高斯分布模型参数估计的均方误差小；在 $\nu =30$ 时，除 ${\upsilon }_{3}$ 外，混合t-分布模型参数估计的均方误差均比混合高斯分布模型参数估计的均方误差小。此外，随着自由度的增大，混合t-分布模型参数估计的均方误差变小；整体来看，样本量越大，MSE越小，估计结果越好。

4.3. 含噪声的混合高斯数据

Table 7. Mixed Gaussian data with 5% noises

Table 8. Mixed Gaussian data with 10% noises

5. 结论

Comparative Study on Effects of Parameter Estimation of Mixture Models under Different Types of Data[J]. 统计学与应用, 2017, 06(04): 482-491. http://dx.doi.org/10.12677/SA.2017.64054

1. 1. Dempster, P., Laird, N.M. and Rubin, D.B. (1977) Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, 39, 1-38.

2. 2. McLachlan, G. and Krishnan, T. (2007) The EM Algorithm and Extensions. John Wiley & Sons, New York.

3. 3. Meng, X.L. and Rubin, D.B. (1993) Maximum Likelihood Estimation via the ECM Algorithm: A General Framework. Biometrika, 80, 267-278. https://doi.org/10.1093/biomet/80.2.267

4. 4. Peel, D. and McLachlan, G. (2000) Robust Mixture Modelling Using the t Distribution. Statistics and Computing, 10, 339-348. https://doi.org/10.1023/A:1008981510081

5. 5. Liu, C. and Rubin, D.B. (1995) ML Estimation of the t Distribution Using EM and Its Extensions, ECM and ECME. Statistica Sinica, 5, 19-39.

6. 6. 冉延平. 基于混合模型的聚类算法及其稳健性研究[D]: [硕士学位论文]. 郑州: 中国人民解放军信息工程大学, 2005.

7. 7. 史鹏飞. 基于改进EM算法的混合模型参数估计及聚类分析[D]: [硕士学位论文]. 西安: 西北大学, 2009.

8. 8. 杨云飞. 基于混合模型的医学图像分割算法应用研究[D]: [硕士学位论文]. 南京: 东南大学, 2015.

9. 9. 熊太松. 基于统计混合模型的图像分割方法研究[D]: [博士学位论文]. 成都: 电子科技大学, 2013.

10. 10. 朱志娥, 吴刘仓, 戴琳. 偏t正态数据下混合线性联合位置与尺度模型的参数估计[J]. 高校应用数学学报, 2016, 31(4): 379-389.

11. 11. Bishop, C.M. (2006) Pattern Recognition and Machine Learning. Springer, Berlin, 423-435.

12. 12. Shoham, S. (2002) Robust Clustering by Deterministic Agglomeration EM of Mixtures of Multivariate T-Distributions. Pattern Recognition, 35, 1127-1142. https://doi.org/10.1016/S0031-3203(01)00080-2

13. 13. 李航. 统计学习方法[M]. 北京: 清华大学出版社, 2012(3).

14. 14. Coretto, P. and Hennig, C. (2010) A Simulation Study to Compare Robust Clustering Methods Based on Mixtures. Advances in Data Analysis and Classification, 4, 111-135. https://doi.org/10.1007/s11634-010-0065-4