﻿ 线性回归模型中响应值的选取对二分类问题的影响 The Effects of Different Response Values in Linear Regression Model on Binary Classification

Statistical and Application
Vol.04 No.02(2015), Article ID:15509,8 pages
10.12677/SA.2015.42007

The Effects of Different Response Values in Linear Regression Model on Binary Classification

Xiaoying Wang, Yanli Yang, Changlong Chen

School of Mathematics and Physics, North China Electrical Power University, Beijing

Email: yangyanlibang@163.com

Received: Jun. 5th, 2015; accepted: Jun. 20th, 2015; published: Jun. 25th, 2015

ABSTRACT

We use the multiple linear regression model to deal with the classification problem of two populations. Firstly, we assign the response variables and some corresponding values with certain rules, and then construct discriminant function and criterion via least square method. On this basis, we discuss the effects of different response values on classification for balanced and unbalanced data in linear model. In addition, we compare the mentioned discriminant method above with classic discriminant methods including the classical Mahalanobis distance discriminant and Bayes discriminant. At last, we find the inner relation between these methods as well as their advantages and disadvantages.

Keywords:Binary Classification, Response Values, Discriminant Analysis, Linear Regression Model, Least Square

Email: yangyanlibang@163.com

1. 引言

(1.1)

(1.2)

2. 用线性回归方法做判别

(2.1)

(2.2)

Theorem 2.1：在多元线性回归方程(2.1)中，参数的最小二乘估计满足式子：

(2.3)

(2.4)

Theorem 2.2：(1) 若正定，则判别结果只与的符号有关，而与的取值无关。即只要的符号相同，用该方法判别得到的结果就相同，且无论相等与否，该结论都成立。

(2) 当满足时，用该判别方法与用距离判别法对新样品分类时得到的判别函数及判别结果相同。

3. 模拟

3.1. 平衡数据模拟

(3.1)

3.2. 不平衡数据模拟

(3.2)

Table 1. The misclassification rate of four discriminant methods

Table 2. The discriminant outcome comparison of unbalanced data

3.3. 实例分析

Table 3. WDBC discriminant result comparison

4. 总结

The Effects of Different Response Values in Linear Regression Model on Binary Classification. 统计学与应用,02,47-55. doi: 10.12677/SA.2015.42007

1. 1. 张尧庭, 方开泰 (1988) 多元统计分析引论. 科学出版社, 北京.

2. 2. Hastie, T., Tibshirani, R. and Friedman, J. (2009) Elements of statistical learning: data mining, inference and prediction. 2nd Edition, Springer, Berlin.

3. 3. Mai, Q. and Zou, H. (2012) A direct approach to sparse discriminant analysis in ultra-high dimensions. Biometrika, 99, 29-42.

4. 4. Fan, J.Q., Feng, Y. and Tong, X. (2012) A road to classification in high dimensional space. Journal of the Royal Statistical Society, Series B, Statistical Methodology, 74, 745-771.

5. 5. 邰淑彩, 孙韫玉, 何娟娟 (2005) 应用数理统计(第二版). 武汉大学出版社, 武汉.

6. 6. Breiman, L. and Spector, P. (1992) Submodel selection and evaluation in regression: the x-random case. International Statistical Review, 60, 291-319.

7. 7. Weiss, G.M. and Provost, F. (2003) Learning when training data are costly: The effect of class distribution on tree induction. Journal of Artificial Intelligence Research, 19, 315-354.

8. 8. Kubat, M., Holte, R. and Matwin, S. (1998) Machine learning for the detection of oil spills in satellite radar images. Machine Learning, 30, 195-215.

9. 9. Lewis, D. and Gale, W. (1994) Training text classifiers by uncertainty sampling. Proceedings of ACM-SIGIR Conference on Information Retrieval, New York, 73-79.

10. 10. 陶新民, 郝思媛, 张冬雪, 徐鹏 (2013) 不均衡数据分类算法的综述. 重庆邮电大学学报(自然科学版), 1, 106- 108.

11. 11. Tibshirani, R.J. (1996) Regression shrinkage and selection via the LASSO. Journal of the Royal Statistical Society, Series B, 58, 267-288.

12. 12. Fan, J. and Li, R. (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96, 1348-1360.

13. 13. Fan, J. and Lv, J. (2008) Sure independence screening for ultrahigh dimensional feature space (with discussion). Journal of the Royal Statistical Society, Series B, 70, 849-911.

14. 14. Fan, J. and Lv, J. (2010) A selective overview of variable selection in high dimensional feature space. Statistica Sinica, 20, 101-148.

15. 15. 薛毅, 陈立萍 (2007) 统计建模与R软件. 清华大学出版社, 北京.

Theorem 2.1的证明：由上(2.2)式知：满足式子，即满足

(A.1)

Theorem 2.2 (1)的证明：由(2.3)式知：

(A.2)

(A.3)

。当正定时：。代入(A.3)式得：

(A.4)

(A.5)

(A.6)

Theorem 2.2 (2)的证明：距离判别的判别函数为：

(A.7)

(A.8)