﻿ 基于一种新的特征选择方法的朴素贝叶斯分类器选择证券的研究 Research on Security Selection by Naive Bayes Classifier Based on a New Feature Selection Method

Panpan Guo, Haijun Liu, Shuangshuang Li

School of Mathematics and Statistics, Zhengzhou University, Zhengzhou Henan

Received: Dec. 18th, 2018; accepted: Jan. 2nd, 2019; published: Jan. 9th, 2019

ABSTRACT

In this paper, a naive Bayes classifier for securities selection based on a new feature selection method is established. Firstly, in consideration of the trading data of 50 companies in Shenzhen Stock Exchange and 18 commonly used indicators, a new feature selection method, i.e. the combination of mutual information and principal component analysis, is adopted to select the value factors for classification. Secondly, a naive Bayes classifier is constructed with the data of the first 10 months, and the prediction accuracy of the classifier is tested with that of the last two months. The empirical analysis shows that the average accuracy of the classifier reaches 75%, which is of application value.

Keywords:Feature Selection, Mutual Information, Principal Component Analysis, Naive Bayes Classifier

1. 引言

2. 预备知识

2.1. 特征选择

2.2. 互信息

$H\left(X\right)=-\underset{x\in X}{\sum }p\left(x\right)\mathrm{log}p\left(x\right)$ (1)

$H\left(X|Y\right)=-\underset{y\in Y}{\sum }\underset{x\in X}{\sum }p\left(x,y\right)\mathrm{log}p\left(x|y\right)$ (2)

$I\left(X;Y\right)=H\left(X\right)-H\left(X|Y\right)$ (3)

2.3. 主成分分析

2.3.1. 基本概念

1) ${{a}^{\prime }}_{i}{a}_{i}=1$ $\left(i=1,2,\cdots ,p\right)$

2) 当 $i>1$ 时， ${{a}^{\prime }}_{i}\Sigma {a}_{j}=0$ $\left(j=1,2,\cdots ,i-1\right)$

3) $Var\left({Y}_{i}\right)=\underset{{a}^{\prime }a=1,{{a}^{\prime }}_{i}\Sigma {a}_{j}=0\left(j=1,\cdots ,i-1\right)}{\mathrm{max}}Var\left({a}^{\prime }X\right)$

${Y}_{i}={{a}^{\prime }}_{i}X={a}_{1i}{X}_{1}+{a}_{2i}{X}_{2}+\cdots +{a}_{pi}{X}_{p}$ $\left(i=1,2,\cdots ,p\right)$ (4)

2.3.2. 具体步骤

1) 用Z-score法对数据进行标准化变换

2) 求指标数据的相关矩阵

3) 求相关矩阵的特征根与特征向量

4) 计算主成分贡献率及累计贡献率，确定主成分(一般取累计贡献率为85%~95%的特征值所对应的主成分。)

2.4. 朴素贝叶斯分类器

2.4.1. 基本概念

2.4.2. 朴素贝叶斯分类器

${v}_{MAP}=\underset{{v}_{j}\in \left\{{v}_{1},{v}_{2},\cdots ,{v}_{m}\right\}}{\mathrm{arg}\mathrm{max}}\frac{P\left({x}_{1},{x}_{2},\cdots ,{x}_{n}|{v}_{j}\right)P\left({v}_{j}\right)}{P\left({x}_{1},{x}_{2},\cdots ,{x}_{n}\right)}$ (5)

$=\underset{{v}_{j}\in \left\{{v}_{1},{v}_{2},\cdots ,{v}_{m}\right\}}{\mathrm{arg}\mathrm{max}}P\left({x}_{1},{x}_{2},\cdots ,{x}_{n}|{v}_{j}\right)P\left({v}_{j}\right)$ (6)

${v}_{NB}=\underset{{v}_{j}\in \left\{{v}_{1},{v}_{2},\cdots ,{v}_{m}\right\}}{\mathrm{arg}\mathrm{max}}P\left({v}_{j}\right)\underset{i}{\prod }P\left({x}_{i}|{v}_{j}\right)$ (7)

3. 数据，指标与因子

3.1. 数据

3.2. 指标

3.2.1. 股票收益率

${R}_{i,T}=\mathrm{ln}\left(\frac{{P}_{i,T+\Delta t}+{I}_{i,T+\Delta t}}{{P}_{i,T}}\right)$ (8)

3.2.2. 日换手率

3.3. 因子的选取

Table 1. Mutual information outcomes of the top five companies

Figure 1. Eigevalues

Table 2. Principal component result

$\begin{array}{l}{Y}_{1}=0.298643431{Z}_{1}+\cdots +0.273257891{Z}_{11}+0.261699545{Z}_{18}\\ {Y}_{2}=0.206278919{Z}_{1}-\cdots -0.336067968{Z}_{11}-0.494679098{Z}_{18}\end{array}$

4. 构建朴素贝叶斯分类器

Table 3. Classification Situation 1

Table 4. Classification Situation 2

Table 5. Classification Situation 3

5. 结论

