﻿ 基于数据挖掘技术的浙江省财政收入影响因素分析 Analysis of Influencing Factors of Fiscal Revenue in Zhejiang Province Based on Data Mining Technology

Hans Journal of Data Mining
Vol.07 No.04(2017), Article ID:22486,12 pages
10.12677/HJDM.2017.74011

Analysis of Influencing Factors of Fiscal Revenue in Zhejiang Province Based on Data Mining Technology

Liangliang Zhuang, Tong Wu

School of Mathematics and Information Science, Wenzhou University, Wenzhou Zhejiang

Received: Sep. 30th, 2017; accepted: Oct. 19th, 2017; published: Oct. 27th, 2017

ABSTRACT

Finance is the basis of governments to perform their functions whose basic responsibilities are the resource integration, resource reallocation and macroeconomic regulation and control. Besides, finance reflects the level of the development of social and economic to a great degree. This is why, to our country, enhancing the predict accuracy of finance means a lot. In order to accomplish this task, we analyzed the factors of influencing Zhejiang Province’s fiscal revenue, used best subset selection, forward stepwise selection, backward stepwise selection, ridge regression and Lasso regression respectively by using R software. We also give evaluation efficiency of each model by using root-mean-square errors. Finally, we find that the Lasso regression model is the optimal regression model, which can pick the key factors affecting the financial income of Zhejiang province for the four balances: tourism earned foreign exchange earnings, the average wage of urban employees employed, the ratio of the third industry to the second industry, and the RMB deposits of all financial institutions.

Keywords:Government Receipts, Best Subset Selection, Forward Stepwise Selection, Backward Stepwise Selection, Ridge Regression, Lasso Regression

1. 引言

1.1. 背景与意义

1.2. 我国学者、专家对于财政收入的研究及观点

1.3. 分析方法与过程

1.4. R语言函数及相关知识介绍

summary()函数：可以计算得到最小值、最大值、四分位数和数值型变量的均值，以及因子向量和逻辑型向量的频数统计。

regsubsets()函数：可以通过穷举搜索、正向或向后逐步或顺序替换来选择模型。本文主要利用此函数进行向前逐步回归和向后逐步回归来选取模型。

BIC (Bayesian Information Criterions，贝叶斯信息规则)：是对模型的拟合效果进行评价的一个指标，BIC值越小，则模型对数据的拟合越好。

AIC (Akaike information criterion，最小信息准则)：可以表示为： $\text{AIC}=2k-2\mathrm{ln}\left(L\right)$ ，它建立在熵的概念基础上，可以权衡所估计模型的复杂度和此模型拟合数据的优良性。AIC值越小，则模型对数据的拟合越好。

2. 变量指标选择

2.1. 数据选择

2.2. 数据预处理和探索性分析

Table 1. Various factors index categories

*所有数据来自浙江省统计信息网——统计年鉴。

Table 2. Descriptive statistics of each index

(a) (b)

Table 3. Pearson correlation coefficient matrix

3. 模型建立

3.1. 最优子集回归模型

$y={\beta }_{0}+{\beta }_{{i}_{1}}{x}_{{i}_{1}}+\cdots +{\beta }_{{i}_{q-1}}{x}_{{i}_{q-1}}+e$

1) BIC或AIC达到最小；

2) Cp统计量达到最小；

3) 复判决定系数达到最大。

$y=101.347-0.196{x}_{1}+0.725{x}_{8}-0.909{x}_{9}-0.301{x}_{11}+0.132{x}_{16}$

3.2. 向前逐步选择回归模型

y先用全部1个变量建立回归方程，然后再从模型中增加变量。之后对重要的变量重新建立回归模型，并对此模型进行系数的显著性检验，以此类推，逐步增加模型的变量。

Table 4. Variance inflation factor test

Table 5. Optimal subset parameter estimation table

Figure 1. The results of BIC forward stepwise regression model generated

Table 6. Forward stepwise regression model coefficients

$\begin{array}{c}y=583.588-0.185{x}_{1}-0.001{x}_{3}+0.014{x}_{4}-0.128{x}_{6}+0.794{x}_{8}-0.885{x}_{9}\\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}+0.454{x}_{10}-0.324{x}_{11}-0.020{x}_{12}+0.060{x}_{15}+0.091{x}_{16}\end{array}$

3.3. 向后逐步选择回归模型

Figure 2. The results of BIC on backward stepwise choice regression model

Table 7. Backward Stepwise regression model coefficients

$\begin{array}{c}y=2599.662+0.042{x}_{1}-0.287{x}_{2}+0.002{x}_{3}+0.031{x}_{4}-0.036{x}_{5}-1.255{x}_{6}\\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}+1.177{x}_{7}-1.069{x}_{9}+0.038{x}_{11}+0.031{x}_{13}-1.027{x}_{14}+0.140{x}_{15}\end{array}$

3.4. 岭回归模型

$\left({\stackrel{˜}{\alpha }}^{\left(\text{ols}\right)},{\stackrel{˜}{\beta }}^{\left(\text{ols}\right)}\right)=\mathrm{arg}\mathrm{min}{\sum _{i=1}^{n}\left({y}_{i}-\alpha -\sum _{j=1}^{p}{x}_{ij}{\beta }_{j}\right)}^{2}$

$\left({\stackrel{˜}{\alpha }}^{\left(\text{ridge}\right)},{\stackrel{˜}{\beta }}^{\left(\text{ridge}\right)}\right)=\mathrm{arg}\mathrm{min}{\sum _{i=1}^{n}\left({y}_{i}-\alpha -\sum _{j=1}^{p}{x}_{ij}{\beta }_{j}\right)}^{2}+\lambda \sum _{j=1}^{p}{\beta }_{j}^{2}$

$\begin{array}{c}y=848.268+0.038{x}_{1}+0.067{x}_{2}+0.001{x}_{3}+0.007{x}_{4}+0.031{x}_{5}-0.649{x}_{6}\\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}+0.371{x}_{7}+0.016{x}_{8}+0.092{x}_{9}+11.313{x}_{10}+0.008{x}_{11}+0.010{x}_{12}\\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}+0.012{x}_{13}+0.281{x}_{14}+0.021{x}_{15}+0.009{x}_{16}\end{array}$

3.5. Lasso回归模型

Lasso方法的参数估计定义如下：

$\stackrel{^}{\beta }\left(\text{lasso}\right)=\underset{\beta }{\mathrm{arg}\mathrm{min}}{‖y-\sum _{j=1}^{p}{x}_{j}{\beta }_{j}‖}^{2}+\lambda \sum _{j=1}^{p}|{\beta }_{j}|$

Table 8. Model coefficients of ridge regression

Table 9. Coefficients of Lasso model

Figure 3. Mean-Squared error of lambda

$y=-228.350+0.008{x}_{3}+0.004{x}_{4}+4.090{x}_{10}-0.030{x}_{12}$

4. 各模型比较

Table 10. Descriptive statistics of each index

5. 总结

Analysis of Influencing Factors of Fiscal Revenue in Zhejiang Province Based on Data Mining Technology[J]. 数据挖掘, 2017, 07(04): 93-104. http://dx.doi.org/10.12677/HJDM.2017.74011

1. 1. 杨欢. 地方财政收入影响因素的实证分析[J]. 时代金融, 2012(3): 156.

2. 2. 刘荣. 基于逐步回归方法的国家财政收入的影响因素分析[J]. 劳动保障世界(理论版)公共科学, 2012(5): 51-54.

3. 3. 周忠辉, 丁建勋, 王丽丽. 我国财政收入影响因素的实证研究[J]. 宏观经济, 2011(8): 84-85.

4. 4. 金欣雪, 周红林. 我国财政收入影响因素分析[J]. 科技情报开发与经济(经济问题探讨), 2007, 17(26): 140-142.

5. 5. 刘睿智, 杜溦. 基于LASSO变量选择方法的投资组合及实证分析[J]. 经济问题, 2012(9): 103-107.

6. 6. 徐菁. 财政收入与经济增长关系研究——以杭州市为例[D]: [硕士学位论文]. 杭州: 浙江大学, 2008.