非参数回归模型样条估计量的分布 Distribution of Spline Estimators for Nonparametric Regression Models

doi:10.12677/AAM.2022.1110788

Advances in Applied Mathematics
Vol. 11 No. 10 ( 2022 ), Article ID: 57127 , 8 pages
10.12677/AAM.2022.1110788

非参数回归模型样条估计量的分布

詹陆丽^*，武新乾

●How to Cite this Article

河南科技大学数学与统计学院，河南洛阳

收稿日期：2022年9月24日；录用日期：2022年10月17日；发布日期：2022年10月26日

摘要

为探究非参数回归模型中非参数函数估计量的分布，本文在标准正态误差情形下首先得到了均值函数样条估计量的正态分布，然后得到了方差函数基于残差的样条估计量的渐近分布，并采用单个卡方变量线性函数来近似方差函数估计量的渐近分布。通过数值模拟验证了均值函数估计量的分布和方差函数估计量的渐近分布。

关键词

非参数回归模型，样条估计，渐近分布，卡方分布线性组合

Distribution of Spline Estimators for Nonparametric Regression Models

Luli Zhan^*, Xinqian Wu

School of Mathematics and Statistics, Henan University of Science and Technology, Luoyang Henan

Received: Sep. 24^th, 2022; accepted: Oct. 17^th, 2022; published: Oct. 26^th, 2022

ABSTRACT

To explore the distribution of nonparametric function estimator in the nonparametric regression model, we first obtain the normal distribution of the spline estimator of the mean function in the standard normal error casein this paper. Then the asymptotic distribution of the spline estimator of the variance function based on the residuals is obtained. And the linear function of individual chi-square variable is used to approximate the asymptotic distribution of the variance function estimator. The distribution of the mean function estimator and the asymptotic distribution of the variance function estimator are illustrated by numerical simulations.

Keywords:Nonparametric Regression Model, Spline Estimation, Asymptotic Distribution, Linear Combination of Chi-Square Distribution

This work is licensed under the Creative Commons Attribution International License (CC BY 4.0).

http://creativecommons.org/licenses/by/4.0/

1. 引言

考虑如下非参数回归模型

$y_{t} = m (x_{t}) + σ (x_{t}) ε_{t}, t = 1, 2, \dots, n$ ， (1)

其中 $x_{t} = t / n$ 为固定设计点， $y_{t}$ 为被解释变量或响应变量， $m (\cdot)$ 与 $σ (\cdot)$ 分别为未知的均值函数和标准差函数， ${ε_{t}}$ 为独立同分布的随机误差序列，且 $E ε_{t} = 0$ ， $E ε_{t}^{2} = 1$ ， $t = 1, 2, \dots, n$ 。

近些年来，国内外学者对非参数回归模型做了大量的研究 [1] - [7]。渐近分布作为估计量大样本性质中的一个重要方面，也引起了一些学者的兴趣。秦永松(1991) [8] 基于加权核估计得到了均值函数的估计量，并研究了其导数的渐近分布；Liang与Jing (2004) [9] 采用加权核估计的方法，基于负相关序列研究了非参数回归模型中未知函数估计量的逐点一致收敛性与渐近正态性；Jin等(2014) [10] 研究了一步Newton-Raphson估计和局部轮廓似然估计，并给出了基于两种方法的估计量的分布；Alharbi与Patili (2018) [11] 提出对响应变量和残差的乘积进行平滑处理，并研究了基于此方法得到的方差函数估计量的渐近分布；Li和Lin (2020) [12] 在没有独立性假设的情形下，推导出了误差方差 $σ^{2} = var (ε)$ 的最佳半参数效率约束，并建立了基于残差的 $σ^{2}$ 有效估计量的渐近正态性。

本文对非参数回归模型中基于样条方法的均值函数估计量和方差函数估计量的分布问题进行研究，并通过数值模拟验证效果。

2. 估计量及主要结论

根据文献 [7]，将区间 $D = [0, 1]$ 进行包括两端点的 $k + 1$ 等距分割，结点序列为

$0 = t_{0} < t_{1} < \dots < t_{k + 1} = 1$ ，

构造相应的v次样条空间 $S_{k, v}$ ，其基函数记作 $B_{s} (x) (s = 1, 2, \dots, k + v)$ 。设 $K = k + v$ ，又令

$B_{s} (x) = {(B_{1} (x), B_{2} (x), \dots, B_{K} (x))}^{T}$ ，

则均值函数 $m (x)$ 的样条估计为

$\hat{m} (x) = B^{T} (x) \hat{φ} = \sum_{s = 1}^{K} {\hat{φ}}_{s} B_{s} (x)$ ， (2)

这里 $\hat{φ} = {({\hat{φ}}_{1}, \dots, {\hat{φ}}_{K})}^{T}$ 为能使

$l_{m} (φ) = \sum_{t = 1}^{n} {y_{t} - \sum_{s = 1}^{K} φ_{s} B_{s} (x_{t})}^{2}$

最小化的参数向量， $φ = {(φ_{1}, \dots, φ_{K})}^{T}$ ，它的最小二乘估计为 $\hat{φ} = {(W^{T} W)}^{- 1} W^{T} y$ ，其中

$y = {(y_{1}, y_{2}, \dots, y_{n})}^{T}$ ，

$W = (\begin{matrix} B_{k 1} (x_{1}) & B_{k 2} (x_{1}) & \dots & B_{k K} (x_{1}) \\ B_{k 1} (x_{2}) & B_{k 2} (x_{2}) & \dots & B_{k K} (x_{2}) \\ ⋮ & ⋮ & ⋱ & ⋮ \\ B_{k 1} (x_{n}) & B_{k 2} (x_{n}) & \dots & B_{k K} (x_{n}) \end{matrix})$ 。

记残差 ${\hat{ς}}_{t} = y_{t} - \hat{m} (x_{t}), t = 1, \dots, n$ ，令 $z_{t} = {\hat{ς}}_{t}^{2}$ ， $z = {(z_{1}, \dots, z_{n})}^{T}$ ，则方差函数 $σ^{2} (x)$ 的基于残差的样条估计为

${\hat{σ}}^{2} (x) = B^{T} (x) \hat{θ} = \sum_{s = 1}^{K} {\hat{θ}}_{s} B_{s} (x)$ ， (3)

其中 $\hat{θ} = {({\hat{θ}}_{1}, \dots, {\hat{θ}}_{K})}^{T}$ 为使得

$l_{σ^{2}} (θ) = {\sum_{t = 1}^{n} {z_{t} - \sum_{s = 1}^{K} θ_{s} B_{s} (x_{t})}}^{2}$

最小化的参数向量， $θ = {(θ_{1}, \dots, θ_{K})}^{T}$ ，它的最小二乘估计为 $\hat{θ} = {(W^{T} W)}^{- 1} W^{T} z$ 。

2.1. 均值函数估计量的分布

本文假定误差序列 ${ε_{t}}$ 来自标准正态总体，即 $ε_{t} \sim N (0, 1), t = 1, 2, \dots, n$ 。不妨记

$M = {(m (x_{1}), m (x_{2}), \dots, m (x_{n}))}^{T}$ ，

$\sum = (\begin{matrix} σ (x_{1}) & 0 & \dots & 0 \\ 0 & σ (x_{2}) & \dots & 0 \\ ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & 0 & \dots & σ (x_{n}) \end{matrix})$ ，

$ε = {(ε_{1}, ε_{2}, \dots, ε_{n})}^{T}$ ，

则模型(1)可简记为

$y = M + \sum ε$ ， (4)

其中

$ε ~ N ((\begin{matrix} 0 \\ 0 \\ ⋮ \\ 0 \end{matrix}), (\begin{matrix} 1 & 0 & \dots & 0 \\ 0 & 1 & \dots & 0 \\ ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & 0 & \dots & 1 \end{matrix}))$ 。

因为 $M$ 与 $\sum$ 分别为常数向量和常数矩阵，又根据期望与方差的性质可知， $y ~ N (M, \sum^{2})$ 。

定理1均值函数 $m (x)$ 的样条估计量 $\hat{m} (x)$ 服从均值为 $B^{T} (x) A M$ ，方差为 $B^{T} (x) A \sum^{2} A^{T} B (x)$ 的正态分布，即

$\hat{m} (x) ~ N (B^{T} (x) A M, B^{T} (x) A \sum^{2} A^{T} B (x))$ ，

其中 $A = {(W^{T} W)}^{- 1} W^{T}$ 。

证明：根据(2)式知，要证 $\hat{m} (x)$ 的分布，只需证 $\hat{φ}$ 的分布，又因为 $\hat{φ} = {(W^{T} W)}^{- 1} W^{T} y$ ，那么有

$\hat{φ} ~ N ({(W^{T} W)}^{- 1} W^{T} M, {(W^{T} W)}^{- 1} W^{T} \sum^{2} W {(W^{T} W)}^{- 1})$ ，

故有

$\hat{m} (x) = B^{T} (x) \hat{φ} ~ N (B^{T} (x) A M, B^{T} (x) A \sum^{2} A^{T} B (x))$ 。

2.2. 方差函数估计量的分布

本文记 $X \leftrightarrow Y$ 表示随机变量序列X与Y渐近同分布。接下来讨论方差函数估计量 ${\hat{σ}}^{2} (x)$ 的渐近分布。

定理2方差函数 $σ^{2} (x)$ 基于残差的样条估计量 ${\hat{σ}}^{2} (x)$ 与 $\sum_{t = 1}^{n} c_{t} ε_{t}^{2}$ 渐近同分布，即

${\hat{σ}}^{2} (x) \leftrightarrow \sum_{t = 1}^{n} c_{t} ε_{t}^{2}$ ，

这里 $c = (c_{1}, c_{2}, \dots, c_{n}) = B^{T} (x) A \sum^{2}$ 。

证明：根据(3)式，要证方差函数估计量 ${\hat{σ}}^{2} (x)$ 的分布，只需证 $z = {(z_{1}, z_{2}, \dots, z_{n})}^{T}$ 的分布。

因为

${\hat{ς}}_{t} = y_{t} - \hat{m} (x_{t}) = y_{t} - m (x_{t}) + m (x_{t}) - \hat{m} (x_{t}) = σ (x_{t}) ε_{t} - [\hat{m} (x_{t}) - m (x_{t})]$ ，

则有

$z_{t} = {\hat{ς}}_{t}^{2} = {σ (x_{t}) ε_{t} - [\hat{m} (x_{t}) - m (x_{t})]}^{2}$ ，

即

$z_{t} = {[σ (x_{t}) ε_{t}]}^{2} - 2 σ (x_{t}) ε_{t} [\hat{m} (x_{t}) - m (x_{t})] + {[\hat{m} (x_{t}) - m (x_{t})]}^{2}$ 。

由 $\hat{m} (x)$ 一致收敛到 $m (x)$ (见文献 [7] 中定理1)且 $ε_{t}$ 是正态随机变量，可知

${\hat{σ}}^{2} (x) \leftrightarrow B^{T} (x) \cdot {(W^{T} W)}^{- 1} W^{T} \cdot (\begin{matrix} σ^{2} (x_{1}) ε_{1}^{2} \\ σ^{2} (x_{2}) ε_{2}^{2} \\ ⋮ \\ σ^{2} (x_{n}) ε_{n}^{2} \end{matrix}) = B^{T} (x) A \sum^{2} (\begin{matrix} ε_{1}^{2} \\ ε_{2}^{2} \\ ⋮ \\ ε_{n}^{2} \end{matrix})$ ，

即

${\hat{σ}}^{2} (x) \leftrightarrow \sum_{t = 1}^{n} c_{t} ε_{t}^{2}$ 。

因为 ${ε_{t}}$ 为相互独立且服从标准正态分布的随机误差序列，故有

$ε_{t}^{2} ~ χ^{2} (1), \forall t = 1, 2, \dots, n$ ，

因此，要求方差函数估计量 ${\hat{σ}}^{2} (x)$ 的渐近分布就是求服从卡方分布的独立随机变量的线性组合的分布。根据文献 [13] [14] 的结果，尝试用单个 $χ^{2}$ 变量的线性函数近似n个相互独立的 $χ^{2}$ 变量的线性组合。

首先，考虑用 $\tilde{Y} = a χ^{2} (d)$ 作为 $\sum_{t = 1}^{n} c_{t} ε_{t}^{2}$ 的近似分布。采用一、二阶矩拟合的原则选取 $a, d$ ，即由方程

${\begin{cases} E (\tilde{Y}) = E (\sum_{t = 1}^{n} c_{t} ε_{t}^{2}) \\ D (\tilde{Y}) = D (\sum_{t = 1}^{n} c_{t} ε_{t}^{2}) \end{cases}$

确定。从而 $a, d$ 应满足方程

${\begin{cases} a d = \sum_{t = 1}^{n} c_{t} ≜ P \\ a^{2} d = \sum_{t = 1}^{n} c_{t}^{2} ≜ Q \end{cases}$

解得

$a = Q / P, d = P^{2} / Q$ 。

考虑到 $χ^{2}$ 变量的自由度为正整数，故将d修正为

$d^{*} = [P^{2} / Q + 0.5]$ ，

这里 $[x]$ 表示不超过x的最大整数，若 $P^{2} / Q < 0.5$ ，则取 $d^{*} = 1$ 。

再用 $Y ~ a χ^{2} (d^{*}) + e_{1}$ 来近似 $\sum_{t = 1}^{n} c_{t} ε_{t}^{2}$ 的分布，采用上述方法得到

$e_{1} = P - \sqrt{Q d^{*}}, a = (P - e_{1}) / d^{*}$ ，

于是，可用 $a χ^{2} (d^{*}) + e_{1}$ 作为 $\sum_{t = 1}^{n} c_{t} ε_{t}^{2}$ 的近似。

3. 数值模拟

考虑模型(1)，其中

${\begin{cases} m (x) = 50 x^{3} {(1 - x)}^{3} \\ σ (x) = 0.2 + 0.4 \sin (π x) ， \\ ε_{t} ~ N (0, 1), t = 1, 2, \dots, n \end{cases}$ (5)

这里 $x \in [0, 1]$ 。

对模型(5)进行蒙特卡罗模拟，利用三次B-样条基函数估计未知均值函数 $m (x)$ ，基于残差估计方差函数 $σ^{2} (x)$ ，并在AIC准则下自动选取等距结点数。

3.1. 均值函数估计量的模拟

为验证理论分布效果，使用MATLAB软件进行模拟运算，选取显著性水平为 $α = 0.05$ ，具体步骤如下：

第一步，根据模型(5)，计算出 $B^{T} (x) A M$ 与 $B^{T} (x) A \sum^{2} A^{T} B (x)$ ；

第二步，生成N个服从 $N (B^{T} (x) A M, B^{T} (x) A \sum^{2} A^{T} B (x))$ 的随机数；

第三步，基于样条方法，通过蒙特卡罗模拟产生N个均值函数的估计值 $\hat{m} (x)$ ；

第四步，在给定置信水平下，检验第二步与第三步产生的随机数与均值函数估计值是否来自于同一分布；

第五步，对上述四步进行多次重复模拟，分析所得结果。

图1绘制了 $x = 0.02, 0.2, 0.4, 0.6, 0.8, 1.0$ 处在 $N = 1000$ 时的直方图和概率密度函数曲线图。

由图1可见，各点处拟合的均值函数的直方图与概率密度函数曲线呈倒U型，直观地可以认为各点处拟合的均值函数估计值来自于正态分布。进一步地，对各点处的均值函数估计量的分布与正态分布进行Two-sample t-test检验，并分别循环模拟 $N = 10, 50, 100, 500, 1000$ 次，检验的P值如表1所示。

Figure 1. Histogram of the estimates at each point of the mean function and its probability density function curve

图1. 均值函数各点处估计值的直方图及其概率密度函数曲线

Table 1. The P-values of the Two-sample t-test-test for the mean function

表1. 均值函数Two-sample t-test检验的P值

由表1可知，当 $x = 0.2, 0 .4, 0.8$ 时，P值均大于0.05；当N较大时，各点处的P值均大于0.05，说明在给定的显著性水平0.05下，应该接受原假设，即认为检验数据服从正态分布。

3.2. 方差函数估计量的模拟

检验步骤如下：

第一步：依据模型(5)给定x值，计算a， $d^{*}$ 与 $e_{1}$ ；

第二步：随机生成N个服从 $a χ^{2} (d^{*}) + e_{1}$ 的随机数；

第三步：基于残差样条方法，通过蒙特卡罗模拟生成N个方差函数 ${\hat{σ}}^{2} (x)$ 的估计值；

第四步，在给定置信水平下，检验第二步与第三步产生的随机数与方差函数估计值是否来自于同一分布；

第五步，对上述四步进行多次重复模拟，分析所得结果。

图2绘制了 $x = 0.02, 0.2, 0.4, 0.6, 0.8, 1.0$ 处在 $N = 1000$ 时的直方图和概率密度函数曲线图。

Figure 2. Histogram of the estimates at each point and its probability density function curve

图2. 方差函数各点处估计值的直方图及其概率密度函数曲线

由图2可见，各点处拟合的方差函数估计量的直方图与概率密度函数曲线呈不对称的倒U型，且整体上右偏，直观地可以认为各点处拟合的方差函数估计量近似服从卡方分布。进一步地，对各点处的方差函数估计量的渐近分布与 $a χ^{2} (d^{*}) + e_{1}$ 进行Two-sample t-test检验，并分别循环模拟 $N = 10, 50, 100, 500, 1000$ 次，结果如表2所示。

由表2可知，各点处的P值均大于给定的显著性水平0.05，说明可以接受原假设，即认为检验数据近似服从 $a χ^{2} (d^{*}) + e_{1}$ 。

Table 2. The P-values of the Two-sample t-test-test for the variance function

表2. 方差函数Two-sample t-test检验的P值

4. 结论

本文基于样条方法研究了固定设计下异方差非参数回归模型的均值函数与方差函数估计量的分布，均值函数的估计量服从正态分布，方差函数估计量的渐近分布可由单个 $χ^{2}$ 变量的线性函数来近似。模拟结果显示：在给定显著性水平0.05下，分布拟合效果较优。

本文所研究的固定设计下异方差非参数回归模型的均值函数与方差函数估计量的近似分布为生物、医学、地质、经济等领域的研究带来了便利。

基金项目

国家自然科学基金项目(11601126)；河南省重点攻关项目(182102210286)。

文章引用

詹陆丽,武新乾. 非参数回归模型样条估计量的分布
Distribution of Spline Estimators for Nonparametric Regression Models[J]. 应用数学进展, 2022, 11(10): 7422-7429. https://doi.org/10.12677/AAM.2022.1110788

参考文献

1. Chown, J. (2016) Efficient Estimation of the Error Distribution Function in Heteroskedastic Nonparametric Regression with Missing Data. Statistics & Probability Letters, 117, 31-39. https://doi.org/10.1016/j.spl.2016.04.009

2. 齐培艳, 田铮, 段西发, 袁芳. 异方差非参数回归模型均值与方差变点的小波估计与应用[J]. 系统工程理论与实践, 2013, 33(4): 988-995.

3. Burman, P. (1991) Regression Function Estimation from Dependent Observations. Journal of Multivariate Analysis, 36, 263-279. https://doi.org/10.1016/0047-259X(91)90061-6

4. Song, Q. and Yang, L. (2009) Spline Confidence Bands for Variance Functions. Journal of Nonparametric Statistics, 21, 589-609. https://doi.org/10.1080/10485250902811151

5. 武新乾, 张刚. 非参数回归模型中误差方差的样条估计[J]. 郑州大学学报(理学版), 2015, 47(3): 17-20.

6. 郑美洁, 田波平. 基于两步样条光滑法的非参数回归模型研究[J]. 统计与决策, 2020, 36(3): 14-20.

7. 马晓跃, 武新乾. 非参数回归模型基于残差的样条估计[J]. 河南科技大学学报(自然科学版), 2021, 42(4): 91-96+10.

8. 秦永松. 一类非参数回归函数导数估计的渐近分布[J]. 工程数学学报, 1991, 8(1): 67-74.

9. Liang, H. and Jing, B. (2004) Asymptotic Properties for Estimates of Nonparametric Regression Models Based on Negatively Associated Sequences. Journal of Multivariate Analysis, 95, 227-245. https://doi.org/10.1016/j.jmva.2004.06.004

10. Jin, S., Su, L. and Xiao, Z. (2014) Adaptive Nonparametric Re-gression with Conditional Heteroskedasticity. Econometric Theory, 31, 1153-1191. https://doi.org/10.1017/S0266466614000450

11. Alharbi, Y. and Patili, P. (2018) Error Variance Function Esti-mation in Nonparametric Regression Models. Communications in Statistics-Simulation and Computation, 47, 1479-1491. https://doi.org/10.1080/03610918.2017.1315774

12. Li, Z. and Lin, W. (2020) Efficient Error Variance Estimation in Non-Parametric Regression. Australian & New Zealand Journal of Statistics, 62, 467-484. https://doi.org/10.1111/anzs.12311

13. 范大茵, 冯云. 独立变量线性组合的近似分布[J]. 高校应用数学学报A辑(中文版), 1993, 8(3): 335-338.

14. Zhang, J. (2011) Approximate and Asymptotic Distributions of Chi-Squared-Type Mixtures with Applications. Journal of the American Statistical Association, 100, 273-285. https://doi.org/10.1198/016214504000000575

NOTES

^*通讯作者。

期刊菜单