模型评估和选择

模型评估和选择#

前言#

在机器学习的学习路径中，掌握了各种算法后，下一个关键问题就是：如何知道我的模型好不好？如何在多个模型中选择最优的？ 这就是模型评估和选择的核心问题。

今天深入学习了模型评估的各种指标、交叉验证技术，以及偏差-方差权衡理论。我发现，仅仅知道如何训练模型是远远不够的，更重要的是要能够科学地评估模型性能，理解模型的优缺点，并据此做出合理的模型选择。

这份笔记记录了我对模型评估体系的理解，从基础的混淆矩阵到高级的偏差-方差分解，希望能为后续的机器学习实践提供坚实的评估基础。

第一部分：分类问题评估指标#

1.1 混淆矩阵：理解分类错误的”地图”#

1.1.1 混淆矩阵的数学定义#

混淆矩阵是评估分类模型性能的基础工具。对于k类分类问题，混淆矩阵是一个k×k的方阵：

$C_{ij} = \text{预测为类别j但实际为类别i的样本数量}$

二分类混淆矩阵的标准形式：

1
                预测结果
2
              正类    负类
3
真实  正类    TP     FN
4
标签  负类    FP     TN

其中：

TP (True Positive)：真正例 - 预测为正，实际也为正
TN (True Negative)：真负例 - 预测为负，实际也为负
FP (False Positive)：假正例 - 预测为正，实际为负（第一类错误）
FN (False Negative)：假负例 - 预测为负，实际为正（第二类错误）

1.1.2 从混淆矩阵导出的基本指标#

1. 准确率 (Accuracy) $\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$

含义：所有预测中正确预测的比例
适用场景：类别平衡的数据集
局限性：在不平衡数据集上容易产生误导

2. 精确率 (Precision) $\text{Precision} = \frac{TP}{TP + FP}$

含义：在所有预测为正的样本中，真正为正的比例
直观理解：模型说”是”的时候，有多少次是对的
业务意义：当误报代价很高时（如垃圾邮件检测），需要高精确率

3. 召回率 (Recall/Sensitivity) $\text{Recall} = \frac{TP}{TP + FN}$

含义：在所有真正为正的样本中，被正确预测的比例
直观理解：所有真正的”阳性”中，模型找到了多少
业务意义：当漏报代价很高时（如疾病诊断），需要高召回率

4. 特异性 (Specificity) $\text{Specificity} = \frac{TN}{TN + FP}$

含义：在所有真正为负的样本中，被正确预测的比例
直观理解：模型正确识别”阴性”的能力

5. F1分数 $F1 = \frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} = \frac{2TP}{2TP + FP + FN}$

含义：精确率和召回率的调和平均数
优势：平衡考虑精确率和召回率，对不平衡数据集较为稳健

1.1.3 实现完整的评估器#

1
import numpy as np
2
import matplotlib.pyplot as plt
3
import seaborn as sns
4
from sklearn.metrics import *
5
from sklearn.model_selection import learning_curve, validation_curve
6

7
class ModelEvaluator:
8
    """完整的模型评估器"""
9

10
    def __init__(self):
11
        pass
12

13
    def plot_confusion_matrix(self, y_true, y_pred, classes=None, normalize=False):
14
        """绘制混淆矩阵并提供详细分析"""
15

16
        cm = confusion_matrix(y_true, y_pred)
17

18
        if normalize:
19
            cm_display = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
20
            title = '标准化混淆矩阵'
21
            fmt = '.2f'
22
        else:
23
            cm_display = cm
24
            title = '混淆矩阵'
25
            fmt = 'd'
26

27
        # 创建子图：原始矩阵和标准化矩阵
28
        fig, axes = plt.subplots(1, 2, figsize=(15, 6))
29

30
        # 原始混淆矩阵
31
        sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
32
                   xticklabels=classes, yticklabels=classes, ax=axes[0])
33
        axes[0].set_title('原始混淆矩阵')
34
        axes[0].set_ylabel('真实标签')
35
        axes[0].set_xlabel('预测标签')
36

37
        # 标准化混淆矩阵
38
        cm_norm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
39
        sns.heatmap(cm_norm, annot=True, fmt='.2f', cmap='Blues',
40
                   xticklabels=classes, yticklabels=classes, ax=axes[1])
41
        axes[1].set_title('标准化混淆矩阵 (按行归一化)')
42
        axes[1].set_ylabel('真实标签')
43
        axes[1].set_xlabel('预测标签')
44

45
        plt.tight_layout()
46
        plt.show()
47

48
        # 详细分析
49
        self._analyze_confusion_matrix(cm, classes)
50

51
        return cm
52

53
    def _analyze_confusion_matrix(self, cm, classes):
54
        """分析混淆矩阵的详细信息"""
55

56
        print("=== 混淆矩阵详细分析 ===")
57

58
        if len(cm) == 2:  # 二分类
59
            tn, fp, fn, tp = cm.ravel()
60
            print(f"真负例 (TN): {tn}")
61
            print(f"假正例 (FP): {fp} - 第一类错误")
62
            print(f"假负例 (FN): {fn} - 第二类错误")
63
            print(f"真正例 (TP): {tp}")
64
            print(f"总预测错误: {fp + fn} / {cm.sum()}")
65

66
            # 错误类型分析
67
            if fp > fn:
68
                print("主要错误类型: 假正例 (过度预测)")
69
            elif fn > fp:
70
                print("主要错误类型: 假负例 (预测不足)")
71
            else:
72
                print("假正例和假负例相当")
73
        else:  # 多分类
74
            print("各类别预测情况:")
75
            for i, class_name in enumerate(classes or range(len(cm))):
76
                correct = cm[i, i]
77
                total = cm[i, :].sum()
78
                accuracy = correct / total if total > 0 else 0
79
                print(f"  {class_name}: {correct}/{total} = {accuracy:.3f}")
80

81
    def classification_metrics(self, y_true, y_pred, y_prob=None, target_names=None):
82
        """计算全面的分类指标"""
83

84
        print("=== 基本分类指标 ===")
85

86
        # 基本指标
87
        accuracy = accuracy_score(y_true, y_pred)
88
        precision = precision_score(y_true, y_pred, average='weighted', zero_division=0)
89
        recall = recall_score(y_true, y_pred, average='weighted', zero_division=0)
90
        f1 = f1_score(y_true, y_pred, average='weighted', zero_division=0)
91

92
        print(f"准确率 (Accuracy): {accuracy:.4f}")
93
        print(f"精确率 (Precision): {precision:.4f}")
94
        print(f"召回率 (Recall): {recall:.4f}")
95
        print(f"F1分数: {f1:.4f}")
96

97
        # 详细的分类报告
98
        print("\n=== 详细分类报告 ===")
99
        print(classification_report(y_true, y_pred, target_names=target_names))
100

101
        # AUC相关指标
102
        if y_prob is not None:
103
            n_classes = len(np.unique(y_true))
104

105
            if n_classes == 2:  # 二分类
106
                # 确保y_prob是概率形式
107
                if y_prob.ndim == 2:
108
                    y_prob_binary = y_prob[:, 1]
109
                else:
110
                    y_prob_binary = y_prob
111

112
                auc = roc_auc_score(y_true, y_prob_binary)
113
                ap = average_precision_score(y_true, y_prob_binary)
114

115
                print(f"\n=== 概率相关指标 ===")
116
                print(f"AUC-ROC: {auc:.4f}")
117
                print(f"平均精确率 (AP): {ap:.4f}")
118

119
                # 绘制ROC和PR曲线
120
                self.plot_roc_curve(y_true, y_prob_binary)
121
                self.plot_precision_recall_curve(y_true, y_prob_binary)
122

123
            else:  # 多分类
124
                try:
125
                    auc = roc_auc_score(y_true, y_prob, multi_class='ovr', average='weighted')
126
                    print(f"\n=== 多分类概率指标 ===")
127
                    print(f"多分类AUC-ROC: {auc:.4f}")
128
                except ValueError as e:
129
                    print(f"无法计算多分类AUC: {e}")
130

131
        return {
132
            'accuracy': accuracy,
133
            'precision': precision,
134
            'recall': recall,
135
            'f1': f1
136
        }
137

138
    def plot_roc_curve(self, y_true, y_prob):
139
        """绘制ROC曲线及详细分析"""
140

141
        fpr, tpr, thresholds = roc_curve(y_true, y_prob)
142
        auc = roc_auc_score(y_true, y_prob)
143

144
        plt.figure(figsize=(12, 5))
145

146
        # ROC曲线
147
        plt.subplot(1, 2, 1)
148
        plt.plot(fpr, tpr, linewidth=2, label=f'ROC曲线 (AUC = {auc:.3f})')
149
        plt.plot([0, 1], [0, 1], 'k--', linewidth=1, label='随机分类器 (AUC = 0.5)')
150
        plt.xlim([0.0, 1.0])
151
        plt.ylim([0.0, 1.05])
152
        plt.xlabel('假正率 (FPR) = FP/(FP+TN)')
153
        plt.ylabel('真正率 (TPR) = TP/(TP+FN)')
154
        plt.title('ROC曲线')
155
        plt.legend()
156
        plt.grid(True, alpha=0.3)
157

158
        # 阈值分析
159
        plt.subplot(1, 2, 2)
160
        plt.plot(thresholds, fpr, label='假正率 (FPR)', linewidth=2)
161
        plt.plot(thresholds, tpr, label='真正率 (TPR)', linewidth=2)
162
        plt.xlabel('分类阈值')
163
        plt.ylabel('率')
164
        plt.title('阈值 vs FPR/TPR')
165
        plt.legend()
166
        plt.grid(True, alpha=0.3)
167

168
        plt.tight_layout()
169
        plt.show()
170

171
        # 最优阈值分析
172
        optimal_idx = np.argmax(tpr - fpr)  # Youden指数
173
        optimal_threshold = thresholds[optimal_idx]
174
        print(f"最优阈值 (Youden指数): {optimal_threshold:.3f}")
175
        print(f"对应TPR: {tpr[optimal_idx]:.3f}, FPR: {fpr[optimal_idx]:.3f}")
176

177
    def plot_precision_recall_curve(self, y_true, y_prob):
178
        """绘制精确率-召回率曲线"""
179

180
        precision, recall, thresholds = precision_recall_curve(y_true, y_prob)
181
        ap = average_precision_score(y_true, y_prob)
182

183
        # 计算基线（随机预测的AP）
184
        baseline_ap = np.sum(y_true) / len(y_true)
185

186
        plt.figure(figsize=(12, 5))
187

188
        # PR曲线
189
        plt.subplot(1, 2, 1)
190
        plt.plot(recall, precision, linewidth=2, label=f'PR曲线 (AP = {ap:.3f})')
191
        plt.axhline(y=baseline_ap, color='k', linestyle='--',
192
                   linewidth=1, label=f'随机分类器 (AP = {baseline_ap:.3f})')
193
        plt.xlim([0.0, 1.0])
194
        plt.ylim([0.0, 1.05])
195
        plt.xlabel('召回率 (Recall)')
196
        plt.ylabel('精确率 (Precision)')
197
        plt.title('精确率-召回率曲线')
198
        plt.legend()
199
        plt.grid(True, alpha=0.3)
200

201
        # F1分数 vs 阈值
202
        plt.subplot(1, 2, 2)
203
        f1_scores = 2 * (precision[:-1] * recall[:-1]) / (precision[:-1] + recall[:-1])
204
        plt.plot(thresholds, f1_scores, linewidth=2, label='F1分数')
205
        optimal_f1_idx = np.argmax(f1_scores)
206
        optimal_f1_threshold = thresholds[optimal_f1_idx]
207
        plt.axvline(x=optimal_f1_threshold, color='r', linestyle='--',
208
                   label=f'最优F1阈值 = {optimal_f1_threshold:.3f}')
209
        plt.xlabel('分类阈值')
210
        plt.ylabel('F1分数')
211
        plt.title('F1分数 vs 阈值')
212
        plt.legend()
213
        plt.grid(True, alpha=0.3)
214

215
        plt.tight_layout()
216
        plt.show()
217

218
        print(f"最优F1阈值: {optimal_f1_threshold:.3f}")
219
        print(f"最大F1分数: {f1_scores[optimal_f1_idx]:.3f}")

1.2 ROC曲线和AUC：理解分类器的判别能力#

1.2.1 ROC曲线的数学基础#

ROC (Receiver Operating Characteristic) 曲线是在不同分类阈值下，真正率(TPR)对假正率(FPR)的函数图像。

数学定义：

TPR (True Positive Rate) = $\frac{TP}{TP + FN}$ = Recall = Sensitivity
FPR (False Positive Rate) = $\frac{FP}{FP + TN}$ = 1 - Specificity

AUC (Area Under Curve) 的含义： AUC等于从正类和负类中各随机选择一个样本，分类器给正类样本的评分高于负类样本评分的概率。

$\text{AUC} = P(S_+ > S_- | \text{随机选择正负样本})$

AUC的性质：

AUC ∈ [0, 1]
AUC = 0.5：随机分类器
AUC = 1：完美分类器
AUC > 0.5：比随机分类好
AUC < 0.5：比随机分类差（可以反转预测）

1.2.2 PR曲线 vs ROC曲线#

选择原则：

平衡数据集：ROC曲线和PR曲线都有效
不平衡数据集：PR曲线更能反映真实性能

数学原因： 在极不平衡数据集中（如正类占1%），即使FPR很小，FP的绝对数量可能很大，导致精确率很低，但ROC曲线看起来仍然不错。

实例比较： 假设数据集：990个负例，10个正例

模型预测：8个真正例，2个假负例，100个假正例，890个真负例
ROC角度：TPR = 8/10 = 0.8，FPR = 100/990 = 0.101 (看起来不错)
PR角度：Precision = 8/108 = 0.074 (很差!)

1.3 多分类评估的特殊考虑#

1.3.1 平均策略#

对于多分类问题，需要选择合适的平均策略：

1. Macro平均 $\text{Macro-F1} = \frac{1}{k}\sum_{i=1}^{k} F1_i$

每个类别权重相等
适用于关心每个类别性能的场景

2. Weighted平均 $\text{Weighted-F1} = \sum_{i=1}^{k} w_i \times F1_i, \quad w_i = \frac{n_i}{n}$

按类别样本数量加权
适用于类别不平衡但更关心大类别的场景

3. Micro平均 $\text{Micro-F1} = \frac{2 \times \sum_{i=1}^{k} TP_i}{2 \times \sum_{i=1}^{k} TP_i + \sum_{i=1}^{k} FP_i + \sum_{i=1}^{k} FN_i}$

聚合所有类别的TP、FP、FN后计算
等价于准确率

第二部分：回归问题评估指标#

2.1 常用回归指标#

2.1.1 基本误差指标#

1. 平均绝对误差 (MAE) $\text{MAE} = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|$

优点：直观易懂，与目标变量同单位，对异常值不敏感
缺点：不可微分，优化困难

2. 均方误差 (MSE) $\text{MSE} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2$

优点：可微分，便于优化，惩罚大误差
缺点：单位是目标变量的平方，对异常值敏感

3. 均方根误差 (RMSE) $\text{RMSE} = \sqrt{\text{MSE}} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}$

优点：与目标变量同单位，兼具MSE的可微性
缺点：仍对异常值敏感

4. 决定系数 (R²) $R^2 = 1 - \frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2} = 1 - \frac{\text{SS}_{\text{res}}}{\text{SS}_{\text{tot}}}$

含义：模型解释的方差占总方差的比例
范围：(-∞, 1]，1表示完美拟合
优点：无量纲，便于不同问题间比较

2.1.2 高级回归指标#

5. 平均绝对百分比误差 (MAPE) $\text{MAPE} = \frac{100\%}{n}\sum_{i=1}^{n}\left|\frac{y_i - \hat{y}_i}{y_i}\right|$

适用：相对误差更重要的场景
局限：当真实值接近0时不稳定

6. 对称平均绝对百分比误差 (sMAPE) $\text{sMAPE} = \frac{100\%}{n}\sum_{i=1}^{n}\frac{|y_i - \hat{y}_i|}{(|y_i| + |\hat{y}_i|)/2}$

改进：解决MAPE在真实值为0时的问题

2.1.3 回归评估的完整实现#

1
def regression_metrics(y_true, y_pred, multioutput='uniform_average'):
2
    """回归问题的完整评估"""
3

4
    # 基本指标
5
    mae = mean_absolute_error(y_true, y_pred)
6
    mse = mean_squared_error(y_true, y_pred)
7
    rmse = np.sqrt(mse)
8
    r2 = r2_score(y_true, y_pred)
9

10
    # 额外指标
11
    evs = explained_variance_score(y_true, y_pred)  # 解释方差得分
12
    max_error = max_error(y_true, y_pred)  # 最大误差
13

14
    print("=== 回归评估指标 ===")
15
    print(f"平均绝对误差 (MAE): {mae:.4f}")
16
    print(f"均方误差 (MSE): {mse:.4f}")
17
    print(f"均方根误差 (RMSE): {rmse:.4f}")
18
    print(f"决定系数 (R²): {r2:.4f}")
19
    print(f"解释方差得分: {evs:.4f}")
20
    print(f"最大误差: {max_error:.4f}")
21

22
    # 计算百分比误差（避免除零）
23
    mask = y_true != 0
24
    if np.any(mask):
25
        mape = np.mean(np.abs((y_true[mask] - y_pred[mask]) / y_true[mask])) * 100
26
        print(f"平均绝对百分比误差 (MAPE): {mape:.2f}%")
27

28
    # 残差分析
29
    residuals = y_true - y_pred
30

31
    plt.figure(figsize=(15, 10))
32

33
    # 1. 预测值 vs 真实值
34
    plt.subplot(2, 3, 1)
35
    plt.scatter(y_true, y_pred, alpha=0.6, edgecolors='k', linewidth=0.5)
36
    min_val, max_val = min(y_true.min(), y_pred.min()), max(y_true.max(), y_pred.max())
37
    plt.plot([min_val, max_val], [min_val, max_val], 'r--', lw=2, label='完美预测线')
38
    plt.xlabel('真实值')
39
    plt.ylabel('预测值')
40
    plt.title(f'预测值 vs 真实值\n(R² = {r2:.3f})')
41
    plt.legend()
42
    plt.grid(True, alpha=0.3)
43

44
    # 2. 残差图
45
    plt.subplot(2, 3, 2)
46
    plt.scatter(y_pred, residuals, alpha=0.6, edgecolors='k', linewidth=0.5)
47
    plt.axhline(y=0, color='r', linestyle='--', linewidth=2)
48
    plt.xlabel('预测值')
49
    plt.ylabel('残差 (真实值 - 预测值)')
50
    plt.title('残差图')
51
    plt.grid(True, alpha=0.3)
52

53
    # 检查同方差性
54
    # 计算残差的绝对值与预测值的相关性
55
    abs_residuals = np.abs(residuals)
56
    heteroscedasticity = np.corrcoef(y_pred, abs_residuals)[0, 1]
57
    plt.text(0.05, 0.95, f'异方差性检验\n相关系数: {heteroscedasticity:.3f}',
58
             transform=plt.gca().transAxes, bbox=dict(boxstyle="round", facecolor='wheat'))
59

60
    # 3. 残差分布
61
    plt.subplot(2, 3, 3)
62
    plt.hist(residuals, bins=30, alpha=0.7, edgecolor='black')
63
    plt.xlabel('残差')
64
    plt.ylabel('频数')
65
    plt.title('残差分布')
66
    plt.grid(True, alpha=0.3)
67

68
    # 正态性检验
69
    from scipy import stats
70
    shapiro_stat, shapiro_p = stats.shapiro(residuals[:min(5000, len(residuals))])  # 限制样本数
71
    plt.text(0.05, 0.95, f'Shapiro-Wilk正态性检验\np-value: {shapiro_p:.3f}',
72
             transform=plt.gca().transAxes, bbox=dict(boxstyle="round", facecolor='lightblue'))
73

74
    # 4. Q-Q图
75
    plt.subplot(2, 3, 4)
76
    stats.probplot(residuals, dist="norm", plot=plt)
77
    plt.title('Q-Q图 (正态性检验)')
78
    plt.grid(True, alpha=0.3)
79

80
    # 5. 误差随索引变化（检查时间序列相关性）
81
    plt.subplot(2, 3, 5)
82
    plt.plot(residuals, alpha=0.7)
83
    plt.axhline(y=0, color='r', linestyle='--')
84
    plt.xlabel('样本索引')
85
    plt.ylabel('残差')
86
    plt.title('残差序列图')
87
    plt.grid(True, alpha=0.3)
88

89
    # 6. 残差绝对值
90
    plt.subplot(2, 3, 6)
91
    plt.scatter(y_pred, abs_residuals, alpha=0.6, edgecolors='k', linewidth=0.5)
92
    plt.xlabel('预测值')
93
    plt.ylabel('|残差|')
94
    plt.title('绝对残差图')
95
    plt.grid(True, alpha=0.3)
96

97
    plt.tight_layout()
98
    plt.show()
99

100
    # 残差分析总结
101
    print("\n=== 残差分析总结 ===")
102
    print(f"残差均值: {np.mean(residuals):.6f} (应接近0)")
103
    print(f"残差标准差: {np.std(residuals):.4f}")
104
    print(f"残差偏度: {stats.skew(residuals):.4f} (应接近0)")
105
    print(f"残差峰度: {stats.kurtosis(residuals):.4f} (应接近0)")
106

107
    if abs(heteroscedasticity) > 0.3:
108
        print("⚠️  警告：可能存在异方差性")
109
    if shapiro_p < 0.05:
110
        print("⚠️  警告：残差可能不服从正态分布")
111

112
    return {
113
        'mae': mae, 'mse': mse, 'rmse': rmse, 'r2': r2,
114
        'explained_variance': evs, 'max_error': max_error
115
    }

第三部分：交叉验证和模型选择#

3.1 交叉验证的数学基础#

3.1.1 K折交叉验证#

算法步骤：

将数据集D随机分成k个大小相等的子集： $D_1, D_2, \ldots, D_k$
对于每个子集 $D_i$ ，用其他k-1个子集训练模型，在 $D_i$ 上测试
计算k个测试结果的平均值作为最终估计

数学表示： $\text{CV}_k = \frac{1}{k}\sum_{i=1}^{k} L(f^{(-i)}, D_i)$

其中 $f^{(-i)}$ 表示在除 $D_i$ 外的数据上训练的模型， $L$ 是损失函数。

方差估计： $\text{Var}(\text{CV}_k) = \frac{1}{k}\sum_{i=1}^{k}(L_i - \text{CV}_k)^2$

3.1.2 不同交叉验证策略的比较#

1. 留一交叉验证 (LOOCV)#

k = n（样本数量）
优点：几乎无偏估计，充分利用数据
缺点：计算成本高，方差大

2. 分层交叉验证#

保持各折中类别分布与原数据集一致
适用：不平衡数据集

3. 时间序列交叉验证#

考虑时间顺序，避免数据泄露
方法：滑动窗口、扩展窗口

3.1.3 实现全面的交叉验证分析#

1
from sklearn.model_selection import *
2
import pandas as pd
3

4
def comprehensive_cross_validation(models, X, y, cv_strategies=None, scoring_metrics=None):
5
    """全面的交叉验证分析"""
6

7
    if cv_strategies is None:
8
        cv_strategies = {
9
            'KFold-5': KFold(n_splits=5, shuffle=True, random_state=42),
10
            'KFold-10': KFold(n_splits=10, shuffle=True, random_state=42),
11
            'StratifiedKFold-5': StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
12
        }
13

14
    if scoring_metrics is None:
15
        # 根据问题类型选择指标
16
        unique_targets = len(np.unique(y))
17
        if unique_targets <= 20:  # 分类问题
18
            scoring_metrics = ['accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted']
19
        else:  # 回归问题
20
            scoring_metrics = ['neg_mean_absolute_error', 'neg_mean_squared_error', 'r2']
21

22
    results = {}
23

24
    for model_name, model in models.items():
25
        print(f"\n=== {model_name} ===")
26
        model_results = {}
27

28
        for cv_name, cv_strategy in cv_strategies.items():
29
            print(f"\n{cv_name} 交叉验证:")
30
            cv_results = {}
31

32
            for metric in scoring_metrics:
33
                scores = cross_val_score(model, X, y, cv=cv_strategy,
34
                                       scoring=metric, n_jobs=-1)
35

36
                cv_results[metric] = {
37
                    'scores': scores,
38
                    'mean': scores.mean(),
39
                    'std': scores.std(),
40
                    'ci_lower': scores.mean() - 1.96 * scores.std(),
41
                    'ci_upper': scores.mean() + 1.96 * scores.std()
42
                }
43

44
                print(f"  {metric}: {scores.mean():.4f} (±{scores.std():.4f})")
45
                print(f"    95% CI: [{cv_results[metric]['ci_lower']:.4f}, "
46
                      f"{cv_results[metric]['ci_upper']:.4f}]")
47

48
            model_results[cv_name] = cv_results
49

50
        results[model_name] = model_results
51

52
    # 可视化比较
53
    plot_cv_comparison(results, scoring_metrics[0])  # 使用第一个指标绘图
54

55
    return results
56

57
def plot_cv_comparison(cv_results, primary_metric):
58
    """可视化交叉验证结果比较"""
59

60
    # 准备数据
61
    models = list(cv_results.keys())
62
    cv_strategies = list(cv_results[models[0]].keys())
63

64
    fig, axes = plt.subplots(2, 2, figsize=(15, 12))
65
    axes = axes.ravel()
66

67
    # 1. 不同模型在不同CV策略下的表现
68
    ax = axes[0]
69
    x_pos = np.arange(len(models))
70
    width = 0.25
71

72
    for i, cv_name in enumerate(cv_strategies):
73
        means = []
74
        stds = []
75
        for model in models:
76
            result = cv_results[model][cv_name][primary_metric]
77
            means.append(result['mean'])
78
            stds.append(result['std'])
79

80
        ax.bar(x_pos + i * width, means, width, yerr=stds,
81
               label=cv_name, capsize=5, alpha=0.8)
82

83
    ax.set_xlabel('模型')
84
    ax.set_ylabel(primary_metric)
85
    ax.set_title(f'不同交叉验证策略下的{primary_metric}比较')
86
    ax.set_xticks(x_pos + width)
87
    ax.set_xticklabels(models, rotation=45)
88
    ax.legend()
89
    ax.grid(True, alpha=0.3)
90

91
    # 2. 各模型得分分布箱线图
92
    ax = axes[1]
93
    all_scores = []
94
    labels = []
95

96
    for model in models:
97
        for cv_name in cv_strategies:
98
            scores = cv_results[model][cv_name][primary_metric]['scores']
99
            all_scores.append(scores)
100
            labels.append(f"{model}\n{cv_name}")
101

102
    box_plot = ax.boxplot(all_scores, labels=labels, patch_artist=True)
103
    ax.set_title('各模型交叉验证得分分布')
104
    ax.set_ylabel(primary_metric)
105
    plt.setp(ax.get_xticklabels(), rotation=45, ha='right')
106
    ax.grid(True, alpha=0.3)
107

108
    # 3. 模型稳定性分析（标准差）
109
    ax = axes[2]
110
    x_pos = np.arange(len(models))
111

112
    for i, cv_name in enumerate(cv_strategies):
113
        stds = []
114
        for model in models:
115
            result = cv_results[model][cv_name][primary_metric]
116
            stds.append(result['std'])
117

118
        ax.bar(x_pos + i * width, stds, width,
119
               label=cv_name, alpha=0.8)
120

121
    ax.set_xlabel('模型')
122
    ax.set_ylabel(f'{primary_metric} 标准差')
123
    ax.set_title('模型稳定性比较 (标准差越小越稳定)')
124
    ax.set_xticks(x_pos + width)
125
    ax.set_xticklabels(models, rotation=45)
126
    ax.legend()
127
    ax.grid(True, alpha=0.3)
128

129
    # 4. 置信区间比较
130
    ax = axes[3]
131
    y_pos = np.arange(len(models) * len(cv_strategies))
132
    ci_ranges = []
133
    means = []
134
    y_labels = []
135

136
    for model in models:
137
        for cv_name in cv_strategies:
138
            result = cv_results[model][cv_name][primary_metric]
139
            ci_ranges.append(result['ci_upper'] - result['ci_lower'])
140
            means.append(result['mean'])
141
            y_labels.append(f"{model}-{cv_name}")
142

143
    ax.barh(y_pos, ci_ranges, alpha=0.7)
144
    ax.set_xlabel('置信区间宽度')
145
    ax.set_ylabel('模型-CV策略')
146
    ax.set_title('置信区间宽度比较 (宽度越小越可靠)')
147
    ax.set_yticks(y_pos)
148
    ax.set_yticklabels(y_labels)
149
    ax.grid(True, alpha=0.3)
150

151
    plt.tight_layout()
152
    plt.show()
153

154
def statistical_significance_test(cv_results, model1, model2, cv_strategy, metric):
155
    """统计显著性检验"""
156

157
    scores1 = cv_results[model1][cv_strategy][metric]['scores']
158
    scores2 = cv_results[model2][cv_strategy][metric]['scores']
159

160
    # 配对t检验
161
    from scipy import stats
162
    t_stat, p_value = stats.ttest_rel(scores1, scores2)
163

164
    print(f"\n=== {model1} vs {model2} 统计显著性检验 ===")
165
    print(f"检验方法: 配对t检验")
166
    print(f"t统计量: {t_stat:.4f}")
167
    print(f"p值: {p_value:.4f}")
168

169
    alpha = 0.05
170
    if p_value < alpha:
171
        winner = model1 if scores1.mean() > scores2.mean() else model2
172
        print(f"结论: {winner} 显著优于另一模型 (α = {alpha})")
173
    else:
174
        print(f"结论: 两模型无显著差异 (α = {alpha})")
175

176
    return t_stat, p_value

3.2 学习曲线和验证曲线#

3.2.1 学习曲线分析#

学习曲线显示模型性能随训练样本数量的变化，用于诊断：

欠拟合：训练和验证曲线都较低且接近
过拟合：训练曲线高，验证曲线低，差距大
理想状态：两条曲线都较高且接近

1
def comprehensive_learning_curve_analysis(model, X, y, cv=5):
2
    """全面的学习曲线分析"""
3

4
    # 计算学习曲线
5
    train_sizes, train_scores, val_scores = learning_curve(
6
        model, X, y, cv=cv, n_jobs=-1,
7
        train_sizes=np.linspace(0.1, 1.0, 10),
8
        scoring='accuracy',
9
        random_state=42
10
    )
11

12
    # 计算统计量
13
    train_mean = np.mean(train_scores, axis=1)
14
    train_std = np.std(train_scores, axis=1)
15
    val_mean = np.mean(val_scores, axis=1)
16
    val_std = np.std(val_scores, axis=1)
17

18
    plt.figure(figsize=(15, 5))
19

20
    # 1. 学习曲线
21
    plt.subplot(1, 3, 1)
22
    plt.plot(train_sizes, train_mean, 'o-', color='blue', label='训练集')
23
    plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std,
24
                     alpha=0.1, color='blue')
25
    plt.plot(train_sizes, val_mean, 'o-', color='red', label='验证集')
26
    plt.fill_between(train_sizes, val_mean - val_std, val_mean + val_std,
27
                     alpha=0.1, color='red')
28
    plt.xlabel('训练样本数')
29
    plt.ylabel('准确率')
30
    plt.title('学习曲线')
31
    plt.legend()
32
    plt.grid(True, alpha=0.3)
33

34
    # 2. 过拟合程度分析
35
    plt.subplot(1, 3, 2)
36
    overfitting_gap = train_mean - val_mean
37
    plt.plot(train_sizes, overfitting_gap, 'o-', color='orange', linewidth=2)
38
    plt.axhline(y=0, color='black', linestyle='--', alpha=0.5)
39
    plt.xlabel('训练样本数')
40
    plt.ylabel('过拟合程度 (训练分数 - 验证分数)')
41
    plt.title('过拟合程度分析')
42
    plt.grid(True, alpha=0.3)
43

44
    # 3. 方差分析
45
    plt.subplot(1, 3, 3)
46
    plt.plot(train_sizes, train_std, 'o-', color='blue', label='训练集方差')
47
    plt.plot(train_sizes, val_std, 'o-', color='red', label='验证集方差')
48
    plt.xlabel('训练样本数')
49
    plt.ylabel('标准差')
50
    plt.title('方差分析')
51
    plt.legend()
52
    plt.grid(True, alpha=0.3)
53

54
    plt.tight_layout()
55
    plt.show()
56

57
    # 分析结论
58
    final_gap = overfitting_gap[-1]
59
    final_val_score = val_mean[-1]
60

61
    print("=== 学习曲线分析结论 ===")
62
    print(f"最终验证分数: {final_val_score:.4f}")
63
    print(f"最终过拟合程度: {final_gap:.4f}")
64

65
    if final_gap > 0.1:
66
        print("🔴 检测到过拟合")
67
        print("建议: 增加正则化、减少模型复杂度或增加训练数据")
68
    elif final_val_score < 0.7:
69
        print("🟡 检测到欠拟合")
70
        print("建议: 增加模型复杂度、添加特征或检查数据质量")
71
    else:
72
        print("🟢 模型拟合良好")
73

74
    # 数据效率分析
75
    data_efficiency = (val_mean[-1] - val_mean[0]) / (train_sizes[-1] - train_sizes[0])
76
    print(f"数据效率: {data_efficiency:.6f} 每个样本的性能提升")
77

78
    return train_sizes, train_scores, val_scores

3.2.2 验证曲线分析#

1
def comprehensive_validation_curve_analysis(model, X, y, param_name, param_range, cv=5):
2
    """全面的验证曲线分析"""
3

4
    # 计算验证曲线
5
    train_scores, val_scores = validation_curve(
6
        model, X, y, param_name=param_name, param_range=param_range,
7
        cv=cv, scoring='accuracy', n_jobs=-1
8
    )
9

10
    train_mean = np.mean(train_scores, axis=1)
11
    train_std = np.std(train_scores, axis=1)
12
    val_mean = np.mean(val_scores, axis=1)
13
    val_std = np.std(val_scores, axis=1)
14

15
    plt.figure(figsize=(15, 5))
16

17
    # 1. 验证曲线
18
    plt.subplot(1, 3, 1)
19
    plt.semilogx(param_range, train_mean, 'o-', color='blue', label='训练集')
20
    plt.fill_between(param_range, train_mean - train_std, train_mean + train_std,
21
                     alpha=0.1, color='blue')
22
    plt.semilogx(param_range, val_mean, 'o-', color='red', label='验证集')
23
    plt.fill_between(param_range, val_mean - val_std, val_mean + val_std,
24
                     alpha=0.1, color='red')
25
    plt.xlabel(f'{param_name} (log scale)')
26
    plt.ylabel('准确率')
27
    plt.title(f'验证曲线 ({param_name})')
28
    plt.legend()
29
    plt.grid(True, alpha=0.3)
30

31
    # 2. 偏差-方差权衡
32
    plt.subplot(1, 3, 2)
33
    bias_proxy = 1 - val_mean  # 用1-验证分数近似偏差
34
    variance_proxy = val_std   # 用验证分数标准差近似方差
35

36
    plt.semilogx(param_range, bias_proxy, 'o-', color='red', label='偏差 (近似)')
37
    plt.semilogx(param_range, variance_proxy, 'o-', color='blue', label='方差 (近似)')
38
    plt.xlabel(f'{param_name} (log scale)')
39
    plt.ylabel('误差')
40
    plt.title('偏差-方差权衡')
41
    plt.legend()
42
    plt.grid(True, alpha=0.3)
43

44
    # 3. 最优参数选择
45
    plt.subplot(1, 3, 3)
46
    # 综合考虑验证分数和稳定性
47
    stability_penalty = val_std / val_mean  # 变异系数
48
    composite_score = val_mean - stability_penalty  # 综合得分
49

50
    plt.semilogx(param_range, val_mean, 'o-', color='green', label='验证分数')
51
    plt.semilogx(param_range, composite_score, 'o-', color='orange', label='综合得分')
52

53
    # 标记最优点
54
    best_idx_validation = np.argmax(val_mean)
55
    best_idx_composite = np.argmax(composite_score)
56

57
    plt.axvline(x=param_range[best_idx_validation], color='green',
58
                linestyle='--', alpha=0.7, label=f'最佳验证: {param_range[best_idx_validation]}')
59
    plt.axvline(x=param_range[best_idx_composite], color='orange',
60
                linestyle='--', alpha=0.7, label=f'最佳综合: {param_range[best_idx_composite]}')
61

62
    plt.xlabel(f'{param_name} (log scale)')
63
    plt.ylabel('得分')
64
    plt.title('参数选择策略比较')
65
    plt.legend()
66
    plt.grid(True, alpha=0.3)
67

68
    plt.tight_layout()
69
    plt.show()
70

71
    # 输出分析结果
72
    print("=== 验证曲线分析结论 ===")
73
    print(f"最佳验证分数: {val_mean[best_idx_validation]:.4f} "
74
          f"(参数 = {param_range[best_idx_validation]})")
75
    print(f"最佳综合得分: {composite_score[best_idx_composite]:.4f} "
76
          f"(参数 = {param_range[best_idx_composite]})")
77
    print(f"对应验证分数: {val_mean[best_idx_composite]:.4f}")
78
    print(f"对应稳定性: {val_std[best_idx_composite]:.4f}")
79

80
    return param_range[best_idx_validation], param_range[best_idx_composite]

第四部分：偏差-方差权衡#

4.1 偏差-方差分解的数学基础#

4.1.1 理论推导#

对于回归问题，设真实函数为 $f(x)$ ，噪声为 $\epsilon \sim N(0, \sigma^2)$ ，模型预测为 $\hat{f}(x)$ ，则：

$y = f(x) + \epsilon$

预测误差的期望可以分解为：

$\mathbb{E}[(y - \hat{f}(x))^2] = \text{Bias}^2(\hat{f}(x)) + \text{Var}(\hat{f}(x)) + \sigma^2$

其中：

偏差 (Bias)： $\text{Bias}(\hat{f}(x)) = \mathbb{E}[\hat{f}(x)] - f(x)$

方差 (Variance)： $\text{Var}(\hat{f}(x)) = \mathbb{E}[(\hat{f}(x) - \mathbb{E}[\hat{f}(x)])^2]$

噪声 (Irreducible Error)： $\sigma^2 = \mathbb{E}[\epsilon^2]$

4.1.2 直观理解#

1
高偏差，低方差：
2
🎯     ●●●
3
       ●●●  (系统性偏离靶心，但很集中)
4
       ●●●
5

6
低偏差，高方差：
7
🎯   ●   ●
8
    ●  ●   (围绕靶心，但很分散)
9
  ●       ●
10

11
高偏差，高方差：
12
🎯       ●
13
   ●   ●    (既偏离靶心又分散)
14
     ●   ●
15

16
低偏差，低方差：
17
🎯  ●●●
18
    ●●●     (理想状态：准确且稳定)
19
    ●●●

4.2 实现偏差-方差分解#

1
import numpy as np
2
import matplotlib.pyplot as plt
3
from sklearn.model_selection import train_test_split
4
from scipy.stats import linregress
5

6
def bias_variance_decomposition(model_class, X, y, n_trials=100, test_size=0.3,
7
                               problem_type='regression'):
8
    """完整的偏差-方差分解分析"""
9

10
    n_samples = len(X)
11
    n_test = int(n_samples * test_size)
12

13
    predictions = []
14
    test_indices_list = []
15

16
    print(f"进行 {n_trials} 次独立实验...")
17

18
    for trial in range(n_trials):
19
        # 随机分割数据
20
        indices = np.random.permutation(n_samples)
21
        train_idx = indices[:-n_test]
22
        test_idx = indices[-n_test:]
23

24
        X_train, X_test = X[train_idx], X[test_idx]
25
        y_train, y_test = y[train_idx], y[test_idx]
26

27
        # 训练模型
28
        if hasattr(model_class, '__call__'):
29
            model = model_class()
30
        else:
31
            model = model_class
32

33
        model.fit(X_train, y_train)
34

35
        # 预测
36
        if problem_type == 'regression':
37
            y_pred = model.predict(X_test)
38
        else:  # classification
39
            if hasattr(model, 'predict_proba'):
40
                y_pred = model.predict_proba(X_test)[:, 1]  # 假设二分类
41
            else:
42
                y_pred = model.predict(X_test)
43

44
        predictions.append(y_pred)
45
        test_indices_list.append(test_idx)
46

47
    # 找到共同的测试样本
48
    common_indices = set(test_indices_list[0])
49
    for indices in test_indices_list[1:]:
50
        common_indices = common_indices.intersection(set(indices))
51

52
    if len(common_indices) < 10:
53
        print("警告：共同测试样本过少，使用替代方法...")
54
        return alternative_bias_variance_analysis(model_class, X, y, n_trials)
55

56
    common_indices = list(common_indices)
57

58
    # 收集共同样本的预测结果
59
    common_predictions = []
60
    for i, pred in enumerate(predictions):
61
        # 找到当前预测中对应的位置
62
        test_idx = test_indices_list[i]
63
        mask = np.isin(test_idx, common_indices)
64
        common_pred = pred[mask]
65

66
        # 按照common_indices的顺序排列
67
        ordered_pred = np.zeros(len(common_indices))
68
        for j, idx in enumerate(common_indices):
69
            pos = np.where(np.array(test_idx) == idx)[0][0]
70
            ordered_pred[j] = pred[pos]
71

72
        common_predictions.append(ordered_pred)
73

74
    predictions = np.array(common_predictions)
75
    y_true = y[common_indices]
76

77
    # 计算偏差-方差分解
78
    mean_pred = np.mean(predictions, axis=0)
79

80
    if problem_type == 'regression':
81
        # 回归问题的偏差-方差分解
82
        bias_squared = np.mean((mean_pred - y_true) ** 2)
83
        variance = np.mean(np.var(predictions, axis=0))
84

85
        # 估计噪声（使用最优预测的残差）
86
        total_error = np.mean((predictions - y_true.reshape(1, -1)) ** 2)
87
        noise = max(0, total_error - bias_squared - variance)  # 确保非负
88

89
        print("=== 偏差-方差分解结果 (回归) ===")
90
        print(f"偏差²: {bias_squared:.6f}")
91
        print(f"方差: {variance:.6f}")
92
        print(f"噪声: {noise:.6f}")
93
        print(f"总误差: {total_error:.6f}")
94
        print(f"分解验证: {bias_squared + variance + noise:.6f}")
95

96
        components = ['偏差²', '方差', '噪声']
97
        values = [bias_squared, variance, noise]
98

99
    else:
100
        # 分类问题的偏差-方差分解（使用0-1损失）
101
        # 计算主要预测（多数投票）
102
        if len(predictions.shape) > 1 and predictions.shape[1] > 1:
103
            # 多类预测
104
            main_pred = np.array([np.bincount(predictions[:, i]).argmax()
105
                                for i in range(predictions.shape[1])])
106
        else:
107
            # 二分类预测
108
            main_pred = np.round(mean_pred).astype(int)
109

110
        # 计算偏差（主要预测与真实标签的不一致性）
111
        bias = np.mean(main_pred != y_true)
112

113
        # 计算方差（不同预测之间的不一致性）
114
        variance = 0
115
        for i in range(len(common_indices)):
116
            # 计算每个样本预测的方差（分类的离散度）
117
            if len(predictions.shape) > 1:
118
                unique_preds = np.unique(predictions[:, i])
119
                if len(unique_preds) > 1:
120
                    variance += 1 - np.max(np.bincount(predictions[:, i].astype(int))) / len(predictions)
121
        variance /= len(common_indices)
122

123
        noise = 0.05  # 分类问题的噪声通常较小
124

125
        print("=== 偏差-方差分解结果 (分类) ===")
126
        print(f"偏差: {bias:.6f}")
127
        print(f"方差: {variance:.6f}")
128
        print(f"噪声 (估计): {noise:.6f}")
129

130
        components = ['偏差', '方差', '噪声']
131
        values = [bias, variance, noise]
132

133
    # 可视化结果
134
    plt.figure(figsize=(15, 10))
135

136
    # 1. 偏差-方差分解饼图
137
    plt.subplot(2, 3, 1)
138
    colors = ['#ff9999', '#66b3ff', '#99ff99']
139
    plt.pie(values, labels=components, autopct='%1.1f%%', colors=colors, startangle=90)
140
    plt.title('偏差-方差分解')
141

142
    # 2. 各组件贡献柱状图
143
    plt.subplot(2, 3, 2)
144
    bars = plt.bar(components, values, color=colors, alpha=0.7, edgecolor='black')
145
    plt.ylabel('误差贡献')
146
    plt.title('各组件误差贡献')
147
    plt.grid(True, alpha=0.3)
148

149
    # 添加数值标签
150
    for bar, value in zip(bars, values):
151
        plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + max(values)*0.01,
152
                f'{value:.4f}', ha='center', va='bottom')
153

154
    # 3. 预测分布分析
155
    plt.subplot(2, 3, 3)
156
    for i in range(min(5, len(common_indices))):  # 只显示前5个样本
157
        plt.hist(predictions[:, i], alpha=0.6, bins=20,
158
                label=f'样本 {i+1}', density=True)
159
    plt.xlabel('预测值')
160
    plt.ylabel('密度')
161
    plt.title('预测值分布 (前5个样本)')
162
    plt.legend()
163
    plt.grid(True, alpha=0.3)
164

165
    # 4. 偏差分析
166
    plt.subplot(2, 3, 4)
167
    bias_per_sample = np.abs(mean_pred - y_true)
168
    plt.plot(bias_per_sample, 'o-', alpha=0.7)
169
    plt.axhline(y=np.mean(bias_per_sample), color='r', linestyle='--',
170
                label=f'平均偏差: {np.mean(bias_per_sample):.4f}')
171
    plt.xlabel('样本索引')
172
    plt.ylabel('|偏差|')
173
    plt.title('各样本偏差分析')
174
    plt.legend()
175
    plt.grid(True, alpha=0.3)
176

177
    # 5. 方差分析
178
    plt.subplot(2, 3, 5)
179
    variance_per_sample = np.var(predictions, axis=0)
180
    plt.plot(variance_per_sample, 'o-', alpha=0.7, color='blue')
181
    plt.axhline(y=np.mean(variance_per_sample), color='r', linestyle='--',
182
                label=f'平均方差: {np.mean(variance_per_sample):.4f}')
183
    plt.xlabel('样本索引')
184
    plt.ylabel('方差')
185
    plt.title('各样本方差分析')
186
    plt.legend()
187
    plt.grid(True, alpha=0.3)
188

189
    # 6. 偏差vs方差散点图
190
    plt.subplot(2, 3, 6)
191
    plt.scatter(bias_per_sample, variance_per_sample, alpha=0.6, edgecolors='k')
192
    plt.xlabel('|偏差|')
193
    plt.ylabel('方差')
194
    plt.title('偏差 vs 方差权衡')
195
    plt.grid(True, alpha=0.3)
196

197
    # 添加趋势线
198
    from scipy.stats import linregress
199
    slope, intercept, r_value, p_value, std_err = linregress(bias_per_sample, variance_per_sample)
200
    x_trend = np.linspace(bias_per_sample.min(), bias_per_sample.max(), 100)
201
    y_trend = slope * x_trend + intercept
202
    plt.plot(x_trend, y_trend, 'r--', alpha=0.8,
203
             label=f'趋势线 (R² = {r_value**2:.3f})')
204
    plt.legend()
205

206
    plt.tight_layout()
207
    plt.show()
208

209
    # 详细分析总结
210
    print("\n=== 偏差-方差分解详细分析 ===")
211
    if problem_type == 'regression':
212
        print(f"各组件占比:")
213
        total = bias_squared + variance + noise
214
        print(f"  偏差²占比: {bias_squared/total*100:.1f}%")
215
        print(f"  方差占比: {variance/total*100:.1f}%")
216
        print(f"  噪声占比: {noise/total*100:.1f}%")
217

218
        if bias_squared > variance:
219
            print("🎯 主要问题: 高偏差 (欠拟合)")
220
            print("建议: 增加模型复杂度、添加特征、减少正则化")
221
        elif variance > bias_squared:
222
            print("🎯 主要问题: 高方差 (过拟合)")
223
            print("建议: 增加正则化、减少模型复杂度、增加训练数据")
224
        else:
225
            print("🎯 偏差和方差相对平衡")
226

227
    return {
228
        'bias_squared': bias_squared if problem_type == 'regression' else bias,
229
        'variance': variance,
230
        'noise': noise,
231
        'total_error': bias_squared + variance + noise if problem_type == 'regression' else bias + variance + noise
232
    }
233

234
def alternative_bias_variance_analysis(model_class, X, y, n_trials=100):
235
    """替代的偏差-方差分析方法（当共同测试样本过少时）"""
236

237
    print("使用替代方法进行偏差-方差分析...")
238

239
    # 使用固定的测试集
240
    from sklearn.model_selection import train_test_split
241
    X_train_base, X_test, y_train_base, y_test = train_test_split(
242
        X, y, test_size=0.3, random_state=42
243
    )
244

245
    predictions = []
246

247
    for trial in range(n_trials):
248
        # 从训练集中进行自助采样
249
        n_train = len(X_train_base)
250
        bootstrap_idx = np.random.choice(n_train, size=n_train, replace=True)
251
        X_bootstrap = X_train_base[bootstrap_idx]
252
        y_bootstrap = y_train_base[bootstrap_idx]
253

254
        # 训练模型
255
        if hasattr(model_class, '__call__'):
256
            model = model_class()
257
        else:
258
            model = model_class
259

260
        model.fit(X_bootstrap, y_bootstrap)
261
        y_pred = model.predict(X_test)
262
        predictions.append(y_pred)
263

264
    predictions = np.array(predictions)
265

266
    # 计算偏差-方差分解
267
    mean_pred = np.mean(predictions, axis=0)
268
    bias_squared = np.mean((mean_pred - y_test) ** 2)
269
    variance = np.mean(np.var(predictions, axis=0))
270

271
    total_error = np.mean((predictions - y_test.reshape(1, -1)) ** 2)
272
    noise = max(0, total_error - bias_squared - variance)
273

274
    print("=== 替代方法偏差-方差分解结果 ===")
275
    print(f"偏差²: {bias_squared:.6f}")
276
    print(f"方差: {variance:.6f}")
277
    print(f"噪声: {noise:.6f}")
278
    print(f"总误差: {total_error:.6f}")
279

280
    return {
281
        'bias_squared': bias_squared,
282
        'variance': variance,
283
        'noise': noise,
284
        'total_error': total_error
285
    }

4.3 不同复杂度模型的偏差-方差分析#

1
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
2
from sklearn.linear_model import LinearRegression, LogisticRegression
3
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
4
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
5

6
def compare_model_complexity_bias_variance(X, y, problem_type='regression'):
7
    """比较不同复杂度模型的偏差-方差权衡"""
8

9
    if problem_type == 'regression':
10
        models = {
11
            '线性回归 (低复杂度)': lambda: LinearRegression(),
12
            '决策树 (深度=3)': lambda: DecisionTreeRegressor(max_depth=3, random_state=42),
13
            '决策树 (深度=10)': lambda: DecisionTreeRegressor(max_depth=10, random_state=42),
14
            '决策树 (无限制)': lambda: DecisionTreeRegressor(random_state=42),
15
            'KNN (k=10)': lambda: KNeighborsRegressor(n_neighbors=10),
16
            'KNN (k=1)': lambda: KNeighborsRegressor(n_neighbors=1),
17
            '随机森林': lambda: RandomForestRegressor(n_estimators=50, random_state=42)
18
        }
19
    else:
20
        models = {
21
            '逻辑回归 (低复杂度)': lambda: LogisticRegression(random_state=42),
22
            '决策树 (深度=3)': lambda: DecisionTreeClassifier(max_depth=3, random_state=42),
23
            '决策树 (深度=10)': lambda: DecisionTreeClassifier(max_depth=10, random_state=42),
24
            '决策树 (无限制)': lambda: DecisionTreeClassifier(random_state=42),
25
            'KNN (k=10)': lambda: KNeighborsClassifier(n_neighbors=10),
26
            'KNN (k=1)': lambda: KNeighborsClassifier(n_neighbors=1),
27
            '随机森林': lambda: RandomForestClassifier(n_estimators=50, random_state=42)
28
        }
29

30
    results = {}
31

32
    print("开始偏差-方差分解比较分析...")
33
    for model_name, model_func in models.items():
34
        print(f"\n分析模型: {model_name}")
35
        result = bias_variance_decomposition(model_func, X, y, n_trials=50,
36
                                           problem_type=problem_type)
37
        results[model_name] = result
38

39
    # 可视化比较结果
40
    plot_bias_variance_comparison(results)
41

42
    return results
43

44
def plot_bias_variance_comparison(results):
45
    """可视化偏差-方差分解比较结果"""
46

47
    models = list(results.keys())
48
    bias_values = [results[model]['bias_squared'] for model in models]
49
    variance_values = [results[model]['variance'] for model in models]
50
    noise_values = [results[model]['noise'] for model in models]
51

52
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
53

54
    # 1. 堆叠柱状图
55
    x = np.arange(len(models))
56
    width = 0.6
57

58
    ax1.bar(x, bias_values, width, label='偏差²', alpha=0.8, color='#ff9999')
59
    ax1.bar(x, variance_values, width, bottom=bias_values, label='方差', alpha=0.8, color='#66b3ff')
60
    ax1.bar(x, noise_values, width, bottom=np.array(bias_values) + np.array(variance_values),
61
            label='噪声', alpha=0.8, color='#99ff99')
62

63
    ax1.set_xlabel('模型')
64
    ax1.set_ylabel('误差')
65
    ax1.set_title('偏差-方差分解比较')
66
    ax1.set_xticks(x)
67
    ax1.set_xticklabels(models, rotation=45, ha='right')
68
    ax1.legend()
69
    ax1.grid(True, alpha=0.3)
70

71
    # 2. 偏差vs方差散点图
72
    ax2.scatter(bias_values, variance_values, s=100, alpha=0.7, edgecolors='k')
73
    for i, model in enumerate(models):
74
        ax2.annotate(model, (bias_values[i], variance_values[i]),
75
                    xytext=(5, 5), textcoords='offset points', fontsize=9)
76

77
    ax2.set_xlabel('偏差²')
78
    ax2.set_ylabel('方差')
79
    ax2.set_title('偏差-方差权衡')
80
    ax2.grid(True, alpha=0.3)
81

82
    plt.tight_layout()
83
    plt.show()

4.4 实际应用示例#

1
# 示例1: 回归问题的偏差-方差分析
2
from sklearn.datasets import make_regression
3
from sklearn.preprocessing import StandardScaler
4
from sklearn.model_selection import train_test_split
5

6
def regression_bias_variance_demo():
7
    """回归问题的偏差-方差分析演示"""
8

9
    print("=== 回归问题偏差-方差分析演示 ===\n")
10

11
    # 生成回归数据
12
    X, y = make_regression(n_samples=300, n_features=10, noise=0.1,
13
                          random_state=42)
14

15
    # 标准化特征
16
    scaler = StandardScaler()
17
    X_scaled = scaler.fit_transform(X)
18

19
    # 比较不同复杂度模型
20
    results = compare_model_complexity_bias_variance(X_scaled, y, 'regression')
21

22
    return results
23

24
# 示例2: 分类问题的偏差-方差分析
25
from sklearn.datasets import make_classification
26

27
def classification_bias_variance_demo():
28
    """分类问题的偏差-方差分析演示"""
29

30
    print("\n=== 分类问题偏差-方差分析演示 ===\n")
31

32
    # 生成分类数据
33
    X, y = make_classification(n_samples=300, n_features=10, n_informative=5,
34
                              n_redundant=2, n_clusters_per_class=1,
35
                              random_state=42)
36

37
    # 标准化特征
38
    scaler = StandardScaler()
39
    X_scaled = scaler.fit_transform(X)
40

41
    # 比较不同复杂度模型
42
    results = compare_model_complexity_bias_variance(X_scaled, y, 'classification')
43

44
    return results
45

46
    # 运行演示
47
if __name__ == "__main__":
48
    # 回归演示
49
    regression_results = regression_bias_variance_demo()
50

51
    # 分类演示
52
    classification_results = classification_bias_variance_demo()
53

54
    print("\n=== 偏差-方差分析总结 ===")
55
    print("回归问题结果:", regression_results)
56
    print("分类问题结果:", classification_results)

第五部分：实际应用示例#

5.1 综合案例：房价预测模型评估#

1
def comprehensive_model_evaluation_demo():
2
    """完整的模型评估演示"""
3

4
    from sklearn.datasets import fetch_california_housing
5
    from sklearn.ensemble import RandomForestRegressor
6
    from sklearn.linear_model import LinearRegression, Ridge
7
    from sklearn.svm import SVR
8
    from sklearn.preprocessing import StandardScaler
9
    from sklearn.model_selection import KFold, train_test_split
10

11
    print("=== 加州房价预测模型评估案例 ===\n")
12

13
    # 加载数据
14
    housing = fetch_california_housing()
15
    X, y = housing.data, housing.target
16

17
    print(f"数据集信息:")
18
    print(f"样本数量: {X.shape[0]}")
19
    print(f"特征数量: {X.shape[1]}")
20
    print(f"目标变量范围: [{y.min():.2f}, {y.max():.2f}]")
21

22
    # 数据预处理
23
    scaler = StandardScaler()
24
    X_scaled = scaler.fit_transform(X)
25

26
    # 定义模型
27
    models = {
28
        '线性回归': LinearRegression(),
29
        '岭回归': Ridge(alpha=1.0),
30
        '随机森林': RandomForestRegressor(n_estimators=100, random_state=42),
31
        'SVR': SVR(kernel='rbf', C=1.0, gamma='scale')
32
    }
33

34
    # 1. 交叉验证比较
35
    print("\n" + "="*50)
36
    print("第一步: 交叉验证模型比较")
37
    print("="*50)
38

39
    cv_results = comprehensive_cross_validation(
40
        models, X_scaled, y,
41
        cv_strategies={
42
            'KFold-5': KFold(n_splits=5, shuffle=True, random_state=42),
43
            'KFold-10': KFold(n_splits=10, shuffle=True, random_state=42)
44
        },
45
        scoring_metrics=['neg_mean_absolute_error', 'neg_mean_squared_error', 'r2']
46
    )
47

48
    # 2. 学习曲线分析
49
    print("\n" + "="*50)
50
    print("第二步: 学习曲线分析")
51
    print("="*50)
52

53
    best_model = RandomForestRegressor(n_estimators=100, random_state=42)
54
    comprehensive_learning_curve_analysis(best_model, X_scaled, y)
55

56
    # 3. 验证曲线分析
57
    print("\n" + "="*50)
58
    print("第三步: 超参数验证曲线分析")
59
    print("="*50)
60

61
    param_range = [10, 50, 100, 200, 500]
62
    comprehensive_validation_curve_analysis(
63
        RandomForestRegressor(random_state=42), X_scaled, y,
64
        'n_estimators', param_range
65
    )
66

67
    # 4. 偏差-方差分解
68
    print("\n" + "="*50)
69
    print("第四步: 偏差-方差分解分析")
70
    print("="*50)
71

72
    bias_variance_results = compare_model_complexity_bias_variance(
73
        X_scaled, y, 'regression'
74
    )
75

76
    # 5. 最终模型评估
77
    print("\n" + "="*50)
78
    print("第五步: 最终模型详细评估")
79
    print("="*50)
80

81
    # 训练最佳模型
82
    X_train, X_test, y_train, y_test = train_test_split(
83
        X_scaled, y, test_size=0.2, random_state=42
84
    )
85

86
    final_model = RandomForestRegressor(n_estimators=200, random_state=42)
87
    final_model.fit(X_train, y_train)
88
    y_pred = final_model.predict(X_test)
89

90
    # 详细回归评估
91
    regression_metrics(y_test, y_pred)
92

93
    print("\n=== 模型评估总结 ===")
94
    print("1. 随机森林在所有评估指标上表现最佳")
95
    print("2. 模型在200个树时达到最佳性能")
96
    print("3. 偏差-方差分析显示模型平衡良好")
97
    print("4. 残差分析显示模型假设基本满足")
98

99
# 运行综合演示
100
if __name__ == "__main__":
101
    comprehensive_model_evaluation_demo()

学习总结与反思#

核心收获#

通过深入学习模型评估和选择，我获得了以下关键认识：

评估指标的选择艺术：不同的业务场景需要不同的评估重点。在医疗诊断中，高召回率可能比高精确率更重要；而在垃圾邮件检测中，可能需要平衡两者。
交叉验证的威力：简单的训练-测试分割往往不够可靠，交叉验证提供了更稳健的性能估计，特别是在数据有限的情况下。
偏差-方差权衡的普遍性：这个概念不仅适用于机器学习，在生活中也无处不在。追求完美往往意味着不稳定，而稳定往往需要接受一定的系统性误差。
模型诊断的重要性：学习曲线和验证曲线就像医生的”听诊器”，能够帮我们诊断模型的”健康状况”，指导改进方向。

实践中的深刻体会#

1. “没有测量就没有改进” 在实际项目中，我发现很多时候模型性能的提升来自于对评估过程的细致分析，而不是盲目地尝试更复杂的算法。

2. “过拟合无处不在” 通过偏差-方差分解，我意识到过拟合不仅仅是一个技术问题，更是一个哲学问题：如何在记忆和泛化之间找到平衡？

3. “统计显著性的重要性” 当两个模型性能接近时，仅仅比较平均分数是不够的，需要进行统计检验来确定差异是否真实存在。

容易犯的错误#

数据泄露：在交叉验证中，特征选择或数据预处理步骤如果在整个数据集上进行，会导致过于乐观的性能估计。
不平衡数据的误区：在极不平衡的数据集上，准确率可能高达99%，但模型可能完全没有学到有用的模式。
过度优化验证集：反复在验证集上调参，实际上是把验证集当作了训练集的一部分。
忽略业务约束：技术指标很重要，但模型的实际部署还需要考虑延迟、内存、可解释性等工程约束。

进阶学习方向#

在线学习的评估：当数据分布随时间变化时，如何评估模型的适应性？
多任务学习评估：当一个模型需要同时完成多个任务时，如何平衡不同任务的性能？
公平性评估：如何确保模型在不同群体上的表现公平？
不确定性量化：如何评估模型预测的置信度？

最终感悟#

模型评估不仅仅是技术活，更是一门艺术。它要求我们在准确性和稳定性之间权衡，在复杂性和可解释性之间选择，在理论完美和实际可行之间妥协。

正如统计学家George Box所说：“所有模型都是错误的，但有些是有用的。” 模型评估的目标不是找到完美的模型，而是找到在给定约束下最有用的模型。

通过系统地学习模型评估和选择，我不仅获得了技术技能，更重要的是培养了科学严谨的思维方式。这种思维方式告诉我们：

要有证据支持的结论
要量化不确定性
要考虑多个角度
要在理论和实践之间找到平衡

这些原则不仅适用于机器学习，也适用于科学研究和日常决策。在这个充满不确定性的世界里，掌握科学的评估方法比掌握任何特定的算法都更加重要。