データの正規化と予測精度
使用するデータセットは、cancerです。
データの正規化データの前処理という記事もあります。
正規化なし
from sklearn.svm import SVC from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from sklearn.datasets import load_breast_cancer cancer = load_breast_cancer() train_X, val_X, train_y, val_y = train_test_split(cancer.data, cancer.target, test_size=0.3, random_state=0) svm = SVC(C=100) svm.fit(train_X, train_y) pred = svm.predict(val_X) print(accuracy_score(val_y, pred))
平均0分散1の正規化
$$
\frac{X – \bar{X}}{\sigma}
$$
from sklearn.svm import SVC from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from sklearn.datasets import load_breast_cancer from sklearn.preprocessing import StandardScaler cancer = load_breast_cancer() train_X, val_X, train_y, val_y = train_test_split(cancer.data, cancer.target, test_size=0.3, random_state=0) scaler_std = StandardScaler() scaler_std.fit(train_X) std_train_X = scaler_std.transform(train_X) std_val_X = scaler_std.transform(val_X) svm = SVC(C=100) svm.fit(std_train_X, train_y) pred = svm.predict(s_val_X) print(accuracy_score(val_y, pred))
最小0最大1の正規化
$$
\frac{X – Min(X)}{Max(X) – Min(X)}
$$
from sklearn.svm import SVC from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from sklearn.datasets import load_breast_cancer from sklearn.preprocessing import MinMaxScaler cancer = load_breast_cancer() train_X, val_X, train_y, val_y = train_test_split(cancer.data, cancer.target, test_size=0.3, random_state=0) #標準化 scaler_mm = MinMaxScaler() scaler_mm.fit(train_X) mm_train_X = scaler_mm.transform(train_X) mm_val_X = scaler_mm.transform(val_X) svm = SVC(C=100) svm.fit(mm_train_X, train_y) pred = svm.predict(mm_val_X) print(accuracy_score(val_y, pred))
予測精度
前処理 | accuracy |
---|---|
正規化なし | 0.632 |
平均0分散1の正規化 | 0.971 |
最小0最大1の正規化 | 0.982 |
まとめ
正規化をすることで、予測精度がよくなることが確認できた。
今回のデータセットでは、平均0分散1の正規化より、最小0最大1の正規化の方が、予測精度が良かった。
参考