PR

【LightGBM】脚質と持ちタイムから上り3F最速の馬を予測する

データサイエンス
この記事は約63分で読めます。
スポンサーリンク

9.過去成績から上り3Fの予測

前回は話では、脚質と持ちタイムの情報から上り3Fに到達するまでの走破タイムの予測可能性を話した。
今回は、残りの上り3Fの走破タイムの予測が可能かどうかを確認する

スポンサーリンク

9-1.上り3Fの予測がしたい理由

前回の話ではレース結果の上り3Fのタイムを除いた走破タイムをLightGBMの回帰モデルで予測できそうだと分かった。

前回の話↓

そのため、上り3Fの予測タイムの算出が出来れば、ラストスパートで足が残っているかどうかの判断が出来るのではという期待から、上り3Fの予測をしたいと考えた。

スポンサーリンク

9-2.データ準備

ソースの一部は有料のものを使ってます。
同じように分析したい方は、以下の記事から入手ください。

ゼロから作る競馬予想モデル・機械学習入門

まずは、前回のLightGBMのモデル作成で使用したデータの作成まで行う

In [2]:
import pathlib
import warnings
import lightgbm as lgbm
import pandas as pd
import tqdm
import datetime
import matplotlib.pyplot as plt
import japanize_matplotlib
import seaborn as sns
import numpy as np

import sys
sys.path.append(".")
sys.path.append("..")
from src.data_manager.preprocess_tools import DataPreProcessor  # noqa
from src.data_manager.data_loader import DataLoader  # noqa

warnings.filterwarnings("ignore")

root_dir = pathlib.Path(".").absolute().parent
dbpath = root_dir / "data" / "keibadata.db"
start_year = 2000  # DBが持つ最古の年を指定
split_year = 2014  # 学習対象期間の開始年を指定
target_year = 2019  # テスト対象期間の開始年を指定
end_year = 2023  # テスト対象期間の終了年を指定 (当然DBに対象年のデータがあること)

# 各種インスタンスの作成
data_loader = DataLoader(
    start_year,
    end_year,
    dbpath=dbpath  # dbpathは各種環境に合わせてパスを指定してください。絶対パス推奨
)

dataPreP = DataPreProcessor()

df = data_loader.load_racedata()
dfblood = data_loader.load_horseblood()

df = dataPreP.exec_pipeline(
    df, dfblood, ["s", "b", "bs", "bbs", "ss", "sss", "ssss", "bbbs"])
2024-10-09 17:43:05.324 | INFO     | src.data_manager.data_loader:load_racedata:23 - Get Year Range: 2000 -> 2023.
2024-10-09 17:43:05.324 | INFO     | src.data_manager.data_loader:load_racedata:24 - Loading Race Info ...
2024-10-09 17:43:06.089 | INFO     | src.data_manager.data_loader:load_racedata:26 - Loading Race Data ...
2024-10-09 17:43:21.863 | INFO     | src.data_manager.data_loader:load_racedata:28 - Merging Race Info and Race Data ...
2024-10-09 17:43:24.125 | INFO     | src.data_manager.data_loader:load_horseblood:45 - Loading Horse Blood ...
2024-10-09 17:43:50.742 | INFO     | src.data_manager.preprocess_tools:__0_check_use_save_checkpoints:100 - Start PreProcess #0 ...
2024-10-09 17:43:50.742 | INFO     | src.data_manager.preprocess_tools:__1_exec_all_sub_prep1:103 - Start PreProcess #1 ...
2024-10-09 17:43:56.792 | INFO     | src.data_manager.preprocess_tools:__2_exec_all_sub_prep2:105 - Start PreProcess #2 ...
2024-10-09 17:44:09.626 | INFO     | src.data_manager.preprocess_tools:__3_convert_type_str_to_number:107 - Start PreProcess #3 ...
2024-10-09 17:44:13.292 | INFO     | src.data_manager.preprocess_tools:__4_drop_or_fillin_none_data:109 - Start PreProcess #4 ...
2024-10-09 17:44:16.759 | INFO     | src.data_manager.preprocess_tools:__5_exec_all_sub_prep5:111 - Start PreProcess #5 ...
2024-10-09 17:44:34.692 | INFO     | src.data_manager.preprocess_tools:__6_convert_label_to_rate_info:113 - Start PreProcess #6 ...
2024-10-09 17:44:44.659 | INFO     | src.data_manager.preprocess_tools:__7_convert_distance_to_smile:115 - Start PreProcess #7 ...
2024-10-09 17:44:44.892 | INFO     | src.data_manager.preprocess_tools:__8_category_encoding:117 - Start PreProcess #8 ...
2024-10-09 17:44:49.519 | INFO     | src.data_manager.preprocess_tools:__9_convert_raceClass_to_grade:119 - Start PreProcess #9 ...
2024-10-09 17:44:56.793 | INFO     | src.data_manager.preprocess_tools:__10_add_bloods_info:123 - Start PreProcess #10 ...

持ちタイムの計算

In [3]:
targetCol = "toL3F_vel"

idf = df.copy()
idf = idf[~idf["horseId"].isin(
    idf[idf["horseId"].str[:4] < "1998"]["horseId"].unique())]
idf["mochiTime_org"] = idf.groupby(['horseId', "field", "distance", "age"])[
    targetCol].shift()
idf["mochiTime"] = idf.groupby(['horseId', "field", "distance", "age"])["mochiTime_org"].rolling(
    1000, min_periods=1).max().reset_index(level=[0, 1, 2, 3], drop=True)
idf["mochiTime_org"] = idf.groupby(['horseId', "field", "distance"])[
    targetCol].shift()
idf["mochiTime"].fillna(idf.groupby(['horseId', "field", "distance"])["mochiTime_org"].rolling(
    1000, min_periods=1).min().reset_index(level=[0, 1, 2], drop=True), inplace=True)
idf["mochiTime_org"] = idf.groupby(['horseId', "field", "dist_cat"])[
    targetCol].shift()
idf["mochiTime"].fillna(idf.groupby(['horseId', "field", "dist_cat"])["mochiTime_org"].rolling(
    1000, min_periods=1).min().reset_index(level=[0, 1, 2], drop=True), inplace=True)
idf["mochiTime_org"] = idf.groupby(['horseId', "field",])[
    targetCol].shift()
idf["mochiTime"].fillna(idf.groupby(['horseId', "field"])["mochiTime_org"].rolling(
    1000, min_periods=1).min().reset_index(level=[0, 1], drop=True), inplace=True)
In [4]:
targetCol = "last3F_vel"

idf = idf[~idf["horseId"].isin(
    idf[idf["horseId"].str[:4] < "1998"]["horseId"].unique())]
idf["mochiTime_org"] = idf.groupby(['horseId', "field", "distance", "age"])[
    targetCol].shift()
idf["mochiTime3F"] = idf.groupby(['horseId', "field", "distance", "age"])["mochiTime_org"].rolling(
    1000, min_periods=1).max().reset_index(level=[0, 1, 2, 3], drop=True)
idf["mochiTime_org"] = idf.groupby(['horseId', "field", "distance"])[
    targetCol].shift()
idf["mochiTime3F"].fillna(idf.groupby(['horseId', "field", "distance"])["mochiTime_org"].rolling(
    1000, min_periods=1).min().reset_index(level=[0, 1, 2], drop=True), inplace=True)
idf["mochiTime_org"] = idf.groupby(['horseId', "field", "dist_cat"])[
    targetCol].shift()
idf["mochiTime3F"].fillna(idf.groupby(['horseId', "field", "dist_cat"])["mochiTime_org"].rolling(
    1000, min_periods=1).min().reset_index(level=[0, 1, 2], drop=True), inplace=True)
idf["mochiTime_org"] = idf.groupby(['horseId', "field",])[
    targetCol].shift()
idf["mochiTime3F"].fillna(idf.groupby(['horseId', "field"])["mochiTime_org"].rolling(
    1000, min_periods=1).min().reset_index(level=[0, 1], drop=True), inplace=True)
In [5]:
targetCol = "last3F_vel"

idf = idf[~idf["horseId"].isin(
    idf[idf["horseId"].str[:4] < "1998"]["horseId"].unique())]
idf["mochiTime_org"] = idf.groupby(['horseId', "field", "distance", "age"])[
    targetCol].shift()
idf["mochiTimeDiff"] = idf.groupby(['horseId', "field", "distance", "age"])["mochiTime_org"].rolling(
    1000, min_periods=1).mean().reset_index(level=[0, 1, 2, 3], drop=True)
idf["mochiTime_org"] = idf.groupby(['horseId', "field", "distance"])[
    targetCol].shift()
idf["mochiTimeDiff"].fillna(idf.groupby(['horseId', "field", "distance"])["mochiTime_org"].rolling(
    1000, min_periods=1).min().reset_index(level=[0, 1, 2], drop=True), inplace=True)
idf["mochiTime_org"] = idf.groupby(['horseId', "field", "dist_cat"])[
    targetCol].shift()
idf["mochiTimeDiff"].fillna(idf.groupby(['horseId', "field", "dist_cat"])["mochiTime_org"].rolling(
    1000, min_periods=1).min().reset_index(level=[0, 1, 2], drop=True), inplace=True)
idf["mochiTime_org"] = idf.groupby(['horseId', "field",])[
    targetCol].shift()
idf["mochiTimeDiff"].fillna(idf.groupby(['horseId', "field"])["mochiTime_org"].rolling(
    1000, min_periods=1).min().reset_index(level=[0, 1], drop=True), inplace=True)

ペース情報の算出

In [6]:
idfpace = idf.drop_duplicates(
    "raceId", ignore_index=True)  # 処理しやすいように重複を削除しておく

idfpace[["raceId", "rapTime"]]
Out[6]:
raceId rapTime
0 200002010105 [12.3, 10.8, 12.0, 12.4, 12.8]
1 200002010106 [12.5, 11.1, 11.4, 11.6, 11.5]
2 200002010205 [12.4, 11.0, 11.4, 11.9, 12.3, 12.1]
3 200002010405 [12.5, 11.6, 11.6, 11.5, 12.2]
4 200002010406 [12.3, 10.3, 11.2, 12.4, 12.6, 12.3]
76137 202309050908 [12.3, 10.9, 12.0, 12.6, 12.3, 12.1, 12.9]
76138 202309050909 [12.7, 10.5, 13.3, 12.4, 12.5, 12.7, 12.6, 13….
76139 202309050910 [12.7, 11.3, 12.7, 12.3, 12.0, 11.6, 11.6, 11….
76140 202309050911 [12.8, 11.5, 12.8, 12.1, 12.5, 13.2, 12.2, 12….
76141 202309050912 [12.3, 11.1, 11.3, 11.2, 11.4, 11.5]

76142 rows × 2 columns

In [7]:
idfpace["rapTime2"] = idfpace[["rapTime", "distance"]].apply(
    lambda row: row["rapTime"] if row["distance"] % 200 == 0
    else [round(row["rapTime"][0]*200/(row["distance"] % 200), 1)] + row["rapTime"][1:], axis=1)
idfpace[idfpace["distance"].isin([1150])][["rapTime", "rapTime2"]]
Out[7]:
rapTime rapTime2
12334 [9.4, 11.0, 11.1, 12.3, 12.5, 13.1] [12.5, 11.0, 11.1, 12.3, 12.5, 13.1]
12337 [9.3, 10.7, 11.1, 12.0, 12.7, 13.0] [12.4, 10.7, 11.1, 12.0, 12.7, 13.0]
12368 [9.5, 10.7, 11.3, 12.2, 12.5, 13.7] [12.7, 10.7, 11.3, 12.2, 12.5, 13.7]
12372 [9.1, 10.8, 11.0, 11.9, 12.7, 13.5] [12.1, 10.8, 11.0, 11.9, 12.7, 13.5]
12405 [9.3, 10.8, 11.4, 12.0, 12.8, 13.2] [12.4, 10.8, 11.4, 12.0, 12.8, 13.2]
75722 [9.6, 10.7, 10.8, 11.8, 12.2, 13.0] [12.8, 10.7, 10.8, 11.8, 12.2, 13.0]
75747 [9.5, 10.8, 11.3, 11.7, 11.9, 12.4] [12.7, 10.8, 11.3, 11.7, 11.9, 12.4]
75754 [9.5, 10.2, 10.9, 11.8, 12.3, 12.7] [12.7, 10.2, 10.9, 11.8, 12.3, 12.7]
75781 [9.8, 10.8, 11.3, 12.2, 12.0, 12.7] [13.1, 10.8, 11.3, 12.2, 12.0, 12.7]
75786 [9.5, 10.5, 11.0, 11.9, 12.4, 13.0] [12.7, 10.5, 11.0, 11.9, 12.4, 13.0]

641 rows × 2 columns

In [8]:
idfpace["prePace"] = idfpace["rapTime2"].apply(
    lambda lst: np.mean(lst[:(len(lst)-3)//2]))
idfpace["pastPace"] = idfpace["rapTime2"].apply(
    lambda lst: np.mean(lst[(len(lst)-3)//2:-3]))
idfpace["prePace3F"] = idfpace["rapTime2"].apply(
    lambda lst: np.mean(lst[:-3]))
idfpace["pastPace3F"] = idfpace["rapTime2"].apply(
    lambda lst: np.mean(lst[-3:]))
idfpace[["raceId", "rapTime2", "prePace", "pastPace"]]
Out[8]:
raceId rapTime2 prePace pastPace
0 200002010105 [12.3, 10.8, 12.0, 12.4, 12.8] 12.300000 10.800000
1 200002010106 [12.5, 11.1, 11.4, 11.6, 11.5] 12.500000 11.100000
2 200002010205 [12.4, 11.0, 11.4, 11.9, 12.3, 12.1] 12.400000 11.200000
3 200002010405 [12.5, 11.6, 11.6, 11.5, 12.2] 12.500000 11.600000
4 200002010406 [12.3, 10.3, 11.2, 12.4, 12.6, 12.3] 12.300000 10.750000
76137 202309050908 [12.3, 10.9, 12.0, 12.6, 12.3, 12.1, 12.9] 11.600000 12.300000
76138 202309050909 [12.7, 10.5, 13.3, 12.4, 12.5, 12.7, 12.6, 13…. 12.166667 12.533333
76139 202309050910 [12.7, 11.3, 12.7, 12.3, 12.0, 11.6, 11.6, 11…. 12.233333 11.875000
76140 202309050911 [12.8, 11.5, 12.8, 12.1, 12.5, 13.2, 12.2, 12…. 12.366667 12.600000
76141 202309050912 [12.3, 11.1, 11.3, 11.2, 11.4, 11.5] 12.300000 11.200000

76142 rows × 4 columns

レース出走馬の持ちタイムの平均値の計算

In [9]:
idfvel = idf.groupby("raceId")[
    ["last3F_vel", "toL3F_vel", "velocity", "mochiTime", "mochiTime3F"]].mean().rename(columns=lambda x: f"{x}_mean").reset_index()
idfvel = pd.merge(idfpace[[
                  "raceId", "prePace", "prePace3F", "pastPace", "pastPace3F",]], idfvel, on="raceId")

idfvel.corr().loc[["mochiTime_mean", "mochiTime3F_mean"]]
Out[9]:
raceId prePace prePace3F pastPace pastPace3F last3F_vel_mean toL3F_vel_mean velocity_mean mochiTime_mean mochiTime3F_mean
mochiTime_mean -0.047494 -0.291776 -0.840335 -0.837238 -0.417676 0.465754 0.839774 0.769325 1.000000 0.493266
mochiTime3F_mean 0.123575 -0.315756 -0.404680 -0.364354 -0.807430 0.838541 0.446989 0.760387 0.493266 1.000000

結果から出走馬の持ちタイムの平均値であるmochiTime_meanに対してpastPace3F, pastPace, toL3F_vel_meanで0.8の相関がある。
つまり、出走馬の持ちタイム(速度指標)が速くなると後半のペース情報であるpastPaceが早くなり、出走馬の上り3Fに至るまでの速度の平均であるtoL3F_vel_meanが速くなることを意味する。

先では出走馬の平均情報とペース情報の関係をみた。
つぎは、各馬の持ちタイムと前走でのvelocity, last3F_vel, toL3F_velとレース結果のvelocity, last3F_vel, toL3F_velとの相関を確認する。

In [10]:
idf["vel_lag1"] = idf.groupby("horseId")["velocity"].shift()
idf["l3Fvel_lag1"] = idf.groupby("horseId")["last3F_vel"].shift()
idf["t3Fvel_lag1"] = idf.groupby("horseId")["toL3F_vel"].shift()
idf[["velocity", "last3F_vel", "toL3F_vel", "vel_lag1",
     "l3Fvel_lag1", "t3Fvel_lag1", "mochiTime", "mochiTime3F"]].corr().loc[
    ["vel_lag1", "l3Fvel_lag1", "t3Fvel_lag1", "mochiTime", "mochiTime3F"]][["velocity", "last3F_vel", "toL3F_vel"]]
Out[10]:
velocity last3F_vel toL3F_vel
vel_lag1 0.615464 0.482257 0.529577
l3Fvel_lag1 0.470675 0.518547 0.252701
t3Fvel_lag1 0.525013 0.268416 0.613931
mochiTime 0.596624 0.323186 0.670378
mochiTime3F 0.615503 0.660305 0.341653
In [11]:
idfp = pd.merge(
    idf[["raceId",]+list(set(idf.columns) - set(idfvel.columns))],
    idfvel[["raceId", "last3F_vel_mean", "toL3F_vel_mean",
            "mochiTime_mean", "mochiTime3F_mean"]],
    on="raceId"
)
idfp
Out[11]:
raceId label_diff last3F horseName rapTime number s3StallionId l3F_vel rapSumTime stallionName f3F boxNum vel_lag1 horseId label_3C label_4C last3F_vel_mean toL3F_vel_mean mochiTime_mean mochiTime3F_mean
0 200002010105 -6.375000 37.9 サニーサマリン [12.3, 10.8, 12.0, 12.4, 12.8] 1 000a001042 2926.829268 [12.3, 23.1, 35.1, 47.5, 60.3] ジョリーズヘイロー 12.3 1 NaN 1998102044 None None 950.671834 973.029890 NaN NaN
1 200002010105 -1.666667 38.0 タシロスプリング [12.3, 10.8, 12.0, 12.4, 12.8] 2 000a000f8c 2926.829268 [12.3, 23.1, 35.1, 47.5, 60.3] マルゼンスキー 12.3 2 NaN 1998105299 None None 950.671834 973.029890 NaN NaN
2 200002010105 -10.166667 37.8 プラントラッキー [12.3, 10.8, 12.0, 12.4, 12.8] 3 000a000e48 2926.829268 [12.3, 23.1, 35.1, 47.5, 60.3] アーミジャー 12.3 3 NaN 1998103340 None None 950.671834 973.029890 NaN NaN
3 200002010105 -3.750000 37.3 マイネルボルテクス [12.3, 10.8, 12.0, 12.4, 12.8] 4 000a000f2b 2926.829268 [12.3, 23.1, 35.1, 47.5, 60.3] サマーサスピション 12.3 4 NaN 1998100829 None None 950.671834 973.029890 NaN NaN
4 200002010105 -10.000000 38.1 グッドマイチョイス [12.3, 10.8, 12.0, 12.4, 12.8] 5 000a000f8c 2926.829268 [12.3, 23.1, 35.1, 47.5, 60.3] ダンシングブレーヴ 12.3 5 NaN 1998100475 None None 950.671834 973.029890 NaN NaN
1054022 202309050912 -13.875000 33.3 スーサンアッシャー [12.3, 11.1, 11.3, 11.2, 11.4, 11.5] 12 000a001676 1055.718475 [12.3, 23.4, 34.7, 45.9, 57.3, 68.8] Siyouni 34.7 6 1026.737968 2019103898 None None 1055.689283 1022.114517 1052.631703 1055.708732
1054023 202309050912 -7.531250 34.1 アネゴハダ [12.3, 11.1, 11.3, 11.2, 11.4, 11.5] 13 000a0012bf 1055.718475 [12.3, 23.4, 34.7, 45.9, 57.3, 68.8] キズナ 34.7 7 1048.689139 2019106102 None None 1055.689283 1022.114517 1052.631703 1055.708732
1054024 202309050912 -14.000000 34.4 テイエムイダテン [12.3, 11.1, 11.3, 11.2, 11.4, 11.5] 14 000a001607 1055.718475 [12.3, 23.4, 34.7, 45.9, 57.3, 68.8] ロードカナロア 34.7 7 1046.511628 2017102603 None None 1055.689283 1022.114517 1052.631703 1055.708732
1054025 202309050912 -10.218750 34.2 ハギノメーテル [12.3, 11.1, 11.3, 11.2, 11.4, 11.5] 15 000a0012bf 1055.718475 [12.3, 23.4, 34.7, 45.9, 57.3, 68.8] サトノアラジン 34.7 8 1033.210332 2019100653 None None 1055.689283 1022.114517 1052.631703 1055.708732
1054026 202309050912 -3.062500 34.9 クムシラコ [12.3, 11.1, 11.3, 11.2, 11.4, 11.5] 16 000a0016d4 1055.718475 [12.3, 23.4, 34.7, 45.9, 57.3, 68.8] ディスクリートキャット 34.7 8 1038.062284 2018103205 None None 1055.689283 1022.114517 1052.631703 1055.708732

1054027 rows × 93 columns

持ちタイムの平均値との差を計算

In [12]:
idfp["mochiTime_diff"] = idfp["mochiTime"] - idfp["mochiTime_mean"]
idfp["mochiTime3F_diff"] = idfp["mochiTime3F"] - idfp["mochiTime3F_mean"]
idfp["last3F_diff"] = idfp["last3F_vel"] - idfp["last3F_vel_mean"]
idfp["toL3F_diff"] = idfp["toL3F_vel"] - idfp["toL3F_vel_mean"]

idfp[["last3F_diff", "toL3F_diff", "mochiTime_diff", "mochiTime3F_diff"]].corr(
).loc[["mochiTime_diff", "mochiTime3F_diff"]][["last3F_diff", "toL3F_diff"]]
Out[12]:
last3F_diff toL3F_diff
mochiTime_diff -0.027427 0.231037
mochiTime3F_diff 0.245777 0.019769

脚質情報の追加

In [13]:
from sklearn.cluster import KMeans  # KMeans法のモジュールをインポート

for col in ["label_1C", "label_lastC"]:
    idfp[f"{col}_rate"] = (idfp[col].astype(
        int)/idfp["horseNum"]).convert_dtypes()

# クラスタ数
n_cls = 4
rate = 2.5
# クラスタリングする特徴量を選定
cluster_columns = ["label_1C_rate", "label_lastC_rate"]
cluster_columns2 = ["label_1C_rate", "label_lastC_rate2"]
kmeans = KMeans(n_clusters=n_cls)  # 脚質が4種類なので、クラス数を4とする
idfp["label_lastC_rate2"] = rate*idfp["label_lastC_rate"]

# 後でクラスタ中心を振り直すが、形式上一旦fitしておかないといけない。
kmeans.fit(idfp[cluster_columns2].iloc[:n_cls*2])

# クラスタ中心をセット
centers = [
    [0.189736, 0.393819],
    [0.432436, 0.995918],
    [0.639462, 1.612348],
    [0.836256, 2.227643]
]
kmeans.cluster_centers_ = np.array(centers)
# 脚質の分類
idfp["cluster"] = kmeans.predict(idfp[cluster_columns2])

# 名前も付けておく
clsnames = ["逃げ", "先行", "差し", "追込"]
cls_map = {i: d for i, d in enumerate(clsnames)}
idfp["clsName"] = idfp["cluster"].map(cls_map)
idfp["clsName"].value_counts().to_frame().T
Out[13]:
clsName 逃げ 先行 差し 追込
count 300844 265800 254865 232518

ペース情報の追加

In [14]:
idf = pd.merge(idfp, idfpace[["raceId", "prePace",
               "pastPace", "prePace3F", "pastPace3F"]], on="raceId")

過去の脚質情報の追加

In [15]:
for lag in range(1, 11):
    idf[f"clsName_lag{lag}"] = idf.sort_values("raceDate").groupby("horseId")[
        "clsName"].shift(lag)

過去の上り3F速度と上り3F到達時速度との差の追加

In [16]:
idf["l3F_diff"] = 200*60/idf["last3F_vel"] - 200*60/idf["toL3F_vel"]
for lag in range(1, 11):
    idf[f"l3Fdiff_lag{lag}"] = idf.sort_values("raceDate").groupby("horseId")[
        "l3F_diff"].shift(lag)

lagcolumns2 = [f"l3Fdiff_lag{lag}" for lag in range(1, 11)]


for lag in range(1, 11):
    idf[f"l3Fvel_lag{lag}"] = idf.sort_values("raceDate").groupby("horseId")[
        "last3F_vel"].shift(lag)

lagcolumns3 = [f"l3Fvel_lag{lag}" for lag in range(1, 11)]


for lag in range(1, 11):
    idf[f"toL3Fvel_lag{lag}"] = idf.sort_values("raceDate").groupby("horseId")[
        "toL3F_vel"].shift(lag)

lagcolumns4 = [f"toL3Fvel_lag{lag}" for lag in range(1, 11)]

過去成績から出走レースごとの脚質分布の計算

In [17]:
dflist = {}
lagcolumns = [f"clsName_lag{lag}" for lag in range(1, 11)]
for g, dfg in tqdm.tqdm(idf[["raceId",] + lagcolumns].groupby("raceId")):
    dflist[g] = (pd.Series(dfg[lagcolumns].values.reshape(-1)
                           ).value_counts() / dfg[lagcolumns].notna().sum().sum()).to_dict()
100%|██████████| 76142/76142 [04:08<00:00, 306.74it/s]
In [18]:
dfcls = pd.DataFrame.from_dict(dflist, orient="index")
idfp = pd.merge(idf, dfcls.reset_index(names="raceId"), on="raceId")
idfp3 = idfp.set_index(["field", "distance"])


idfp3["prePace3F_diff"] = idfp[["field", "distance", "prePace3F"]
                               ].groupby(["field", "distance"])["prePace3F"].mean()


idfp["prePace3F_diff"] = idfp["prePace3F"] - \
    idfp3["prePace3F_diff"].reset_index(drop=True)

以上でデータ準備完了

スポンサーリンク

9-3.持ちタイム×前走の脚質を使って上り3Fタイム情報の推定¶

上り3Fは全力疾走する距離なため、その日の競走馬の体調に左右されやすく計測タイムも小数点1位までしかなくかなり推定が困難だと思われる。
そのため、タイムの推定ではなく順位の推定を行うこととする

スポンサーリンク

9-4.LightGBMのモデル作成案

  • モデル:ランク学習モデル
  • 目的変数: 上り3Fタイムのランク推定
  • 説明変数: 前走情報の脚質割合とmochiTimeの平均値、そのほかレース情報のカテゴリ
  • 学習期間: 2014年~2019年
  • 検証期間: 2020年
  • テスト期間: 2021年
スポンサーリンク

9-5.特徴量作成¶

In [19]:
feature_columns = clsnames + \
    [
        "field", "place", "dist_cat", "distance",
        "condition", "raceGrade", "horseNum", "direction",
        "inoutside", 'mochiTime_mean', 'mochiTime3F_mean',
        'weather', 'mochiTime_mean_div', 'mochiTime3F_mean_div',
        "mochiTime_diff", "mochiTime", "mochiTime3F", "mochiTimeDiff",
        "mochiTime_div", "mochiTime3F_div", "horseId", "breedId",
        "bStallionId", "b2StallionId", "stallionId",
        "mochiTime_rank", "mochiTime3F_rank", "mochiTimeDiff_rank",
        "mochiTime_dev", "mochiTime3F_dev", "mochiTimeDiff_dev"
    ]+lagcolumns+lagcolumns2+lagcolumns3+lagcolumns4
label_column = "last3F_rank"  # 上り3Fの順位
In [20]:
cat_list = [
    "field", "place", "dist_cat", 'weather',
    "condition", "direction", "inoutside", "horseId",
    "breedId", "bStallionId", "b2StallionId", "stallionId"
]+lagcolumns
for cat in cat_list:
    idfp[cat] = idfp[cat].astype("category")
In [21]:
for mochi in ["mochiTime", "mochiTime3F", "mochiTimeDiff"]:
    idfp[f"{mochi}_rank"] = idfp.sort_values(
        "odds").groupby("raceId")[mochi].rank()

for mochi in ["mochiTime", "mochiTime3F", "mochiTimeDiff"]:
    idfp[f"{mochi}_dev"] = (idfp[mochi] - idfp[["raceId", mochi]].groupby(
        "raceId")[mochi].mean())/idfp[["raceId", mochi]].groupby("raceId")[mochi].std()
In [22]:
idfp["mochiTime3F_mean_div"] = 200*60/idfp["mochiTime3F_mean"]
idfp["mochiTime_mean_div"] = 200*60/idfp["mochiTime_mean"]
idfp["mochiTime_div"] = 200*60/idfp["mochiTime"]
idfp["mochiTime3F_div"] = 200*60/idfp["mochiTime3F"]

# toL3F_velが走破速度(分速)になっているので、200m単位のタイムに変換
idfp["L3F_diff"] = idfp["last3F_vel"] - idfp["toL3F_vel"]
idfp["last3F_rank"] = idfp.sort_values("time").groupby("raceId")[
    "last3F"].rank(method="first")
dffl = idfp[["raceId", "raceDate"]+feature_columns +
            ["prePace3F", "toL3F_vel_mean", "L3F_diff", "last3F_vel", "last3F", "last3F_rank"]]
dffl
Out[22]:
raceId raceDate 逃げ 先行 差し 追込 field place dist_cat distance toL3Fvel_lag7 toL3Fvel_lag8 toL3Fvel_lag9 toL3Fvel_lag10 prePace3F toL3F_vel_mean L3F_diff last3F_vel last3F last3F_rank
0 200002010405 2000-06-18 0.500000 NaN 0.500000 NaN 函館 S 1000 NaN NaN NaN NaN 12.050000 961.058363 -36.238136 947.368421 38.0 9.0
1 200002010405 2000-06-18 0.500000 NaN 0.500000 NaN 函館 S 1000 NaN NaN NaN NaN 12.050000 961.058363 3.764352 952.380952 37.8 8.0
2 200002010405 2000-06-18 0.500000 NaN 0.500000 NaN 函館 S 1000 NaN NaN NaN NaN 12.050000 961.058363 29.767442 960.000000 37.5 6.0
3 200002010405 2000-06-18 0.500000 NaN 0.500000 NaN 函館 S 1000 NaN NaN NaN NaN 12.050000 961.058363 30.991736 1022.727273 35.2 1.0
4 200002010405 2000-06-18 0.500000 NaN 0.500000 NaN 函館 S 1000 NaN NaN NaN NaN 12.050000 961.058363 -3.855422 960.000000 37.5 5.0
971261 202309050912 2023-12-28 0.415094 0.207547 0.226415 0.150943 阪神 S 1200 971.659919 1006.711409 1003.344482 1016.949153 11.566667 1022.114517 69.845126 1081.081081 33.3 1.0
971262 202309050912 2023-12-28 0.415094 0.207547 0.226415 0.150943 阪神 S 1200 1012.658228 1087.613293 1121.495327 1027.837259 11.566667 1022.114517 30.077449 1055.718475 34.1 8.0
971263 202309050912 2023-12-28 0.415094 0.207547 0.226415 0.150943 阪神 S 1200 1043.478261 1046.511628 1074.626866 1058.823529 11.566667 1022.114517 46.511628 1046.511628 34.4 13.0
971264 202309050912 2023-12-28 0.415094 0.207547 0.226415 0.150943 阪神 S 1200 966.442953 1043.478261 936.280884 979.020979 11.566667 1022.114517 32.801551 1052.631579 34.2 9.0
971265 202309050912 2023-12-28 0.415094 0.207547 0.226415 0.150943 阪神 S 1200 1077.844311 1043.478261 1019.830028 1000.000000 11.566667 1022.114517 0.000000 1031.518625 34.9 16.0

971266 rows × 83 columns

スポンサーリンク

9-6.モデルの学習¶

In [23]:
dftrain, dfvalid, dftest = dffl[dffl["raceId"].str[:4] <= "2019"], dffl[dffl["raceId"].str[:4].isin(
    ["2020"])], dffl[dffl["raceId"].str[:4].isin(["2021"])]

params = {
    'metric': 'rmse',
    "categorical_feature": cat_list,
    'boosting_type': 'gbdt',
    'seed': 777,
}
# train_data = lgbm.Dataset(
#     dftrain[feature_columns], label=dftrain[label_column])
# valid_data = lgbm.Dataset(
#     dfvalid[feature_columns], label=dfvalid[label_column])
# test_data = lgbm.Dataset(dftest[feature_columns], label=dftest[label_column])


params = {
    'objective': 'lambdarank',
    'metric': 'ndcg',
    "categorical_feature": cat_list,
    'ndcg_eval_at': [1,],
    'boosting_type': 'gbdt',
    "label_gain": ",".join([str(n**3) for n in range(1, 19)]),
    'seed': 777,
}
train_data = lgbm.Dataset(
    dftrain[feature_columns], label=dftrain["horseNum"]-dftrain[label_column],
    group=dftrain.groupby("raceId")["raceId"].count().values)
valid_data = lgbm.Dataset(
    dfvalid[feature_columns], label=dfvalid["horseNum"]-dfvalid[label_column],
    group=dfvalid.groupby("raceId")["raceId"].count().values)
test_data = lgbm.Dataset(dftest[feature_columns], label=dftest["horseNum"]-dftest[label_column],
                         group=dftest.groupby("raceId")["raceId"].count().values)


# モデル学習
model = lgbm.train(params, train_data, num_boost_round=1000, valid_sets=[
                   train_data, valid_data], callbacks=[
    lgbm.early_stopping(
        stopping_rounds=50, verbose=True,),
    lgbm.log_evaluation(50 if True else 0)
],)
[LightGBM] [Warning] Categorical features with more bins than the configured maximum bin number found.
[LightGBM] [Warning] For categorical features, max_bin and max_bin_by_feature may be ignored with a large number of categories.
[LightGBM] [Warning] categorical_feature is set=field,place,dist_cat,weather,condition,direction,inoutside,horseId,breedId,bStallionId,b2StallionId,stallionId,clsName_lag1,clsName_lag2,clsName_lag3,clsName_lag4,clsName_lag5,clsName_lag6,clsName_lag7,clsName_lag8,clsName_lag9,clsName_lag10, categorical_column=4,5,6,8,9,11,12,15,24,25,26,27,28,35,36,37,38,39,40,41,42,43,44 will be ignored. Current value: categorical_feature=field,place,dist_cat,weather,condition,direction,inoutside,horseId,breedId,bStallionId,b2StallionId,stalli
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.226433 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 59422
[LightGBM] [Info] Number of data points in the train set: 803372, number of used features: 72
Training until validation scores don't improve for 50 rounds
[50]	training's ndcg@1: 0.661963	valid_1's ndcg@1: 0.517419
[100]	training's ndcg@1: 0.725348	valid_1's ndcg@1: 0.512167
Early stopping, best iteration is:
[57]	training's ndcg@1: 0.674074	valid_1's ndcg@1: 0.518644
スポンサーリンク

9-7.推論結果の追加¶

In [24]:
dftrain["pred"] = model.predict(
    dftrain[feature_columns], num_iteration=model.best_iteration)

dfvalid["pred"] = model.predict(
    dfvalid[feature_columns], num_iteration=model.best_iteration)

dftest["pred"] = model.predict(
    dftest[feature_columns], num_iteration=model.best_iteration)
スポンサーリンク

9-8.特徴量重要度の確認¶

In [39]:
dfimp = pd.DataFrame(model.feature_importance(
    "gain"), index=feature_columns, columns=["重要度"]).round(3).sort_values("重要度")
dfimp[dfimp["重要度"] > 0]
Out[39]:
重要度
l3Fvel_lag7 80.720
mochiTime3F_mean 92.854
逃げ 97.635
toL3Fvel_lag8 99.607
mochiTime_mean 100.510
mochiTime_mean_div 102.394
toL3Fvel_lag7 109.249
mochiTime3F 154.219
mochiTimeDiff 160.370
toL3Fvel_lag2 201.592
l3Fvel_lag3 225.900
toL3Fvel_lag6 266.075
l3Fdiff_lag7 355.419
clsName_lag7 364.884
toL3Fvel_lag4 383.677
toL3Fvel_lag5 449.677
field 514.158
toL3Fvel_lag1 518.538
toL3Fvel_lag3 690.929
mochiTime_rank 791.436
clsName_lag6 794.701
clsName_lag5 1102.182
l3Fdiff_lag6 1159.637
l3Fvel_lag1 1538.998
l3Fdiff_lag5 1774.065
clsName_lag2 2069.811
clsName_lag3 2162.036
distance 2307.310
clsName_lag4 2772.045
raceGrade 2883.395
bStallionId 3668.477
l3Fdiff_lag4 4124.210
clsName_lag1 4580.906
l3Fdiff_lag3 5021.967
b2StallionId 5420.097
l3Fdiff_lag2 9546.150
stallionId 17212.072
l3Fdiff_lag1 17454.751
horseNum 34273.463
mochiTimeDiff_rank 41606.698
horseId 47169.059
mochiTime3F_rank 51415.473
breedId 61055.445

結果からレースごとの出走馬の持ちタイムの平均情報が最も重要であり、次にレース距離, 競馬場, 馬場, 母の血統, レースグレードと続いている。

In [26]:
dft = pd.merge(
    idfp,
    dftest[["raceId", "horseId", "pred"]],
    on=["raceId", "horseId"]
)
dfv = pd.merge(
    idfp,
    dfvalid[["raceId", "horseId", "pred"]],
    on=["raceId", "horseId"]
)

ランダムにレース情報を確認してみる。
実際に予測結果のペース情報が出走馬の出走タイムに沿っているか確認する。

今回の予測値のランク(pred_rank)と正解のランク(ans_rank)を見比べる

In [27]:
raceId = np.random.choice(dft[dft["raceGrade"].isin(
    [8])]["raceId"].unique())
raceId = "202105010811"
for dfvt in [dft, dfv]:
    dfvt["pred_rank"] = dfvt.sort_values("time").groupby("raceId")[
        "pred"].rank(method="first", ascending=False)
    dfvt["ans_rank"] = dfvt.sort_values("time").groupby(
        "raceId")[label_column].rank(method="first", ascending=True)


dft[
    dft["raceId"].isin([raceId])
][[
    "pred_rank", "ans_rank", "pred", "label", "raceId", "raceName", "horseId",
    "favorite", "last3F", "dist_cat", "distance", "field", "place"
]].sort_values("pred_rank", ascending=True)
Out[27]:
pred_rank ans_rank pred label raceId raceName horseId favorite last3F dist_cat distance field place
6225 1.0 1.0 0.927583 2 202105010811 第38回フェブラリーS(G1) 2013105399 9 35.2 M 1600 東京
6222 2.0 6.0 0.863496 3 202105010811 第38回フェブラリーS(G1) 2013102815 8 35.7 M 1600 東京
6224 3.0 8.0 0.739405 11 202105010811 第38回フェブラリーS(G1) 2014106474 4 35.9 M 1600 東京
6228 4.0 7.0 0.559368 8 202105010811 第38回フェブラリーS(G1) 2015100902 5 35.9 M 1600 東京
6218 5.0 4.0 0.460582 1 202105010811 第38回フェブラリーS(G1) 2017110151 1 35.6 M 1600 東京
6226 6.0 12.0 0.346935 12 202105010811 第38回フェブラリーS(G1) 2014103547 16 36.7 M 1600 東京
6216 7.0 10.0 0.271687 5 202105010811 第38回フェブラリーS(G1) 2015110086 10 36.6 M 1600 東京
6230 8.0 5.0 0.245898 7 202105010811 第38回フェブラリーS(G1) 2016100981 13 35.6 M 1600 東京
6227 9.0 9.0 0.018113 10 202105010811 第38回フェブラリーS(G1) 2014104002 14 36.4 M 1600 東京
6231 10.0 2.0 -0.120186 4 202105010811 第38回フェブラリーS(G1) 2016105188 3 35.5 M 1600 東京
6221 11.0 11.0 -0.173775 9 202105010811 第38回フェブラリーS(G1) 2015100318 2 36.6 M 1600 東京
6229 12.0 13.0 -0.319610 13 202105010811 第38回フェブラリーS(G1) 2016101455 6 37.5 M 1600 東京
6220 13.0 14.0 -0.382169 15 202105010811 第38回フェブラリーS(G1) 2014100377 15 37.5 M 1600 東京
6219 14.0 16.0 -0.727043 16 202105010811 第38回フェブラリーS(G1) 2016102370 12 38.7 M 1600 東京
6217 15.0 3.0 -0.780546 6 202105010811 第38回フェブラリーS(G1) 2014104052 7 35.5 M 1600 東京
6223 16.0 15.0 -0.858769 14 202105010811 第38回フェブラリーS(G1) 2016106260 11 38.2 M 1600 東京

実際にpred_rank1.0のもの(horseId=2013105399)に対して、正解ランクが1.0と予測されているのが分かるが、その次に6位や8位のものが来ていたりとかなりムラがある。
正直最後の全力疾走のタイム順位の予測なので、さすがにこの手の精度が落ちるのは致し方ない。
良く予測できている方だと考える

実際の正解ランクと今回の予測結果のランク分けに関連があるか確認する

In [28]:
dft[["ans_rank", "pred_rank"]].corr()
Out[28]:
ans_rank pred_rank
ans_rank 1.000000 0.397797
pred_rank 0.397797 1.000000

かなり関係がありそうである

スポンサーリンク

9-9.色々と分布の確認¶

In [29]:
# region plot
ans_col = "ans_rank"
dft["pred_q90"] = dft["raceId"].map(dft[["raceId", "pred_rank"]].groupby(
    "raceId")["pred_rank"].quantile(0.9).to_dict())
dft["pred_q90"] = dft["pred_rank"] >= dft["pred_q90"]
dft[dft["pred_q90"]][ans_col].value_counts().sort_index()

dft["pred_q10"] = dft["raceId"].map(dft[["raceId", "pred_rank"]].groupby(
    "raceId")["pred_rank"].quantile(0.1).to_dict())
dft["pred_q10"] = dft["pred_rank"] <= dft["pred_q10"]
dft[dft["pred_q10"]][ans_col].value_counts().sort_index()
idft = pd.concat(
    [
        dft[dft["pred_q10"]][ans_col].value_counts(normalize=True).sort_index(
        ).to_frame().rename(columns={"proportion": "上位10%"}),
        dft[dft["pred_q90"]][ans_col].value_counts(normalize=True).sort_index(
        ).to_frame().rename(columns={"proportion": "下位10%"})
    ],
    axis=1
)

idft.sort_index(ascending=False).cumsum().sort_index(ascending=True).rename(
    columns={"上位10%": "上位10%_累積", "下位10%": "下位10%_累積"}, index=lambda x: x-1).plot(ax=idft.plot.bar(alpha=0.5).twinx())


plt.title("pred_rankと上り3F順位の関係")
plt.legend(loc="upper right", bbox_to_anchor=(
    1., 0.85),)
plt.show()
# endregion

予測ランクと正解ランクの分布の関係を確認

In [30]:
idftlist = []
for g, idfgtg in dfv.groupby("pred_rank"):
    idftlist += [idfgtg["ans_rank"].value_counts(
        normalize=True).to_frame(name=g).T]
idft: pd.DataFrame = pd.concat(idftlist)[list(range(1, 19))]
idft
Out[30]:
ans_rank 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0 13.0 14.0 15.0 16.0 17.0 18.0
1.0 0.198745 0.147243 0.116210 0.098052 0.084846 0.073622 0.059426 0.059756 0.041598 0.031033 0.027732 0.018488 0.014196 0.012545 0.009904 0.005282 0.000990 0.000330
2.0 0.124794 0.118191 0.111588 0.099373 0.086497 0.085507 0.068339 0.067019 0.059095 0.049191 0.036316 0.035325 0.026741 0.017828 0.008584 0.004292 0.000330 0.000990
3.0 0.109937 0.098712 0.105976 0.099373 0.100693 0.074612 0.080224 0.071971 0.061406 0.047871 0.038627 0.033014 0.029052 0.022780 0.012545 0.009904 0.001651 0.001651
4.0 0.102014 0.096732 0.089468 0.099043 0.085507 0.092440 0.073622 0.075933 0.061737 0.060086 0.044239 0.035325 0.028722 0.023770 0.019478 0.009244 0.001981 0.000660
5.0 0.078904 0.085837 0.086167 0.100693 0.086827 0.081875 0.085837 0.070981 0.074942 0.063057 0.042588 0.039287 0.034665 0.024761 0.024100 0.015847 0.002311 0.001321
6.0 0.076821 0.080795 0.086093 0.080132 0.092384 0.071854 0.092715 0.085762 0.062583 0.061921 0.053311 0.051656 0.036755 0.029801 0.020199 0.012914 0.002649 0.001656
7.0 0.059429 0.067397 0.075697 0.075365 0.071713 0.090969 0.089973 0.080677 0.073705 0.073373 0.066069 0.051793 0.048473 0.035857 0.020584 0.014940 0.001660 0.002324
8.0 0.055872 0.055537 0.069923 0.067581 0.076280 0.081967 0.082971 0.083640 0.094346 0.070258 0.066243 0.056875 0.047173 0.037471 0.032787 0.016393 0.002676 0.002007
9.0 0.053682 0.061597 0.056435 0.065038 0.074329 0.076394 0.076394 0.077082 0.093255 0.073985 0.071920 0.072953 0.049209 0.036820 0.025120 0.027529 0.005162 0.003097
10.0 0.038531 0.056176 0.056176 0.059777 0.063018 0.077422 0.075621 0.080663 0.071660 0.094707 0.084984 0.065898 0.057616 0.046093 0.039971 0.021606 0.006482 0.003601
11.0 0.036725 0.050115 0.054323 0.055853 0.058148 0.059679 0.070773 0.063504 0.084162 0.087605 0.096021 0.073068 0.061209 0.054323 0.052793 0.034047 0.004208 0.003443
12.0 0.032231 0.036835 0.049812 0.045626 0.050649 0.055672 0.059858 0.069067 0.084554 0.085391 0.085810 0.088740 0.087484 0.066136 0.054835 0.034743 0.008790 0.003767
13.0 0.029921 0.032726 0.037868 0.032258 0.045816 0.055166 0.049556 0.068724 0.064049 0.075269 0.086022 0.108929 0.098177 0.092567 0.073399 0.040206 0.003740 0.005610
14.0 0.023073 0.028317 0.029890 0.046146 0.045097 0.045621 0.058207 0.065548 0.057682 0.069743 0.082853 0.078133 0.087572 0.111694 0.083377 0.067121 0.011012 0.008915
15.0 0.014154 0.031385 0.026462 0.024000 0.040000 0.041231 0.040615 0.048000 0.060923 0.064000 0.083077 0.070154 0.089231 0.110154 0.136000 0.094154 0.017846 0.008615
16.0 0.013180 0.019769 0.018122 0.028007 0.028007 0.031301 0.036244 0.037891 0.040362 0.059308 0.065898 0.086491 0.090610 0.102965 0.125206 0.179572 0.021417 0.015651
17.0 0.019608 0.027451 0.015686 0.043137 0.027451 0.035294 0.035294 0.031373 0.050980 0.039216 0.058824 0.019608 0.050980 0.113725 0.082353 0.094118 0.145098 0.109804
18.0 0.010417 NaN 0.005208 0.015625 0.031250 0.015625 0.020833 0.036458 0.041667 0.036458 0.046875 0.072917 0.104167 0.052083 0.078125 0.130208 0.135417 0.166667

テストデータに対して予測値のランクが1位のものを選択した場合の正解ランク(ans_rank)の累積分布を出してみる
一応レースグレード別に出してみる

In [31]:
pd.concat([dft[dft["pred_rank"].isin([1]) & dft["raceGrade"].isin([i])]["ans_rank"].value_counts(
    normalize=True).sort_index().cumsum().to_frame(name=f"Grade: {i}").T for i in range(9)])
Out[31]:
ans_rank 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0 13.0 14.0 15.0 16.0 17.0
Grade: 0 0.188745 0.318615 0.425974 0.513420 0.608658 0.684848 0.758442 0.805195 0.854545 0.890043 0.918615 0.948052 0.965368 0.975758 0.989610 0.999134 1.0
Grade: 1 0.201081 0.348108 0.463784 0.552432 0.637838 0.717838 0.772973 0.836757 0.875676 0.915676 0.936216 0.953514 0.974054 0.983784 0.994595 0.997838 1.0
Grade: 2 0.222222 0.371069 0.477987 0.566038 0.662474 0.735849 0.809224 0.863732 0.907757 0.932914 0.958071 0.974843 0.983229 0.991614 1.000000 NaN NaN
Grade: 3 0.210280 0.392523 0.518692 0.626168 0.682243 0.738318 0.799065 0.836449 0.887850 0.906542 0.915888 0.934579 0.953271 0.976636 1.000000 NaN NaN
Grade: 4 0.212121 0.454545 0.621212 0.696970 0.772727 0.848485 NaN 0.878788 0.924242 0.939394 0.969697 NaN 1.000000 NaN NaN NaN NaN
Grade: 5 0.246154 0.369231 0.461538 0.553846 0.676923 0.723077 0.769231 0.800000 0.861538 0.876923 0.907692 0.938462 0.969231 0.984615 NaN 1.000000 NaN
Grade: 6 0.179104 0.328358 0.402985 0.507463 0.641791 0.701493 0.776119 0.865672 0.895522 NaN 0.940299 NaN 0.970149 NaN 0.985075 1.000000 NaN
Grade: 7 0.081081 0.216216 0.405405 0.486486 0.567568 NaN 0.648649 0.675676 0.783784 0.864865 NaN 0.891892 0.972973 NaN NaN NaN 1.0
Grade: 8 0.208333 0.291667 NaN 0.458333 0.500000 0.625000 0.750000 0.791667 0.875000 0.916667 0.958333 NaN NaN 1.000000 NaN NaN NaN

全体的にみて、予測タイムが最も早いデータ(pred_rank=1)を選択するとそのうちの2割前後で上り最速を選択出来ている。
累積分布で考えてみると予測タイム最速のものを選択した場合、上位8位までに入る割合が全体の80%で、上位4位に入るのは全体の50%である。
もう少し精度を上げて全体の80%が上位3位に入るように改良をしたいが、さすがにまだ難しいかもしれない。

コメント

タイトルとURLをコピーしました