PR

LightGBMで作る回収率と的中率両方を考慮する競馬AI

データサイエンス
この記事は約99分で読めます。
スポンサーリンク

11.サードモデルの作成

前回の話で過去成績の活用方法が見えてきたのでサードモデルの開発に入る

モデル作成の動機¶

血統情報+過去成績を考慮したモデルの作成がしたい
的中率をあげることで、回収率をあげることを
目指すのではなく、回収率を直接あげたい

モデルの目的¶

出走馬の血統情報と産駒たちの勝率
および過去成績による持ちタイムと脚質を
追加したモデルの作成

確認したい仮説¶

血統からある程度の特徴を掴み、
レース条件ごとの脚質や持ちタイムの性質を
考慮した予想モデルができるか確認

特徴量¶

ファーストモデル+
父, 母, 母父, 母母父の馬ID+
父, 母, 母父, 母母父の産駒の勝率+
推論した脚質+過去の脚質+持ちタイム+
推論した上り3Fタイムと上り3Fまでの到達タイム

目的変数¶

着順+オッズによる勝率

スポンサーリンク

11-1.下準備¶

ソースの一部は有料のものを使ってます。
同じように分析したい方は、以下の記事から入手ください。

ゼロから作る競馬予想モデル・機械学習入門

In [1]:
import pathlib
import warnings
import sys
sys.path.append(".")
sys.path.append("..")
from src.model_manager.lgbm_manager import LightGBMModelManager  # noqa
from src.core.meta.bet_name_meta import BetName  # noqa
from src.data_manager.preprocess_tools import DataPreProcessor  # noqa
from src.data_manager.data_loader import DataLoader  # noqa

warnings.filterwarnings("ignore")

root_dir = pathlib.Path(".").absolute().parent
dbpath = root_dir / "data" / "keibadata.db"
start_year = 2000  # DBが持つ最古の年を指定
split_year = 2014  # 学習対象期間の開始年を指定
target_year = 2019  # テスト対象期間の開始年を指定
end_year = 2023  # テスト対象期間の終了年を指定 (当然DBに対象年のデータがあること)

# 各種インスタンスの作成
data_loader = DataLoader(
    start_year,
    end_year,
    dbpath=dbpath  # dbpathは各種環境に合わせてパスを指定してください。絶対パス推奨
)

dataPreP = DataPreProcessor(
    # 今回からキャッシュ機能の追加をした。使用する場合にTrueを指定。デフォルト:True
    use_cache=True,
    cache_dir=pathlib.Path("./data")
)

df = data_loader.load_racedata()
dfblood = data_loader.load_horseblood()

df = dataPreP.exec_pipeline(
    df,
    dfblood,
    blood_set=["s", "b", "bs", "bbs", "ss", "sss", "ssss", "bbbs"],
    lagN=5
)
2024-11-01 18:39:13.495 | INFO     | src.data_manager.data_loader:load_racedata:23 - Get Year Range: 2000 -> 2023.
2024-11-01 18:39:13.495 | INFO     | src.data_manager.data_loader:load_racedata:24 - Loading Race Info ...
2024-11-01 18:39:14.201 | INFO     | src.data_manager.data_loader:load_racedata:26 - Loading Race Data ...
2024-11-01 18:39:28.737 | INFO     | src.data_manager.data_loader:load_racedata:28 - Merging Race Info and Race Data ...
2024-11-01 18:39:30.799 | INFO     | src.data_manager.data_loader:load_horseblood:45 - Loading Horse Blood ...
2024-11-01 18:39:55.219 | INFO     | src.data_manager.preprocess_tools:load_cache:760 - Loading Cache. file: data\cache_data.pkl
2024-11-01 18:40:04.687 | INFO     | src.data_manager.preprocess_tools:load_cache:771 - Check Cache version... cache ver: 14
2024-11-01 18:40:04.687 | INFO     | src.data_manager.preprocess_tools:exec_pipeline:170 - OK! Completed Loading Cache File. cache ver: 14

今回から前提処理として以下の4つを追加している

  • 持ちタイムの追加
  • 脚質情報の追加
  • ペース情報の追加
  • 脚質情報と持ちタイムの前5走データの追加:引数lagNで取得する前走数を指定可能。

前提処理の話は以下をご覧ください。

今回から前処理のキャッシュ機能を実装しました。
一度でも前処理を実行していれば、実行したNotebook配下に作成した「data」ディレクトリに「cache_data.pkl」ファイルが作成されます。

もう一度前処理を実行すると、キャッシュファイルを読み込んで処理を終了します。
キャッシュ機能を使用したくない場合は、インスタンス作成時に「use_cache」の引数で指定できます。

使用例

dataPreP = DataPreProcessor(
        use_cache=True,
        cache_dir=pathlib.Path("./data")
    )

    df = data_loader.load_racedata()
    dfblood = data_loader.load_horseblood()

    df = dataPreP.exec_pipeline(
        df,
        dfblood,
        blood_set=["s", "b", "bs", "bbs", "ss", "sss", "ssss", "bbbs"],
        lagN=5
    )

また簡易的ですがキャッシュチェック機能もあり、前提処理の増減がある場合は再度前処理を新たに実行するようになってます

以下のように「WARNING」でキャッシュのバージョンが違うと出して、再実行してくれます。
ただし、注意点として再実行は前提処理が増える減るなどの時だけなので、前提処理の中身を変えても再実行の対象にはならないのでご注意ください。

スポンサーリンク

11-2.前処理で追加された情報の確認¶

11-2-1.持ちタイムの情報¶

In [2]:
display(
    df.columns[
        df.columns.str.contains(r"^.+(?<!_lag[1-9])

quot;, regex=True) & df.columns.str.contains(r"^mochiTime", regex=True) ] )
Index(['mochiTime_org', 'mochiTime', 'mochiTime_mean', 'mochiTime_diff',
       'mochiTime3F', 'mochiTime3F_mean', 'mochiTime3F_diff'],
      dtype='object')

mochiTime_orgは不要だが、それ以外の項目は必要

  • mochiTime: 上り3Fまでの再考速度
  • mochiTime3F: 上り3Fの最高速度
  • mochiTime_mean: レースごとの出走馬のmochiTimeの平均
  • mochiTime_diff: レースごとの出走馬のmochiTimeの平均との差
  • mochiTime3F_mean: レースごとの出走馬のmochiTime3Fの平均
  • mochiTime3F_diff: レースごとの出走馬のmochiTime3Fの平均との差

11-2-2.脚質情報¶

In [3]:
display(
    df.columns[
        df.columns.str.contains(r"^.+(?<!_lag[1-9])

quot;, regex=True) & df.columns.str.contains(r"^cluster", regex=True) ] ) dist_columns = df.columns[df.columns.str.contains( r"cluster0_\d+

quot;)].tolist()
Index(['cluster0', 'cluster1', 'cluster0_0', 'cluster0_1', 'cluster0_2',
       'cluster0_3', 'cluster0_4'],
      dtype='object')

データリークになるので、この情報は使えない。
前走情報で使う

  • cluster0: クラスタ数4の脚質分類情報
  • cluster1: クラスタ数16の脚質分類情報
  • cluster0_n: 各レースに出走する競走馬の前5走に関するクラスタ数4の分布

11-2-3.前走情報¶

In [4]:
display(df.columns[df.columns.str.contains(r"_lag", regex=True)])
Index(['cluster0_lag1', 'cluster0_lag2', 'cluster0_lag3', 'cluster0_lag4',
       'cluster0_lag5', 'cluster1_lag1', 'cluster1_lag2', 'cluster1_lag3',
       'cluster1_lag4', 'cluster1_lag5', 'toL3F_vel_lag1', 'toL3F_vel_lag2',
       'toL3F_vel_lag3', 'toL3F_vel_lag4', 'toL3F_vel_lag5',
       'toL3F_vel_diff_lag1', 'toL3F_vel_diff_lag2', 'toL3F_vel_diff_lag3',
       'toL3F_vel_diff_lag4', 'toL3F_vel_diff_lag5', 'last3F_vel_lag1',
       'last3F_vel_lag2', 'last3F_vel_lag3', 'last3F_vel_lag4',
       'last3F_vel_lag5', 'last3F_vel_diff_lag1', 'last3F_vel_diff_lag2',
       'last3F_vel_diff_lag3', 'last3F_vel_diff_lag4', 'last3F_vel_diff_lag5'],
      dtype='object')

cluster0, cluster1, toL3F_vel, toL3F_vel_diff, last3F_vel, last3F_vel_diffの前5走分の情報

前走情報については以下の属性から取得可能

In [5]:
dataPreP.lag_columns
Out[5]:
['cluster0_lag1',
 'cluster0_lag2',
 'cluster0_lag3',
 'cluster0_lag4',
 'cluster0_lag5',
 'cluster1_lag1',
 'cluster1_lag2',
 'cluster1_lag3',
 'cluster1_lag4',
 'cluster1_lag5',
 'toL3F_vel_lag1',
 'toL3F_vel_lag2',
 'toL3F_vel_lag3',
 'toL3F_vel_lag4',
 'toL3F_vel_lag5',
 'toL3F_vel_diff_lag1',
 'toL3F_vel_diff_lag2',
 'toL3F_vel_diff_lag3',
 'toL3F_vel_diff_lag4',
 'toL3F_vel_diff_lag5',
 'last3F_vel_lag1',
 'last3F_vel_lag2',
 'last3F_vel_lag3',
 'last3F_vel_lag4',
 'last3F_vel_lag5',
 'last3F_vel_diff_lag1',
 'last3F_vel_diff_lag2',
 'last3F_vel_diff_lag3',
 'last3F_vel_diff_lag4',
 'last3F_vel_diff_lag5']
スポンサーリンク

11-3.モデル作成用インスタンス作成¶

In [6]:
lgbm_model_manager = LightGBMModelManager(
    # modelsディレクトリ配下に作成したいモデル名のフォルダパスを指定。
    # フォルダパスは絶対パスにすると安全です。
    root_dir / "models" / "third_model",  # セカンドモデルのモデルID
    split_year,
    target_year,
    end_year
)
2024-11-01 18:40:05.531 | INFO     | src.model_manager.base_manager:set_keyvalue_to_export_mapping:139 - Set Export info. key=model_type, val=lightGBM
2024-11-01 18:40:05.531 | INFO     | src.model_manager.base_manager:set_keyvalue_to_export_mapping:139 - Set Export info. key=model_id, val=third_model
2024-11-01 18:40:05.531 | INFO     | src.model_manager.base_manager:set_keyvalue_to_export_mapping:139 - Set Export info. key=model_dir, val=e:\dev_um_ai\dev-um-ai\models\third_model
2024-11-01 18:40:05.531 | INFO     | src.model_manager.base_manager:set_keyvalue_to_export_mapping:139 - Set Export info. key=model_analyze_dir, val=e:\dev_um_ai\dev-um-ai\models\third_model\analyze
2024-11-01 18:40:05.531 | INFO     | src.model_manager.base_manager:set_keyvalue_to_export_mapping:139 - Set Export info. key=model_predict_dir, val=e:\dev_um_ai\dev-um-ai\models\third_model\analyze\00_predict
2024-11-01 18:40:05.531 | INFO     | src.model_manager.base_manager:set_keyvalue_to_export_mapping:139 - Set Export info. key=confidence_column, val=pred_prob
2024-11-01 18:40:05.546 | INFO     | src.model_manager.base_manager:set_keyvalue_to_export_mapping:139 - Set Export info. key=confidence_rank_column, val=pred_rank
2024-11-01 18:40:05.548 | INFO     | src.model_manager.lgbm_manager:load_root_mode_info:342 - Load model params and dataset info columns.
2024-11-01 18:40:05.562 | INFO     | src.model_manager.lgbm_manager:load_root_mode_info:345 - ==================  model params  ========================
2024-11-01 18:40:05.562 | INFO     | src.model_manager.lgbm_manager:load_root_mode_info:348 - boosting_type             =     gbdt
2024-11-01 18:40:05.562 | INFO     | src.model_manager.lgbm_manager:load_root_mode_info:348 - objective                 =     kl_divergence_objective
2024-11-01 18:40:05.562 | INFO     | src.model_manager.lgbm_manager:load_root_mode_info:348 - feval                     =     kl_divergence_metric
2024-11-01 18:40:05.562 | INFO     | src.model_manager.lgbm_manager:load_root_mode_info:348 - verbose                   =     1
2024-11-01 18:40:05.562 | INFO     | src.model_manager.lgbm_manager:load_root_mode_info:348 - seed                      =     77777
2024-11-01 18:40:05.562 | INFO     | src.model_manager.lgbm_manager:load_root_mode_info:348 - learning_rate             =     0.05
2024-11-01 18:40:05.562 | INFO     | src.model_manager.lgbm_manager:load_root_mode_info:348 - n_estimators              =     1000
2024-11-01 18:40:05.562 | INFO     | src.model_manager.lgbm_manager:load_root_mode_info:349 - ==========================================================
2024-11-01 18:40:05.562 | INFO     | src.data_manager.dataset_tools:set_feature_and_objective_columns:79 - Set Feature columns. ['distance', 'number', 'boxNum', 'age', 'jweight', 'weight', 'gl', 'race_span', 'raceGrade', 'mochiTime', 'mochiTime3F', 'pred_last3F', 'pred_toL3F', 'pred_cls', 'place_en', 'field_en', 'sex_en', 'condition_en', 'jockeyId_en', 'teacherId_en', 'dist_cat_en', 'horseId_en', 'cluster0_lag1', 'cluster0_lag2', 'cluster0_lag3', 'cluster0_lag4', 'cluster0_lag5', 'cluster1_lag1', 'cluster1_lag2', 'cluster1_lag3', 'cluster1_lag4', 'cluster1_lag5', 'toL3F_vel_lag1', 'toL3F_vel_lag2', 'toL3F_vel_lag3', 'toL3F_vel_lag4', 'toL3F_vel_lag5', 'toL3F_vel_diff_lag1', 'toL3F_vel_diff_lag2', 'toL3F_vel_diff_lag3', 'toL3F_vel_diff_lag4', 'toL3F_vel_diff_lag5', 'last3F_vel_lag1', 'last3F_vel_lag2', 'last3F_vel_lag3', 'last3F_vel_lag4', 'last3F_vel_lag5', 'last3F_vel_diff_lag1', 'last3F_vel_diff_lag2', 'last3F_vel_diff_lag3', 'last3F_vel_diff_lag4', 'last3F_vel_diff_lag5', 'cluster0_0', 'cluster0_1', 'cluster0_2', 'cluster0_3', 'cluster0_4', 'stallionId', 'breedId', 'bStallionId', 'b2StallionId', 'winR_stallion', 'winR_breed', 'winR_bStallion', 'winR_b2Stallion']
2024-11-01 18:40:05.578 | INFO     | src.data_manager.dataset_tools:set_feature_and_objective_columns:81 - Set Objective columns. label_in1
In [7]:
# 説明変数にするカラム
feature_columns = [
    'distance',
    'number',
    'boxNum',
    # 'odds',
    # 'favorite',
    'age',
    'jweight',
    'weight',
    'gl',
    'race_span',
    "raceGrade",  # グレード情報を追加

    # region 持ちタイムの情報
    "mochiTime",
    "mochiTime3F",
    # endregion

    # region 予測脚質とタイム
    "pred_last3F",
    "pred_toL3F",
    "pred_cls"
    # endregion
] + dataPreP.encoding_columns + \
    dataPreP.lag_columns + \
    dist_columns

# 血統情報を追加
feature_columns += ["stallionId", "breedId", "bStallionId", "b2StallionId",]

# 勝率情報の追加
feature_columns += ['winR_stallion', 'winR_breed',
                    'winR_bStallion', 'winR_b2Stallion']


# 目的変数用のカラム
objective_column = "label_in1"

# 説明変数と目的変数をモデル作成用のインスタンスへセット
lgbm_model_manager.set_feature_and_objective_columns(
    feature_columns, objective_column)

# 目的変数の作成: 1着のデータに正解フラグを立てる処理を実行
df = lgbm_model_manager.add_objective_column_to_df(df, "label", 1)
2024-11-01 18:40:05.595 | INFO     | src.data_manager.dataset_tools:set_feature_and_objective_columns:79 - Set Feature columns. ['distance', 'number', 'boxNum', 'age', 'jweight', 'weight', 'gl', 'race_span', 'raceGrade', 'mochiTime', 'mochiTime3F', 'pred_last3F', 'pred_toL3F', 'pred_cls', 'place_en', 'field_en', 'sex_en', 'condition_en', 'jockeyId_en', 'teacherId_en', 'dist_cat_en', 'horseId_en', 'cluster0_lag1', 'cluster0_lag2', 'cluster0_lag3', 'cluster0_lag4', 'cluster0_lag5', 'cluster1_lag1', 'cluster1_lag2', 'cluster1_lag3', 'cluster1_lag4', 'cluster1_lag5', 'toL3F_vel_lag1', 'toL3F_vel_lag2', 'toL3F_vel_lag3', 'toL3F_vel_lag4', 'toL3F_vel_lag5', 'toL3F_vel_diff_lag1', 'toL3F_vel_diff_lag2', 'toL3F_vel_diff_lag3', 'toL3F_vel_diff_lag4', 'toL3F_vel_diff_lag5', 'last3F_vel_lag1', 'last3F_vel_lag2', 'last3F_vel_lag3', 'last3F_vel_lag4', 'last3F_vel_lag5', 'last3F_vel_diff_lag1', 'last3F_vel_diff_lag2', 'last3F_vel_diff_lag3', 'last3F_vel_diff_lag4', 'last3F_vel_diff_lag5', 'cluster0_0', 'cluster0_1', 'cluster0_2', 'cluster0_3', 'cluster0_4', 'stallionId', 'breedId', 'bStallionId', 'b2StallionId', 'winR_stallion', 'winR_breed', 'winR_bStallion', 'winR_b2Stallion']
2024-11-01 18:40:05.595 | INFO     | src.data_manager.dataset_tools:set_feature_and_objective_columns:81 - Set Objective columns. label_in1
2024-11-01 18:40:05.595 | INFO     | src.model_manager.lgbm_manager:add_objective_column_to_df:81 - make objective data. label_in1. topN: 1
スポンサーリンク

11-4.データセットの作成¶

In [8]:
dataset_mapping = lgbm_model_manager.make_dataset_mapping(
    df,
    target_category=[["stallionId"], ["breedId"],
                     ["bStallionId"], ["b2StallionId"]],
    target_sub_category=["field", "dist_cat"]
)

# 上で作成したデータセットのマッピングをセットする
dataset_mapping = lgbm_model_manager.setup_dataset(dataset_mapping)
2024-11-01 18:40:06.318 | INFO     | src.data_manager.dataset_tools:make_dataset_mapping:105 - Generate dataset mapping. Year Range: 2019 -> 2023
Add blood win rate in 2023second (2024/11/01 18:41:59) ...: 100%|██████████| 10/10 [02:05<00:00, 12.50s/it]
Add infer race style in 2023second (2024/11/01 18:43:27) ...: 100%|██████████| 10/10 [01:23<00:00,  8.36s/it]
Add infer pace in 2023second (2024/11/01 18:49:39) ...: 100%|██████████| 10/10 [06:44<00:00, 40.44s/it]
2024-11-01 18:50:22.402 | INFO     | src.model_manager.lgbm_manager:setup_dataset:111 - Create LightGBM Dataset.
スポンサーリンク

11-5.自作の損失関数を使った学習¶

自作の損失関数を使う理由としては、クラス分類に限界を感じたからである。
また、現状のクラス分類では1頭づつの分類の予測しか出さず、レースに出走する競走馬を考慮した予測結果になっていない。
そのため、出走する他の馬を考慮した学習をできるようなカスタムobjectを作ると良いのではと考えた。

作成するカスタムObjectは、KL情報量を最適化するものとする
KL情報量を使って、1着から3着までのオッズに関する勝率の分布と予測で選ばれた上位3件のオッズに関する勝率の分布の差を見て最適化することを目的としている。

なので、学習はレースごとに予測値の上位3件のオッズに関する勝率の分布を取り出し、実際の1着から3着のオッズの勝率を比べてKL情報量の勾配とヘッシアンを計算すれば良い

KL情報量によるオッズを考慮した買い目の最適化の解説は以下の記事を参考にしてください。

カスタムObjectとカスタムメトリックの作成¶

作成自体は上記で紹介した記事にあるので、そちらを参照ください。
損失関数と評価関数の計算間違いはないはずなので、流用してもらってOKです

本ソースでは、src.model_manager.custom_objects.custom_functionsにあるものを使ってます

データセットの修正¶

目的変数の形式が変わってるので、カスタムObjectで使えるように修正する

In [9]:
import lightgbm as lgbm

for dataset in dataset_mapping.values():
    for mode in ["train", "valid", "test"]:
        dfp = dataset.__dict__[mode]
        dfp["odds_rate"] = 0.8/dfp["odds"]
        dfp["odds_rate"] /= dfp["raceId"].map(
            dfp[["raceId", "odds_rate"]].groupby("raceId")["odds_rate"].sum())
        dfp[lgbm_model_manager.objective_column] = dfp["label"] + dfp["odds_rate"]
        dataset.__dict__[mode]["group"] = dataset.__dict__[mode].groupby("raceId")[
            "number"].rank()
        dataset.__dict__[mode]["group"] = dataset.__dict__[mode]["group"].isin([1]).astype(
            int)*dataset.__dict__[mode]["raceId"].map(dataset.__dict__[mode].groupby("raceId")["raceId"].count())

        dataset.__dict__[mode+"_dataset"] = lgbm.Dataset(
            dataset.__dict__[mode][lgbm_model_manager.feature_columns],
            dataset.__dict__[mode][lgbm_model_manager.objective_column],
            group=dataset.__dict__[mode]["group"].values
        )
スポンサーリンク

11-6.モデル作成実行¶

とくに追加したものはないので、いつも通りにモデル作成の実行

In [10]:
# 学習用パラメータ
# カスタムObjectとカスタムmetricを以下のように指定
params = {
    'boosting_type': 'gbdt',
    # 二値分類
    'objective': "kl_divergence_objective",
    'feval': "kl_divergence_metric",
    'verbose': 1,
    'seed': 77777,
    'learning_rate': 0.05,
    "n_estimators": 1000
}


lgbm_model_manager.train_all(
    params,
    dataset_mapping,
    verbose=True,
    stopping_rounds=25,  # ここで指定した値を超えるまでは、early stopさせない
    val_num=25,  # ログを出力するスパン
)
for dataset_dict in dataset_mapping.values():
    lgbm_model_manager.load_model(dataset_dict.name)
    lgbm_model_manager.predict(dataset_dict)
2024-11-01 18:50:26.383 | INFO     | src.model_manager.lgbm_manager:save_root_model_info:327 - Save model params and dataset columns
2024-11-01 18:50:26.388 | INFO     | src.model_manager.lgbm_manager:train_all:308 - Training Start!
2024-11-01 18:50:26.388 | INFO     | src.model_manager.lgbm_manager:train_all:309 - ==================  train params  ========================
2024-11-01 18:50:26.388 | INFO     | src.model_manager.lgbm_manager:train_all:312 - boosting_type             =     gbdt
2024-11-01 18:50:26.388 | INFO     | src.model_manager.lgbm_manager:train_all:312 - objective                 =     kl_divergence_objective
2024-11-01 18:50:26.388 | INFO     | src.model_manager.lgbm_manager:train_all:312 - feval                     =     kl_divergence_metric
2024-11-01 18:50:26.388 | INFO     | src.model_manager.lgbm_manager:train_all:312 - verbose                   =     1
2024-11-01 18:50:26.388 | INFO     | src.model_manager.lgbm_manager:train_all:312 - seed                      =     77777
2024-11-01 18:50:26.388 | INFO     | src.model_manager.lgbm_manager:train_all:312 - learning_rate             =     0.05
2024-11-01 18:50:26.388 | INFO     | src.model_manager.lgbm_manager:train_all:312 - n_estimators              =     1000
2024-11-01 18:50:26.388 | INFO     | src.model_manager.lgbm_manager:train_all:313 - ==========================================================
2024-11-01 18:50:26.388 | INFO     | src.model_manager.lgbm_manager:train_all:317 - Start training model. model name: 2019first
2024-11-01 18:50:26.921 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Warning] Categorical features with more bins than the configured maximum bin number found.
2024-11-01 18:50:26.921 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Warning] For categorical features, max_bin and max_bin_by_feature may be ignored with a large number of categories.
2024-11-01 18:50:27.186 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Info] Using self-defined objective function
2024-11-01 18:50:27.249 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.062174 seconds.
You can set `force_col_wise=true` to remove the overhead.
2024-11-01 18:50:27.265 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Info] Total Bins 42046
2024-11-01 18:50:27.265 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Info] Number of data points in the train set: 215804, number of used features: 65
2024-11-01 18:50:27.405 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Info] Using self-defined objective function
2024-11-01 18:50:32.188 | INFO     | lightgbm.basic:_log_info:191 - Training until validation scores don't improve for 25 rounds
2024-11-01 18:52:30.187 | INFO     | lightgbm.basic:_log_info:191 - [25]	training's kl_divergence: 15124.8	valid_1's kl_divergence: 1888.7
2024-11-01 18:54:30.804 | INFO     | lightgbm.basic:_log_info:191 - [50]	training's kl_divergence: 14505.1	valid_1's kl_divergence: 1909.94
2024-11-01 18:55:32.754 | INFO     | lightgbm.basic:_log_info:191 - Early stopping, best iteration is:
[38]	training's kl_divergence: 14753.5	valid_1's kl_divergence: 1868.63
2024-11-01 18:55:32.954 | INFO     | src.model_manager.lgbm_manager:save_model:268 - Saving model... model path: e:\dev_um_ai\dev-um-ai\models\third_model\params\2019first\model.params
2024-11-01 18:55:33.572 | INFO     | src.model_manager.lgbm_manager:train_all:317 - Start training model. model name: 2019second
2024-11-01 18:55:34.121 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Warning] Categorical features with more bins than the configured maximum bin number found.
2024-11-01 18:55:34.123 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Warning] For categorical features, max_bin and max_bin_by_feature may be ignored with a large number of categories.
2024-11-01 18:55:34.388 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Info] Using self-defined objective function
2024-11-01 18:55:34.438 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.054685 seconds.
You can set `force_col_wise=true` to remove the overhead.
2024-11-01 18:55:34.454 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Info] Total Bins 43156
2024-11-01 18:55:34.454 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Info] Number of data points in the train set: 238670, number of used features: 65
2024-11-01 18:55:34.588 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Info] Using self-defined objective function
2024-11-01 18:55:39.553 | INFO     | lightgbm.basic:_log_info:191 - Training until validation scores don't improve for 25 rounds
2024-11-01 18:57:45.753 | INFO     | lightgbm.basic:_log_info:191 - [25]	training's kl_divergence: 16652.7	valid_1's kl_divergence: 1695.03
2024-11-01 18:59:57.423 | INFO     | lightgbm.basic:_log_info:191 - [50]	training's kl_divergence: 16122.2	valid_1's kl_divergence: 1696.11
2024-11-01 19:02:09.425 | INFO     | lightgbm.basic:_log_info:191 - [75]	training's kl_divergence: 15973.1	valid_1's kl_divergence: 1740.2
2024-11-01 19:02:14.625 | INFO     | lightgbm.basic:_log_info:191 - Early stopping, best iteration is:
[51]	training's kl_divergence: 16089.1	valid_1's kl_divergence: 1678.09
2024-11-01 19:02:14.825 | INFO     | src.model_manager.lgbm_manager:save_model:268 - Saving model... model path: e:\dev_um_ai\dev-um-ai\models\third_model\params\2019second\model.params
2024-11-01 19:02:15.450 | INFO     | src.model_manager.lgbm_manager:train_all:317 - Start training model. model name: 2020first
2024-11-01 19:02:15.974 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Warning] Categorical features with more bins than the configured maximum bin number found.
2024-11-01 19:02:15.975 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Warning] For categorical features, max_bin and max_bin_by_feature may be ignored with a large number of categories.
2024-11-01 19:02:16.258 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Info] Using self-defined objective function
2024-11-01 19:02:16.351 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.078319 seconds.
You can set `force_col_wise=true` to remove the overhead.
2024-11-01 19:02:16.359 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Info] Total Bins 44561
2024-11-01 19:02:16.359 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Info] Number of data points in the train set: 262079, number of used features: 65
2024-11-01 19:02:16.508 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Info] Using self-defined objective function
2024-11-01 19:02:21.954 | INFO     | lightgbm.basic:_log_info:191 - Training until validation scores don't improve for 25 rounds
2024-11-01 19:04:38.710 | INFO     | lightgbm.basic:_log_info:191 - [25]	training's kl_divergence: 18203	valid_1's kl_divergence: 1714.08
2024-11-01 19:06:56.011 | INFO     | lightgbm.basic:_log_info:191 - Early stopping, best iteration is:
[24]	training's kl_divergence: 18153.7	valid_1's kl_divergence: 1692.65
2024-11-01 19:06:56.178 | INFO     | src.model_manager.lgbm_manager:save_model:268 - Saving model... model path: e:\dev_um_ai\dev-um-ai\models\third_model\params\2020first\model.params
2024-11-01 19:06:56.762 | INFO     | src.model_manager.lgbm_manager:train_all:317 - Start training model. model name: 2020second
2024-11-01 19:06:57.278 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Warning] Categorical features with more bins than the configured maximum bin number found.
2024-11-01 19:06:57.294 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Warning] For categorical features, max_bin and max_bin_by_feature may be ignored with a large number of categories.
2024-11-01 19:06:57.611 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Info] Using self-defined objective function
2024-11-01 19:06:57.678 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.070501 seconds.
You can set `force_col_wise=true` to remove the overhead.
2024-11-01 19:06:57.694 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Info] Total Bins 45317
2024-11-01 19:06:57.694 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Info] Number of data points in the train set: 284243, number of used features: 65
2024-11-01 19:06:57.843 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Info] Using self-defined objective function
2024-11-01 19:07:03.810 | INFO     | lightgbm.basic:_log_info:191 - Training until validation scores don't improve for 25 rounds
2024-11-01 19:09:32.063 | INFO     | lightgbm.basic:_log_info:191 - [25]	training's kl_divergence: 20532.3	valid_1's kl_divergence: 1814.71
2024-11-01 19:12:07.131 | INFO     | lightgbm.basic:_log_info:191 - [50]	training's kl_divergence: 19486.3	valid_1's kl_divergence: 1793.74
2024-11-01 19:14:41.693 | INFO     | lightgbm.basic:_log_info:191 - [75]	training's kl_divergence: 18977.3	valid_1's kl_divergence: 1810.06
2024-11-01 19:14:54.032 | INFO     | lightgbm.basic:_log_info:191 - Early stopping, best iteration is:
[52]	training's kl_divergence: 19436.7	valid_1's kl_divergence: 1762.52
2024-11-01 19:14:54.232 | INFO     | src.model_manager.lgbm_manager:save_model:268 - Saving model... model path: e:\dev_um_ai\dev-um-ai\models\third_model\params\2020second\model.params
2024-11-01 19:14:54.916 | INFO     | src.model_manager.lgbm_manager:train_all:317 - Start training model. model name: 2021first
2024-11-01 19:14:55.515 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Warning] Categorical features with more bins than the configured maximum bin number found.
2024-11-01 19:14:55.515 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Warning] For categorical features, max_bin and max_bin_by_feature may be ignored with a large number of categories.
2024-11-01 19:14:55.865 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Info] Using self-defined objective function
2024-11-01 19:14:55.932 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.069208 seconds.
You can set `force_col_wise=true` to remove the overhead.
2024-11-01 19:14:55.949 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Info] Total Bins 46560
2024-11-01 19:14:55.949 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Info] Number of data points in the train set: 308179, number of used features: 65
2024-11-01 19:14:56.099 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Info] Using self-defined objective function
2024-11-01 19:15:02.433 | INFO     | lightgbm.basic:_log_info:191 - Training until validation scores don't improve for 25 rounds
2024-11-01 19:17:43.868 | INFO     | lightgbm.basic:_log_info:191 - [25]	training's kl_divergence: 20683.1	valid_1's kl_divergence: 1572.27
2024-11-01 19:20:30.453 | INFO     | lightgbm.basic:_log_info:191 - [50]	training's kl_divergence: 20312.8	valid_1's kl_divergence: 1600.53
2024-11-01 19:20:30.453 | INFO     | lightgbm.basic:_log_info:191 - Early stopping, best iteration is:
[25]	training's kl_divergence: 20683.1	valid_1's kl_divergence: 1572.27
2024-11-01 19:20:30.602 | INFO     | src.model_manager.lgbm_manager:save_model:268 - Saving model... model path: e:\dev_um_ai\dev-um-ai\models\third_model\params\2021first\model.params
2024-11-01 19:20:31.203 | INFO     | src.model_manager.lgbm_manager:train_all:317 - Start training model. model name: 2021second
2024-11-01 19:20:31.752 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Warning] Categorical features with more bins than the configured maximum bin number found.
2024-11-01 19:20:31.768 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Warning] For categorical features, max_bin and max_bin_by_feature may be ignored with a large number of categories.
2024-11-01 19:20:32.102 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Info] Using self-defined objective function
2024-11-01 19:20:32.185 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.079138 seconds.
You can set `force_col_wise=true` to remove the overhead.
2024-11-01 19:20:32.202 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Info] Total Bins 47152
2024-11-01 19:20:32.202 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Info] Number of data points in the train set: 330561, number of used features: 65
2024-11-01 19:20:32.352 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Info] Using self-defined objective function
2024-11-01 19:20:38.952 | INFO     | lightgbm.basic:_log_info:191 - Training until validation scores don't improve for 25 rounds
2024-11-01 19:23:27.604 | INFO     | lightgbm.basic:_log_info:191 - [25]	training's kl_divergence: 23844.8	valid_1's kl_divergence: 1757.95
2024-11-01 19:26:24.606 | INFO     | lightgbm.basic:_log_info:191 - [50]	training's kl_divergence: 22190	valid_1's kl_divergence: 1693.08
2024-11-01 19:28:52.657 | INFO     | lightgbm.basic:_log_info:191 - Early stopping, best iteration is:
[46]	training's kl_divergence: 22245.5	valid_1's kl_divergence: 1678.68
2024-11-01 19:28:52.850 | INFO     | src.model_manager.lgbm_manager:save_model:268 - Saving model... model path: e:\dev_um_ai\dev-um-ai\models\third_model\params\2021second\model.params
2024-11-01 19:28:53.458 | INFO     | src.model_manager.lgbm_manager:train_all:317 - Start training model. model name: 2022first
2024-11-01 19:28:53.989 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Warning] Categorical features with more bins than the configured maximum bin number found.
2024-11-01 19:28:53.990 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Warning] For categorical features, max_bin and max_bin_by_feature may be ignored with a large number of categories.
2024-11-01 19:28:54.373 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Info] Using self-defined objective function
2024-11-01 19:28:54.457 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.086560 seconds.
You can set `force_col_wise=true` to remove the overhead.
2024-11-01 19:28:54.473 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Info] Total Bins 48259
2024-11-01 19:28:54.473 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Info] Number of data points in the train set: 354032, number of used features: 65
2024-11-01 19:28:54.623 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Info] Using self-defined objective function
2024-11-01 19:29:01.824 | INFO     | lightgbm.basic:_log_info:191 - Training until validation scores don't improve for 25 rounds
2024-11-01 19:32:02.653 | INFO     | lightgbm.basic:_log_info:191 - [25]	training's kl_divergence: 24924.9	valid_1's kl_divergence: 1674.46
2024-11-01 19:35:12.444 | INFO     | lightgbm.basic:_log_info:191 - [50]	training's kl_divergence: 23865.1	valid_1's kl_divergence: 1641.52
2024-11-01 19:37:51.363 | INFO     | lightgbm.basic:_log_info:191 - Early stopping, best iteration is:
[46]	training's kl_divergence: 23987.3	valid_1's kl_divergence: 1631.5
2024-11-01 19:37:51.545 | INFO     | src.model_manager.lgbm_manager:save_model:268 - Saving model... model path: e:\dev_um_ai\dev-um-ai\models\third_model\params\2022first\model.params
2024-11-01 19:37:52.212 | INFO     | src.model_manager.lgbm_manager:train_all:317 - Start training model. model name: 2022second
2024-11-01 19:37:52.779 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Warning] Categorical features with more bins than the configured maximum bin number found.
2024-11-01 19:37:52.779 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Warning] For categorical features, max_bin and max_bin_by_feature may be ignored with a large number of categories.
2024-11-01 19:37:53.179 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Info] Using self-defined objective function
2024-11-01 19:37:53.279 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.098616 seconds.
You can set `force_col_wise=true` to remove the overhead.
2024-11-01 19:37:53.295 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Info] Total Bins 48746
2024-11-01 19:37:53.295 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Info] Number of data points in the train set: 376558, number of used features: 65
2024-11-01 19:37:53.445 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Info] Using self-defined objective function
2024-11-01 19:38:01.046 | INFO     | lightgbm.basic:_log_info:191 - Training until validation scores don't improve for 25 rounds
2024-11-01 19:41:12.464 | INFO     | lightgbm.basic:_log_info:191 - [25]	training's kl_divergence: 25961.2	valid_1's kl_divergence: 1764.43
2024-11-01 19:44:32.866 | INFO     | lightgbm.basic:_log_info:191 - [50]	training's kl_divergence: 24616.6	valid_1's kl_divergence: 1738.07
2024-11-01 19:46:00.501 | INFO     | lightgbm.basic:_log_info:191 - Early stopping, best iteration is:
[36]	training's kl_divergence: 24777.2	valid_1's kl_divergence: 1699.64
2024-11-01 19:46:00.666 | INFO     | src.model_manager.lgbm_manager:save_model:268 - Saving model... model path: e:\dev_um_ai\dev-um-ai\models\third_model\params\2022second\model.params
2024-11-01 19:46:01.284 | INFO     | src.model_manager.lgbm_manager:train_all:317 - Start training model. model name: 2023first
2024-11-01 19:46:01.884 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Warning] Categorical features with more bins than the configured maximum bin number found.
2024-11-01 19:46:01.884 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Warning] For categorical features, max_bin and max_bin_by_feature may be ignored with a large number of categories.
2024-11-01 19:46:02.283 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Info] Using self-defined objective function
2024-11-01 19:46:02.383 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.096326 seconds.
You can set `force_col_wise=true` to remove the overhead.
2024-11-01 19:46:02.400 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Info] Total Bins 49736
2024-11-01 19:46:02.400 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Info] Number of data points in the train set: 399581, number of used features: 65
2024-11-01 19:46:02.533 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Info] Using self-defined objective function
2024-11-01 19:46:10.633 | INFO     | lightgbm.basic:_log_info:191 - Training until validation scores don't improve for 25 rounds
2024-11-01 19:49:35.018 | INFO     | lightgbm.basic:_log_info:191 - [25]	training's kl_divergence: 27391.5	valid_1's kl_divergence: 1710.6
2024-11-01 19:53:07.254 | INFO     | lightgbm.basic:_log_info:191 - [50]	training's kl_divergence: 26879.6	valid_1's kl_divergence: 1700.85
2024-11-01 19:55:06.706 | INFO     | lightgbm.basic:_log_info:191 - Early stopping, best iteration is:
[39]	training's kl_divergence: 26878.2	valid_1's kl_divergence: 1671.2
2024-11-01 19:55:06.872 | INFO     | src.model_manager.lgbm_manager:save_model:268 - Saving model... model path: e:\dev_um_ai\dev-um-ai\models\third_model\params\2023first\model.params
2024-11-01 19:55:07.506 | INFO     | src.model_manager.lgbm_manager:train_all:317 - Start training model. model name: 2023second
2024-11-01 19:55:08.055 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Warning] Categorical features with more bins than the configured maximum bin number found.
2024-11-01 19:55:08.055 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Warning] For categorical features, max_bin and max_bin_by_feature may be ignored with a large number of categories.
2024-11-01 19:55:08.488 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Info] Using self-defined objective function
2024-11-01 19:55:08.588 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.095955 seconds.
You can set `force_col_wise=true` to remove the overhead.
2024-11-01 19:55:08.605 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Info] Total Bins 50082
2024-11-01 19:55:08.605 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Info] Number of data points in the train set: 422064, number of used features: 65
2024-11-01 19:55:08.739 | INFO     | lightgbm.basic:_log_native:200 - [LightGBM] [Info] Using self-defined objective function
2024-11-01 19:55:17.139 | INFO     | lightgbm.basic:_log_info:191 - Training until validation scores don't improve for 25 rounds
2024-11-01 19:58:51.225 | INFO     | lightgbm.basic:_log_info:191 - [25]	training's kl_divergence: 30611.2	valid_1's kl_divergence: 1912.56
2024-11-01 20:02:35.410 | INFO     | lightgbm.basic:_log_info:191 - [50]	training's kl_divergence: 27803.3	valid_1's kl_divergence: 1770.63
2024-11-01 20:05:56.178 | INFO     | lightgbm.basic:_log_info:191 - Early stopping, best iteration is:
[47]	training's kl_divergence: 27748.2	valid_1's kl_divergence: 1749.13
2024-11-01 20:05:56.371 | INFO     | src.model_manager.lgbm_manager:save_model:268 - Saving model... model path: e:\dev_um_ai\dev-um-ai\models\third_model\params\2023second\model.params
2024-11-01 20:05:57.529 | INFO     | src.model_manager.lgbm_manager:load_model:247 - Loading model... model name: 2019first
2024-11-01 20:05:57.612 | INFO     | src.model_manager.lgbm_manager:load_model:249 - model activate! model_name: 2019first
2024-11-01 20:05:59.028 | INFO     | src.model_manager.base_manager:set_predict_dataframe:379 - Set the infered DataFrame into the dataset. model_name: 2019first
2024-11-01 20:05:59.043 | INFO     | src.model_manager.base_manager:save_predict_result:406 - Save predict result. save path: e:\dev_um_ai\dev-um-ai\models\third_model\analyze\00_predict\2019first
2024-11-01 20:06:01.379 | INFO     | src.model_manager.lgbm_manager:load_model:247 - Loading model... model name: 2019second
2024-11-01 20:06:01.479 | INFO     | src.model_manager.lgbm_manager:load_model:249 - model activate! model_name: 2019second
2024-11-01 20:06:02.912 | INFO     | src.model_manager.base_manager:set_predict_dataframe:379 - Set the infered DataFrame into the dataset. model_name: 2019second
2024-11-01 20:06:02.928 | INFO     | src.model_manager.base_manager:save_predict_result:406 - Save predict result. save path: e:\dev_um_ai\dev-um-ai\models\third_model\analyze\00_predict\2019second
2024-11-01 20:06:05.529 | INFO     | src.model_manager.lgbm_manager:load_model:247 - Loading model... model name: 2020first
2024-11-01 20:06:05.628 | INFO     | src.model_manager.lgbm_manager:load_model:249 - model activate! model_name: 2020first
2024-11-01 20:06:07.145 | INFO     | src.model_manager.base_manager:set_predict_dataframe:379 - Set the infered DataFrame into the dataset. model_name: 2020first
2024-11-01 20:06:07.162 | INFO     | src.model_manager.base_manager:save_predict_result:406 - Save predict result. save path: e:\dev_um_ai\dev-um-ai\models\third_model\analyze\00_predict\2020first
2024-11-01 20:06:10.366 | INFO     | src.model_manager.lgbm_manager:load_model:247 - Loading model... model name: 2020second
2024-11-01 20:06:10.468 | INFO     | src.model_manager.lgbm_manager:load_model:249 - model activate! model_name: 2020second
2024-11-01 20:06:12.228 | INFO     | src.model_manager.base_manager:set_predict_dataframe:379 - Set the infered DataFrame into the dataset. model_name: 2020second
2024-11-01 20:06:12.245 | INFO     | src.model_manager.base_manager:save_predict_result:406 - Save predict result. save path: e:\dev_um_ai\dev-um-ai\models\third_model\analyze\00_predict\2020second
2024-11-01 20:06:15.361 | INFO     | src.model_manager.lgbm_manager:load_model:247 - Loading model... model name: 2021first
2024-11-01 20:06:15.477 | INFO     | src.model_manager.lgbm_manager:load_model:249 - model activate! model_name: 2021first
2024-11-01 20:06:17.312 | INFO     | src.model_manager.base_manager:set_predict_dataframe:379 - Set the infered DataFrame into the dataset. model_name: 2021first
2024-11-01 20:06:17.328 | INFO     | src.model_manager.base_manager:save_predict_result:406 - Save predict result. save path: e:\dev_um_ai\dev-um-ai\models\third_model\analyze\00_predict\2021first
2024-11-01 20:06:20.346 | INFO     | src.model_manager.lgbm_manager:load_model:247 - Loading model... model name: 2021second
2024-11-01 20:06:20.445 | INFO     | src.model_manager.lgbm_manager:load_model:249 - model activate! model_name: 2021second
2024-11-01 20:06:22.529 | INFO     | src.model_manager.base_manager:set_predict_dataframe:379 - Set the infered DataFrame into the dataset. model_name: 2021second
2024-11-01 20:06:22.545 | INFO     | src.model_manager.base_manager:save_predict_result:406 - Save predict result. save path: e:\dev_um_ai\dev-um-ai\models\third_model\analyze\00_predict\2021second
2024-11-01 20:06:25.865 | INFO     | src.model_manager.lgbm_manager:load_model:247 - Loading model... model name: 2022first
2024-11-01 20:06:25.945 | INFO     | src.model_manager.lgbm_manager:load_model:249 - model activate! model_name: 2022first
2024-11-01 20:06:27.995 | INFO     | src.model_manager.base_manager:set_predict_dataframe:379 - Set the infered DataFrame into the dataset. model_name: 2022first
2024-11-01 20:06:28.011 | INFO     | src.model_manager.base_manager:save_predict_result:406 - Save predict result. save path: e:\dev_um_ai\dev-um-ai\models\third_model\analyze\00_predict\2022first
2024-11-01 20:06:31.195 | INFO     | src.model_manager.lgbm_manager:load_model:247 - Loading model... model name: 2022second
2024-11-01 20:06:31.295 | INFO     | src.model_manager.lgbm_manager:load_model:249 - model activate! model_name: 2022second
2024-11-01 20:06:33.545 | INFO     | src.model_manager.base_manager:set_predict_dataframe:379 - Set the infered DataFrame into the dataset. model_name: 2022second
2024-11-01 20:06:33.576 | INFO     | src.model_manager.base_manager:save_predict_result:406 - Save predict result. save path: e:\dev_um_ai\dev-um-ai\models\third_model\analyze\00_predict\2022second
2024-11-01 20:06:36.979 | INFO     | src.model_manager.lgbm_manager:load_model:247 - Loading model... model name: 2023first
2024-11-01 20:06:37.079 | INFO     | src.model_manager.lgbm_manager:load_model:249 - model activate! model_name: 2023first
2024-11-01 20:06:39.512 | INFO     | src.model_manager.base_manager:set_predict_dataframe:379 - Set the infered DataFrame into the dataset. model_name: 2023first
2024-11-01 20:06:39.528 | INFO     | src.model_manager.base_manager:save_predict_result:406 - Save predict result. save path: e:\dev_um_ai\dev-um-ai\models\third_model\analyze\00_predict\2023first
2024-11-01 20:06:43.674 | INFO     | src.model_manager.lgbm_manager:load_model:247 - Loading model... model name: 2023second
2024-11-01 20:06:43.779 | INFO     | src.model_manager.lgbm_manager:load_model:249 - model activate! model_name: 2023second
2024-11-01 20:06:46.312 | INFO     | src.model_manager.base_manager:set_predict_dataframe:379 - Set the infered DataFrame into the dataset. model_name: 2023second
2024-11-01 20:06:46.328 | INFO     | src.model_manager.base_manager:save_predict_result:406 - Save predict result. save path: e:\dev_um_ai\dev-um-ai\models\third_model\analyze\00_predict\2023second
スポンサーリンク

11-7.モデルのエクスポート¶

モデルのエクスポートをするためには、モデルの成績を先に計算しておく必要がある

モデルの成績は以下を計算する

  1. 収支の計算
  2. 基礎統計の計算
  3. オッズグラフの計算

収支の計算¶

収支計算はメンテが出来てないので少し泥臭いが、以下のコードを実行して貰えればよい

現在のソースでは単勝の収支しか計算できない

In [11]:
bet_mode = BetName.tan
bet_column = lgbm_model_manager.get_bet_column(bet_mode=bet_mode)
pl_column = lgbm_model_manager.get_profit_loss_column(bet_mode=bet_mode)
for dataset_dict in dataset_mapping.values():
    lgbm_model_manager.set_bet_column(dataset_dict, bet_mode)

    # region 推論結果の確信度が学習データのそれの中央値より大きいものだけに絞りたい場合
    # q = dataset_dict.pred_train[dataset_dict.pred_train[bet_column].isin([1])]["pred_prob"].quantile(0.50)
    # dataset_dict.pred_valid[bet_column] &= (dataset_dict.pred_valid["pred_prob"] >= q)
    # dataset_dict.pred_test[bet_column] &= (dataset_dict.pred_test["pred_prob"] >= q)
    # endregion

    # region 1番人気から6番人気に賭けないようにする場合
    # dataset_dict.pred_valid[bet_column] &= ~dataset_dict.pred_valid["favorite"].isin([1, 2, 3, 4, 5, 6])
    # dataset_dict.pred_test[bet_column] &= ~dataset_dict.pred_test["favorite"].isin([1, 2, 3, 4, 5, 6])
    # endregion

    # region オッズが7以上50未満にだけ賭けるようにする場合
    # dataset_dict.pred_valid[bet_column] &= dataset_dict.pred_valid["odds"].ge(7)
    # dataset_dict.pred_test[bet_column] &= dataset_dict.pred_test["odds"].ge(7)
    # dataset_dict.pred_valid[bet_column] &= dataset_dict.pred_valid["odds"].lt(50)
    # dataset_dict.pred_test[bet_column] &= dataset_dict.pred_test["odds"].lt(50)
    # endregion

_, dfbetva, dfbette = lgbm_model_manager.merge_dataframe_data(
    dataset_mapping, mode=True)

dfbetva, dfbette = lgbm_model_manager.generate_profit_loss(
    dfbetva, dfbette, bet_mode)

dfbette[f"{pl_column}_sum"] = dfbette[pl_column].cumsum()
dfbette[["raceDate", "raceId", "label", "favorite",
         bet_column, pl_column, f"{pl_column}_sum"]]
2024-11-01 20:06:49.678 | INFO     | src.model_manager.base_manager:set_keyvalue_to_export_mapping:139 - Set Export info. key=bet_columns_map, val={'tan': 'bet_tan'}
2024-11-01 20:06:49.678 | INFO     | src.model_manager.base_manager:set_keyvalue_to_export_mapping:139 - Set Export info. key=pl_column_map, val={'tan': 'pl_tan'}
2024-11-01 20:06:50.447 | INFO     | src.model_manager.base_manager:__save_profit_loss:646 - Save profit loss data. save_path: e:\dev_um_ai\dev-um-ai\models\third_model\analyze\tan\profit_loss
2024-11-01 20:06:50.447 | INFO     | src.model_manager.base_manager:set_keyvalue_to_export_mapping:139 - Set Export info. key=profit_loss_dir, val={'tan': 'e:\\dev_um_ai\\dev-um-ai\\models\\third_model\\analyze\\tan\\profit_loss'}
Out[11]:
raceDate raceId label favorite bet_tan pl_tan pl_tan_sum
9 2019-01-05 201906010101 3 3 1 -100.0 -100.0
28 2019-01-05 201906010102 4 2 1 -100.0 -200.0
33 2019-01-05 201906010103 6 6 1 -100.0 -300.0
54 2019-01-05 201906010104 9 8 1 -100.0 -400.0
66 2019-01-05 201906010105 9 3 1 -100.0 -500.0
22403 2023-12-28 202309050908 3 10 1 -100.0 -279690.0
22431 2023-12-28 202309050909 1 1 1 250.0 -279440.0
22440 2023-12-28 202309050910 3 3 1 -100.0 -279540.0
22447 2023-12-28 202309050911 3 3 1 -100.0 -279640.0
22466 2023-12-28 202309050912 9 11 1 -100.0 -279740.0

16630 rows × 7 columns

基礎統計の計算¶

基礎統計では回収率と的中率および人気別のベット回数の集計を行う

以下のコードを実行するだけ

In [12]:
lgbm_model_manager.basic_analyze(dataset_mapping)
2024-11-01 20:06:50.497 | INFO     | src.model_manager.base_manager:basic_analyze:220 - Start basic analyze.
2024-11-01 20:06:50.765 | INFO     | src.model_manager.base_manager:basic_analyze:256 - Saving Return And Hit Rate Summary.
2024-11-01 20:06:50.765 | INFO     | src.model_manager.base_manager:set_keyvalue_to_export_mapping:139 - Set Export info. key=return_hit_rate_file, val={'tan': 'e:\\dev_um_ai\\dev-um-ai\\models\\third_model\\analyze\\tan\\hit_and_return_rate.csv'}
2024-11-01 20:06:50.765 | INFO     | src.model_manager.base_manager:basic_analyze:259 - Saving Favorite Bet Num Summary.
2024-11-01 20:06:50.782 | INFO     | src.model_manager.base_manager:set_keyvalue_to_export_mapping:139 - Set Export info. key=fav_bet_num_dir, val={'tan': 'e:\\dev_um_ai\\dev-um-ai\\models\\third_model\\analyze\\tan\\fav_bet_num'}

オッズグラフの計算¶

In [13]:
dftrain, dfvalid, dftest = lgbm_model_manager.merge_dataframe_data(
    dataset_mapping,
    mode=True
)
summary_dict = lgbm_model_manager.gegnerate_odds_graph(
    dftrain, dfvalid, dftest, bet_mode)
print("'test'データのオッズグラフを確認")
summary_dict["test"].fillna(0)
2024-11-01 20:06:51.730 | INFO     | src.model_manager.base_manager:__save_odds_graph:514 - Save Odds Graph. save_path: e:\dev_um_ai\dev-um-ai\models\third_model\analyze\tan\odds_graph
2024-11-01 20:06:51.730 | INFO     | src.model_manager.base_manager:set_keyvalue_to_export_mapping:139 - Set Export info. key=odds_graph_file, val={'tan': 'e:\\dev_um_ai\\dev-um-ai\\models\\third_model\\analyze\\tan\\odds_graph'}
'test'データのオッズグラフを確認
Out[13]:
勝率 支持率 回収率100%超 weight 件数
odds_round
1.25 67.595819 64.000000 80.000000 0.017258 287
1.75 45.875252 45.714286 57.142857 0.059771 994
2.25 36.116700 35.555556 44.444444 0.059771 994
2.75 32.183908 29.090909 36.363636 0.062778 1044
3.25 25.079365 24.615385 30.769231 0.056825 945
120.00 1.639344 0.666667 0.833333 0.003668 61
130.00 0.000000 0.615385 0.769231 0.002826 47
140.00 0.000000 0.571429 0.714286 0.002405 40
150.00 0.000000 0.533333 0.666667 0.001022 17
200.00 0.000000 0.400000 0.500000 0.020686 344

90 rows × 5 columns

モデルのエクスポート¶

以下を実行することでモデルの成績をエクスポートできる

In [14]:
lgbm_model_manager.export_model_info()
2024-11-01 20:06:51.763 | INFO     | src.model_manager.base_manager:export_model_info:848 - Export Model info json. export path: e:\dev_um_ai\dev-um-ai\models\third_model\model_info.json
スポンサーリンク

11-8.性能の確認(WEBアプリ起動)¶

以下のコードを実行するとWEBアプリが起動します。
コマンドプロンプト/ターミナルを使うことを推奨します。その際は先頭の「!」は削除して実行してください。

In [15]:
! python ../app_keiba/manage.py makemigrations
! python ../app_keiba/manage.py migrate 
! echo server launch OK
# ! python ../app_keiba/manage.py runserver 12345
No changes detected
Operations to perform:
  Apply all migrations: admin, auth, contenttypes, model_analyzer, sessions
Running migrations:
  No migrations to apply.
server launch OK

「server launch OK」の表示がでたら以下のリンクをクリックしてWEBアプリへアクセス

http://localhost:12345/index.html
スポンサーリンク

11-9.結果¶

モデルID 支持率OGS 回収率OGS AonBOGS
1 third_model 0.51958 -3.51619 4.13367
2 second_model 1.02706 -4.55201 3.09785
3 first_model
(baseline)
0.41924 -7.64492

結果から、ファーストモデルに比べてサードモデルでは、加重平均的に回収率を4.13367ポイント上回っている

セカンドモデルに比べて1ポイント程度向上している

スポンサーリンク

11-10.重要度の確認¶

11-10-1.サードモデルの重要度¶

In [16]:
import pandas as pd

lgbm_model_manager.load_model("2023second")
dfimp = pd.Series(data=lgbm_model_manager.model.feature_importance(importance_type="gain").tolist(),
                  index=feature_columns).sort_values()/1000
dfimp[dfimp > 0]
2024-11-01 20:07:05.421 | INFO     | src.model_manager.lgbm_manager:load_model:247 - Loading model... model name: 2023second
2024-11-01 20:07:05.491 | INFO     | src.model_manager.lgbm_manager:load_model:249 - model activate! model_name: 2023second
Out[16]:
race_span                 0.208408
cluster0_0                0.222090
raceGrade                 0.484789
mochiTime                 0.491345
cluster1_lag5             0.534295
pred_last3F               0.806533
mochiTime3F               1.092229
cluster1_lag1             1.160729
last3F_vel_diff_lag2      1.572129
cluster0_lag1             1.728195
winR_breed                2.830188
cluster0_4                7.616864
bStallionId               7.795521
last3F_vel_diff_lag1      9.769049
stallionId               24.330961
b2StallionId             24.685921
horseId_en               37.798580
teacherId_en            101.252105
boxNum                  130.698130
pred_cls                150.110893
jockeyId_en             152.990778
breedId                 194.550900
number                  289.937111
dtype: float64

面白いことに馬番のnumberの特徴量が最も重要だと出ており、次に母馬であるbreedIdと騎手であるjockeyId_enと続いている。
通説としてある内枠が有利というのはあながち間違いではないというのが分かった。

11-10-2.セカンドモデルとサードモデルの比較¶

セカンドモデルの重要度はどうだったかとサードモデルでどう変わったのかを確認

In [17]:
lgbm_model_manager2 = LightGBMModelManager(
    # modelsディレクトリ配下に作成したいモデル名のフォルダパスを指定。
    # フォルダパスは絶対パスにすると安全です。
    root_dir / "models" / "second_model",  # セカンドモデルのモデルID
    split_year,
    target_year,
    end_year
)
lgbm_model_manager2.load_model("2023second")
dfimp2 = pd.Series(data=lgbm_model_manager2.model.feature_importance(importance_type="gain").tolist(),
                   index=lgbm_model_manager2.feature_columns).sort_values()/1000
2024-11-01 20:07:05.519 | INFO     | src.model_manager.base_manager:set_keyvalue_to_export_mapping:139 - Set Export info. key=model_type, val=lightGBM
2024-11-01 20:07:05.520 | INFO     | src.model_manager.base_manager:set_keyvalue_to_export_mapping:139 - Set Export info. key=model_id, val=second_model
2024-11-01 20:07:05.521 | INFO     | src.model_manager.base_manager:set_keyvalue_to_export_mapping:139 - Set Export info. key=model_dir, val=e:\dev_um_ai\dev-um-ai\models\second_model
2024-11-01 20:07:05.525 | INFO     | src.model_manager.base_manager:set_keyvalue_to_export_mapping:139 - Set Export info. key=model_analyze_dir, val=e:\dev_um_ai\dev-um-ai\models\second_model\analyze
2024-11-01 20:07:05.525 | INFO     | src.model_manager.base_manager:set_keyvalue_to_export_mapping:139 - Set Export info. key=model_predict_dir, val=e:\dev_um_ai\dev-um-ai\models\second_model\analyze\00_predict
2024-11-01 20:07:05.525 | INFO     | src.model_manager.base_manager:set_keyvalue_to_export_mapping:139 - Set Export info. key=confidence_column, val=pred_prob
2024-11-01 20:07:05.528 | INFO     | src.model_manager.base_manager:set_keyvalue_to_export_mapping:139 - Set Export info. key=confidence_rank_column, val=pred_rank
2024-11-01 20:07:05.529 | INFO     | src.model_manager.lgbm_manager:load_root_mode_info:342 - Load model params and dataset info columns.
2024-11-01 20:07:05.530 | INFO     | src.model_manager.lgbm_manager:load_root_mode_info:345 - ==================  model params  ========================
2024-11-01 20:07:05.530 | INFO     | src.model_manager.lgbm_manager:load_root_mode_info:348 - boosting_type             =     gbdt
2024-11-01 20:07:05.530 | INFO     | src.model_manager.lgbm_manager:load_root_mode_info:348 - objective                 =     binary
2024-11-01 20:07:05.530 | INFO     | src.model_manager.lgbm_manager:load_root_mode_info:348 - metric                    =     auc
2024-11-01 20:07:05.530 | INFO     | src.model_manager.lgbm_manager:load_root_mode_info:348 - verbose                   =     0
2024-11-01 20:07:05.530 | INFO     | src.model_manager.lgbm_manager:load_root_mode_info:348 - seed                      =     77777
2024-11-01 20:07:05.530 | INFO     | src.model_manager.lgbm_manager:load_root_mode_info:348 - learning_rate             =     0.01
2024-11-01 20:07:05.530 | INFO     | src.model_manager.lgbm_manager:load_root_mode_info:348 - n_estimators              =     10000
2024-11-01 20:07:05.530 | INFO     | src.model_manager.lgbm_manager:load_root_mode_info:349 - ==========================================================
2024-11-01 20:07:05.547 | INFO     | src.data_manager.dataset_tools:set_feature_and_objective_columns:79 - Set Feature columns. ['distance', 'number', 'boxNum', 'odds', 'favorite', 'age', 'jweight', 'weight', 'gl', 'race_span', 'raceGrade', 'place_en', 'field_en', 'sex_en', 'condition_en', 'jockeyId_en', 'teacherId_en', 'dist_cat_en', 'horseId_en', 'stallionId', 'breedId', 'bStallionId', 'b2StallionId', 'winR_stallion', 'winR_breed', 'winR_bStallion', 'winR_b2Stallion']
2024-11-01 20:07:05.547 | INFO     | src.data_manager.dataset_tools:set_feature_and_objective_columns:81 - Set Objective columns. label_in1
2024-11-01 20:07:06.146 | INFO     | src.model_manager.lgbm_manager:load_model:247 - Loading model... model name: 2023second
2024-11-01 20:07:06.246 | INFO     | src.model_manager.lgbm_manager:load_model:249 - model activate! model_name: 2023second
In [18]:
pd.concat([dfimp[dfimp > 0], dfimp2[dfimp2 > 0]], axis=1).sort_values(
    0).rename(columns={0: "3rd", 1: "2nd"})
Out[18]:
3rd 2nd
race_span 0.208408 NaN
cluster0_0 0.222090 NaN
raceGrade 0.484789 48.136024
mochiTime 0.491345 NaN
cluster1_lag5 0.534295 NaN
pred_last3F 0.806533 NaN
mochiTime3F 1.092229 NaN
cluster1_lag1 1.160729 NaN
last3F_vel_diff_lag2 1.572129 NaN
cluster0_lag1 1.728195 NaN
winR_breed 2.830188 716.580140
cluster0_4 7.616864 NaN
bStallionId 7.795521 2.755637
last3F_vel_diff_lag1 9.769049 NaN
stallionId 24.330961 0.568410
b2StallionId 24.685921 4.066919
horseId_en 37.798580 0.406304
teacherId_en 101.252105 13.413124
boxNum 130.698130 NaN
pred_cls 150.110893 NaN
jockeyId_en 152.990778 0.179873
breedId 194.550900 17.960672
number 289.937111 NaN
age NaN 0.246741
odds NaN 1414.074588

セカンドモデルはオッズを特徴量に入れているので、オッズが最も高いのは仕方ないがその次に高かった母親産駒の勝率であるwinR_breedがサードモデルでは、かなり重要度が低くなっている。
更にいうと母馬のbreedIdの方が高いという状況から、馬IDが持つ性質をみるのではなく特定の馬IDを持っているかの方が重要であると学習しているようである。

スポンサーリンク

11-11.まとめ¶

今回のサードモデルはファーストモデルやセカンドモデルのような的中率を最適化するモデルではなく、1着から3着になる競走馬のオッズの勝率分布を最適化するモデルとなっている。
そのため、このモデルは買い目を最適化しているわけではないが、オッズが10倍以内だからだとか1番人気だからだとかそういったことを見ないでオッズの勝率分布を考慮して着順を予測するようにしてる。
つまり、最適化の過程でオッズもみていることから的中率と回収率の両方を考慮しているモデルと見なせると考える。

ただ、あまり正攻法とは言えないというのも課題として持っていることと、ここで使用しているオッズというのは最終オッズの値になっていることから、学習の段階でリークが起きていると見えなくもない。
そのため、学習では最終オッズと推定したオッズ値との分布差を最小にするような学習モデルを作成する方がより実態に合っていると考えるが、それは今後の発展として今回は扱わないこととする。
(いずれは必ず取り組むべき事項だと認識してるので今回は見送らせてほしい)

コメント

タイトルとURLをコピーしました