[WIP][POG]2026年两岁马母系稳定性可视化评估
以下是“POG 母系稳定性评价方案(Median+Q3+Gini+家族先验+EB 收缩)”的完整文档。包含目标与原理、数据准备、评分公式、参数与调参建议、全流程代码、性能与维护建议等。
目录
- 方案概览
- 数据准备与前置依赖
- 评分逻辑与数学定义
- 参数与调参方法
- 整体流程
- 完整代码
- 使用方法与输出说明
- 性能与维护
- 校验与常见问题
方案概览
- 核心目标:基于兄弟马(同母)与二代母家族(近亲)的3岁年内奖金表现,评价当年两岁马的“母系稳定性 + 上限潜力”,并对小样本进行保守收缩,同时引入胎次与母龄的生物学修正。
- 方案要点:
- 中心趋势:使用“中位数 + Q3”的加权中心,先对奖金做 平滑。
- 稳定性惩罚:用 Gini 衡量奖金分布的均衡性,惩罚离散的“一枝独秀”家系。
- 先验与收缩:当同母样本数较少时,优先参考二代母家族层的同构指标作为先验,应用经验贝叶斯收缩。
- 生物学修正:胎次与母龄对最终分进行乘性调整(2–6 胎最优,过高胎/高龄下调)。
- 高效实现:PostgreSQL 优化查询 + 物化视图 + 数组拉取(Python 侧算 Gini)。
数据准备与前置依赖
- 数据库:PostgreSQL,数据库名
jvlink,表结构见你的本地环境(sankus、hansyokus、race_umas等)。 - 建议索引与物化视图(强烈建议先建立,以保障性能):
- 功能性索引(同母快速等值连接)
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_sankus_mother_fn ON public.sankus ((hansyoku_num[2]));
- GIN 索引(近亲数组包含 @>)
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_sankus_hnums_gin ON public.sankus USING GIN (hansyoku_num);
- 出生日期(范围查询)
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_sankus_birth_date ON public.sankus (birth_date);
- 繁殖马连接列
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_hansyokus_hnum ON public.hansyokus (hansyoku_num);
- 三岁年内奖金物化视图(一次创建,定期刷新)
- 见“完整代码”中的
ensure_mv_earn_upto3;手动刷新:REFRESH MATERIALIZED VIEW CONCURRENTLY public.mv_earn_upto3;
- 见“完整代码”中的
- 功能性索引(同母快速等值连接)
评分逻辑与数学定义
-
输入集合:
- 兄弟层:同母全部子代的3岁年内奖金序列 。
- 家族层:以“二代母”为祖先,家族全部后代的3岁年内奖金序列 (限定最近若干年出生的样本)。
-
中心趋势(分位数组合,对数平滑):
- 兄弟层加权中心:
- 家族层加权中心:
- 标准化:
- ,
- 兄弟层加权中心:
-
Gini 惩罚(分布均衡性,越大越分散):
-
家族优先的经验贝叶斯收缩(抗小样本):
-
胎次与母龄修正(乘性因子):
- 建议的 与 见“参数与调参方法”。
参数与调参方法
-
基础参数(默认值)
- 两岁马出生年:
TWOYO_BIRTH_YEAR = 2024 - 近亲回溯年数:
FAM_YEAR_WINDOW = 15 - 数组位序:
MOTHER_IDX = 2,SECOND_DAM_IDX = 6(若与你数据库位序不同,请修改)
- 两岁马出生年:
-
评分参数(默认值)
- 中位数/Q3 加权:
A_MED = 0.6,B_Q3 = 0.4 - Gini 惩罚:
ALPHA_GINI_SIB = 0.8(兄弟)、BETA_GINI_FAM = 0.3(家族) - 收缩强度:
LAMBDA_SHRINK = 6.0 - 家族样本阈值:
MIN_FAM_N = 20
- 中位数/Q3 加权:
-
胎次与母龄权重(建议)
- 胎次 :
- 第1胎:
- 第2–6胎:
- 第7–8胎:
- 第9–10胎:
- 第11+ 胎:
- 母龄 :
- – 岁:
- – 岁:
- – 岁:
- 岁:
- 岁:
- 胎次 :
-
调参建议(根据偏好)
- 厚度优先:提高 (中位数权重)、提高 (更强惩罚离散)、适度提高 (更保守)。
- 上限优先:提高 (Q3 权重)、降低 、可轻微提高 。
- 小样本保守:提高 或提高 ,让评分更多回归家族/总体先验。
- 零膨胀强:提高 或以 为主的中心;必要时对奖金做温莎化(可在 Python 侧对数组截断极端分位)。
整体流程
- 数据提取(优化 SQL):
- 用日期范围取两岁马;
- 基于“当年两岁马”的母/二代母集合聚合兄弟/家族分位数;
- 单独拉取兄弟与家族的奖金数组用于 Gini 计算;
- 奖金使用
mv_earn_upto3(3岁年内本赏金+付加赏金)。
- 评分计算(Python):
- 计算 并标准化为 ;
- 计算 并施加惩罚,得 ;
- 依据家族层或总体均值做先验,进行 EB 收缩,得 ;
- 用胎次/母龄修正,得最终分 。
- 输出:
- CSV:
scores_weighted.csv - 可视化:散点(加权中心 vs Gini)、排行榜(Top 50)。
- CSV:
完整代码
# optimized_pog_weighted.py
# pip install psycopg2-binary pandas numpy plotly
import os
import numpy as np
import pandas as pd
import psycopg2
from datetime import date
import plotly.express as px
DB_DSN = os.getenv("JRLINK_DSN", "dbname=jvlink user=postgres host=localhost password=xxxx")
# 配置参数
TWOYO_BIRTH_YEAR = 2024 # 例如:2026 年的两岁马 -> 2024 出生
FAM_YEAR_WINDOW = 15 # 家族统计回溯年数
MOTHER_IDX = 2 # 母在数组中的索引(Postgres 1-based)
SECOND_DAM_IDX = 6 # 二代母在数组中的索引
# 评分参数(可按偏好调参)
A_MED, B_Q3 = 0.6, 0.4 # 中位数/Q3 加权
ALPHA_GINI_SIB = 0.8 # 兄弟 Gini 惩罚
BETA_GINI_FAM = 0.3 # 家族 Gini 惩罚
LAMBDA_SHRINK = 6.0 # 经验贝叶斯收缩强度
MIN_FAM_N = 20 # 家族样本阈值(足量)
def gini(arr):
x = np.asarray(arr, dtype=float)
if x.size == 0:
return np.nan
x = x[x >= 0]
if x.size == 0:
return 0.0
mu = x.mean()
if mu == 0:
return 0.0
x_sorted = np.sort(x)
n = x_sorted.size
cumx = np.cumsum(x_sorted)
return float((n + 1 - 2 * (cumx.sum() / cumx[-1])) / n)
def zscore(series):
s = pd.to_numeric(series, errors='coerce')
mu, sd = s.mean(skipna=True), s.std(ddof=0, skipna=True)
if sd == 0 or np.isnan(sd):
return pd.Series(np.zeros(len(s)), index=s.index)
return (s - mu) / sd
def parity_weight(p):
if p is None or p <= 0:
return 1.0
if p == 1:
return 0.90
if 2 <= p <= 6:
return 1.00
if 7 <= p <= 8:
return 0.95
if 9 <= p <= 10:
return 0.90
return 0.85 # 11+
def mother_age_weight(age):
if age is None:
return 1.0
if 5 <= age <= 11:
return 1.00
if 12 <= age <= 14:
return 0.97
if 15 <= age <= 17:
return 0.93
if age >= 18:
return 0.88
return 0.95 # 过年轻
def ensure_mv_earn_upto3(conn):
with conn.cursor() as cur:
try:
cur.execute("SELECT 1 FROM public.mv_earn_upto3 LIMIT 1;")
except psycopg2.errors.UndefinedTable:
conn.rollback()
cur.execute("""
CREATE MATERIALIZED VIEW IF NOT EXISTS public.mv_earn_upto3 AS
SELECT
ketto_num,
SUM(honsyokin + fukasyokin)::bigint AS earn_upto3,
COUNT(*) AS starts_upto3
FROM public.race_umas
WHERE barei <= 3
GROUP BY ketto_num;
""")
cur.execute("CREATE INDEX IF NOT EXISTS idx_mv_earn_upto3_ketto ON public.mv_earn_upto3 (ketto_num);")
conn.commit()
SQL_MAIN = f"""
WITH
twoyo AS (
SELECT
s.ketto_num AS twoyo_ketto_num,
s.birth_date AS twoyo_birth_date,
s.hansyoku_num[{MOTHER_IDX}] AS mother_hnum,
s.hansyoku_num[{SECOND_DAM_IDX}] AS second_dam_hnum
FROM public.sankus s
WHERE s.birth_date >= %s
AND s.birth_date < %s
),
moms AS ( SELECT DISTINCT mother_hnum FROM twoyo ),
sdams AS ( SELECT DISTINCT second_dam_hnum FROM twoyo ),
foals_by_moms AS (
SELECT
s2.hansyoku_num[{MOTHER_IDX}] AS mother_hnum,
s2.ketto_num AS foal_ketto_num,
s2.birth_date AS foal_birth_date
FROM public.sankus s2
JOIN moms m
ON s2.hansyoku_num[{MOTHER_IDX}] = m.mother_hnum
),
foals_with_parity AS (
SELECT
f.*,
ROW_NUMBER() OVER (PARTITION BY f.mother_hnum ORDER BY f.foal_birth_date) AS parity
FROM foals_by_moms f
),
sib_agg_by_mother AS (
SELECT
f.mother_hnum,
COUNT(*) AS n_sib,
PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY COALESCE(e.earn_upto3,0))::float8 AS med_earn,
PERCENTILE_DISC(0.75) WITHIN GROUP (ORDER BY COALESCE(e.earn_upto3,0))::float8 AS q3_earn
FROM foals_with_parity f
LEFT JOIN public.mv_earn_upto3 e
ON e.ketto_num = f.foal_ketto_num
GROUP BY f.mother_hnum
),
cutoff AS (
SELECT make_date(EXTRACT(YEAR FROM %s)::int - %s, 1, 1) AS from_date
),
fam_rows AS (
SELECT
s3.ketto_num AS kin_ketto_num,
s3.birth_date AS kin_birth_date,
s3.hansyoku_num
FROM public.sankus s3
WHERE s3.birth_date >= (SELECT from_date FROM cutoff)
),
fam_agg_by_sdam AS (
SELECT
d.second_dam_hnum,
COUNT(*) AS n_fam,
PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY COALESCE(e.earn_upto3,0))::float8 AS med_earn_fam,
PERCENTILE_DISC(0.75) WITHIN GROUP (ORDER BY COALESCE(e.earn_upto3,0))::float8 AS q3_earn_fam
FROM sdams d
JOIN fam_rows fr
ON fr.hansyoku_num @> ARRAY[d.second_dam_hnum]
LEFT JOIN public.mv_earn_upto3 e
ON e.ketto_num = fr.kin_ketto_num
GROUP BY d.second_dam_hnum
),
twoyo_parity AS (
SELECT
t.twoyo_ketto_num,
(SELECT COUNT(*) FROM foals_by_moms f
WHERE f.mother_hnum = t.mother_hnum
AND f.foal_birth_date <= t.twoyo_birth_date) AS twoyo_parity
FROM twoyo t
),
mother_info AS (
SELECT
h.hansyoku_num AS mother_hnum,
h.bamei AS mother_name,
h.birth_year AS mother_birth_year
FROM public.hansyokus h
WHERE h.sex_cd = '2'
)
SELECT
t.twoyo_ketto_num,
t.twoyo_birth_date,
t.mother_hnum,
t.second_dam_hnum,
p.twoyo_parity,
(EXTRACT(YEAR FROM t.twoyo_birth_date)::int - mi.mother_birth_year)::int AS mother_age_at_foaling,
mi.mother_name,
mi.mother_birth_year,
-- 兄弟(同母)分位
sa.n_sib, sa.med_earn, sa.q3_earn,
-- 家族(二代母)分位
fa.n_fam, fa.med_earn_fam, fa.q3_earn_fam
FROM twoyo t
LEFT JOIN twoyo_parity p ON p.twoyo_ketto_num = t.twoyo_ketto_num
LEFT JOIN mother_info mi ON mi.mother_hnum = t.mother_hnum
LEFT JOIN sib_agg_by_mother sa ON sa.mother_hnum = t.mother_hnum
LEFT JOIN fam_agg_by_sdam fa ON fa.second_dam_hnum = t.second_dam_hnum
ORDER BY t.twoyo_ketto_num;
"""
SQL_SIB_ARRAYS = f"""
WITH
twoyo AS (
SELECT s.hansyoku_num[{MOTHER_IDX}] AS mother_hnum
FROM public.sankus s
WHERE s.birth_date >= %s
AND s.birth_date < %s
),
moms AS (SELECT DISTINCT mother_hnum FROM twoyo),
foals_by_moms AS (
SELECT
s2.hansyoku_num[{MOTHER_IDX}] AS mother_hnum,
s2.ketto_num AS foal_ketto_num,
s2.birth_date AS foal_birth_date
FROM public.sankus s2
JOIN moms m
ON s2.hansyoku_num[{MOTHER_IDX}] = m.mother_hnum
)
SELECT
f.mother_hnum,
ARRAY_AGG(COALESCE(e.earn_upto3,0) ORDER BY f.foal_birth_date) AS sib_earn_list
FROM foals_by_moms f
LEFT JOIN public.mv_earn_upto3 e
ON e.ketto_num = f.foal_ketto_num
GROUP BY f.mother_hnum;
"""
SQL_FAM_ARRAYS = f"""
WITH
twoyo AS (
SELECT s.hansyoku_num[{SECOND_DAM_IDX}] AS second_dam_hnum
FROM public.sankus s
WHERE s.birth_date >= %s
AND s.birth_date < %s
),
sdams AS (SELECT DISTINCT second_dam_hnum FROM twoyo),
cutoff AS (
SELECT make_date(EXTRACT(YEAR FROM %s)::int - %s, 1, 1) AS from_date
),
fam_rows AS (
SELECT s3.ketto_num AS kin_ketto_num, s3.hansyoku_num
FROM public.sankus s3
WHERE s3.birth_date >= (SELECT from_date FROM cutoff)
)
SELECT
d.second_dam_hnum,
ARRAY_AGG(COALESCE(e.earn_upto3,0)) AS fam_earn_list
FROM sdams d
JOIN fam_rows fr
ON fr.hansyoku_num @> ARRAY[d.second_dam_hnum]
LEFT JOIN public.mv_earn_upto3 e
ON e.ketto_num = fr.kin_ketto_num
GROUP BY d.second_dam_hnum;
"""
def main():
y0 = date(TWOYO_BIRTH_YEAR, 1, 1)
y1 = date(TWOYO_BIRTH_YEAR + 1, 1, 1)
with psycopg2.connect(DB_DSN) as conn:
ensure_mv_earn_upto3(conn)
with conn.cursor() as cur:
cur.execute(SQL_MAIN, (y0, y1, y0, FAM_YEAR_WINDOW))
main_rows = cur.fetchall()
main_cols = [d[0] for d in cur.description]
main_df = pd.DataFrame(main_rows, columns=main_cols)
with conn.cursor() as cur:
cur.execute(SQL_SIB_ARRAYS, (y0, y1))
rows = cur.fetchall()
cols = [d[0] for d in cur.description]
sib_arr_df = pd.DataFrame(rows, columns=cols)
with conn.cursor() as cur:
cur.execute(SQL_FAM_ARRAYS, (y0, y1, y0, FAM_YEAR_WINDOW))
rows = cur.fetchall()
cols = [d[0] for d in cur.description]
fam_arr_df = pd.DataFrame(rows, columns=cols)
# 合并数组并计算 Gini
full = (main_df
.merge(sib_arr_df, on='mother_hnum', how='left')
.merge(fam_arr_df, on='second_dam_hnum', how='left'))
full['gini'] = full['sib_earn_list'].apply(lambda x: gini(x) if isinstance(x, (list, tuple, np.ndarray)) else np.nan)
full['gini_fam'] = full['fam_earn_list'].apply(lambda x: gini(x) if isinstance(x, (list, tuple, np.ndarray)) else np.nan)
# 1) 兄弟/家族加权中心(对数平滑)
full['C_sib'] = A_MED * np.log1p(full['med_earn'].fillna(0)) + B_Q3 * np.log1p(full['q3_earn'].fillna(0))
full['C_fam'] = A_MED * np.log1p(full['med_earn_fam'].fillna(0)) + B_Q3 * np.log1p(full['q3_earn_fam'].fillna(0))
# 2) 标准化为 Z 分
full['Z_sib'] = zscore(full['C_sib'])
full['Z_fam'] = zscore(full['C_fam'])
# 3) Gini 惩罚
full['S_sib_base'] = full['Z_sib'] - ALPHA_GINI_SIB * full['gini'].fillna(0.0)
full['S_fam_base'] = full['Z_fam'] - BETA_GINI_FAM * full['gini_fam'].fillna(0.0)
# 4) 家族优先的 EB 收缩
S_global = full['S_sib_base'].mean(skipna=True)
prior = np.where(full['n_fam'].fillna(0) >= MIN_FAM_N, full['S_fam_base'].fillna(S_global), S_global)
w_n = full['n_sib'].fillna(0.0) / (full['n_sib'].fillna(0.0) + LAMBDA_SHRINK)
full['S_EB'] = w_n * full['S_sib_base'] + (1.0 - w_n) * prior
# 5) 胎次/母龄修正
def parity_weight_apply(x):
try:
return parity_weight(int(x) if pd.notnull(x) else None)
except Exception:
return 1.0
def mother_age_weight_apply(x):
try:
return mother_age_weight(int(x) if pd.notnull(x) else None)
except Exception:
return 1.0
full['w_p'] = full['twoyo_parity'].apply(parity_weight_apply)
full['w_a'] = full['mother_age_at_foaling'].apply(mother_age_weight_apply)
full['S_final'] = full['S_EB'] * full['w_p'] * full['w_a']
# 可视化与导出
fig1 = px.scatter(
full,
x='C_sib', y='gini',
size='n_sib', color='S_final',
hover_data=['twoyo_ketto_num','mother_name','twoyo_parity','mother_age_at_foaling',
'med_earn','q3_earn','med_earn_fam','q3_earn_fam','gini','gini_fam','S_EB','S_final'],
labels={'C_sib':'兄弟加权中心 (log1p[Median,Q3])','gini':'兄弟奖金Gini','S_final':'最终得分'},
title='母系稳定性:加权中心 vs Gini(颜色=S_final)'
)
fig1.update_layout(yaxis=dict(autorange='reversed')) # Gini 越小越好
topn = full.sort_values('S_final', ascending=False).head(50)
fig2 = px.bar(
topn,
x='S_final', y='twoyo_ketto_num', orientation='h',
hover_data=['mother_name','twoyo_parity','mother_age_at_foaling',
'med_earn','q3_earn','med_earn_fam','q3_earn_fam','gini','gini_fam','n_sib','n_fam'],
title='两岁马母系稳定度(Median+Q3+Gini+EB)Top 50'
)
fig2.update_layout(yaxis={'categoryorder':'total ascending'})
fig1.show()
fig2.show()
full.sort_values('S_final', ascending=False).to_csv('scores_weighted.csv', index=False)
print("导出评分表:scores_weighted.csv")
if __name__ == "__main__":
main()
使用方法与输出说明
- 依赖:
pip install psycopg2-binary pandas numpy plotly- 设置环境变量
JRLINK_DSN或在代码中修改DB_DSN
- 运行:
python optimized_pog_weighted.py
- 输出:
- 交互图:散点(加权中心 vs Gini),排行榜(Top 50)
- CSV:
scores_weighted.csv(包含每匹两岁马的关键统计与S_final)
- 关键字段说明(部分):
med_earn, q3_earn:同母子代奖金的中位数与上四分位(3岁年内)med_earn_fam, q3_earn_fam:二代母家族层对应分位gini, gini_fam:兄弟/家族奖金 Ginin_sib, n_fam:同母/家族的样本数twoyo_parity, mother_age_at_foaling:胎次、母龄S_EB, S_final:收缩后的基础分与最终得分(已乘胎次/母龄)
性能与维护
- 性能关键点:
- 使用了“当年两岁马”母/二代母集合缩小工作集;
- 同母用功能性 B-Tree 索引,近亲用 GIN
@>; - 三岁年内奖金用物化视图
mv_earn_upto3; - 兄弟/近亲分开聚合,再一对一并表,避免行数膨胀。
- 建议维护:
- 每次有新赛程入库后,执行:
REFRESH MATERIALIZED VIEW CONCURRENTLY public.mv_earn_upto3;
- 定期
VACUUM ANALYZE,确保统计信息新鲜。
- 每次有新赛程入库后,执行:
- 进一步加速(可选):
- 构建
ancestor_map物化视图,把“祖先→子代”展开成等值连接,进一步加快近亲查询(如数据量极大)。
- 构建
校验与常见问题
- 校验数组位序:
- 抽样几匹已知血统,确认
MOTHER_IDX、SECOND_DAM_IDX与实际一致。如不一致,修改代码顶部常量并重跑。
- 抽样几匹已知血统,确认
- 小样本评分过高/过低?
- 调高
LAMBDA_SHRINK或MIN_FAM_N,让评分更多回归家族/总体先验。
- 调高
- 太偏“顶尖导向”或“厚度导向”?
- 调整
A_MED/B_Q3与ALPHA_GINI_SIB/BETA_GINI_FAM;或加入 (上金率)作为额外加分项(当前方案已由 Gini 侧面约束分布,若仍需要可拓展)。
- 调整
- 零膨胀严重导致大量中位数为 ?
- 提升 (Q3 权重);或对奖金做温莎化/设极端上限(Python 侧对
sib_earn_list/fam_earn_list截断高分位后再计算分位与 Gini)。
- 提升 (Q3 权重);或对奖金做温莎化/设极端上限(Python 侧对
- 解释最终得分:
- 越大代表“兄弟与家族的厚度+上限更好、分布更均衡、小样本也已保守折中、且胎次/母龄更佳”。