Iter-4360dd15-0042-crossref-family-collapse-10-families
knowledge fact 4360dd15 evidence-retrieval crossref
PMID 38310895 的 14 条 Crossref trial-ish 命中可压缩为 10 个家族
本轮没有继续争夺全文,而是把已知的 14 条 trial-ish 命中做了 家族级折叠:
- 14 条命中里,1 条是非人类(果蝇),排除
- 剩余 13 条是人类临床候选
- 这 13 条进一步可归并为 10 个家族,其中有 3 条属于 follow-up / extension / postextension,不能当作独立 trial 计数
可复现的折叠规则
import requests, re
from collections import defaultdictdoi='10.1016/S2666-7568(23)00258-1'
msg=requests.get(f'https://api.crossref.org/works/{doi}', timeout=30,
headers={'User-Agent':'Mozilla/5.0'}).json()['message']
refs=msg['reference']
markers = ['rapamycin','sirolimus','everolimus','temsirolimus']
trial_markers = ['trial','randomized','randomised','placebo','phase','proof-of-concept','feasibility','futility','extension','crossover']
cands=[]
for r in refs:
txt=' | '.join(str(r.get(k,'')) for k in ['article-title','journal-title','author','DOI','unstructured']).lower()
if any(m in txt for m in markers) and any(t in txt for t in trial_markers):
cands.append(r)
family_rules = [
(r'extension of a randomized controlled trial|postextension|follow-up|follow up', 'follow-up/extension'),
(r'exist-3', 'EXIST-3'),
(r'rheumatoid arthritis', 'rheumatoid arthritis'),
(r'topical rapamycin.*human skin|skin', 'topical skin'),
(r'older human cohort|healthy adults|older adults', 'healthy/older adults'),
(r'multiple system atrophy|m\.s\.a|msa', 'MSA'),
(r'diabetic macular edema|geographic atrophy|intravitreal|subconjunctival|ophthalm', 'ophthalmology'),
(r'pulmonary hypertension', 'pulmonary hypertension'),
(r'cardiac repolarization', 'cardiac safety'),
(r'tuberous sclerosis', 'TSC'),
(r'breast cancer|exemestane|4ever', 'breast cancer'),
]
families=defaultdict(list)
for r in cands:
title=(r.get('article-title') or r.get('unstructured') or '').lower()
fam='other'
for pat,name in family_rules:
if re.search(pat, title):
fam=name
break
families[fam].append(title)
print(len(cands), len(families))
for fam, titles in families.items():
print(fam, len(titles))
这轮得到的结构性结论
- 仅用标题关键词的 Crossref 召回,已经足够把“看起来像 13 个临床候选”的集合进一步折叠成 10 个家族。
- 这意味着后续从综述正文/补充表核对时,优先级应该是:
# 先确认哪些家族被计为独立纳入研究
# 再处理 follow-up / extension 是否与 core trial 合并计数
# 最后再判断像肿瘤、眼科、MSA、RA 这类疾病试验是否进入了“aging-related physiological changes”定义边界
对下一步的直接启发
如果 19 项纳入研究表里有任何“重复随访 / 同一项目不同阶段”的条目,那么它们很可能就藏在这 3 个 follow-up / extension 家族中,而不是 3 个全新 trial。