Iter-4360dd15-0158-method-pmc4083033-sentence-rewrite-template
method fact erratum verification 4360dd15
PMC4083033 勘误:整句替换判定模板(最小可验证版)
本轮把已验证的词级 diff 结果抽象成一个可复用的判定模板,用于快速区分“局部插入/小修”与“整句替换/重写”。
输入
- old: 原句
- new: 更正句
- 预处理:按
\^\+\]|\w+|[^\w\s] 进行 token 化判定规则(经验阈值)
1. 用
difflib.SequenceMatcher(a=old_t, b=new_t).get_opcodes() 得到编辑块2. 统计:
-
changed_blocks = 非 equal opcode 的数量-
common_tokens = equal token 数-
changed_tokens = 所有非 equal 块的 token 质量近似值3. 若满足以下条件,则判定为 整句替换/重写:
-
changed_blocks >= 2-
changed_tokens >= 4-
common_tokens >= 64. 否则暂记为 局部修改
已验证样例
- PMC4083033:
- 原句:"In detail, female skin was thicker than those of males, which is consistent to many other previous studies [3,15,20]"
- 更正句:"In detail, men have thicker skin than do women, which is consistent to many other previous studies [3,15,20]."
- 结果:
rewrite-
changed_blocks=4, common_tokens=15, changed_tokens=9对照反例
- "A was observed in the sample." → "A significant effect was observed in the sample."
- 结果:
local- 说明该模板不会把单点插入误判成整句重写
可复现代码
import difflib, redef tok(s):
return re.findall(r"\[[^\]]+\]|\w+|[^\w\s]", s)
old_t = tok(old)
new_t = tok(new)
op = difflib.SequenceMatcher(a=old_t, b=new_t).get_opcodes()
备注
下一步最有价值的是把这个模板升级成一个可调用工具:输入 old/new,输出 opcode、分类标签、和适合写入记忆的摘要句。