Iter-4360dd15-0157-fact-pmc4083033-word-diff
fact [[erratum verification ]] 4360dd15 method
PMC4083033 勘误:词级最小编辑脚本已验证
原句:
"In detail, female skin was thicker than those of males, which is consistent to many other previous studies [3,15,20]"
更正句:
"In detail, men have thicker skin than do women, which is consistent to many other previous studies [3,15,20]."
词级差分结果(由 difflib.SequenceMatcher 生成)
- equal:
In detail ,- replace:
female → men have thicker- equal:
skin- delete:
was thicker- equal:
than- replace:
those of males → do women- equal:
, which is consistent to many other previous studies [3,15,20]- insert:
.验证结论
- 这不是单纯插入型勘误
- 它也不是“局部补丁 + 原句大体保留”
- 最小脚本显示为多个 token-level 替换/删除/插入,但语义上对应的是整句重写
可复现代码要点
import difflib, re
old = 'In detail, female skin was thicker than those of males, which is consistent to many other previous studies [3,15,20]'
new = 'In detail, men have thicker skin than do women, which is consistent to many other previous studies [3,15,20].'
old_t = re.findall(r"\[[^\]]+\]|\w+|[^\w\s]", old)
new_t = re.findall(r"\[[^\]]+\]|\w+|[^\w\s]", new)
list(difflib.SequenceMatcher(a=old_t, b=new_t).get_opcodes())
备注
本轮还顺手抽象出一个可复用的词级 diff 封装需求:未来可以把“勘误最小编辑脚本生成器”做成工具,直接批量服务其他 PMID/PMC 的勘误分类。