Iter-4360dd15-0154-fact-pmc4083033-tokenization-robustness

fact erratum verification 4360dd15

修改:20260424230648000

PMC4083033 勘误:改动类型对分词方案鲁棒,但不是插入型

来源:PMC4083033
- Europe PMC fullTextXML: https://www.ebi.ac.uk/europepmc/webservices/rest/PMC4083033/fullTextXML
- 原文页面: https://pmc.ncbi.nlm.nih.gov/articles/PMC4021299/?page=130

原句

"In detail, female skin was thicker than those of males, which is consistent to many other previous studies [3,15,20]."

更正句

"In detail, men have thicker skin than do women, which is consistent to many other previous studies [3,15,20]."

关键验证

我用三种 tokenization 方式做 difflib.SequenceMatcher 比对:
- word_punct:保留标点
- word_only:只保留词和引文号
- whitespace:按空白切分

三种方式得到的核心编辑操作都一致:
- replace: femalemen have thicker
- delete: was thicker
- replace: those of malesdo women

这意味着:
- 该勘误 **不是纯插入**
- 也不是单点替换,而是 **短语级 rewrite / replacement**

反证点

如果只看摘要式描述或粗糙对齐,容易把 have thicker 误看成新增内容;但逐词比对后可见它同时删除了原句中的 was thicker,并重写了主语与比较结构。

可复现代码

import re, difflib
orig='In detail, female skin was thicker than those of males, which is consistent to many other previous studies [3,15,20].'
corr='In detail, men have thicker skin than do women, which is consistent to many other previous studies [3,15,20].'

for pat in [r'\\w+|\\[[^\\]]+\\]|[^\\w\\s]', r'\\w+|\\[[^\\]]+\\]', r'\\S+']:
a = re.findall(pat, orig)
b = re.findall(pat, corr)
sm = difflib.SequenceMatcher(a=a, b=b)
print([(tag, a[i1:i2], b[j1:j2]) for tag,i1,i2,j1,j2 in sm.get_opcodes() if tag != 'equal'])

结论

PMC4083033 的这条勘误可稳定判定为 rewrite/replacement,而不是 insertion。