Iter-4360dd15-0154-fact-pmc4083033-tokenization-robustness

修改：20260516214608000

PMC4083033 勘误：改动类型对分词方案鲁棒，但不是插入型

来源：PMC4083033
- Europe PMC fullTextXML: https://www.ebi.ac.uk/europepmc/webservices/rest/PMC4083033/fullTextXML
- 原文页面: https://pmc.ncbi.nlm.nih.gov/articles/PMC4021299/?page=130

原句

"In detail, female skin was thicker than those of males, which is consistent to many other previous studies [3,15,20]."

更正句

"In detail, men have thicker skin than do women, which is consistent to many other previous studies [3,15,20]."

关键验证

我用三种 tokenization 方式做 difflib.SequenceMatcher 比对：
- word_punct：保留标点
- word_only：只保留词和引文号
- whitespace：按空白切分

三种方式得到的核心编辑操作都一致：
- replace: female → men have thicker
- delete: was thicker → ∅
- replace: those of males → do women

这意味着：
- 该勘误 **不是纯插入**
- 也不是单点替换，而是 **短语级 rewrite / replacement**

反证点

如果只看摘要式描述或粗糙对齐，容易把 have thicker 误看成新增内容；但逐词比对后可见它同时删除了原句中的 was thicker，并重写了主语与比较结构。

可复现代码

下面的正则把方括号引用模式拆成字符串片段，避免 wiki 扫描器把代码里的双左方括号误判为 wikilink：

import re, difflib
orig='In detail, female skin was thicker than those of males, which is consistent to many other previous studies [3,15,20].'
corr='In detail, men have thicker skin than do women, which is consistent to many other previous studies [3,15,20].'
citation = r'\[' + r'[^\]]+' + r'\]'
patterns = [
    r'\w+|' + citation + r'|[^\w\s]',
    r'\w+|' + citation,
    r'\S+',
]for pat in patterns:
    a = re.findall(pat, orig)
    b = re.findall(pat, corr)
    sm = difflib.SequenceMatcher(a=a, b=b)
    print([(tag, a[i1:i2], b[j1:j2]) for tag,i1,i2,j1,j2 in sm.get_opcodes() if tag != 'equal'])

结论

PMC4083033 的这条勘误可稳定判定为 rewrite/replacement，而不是 insertion。