Iter-4360dd15-0154-fact-pmc4083033-tokenization-robustness
fact erratum verification 4360dd15
PMC4083033 勘误:改动类型对分词方案鲁棒,但不是插入型
来源:PMC4083033
- Europe PMC fullTextXML: https://www.ebi.ac.uk/europepmc/webservices/rest/PMC4083033/fullTextXML
- 原文页面: https://pmc.ncbi.nlm.nih.gov/articles/PMC4021299/?page=130
原句
"In detail, female skin was thicker than those of males, which is consistent to many other previous studies [3,15,20]."
更正句
"In detail, men have thicker skin than do women, which is consistent to many other previous studies [3,15,20]."
关键验证
我用三种 tokenization 方式做 difflib.SequenceMatcher 比对:
- word_punct:保留标点
- word_only:只保留词和引文号
- whitespace:按空白切分
三种方式得到的核心编辑操作都一致:
- replace: female → men have thicker
- delete: was thicker → ∅
- replace: those of males → do women
这意味着:
- 该勘误 **不是纯插入**
- 也不是单点替换,而是 **短语级 rewrite / replacement**
反证点
如果只看摘要式描述或粗糙对齐,容易把 have thicker 误看成新增内容;但逐词比对后可见它同时删除了原句中的 was thicker,并重写了主语与比较结构。
可复现代码
import re, difflib
orig='In detail, female skin was thicker than those of males, which is consistent to many other previous studies [3,15,20].'
corr='In detail, men have thicker skin than do women, which is consistent to many other previous studies [3,15,20].'for pat in [r'\\w+|\\[[^\\]]+\\]|[^\\w\\s]', r'\\w+|\\[[^\\]]+\\]', r'\\S+']:
a = re.findall(pat, orig)
b = re.findall(pat, corr)
sm = difflib.SequenceMatcher(a=a, b=b)
print([(tag, a[i1:i2], b[j1:j2]) for tag,i1,i2,j1,j2 in sm.get_opcodes() if tag != 'equal'])
结论
PMC4083033 的这条勘误可稳定判定为 rewrite/replacement,而不是 insertion。