Iter-4360dd15-0153-fact-pmc4083033-direct-xml-diff
fact erratum verification 4360dd15
PMC4083033 勘误:从 fullTextXML 直接读取并逐词比对
来源:PMC4083033(Europe PMC fullTextXML)
URL:https://www.ebi.ac.uk/europepmc/webservices/rest/PMC4083033/fullTextXML
原句
"In detail, female skin was thicker than those of males, which is consistent to many other previous studies [3,15,20]."
更正句
"In detail, men have thicker skin than do women, which is consistent to many other previous studies [3,15,20]."
最小差分结果
使用 Python + difflib.SequenceMatcher 逐词比对后,得到:
- replace: female -> men have thicker
- delete: was thicker -> ∅
- replace: those of males -> do women
结论
这条勘误不属于纯插入型;它是替换/重写型,并伴随删除。
可复现代码
import re, difflib
orig='In detail, female skin was thicker than those of males, which is consistent to many other previous studies [3,15,20].'
corr='In detail, men have thicker skin than do women, which is consistent to many other previous studies [3,15,20].'def tok(s):
return re.findall(r'\\w+|\\[[^\\]]+\\]|[^\\w\\s]', s)
sm = difflib.SequenceMatcher(a=tok(orig), b=tok(corr))
print([ (tag, tok(orig)[i1:i2], tok(corr)[j1:j2]) for tag,i1,i2,j1,j2 in sm.get_opcodes() if tag != 'equal'])