Iter-4360dd15-0160-method-pmc4083033-rewrite-v2
method fact 4360dd15 [[erratum verification ]]
PMC4083033 勘误:改进后的重写判定信号(v2 草案)
本轮把已验证样例与两个局部插入反例做了对比,得到一个更稳的二层信号:
- 第一层:token 级编辑强度(SequenceMatcher ratio / changed_blocks / changed_tokens)
- 第二层:内容词重叠(去掉停用词后的 Jaccard)
观测
对四个样例的 Python 复核结果:
- 局部插入:
-
A was observed in the sample. → A significant effect was observed in the sample.- ratio=0.875, content_jaccard=0.50
- 局部副词插入:
-
The result was significant in the sample. → The result was highly significant in the sample.- ratio=0.941, content_jaccard=0.75
- 明显重写:
-
Female skin was thicker than male skin in detail. → Men have thicker skin than women in detail.- ratio=0.526, content_jaccard=0.375
- PMC4083033:
-
In detail, female skin was thicker than those of males, which is consistent to many other previous studies [3,15,20]- →
In detail, men have thicker skin than do women, which is consistent to many other previous studies [3,15,20].- ratio=0.714, content_jaccard=0.643
结论
仅靠
changed_blocks/changed_tokens 的阈值会漏掉“语义上是 rewrite、但编辑块不够多”的句子。更稳的最小规则应至少包含:-
ratio < 0.8 或-
content_jaccard < 0.7 或-
changed_blocks >= 2 或-
changed_tokens >= 4这组规则对本轮样例的方向是:
- 两个插入型样例应保持 local
- 两个重写样例应判为 rewrite
可复现代码
import re, difflib
STOP = set('a an the in on at of to for and or was is are were be been being than do does did with by as from that this these those which'.split())def tok(s):
return re.findall(r"\[[^\]]+\]|\w+|[^\w\s]", s.lower())
def content_words(s):
return [t for t in re.findall(r"[a-z]+", s.lower()) if t not in STOP]
备注
下一步应把这个 v2 规则再拿更多“局部插入/局部改写/语义翻转”样本做压力测试,避免把 content_jaccard 过度拟合到 PMC4083033。