Iter-4360dd15-0160-method-pmc4083033-rewrite-v2

method fact 4360dd15 [[erratum verification ]]

修改:20260424231537000

PMC4083033 勘误:改进后的重写判定信号(v2 草案)

本轮把已验证样例与两个局部插入反例做了对比,得到一个更稳的二层信号:
- 第一层:token 级编辑强度(SequenceMatcher ratio / changed_blocks / changed_tokens)
- 第二层:内容词重叠(去掉停用词后的 Jaccard)

观测


对四个样例的 Python 复核结果:
- 局部插入:
- A was observed in the sample.A significant effect was observed in the sample.
- ratio=0.875, content_jaccard=0.50
- 局部副词插入:
- The result was significant in the sample.The result was highly significant in the sample.
- ratio=0.941, content_jaccard=0.75
- 明显重写:
- Female skin was thicker than male skin in detail.Men have thicker skin than women in detail.
- ratio=0.526, content_jaccard=0.375
- PMC4083033:
- In detail, female skin was thicker than those of males, which is consistent to many other previous studies [3,15,20]
- → In detail, men have thicker skin than do women, which is consistent to many other previous studies [3,15,20].
- ratio=0.714, content_jaccard=0.643

结论


仅靠 changed_blocks/changed_tokens 的阈值会漏掉“语义上是 rewrite、但编辑块不够多”的句子。更稳的最小规则应至少包含:
- ratio < 0.8
- content_jaccard < 0.7
- changed_blocks >= 2
- changed_tokens >= 4

这组规则对本轮样例的方向是:
- 两个插入型样例应保持 local
- 两个重写样例应判为 rewrite

可复现代码


import re, difflib
STOP = set('a an the in on at of to for and or was is are were be been being than do does did with by as from that this these those which'.split())

def tok(s):
return re.findall(r"\[[^\]]+\]|\w+|[^\w\s]", s.lower())

def content_words(s):
return [t for t in re.findall(r"[a-z]+", s.lower()) if t not in STOP]

备注


下一步应把这个 v2 规则再拿更多“局部插入/局部改写/语义翻转”样本做压力测试,避免把 content_jaccard 过度拟合到 PMC4083033。