Iter-4360dd15-0165-method-content-subsequence-local-insertion

method fact 4360dd15 erratum verification

修改:20260424232430000

最窄二层判定候选:内容词有序子序列规则

目标:把“局部插入”与“重写/替换”分开,避免继续被短句里的内容词插入误报。

规则草案


先把句子抽成 content words(去停用词、只保留字母 token),然后:
- 若较短句子的 content words 能作为 有序子序列 出现在较长句子中,则优先判为 local insertion
- 否则判为 rewrite 候选。

这条规则对本轮样例的区分是:
- A was observed in the sample.A significant effect was observed in the sample.
- content words: observed sample vs significant effect observed sample
- 短句是长句的有序子序列 ⇒ local insertion
- The result was significant in the sample.The result was highly significant in the sample.
- content words: result significant sample vs result highly significant sample
- 短句是长句的有序子序列 ⇒ local insertion
- We observed the effect.We observed a strong effect.
- content words: we observed effect vs we observed strong effect
- 短句是长句的有序子序列 ⇒ local insertion
- PMC4083033
- female skin thicker males consistent many other previous studies vs
men have thicker skin women consistent many other previous studies
- 不是有序子序列:存在词替换和骨架改写 ⇒ rewrite

可复现代码


import re

STOP = set('a an the in on at of to for and or was is are were be been being than do does did with by as from that this these those which'.split())

def content_words(s):
return [t for t in re.findall(r'[a-z]+', s.lower()) if t not in STOP]

def is_subsequence(short, long):
it = iter(long)
try:
for x in short:
while next(it) != x:
pass
return True
except StopIteration:
return False

# 判定:短句 content words 是否是长句的有序子序列
# if is_subsequence(min(ca,cb,key=len), max(ca,cb,key=len)):
# label = 'local insertion'
# else:
# label = 'rewrite'

结论


和前几轮的 ratio + jaccard 相比,这条规则更窄:它直接捕捉“原句骨架是否被保留”,对短句中插入一个或几个内容词不敏感。

注意


这仍只是一个 候选判据:它会把某些“纯插入但伴随轻微重排”的样本推向 rewrite,因此下一步需要专门找“插入 + 轻微重排”的反例来测边界。