wand(weakand)算法基本思路--688IT编程网

wand（weakand）算法基本思路

⼀般搜索的query⽐较短，但如果query⽐较长，如是⼀段⽂本，需要搜索相似的⽂本，这时候⼀般就需要wand算法，该算法在⼴告系统中有⽐较成熟的应该，主要是adsense场景，需要搜索⼀个页⾯内容的相似⼴告。

Wand⽅法简单来说，⼀般我们在计算⽂本相关性的时候，会通过倒排索引的⽅式进⾏查询，通过倒排索引已经要⽐全量遍历节约⼤量时间，但是有时候仍然很慢。

原因是很多时候我们其实只是想要top n个结果，⼀些结果明显较差的也进⾏了复杂的相关性计算，⽽weak-and算法通过计算每个词的贡献上限来估计⽂档的相关性上限，从⽽建⽴⼀个阈值对倒排中的结果进⾏减枝，从⽽得到提速的效果。

wand算法⾸先要估计每个词对相关性贡献的上限，最简单的相关性就是TF*IDF，⼀般query中词的TF均为1，IDF是固定的，因此就是估计⼀个词在⽂档中的词频TF上限，⼀般TF需要归⼀化，即除以⽂档所有词的个数，因此，就是要估算⼀个词在⽂档中所能占到的最⼤⽐例，这个线下计算即可。

知道了⼀个词的相关性上界值，就可以知道⼀个query和⼀个⽂档的相关性上限值，显然就是他们共同的词的相关性上限值的和。

这样对于⼀个query，获得其所有词的相关性贡献上限，然后对⼀个⽂档，看其和query中都出现的词，然后求这些词的贡献和即可，然后和⼀个预设值⽐较，如果超过预设值，则进⼊下⼀步的计算，否则则丢弃。

如果按照这样的⽅法计算n个最相似⽂档，就要取出所有的⽂档，每个⽂档作预计算，⽐较threshold，然后决定是否在top-n之列。这样计算当然可⾏，但是还是可以优化的。优化的出发点就是尽量减少预计算，wand论⽂中提到的算法如下：

其基本思路是基于倒排索引来实现WAND⽅法：

⾸先是初始化：

1. 提取出da中所有的词，以及这些词的倒排索引；

2. 初始化curDoc=0；

3. 初始化posting数组，使得posting[t]为词t倒排索引中第⼀个⽂档；

然后是寻下⼀个需要完全计算权值的问题，具体流程如下：

可以定义⼀个next函数，⽤于查下⼀个进⾏完全计算的⽂档，论⽂中对next描述如下：

Function next(θ)

repeat

/* Sort the terms in non decreasing order of DID */

sort(terms, posting)

/* Find pivot term - the first one with accumulated UB ≥ θ */

pTerm ← findPivotTerm(terms, θ)

if (pTerm = null) return (NoMoreDocs)

pivot ← posting[pTerm].DID

if (pivot = lastID) return (NoMoreDocs)

if (pivot ≤ curDoc)

/* pivot has already been considered, advance one of the preceding terms */

aterm ← pickTerm(terms[0..pTerm])

posting[aterm] ← (curDoc+1)

else /* pivot > curDoc */

if (posting[0].DID = pivot) //注，这个是sort之后的第⼀个term位置的doc id

/* Success, all terms preceding pTerm belong to the pivot */

curDoc ← pivot

return (curDoc, posting)

else

/* not enough mass yet on pivot, advance one of the preceding terms */

aterm ← pickTerm(terms[0..pTerm])

posting[aterm] ← (pivot)

end repeat

其中⽤到了⼏个函数，解释如下：

(n)

这个函数返回aterm倒排索引中的DID，这个DID要满⾜DID >= n。DID就是docID

sort(terms, posting)

sort是把terms按照posting当前指向的DID的⾮递减排序。⽐如da所有词的倒排索引为：

t0: [1, 3, 26]

t1: [1, 2, 4, 10, 100]

t2: [2, 3, 6, 34, 56]

t3: [1, 4, 5, 23, 70, 200]

t4: [5, 14, 78]

当前posting数组为：[2, 2, 0, 3, 0]

根据以上两条信息，可以得到：{t0 : 26, t1 : 4, t2 : 2, t3 : 23, t4 : 5}

则排序后的结果为[t2, t1, t4, t3, t0]

findPivotTerm(terms, θ)

按照之前得到的排序，返回第⼀个上界累加和≥θ的term。

引⼊以下数据：[UB0, UB1, UB2, UB3, UB4] = [0.5, 1, 2, 3, 4], θ = 8，UB*为词B*的最⼤可能的贡献值。

因为(2 + 1 + 4) = 7 < 8 ⽽ (2 + 1 + 4 + 3) = 10 > 8，所以此函数返回t3

pickTerm(terms[0..pTerm])

在0到pTerm(不包含pTerm)中选择⼀个term。

还是⽤之前的数据，则是在[t2, t1, t4](没有t3)中选择⼀个term返回。

关于选择策略，当然是以可以跳过最多的⽂档为指导，论⽂中选择了idf最⼤的term。

上⾯的过程可以⽤下图表⽰：

上图即为已经按照term当前指向的doc id排序的情况。

对于doc 2，其可能的最⼤得分为2<8 //2:max of t2

对于doc 4，其可能的最⼤得分为2+1=3<8 //1:max of t1

对于doc 5，其可能的最⼤得分为2+1+4=7<8 //4:max of t4

对于doc 23，其可能的最⼤得分为2+1+4+3=10>8 //3:max of t3

因此，doc 23即为我们需要寻的pivot term

上图其实也解释了为什么要寻pivot term，因为doc 2、doc 4、doc 5的得分不可能达到threshold，所以可以直接忽略，t2、t1、t4对应的posting list直接skip到doc 23(⼤于等于doc23的位置)，具体选择先跳哪个，可以根据term的idf来选择，当然也可以按照其距离pivot对于的doc id距离选择，选择⼀个跳的最多的。在这⾥，doc 23 被称为pivot，可以作为候选⽂档（candidate），进⼀步计算全局得分（evaluate）；doc 2、doc 4、doc 5被跳过。

解释了以上⼏个函数之后，理解上边的伪码应该没有问题。⾄于要理解为什么这样不会错过相似度⼤的⽂档，就需要⾃⼰动⼀翻脑⼦。可以参考论⽂中的解释，不过说起来⽐较啰嗦，这⾥就不证明了。

最后要提到的⼀点是，在sort函数中是没有必要完全排序的，因为每⼀次循环都只是改变了posting中的⼀条数据，只要把这个数据排好序就可以了。

附python代码

#!/usr/bin/env python

#wand, assume threshold is 4,the upper bound of every term is UB

#max contribute

import time

import heapq

UB = {"t0":0.5,"t1":1,"t2":2,"t3":3,"t4":4} #upper bound of term's value

MAX_RESULT_NUM = 3 #max result number

class WAND:

#initial index

def__init__(self, InvertIndex, last_docid):

self.invert_index = InvertIndex #InvertIndex: term -> docid1, docid2, docid3 ...

self.current_doc = 0

self.current_invert_index = {} #posting

self.query_terms = []

self.threshold = -1

self.sort_terms = []

self.LastID = 2000000000 #big num

self.last_docid = last_docid

#get index list according to query term

def__InitQuery(self, query_terms):

self.current_doc = -1

self.current_invert_index.clear()

self.query_terms = query_terms

self.sort_terms[:] = []

for term in query_terms:

#initial start pos from the first position of term's invert_index

self.current_invert_index[term] = [ self.invert_index[term][0], 0 ] #[ docid, index ] #sort term according its current posting doc id

def__SortTerms(self):

if len(self.sort_terms) == 0:

for term in self.query_terms:

if term in self.current_invert_index:

doc_id = self.current_invert_index[term][0]

self.sort_terms.append([ int(doc_id), term ])

self.sort_terms.sort()

#select the first term in sorted term list

def__PickTerm(self, pivot_index):

return 0

#find pivot term

def__FindPivotTerm(self):

score = 0

#print "sort term ", self.sort_terms #[docid, term]

for i in range(0, len(self.sort_terms)):

score = score + UB[self.sort_terms[i][1]]

if score >= self.threshold:

return [ self.sort_terms[i][1], i] #[term, index]

return [ None, len(self.sort_terms)]

#move to doc id >= docid

def__IteratorInvertIndex(self, change_term, docid, pos):

doc_list = self.invert_index[change_term]

i = 0

for i in range(pos, len(doc_list)):

if doc_list[i] >= docid:

pos = i

docid = doc_list[i]

break

return [ docid, pos ]

def__AdvanceTerm(self, change_index, docid ):

change_term = self.sort_terms[change_index][1]

pos = self.current_invert_index[change_term][1]

(new_doc, new_pos) = self.__IteratorInvertIndex(change_term, docid, pos)

self.current_invert_index[change_term] = [ new_doc , new_pos ]

self.sort_terms[change_index][0] = new_doc

def__Next(self):

if self.last_docid == self.current_doc:

return None

while True:

#sort terms by doc id

self.__SortTerms()

#find pivot term > threshold

(pivot_term, pivot_index) = self.__FindPivotTerm()

if pivot_term == None:

#no more candidate

return None

pivot_doc_id = self.current_invert_index[pivot_term][0]

if pivot_doc_id == self.LastID: #!!

return None

if pivot_doc_id <= self.current_doc:

change_index = self.__PickTerm(pivot_index)#always retrun 0

self.__AdvanceTerm( change_index, self.current_doc + 1 )

else:

first_docid = self.sort_terms[0][0]

if pivot_doc_id == first_docid:

self.current_doc = pivot_doc_id

return self.current_doc

else:

#pick all preceding term,advance to pivot

for i in range(0, pivot_index):

change_index = i

self.__AdvanceTerm( change_index, pivot_doc_id )

def__InsertHeap(self,doc_id,score):

if sult_list)<3:

heapq.sult_list, (score, doc_id))

else:

if score&sult_list[0][0]: #large than mini item in heap

heapq.sult_list)

heapq.sult_list, (score, doc_id))

sult_list[0][0]

#full evaluate the doucment, get its full score, to be added

def__FullEvaluate(self, docid):

return 4

def DoQuery(self, query_terms):

self.__InitQuery(query_terms)

while True:

candidate_docid = self.__Next()

if candidate_docid == None:

break

print"candidate_docid:" + str(candidate_docid)

#insert candidate_docid to heap

full_doc_score = self.__FullEvaluate(candidate_docid)

mini_item_value = self.__InsertHeap(candidate_docid, full_doc_score)

#update threshold

self.threshold = mini_item_value

print"result list ", sult_list

sult_list

if__name__ == "__main__":

testIndex = {}

testIndex["t0"] = [ 1, 3, 26, 2000000000]

testIndex["t1"] = [ 1, 2, 4, 10, 100, 2000000000 ]

testIndex["t2"] = [ 2, 3, 6, 34, 56, 2000000000 ]

testIndex["t3"] = [ 1, 4, 5, 23, 70, 200, 2000000000 ]

testIndex["t4"] = [ 5, 14, 78, 2000000000 ]

last_doc_id = 100

w = WAND(testIndex, last_doc_id)

final_result = w.DoQuery(["t0", "t1", "t2", "t3", "t4"])

print"final result "

for item in final_result:

print"doc " + str(item[1])

附录每⼀步的执⾏过程，如果原来的设置，是0结果的，因此设置threshold为4：初始位置：

t0: [1, 3, 26] max:0

sort of等于什么t1: [1, 2, 4, 10, 100] max:1

t2: [2, 3, 6, 34, 56] max:2

t3: [1, 4, 5, 23, 70, 200] max:3

t4: [5, 14, 78] max:4

posting=[1,1,2,1,5]，也就是{t0:1,t1:1,t2:2,t3:1,t4:5}

cur_doc = 0

第1步：

1. 按照posting的doc id对term升序排序，这⾥得到[t0,t1,t3,t2,t4]

2. 寻pivot，0+1+3=4>=4, 也就是到了t3⼤于等于4，因此t3就是pivot term，pivot就是其对应的doc id=1

3. pivot为1，cur doc id=0，因此pivot>cur_doc, 然后⽐较posting[0].doc_id，也就是t0对应的当前doc id=1 和pivot=1是否相等，这⾥相

等，then，返回doc1, cur_doc_id=1

当前的数据不变：

t0: [1, 3, 26] max:0

t1: [1, 2, 4, 10, 100] max:1

t2: [2, 3, 6, 34, 56] max:2

t3: [1, 4, 5, 23, 70, 200] max:3

t4: [5, 14, 78] max:4

posting=[26,1,2,1,5]，也就是{t0:26,t1:1,t2:2,t3:1,t4:5}

cur_doc = 0

第2步：

1. 按照posting的doc id对term升序排序，这⾥得到[t0,t1,t3,t2,t4]

2. 寻pivot，0+1+3=4>=4, 也就是到了t3⼤于等于4，因此t3就是pivot term，pivot就是其对应的doc id=1

3. pivot为1，cur doc id=1，因此pivot<=cur_doc, ，then，选择⼀个pterm之前的term，向前移动。这⾥选择第⼀个，也就是t0，然后将其

对应的倒排指针移动到⼤于等于doc_cur_id+1的位置，这⾥移动到3.

当前的数据：

t0: [1, 3, 26] max:0

t1: [1, 2, 4, 10, 100] max:1

t2: [2, 3, 6, 34, 56] max:2

t3: [1, 4, 5, 23, 70, 200] max:3

t4: [5, 14, 78] max:4

posting=[3,1,2,1,5]，也就是{t0:3,t1:10,t2:2,t3:1,t4:5}

后续的步骤基本差不多

第3步，排序后[t1,t3,t2,t0,t4] cur_doc = 1,pivot term=t1, pivot=1，移动t1

t0: [1, 3, 26] max:0

t1: [1, 2, 4, 10, 100] max:1

t2: [2, 3, 6, 34, 56] max:2

t3: [1, 4, 5, 23, 70, 200] max:3

t4: [5, 14, 78] max:4

第4步, 排序后[t3,t2,t1,t0,t4] cur_doc = 1,pivot term=t3, pivot=1，移动t3

t0: [1, 3, 26] max:0

t1: [1, 2, 4, 10, 100] max:1

t2: [2, 3, 6, 34, 56] max:2

t3: [1, 4, 5, 23, 70, 200] max:3

t4: [5, 14, 78] max:4

第5步, 排序后[t1,t2,t0,t3,t4] cur_doc = 1,pivot term=t3, pivot=4，移动t1

688IT编程网

wand(weakand)算法基本思路

发表评论

推荐文章

java正则表达式选择题

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

工龄小数点提取

非零金额正则表达式

提取文本中数字的函数

热门文章

excel文字递增函数公式

数字递增公式

notepad 正则变量运算

C++regex库常用函数及实例

js正则表达式之前瞻后顾与非捕获分组

indesign正则数字和英文之间的空格

C#匹配中文字符串的4种正则表达式分享

PHP正则表达式匹配中文字符

匹配中文汉字的正则表达式介绍

Python正则表达式如何进行字符串替换

orcl中用正则表达式

sql正则表达式excel

dataframe正则表达式

postgress sql正则

el-upload accept 正则表达式

半小时正则表达式

判断科学计数法的正则

根据url判断静态资源的方法

Java正则表达式-匹配正负浮点数

替换模糊匹配正则-hive

最新文章

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

能被5整除的十进制整数的正规表达式

大于0小于等于1的正则表达式

linux grep 26个字母

java pattern 正则表达式

掌握文本编辑器中的搜索和替换技巧

标签列表

688IT编程网

wand(weakand)算法基本思路

发表评论

推荐文章

java正则表达式 选择题

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

工龄小数点提取

非零金额 正则表达式

提取文本中数字的函数

热门文章

excel文字递增函数公式

数字递增公式

notepad 正则变量运算

C++regex库常用函数及实例

js正则表达式之前瞻后顾与非捕获分组

indesign正则数字和英文之间的空格

C#匹配中文字符串的4种正则表达式分享

PHP正则表达式匹配中文字符

匹配中文汉字的正则表达式介绍

Python正则表达式如何进行字符串替换

orcl中用正则表达式

sql正则表达式excel

dataframe正则表达式

postgress sql正则

el-upload accept 正则表达式

半小时 正则表达式

判断科学计数法的正则

根据url判断静态资源的方法

Java正则表达式-匹配正负浮点数

替换模糊匹配正则-hive

最新文章

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

能被5整除的十进制整数的正规表达式

大于0小于等于1的正则表达式

linux grep 26个字母

java pattern 正则表达式

掌握文本编辑器中的搜索和替换技巧

标签列表

java正则表达式选择题

非零金额正则表达式

半小时正则表达式