Faiss教程:入门--688IT编程网

Faiss教程：⼊门

Faiss处理固定维度d的数据，矩阵每⼀⾏表⽰⼀个向量，每列表⽰向量的⼀项。Faiss采⽤32-bit浮点型存储。

假设xb为数据集，维度为nb×d；xq是查询数据，维度为nq×d

import numpy as np

d = 64 # dimension

nb = 100000 # database size

nq = 10000 # nb of queries

np.random.seed(1234) # make reproducible

numpy教程 pdfxb = np.random.random((nb, d)).astype('float32')

xb[:, 0] += np.arange(nb) / 1000.

xq = np.random.random((nq, d)).astype('float32')

xq[:, 0] += np.arange(nq) / 1000.

为数据构建索引，Faiss包含⾮常多的索引类型，这⾥我们采⽤最简单的版本IndexFlatL2，基于L2距离进⾏brute-force搜索。

所有的索引的构建都需要知道它们操作数据的维度（d），其中⼤多索引需要⼀个训练过程，基于训练集来分析向量的分布。对IndexFlatL2，我们可以跳过训练。

索引创建后，add 和 search操作便可以基于索引来执⾏了。add 添加数据到索引（添加到xb）。

我们可以查看索引的属性状态，is_trained是否训练完成（有些不需要训练），ntotal被索引数据的数⽬。

有⼀些索引，需要提供向量的整数ID，如果ID没有提供，add可以采⽤数据的序号数，第⼀个数据为0，第⼆个是1，以此类推。

import faiss # make faiss available

index = faiss.IndexFlatL2(d) # build the index

print(index.is_trained)

index.add(xb) # add vectors to the index

al)

# output

True

100000

基于索引便可以进⾏k近邻查询了，结果矩阵为nq×k，第i⾏表⽰第i个查询向量，每⾏包含k个最近邻的ID，距离依次递增。同时返回相同维度的距离矩阵。

k = 4 # we want to see 4 nearest neighbors

D, I = index.search(xb[:5], k) # sanity check

print(I)

print(D)

D, I = index.search(xq, k) # actual search

print(I[:5]) # neighbors of the 5 first queries

print(I[-5:]) # neighbors of the 5 last queries

# output

[[ 0 393 363 78]

[ 1 555 277 364]

[ 2 304 101 13]

[ 3 173 18 182]

[ 4 288 370 531]]

[[ 0. 7.17517328 7.2076292 7.25116253]

[ 0. 6.32356453 6.6845808 6.79994535]

[ 0. 5.79640865 6.39173603 7.28151226]

[ 0. 7.27790546 7.52798653 7.66284657]

[ 0. 6.76380348 7.29512024 7.36881447]]

[[ 381 207 210 477]

[ 526 911 142 72]

[ 838 527 1290 425]

[ 196 184 164 359]

[ 526 377 120 425]]

[[ 9900 10500 9309 9831]

[11055 10895 10812 11321]

[11353 11103 10164 9787]

[10571 10664 10632 9638]

[ 9628 9554 10036 9582]]

受向量第⼀项的影响，查询数据中头部数据的最近邻也在数据集的头部。

加速查询，⾸先可以把数据集切分成多个，我们定义Voronoi Cells，每个数据向量只能落在⼀个cell中。查询时只需要查询query向量落在cell中的数据了，降低了距离计算次数。

通过IndexIVFFlat索引，可以实现上⾯的思想，它需要⼀个训练的阶段。IndexIVFFlat需要另⼀个索引，称为quantizer，来判断向量属于哪个cell。

search⽅法也相应引⼊了nlist（cell的数⽬）和nprobe（执⾏搜索的cell数）

nlist = 100

k = 4

quantizer = faiss.IndexFlatL2(d) # the other index

index = faiss.IndexIVFFlat(quantizer, d, nlist, faiss.METRIC_L2)

# here we specify METRIC_L2, by default it performs inner-product search

assert not index.is_trained

assert index.is_trained

index.add(xb) # add may be a bit slower as well

D, I = index.search(xq, k) # actual search

print(I[-5:]) # neighbors of the 5 last queries

index.nprobe = 10 # default nprobe is 1, try a few more

D, I = index.search(xq, k)

print(I[-5:]) # neighbors of the 5 last queries

# output

[[ 9900 10500 9831 10808]

[11055 10812 11321 10260]

[11353 10164 10719 11013]

[10571 10203 10793 10952]

[ 9582 10304 9622 9229]]

[[ 9900 10500 9309 9831]

[11055 10895 10812 11321]

[11353 11103 10164 9787]

[10571 10664 10632 9638]

[ 9628 9554 10036 9582]]

结果并不完全⼀致，因为落在Voronoi cell外的数据也可能离查询数据更近。适当增加nprobe可以得到和brute-force相同的结果，nprobe控

制了速度和精度的平衡。

IndexFlatL2 和 IndexIVFFlat都要存储所有的向量数据，这对于⼤型数据集是不现实的。Faiss基于PQ提供了变体IndexIVFPQ来压缩数据向量（⼀定的精度损耗）。

向量仍是存储在Voronoi cells中，但是它们的⼤⼩可以通过m来设置(m是d的因⼦)。

由于向量值不在准确存储，所以search计算的距离也是近似的了。

nlist = 100

m = 8 # number of bytes per vector

k = 4

quantizer = faiss.IndexFlatL2(d) # this remains the same

index = faiss.IndexIVFPQ(quantizer, d, nlist, m, 8)

# 8 specifies that each sub-vector is encoded as 8 bits

index.add(xb)

D, I = index.search(xb[:5], k) # sanity check

print(I)

print(D)

index.nprobe = 10 # make comparable with experiment above

D, I = index.search(xq, k) # search

print(I[-5:])

# output

[[ 0 424 363 278]

[ 1 555 1063 24]

[ 2 304 46 346]

[ 3 773 182 1529]

[ 4 288 754 531]]

[[ 1.45568264 6.03136778 6.18729019 6.38852692]

[ 1.4934082 5.74254704 6.19941282 6.21501732]

[ 1.60279942 6.20174742 6.32792568 6.78541422]

[ 1.69804895 6.2623148 6.26956797 6.56042767]

[ 1.30235791 6.13624859 6.33899879 6.51442146]]

[[10664 10914 9922 9380]

[10260 9014 9458 10310]

[11291 9380 11103 10392]

[10856 10284 9638 11276]

[10304 9327 10152 9229]]

最近距离（到⾃⾝）不再是0了，因为数据被压缩了。整理64位 32-bits向量，被分割为8份，每份⽤8bits表⽰，所以压缩因⼦为32。

查询数据集的结果和IVFFlat对⽐，⼤多是错误的，但是它们都在10000左右。这种策略在实际数据中是更好的：

均匀分布的数据是很难索引的，它们很难聚类和降维

⾃然数据，相似数据⽐不相⼲数据的距离要显著的更⼩。

使⽤⼯⼚⽅法简化索引构建

index = faiss.index_factory(d, "IVF100,PQ8")

PQ8替换为Flat便得到了IndexFlat索引，⼯⼚⽅法是⾮常有效的，尤其是对数据采⽤预处理的时候，如参数"PCA32,IVF100,Flat"，表⽰通过PCA把向量减低到32维。

Faiss可以基本⽆缝地在GPU上运⾏，⾸先申请GPU资源，并包括⾜够的显存空间。

res = faiss.StandardGpuResources() # use a single GPU

使⽤GPU创建索引

# build a flat (CPU) index

index_flat = faiss.IndexFlatL2(d)

# make it into a gpu index

gpu_index_flat = faiss.index_cpu_to_gpu(res, 0, index_flat)

索引的使⽤和CPU上类似

gpu_index_flat.add(xb) # add vectors to the index print(gpu_al)

k = 4 # we want to see 4 nearest neighbors D, I = gpu_index_flat.search(xq, k) # actual search

print(I[:5]) # neighbors of the 5 first queries print(I[-5:]) # neighbors of the 5 last queries

使⽤多张GPU卡

ngpus = _num_gpus()

print("number of GPUs:", ngpus)

cpu_index = faiss.IndexFlatL2(d)

gpu_index = faiss.index_cpu_to_all_gpus( # build the index cpu_index

)

gpu_index.add(xb) # add vectors to the index print(al)

k = 4 # we want to see 4 nearest neighbors D, I = gpu_index.search(xq, k) # actual search

print(I[:5]) # neighbors of the 5 first queries print(I[-5:]) # neighbors of the 5 last queries Processing math: 100%

688IT编程网

Faiss教程:入门

发表评论

推荐文章

java正则表达式选择题

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

工龄小数点提取

非零金额正则表达式

提取文本中数字的函数

热门文章

利用正则表达式实现文本数据提取与处理

正则表达式零宽断言详解

文本匹配规则

excel中使用正则

1-31正则表达式

anki之高级筛选

BUAA_OO_2021_第一单元总结

insert语句递增写法

sublime text 3在行前插入递增数字序号的方法

字符串只允许数字和英文的正则

powerbuilder 正则表达式

Shell脚本编写的高级技巧利用正则表达式进行字符串匹配

JAVA正则表达式的三种模式:贪婪,勉强和占有的讨论

go regexp匹配规则

oracle regexp_substr 实现原理

基本的元字符回溯引用和前后查匹配模式

elasticsearch query dsl正则

oracle sql正则表达式

GA-设置目标

仅匹配全角片假名的正则表达式

最新文章

java正则表达式选择题

工龄小数点提取

非零金额正则表达式

提取文本中数字的函数

vue数字相加小数点变长-概述说明以及解释

vue validate 正则验证小数长度

标签列表

688IT编程网

Faiss教程:入门

发表评论

推荐文章

java正则表达式 选择题

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

工龄小数点提取

非零金额 正则表达式

提取文本中数字的函数

热门文章

利用正则表达式实现文本数据提取与处理

正则表达式零宽断言详解

文本匹配规则

excel中使用正则

1-31正则表达式

anki之高级筛选

BUAA_OO_2021_第一单元总结

insert语句递增写法

sublime text 3在行前插入递增数字序号的方法

字符串只允许数字和英文的正则

powerbuilder 正则表达式

Shell脚本编写的高级技巧利用正则表达式进行字符串匹配

JAVA正则表达式的三种模式:贪婪,勉强和占有的讨论

go regexp匹配规则

oracle regexp_substr 实现原理

基本的元字符 回溯引用和前后查 匹配模式

elasticsearch query dsl正则

oracle sql正则表达式

GA-设置目标

仅匹配全角片假名的正则表达式

最新文章

java正则表达式 选择题

工龄小数点提取

非零金额 正则表达式

提取文本中数字的函数

vue数字相加小数点变长-概述说明以及解释

vue validate 正则验证小数长度

标签列表

java正则表达式选择题

非零金额正则表达式

基本的元字符回溯引用和前后查匹配模式

java正则表达式选择题

非零金额正则表达式