全⽂索引的原理及定义
全⽂索引时将存储在数据库中的整本书或整篇⽂章中的任意内容信息查出来的技术。它可以根据需要获取全⽂中有关章,节,段,句,词等信息,也可以进⾏各种统计和分析。
之前的MySQL数据库中,INNODB存储引擎并不⽀持全⽂索引技术,⼤多数的⽤户转向MyISAM存储引擎,不过这可能进⾏表的拆分,并需要将进⾏全⽂索引的数据存储为MyIsam表。这样的确能够解决逻辑业务的需求,但是却丧失了INNODB存储引擎的事务性,⽽这在⽣产环境应⽤中同样是⾮常关键的。
从INNODB1.2.x版本开始,INNODB存储引擎开始⽀持全⽂索引,其⽀持myisam的全部功能,并且还⽀持其他的⼀些特性。
倒排索引
全⽂索引通常使⽤倒排索引来实现。倒排索引同B+树索引⼀样,也是⼀种索引结构。它在辅助表中存储了单词与单词⾃⾝在⼀个或多个⽂档中所在的位置之间的映射。这通常利⽤关联数组实现,其拥有两种表现形式。
inverted file index, 其表现形式为{单词,单词所在⽂档的ID}
full inverted index,其表现形式为{单词,(单词所在⽂档的ID,再具体⽂档中的位置)}
【具体来说就是inverted file index只能根据单词到对应的⽂档,⽽full inverted index不仅能到对应的⽂档,还能到单词在⽂档中的具体位置】
INNODB全⽂索引
INNODB存储从1.2版本开始⽀持全⽂索引的技术,其采⽤full inverted index的技术。在INNODB存储引擎中将(documentid, position)视为⼀个“ilist”。因此在全⽂索引的表中,有两个列,⼀个是word字段,另⼀个是ilist字段,并且在word字段上设有索引。此外,INNODB存储引擎在ilist字段上存放了position信息,故可以进⾏proximity search(邻近查),⽽myisam不⽀持这个。
倒排索引需要将word字段存放在⼀张表中,这个表称为辅助表。在INNODB存储引擎中,为了提⾼全⽂检索的并⾏性能,共有6张辅助表,⽬前每张表根据word的latin编码进⾏分区。辅助表时持久的表,存放于磁盘上。然⽽在innodb存储引擎的全⽂索引中,还有另外⼀个重要的概念FTS index cache(全⽂索引缓存),其⽤来提⾼全⽂检索的性能。FTS index cache是⼀个红⿊树结构,其根据(word, ilist)进⾏排序。这意味着插⼊的数据已经更新了对应的表,但是对全⽂索引的更新可能在分词操作后还在FTS index cache中,辅助表可能还没有更新。innodb存储引擎会批量对辅助表进⾏更新,⽽不是每次插⼊后更新⼀次辅助表。当对全⽂索引进⾏查询时,辅助表会⾸先将FTS index cache中对应的word字段合并到辅助表中,然后再进⾏查询。
innodb存储引擎允许⽤户查看指定倒排索引的辅助表中的分词信息,可以通过设置参数innodb_ft_aux_table来观察倒排索引的辅助表。
mysql>set global innodb_ft_aux_table="mytest/tb1"; #对应的表上必须有全⽂索引,不然会报错。
Query OK, 0 rows affected (0.00 sec)
mysql> use information_schema; #在information_schema库中,查看对应表的分词信息
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Database changed
mysql> desc innodb_ft_index_table;
+--------------+---------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------------+---------------------+------+-----+---------+-------+
| WORD | varchar(337) | NO | | | |
| FIRST_DOC_ID | bigint(21) unsigned | NO | | 0 | |
| LAST_DOC_ID | bigint(21) unsigned | NO | | 0 | |
| DOC_COUNT | bigint(21) unsigned | NO | | 0 | |
| DOC_ID | bigint(21) unsigned | NO | | 0 | |
| POSITION | bigint(21) unsigned | NO | | 0 | |
+--------------+---------------------+------+-----+---------+-------+
6 rows in set (0.00 sec)
mysql>
#对于innodb存储引擎,总是在事务提交时将分词写⼊FTS Index Cache,然后再通过批量更新写⼊到磁盘。虽然INNODB存储引擎通过⼀种延时,批量的写⼊⽅式来提⾼数据库的性能,
#但是上述操作仅在事务提交时发⽣。mysql操作官方文档
当数据库关闭时,在FTS Index Cache中的数据库会同步到磁盘上的辅助表。然⽽,如果当数据库发⽣宕机时,⼀些FTS Index Cache中的数据可能未被同步到磁盘上。那么下次重启数据库时,当⽤户对表进⾏全⽂索引时,INNODB存储引擎会⾃动读取未完成的⽂档,然后进⾏分词操作,再将分词的结果放⼊到FTS Index Cache中。
参数innodb_ft_cache_size⽤来控制FTS Index Cache的⼤⼩。当该缓存满时,会将其中的(word,ilist)分词信息同步到磁盘的辅助表中。增⼤该参数可以提⾼数据库的全⽂索引性能,但是在宕机时,未同步到磁盘中的索引信息可能需要更长时间恢复。
在innodb存储引擎中,为了⽀持全⽂检索,必须有⼀个列与word进⾏映射,在innodb中这个列被命名为FTS_DOC_ID,其类型必须为BIGINT UNSIGNED NOT NULL,并且innodb存储引擎会在该列上加上⼀个名为FTS_DOC_ID_INDEX的唯⼀索引。上述操作由innodb存储引擎⾃⼰完成,⽤户也可以在创建表时⼿动添加,主要对应的约束条件。
⽂档中分词的插⼊操作是在事务提交时完成,然⽽对于删除操作,其再事务提交时,不删除磁盘辅助表中的记录,⽽只是删除FTS cache index中的记录。对于辅助表中被删除的记录,innodb存储引擎会记录其FTS Document id,并将其保存在deleted辅助表中。在设置innodb_ft_aux_table后,⽤户同意可以⽅位
information_schema架构下的表innodb_ft_deleted来观察删除的FTS Document ID。
由于⽂档的DML操作实际并不删除索引中的数据,相反还会在对应deleted表中插⼊记录,因此随着应⽤程序的允许,索引会变得⾮常⼤,即使索引中的有些数据已经被删除,查询也不会选择这类记录。innodb提供了⼀种⽅式,允许⽤户⼿⼯地将已经删除的记录从索引中彻底删除,该命令就是OPTIMIZE TABLE.因为这个命令还会进⾏⼀些其他的操作,如cardinality的重新统计等,若⽤户希望仅对倒排索引进⾏操作,可以通过如下参数设置。
mysql>set global innodb_optimize_fulltext_only =ON;
Query OK, 0 rows affected (0.00 sec)
mysql>
若删除的⽂档⾮常多,那么OPTIMIZE TABLE操作可能需要占⽤⾮常多的时间,这样会影响应⽤程序
的并发性,并极⼤地降低⽤户的响应时间。⽤户可以通过innodb_ft_num_word_optimize来限制每次实际删除分词的数量。该参数默认数值为2000.
mysql> show variables like "innodb_ft_num_word_optimize";
+-----------------------------+-------+
| Variable_name | Value |
+-----------------------------+-------+
| innodb_ft_num_word_optimize |2000|
+-----------------------------+-------+
1 row in set (0.00 sec)
mysql>
上⾯提到过创建表的全⽂索引时,还有⼀个列叫做FTS_DOC_ID;这个列我们可以⾃⼰在创建表时创建,也可以让innodb存储引擎⾃动创建,innodb存储引擎⾃动创建的是隐藏的列。但是在官⽅⽂档中
建议⾃⼰创建。
Defining an FTS_DOC_ID column at CREATE TABLE time is less expensive than creating a full-text index on a table that is already loaded with
data. If an FTS_DOC_ID column is defined on a table prior to loading data, the table and its indexes do not have to be rebuilt to add the new
column. If you are not concerned with CREATE FULLTEXT INDEX performance, leave out the FTS_DOC_ID column to have InnoDB create it for you.
InnoDB creates a hidden FTS_DOC_ID column along with a unique index (FTS_DOC_ID_INDEX) on the FTS_DOC_ID column. If you want to create your own
FTS_DOC_ID column, the column must be defined as BIGINT UNSIGNED NOT NULL and named FTS_DOC_ID (all upper case), as in the following
官⽅⽂档:
下⾯演⽰⼀个实例:
create table tb3(
FTS_DOC_ID BIGINT UNSIGNED AUTO_INCREMENT NOT NULL,
body text,
FULLTEXT INDEX ft_index (body),
UNIQUE INDEX FTS_DOC_ID_INDEX (FTS_DOC_ID)
);
#上⾯提到过,除了创建FTS_DOC_ID,还要在这个字段上创建⼀个名为FTS_DOC_ID_INDEX的唯⼀索引;
在表中插⼊数据:
mysql> insert into tb3 select NULL, "pLease porridge in the pot";
Query OK, 1 row affected (0.01 sec)
Records: 1 Duplicates: 0 Warnings: 0
mysql> insert into tb3 select NULL,"please say sorry";
Query OK, 1 row affected (0.01 sec)
Records: 1 Duplicates: 0 Warnings: 0
mysql> insert into tb3 select NULL,"nine years old";
Query OK, 1 row affected (0.00 sec)
Records: 1 Duplicates: 0 Warnings: 0
mysql> insert into tb3 select NULL,"some like it hot, some like it cold";
Query OK, 1 row affected (0.00 sec)
Records: 1 Duplicates: 0 Warnings: 0
mysql> insert into tb3 select NULL,"i like coding";
Query OK, 1 row affected (0.00 sec)
Records: 1 Duplicates: 0 Warnings: 0
mysql> insert into tb3 select NULL,"fuck the company";
Query OK, 1 row affected (0.01 sec)
Records: 1 Duplicates: 0 Warnings: 0
然后设置innodb_ft_aux_table的值,查看分词信息:
mysql>select*from tb3;
+------------+-------------------------------------+
| FTS_DOC_ID | body |
+------------+-------------------------------------+
|1| pLease porridge in the pot |
|2| please say sorry |
|3| nine years old |
|4|some like it hot, some like it cold |
|5| i like coding |
|6| fuck the company |
+------------+-------------------------------------+
6 rows in set (0.00 sec)
mysql>set global innodb_ft_aux_table="mytest/tb3"; #设置查看分词信息
Query OK, 0 rows affected (0.00 sec)
mysql>use information_schema;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with-A
Database changed
mysql>select*from innodb_ft_index_table; #MySQL5.7中加⼊了⼀个cache表,这时候table表时空的
Empty set (0.01 sec)
mysql>select*from innodb_ft_index_cache;
+----------+--------------+-------------+-----------+--------+----------+
| WORD | FIRST_DOC_ID | LAST_DOC_ID | DOC_COUNT | DOC_ID | POSITION |
+----------+--------------+-------------+-----------+--------+----------+
| coding |5|5|1|5|7|
| cold |4|4|1|4|31|
| company |6|6|1|6|9|
| fuck |6|6|1|6|0|
| hot |4|4|1|4|13|
|like|4|5|2|4|5|
|like|4|5|2|4|18|
|like|4|5|2|5|2|
| nine |3|3|1|3|0|
| old |3|3|1|3|11|
| please |1|2|2|1|0|
| please |1|2|2|2|0|
| porridge |1|1|1|1|7|
| pot |1|1|1|1|23|
| say |2|2|1|2|7|
|some|4|4|1|4|0|
|some|4|4|1|4|18|
| sorry |2|2|1|2|11|
| years |3|3|1|3|5|
+----------+--------------+-------------+-----------+--------+----------+
19 rows in set (0.00 sec)
有关innodb_ft_index_cache的作⽤:
The INNODB_FT_INDEX_CACHE table provides token information about newly inserted rows in a FULLTEXT index. To avoid expensive index reorganization
during DML operations, the information about newly indexed words is stored separately, and combined with the main search index only when OPTIMIZE
TABLE is run, when the server is shut down, or when the cache size exceeds a limit defined by the in
nodb_ft_cache_size or innodb_ft_total_cache_size
system variable.
#INNODB_FT_INDEX_CACHE提供关于新插⼊全⽂索引⾏的信息,为了避免在DML操作期间,昂贵的索引重组,新索引字段的信息单独存储。只有当运⾏OPTIMIZE TABLE命令,或者服务器关机,或者缓存的⼤⼩超过innodb_ft_cache_size或者innodb_ft_total_cache_size的限制时,才会合并索引信息。
执⾏命令,合并全⽂索引的信息:
mysql>set global innodb_optimize_fulltext_only =ON;
Query OK, 0 rows affected (0.00 sec)
mysql> optimize table mytest.tb3;
+------------+----------+----------+----------+
|Table| Op | Msg_type | Msg_text |
+------------+----------+----------+----------+
| mytest.tb3 | optimize | status | OK |
+------------+----------+----------+----------+
1 row in set (0.0
2 sec)
mysql>select*from innodb_ft_index_cache; #执⾏之后,缓存中没有数据,table表中有对应的分词信息。
Empty set (0.00 sec)
mysql>select*from innodb_ft_index_table;
+----------+--------------+-------------+-----------+--------+----------+
| WORD | FIRST_DOC_ID | LAST_DOC_ID | DOC_COUNT | DOC_ID | POSITION |
+----------+--------------+-------------+-----------+--------+----------+
| coding |5|5|1|5|7|
| cold |4|4|1|4|31|
| company |6|6|1|6|9|
| fuck |6|6|1|6|0|
| hot |4|4|1|4|13|
|like|4|5|2|4|5|
|like|4|5|2|4|18|
|like|4|5|2|5|2|
| nine |3|3|1|3|0|
| old |3|3|1|3|11|
| please |1|2|2|1|0|
| please |1|2|2|2|0|
| porridge |1|1|1|1|7|
| pot |1|1|1|1|23|
| say |2|2|1|2|7|
|some|4|4|1|4|0|
|some|4|4|1|4|18|
| sorry |2|2|1|2|11|
| years |3|3|1|3|5|
+----------+--------------+-------------+-----------+--------+----------+
19 rows in set (0.00 sec)
删除全⽂索引的⼀⾏记录:
mysql>delete from tb3 where fts_doc_id=3;
Query OK, 1 row affected (0.05 sec)
mysql>select*from information_schema.innodb_ft_deleted;
+--------+
| DOC_ID |
+--------+
|3|
+--------+
1 row in set (0.00 sec)
#The INNODB_FT_DELETED table stores rows that are deleted from the FULLTEXT index for an InnoDB table. To avoid expensive index reorganization
during DML operations for an InnoDB FULLTEXT index, the information about newly deleted words is stored separately, filtered out of search
results when you do a text search, and removed from the main search index only when you issue an OPTIMIZE TABLE statement for the InnoDB table.
#删除的时候并不会真实的删除,只是把删除的row记录在INNODB_FT_DELETED表中,查询的时候会过滤掉,如果你执⾏了OPTIMIZE TABLE语句,那么才会真正的删除。
mysql>set global innodb_optimize_fulltext_only =ON;
Query OK, 0 rows affected (0.00 sec)
mysql> optimize table mytest.tb3; #执⾏完这条语句之后,才会真正的删除。
+------------+----------+----------+----------+
|Table| Op | Msg_type | Msg_text |
+------------+----------+----------+----------+
| mytest.tb3 | optimize | status | OK |
+------------+----------+----------+----------+
1 row in set (0.03 sec)
mysql> select * from information_schema.innodb_ft_being_deleted;
+--------+
| DOC_ID |
+--------+
| 3 |
+--------+
1 row in set (0.00 sec)
#The INNODB_FT_BEING_DELETED table provides a snapshot of the INNODB_FT_DELETED table; it is used only during an OPTIMIZE TABLE maintenance
operation. When OPTIMIZE TABLE is run, the INNODB_FT_BEING_DELETED table is emptied, and DOC_ID values are removed from the INNODB_FT_DELETED
table. Because the contents of INNODB_FT_BEING_DELETED typically have a short lifetime, this table has limited utility for monitoring or
debugging.
#INNODB_FT_BEING_DELETED表是INNODB_FT_DELETED的快照,在⼿动运⾏OPTIMIZE TABLE期间使⽤,当OPTIMIZE TABLE运⾏时,INNODB_FT_BEING_DELETED表被清空,并且DOC_ID #的值从INNODB_FT_DELETED中删除,因为INNODB_FT_BEING_DELETED表中的内容⽣命周期很短,因此使⽤这个表的作⽤是有限的。
stopword:在分词操作时,有些词因为没有实际的意义,并不需要对其进⾏分次操作。例如,对于定冠词the。在innodb存储引擎中有⼀张默认的stopword表,如下!
mysql> show create table information_schema.INNODB_FT_DEFAULT_STOPWORD\G
***************************1. row ***************************
Table: INNODB_FT_DEFAULT_STOPWORD
Create Table: CREATE TEMPORARY TABLE `INNODB_FT_DEFAULT_STOPWORD` (
`value` varchar(18) NOT NULL DEFAULT''
) ENGINE=MEMORY DEFAULT CHARSET=utf8
1 row in set (0.00 sec)
mysql>
除了使⽤innodb存储引擎默认的stopword表,⽤户还可以使⽤innodb_ft_server_stopword_table来定义stopword列表。
mysql>create table user_stopword( value varchar(30) );
Query OK, 0 rows affected (0.04 sec)
mysql>set global innodb_ft_server_stopword_table="mytest/user_stopword";
Query OK, 0 rows affected (0.00 sec)
innodb存储引擎的全⽂索引还存在以下限制:
每张表只能有⼀个全⽂索引。
由多列组合⽽成的全⽂索引必须使⽤相同的字符集与排序规则。
不⽀持没有单词界定符的语⾔,如中⽂,⽇语,韩语等。
博⽂中的内容摘⾃: MySQL技术内幕-innodb存储引擎。以及官⽅⽂档。
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。
发表评论