mysql全⽂检索关联查询_MySQL--全⽂检索(⾃然语⾔全⽂检
索)
⾃然语⾔全⽂本检索
缺省或者modifier被设置为in natural language mode,都是进⾏⾃然语⾔检索。对于表中的每⼀⾏,match()都会返回⼀个关联值。
mysql> CREATE TABLE articles (
-> id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
-> title VARCHAR(200),
-> body TEXT,
-> FULLTEXT ( title , body )
-> ) ENGINE=INNODB;
Query OK, 0 rows affected (0.04 sec)
mysql> INSERT INTO articles (title,body) VALUES
-> ('MySQL Tutorial','DBMS stands for DataBase ...'),
-> ('How To Use MySQL Well','After you went through a ...'),
-> ('Optimizing MySQL','In this tutorial we will show ...'),
-> ('1001 MySQL Tricks','1. Never run mysqld as root. 2. ...'),
-> ('MySQL vs. YourSQL','In the following database comparison ...'),
-> ('MySQL Security','When configured properly, MySQL ...');
Query OK, 6 rows affected (0.00 sec)
Records: 6 Duplicates: 0 Warnings: 0
mysql> select * from articles
-> where match(title,body)
-
> against('database' in natural language mode);
+----+-------------------+------------------------------------------+
| id | title | body |
+----+-------------------+------------------------------------------+
| 1 | MySQL Tutorial | DBMS stands for DataBase ... |
| 5 | MySQL vs. YourSQL | In the following database comparison ... |
+----+-------------------+------------------------------------------+
2 rows in set (0.00 sec)
mysql>
缺省情况下,检索是⼤⼩写不敏感的。如果要想进⾏⼤⼩写敏感的检索,对于索引的列要进⾏⼆进制collation。⽐如字符集类型为latin1的列可以修改为Latin1_bin。
当match()被作为where⼦句的时候,返回的⾏会被⾃动排序,根据检索关联度进⾏排序。
mysql> INSERT INTO articles (title,body) VALUES
-> ('oracle Tutorial','DBMS stands for DataBase ...DataBase');
Query OK, 1 row affected (0.00 sec)
mysql> select * from articles
-> where match(title,body)
-> against('database' in natural language mode);
+----+-------------------+------------------------------------------+
| id | title | body |
+----+-------------------+------------------------------------------+
| 7 | oracle Tutorial | DBMS stands for DataBase ...DataBase |
| 1 | MySQL Tutorial | DBMS stands for DataBase ... |
| 5 | MySQL vs. YourSQL | In the following database comparison ... |
+----+-------------------+------------------------------------------+
3 rows in set (0.00 sec)
mysql>
可以查看⼀下匹配的次数:
#使⽤索引
mysql> SELECT
-> COUNT(*)
-> FROM
-> articles
-> WHERE
-
> MATCH (title , body) AGAINST ('database' IN NATURAL LANGUAGE MODE);
+----------+
| COUNT(*) |
+----------+
| 2 |
+----------+
1 row in set (0.00 sec)
mysql>
#使⽤全表扫描
mysql> SELECT
-> COUNT(IF(MATCH (title , body) AGAINST ('database' IN NATURAL LANGUAGE MODE), -> 1,
-
> NULL)) AS count
-> FROM
-> articles;
+-------+
| count |
+-------+
| 3 |
+-------+
1 row in set (0.00 sec)
mysql>
对于⾃然语⾔全⽂本检索,match()中的列名必须和全⽂索引中的列相同。上例中,如果想对title或body分开检索,就需要分别创建全⽂索引。
上⾯的例⼦中,基本展⽰了如何使⽤match()。返回的结果根据关联值的降序排列。
下⾯的例⼦,展⽰如何显式输出关联值得⼤⼩。返回的⾏不是有序的,因为select语句既不包含where也没有order by。
mysql> SELECT
-> id,
-> MATCH (title , body) AGAINST ('database' IN NATURAL LANGUAGE MODE) AS score
-> FROM
-> articles;
+----+---------------------+
| id | score |
+----+---------------------+
| 1 | 0.22764469683170319 |
| 2 | 0 |
| 3 | 0 |
| 4 | 0 |
| 5 | 0.22764469683170319 |
| 6 | 0 |
+----+---------------------+
6 rows in set (0.00 sec)
mysql>
下⾯的例⼦更复杂,查询返回关联值得具体值,同时进⾏降序排序。为了实现这个⽬的,使⽤了match()两次。这样的语句不会有额外的开销,优化器会注意到两次match()调⽤是⼀样的,所以只会执⾏全⽂检索⼀次。
mysql> SELECT
-> id,
-> body,
-> MATCH (title , body) AGAINST ('Security implications of running MySQL as root' IN NATURAL LANGUAGE MODE) AS score
-> FROM
-> articles
-> WHERE
-> MATCH (title , body) AGAINST ('Security implications of running MySQL as root' IN NATURAL LANGUAGE MODE);
+----+------------------------------------------+----------------------------+
| id | body | score |
+----+------------------------------------------+----------------------------+
| 4 | 1. Never run mysqld as root. 2. ... | 0.6055193543434143 |
mysql下载的vs库放在那个文件里| 6 | When configured properly, MySQL ... | 0.6055193543434143 |
| 1 | DBMS stands for DataBase ... | 0.000000001885928302414186 |
| 2 | After you went through a ... | 0.000000001885928302414186 |
| 3 | In this tutorial we will show ... | 0.000000001885928302414186 |
| 5 | In the following database comparison ... | 0.000000001885928302414186 |
+----+------------------------------------------+----------------------------+
6 rows in set (0.00 sec)
mysql>
⽤双引号引起来的词组,检索匹配的结果只能是双引号中的字⾯值。全⽂检索会将双引号中的内容分解成单词,然后执⾏检索匹配。⾮单词字符是不需要匹配的,只是按照其中的单词顺序进⾏匹配,⽐如"test phrase"和"test, phrase"是匹配的。
全⽂检索会将字母、数字、下划线的组合当成⼀个word。但是也会将'当成⼀个word序列,不过⼀⾏只能有⼀个',⽐如会将aaa'bbb当成⼀个单词,但是aaa''bbb就不是⼀个单词了,⽽是两个。如果'放在开头或者结果,会被丢弃。
内嵌的⽂本解释器决定单词的开头和结尾,根据delimiter符号进⾏判断,⽐如逗号、空格、点号。如果不是根据delimiter分割的,⽐如中⽂,解释器就⽆法判断出单词的开头和结尾了。
所以,⽤户必须使⽤某些delimiter对⽂本进⾏处理后再检索。在5.7.6中可以使⽤插件ngram解释器来实现对中⽂、⽇⽂、韩⽂的⽀持,或者使⽤MeCab解释器来⽀持⽇⽂。
也可以⾃⼰编写插件解释器。⽰例代码位于plugin/fulltext⽬录。
在全⽂检索中,有些单词是被忽略的:
--太短的单词。默认最⼩长度是3个字符(innodb)、4个字符(myisam)。可以设置innodb中的innodb_ft_min_token_size、myisam中的ft_min_word_len
--stopword中的单词会被忽略。stopword是指那些类似the、some⼀样太普通以致被认为是没有什么语义值得单词。有⼀个内嵌stopword列表。也可以重新定义。
每个正确的单词在查询时都被会加权,根据其在集合和查询中的重要性。所以出现频率越⾼,权重就越低。单词的权重会被⽤来计算⾏的关联值。
全⽂检索如果本⽣⾏数就⽐较少,可能检索不出正确的结果。

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。