ElasticsearchIngest-Attachment
⼀、简介
ElasticSearch只能处理⽂本,不能直接处理⽂档。Ingest-Attachment是⼀个开箱即⽤的插件,替代了较早版本的Mapper-Attachment 插件,使⽤它可以实现对(PDF,DOC,EXCEL等)主流格式⽂件的⽂本抽取及⾃动导⼊。
Elasticsearch5.x新增⼀个新的特性Ingest Node,此功能⽀持定义命名处理器管道pipeline,pipeline中可以定义多个处理器,在数据插⼊ElasticSearch之前进⾏预处理。⽽Ingest Attachment Processor Plugin提供了关键的预处理器attachment,⽀持⾃动对⼊库⽂档的指定字段作为⽂档⽂件进⾏⽂本抽取。
由于ElasticSearch是基于JSON格式的⽂档数据库,所以附件⽂档在插⼊ElasticSearch之前必须进⾏Base64编码。
思考:在处理xls和xlsx格式的时候,⽆法将sheet分开索引,只能将整个⽂件当做⼀个⽂档插⼊es中。⽬前没有想到什么好的⽅法。
⼆、环境
三、实现步骤
curl -X PUT "localhost:9200/_ingest/pipeline/attachment" -d '{
"description" : "Extract attachment information",
"processors":[
{curl是什么命令
"attachment":{
"field":"data",
"indexed_chars" : -1,
"ignore_missing":true
}
},
{
"remove":{"field":"data"}
}]}'
curl -X PUT "localhost:9200/pdftest/pdf/1?pipeline=attachment" -d '
{
"data":"5oiR54ix5L2g"
}'
在Kibana中执⾏GET pdftest/_search可得:
{
"took": 0,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "pdftest",
"_type": "pdf",
"_id": "1",
"_score": 1,
"_source": {
"attachment": {
"content_type": "text/plain; charset=UTF-8",
"language": "lt",
"content": "我爱你",
"content_length": 4
}
}
}
]
}
}
注:如果在创建管道的时候不删除data即创建命令为
curl -X PUT "localhost:9200/_ingest/pipeline/attachment" -d '{
"description" : "Extract attachment information",
"processors":[
{
"attachment":{
"field":"data",
"indexed_chars" : -1,
"ignore_missing":true
}
}]}'
最终插⼊结果为:
"_source": {
"data": "5oiR54ix5L2g",
"attachment": {
"content_type": "text/plain; charset=UTF-8",
"language": "lt",
"content": "我爱你",
"content_length": 4
}
}
⽅法⼆:载⼊⽂件的同时进⾏转码导⼊
这⾥我将写有“我爱你”三个字的放在了Windows的桌⾯上,在Cygwin中可以使⽤perl脚本的解码功能:base64 -w 0 /cygdrive/c/Users/jk/ | perl -pe's/\n/\\n/g'
参数解释:-w, --wrap=字符数 在指定的字符数后⾃动换⾏(默认为76),0 为禁⽤⾃动换⾏
解码:echo 5oiR54ix5L2g | base64 -d -i
curl -X PUT "localhost:9200/pdftest/pdf/1?pipeline=attachment" -d '
{
"data":"'`base64 -w 0 /cygdrive/c/Users/jk/ | perl -pe's/\n/\\n/g'`'"
}'
最终插⼊结果同上。
注意:如果换成pdf格式或者txt⽂件中有很多内容导致解码后的字符串长度过⼤的话,则会报错:-bash:
/cygdrive/c/Windows/system32/curl: Argument list too long
解释:该错误表⽰执⾏命令的参数太长,超过系统允许的最⼤值,这个值通过ARG_MAX参数控制。这个是Linux系统⼀直以来都有的限制。查看这个限制可以通过命令“getconf ARG_MAX”来实现
解决:使⽤curl命令的-d参数发送json⽂件的⽅式绕过限制
echo '{"data":"'`base64 -w 0 /cygdrive/c/Users/jk/Desktop/hehe.pdf | perl -pe's/\n/\\n/g'`'"}' >
curl -X PUT "localhost:9200/pdftest/pdf/1?pipeline=attachment" -d @
最终插⼊结果为:
"_source": {
"attachment": {
"date": "2018-10-20T08:00:17Z",
"content_type": "application/pdf",
"author": "jk",
"language": "lt",
"content": "我爱你",
"content_length": 7
}
}
3.3 全⽂索引,查询指定字段,注意查询字段名称
快别⽤Cygwin了各种问题(关键是Cygwin还没有玩得很溜),在Kibana⾥输⽅便还不出其他莫名其妙的问题。如在Cygwin输⼊查询命令:
curl -X GET localhost:9200/pdftest/pdf/_search?pretty -d '
{
"query":{
"match":{
"t":"爱"
}
}
}'
报错:
"type" : "json_parse_exception",
"reason" : "Invalid UTF-8 start byte 0xb0\n at [Source: ansportty4.ByteBufStreamInput@69645e80; line: 5, column: 30]"
解决:在Cygwin中执⾏命令:export LC_ALL=“zh_CN” ⽽不是:export LC_ALL=“zh_CN.UTF-8” 然后执⾏locale命令可查看设置。LC_ALL,它是⼀个宏,如果该值设置了,则该值会覆盖所有LC_*的设
置值,LANG的值不受该宏影响。zh代表中⽂,CN代表⼤陆地区,GBK或UTF-8表⽰字符集
Kibana中输⼊:
GET pdftest/_search
{
"query" : {
"match" : {
"t":"爱"
}
}
}
四、补充:安装中⽂分词插件
.
\bin\elasticsearch-plugin install github/medcl/elasticsearch-analysis-ik/releases/download/v5.4.2/elasticsearch-analysis-ik-5.4.2.zip
bin\elasticsearch-plugin install file:///C:\Users\jk\Desktop\elasticsearch-analysis-ik-5.4.2.zip
但是却总是报错:
ERROR: `elasticsearch` directory is missing in the plugin zip
后来只能⽤了这种⽅法:在Elasticsearch安装⽬录下的⽂件夹plugins中新建⽂件夹ik,将elasticsearch-analysis-ik-5.4.2.zip解压到这⾥即可
其实官⽹⾥已经说明了低于5.5.1版本的⽤解压的⽅式安装了:
ik_smart: 会做最粗粒度的拆分
ik_max_word: 会将⽂本做最细粒度的拆分。
测试1:
GET /_analyze
{
"analyzer": "ik_smart",
"text":"中华⼈民共和国"
}
结果:
{
"tokens": [
{
"token": "中华⼈民共和国",
"start_offset": 0,
"end_offset": 7,
"type": "CN_WORD",
"position": 0
}
]
}
测试2:
GET /_analyze
{
"analyzer": "ik_max_word",
"text":"中华⼈民共和国"
}
结果:
{
"tokens": [
{
"token": "中华⼈民共和国",
"start_offset": 0,
"end_offset": 7,
"type": "CN_WORD",
"position": 0
},
{
"token": "中华⼈民",
"token": "中华⼈民", "start_offset": 0,
"end_offset": 4,
"type": "CN_WORD", "position": 1
},
{
"token": "中华",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD", "position": 2
},
{
"token": "华⼈",
"start_offset": 1,
"end_offset": 3,
"type": "CN_WORD", "position": 3
},
{
"token": "⼈民共和国", "start_offset": 2,
"end_offset": 7,
"type": "CN_WORD", "position": 4
},
{
"token": "⼈民",
"start_offset": 2,
"end_offset": 4,
"type": "CN_WORD", "position": 5
},
{
"token": "共和国",
"start_offset": 4,
"end_offset": 7,
"type": "CN_WORD", "position": 6
},
{
"token": "共和",
"start_offset": 4,
"end_offset": 6,
"type": "CN_WORD", "position": 7
},
{
"token": "国",
"start_offset": 6,
"end_offset": 7,
"type": "CN_CHAR", "position": 8
}
]
}
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。
发表评论