ElasticsearchIngest-Attachment--688IT编程网

ElasticsearchIngest-Attachment

⼀、简介

ElasticSearch只能处理⽂本，不能直接处理⽂档。Ingest-Attachment是⼀个开箱即⽤的插件，替代了较早版本的Mapper-Attachment 插件，使⽤它可以实现对（PDF,DOC,EXCEL等）主流格式⽂件的⽂本抽取及⾃动导⼊。

Elasticsearch5.x新增⼀个新的特性Ingest Node，此功能⽀持定义命名处理器管道pipeline，pipeline中可以定义多个处理器，在数据插⼊ElasticSearch之前进⾏预处理。⽽Ingest Attachment Processor Plugin提供了关键的预处理器attachment，⽀持⾃动对⼊库⽂档的指定字段作为⽂档⽂件进⾏⽂本抽取。

由于ElasticSearch是基于JSON格式的⽂档数据库，所以附件⽂档在插⼊ElasticSearch之前必须进⾏Base64编码。

思考：在处理xls和xlsx格式的时候，⽆法将sheet分开索引，只能将整个⽂件当做⼀个⽂档插⼊es中。⽬前没有想到什么好的⽅法。

⼆、环境

三、实现步骤

curl -X PUT "localhost:9200/_ingest/pipeline/attachment" -d '{

"description" : "Extract attachment information",

"processors":[

{curl是什么命令

"attachment":{

"field":"data",

"indexed_chars" : -1,

"ignore_missing":true

}

{

"remove":{"field":"data"}

}]}'

curl -X PUT "localhost:9200/pdftest/pdf/1?pipeline=attachment" -d '

{

"data":"5oiR54ix5L2g"

在Kibana中执⾏GET pdftest/_search可得：

{

"took": 0,

"timed_out": false,

"_shards": {

"total": 5,

"successful": 5,

"failed": 0

"hits": {

"total": 1,

"max_score": 1,

"hits": [

{

"_index": "pdftest",

"_type": "pdf",

"_id": "1",

"_score": 1,

"_source": {

"attachment": {

"content_type": "text/plain; charset=UTF-8",

"language": "lt",

"content": "我爱你",

"content_length": 4

}

]

}

注：如果在创建管道的时候不删除data即创建命令为

curl -X PUT "localhost:9200/_ingest/pipeline/attachment" -d '{

"description" : "Extract attachment information",

"processors":[

{

"attachment":{

"field":"data",

"indexed_chars" : -1,

"ignore_missing":true

}

}]}'

最终插⼊结果为：

"_source": {

"data": "5oiR54ix5L2g",

"attachment": {

"content_type": "text/plain; charset=UTF-8",

"language": "lt",

"content": "我爱你",

"content_length": 4

}

⽅法⼆：载⼊⽂件的同时进⾏转码导⼊

这⾥我将写有“我爱你”三个字的放在了Windows的桌⾯上，在Cygwin中可以使⽤perl脚本的解码功能：base64 -w 0 /cygdrive/c/Users/jk/ | perl -pe's/\n/\\n/g'

参数解释：-w, --wrap=字符数在指定的字符数后⾃动换⾏(默认为76)，0 为禁⽤⾃动换⾏

解码：echo 5oiR54ix5L2g | base64 -d -i

curl -X PUT "localhost:9200/pdftest/pdf/1?pipeline=attachment" -d '

{

"data":"'`base64 -w 0 /cygdrive/c/Users/jk/ | perl -pe's/\n/\\n/g'`'"

最终插⼊结果同上。

注意：如果换成pdf格式或者txt⽂件中有很多内容导致解码后的字符串长度过⼤的话，则会报错：-bash:

/cygdrive/c/Windows/system32/curl: Argument list too long

解释：该错误表⽰执⾏命令的参数太长，超过系统允许的最⼤值，这个值通过ARG_MAX参数控制。这个是Linux系统⼀直以来都有的限制。查看这个限制可以通过命令“getconf ARG_MAX”来实现

解决：使⽤curl命令的-d参数发送json⽂件的⽅式绕过限制

echo '{"data":"'`base64 -w 0 /cygdrive/c/Users/jk/Desktop/hehe.pdf | perl -pe's/\n/\\n/g'`'"}' >

curl -X PUT "localhost:9200/pdftest/pdf/1?pipeline=attachment" -d @

最终插⼊结果为：

"_source": {

"attachment": {

"date": "2018-10-20T08:00:17Z",

"content_type": "application/pdf",

"author": "jk",

"language": "lt",

"content": "我爱你",

"content_length": 7

}

3.3 全⽂索引，查询指定字段，注意查询字段名称

快别⽤Cygwin了各种问题（关键是Cygwin还没有玩得很溜），在Kibana⾥输⽅便还不出其他莫名其妙的问题。如在Cygwin输⼊查询命令：

curl -X GET localhost:9200/pdftest/pdf/_search?pretty -d '

{

"query":{

"match":{

"t":"爱"

}

报错：

"type" : "json_parse_exception",

"reason" : "Invalid UTF-8 start byte 0xb0\n at [Source: ansportty4.ByteBufStreamInput@69645e80; line: 5, column: 30]"

解决：在Cygwin中执⾏命令：export LC_ALL=“zh_CN” ⽽不是：export LC_ALL=“zh_CN.UTF-8” 然后执⾏locale命令可查看设置。LC_ALL，它是⼀个宏，如果该值设置了，则该值会覆盖所有LC_*的设

置值，LANG的值不受该宏影响。zh代表中⽂，CN代表⼤陆地区，GBK或UTF-8表⽰字符集

Kibana中输⼊：

GET pdftest/_search

{

"query" : {

"match" : {

"t":"爱"

}

四、补充：安装中⽂分词插件

\bin\elasticsearch-plugin install github/medcl/elasticsearch-analysis-ik/releases/download/v5.4.2/elasticsearch-analysis-ik-5.4.2.zip

bin\elasticsearch-plugin install file:///C:\Users\jk\Desktop\elasticsearch-analysis-ik-5.4.2.zip

但是却总是报错：

ERROR: `elasticsearch` directory is missing in the plugin zip

后来只能⽤了这种⽅法：在Elasticsearch安装⽬录下的⽂件夹plugins中新建⽂件夹ik，将elasticsearch-analysis-ik-5.4.2.zip解压到这⾥即可

其实官⽹⾥已经说明了低于5.5.1版本的⽤解压的⽅式安装了：

ik_smart: 会做最粗粒度的拆分

ik_max_word: 会将⽂本做最细粒度的拆分。

测试1：

GET /_analyze

{

"analyzer": "ik_smart",

"text":"中华⼈民共和国"

}

结果：

{

"tokens": [

{

"token": "中华⼈民共和国",

"start_offset": 0,

"end_offset": 7,

"type": "CN_WORD",

"position": 0

}

]

}

测试2：

GET /_analyze

{

"analyzer": "ik_max_word",

"text":"中华⼈民共和国"

}

结果：

{

"tokens": [

{

"token": "中华⼈民共和国",

"start_offset": 0,

"end_offset": 7,

"type": "CN_WORD",

"position": 0

{

"token": "中华⼈民",

"token": "中华⼈民", "start_offset": 0,

"end_offset": 4,

"type": "CN_WORD", "position": 1

{

"token": "中华",

"start_offset": 0,

"end_offset": 2,

"type": "CN_WORD", "position": 2

{

"token": "华⼈",

"start_offset": 1,

"end_offset": 3,

"type": "CN_WORD", "position": 3

{

"token": "⼈民共和国", "start_offset": 2,

"end_offset": 7,

"type": "CN_WORD", "position": 4

{

"token": "⼈民",

"start_offset": 2,

"end_offset": 4,

"type": "CN_WORD", "position": 5

{

"token": "共和国",

"start_offset": 4,

"end_offset": 7,

"type": "CN_WORD", "position": 6

{

"token": "共和",

"start_offset": 4,

"end_offset": 6,

"type": "CN_WORD", "position": 7

{

"token": "国",

"start_offset": 6,

"end_offset": 7,

"type": "CN_CHAR", "position": 8

}

]

}

688IT编程网

ElasticsearchIngest-Attachment

发表评论

推荐文章

应用程序的安全检测方法、装置、电子设备和存储介质

nginx map用法正则

VBA之正则表达式(1)--基础篇

Prometheus监控学习笔记之初识PromQL

关于PHP中的webshell

热门文章

m函数数字提取

jest断言方法大全

中兴ZXSEC US 管理员手册

keras系列(一):参数设置

Qt从QString中提取出数字

element input 金额千分位格式化

freemaker 参数解析正则

C#正则验证数字

form表单验证正则

scanf正则表达式用法

grafana value的正则表达式

Android平台浮点数运算应用

js-(JS正则表达式验证数字)

判断Python输入是否是整数,字符,或浮点数

c语言 sscanf 正则规则

从文本中提取数值技巧

js将整数转换成两位浮点数的方法

vue正则限制浮点数

8到20的结尾的正则

shell 正则表达式最后一行

最新文章

应用程序的安全检测方法、装置、电子设备和存储介质

VBA之正则表达式(1)--基础篇

代码编辑的辅助方法、装置及电子设备

SHELL查字符串中包含字符的命令

String方法中replace和replaceAll的区别详解(源码分析)

双字节符号正则

标签列表

688IT编程网

ElasticsearchIngest-Attachment

发表评论

推荐文章

应用程序的安全检测方法、装置、电子设备和存储介质

nginx map用法 正则

VBA之正则表达式(1)--基础篇

Prometheus监控学习笔记之初识PromQL

关于PHP中的webshell

热门文章

m函数数字提取

jest断言方法大全

中兴ZXSEC US 管理员手册

keras系列(一):参数设置

Qt从QString中提取出数字

element input 金额千分位格式化

freemaker 参数解析正则

C#正则验证数字

form表单验证正则

scanf正则表达式用法

grafana value的正则表达式

Android平台浮点数运算应用

js-(JS正则表达式验证数字)

判断Python输入是否是整数,字符,或浮点数

c语言 sscanf 正则规则

从文本中提取数值技巧

js将整数转换成两位浮点数的方法

vue正则限制浮点数

8到20的结尾的正则

shell 正则表达式 最后一行

最新文章

应用程序的安全检测方法、装置、电子设备和存储介质

VBA之正则表达式(1)--基础篇

代码编辑的辅助方法、装置及电子设备

SHELL查字符串中包含字符的命令

String方法中replace和replaceAll的区别详解(源码分析)

双字节符号正则

标签列表

nginx map用法正则

shell 正则表达式最后一行