logstash过滤器插件filter详解
1、logstash过滤器插件filter
1.1、grok正则捕获
grok是⼀个⼗分强⼤的logstash filter插件,他可以通过正则解析任意⽂本,将⾮结构化⽇志数据弄成结构化和⽅便查询的结构。他是⽬前logstash 中解析⾮结构化⽇志数据最好的⽅式
grok的语法规则是:
%{语法:语义}
“语法”指的是匹配的模式。例如使⽤NUMBER模式可以匹配出数字,IP模式则会匹配出127.0.0.1这样的IP地址。
例如:
我们的试验数据是:
172.16.213.132 [07/Feb/2018:16:24:19 +0800] "GET /HTTP/1.1" 403 5039
1)我们举个例⼦来讲解过滤IP
input {
stdin {
}
}
filter{
grok{
match => {"message" => "%{IPV4:ip}"}
}
}
output {
stdout {
}
}
现在启动⼀下:
[root@:172.31.22.29 /etc/logstash/conf.d]#/usr/share/logstash/bin/logstash -f /etc/logstash/conf.f
Sending Logstash logs to /var/log/logstash which is now configured via log4j2.properties172.16.213.132 [07/Feb/2018:16:24:19 +0800]"GET /HTTP/1.1" 403 5039      #⼿动输⼊此⾏信息
{
"message" => "172.16.213.132 [07/Feb/2018:16:24:19 +0800]\"GET /HTTP/1.1\" 403 5039",
"ip" => "172.16.213.132",
"@version" => "1",
"host" => "2.internal",
"@timestamp" => 2019-01-22T09:48:15.354Z
}
2)举个例⼦来讲解过滤时间戳
input与output字段信息这⾥省略不写了。
filter{
grok{
match => {"message" => "%{IPV4:ip}\ \[%{HTTPDATE:timestamp}\]"}
}
}
接下来我们过滤⼀下:
[root@:172.31.22.29 /etc/logstash/conf.d]#/usr/share/logstash/bin/logstash -f /etc/logstash/conf.f
Sending Logstash logs to /var/log/logstash which is now configured via log4j2.properties172.16.213.132 [07/Feb/2018:16:24:19 +0800]"GET /HTTP/1.1" 403 5039    ⼿动输⼊此⾏信息
{
"@version" => "1",
"timestamp" => "07/Feb/2018:16:24:19 +0800",
"@timestamp" => 2019-01-22T10:16:14.205Z,
"message" => "172.16.213.132 [07/Feb/2018:16:24:19 +0800]\"GET /HTTP/1.1\" 403 5039",
"ip" => "172.16.213.132",
"host" => "2.internal"
}
可以看到我们已经过滤成功了,在配置⽂件中grok其实是使⽤正则表达式来进⾏过滤的。我们做个⼩实验,⽐如我现在在例⼦中的数据ip后⾯添加两个“-”。如图所⽰:
172.16.213.132 - - [07/Feb/2018:16:24:19 +0800] "GET /HTTP/1.1" 403 5039
那么此时在配置⽂件中我就需要这样⼦来写:
filter{
grok{
match => {"message" => "%{IPV4:ip}\ -\ -\ \[%{HTTPDATE:timestamp}\]"}
}
}
那么此时在match⾏我就要匹配两个“-”,否则grok就不能正确匹配数据,从⽽不能解析数据。
启动⼀下来查看⼀下结果:
[root@:172.31.22.29 /etc/logstash/conf.d]#/usr/share/logstash/bin/logstash -f /etc/logstash/conf.f
Sending Logstash logs to /var/log/logstash which is now configured via log4j2.properties172.16.213.132 - - [07/Feb/2018:16:24:19 +0800] "GET /HTTP/1.1" 403 5039      #⼿动输⼊此⾏内容,然后按下enter键。{
"@timestamp" => 2019-01-22T10:25:46.687Z,
"ip" => "172.16.213.132",
"message" => "172.16.213.132 - - [07/Feb/2018:16:24:19 +0800] \"GET /HTTP/1.1\" 403 5039",
"timestamp" => "07/Feb/2018:16:24:19 +0800",
"@version" => "1",
"host" => "2.internal"
}
这时候我们就得到了信息,我这⾥是匹配IP和时间,当然你也可以直接匹配时间即可:
filter{
grok{
match => {"message" => "\ -\ -\ \[%{HTTPDATE:timestamp}\]"}
}
}
这个时候我们更加能理解grok使⽤正则匹配数据了。
需要注意的是:正则中,匹配空格和中括号要加上转义符。
3)过滤出报⽂头信息
⾸先来写匹配的正则模式
filter{
grok{
match => {"message" => "\ %{QS:referrer}\ "}
}
}
启动⼀下看看结果:
[root@:172.31.22.29 /etc/logstash/conf.d]#/usr/share/logstash/bin/logstash -f /etc/logstash/conf.f
Sending Logstash logs to /var/log/logstash which is now configured via log4j2.properties172.16.213.132 - - [07/Feb/2018:16:24:19 +0800] "GET /HTTP/1.1" 403 5039
{
"@timestamp" => 2019-01-22T10:47:37.127Z,
"message" => "172.16.213.132 - - [07/Feb/2018:16:24:19 +0800] \"GET /HTTP/1.1\" 403 5039",
"@version" => "1",
"host" => "2.internal",
"referrer" => "\"GET /HTTP/1.1\""
}
4)举⼀反三,我们尝试输出⼀下/var/log/message字段的时间信息。
例⼦的数据:
Jan 20 11:33:03 ip-172-31-22-29 systemd: Removed slice User Slice of root.
我们的⽬的是输出时间,也就是前三列⽽已。
这个时候我们可以去匹配的正则有哪些,要去这个路径下:/usr/share/logstash/vendor/bundle/jruby/2.3.0/gems/logstash-patterns-core-4.1.2/patterns⽬录下的grok-patterns这个⽂件,我们发现了这个:
正好⾮常符合上⾯输出的信息。
⾸先写好配置⽂件
filter{
grok{
match => {"message" => "%{SYSLOGTIMESTAMP:time}"}
remove_field => ["message"]
}
}
启动⼀下看看情况:
[root@:172.31.22.29 /etc/logstash/conf.d]#/usr/share/logstash/bin/logstash -f /etc/logstash/conf.f
Sending Logstash logs to /var/log/logstash which is now configured via log4j2.properties
Jan 20 11:33:03 ip-172-31-22-29 systemd: Removed slice User Slice of root.      #⼿动输⼊此⾏信息。
{
"@timestamp" => 2019-01-22T11:54:26.646Z,
"host" => "2.internal",
"@version" => "1",
"time" => "Jan 20 11:33:03"
}
看到结果已经转换成功了,⾮常好⽤的⼯具。
1.2、date插件
在上⾯我们有个例⼦是讲解timestamp字段,表⽰取出⽇志中的时间。但是在显⽰的时候除了显⽰你指定的timestamp外,还有⼀⾏是@timestamp信息,这两个时间是不⼀样
的,@timestamp表⽰系统当前时间。两个时间并不是⼀回事,在ELK的⽇志处理系统中,@timestamp字段会被elasticsearch⽤到,⽤来标注⽇志的⽣产时间,如此⼀来,⽇志⽣成时间就会发⽣混乱,要解决这个问题,需要⽤到另⼀个插件,即date插件,这个时间插件⽤来转换⽇志记录中的时间字符串,变成Logstash::Timestamp对象,然后转存到@timestamp字段⾥⾯
接下来我们在配置⽂件中配置⼀下:
filter{
grok{
match => {"message" => "\ -\ -\ \[%{HTTPDATE:timestamp}\]"}
}
date{
match => ["timestamp","dd/MMM/yyyy:HH:mm:ss Z"]
}
}
注意:时区偏移量需要⽤⼀个字母Z来转换。还有这⾥的“dd/MMM/yyyy”,你发现中间是三个⼤写的M,没错,这⾥确实是三个⼤写的M,我尝试只写两个M的话,转换失败
启动⼀下我们看看效果:
[root@:172.31.22.29 /etc/logstash/conf.d]#/usr/share/logstash/bin/logstash -f /etc/logstash/conf.f
Sending Logstash logs to /var/log/logstash which is now configured via log4j2.properties172.16.213.132 - - [07/Feb/2018:16:24:19 +0800] "GET /HTTP/1.1" 403 5039        #⼿动输⼊此⾏信息{
"host" => "2.internal",
"timestamp" => "07/Feb/2018:16:24:19 +0800",
"@timestamp" => 2018-02-07T08:24:19.000Z,
"message" => "172.16.213.132 - - [07/Feb/2018:16:24:19 +0800] \"GET /HTTP/1.1\" 403 5039",
"@version" => "1"
}
会发现@timestamp时间转换成功,因为我写这篇博客是在2019年1⽉22⽇写的。还有⼀点就是在时间少8个⼩时,你发现了吗?继续往下看
1.3、remove_field的⽤法
remove_field的⽤法也是很常见的,他的作⽤就是去重,在前⾯的例⼦中你也看到了,不管是我们要输出什么样⼦的信息,都是有两份数据,即message⾥⾯是⼀份,HTTPDATE或者IP⾥
⾯也有⼀份,这样⼦就造成了重复,过滤的⽬的就是筛选出有⽤的信息,重复的不要,因此我们看看如何去重呢?
1)我们还是以输出IP为例:
filter{
grok{
match => {"message" => "%{IP:ip_address}"}
remove_field => ["message"]
}
}
启动服务查看⼀下:
[root@:172.31.22.29 /etc/logstash/conf.d]#/usr/share/logstash/bin/logstash -f /etc/logstash/conf.f
Sending Logstash logs to /var/log/logstash which is now configured via log4j2.properties172.16.213.132 - - [07/Feb/2018:16:24:19 +0800] "GET /HTTP/1.1" 403 5039      #⼿动输⼊此⾏内容并按enter键{
"ip_address" => "172.16.213.132",
"host" => "2.internal",
"@version" => "1",
"@timestamp" => 2019-01-22T12:16:58.918Z
}
这时候你会发现没有之前显⽰的那个message的那⼀⾏信息了。因为我们使⽤remove_field把他移除了,这样的好处显⽽易见,我们只需要⽇志中特定的信息⽽已。
2)在上⾯的⼏个例⼦中我们是把message⼀⾏的信息⼀个⼀个分开演⽰了,现在我想在⼀个logstash中全部显⽰出来。
我们先在配置⽂件中配置⼀下:
filter{
grok{
match => {"message" => "%{IP:ip_address}\ -\ -\ \[%{HTTPDATE:timestamp}\]\ %{QS:referrer}\ %{NUMBER:status}\ %{NUMBER:bytes}"}
}
date{
match => ["timestamp","dd/MMM/yyyy:HH:mm:ss Z"]
}
}
启动⼀下,看看情况:
[root@172.31.22.29 /etc/logstash/conf.d]#/usr/share/logstash/bin/logstash -f /etc/logstash/conf.f
Sending Logstash logs to /var/log/logstash which is now configured via log4j2.properties172.16.213.132 - - [07/Feb/2018:16:24:19 +0800] "GET /HTTP/1.1" 403 5039      #⼿动输⼊此⾏内容
{
"status" => "403",
"bytes" => "5039",
"message" => "172.16.213.132 - - [07/Feb/2018:16:24:19 +0800] "GET /HTTP/1.1" 403 5039",
"ip_address" => "172.16.213.132",
"timestamp" => "07/Feb/2018:16:24:19 +0800",
"@timestamp" => 2018-02-07T08:24:19.000Z,
"referrer" => ""GET /HTTP/1.1"",
"@version" => "1",
"host" => "2.internal"
}
在这个例⼦中,你能感受到输出内容的臃肿,相当于输出了两份的内容,因此我们很有必要将原始内容message的这⼀⾏给去掉。
3)使⽤remove_field去掉message这⼀⾏的信息。
⾸先我们修改⼀下配置⽂件:
filter{
grok{
match => {"message" => "%{IP:ip_address}\ -\ -\ \[%{HTTPDATE:timestamp}\]\ %{QS:referrer}\ %{NUMBER:status}\ %{NUMBER:bytes}"}
}
date{
match => ["timestamp","dd/MMM/yyyy:HH:mm:ss Z"]
}
mutate{
remove_field => ["message","timestamp"]
}
启动⼀下看看:
[root@:172.31.22.29 /etc/logstash/conf.d]#/usr/share/logstash/bin/logstash -f /etc/logstash/conf.f
Sending Logstash logs to /var/log/logstash which is now configured via log4j2.properties172.16.213.132 - - [07/Feb/2018:16:24:19 +0800] "GET /HTTP/1.1" 403 5039     #⼿动输⼊此⾏内容尝试⼀下
{
"referrer" => "\"GET /HTTP/1.1\"",
"bytes" => "5039",
"host" => "2.internal",
"@timestamp" => 2018-02-07T08:24:19.000Z,
"status" => "403",
"ip_address" => "172.16.213.132",
"@version" => "1"
}
看到了吗这就是我们想要的最终结果
1.4、时间处理(date)
上⾯有⼏个例⼦已经讲到了date的⽤法。date插件对于排序事件和回填旧数据尤其重要,它可以⽤来转换⽇志记录中的时间字段,变成Logstash::timestamp对象,然后转存到
@timestamp字段⾥⾯。
为什么要使⽤这个插件呢?
  1、⼀⽅⾯由于Logstash会给收集到的每条⽇志⾃动打上时间戳(即@timestamp),但是这个时间戳记录的是input接收数据的时间,⽽不是⽇志⽣成的时间(因为⽇志⽣成时间与input 接收的时间肯定不
同),这样就可能导致搜索数据时产⽣混乱。
  2、另⼀⽅⾯,在上⾯那段rubydebug编码格式的输出中,@timestamp字段虽然已经获取了timestamp字段的时间,但是仍然⽐北京时间晚了8个⼩时,这是因为在Elasticsearch内部,对时间类型字段都是统⼀采⽤UTC时间,⽽⽇志统⼀采⽤UTC时间存储,是国际安全、运维界的⼀个共识。其实这并不影响什么,因为ELK已经给出了解决⽅案,那就是在Kibana平台上,程序会⾃动读取浏览器的当前时区,然后在web页⾯⾃动将UTC时间转换为当前时区的时间。
如果你要解析你的时间,你要使⽤字符来代替,⽤于解析⽇期和时间⽂本的语法使⽤字母来指⽰时间(年、⽉、⽇、时、分等)的类型。以及重复的字母来表⽰该值的形式。在上⾯看到的"dd/MMM/yyy:HH:mm:ss Z",他就是使⽤这种形式,我们列出字符的含义:
那我们是依据什么写出“dd/MMM/yyy:HH:mm:ss Z”这样⼦的形式的呢?
这⼀点不好理解,给⼤家尽量说清楚。⽐如上⾯的试验数据是
172.16.213.132 - - [07/Feb/2018:16:24:19 +0800] "GET /HTTP/1.1" 403 5039
现在我们想转换时间,那就要写出"dd/MMM/yyy:HH:mm:ss Z",你发现中间有三个M,你要是写出两个就不⾏了,因为我们查表发现两个⼤写的M表⽰两位数字的⽉份,可是我们要解析的⽂本中,⽉份则是使⽤简写的英⽂,所以只能去三个M。还有最后为什么要加上个⼤写字母Z,因为要解析的⽂本中含有“+0800”时区偏移,因此我们要加上去,否则filter就不能正确解析⽂本数据,从⽽转换时间戳失败。
1.5、数据修改mutate插件
mutate插件是logstash另⼀个⾮常重要的插件,它提供了丰富的基础类型数据处理能⼒,包括重命名、删除、替换、修改⽇志事件中的字段。我们这⾥举⼏个常⽤的mutate插件:字段类型转换功能covert、正则表达式替换字段功能gsub、分隔符分隔字符串为数值功能split、重命名字段功能rename、删除字段功能remove_field
1)字段类型转换convert
先修改配置⽂件:
filter{
grok{
match => {"message" => "%{IPV4:ip}"}
remove_field => ["message"]
}
mutate{
convert => ["ip","string"]
}
}
或者这样⼦写也⾏,写法区别较⼩:
filter{
grok{
match => {"message" => "%{IPV4:ip}"}
remove_field => ["message"]
}
mutate{
convert => {
"ip" => "string"
}
}
}
现在我们启动服务查看⼀下效果:
[root@:172.31.22.29 /etc/logstash/conf.d]#/usr/share/logstash/bin/logstash -f /etc/logstash/conf.f
Sending Logstash logs to /var/log/logstash which is now configured via log4j2.properties172.16.213.132 - - [07/Feb/2018:16:24:9 +0800] "GET /HTTP/1.1" 403 5039log4j与log4j2
{
"@timestamp" => 2019-01-23T04:13:55.261Z,
"ip" => "172.16.213.132",
"host" => "2.internal",
"@version" => "1"
}
在这⾥的ip⾏中,效果可能不太明显,但是确实是已经转化成string模式了。
2)正则表达式替换匹配字段
gsub可以通过正则表达式替换字段中匹配到的值,但是这本⾝只对字符串字段有效。
⾸先把修改配置⽂件看看
filter{
grok{
match => {"message" => "%{QS:referrer}"}
remove_field => ["message"]
}
mutate{
gsub => ["referrer","/","-"]
}
}
启动⼀下看看效果:
172.16.213.132 - - [07/Feb/2018:16:24:9 +0800] "GET /HTTP/1.1" 403 5039
{
"host" => "2.internal",
"@timestamp" => 2019-01-23T05:51:30.786Z,
"@version" => "1",
"referrer" => "\"GET -HTTP-1.1\""
}
很不错,确实对QS的部分的分隔符换做横杠了
3)分隔符分隔字符串为数组
split可以通过指定的分隔符分隔字段中的字符串为数组。
⾸先配置⽂件
filter{
mutate{
split => ["message","-"]
add_field => ["A is lower case :","%{[message][0]}"]
}
}
这⾥的意思是对⼀个字段按照“-”进⾏分隔为数组
启动⼀下:
a-b-c-d-e-f-g            #⼿动输⼊此⾏内容,并按下enter键。
{
"A is lower case :" => "a",
"message" => [
[0] "a",
[1] "b",
[2] "c",
[3] "d",
[4] "e",
[5] "f",
[6] "g"
],
"host" => "2.internal",
"@version" => "1",
"@timestamp" => 2019-01-23T06:07:18.062Z
}
4)重命名字段
rename可以实现重命名某个字段的功能。
filter{
grok{
match => {"message" => "%{IPV4:ip}"}
remove_field => ["message"]
}
mutate{
convert => {
"ip" => "string"
}
rename => {
"ip"=>"IP"
}
}
}
rename字段使⽤⼤括号{}括起来,其实我们也可以使⽤中括号达到同样的⽬的
mutate{
convert => {
"ip" => "string"
}
rename => ["ip","IP"]
}
启动后检查⼀下:
172.16.213.132 - - [07/Feb/2018:16:24:9 +0800] "GET /HTTP/1.1" 403 5039      #⼿动输⼊此内容
{
"@version" => "1",
"@timestamp" => 2019-01-23T06:20:21.423Z,
"host" => "2.internal",
"IP" => "172.16.213.132"
}
5)删除字段,这个不多说,我们上⾯已经有例⼦了。
6)添加字段add_field。
添加字段多⽤于split分隔中,主要是对split分隔后的字段中指定格式输出。
filter {
mutate {
split => ["message", "|"]
add_field => {
"timestamp" => "%{[message][0]}"    } }}
添加字段后,该字段会与@timestamp⼀样同等格式显⽰出来。
1.6、geoip地址查询归类
geoip是常见的免费的IP地址归类查询库,geoip可以根据IP地址提供对应的地域信息,包括国别,省市,经纬度等等,此插件对于可视化地图和区域统计⾮常有⽤。⾸先我们修改⼀下配置⽂件来看看
filter{
grok {
match => {
"message" => "%{IP:ip}"
}
remove_field => ["message"]
}
geoip {
source => "ip"
}
}
  中间match的部分也可以替换成下图例⼦:
grok {

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。