全链路跟踪skywalking简介
该⽂章主要包括以下内容:
1. skywalking的简介
2. skywalking的使⽤,⽀持多种调⽤中间件(httpclent,springmvc,dubbo,mysql等等)
3. skywalking的traceId与⽇志组件(log4j,logback,elk等)的集成
4. skywalking告警模块使⽤
5. skywalking的原理
6. skywalking的限制
1.skywalking的简介:
Overview:
SkyWalking: an open source observability platform to collect, analyze, aggregate and visualize data from services and cloud native infrastructures.
SkyWalking provides an easy way to keep you have a clear view of your distributed system, even across Cloud.
It is more like a modern APM, specially designed for cloud native, container based and distributed system.
-------
skywalking是⼀个开放源码的,⽤于收集、分析,聚合,可视化来⾃于不同服务和本地基础服务的数据的可观察的平台,
skywalking提供了⼀个简单的⽅法来让你对你的分布式系统甚⾄是跨云的服务有清晰的了解。
它更像是⼀个现代的系统性能管理,特别为分布式系统⽽设计。
Why use SkyWalking?
SkyWalking provides solutions for observing and monitoring distributed system, in many different scenarios.
First of all, like traditional ways, SkyWalking provides auto instrument agents for service, such as Java, C# and Node.js.
At the same time, it provides manual instrument SDKs for Go(Not yet), C++(Not yet).
Also with more languages required, risks in manipulating codes at runtime, cloud native infrastructures grow more powerful,
SkyWalking could use Service Mesher infra probes to collect data for understanding the whole distributed system.
In general, it provides observability capabilities for service(s), service instance(s), endpoint(s).
----------
skywalking提供了在很多不同的场景下⽤于观察和监控分布式系统的⽅式。
⾸先,像传统的⽅法,skywalking为java,c#,Node.js等提供了⾃动探针代理.
同时,它为Go,C++提供了⼿⼯探针。
随着本地服务越来越多,需要越来越多的语⾔,掌控代码的风险也在增加,
Skywalking可以使⽤⽹状服务探针收集数据,以了解整个分布式系统。
通常,skywalking提供了观察service,service instance,endpoint的能⼒。
service: ⼀个服务
Service Instance: 服务的实例(1个服务会启动多个节点)
Endpoint: ⼀个服务中的其中⼀个接⼝
Architecture:
2.skywalking的使⽤:
第三步:启动项⽬:  拷贝skywalking-agent⽬录到所需位置,探针包含整个⽬录,请不要改变⽬录结构,可修改fig配置agent.application_code=xxl-job为⾃⼰的应⽤名
增加JVM启动参数,-javaagent:/path/to/skywalking-agent/skywalking-agent.jar。参数值为skywalking-agent.jar的绝对路径。
通过以上⼏步之后,我们就可以直接访问我们的项⽬的接⼝,看skywalking界⾯上能否收集到我们的调⽤信息了。
下图为skywalking的⾸页,主要展⽰全局的性能信息。
为了验证skywalking具有发现系统拓扑(系统依赖)的功能,启动4个服务,4个服务的接⼝路径分别为hello/start1,hello/start2,hello/start3,hello/start4,
在服务的依赖关系为: start1依赖start2,start2依赖start3和start4。
访问start1接⼝,skywalking展⽰的项⽬拓扑图如下:
全链路性能跟踪展⽰页⾯:
skywalking默认⽀持调⽤性能监控的类型有DB(1),RPC_FRAMEWORK(2),HTTP(3),MQ(4),CACHE(5),此外还⽀持⾃定义插件来监控未⽀持的组件。
下⾯来看下调⽤dubbo和db的效果:(服务start2中调⽤db和项⽬4的dubbo服务)
3.skywalking的traceId与⽇志组件(log4j,logback,elk等)的集成:
以logback为例,只要在⽇志配置xml中增加以下配置,则在打印⽇志的时候,⾃动把当前上下⽂中的traceId加⼊到⽇志中去。
<appender name="console" class="ch.ConsoleAppender">
<layout class="org.apache.lkit.log.TraceIdPatternLogbackLayout">
<pattern>
%d{yyyy-MM-dd HH:mm:ss} [%thread] %-5level %logger{36} - %tid - %msg%n
</pattern>
</layout>
</appender>
效果如下图所⽰,链路中的所有节点的traceId是⼀样的,这样就可以在skywalking上⾯发现性能差的traceId后,再去⽇志组件中查看⽇志是否有异常⽇志。
服务1中打印的⽇志:
2019-08-14 16:46:22 [http-nio-9091-exec-1] INFO  c.ller.HelloController - TID:47.34.15657723821280001 - service1 logger with traceId 服务2中打印的⽇志:
2019-08-14 16:46:24 [http-nio-9092-exec-9] INFO  c.ller.HelloController - TID:47.34.15657723821280001 - service2 logger with traceId 服务3中打印的⽇志:
2019-08-14 16:46:24 [http-nio-9093-exec-1] INFO  c.ller.HelloController - TID:47.34.15657723821280001 - service3 logger with traceId 服务4中打印的⽇志:
2019-08-14 16:46:24 [http-nio-9094-exec-1] INFO  c.ller.HelloController - TID:47.34.15657723821280001 - service4 logger with traceId
4.skywalking告警模块的使⽤:
下图为告警页⾯的ui界⾯,可以看到可以从三个维度来监控,分别为服务(service)、服务实例(service instance),端点(endpoint/接⼝)。
告警规则可以在安装包下的配置⽂件-(apache-skywalking-apm-bin/l)中,⾃由定义。
默认配置监控服务和服务实例,不监控端点,因为 # Active endpoint related metrics alarm will cost more memory than service and service instance metrics alarm.# Because the number of endpoint is much more than service and instance.
下⾯代码为配置告警规则的代码,skywalking还⽀持使⽤者配置告警接⼝,来及时发送通知,如/邮件等。如配置⽂件中的webhooks中。
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#    /licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Sample alarm rules.
rules:
# Rule unique name, must be ended with `_rule`.
service_resp_time_rule:
metrics-name: service_resp_time
op: ">"
threshold: 1000
period: 10
count: 3
silence-period: 5
message: Response time of service {name} is more than 1000ms in 3 minutes of last 10 minutes.
service_sla_rule:
# Metrics value need to be long, double or int
metrics-name: service_sla
op: "<"
threshold: 8000
# The length of time to evaluate the metrics
period: 10
# How many times after the metrics match the condition, will trigger alarm
count: 2
# How many times of checks, the alarm keeps silence after alarm triggered, default as same as period.
silence-period: 3
message: Successful rate of service {name} is lower than 80% in 2 minutes of last 10 minutes
service_p90_sla_rule:
# Metrics value need to be long, double or int
metrics-name: service_p90
op: ">"
threshold: 1000
period: 10
count: 3
silence-period: 5
message: 90% response time of service {name} is more than 1000ms in 3 minutes of last 10 minutes
service_instance_resp_time_rule:
metrics-name: service_instance_resp_time
op: ">"
threshold: 1000
period: 10
count: 2
silence-period: 5
message: Response time of service instance {name} is more than 1000ms in 2 minutes of last 10 minutes
#  Active endpoint related metrics alarm will cost more memory than service and service instance metrics alarm.
#  Because the number of endpoint is much more than service and instance.
#
endpoint_avg_rule:
metrics-name: endpoint_avg
op: ">"
threshold: 1000
period: 10
count: 2
silence-period: 5
message: Response time of endpoint {name} is more than 1000ms in 2 minutes of last 10 minutes
#webhooks:
#  - 127.0.0.1/notify/
#  - 127.0.0.1/go-wechat/
5.skywalking的原理:
skywalaking总体架构分为三部分:
1.    skywalking-collector:链路数据归集器,数据可以落地ElasticSearch,单机也可以落地H2,不推荐,H2仅作为临时演⽰⽤
2.    skywalking-web:web可视化平台,⽤来展⽰落地的数据
3.    skywalking-agent:探针,⽤来收集和发送数据到归集器
skywalking的核⼼在于agent部分,下图展⽰了⼀次调⽤跨多个进程⾥agent的详细的运⾏过程:
以拦截dubbo请求为例,skywalking的dubbo拦截插件实现的代码实现:
源码使⽤的是拦截dubbo中的MonitorFilter这个类中的invoke⽅法。具体如DubboInterceptor所⽰,通过获取dubbo的上下⽂RpcContext先对消费者调⽤之前加⼊sky walking 的跨进程协议header信息sw:traceId,然后到⽣产者取出。
package org.apache.skywalking.apm.plugin.dubbo;
public class DubboInstrumentation extends ClassInstanceMethodsEnhancePluginDefine {
private static final String ENHANCE_CLASS = "com.itor.support.MonitorFilter";
private static final String INTERCEPT_CLASS = "org.apache.skywalking.apm.plugin.dubbo.DubboInterceptor";
@Override
protected ClassMatch enhanceClass() {
return NameMatch.byName(ENHANCE_CLASS);
}
@Override
public ConstructorInterceptPoint[] getConstructorsInterceptPoints() {
return null;
}
@Override
public InstanceMethodsInterceptPoint[] getInstanceMethodsInterceptPoints() {
return new InstanceMethodsInterceptPoint[] {
new InstanceMethodsInterceptPoint() {active transport
@Override
public ElementMatcher<MethodDescription> getMethodsMatcher() {
return named("invoke");
}
@Override
public String getMethodsInterceptor() {
return INTERCEPT_CLASS;
}
@Override
public boolean isOverrideArgs() {
return false;
}
}
};
}
}
以下代码为Dubbo的实现:
package org.apache.skywalking.apm.plugin.dubbo;
import com.alibaba.dubbomon.URL;
import com.alibaba.dubbo.rpc.Invocation;
import com.alibaba.dubbo.rpc.Invoker;
import com.alibaba.dubbo.rpc.Result;
import com.alibaba.dubbo.rpc.RpcContext;
import flect.Method;
import org.apache.skywalking.ontext.ContextCarrier;
import org.apache.skywalking.ontext.tag.Tags;
import org.apache.skywalking.ontext.CarrierItem;
import org.apache.skywalking.ontext.ContextManager;
import org.apache.skywalking.ace.AbstractSpan;
import org.apache.skywalking.ace.SpanLayer;
import org.apache.skywalking.hance.EnhancedInstance;
import org.apache.skywalking.hance.InstanceMethodsAroundInterceptor;
import org.apache.skywalking.hance.MethodInterceptResult;
import org.apache.aceponent.ComponentsDefine;
/**
* {@link DubboInterceptor} define how to enhance class {@link com.itor.support.MonitorFilter#invoke(Invoker,
* Invocation)}. the trace context transport to the provider side by {@link RpcContext#attachments}.but all the version
* of dubbo framework below 2.8.3 don't support {@link RpcContext#attachments}, we support another way to support it.
*
* @author zhangxin
*/
public class DubboInterceptor implements InstanceMethodsAroundInterceptor {
/**
* <h2>Consumer:</h2> The serialized trace context data will
* inject to the {@link RpcContext#attachments} for transport to provider side.
* <p>
* <h2>Provider:</h2> The serialized trace context data will extract from
* {@link RpcContext#attachments}. current trace segment will ref if the serialize context data is not null.
*/
@Override
public void beforeMethod(EnhancedInstance objInst, Method method, Object[] allArguments,
Class<?>[] argumentsTypes, MethodInterceptResult result) throws Throwable {
Invoker invoker = (Invoker)allArguments[0];
Invocation invocation = (Invocation)allArguments[1];
RpcContext rpcContext = Context();
boolean isConsumer = rpcContext.isConsumerSide();
URL requestURL = Url();
AbstractSpan span;
final String host = Host();
final int port = Port();
if (isConsumer) {
final ContextCarrier contextCarrier = new ContextCarrier();
span = ateExitSpan(generateOperationName(requestURL, invocation), contextCarrier, host + ":" + port);
//Attachments().put("contextData", contextDataStr);
//@see github/alibaba/dubbo/blob/dubbo-2.5.3/dubbo-rpc/dubbo-rpc-api/src/main/java/com/alibaba/dubbo/rpc/RpcInvocation.java#L154-L161            CarrierItem next = contextCarrier.items();
while (next.hasNext()) {
next = ();
}
} else {
ContextCarrier contextCarrier = new ContextCarrier();
CarrierItem next = contextCarrier.items();
while (next.hasNext()) {
next = ();
next.HeadKey()));
}
span = ateEntrySpan(generateOperationName(requestURL, invocation), contextCarrier);
}
Tags.URL.set(span, generateRequestURL(requestURL, invocation));
span.setComponent(ComponentsDefine.DUBBO);
SpanLayer.asRPCFramework(span);
}
@Override
public Object afterMethod(EnhancedInstance objInst, Method method, Object[] allArguments,
Class<?>[] argumentsTypes, Object ret) throws Throwable {
Result result = (Result)ret;
if (result != null && Exception() != null) {
Exception());
}
ContextManager.stopSpan();
return ret;
}
@Override
public void handleMethodException(EnhancedInstance objInst, Method method, Object[] allArguments,
Class<?>[] argumentsTypes, Throwable t) {
dealException(t);
}
/**
* Log the throwable, which occurs in Dubbo RPC service.
*/
private void dealException(Throwable throwable) {
AbstractSpan span = ContextManager.activeSpan();
span.log(throwable);
}
/**
* Format operation name. e.g. org.apache.skywalking.st(String)
*
* @return operation name.
*/
private String generateOperationName(URL requestURL, Invocation invocation) {
StringBuilder operationName = new StringBuilder();
operationName.Path());
operationName.append("." + MethodName() + "(");
for (Class<?> classes : ParameterTypes()) {
operationName.SimpleName() + ",");
}
if (ParameterTypes().length > 0) {
operationName.delete(operationName.length() - 1, operationName.length());
}
operationName.append(")");
String();
}
/**
* Format request url.
* e.g. dubbo://127.0.0.1:20880/org.apache.skywalking.st(String).
*
* @return request url.
*/
private String generateRequestURL(URL url, Invocation invocation) {
StringBuilder requestURL = new StringBuilder();
requestURL.Protocol() + "://");
requestURL.Host());
requestURL.append(":" + Port() + "/");
requestURL.append(generateOperationName(url, invocation));
String();
}
}
在调⽤结束后结束,把span的详情信息发送给collector(数据收集器).具体实现在类org.apache.skywalking.ontext.TracingContext的stopSpan(AbstractSpan span)⽅法,
下⾯是stopSpan的具体实现⽅法:
@Override
public boolean stopSpan(AbstractSpan span) {
AbstractSpan lastSpan = peek();
if (lastSpan == span) {
if (lastSpan instanceof AbstractTracingSpan) {
AbstractTracingSpan toFinishSpan = (AbstractTracingSpan)lastSpan;
if (toFinishSpan.finish(segment)) {
pop();
}
} else {
pop();
}
} else {
throw new IllegalStateException("Stopping the unexpected span = " + span);
}
finish();
return activeSpanStack.isEmpty();
}
具体发送数据的逻辑在finish⽅法中
/**
* Finish this context, and notify all {@link TracingContextListener}s, managed by {@link
* TracingContext.ListenerManager}
*/
private void finish() {
if (isRunningInAsyncMode) {
asyncFinishLock.lock();
}
try {
if (activeSpanStack.isEmpty() && running && (!isRunningInAsyncMode || () == 0)) {
TraceSegment finishedSegment = segment.finish(isLimitMechanismWorking());
/*
* Recheck the segment if the segment contains only one span.
* Because in the runtime, can't sure this segment is part of distributed trace.
*
* @see {@link #createSpan(String, long, boolean)}
*/
if (!segment.hasRef() && segment.isSingleSpanSegment()) {
if (!Sampling()) {
finishedSegment.setIgnore(true);
}
}
/*
* Check that the segment is created after the agent (re-)registered to backend,
* otherwise the segment may be created when the agent is still rebooting and should
* be ignored
*/
if (ateTime() < RemoteDownstreamConfig.Agent.INSTANCE_REGISTERED_TIME) {

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。