Distributed | 如何解决令人头疼的堆外内存泄漏问题?-验证猜测
最编程
2024-05-22 22:27:53
...
故障时 dble 机器的内存图:
可以看到确实存在短时间攀升。而 dble cpu 当时的使用率也很高。
再来看 dble 中 free buffer 的监控图(这个指标是记录 dble 中 Processor 的堆外内存使用情况的):
堆外内存泄露分析
btrace 脚本
运行此脚本后,对程序的性能有 10% - 20% 的损耗,且日志量较大,由于堆外内存呈长期递减的趋势,因此只采集了2h的日志进行分析:
package com.actiontech.dble.btrace.script;
import com.sun.btrace.BTraceUtils;
import com.sun.btrace.annotations.*;
import sun.nio.ch.DirectBuffer;
import java.nio.ByteBuffer;
@BTrace(unsafe = true)
public class BTraceDirectByteBuffer {
private BTraceDirectByteBuffer() {
}
@OnMethod(
clazz = "com.actiontech.dble.buffer.DirectByteBufferPool",
method = "recycle",
location = @Location(Kind.RETURN)
)
public static void recycle(@ProbeClassName String pcn, @ProbeMethodName String pmn, ByteBuffer buf) {
String threadName = BTraceUtils.currentThread().getName();
// 排除一些线程的干扰
if (!threadName.contains("writeTo")) {
String js = BTraceUtils.jstackStr(15);
if (!js.contains("heartbeat") && !js.contains("XAAnalysisHandler")) {
BTraceUtils.println(threadName);
if (buf.isDirect()) {
BTraceUtils.println("r:" + ((DirectBuffer) buf).address());
}
BTraceUtils.println(js);
}
}
}
@OnMethod(
clazz = "com.actiontech.dble.buffer.DirectByteBufferPool",
method = "allocate",
location = @Location(Kind.RETURN)
)
public static void allocate(@Return ByteBuffer buf) {
String threadName = BTraceUtils.currentThread().getName();
// 排除一些线程的干扰
if (!threadName.contains("writeTo")) {
String js = BTraceUtils.jstackStr(15);
// 排除心跳等功能的干扰
if (!js.contains("heartbeat") && !js.contains("XAAnalysisHandler")) {
BTraceUtils.println(threadName);
if (buf.isDirect()) {
BTraceUtils.println("a:" + ((DirectBuffer) buf).address());
}
BTraceUtils.println(js);
}
}
}
}
分析采集的 btrace 日志
采集命令:
$ btrace -o 日志的路径 -u 11735 /path/to/BTraceDirectByteBuffer.java
过滤出分配的堆外内存的地址:
$ grep '^a:' /tmp/142-20-dble-btrace.log > allocat.txt
$ sed 's/..//' allocat.txt > allocat_addr.txt # 删除前两个字符
过滤出释放的堆外内存的地址:
$ grep '^r:' /tmp/142-20-dble-btrace.log > release.txt
$ sed 's/..//' release.txt > release_addr.txt # 删除前两个字符
此时取两个文件的差集:
$ sort allocat_addr.txt release_addr.txt | uniq -u > res.txt
从中任选几个堆外内存的 address ,查看堆栈。排除掉误记录的堆栈后,出现最多的堆栈如下:
complexQueryExecutor176019
a:139999811142058
com.actiontech.dble.buffer.DirectByteBufferPool.allocate(DirectByteBufferPool.java:82)
com.actiontech.dble.net.connection.AbstractConnection.allocate(AbstractConnection.java:395)
com.actiontech.dble.backend.mysql.nio.handler.query.impl.OutputHandler.<init>(OutputHandler.java:51)
com.actiontech.dble.services.factorys.FinalHandlerFactory.createFinalHandler(FinalHandlerFactory.java:28)
com.actiontech.dble.backend.mysql.nio.handler.builder.HandlerBuilder.build(HandlerBuilder.java:90)
com.actiontech.dble.server.NonBlockingSession.executeMultiResultSet(NonBlockingSession.java:608)
com.actiontech.dble.server.NonBlockingSession.lambda$executeMultiSelect$55(NonBlockingSession.java:670)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:748)
review 代码
首先通过上面的堆栈定位到下面的代码:
// com/actiontech/dble/backend/mysql/nio/handler/builder/HandlerBuilder.java
public RouteResultsetNode build(boolean isHaveHintPlan2Inner) throws Exception {
TraceManager.TraceObject traceObject = TraceManager.serviceTrace(session.getShardingService(), "build&execute-complex-sql");
try {
final long startTime = System.nanoTime();
BaseHandlerBuilder builder = getBuilder(session, node, false);
DMLResponseHandler endHandler = builder.getEndHandler();
// 泄露点在这,dble 会创建 OutputHandler实例,OutputHandler会分配堆外内存
DMLResponseHandler fh = FinalHandlerFactory.createFinalHandler(session);
endHandler.setNextHandler(fh);
...
RouteResultsetNode routeSingleNode = getTryRouteSingleNode(builder, isHaveHintPlan2Inner);
if (routeSingleNode != null)
return routeSingleNode;
HandlerBuilder.startHandler(fh);
session.endComplexExecute();
long endTime = System.nanoTime();
LOGGER.debug("HandlerBuilder.build cost:" + (endTime - startTime));
session.setTraceBuilder(builder);
} finally {
TraceManager.finishSpan(session.getShardingService(), traceObject);
}
return null;
}
// com/actiontech/dble/backend/mysql/nio/handler/query/impl/OutputHandler.java
public OutputHandler(long id, NonBlockingSession session) {
super(id, session);
session.setOutputHandler(this);
this.lock = new ReentrantLock();
this.packetId = (byte) session.getPacketId().get();
this.isBinary = session.isPrepared();
// 分配堆外内存
this.buffer = session.getSource().allocate();
}
1. 程序bug导致复杂查询未下发,从而执行链被丢弃而没有回收 buffer
2. 程序下发了,由于未知bug导致没有释放 buffer
那如何进一步缩小范围呢?
堆内存 dump
通过现场收集到的异常 OutputHandler 中 buffer 的状态是:
正常写出的 OutputHandler 中 buffer 的状态是:
找到的异常的 OutputHandler 的 BaseSelectHandler 中状态值:
重新回到代码中,review 下发复杂查询之前和构造之后的代码:
// com/actiontech/dble/backend/mysql/nio/handler/builder/HandlerBuilder.java
public RouteResultsetNode build(boolean isHaveHintPlan2Inner) throws Exception {
TraceManager.TraceObject traceObject = TraceManager.serviceTrace(session.getShardingService(), "build&execute-complex-sql");
try {
final long startTime = System.nanoTime();
BaseHandlerBuilder builder = getBuilder(session, node, false);
DMLResponseHandler endHandler = builder.getEndHandler();
// 泄露点在这,dble 会创建 OutputHandler,OutputHandler会分配堆外内存
DMLResponseHandler fh = FinalHandlerFactory.createFinalHandler(session);
endHandler.setNextHandler(fh);
...
RouteResultsetNode routeSingleNode = getTryRouteSingleNode(builder, isHaveHintPlan2Inner);
if (routeSingleNode != null)
return routeSingleNode;
// 下发复杂查询,review 之前的代码
HandlerBuilder.startHandler(fh);
session.endComplexExecute();
long endTime = System.nanoTime();
LOGGER.debug("HandlerBuilder.build cost:" + (endTime - startTime));
session.setTraceBuilder(builder);
} finally {
TraceManager.finishSpan(session.getShardingService(), traceObject);
}
return null;
}
上一篇: python zipfile 模块的基本用法和演示
下一篇: 如何使用 Btrace