最近在使用Flume收集日誌,發現每隔幾個小時就flume進程就會掛掉,但是日誌中又沒有Error 和Exception,進程結束前的一些日誌如下:
09 Apr 2015 14:47:21,160 INFO [agent-shutdown-hook] (org.apache.flume.lifecycle.LifecycleSupervisor.stop:79) - Stopping lifecycle supervisor 11
09 Apr 2015 14:47:21,162 INFO [agent-shutdown-hook] (org.apache.flume.node.PollingPropertiesFileConfigurationProvider.stop:83) - Configuration provider stopping
09 Apr 2015 14:47:21,177 INFO [SinkRunner-PollingRunner-DefaultSinkProcessor] (org.apache.flume.channel.file.Log.roll:932) - Roll start /maichuang/flume/channel/data
09 Apr 2015 14:47:21,177 INFO [SinkRunner-PollingRunner-DefaultSinkProcessor] (org.apache.flume.channel.file.LogFile$Writer.<init>:214) - Opened /maichuang/flume/channel/data/log-15
09 Apr 2015 14:47:21,178 INFO [SinkRunner-PollingRunner-DefaultSinkProcessor] (org.apache.flume.channel.file.Log.roll:948) - Roll end
09 Apr 2015 14:47:21,210 INFO [SinkRunner-PollingRunner-DefaultSinkProcessor] (org.apache.flume.channel.file.Log.roll:932) - Roll start /maichuang/flume/channel/data
09 Apr 2015 14:47:21,211 INFO [SinkRunner-PollingRunner-DefaultSinkProcessor] (org.apache.flume.channel.file.LogFile$Writer.<init>:214) - Opened /maichuang/flume/channel/data/log-16
09 Apr 2015 14:47:21,211 INFO [SinkRunner-PollingRunner-DefaultSinkProcessor] (org.apache.flume.channel.file.Log.roll:948) - Roll end
09 Apr 2015 14:47:21,211 ERROR [SinkRunner-PollingRunner-DefaultSinkProcessor] (org.apache.flume.SinkRunner$PollingRunner.run:160) - Unable to deliver event. Exception follows.
org.apache.flume.ChannelException: Commit failed due to IO error [channel=channel1]
at org.apache.flume.channel.file.FileChannel$FileBackedTransaction.doRollback(FileChannel.java:605)
at org.apache.flume.channel.BasicTransactionSemantics.rollback(BasicTransactionSemantics.java:168)
at org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:466)
at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)
at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.nio.channels.ClosedByInterruptException
at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
at sun.nio.ch.FileChannelImpl.force(FileChannelImpl.java:380)
at org.apache.flume.channel.file.LogFileV3.writeDelimitedTo(LogFileV3.java:148)
at org.apache.flume.channel.file.LogFileV3$Writer.<init>(LogFileV3.java:209)
at org.apache.flume.channel.file.LogFileFactory.getWriter(LogFileFactory.java:77)
at org.apache.flume.channel.file.Log.roll(Log.java:935)
at org.apache.flume.channel.file.Log.roll(Log.java:904)
at org.apache.flume.channel.file.Log.rollback(Log.java:719)
at org.apache.flume.channel.file.FileChannel$FileBackedTransaction.doRollback(FileChannel.java:603)
... 5 more
09 Apr 2015 14:47:21,212 INFO [agent-shutdown-hook] (org.apache.flume.sink.hdfs.HDFSEventSink.stop:499) - Closing hdfs://master:9000/flume/2015_04_09-1446.log
09 Apr 2015 14:47:21,213 INFO [agent-shutdown-hook] (org.apache.flume.sink.hdfs.BucketWriter.close:409) - Closing hdfs://master:9000/flume/_2015_04_09-1446.log.1428562035496.tmp
09 Apr 2015 14:47:21,214 INFO [hdfs-master-call-runner-9] (org.apache.flume.sink.hdfs.BucketWriter$3.call:339) - Close tries incremented
雖然我不太懂java,但是我還是硬着頭皮去源碼裏面查看<pre name="code" class="java">org.apache.flume.lifecycle.LifecycleSupervisor
對應的源碼,感覺應該不是這裏的問題。後來,感覺應該找 agent-shutdown-hook 相關的信息,果然找到了:
http://www.codatlas.com/github.com/apache/flume/HEAD/flume-ng-node/src/main/java/org/apache/flume/node/Application.java?line=345
public static void main(String[] args) {
try {
boolean isZkConfigured = false;
Options options = new Options();
Option option = new Option("n", "name", true, "the name of this agent");
option.setRequired(true);
options.addOption(option);
option = new Option("f", "conf-file", true,
"specify a config file (required if -z missing)");
option.setRequired(false);
options.addOption(option);
option = new Option(null, "no-reload-conf", false,
"do not reload config file if changed");
options.addOption(option);
// Options for Zookeeper
option = new Option("z", "zkConnString", true,
"specify the ZooKeeper connection to use (required if -f missing)");
option.setRequired(false);
options.addOption(option);
option = new Option("p", "zkBasePath", true,
"specify the base path in ZooKeeper for agent configs");
option.setRequired(false);
options.addOption(option);
option = new Option("h", "help", false, "display help text");
options.addOption(option);
CommandLineParser parser = new GnuParser();
CommandLine commandLine = parser.parse(options, args);
if (commandLine.hasOption('h')) {
new HelpFormatter().printHelp("flume-ng agent", options, true);
return;
}
String agentName = commandLine.getOptionValue('n');
boolean reload = !commandLine.hasOption("no-reload-conf");
if (commandLine.hasOption('z') || commandLine.hasOption("zkConnString")) {
isZkConfigured = true;
}
Application application = null;
if (isZkConfigured) {
// get options
String zkConnectionStr = commandLine.getOptionValue('z');
String baseZkPath = commandLine.getOptionValue('p');
if (reload) {
EventBus eventBus = new EventBus(agentName + "-event-bus");
List<LifecycleAware> components = Lists.newArrayList();
PollingZooKeeperConfigurationProvider zookeeperConfigurationProvider =
new PollingZooKeeperConfigurationProvider(
agentName, zkConnectionStr, baseZkPath, eventBus);
components.add(zookeeperConfigurationProvider);
application = new Application(components);
eventBus.register(application);
} else {
StaticZooKeeperConfigurationProvider zookeeperConfigurationProvider =
new StaticZooKeeperConfigurationProvider(
agentName, zkConnectionStr, baseZkPath);
application = new Application();
application.handleConfigurationEvent(zookeeperConfigurationProvider
.getConfiguration());
}
} else {
File configurationFile = new File(commandLine.getOptionValue('f'));
/*
* The following is to ensure that by default the agent will fail on
* startup if the file does not exist.
*/
if (!configurationFile.exists()) {
// If command line invocation, then need to fail fast
if (System.getProperty(Constants.SYSPROP_CALLED_FROM_SERVICE) ==
null) {
String path = configurationFile.getPath();
try {
path = configurationFile.getCanonicalPath();
} catch (IOException ex) {
logger.error("Failed to read canonical path for file: " + path,
ex);
}
throw new ParseException(
"The specified configuration file does not exist: " + path);
}
}
List<LifecycleAware> components = Lists.newArrayList();
if (reload) {
EventBus eventBus = new EventBus(agentName + "-event-bus");
PollingPropertiesFileConfigurationProvider configurationProvider =
new PollingPropertiesFileConfigurationProvider(
agentName, configurationFile, eventBus, 30);
components.add(configurationProvider);
application = new Application(components);
eventBus.register(application);
} else {
PropertiesFileConfigurationProvider configurationProvider =
new PropertiesFileConfigurationProvider(
agentName, configurationFile);
application = new Application();
application.handleConfigurationEvent(configurationProvider
.getConfiguration());
}
}
application.start();
final Application appReference = application;
<strong><span style="color:#ff0000;">Runtime.getRuntime().addShutdownHook(new Thread("agent-shutdown-hook") {
@Override
public void run() {
appReference.stop();
}
});</span></strong>
} catch (Exception e) {
logger.error("A fatal error occurred while running. Exception follows.",
e);
}
}
上面紅色代碼部分應該就是結束進程的代碼所在。
於是,我又查找 Runtime.getRuntime().addShutdownHook,在
http://www.codatlas.com/github.com/openjdk-mirror/jdk7u-jdk/master/src/share/classes/java/lang/Runtime.java?keyword=getRuntime&line=209
中找到了:
public void addShutdownHook(Thread hook) {
SecurityManager sm = System.getSecurityManager();
if (sm != null) {
sm.checkPermission(new RuntimePermission("shutdownHooks"));
}
ApplicationShutdownHooks.add(hook);
}
http://www.codatlas.com/github.com/openjdk-mirror/jdk7u-jdk/HEAD/src/share/classes/java/lang/ApplicationShutdownHooks.java?line=83中也有相關的代碼:
static synchronized void add(Thread hook) {
if(hooks == null)
throw new IllegalStateException("Shutdown in progress");
if (hook.isAlive())
throw new IllegalArgumentException("Hook already running");
if (hooks.containsKey(hook))
throw new IllegalArgumentException("Hook previously registered");
hooks.put(hook, hook);
}
另外在Application.java源碼中,有這麼一段:
public synchronized void stop() {
supervisor.stop();
if(monitorServer != null) {
monitorServer.stop();
}
}
然後我又去看 supervisor.stop()對應的源碼,可從Application.java 中
public Application(List<LifecycleAware> components) {
this.components = components;
supervisor = new LifecycleSupervisor();
}
可知:supervisor = new LifecycleSupervisor()
supervisor 是 LifecycleSupervisor的一個實例,故又轉而看LifecycleSupervisor所在的源碼:
http://www.codatlas.com/github.com/apache/flume/HEAD/flume-ng-core/src/main/java/org/apache/flume/lifecycle/LifecycleSupervisor.java?line=51
Application.java代碼中的
application.start();
<span style="color:#ff0000;">//啓動application之後再執行如下代碼??這樣不就是一直會執行stop()了麼?這中間的邏輯我還沒搞明白,期待高手幫忙解答。</span>
final Application appReference = application;
Runtime.getRuntime().addShutdownHook(new Thread("agent-shutdown-hook") {
@Override
public void run() {
appReference.stop();
}
});
如上面的紅色字體,目前還沒搞懂,望高手留言指點迷津,或發我郵箱[email protected] ,謝謝!