ENABLING DEBUG LOGGING – EMR MASTER GUIDE

Contains different configurations and procedures to enable logging on different daemons on AWS EMR cluster.
[Please contribute to this article to add additional ways to enable logging]

HBASE on S3 :

{

"Classification": "hbase-log4j",

"Properties": {

"log4j.logger.com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.request" : "DEBUG",

"log4j.logger.com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.latency" : "ERROR"

}

This will enable calls made from EMRFS from HBASE.

Important to troubleshoot S3 consistency issues and failures for HBASE on S3 cluster.

Enabling DEBUG on Hive Metastore daemon (its Datastore) on EMR :

vim /etc/hive/conf/hive-log4j2.properties

status = DEBUG

name = HiveLog4j2

logger.DataNucleus.name = DataNucleus

logger.DataNucleus.level = DEBUG

logger.Datastore.name = Datastore

logger.Datastore.level = DEBUG

sudo stop hive-hcatalog-server

- sudo start hive-hcatalog-server

{

"Classification": "hive-log4j2",

"Properties": {

"logger.Datastore.level" : "DEBUG",

"logger.DataNucleus.level" : "INFO"

}

Logs at /var/log/hive/user/hive/hive.log

HUE:

use_get_log_api=true in the beeswaxsection of the hue.ini configuration file.

Hadoop and MR :

[

{

"classification": "hadoop-log4j",

"properties": {

"log4j.logger.com.amazonaws": "DEBUG",

"log4j.logger.com.amazonaws.http.AmazonHttpClient": "DEBUG",

"log4j.logger.org.apache.hadoop.fs.s3a.S3AFileSystem": "DEBUG",

"log4j.logger.emr": "DEBUG",

"hadoop.root.logger": "DEBUG,console"

}

{

"Classification": "mapred-env",

"Configurations": [

{

"Classification": "export",

"Configurations": [],

"Properties": {

"HADOOP_MAPRED_ROOT_LOGGER": "DEBUG,DRFA"

}

"Properties": {}

}

]

Enable GC verbose on Hive Server 2 JVM:

#!/bin/bash

# This script sources the hive-env.sh on EMR 4.x.x/5.x.x with custom HADOOP_OPTS for hiveserver2 and restarts the HS2 process.

# These options enable GC verbose on HS2 which gets logged to /var/log/hive/hive-server2.out.

# On OOM , just before HS2 gets killed(with kill -9 command) , it also issues a heap dump(to /var/log/hive path).

echo '' | sudo tee --append /usr/lib/hive/conf/hive-env.sh

echo 'if [ "$SERVICE" = "hiveserver2" ]

then

export HADOOP_OPTS="$HADOOP_OPTS -server -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/log/hive"

fi' | sudo tee --append /usr/lib/hive/conf/hive-env.sh

echo "stopping HS2"

sudo stop hive-server2

sleep 5;

echo "starting HS2"

sudo start hive-server2

WIRE OR DEBUG logging on EMR to check calls to S3 and DDB for DynamoDb connector library :

Paste the following on log4j configurations of Hadoop / hive / spark etc.

/etc/hadoop/conf/log4j.properties
/etc/hadoop/conf/container-log4j.properties
/etc/hive/conf/hive-log4j2.properties
/etc/spark/conf/..

log4j.rootCategory=DEBUG, stdout

log4j.appender.stdout=org.apache.log4j.ConsoleAppender

log4j.appender.stdout.layout=org.apache.log4j.PatternLayout

log4j.appender.stdout.layout.ConversionPattern=%d{ABSOLUTE} %5p %t %c{2} - %m%n

log4j.logger.org.apache.hadoop.hive=ALL

https://github.com/awslabs/emr-dynamodb-connector/blob/master/emr-dynamodb-hive/src/test/resources/log4j.properties

Debug on S3 Calls from EMR HIVE :

These metrics can be obtained from the hive.log when enabling debug logging in aws-java-sdk. To enable this logging, add the following line to '/etc/hive/conf/hive-log4j.properties'. The Configuration API can be used as well.

1	log4j.logger.com.amazonaws=DEBUG

To count the number of HEAD calls made execute:

$ cat /var/log/hive/tmp/hadoop/hive.log | grep "Sending Request: HEAD" | wc -l

To count the number of GET calls made execute:

$ cat /var/log/hive/tmp/hadoop/hive.log | grep "Sending Request: GET" | wc -l

Enable DEBUG logging for Http Connection pool:

(from spark) by adding the following to /etc/spark/conf/log4j.properties

1	log4j.logger.com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.conn.PoolingHttpClientConnectionManager=DEBUG

*Tez overwrites the loglevel options we have passed. Please see the related items.*

{

"Classification": "tez-site",

"Properties": {

"tez.task.log.level" : "DEBUG",

"tez.am.log.level" : "DEBUG",

"tez.root.logger" : "DEBUG,CLA",

"tez.task-specific.launch.cmd-opts" : "-Dlog4j.configuration=log4j.properties"

}

Enabling Debug on Hadoop log to log calls by EMRFS :

/etc/hadoop/conf/log4j.properties

hadoop.root.logger=DEBUG,console

# Jets3t library

log4j.logger.org.jets3t.service.impl.rest.httpclient.RestS3Service=DEBUG

# AWS SDK & S3A FileSystem

log4j.logger.com.amazonaws.http.AmazonHttpClient=DEBUG

log4j.logger.org.apache.hadoop.fs.s3a.S3AFileSystem=WARN

log4j.logger.com.amazonaws=DEBUG

log4j.logger.com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.request=DEBUG

log4j.logger.com.amazon.ws.emr.hadoop.fs.s3.lite.handler.RequestIdLogger=DEBUG #logs the AWS request id and S3 "extended request id" from each response

You can use same logging config for other Application like spark/hbase using respective log4j config files as appropriate. You can also use EMR log4j configuration classification like hadoop-log4j or spark-log4j to set those config’s while starting EMR cluster.(see below for sample JSON for configuration API)

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps.html

DEBUG on EMR Logpusher Logs :

Edit this file on Master / Slave’s manually and restart Logpusher.

/etc/logpusher/logpusher-log4j.properties

log4j.rootLogger=DEBUG,DRFA

log4j.threshhold=INFO

log4j.appender.DRFA=org.apache.log4j.DailyRollingFileAppender

log4j.appender.DRFA.File=/emr/logpusher/log/logpusher.log

log4j.appender.DRFA.DatePattern=.yyyy-MM-dd-HH

log4j.appender.DRFA.layout=org.apache.log4j.PatternLayout

log4j.appender.DRFA.layout.ConversionPattern=%d{ISO8601} %p %t: %m%n

log4j.logger.org.apache.commons.httpclient.contrib.ssl.AuthSSLX509TrustManager=INFO

log4j.logger.aws157.instancecontroller=DEBUG

[ec2-user@ip-10-1-2-175 ~]$ sudo service logpusher stop

Stopped process in pidfile `/emr/logpusher/run/logpusher.pid' (pid 13677).

[ec2-user@ip-10-1-2-175 ~]$ sudo service logpusher status

Running [ OK ]

(Might need to stop Service-nanny before stopping Logpusher, to properly stop/start Logpusher)

DEBUG on Spark classes :

Use the following EMR config to set DEBUG level for relevant class files.

[

{

"Classification": "hadoop-log4j",

"Properties": {

"log4j.logger.org.apache.spark.network.crypto" : "DEBUG"

}

{

"Classification": "spark-log4j",

"Properties": {

"log4j.logger.org.apache.spark.network.crypto" : "DEBUG"

}

]

DEBUG using spark shell:

Execute the following commands after invoking spark-shell to enable DEBUG logging on respective spark classes like Memstore. You can use the same if you want to reduce the amount of logging from INFO (which is default coming from log4j.properties in the spark conf ) to ERROR.

import org.apache.log4j.{Level, Logger}

Logger.getLogger("org.apache.spark.storage.BlockManagerMasterEndpoint").setLevel(Level.DEBUG)

Logger.getLogger("org.apache.spark.storage.BlockManagerInfo").setLevel(Level.DEBUG)

Logger.getLogger("org.apache.spark.storage.BlockManagerMaster").setLevel(Level.DEBUG)

Logger.getLogger("org.apache.spark.storage.MemoryStore").setLevel(Level.DEBUG)

Logger.getLogger("org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend").setLevel(Level.DEBUG)

Logger.getLogger("org.apache.spark.scheduler.cluster.YarnScheduler").setLevel(Level.DEBUG)

Logger.getLogger("org.apache.spark.scheduler.cluster.YarnSchedulerBackend").setLevel(Level.DEBUG)

Logger.getLogger("org.apache.spark.scheduler.DAGScheduler").setLevel(Level.DEBUG)

Logger.getLogger("org.apache.spark.scheduler.TaskSetManager").setLevel(Level.DEBUG) // good for progress tracking

Logger.getLogger("org.apache.spark.ContextCleaner").setLevel(Level.DEBUG)

Logger.getLogger("akka.remote.ReliableDeliverySupervisor").setLevel(Level.DEBUG)

EMRFS CLI command like EMRFS SYNC :

/etc/hadoop/conf/log4j.properties

1	hadoop.root.logger=DEBUG,console

Logs will be on the console out. We might need to redirect to a File or do both.

Enable Debug on Boto3 client :

# Create the boto3 client

import boto3

import json

import logging

logging.basicConfig(level=logging.DEBUG)

logger = logging.getLogger()

client = boto3.client('sagemaker-runtime')

endpoint_name = "my-model" # Your endpoint name.

payload = {"data":"this makes me ", "k": 5} # Payload for inference.

response = client.invoke_endpoint(EndpointName=endpoint_name, ContentType='application/json', Body=json.dumps(payload))

response_body = response['Body']

print(response_body.read())

玉羽凌風

發佈了127 篇原創文章 · 獲贊 76 · 訪問量 45萬+

他的留言板關注

ENABLING DEBUG LOGGING – EMR MASTER GUIDE

HBASE on S3 :

Enabling DEBUG on Hive Metastore daemon (its Datastore) on EMR :

HUE:

Hadoop and MR :

Enable GC verbose on Hive Server 2 JVM:

WIRE OR DEBUG logging on EMR to check calls to S3 and DDB for DynamoDb connector library :

Debug on S3 Calls from EMR HIVE :

Enable DEBUG logging for Http Connection pool:

Enabling Debug on Hadoop log to log calls by EMRFS :

DEBUG on EMR Logpusher Logs :

DEBUG on Spark classes :

DEBUG using spark shell:

EMRFS CLI command like EMRFS SYNC :

Enable Debug on Boto3 client :

Common issues of disk going full on EMR Cluster (or In general any Hadoop / Spark cluster)

Amazon AWS 中國區的那些"坑"

jedis使用scan替換keys

HDFS中將普通用戶加入到supergroup組來訪問HDFS

hive-site.xml 參數設置

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結