ENABLING DEBUG LOGGING – EMR MASTER GUIDE

Contains different configurations and procedures to enable logging on different daemons on AWS EMR cluster.
[Please contribute to this article to add additional ways to enable logging]

HBASE on S3 :

 

 

1

2

3

4

5

6

7

    {

    "Classification": "hbase-log4j",

            "Properties": {

                "log4j.logger.com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.request" : "DEBUG",

                "log4j.logger.com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.latency" : "ERROR"

            }

    }

This will enable calls made from EMRFS from HBASE.

Important to troubleshoot S3 consistency issues and failures for HBASE on S3 cluster.

Enabling DEBUG on Hive Metastore daemon (its Datastore) on EMR :

 

 

1

2

3

4

5

6

7

8

9

10

11

12

13

vim /etc/hive/conf/hive-log4j2.properties

 

status = DEBUG

name = HiveLog4j2

 

logger.DataNucleus.name = DataNucleus

logger.DataNucleus.level = DEBUG

 

logger.Datastore.name = Datastore

logger.Datastore.level = DEBUG

 

sudo stop hive-hcatalog-server

- sudo start hive-hcatalog-server

or

 

1

2

3

4

5

6

7

    {

    "Classification": "hive-log4j2",

            "Properties": {

                "logger.Datastore.level" : "DEBUG",

                "logger.DataNucleus.level" : "INFO"

            }

    }

Logs at /var/log/hive/user/hive/hive.log

HUE:

use_get_log_api=true in the beeswaxsection of the hue.ini configuration file.

Hadoop and MR :

 

 

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

[

{

"classification": "hadoop-log4j",

"properties": {

"log4j.logger.com.amazonaws": "DEBUG",

"log4j.logger.com.amazonaws.http.AmazonHttpClient": "DEBUG",

"log4j.logger.org.apache.hadoop.fs.s3a.S3AFileSystem": "DEBUG",

"log4j.logger.emr": "DEBUG",

"hadoop.root.logger": "DEBUG,console"

}

},

{

"Classification": "mapred-env",

"Configurations": [

{

"Classification": "export",

"Configurations": [],

"Properties": {

"HADOOP_MAPRED_ROOT_LOGGER": "DEBUG,DRFA"

}

}

],

"Properties": {}

}

]

 

Enable GC verbose on Hive Server 2 JVM:

 

 

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

#!/bin/bash

 

# This script sources the hive-env.sh on EMR 4.x.x/5.x.x with custom HADOOP_OPTS for hiveserver2 and restarts the HS2 process.

# These options enable GC verbose on HS2 which gets logged to /var/log/hive/hive-server2.out.

# On OOM , just before HS2 gets killed(with kill -9 command) , it also issues a heap dump(to /var/log/hive path).

 

echo '' | sudo tee --append /usr/lib/hive/conf/hive-env.sh

 

echo 'if [ "$SERVICE" = "hiveserver2" ]

then

export HADOOP_OPTS="$HADOOP_OPTS -server -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/log/hive"

fi' | sudo tee --append /usr/lib/hive/conf/hive-env.sh

 

echo "stopping HS2"

sudo stop hive-server2

sleep 5;

echo "starting HS2"

sudo start hive-server2

 

WIRE OR DEBUG logging on EMR to check calls to S3 and DDB for DynamoDb connector library :

Paste the following on log4j configurations of Hadoop / hive / spark etc.

/etc/hadoop/conf/log4j.properties
/etc/hadoop/conf/container-log4j.properties
/etc/hive/conf/hive-log4j2.properties
/etc/spark/conf/..

 

 

1

2

3

4

5

log4j.rootCategory=DEBUG, stdout

log4j.appender.stdout=org.apache.log4j.ConsoleAppender

log4j.appender.stdout.layout=org.apache.log4j.PatternLayout

log4j.appender.stdout.layout.ConversionPattern=%d{ABSOLUTE} %5p %t %c{2} - %m%n

log4j.logger.org.apache.hadoop.hive=ALL

 

https://github.com/awslabs/emr-dynamodb-connector/blob/master/emr-dynamodb-hive/src/test/resources/log4j.properties

Debug on S3 Calls from EMR HIVE :

These metrics can be obtained from the hive.log when enabling debug logging in aws-java-sdk. To enable this logging, add the following line to '/etc/hive/conf/hive-log4j.properties'. The Configuration API can be used as well.

 

 

1

log4j.logger.com.amazonaws=DEBUG

 

 

 

1

2

3

4

5

6

7

8

To count the number of HEAD calls made execute:

 

$ cat /var/log/hive/tmp/hadoop/hive.log | grep "Sending Request: HEAD" | wc -l

4

 

To count the number of GET calls made execute:

 

$ cat /var/log/hive/tmp/hadoop/hive.log | grep "Sending Request: GET" | wc -l

 

Enable DEBUG logging for Http Connection pool:

(from spark) by adding the following to /etc/spark/conf/log4j.properties

 

 

1

log4j.logger.com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.conn.PoolingHttpClientConnectionManager=DEBUG

 

*Tez overwrites the loglevel options we have passed. Please see the related items.*

 

 

1

2

3

4

5

6

7

8

9

{

"Classification": "tez-site",

"Properties": {

"tez.task.log.level" : "DEBUG",

"tez.am.log.level" : "DEBUG",

"tez.root.logger" : "DEBUG,CLA",

"tez.task-specific.launch.cmd-opts" : "-Dlog4j.configuration=log4j.properties"

}

}

 

Enabling Debug on Hadoop log to log calls by EMRFS :

/etc/hadoop/conf/log4j.properties

 

 

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

hadoop.root.logger=DEBUG,console

 

# Jets3t library

 

log4j.logger.org.jets3t.service.impl.rest.httpclient.RestS3Service=DEBUG

 

# AWS SDK & S3A FileSystem

 

log4j.logger.com.amazonaws.http.AmazonHttpClient=DEBUG

 

log4j.logger.org.apache.hadoop.fs.s3a.S3AFileSystem=WARN

 

log4j.logger.com.amazonaws=DEBUG

 

log4j.logger.com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.request=DEBUG

 

log4j.logger.com.amazon.ws.emr.hadoop.fs.s3.lite.handler.RequestIdLogger=DEBUG #logs the AWS request id and S3 "extended request id" from each response

 

You can use same logging config for other Application like spark/hbase using respective log4j config files as appropriate. You can also use EMR log4j configuration classification like hadoop-log4j or spark-log4j to set those config’s while starting EMR cluster.(see below for sample JSON for configuration API)

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps.html

DEBUG on EMR Logpusher Logs :

Edit this file on Master / Slave’s manually and restart Logpusher.

/etc/logpusher/logpusher-log4j.properties

 

 

1

2

3

4

5

6

7

8

9

log4j.rootLogger=DEBUG,DRFA

log4j.threshhold=INFO

log4j.appender.DRFA=org.apache.log4j.DailyRollingFileAppender

log4j.appender.DRFA.File=/emr/logpusher/log/logpusher.log

log4j.appender.DRFA.DatePattern=.yyyy-MM-dd-HH

log4j.appender.DRFA.layout=org.apache.log4j.PatternLayout

log4j.appender.DRFA.layout.ConversionPattern=%d{ISO8601} %p %t: %m%n

log4j.logger.org.apache.commons.httpclient.contrib.ssl.AuthSSLX509TrustManager=INFO

log4j.logger.aws157.instancecontroller=DEBUG

 

 

 

1

2

3

4

[ec2-user@ip-10-1-2-175 ~]$ sudo service logpusher stop

Stopped process in pidfile `/emr/logpusher/run/logpusher.pid' (pid 13677).

[ec2-user@ip-10-1-2-175 ~]$ sudo service logpusher status

Running [ OK ]

 

(Might need to stop Service-nanny before stopping Logpusher, to properly stop/start Logpusher)

DEBUG on Spark classes :

Use the following EMR config to set DEBUG level for relevant class files.

 

 

1

2

3

4

5

6

7

8

9

10

11

12

13

14

[

{

"Classification": "hadoop-log4j",

"Properties": {

"log4j.logger.org.apache.spark.network.crypto" : "DEBUG"

}

},

{

"Classification": "spark-log4j",

"Properties": {

"log4j.logger.org.apache.spark.network.crypto" : "DEBUG"

}

}

]

 

DEBUG using spark shell:

Execute the following commands after invoking spark-shell to enable DEBUG logging on respective spark classes like Memstore. You can use the same if you want to reduce the amount of logging from INFO (which is default coming from log4j.properties in the spark conf ) to ERROR.

 

 

1

2

3

4

5

6

7

8

9

10

11

12

import org.apache.log4j.{Level, Logger}

Logger.getLogger("org.apache.spark.storage.BlockManagerMasterEndpoint").setLevel(Level.DEBUG)

Logger.getLogger("org.apache.spark.storage.BlockManagerInfo").setLevel(Level.DEBUG)

Logger.getLogger("org.apache.spark.storage.BlockManagerMaster").setLevel(Level.DEBUG)

Logger.getLogger("org.apache.spark.storage.MemoryStore").setLevel(Level.DEBUG)

Logger.getLogger("org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend").setLevel(Level.DEBUG)

Logger.getLogger("org.apache.spark.scheduler.cluster.YarnScheduler").setLevel(Level.DEBUG)

Logger.getLogger("org.apache.spark.scheduler.cluster.YarnSchedulerBackend").setLevel(Level.DEBUG)

Logger.getLogger("org.apache.spark.scheduler.DAGScheduler").setLevel(Level.DEBUG)

Logger.getLogger("org.apache.spark.scheduler.TaskSetManager").setLevel(Level.DEBUG) // good for progress tracking

Logger.getLogger("org.apache.spark.ContextCleaner").setLevel(Level.DEBUG)

Logger.getLogger("akka.remote.ReliableDeliverySupervisor").setLevel(Level.DEBUG)

 

EMRFS CLI command like EMRFS SYNC :

/etc/hadoop/conf/log4j.properties

 

 

1

hadoop.root.logger=DEBUG,console

 

Logs will be on the console out. We might need to redirect to a File or do both.

Enable Debug on Boto3 client :

 

 

1

2

3

4

5

6

7

8

9

10

11

12

13

14

# Create the boto3 client

import boto3

import json

import logging

logging.basicConfig(level=logging.DEBUG)

logger = logging.getLogger()

 

client = boto3.client('sagemaker-runtime')

 

endpoint_name = "my-model" # Your endpoint name.

payload = {"data":"this makes me ", "k": 5} # Payload for inference.

response = client.invoke_endpoint(EndpointName=endpoint_name, ContentType='application/json', Body=json.dumps(payload))

response_body = response['Body']

print(response_body.read())

發佈了127 篇原創文章 · 獲贊 76 · 訪問量 45萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章