Remote Debugging of Hadoop Job with Eclipse

Introduction:

When we create MapReduce Application in Java, and run generated war on Hadoop platform, we may need to remotely debug that MapReduce Application at runtime.

Hadoop runs in 3 modes
1) Standalone
2) Pseudo distributed
3) Fully distributed

It is possible to debug Hadoop Map Reduce App in all three modes.

Its also possible to debug the Mapper and Reducer Task which are executed with help of containers when job is submitted, which contains YarnChild.java (Part of Hadoop Framework), For these please refer- debugging child processes in Hadoop at the end of the current blog.

Scenario:
You have Virtual machine on which Hadoop is installed and you want to debug MapReduce App from eclipse in windows on same machine or another machine.

To achieve this-

1) Modify conf/hadoop-env.sh file in Hadoop installation directory.

# cd /root/hadoop/hadoop-1.0.4/conf

Open hadoop-env.sh and add follwing line -export

HADOOP_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5432 $HADOOP_OPTS"

jdwp is java debugger wire protocol
suspend=y is for when breakpoint is found suspend execution until debugger is attached.
address=<PORT> Hadoop will listen on this port for debugging.

2) Now run job on Hadoop-


# hadoop jar /root/hadoop/app/WordCount.jar /root/hadoop/app/input/file1 /root/hadoop/app/output/file1

3) Now come to windows/another VM where your eclipse is present.
a) You should have same MapReduce project in your workspace of eclipse.
b) Right click on Project->Debug As->Debug configuration->Remote Java Application

Debug Configuration

i) Browse project from workspace.
ii) In IP field specify IP of VM where you are running Hadoop.
iii)Set Port Number equal to port number set in hadoop-env.sh HADOOP_OPTS value.
We had set-

HADOOP_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5000"

In this case,
So now set port=5000 in debug configuration.

click on Debug to start debugging.

* Debugging Hadoop core components

1) Modify the file $HADOOP_HOME/etc/hadoop/yarn-env.sh. Add the following lines.

YARN_OPTS="$YARN_OPTS -agentlib:jdwp=transport=dt_socket, server=y,suspend=y,address=51234"

Add the following lines in the file $HADOOP_HOME/etc/hadoop/mapred-site.xml inside block. It will enable YARN Framework and job will run in YARN.

<property>
<name> mapreduce.framework.name</name>
<value>yarn</value>
</property>

2) Execute the following commands

$HADOOP_HOME/sbin/start-dfs.sh
$HADOOP_HOME/sbin/start-yarn.sh

Follow the same steps as we did for debugging a map reduce job.

  • Debugging Child Process in Hadoop-

1. set follwing property in mapred-site.xml

<property>
<name>mapred.child.java.opts</name>
<value>-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5432</value>
</property>

2. Follow above steps to debug.

3. You can debug child processes(Mapper/Reducer) in Hadoop cluster(fully distributed) also, But you dont know on which datanode the current Mapper / Reducer task is running, so you need to try to find it out by trying IP’s of datanodes with configured port(5432) with eclipse remote debugging.

Advertisements

11 responses to “Remote Debugging of Hadoop Job with Eclipse

  1. Hi: I’m running hadoop 1.0.4 on MacOs 10.6.8. If I add this line to haddop-env.sh export HADOOP_OPTS=”-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5432$HADOOP_OPTS” and then start all services using bin/start-all.sh I get the following error for each service (datanode, namenode, jobtracker, etc.)
    ERROR: Cannot load this JVM TI agent twice, check your java command line for duplicate jdwp options. Error occurred during initialization of VM

    Any suggestions?
    thanks in advance, Lorena

  2. You dont even need to restart services, you can set HADOOP_OPTS and debug it while services are running, And while restarting just comment the the HADOOP_OPTS..

  3. hey.. this seems to be not working on pseudo distribute.

    should i set debug mode to jvm of hadoo by using command like “java -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=y,address=8000^C
    ” ?

  4. See, we are setting this option to Hadoop JVM ( JVM in which map / reduce task run), This JVM are created runtime while job running for each task, so I dont think so we can set remote debugging options from command line, So you need to set following option in mapred-site.xml

    &ltproperty&gt
    &ltname&gtmapred.child.java.opts&lt/name&gt
    &ltvalue&gt-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5432&lt/value&gt
    &lt/property&gt

  5. hi,I connect to my VM using public key.
    i.e I ssh to my VM where hadoop is installed as-
    “ssh -i key.pem ubuntu@vm_ip”. How can I provide public key in eclipse debug modE?

      • See, What happens in remote debugging is your VM is listens on one port number, for that we dont need public key, we dont need to connect that VM.
        Please put your exact errors, screenshots, what is happening.

  6. What do you mean by debuggin core components? I configured I m using mapreduce1. So, when my code reaches
    status = jobSubmitClient.submitJob(
    jobId, submitJobDir.toString(), jobCopy.getCredentials());
    in JobClient.java, I am not able to see whats happening inside this submitJob() method. Is there anything that I missed out?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s