17.5.14

YARN: Some quick sanity checks


There are alot of extension points in YARN.  Every time you modify something, you are risking breaking some aspect of it : wether its job submission, RPC, or conatiner execution. 

Here are some common modifications,  and the way they can cause submitted jobs to fail.

environment variables

When hacking around on a YARN deployment and getting it set up, you need to make sure your environment variable are set. 
env.sh : I run this before starting yarn.
export HADOOP_CONF_DIR=/etc/hadoop/conf
export JAVA_HOME=/usr/lib/jvm/jre-1.7.0-openjdk.x86_64/ 
export HADOOP_LIBEXEC_DIR=/usr/lib/hadoop/libexec
export HADOOP_YARN_HOME=/usr/lib/hadoop-yarn/
export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce/
export YARN_HOME=/usr/lib/hadoop-yarn/
If modifying things, you can also keep around this handy restart script to keep running until you "get it right" and your YARN services are running properly:
start-yarn.sh: I keep this around to restart yarn services
export JAVA_HOME=/usr/lib/jvm/jre-1.7.0-openjdk.x86_64/ 
export HADOOP_LIBEXEC_DIR=/usr/lib/hadoop/libexec
export HADOOP_COMMON_HOME=/usr/lib/hadoop/
/usr/lib/hadoop-yarn/sbin/yarn-daemon.sh stop nodemanager
/usr/lib/hadoop-yarn/sbin/yarn-daemon.sh stop resourcemanager 
/usr/lib/hadoop-yarn/sbin/yarn-daemon.sh start resourcemanager
/usr/lib/hadoop-yarn/sbin/yarn-daemon.sh start nodemanager 

yarn.application.classpath

If this is incorrectly set, all sorts of havoc can occur:  You can get "no filesystem for scheme" errors, or many kinds of missing classpath errors.  A quick way to debug is to manually add in your hadoop library paths like this:
 <property>
    <description>Classpath for typical applications.</description>
     <name>yarn.application.classpath</name>
     <value>/usr/lib/hadoop-yarn/lib/*,/usr/lib/hadoop-hdfs/lib/*,/usr/lib/hadoop-yarn/*,/usr/lib/hadoop/*,/usr/lib/hadoop-mapred/*,/usr/lib/hadoop-mapreduce/*,/usr/lib/hadoop/lib/*,/etc/hadoop/conf/,/etc/hadoop/conf/*,/usr/lib/hadoop-hdfs/*,
        $HADOOP_CONF_DIR,
        $HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*,
        $HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*,
        $HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*,
        $HADOOP_YARN_HOME/*,$HADOOP_YARN_HOME/lib/*
     </value>
  </property>


The FileSystem permissions

Jobs can fail because resource manager can't read the job submission directory. Also, if certain user home directories dont exist or have bad permissions, you can get errors.  For a FileSystem neutral schema of what a base hadoop cluster DFS looks like, see BIGTOP-1200
which attempts to ease the pain for anyone deploying a hadoop cluster by unifying the general expected distributed file system schema.  Since most hadoop distros have at least SOME of these directories created, typical users will use init-hcfs.json to simply sanity check their cluster (i.e. /hbase should be readable by hbase, and so on).


The Container implementation and The Security settings


Jobs can fail because the linux container doesnt have right permissions or linux container .  Jobs can fail because security mode is simple BUT user 'nobody' doesnt exist on the system, and fainally, jobs can fail because container-executor.cfg file doesnt have right content or permissions or group restrictions.

YARN Writable and OWNED directories

There are certain directories that YARN needs to own.  If it doesnt, then you will get job hangs - because nodemanager will be running, but your nodes will be in the unhealthy state.  Make sure, for example, that your yarn daemon (that runs nodemanger), owns the /var/log/hadoop-yarn/containers/ directory.


The amount of Mem / container

Jobs can hang after submission (before 0% completion), because no YARN NodeManager is available with the requisite amount of minimum memory.  Again, another configuration parameter that can bite you on VMs or small cluster nodes.  This can result in a nasty HANG, which is hard to debug.  To fix it, you set "" to something reasonably small (like 1G) and re start nodes.


<name>yarn.scheduler.minimum-allocation-mb</name> <value>2048</value>

shuffle.port and other "close, but not quite" parameter names

When hadoop 2.2 came out, the mapreduce_shuffle parameter got changed to mapreduce.shuffle.   Likewise, there are many other parameters (see http://hadoop.apache.org/docs/r2.2.0/hadoop-project-dist/hadoop-common/DeprecatedProperties.html) which continue to change over time.  Possibly https://issues.apache.org/jira/browse/MAPREDUCE-5894 will result in a fix to this, but for now, configuration parameters in hadoop are vast, and having the right ones can be quite tricky.  A good resource for checking your configuration parameters against a real hadoop cluster is the puppet recipes in apache bigtop : For example, see https://github.com/apache/bigtop/tree/master/bigtop-deploy/puppet/modules/hadoop/templates/yarn-site.xml.








No comments:

Post a Comment