28.4.14

How bigtop packages Hadoop

I've been trying to learn some more about the RPM packaging of hadoop services lately... And I realized it takes alot of time to lookup all the RPM syntax stuff, so I decided to take some notes here. 

Bigtop packages the entire upstream hadoop ecosystem for us.  It does this, in general, by building the same jars which come from hadoop distros : without patching them - and converting them to rpm/deb packages.

So.... that means the hadoop tarballs that you're used to will get split out into a linux friendly package structure, i.e. (/usr/lib/hadoop/, /etc/hadoop/conf,  and so on ...).

What does this mean for Java developers?  It means that, finally, after all these years, we'll have to learn something about how linux packages stuff.  No longer is a maven repo enough, when it comes to your bigdata applications.  Remember:  to run hadoop, you need a system that is very well organized, which is uniform across the entire cluster, etc.... And for that - you really need 1st class packaging.

I'll update this over time as I learn more.  First we'll start off with the two "main" components that drive the creation of an RPM from java sources... do-component-build and ther spec file.

Specifically, we're looking at hadoop here: But for simpler examples you can dive into the bigtop source and see how tools such as mahout are deployed.  

The do-component-build file

BigTop builds jars directly from upstream project sources.  No actual patching is done.  The build artifacts of upstream sources are then decomposed into proper linux packages.  At the heart of the packaging is the .spec file, of course.  But the building of the raw artifacts for a BigTop package (i.e. the jar files, and so on) that get put into linux directories by the spec file is done in the do-component-build file.  Each hadoop ecosystem component has one.


 The hadoop.spec file


As mentioned above, actual packaging of raw apache artifacts into RPMs, which ultimately guide the way hadoop components get split up into linux directories (/etc/, /usr/lib, and so on) is done by spec files for each component.  So, lets poke around in the source for the hadoop.spec file bigtop-packaging/src/common/rpm/SPECS/hadoop.spec.

%DEFINE (see also RPM Macros)

Before anything, RPM specifications (like any program) define a bunch of constants.

The %DEFINE directive in rpm declares macro expansions (fancy word for variables).  So, for example, later on we will see references to "etc_yarn".
etc_yarn=/etc/yarn
Preamble: Defining the package metadata

The preamble defines the metadata and main components of the installation.

After the macros, the preamble begins.  Here we can see references to the %DEFINE directives above.  For example "hadoop_name" (hadoop).  Also there are several Source[0-n] definitions.  We will see what the Source[0-n] definitions point to in a second.

Preamble :SOURCE[...]


Each entry in "Source..." corresponds to a file present in the packaging source code which will get installed on the target installation.

The SOURCE directives point to a variety of oddly suffixed files, i.e. "hadoop.1", "do-component-build", etc...  For each one of these, we can see how they are applied/installed by looking up the corresponding call to their "SOURCE*" name.  For example, a quick grep for"yarn.conf" references in the rpm specfile:

The yarn.conf file ultimately is added tin /etc/security/limits.d/ in the target


BuildRequires and Requires:

The build vs. installation of a program are completely different animals.  In hadoop's case, there are build requirements, such as gcc... But we don't need gcc to actually run hadoop on a cluster.    Conversly, we need sh-utils in a hadoop cluster, but we don't need them for compilation of hadoop.  Thus, we have a few different OS specific dependency declarations such as this one:


install_hadoop.sh and the RPM_BUILD_ROOT

Finally, we have the installation command.  As you know, hadoop consists of:

  • configuration files (like core-site.xml)
  • executables (like hadoop)
  • logs (created by the nodemanagers, resourcemanager, etc..). 
Thus the RPM installer for hadoop has to set those locations and puts those resources in an appropriate place (/etc/hadoop/conf, /usr/lib/hadoop, /var/log/hadoop-yarn/..., respectively).   

Now something interesting about RPM installers:  The $RPM_BUILD_ROOT.   The RPM_BUILDROOT directory is used as the prefix to many of the arguments to install_hadoop.sh, which is called in the below portion.  For example: we can see that etc_hadoop is put into "RPM_BUILD_ROOT/".  etc_hadoop=/etc/hadoop/, so RPM_BUILD_ROOT is thus

SOURCE2=install_hadoop.sh, the command that installs hadoop.

Next, we have a series of shell calls, disguised as RPM macros

See https://www.zarb.org/~jasonc/macros.php

Most of the macros used in this rpm are simply platform independent ways of referncing standard unix libraries... like "sed" , "ls" , "chgrp" and so on.


The functionality of the above commands are pretty obvious... They do what their unix equivalents already do.  

%PRE directives

Next up, %pre directives.  These define steps which precede installation of particular components.  In general , you can see that this is where the hadoop service users names are created.



%FILES

At this point, we've defined metadata, user names, and other system specific info about our package.  So what are we missing?  files!  The %Files directive is probably the most important: It tells you exactly what files are being installed, and where they should go.  It supports globs/recursive installs as well.



THATS ALL FOR NOW ! I'll update this post once I learn more from our good friends at the ASF Bigtop.  In the meantime, just ping the mailing list (dev@bigtop) for specific questions.




13 comments:

  1. HI,

    Thanks for Great detailed information on bigtop RPM compilation. if possible can you share your spec file.

    ReplyDelete
  2. Check out the source code at https://github.com/apache/bigtop, under bigtop-packaging

    ReplyDelete
  3. Thanks for InformationHadoop Course will provide the basic concepts of MapReduce applications developed using Hadoop, including a close look at framework components, use of Hadoop for a variety of data analysis tasks, and numerous examples of Hadoop in action. This course will further examine related technologies such as Hive, Pig, and Apache Accumulo. HADOOP Online Training

    ReplyDelete
  4. Thanks for your post! Bigtop Hadoop distribution artifacts won't make you feel that you live in an alien world! After installing, you will get a chance to blend a Hadoop cluster in any mode, with the sub-projects of it. Its all for you to garnish next! :) Hadoop Online Training

    ReplyDelete
  5. Awesome post that I regret that I didn't read it earlier. This is a fundamental guide for beginners to start packaging big data components in Bigtop!

    ReplyDelete
  6. The Hadoop tutorial you have explained is most useful for begineers who are taking Hadoop Administrator Online Training for Installing Haddop on their Own
    Thank you for sharing Such a good tutorials on Hadoop

    ReplyDelete
  7. hadoop online training is becoming more popular in India as many students from Europe, South America and Australia are showing more interest in training institutes in India.

    ReplyDelete
  8. Really i enjoyed very much. And this may helpful for lot of peoples. So you are provided such a nice and great article within this.

    Web Designing Training in Chennai

    Java Training in Chennai

    Salesforce Training in Chennai

    ReplyDelete
  9. Thanks For Sharining..A good Information..This is a nice Blog Keep Sharining This Type Of Information..
    Hadoop Online Training In Hyderabad

    ReplyDelete
  10. We are spending most of our timing through the social medias especially facebook. Within that we usually share information to our closest persons. Apart from that we can learn new things or news through this only. So here the information is much valuable and using the trending hashtags was really informative. Thank you for sharing and please keep update like this.

    MSBI Training in Chennai

    Informatica Training in Chennai

    Dataware Housing Training in Chennai

    ReplyDelete
  11. Excellent post on hadoop Technologies Please makes more post on this tech to make us update in this.
    Hadoop Training in Bangalore

    ReplyDelete
  12. Appreciative to you, for sharing those magnificent expressive affirmations. I'll endeavor to do around a motivating force in responding; there's a remarkable game-plan that you've squeezed in articulating the principal objectives, as you charmingly put it. Continue Sharing
    Big Data Hadoop online training in Hyderabad
    Online Hadoop training in Bangalore

    ReplyDelete
  13. The blog is so interactive and Informative , you should write more blogs like this Hadoop Administration Online course Hyderabad

    ReplyDelete