29.12.13

The anatomy of a JDBC connection in Hive and other interesting diversions

In a previous post we went through the way JDBC connections get invoked at runtime.

Now we'll expose the details hive's services - and how SQL gets translated into an actual mapreduce job, using a basic understanding of JDBC as an "entry point" to track down how translates abstract SQL operations into nuts and bolts MapReduce jobs.

I used some of the concepts here to come up with a simple and easy way to run and debug hive mapreduce jobs inside of my IDE.  See https://github.com/jayunit100/bigpetstore/ for details.
  

Part 1: What is the actual contract that Hive provides us with?

  • Hive's contract to users is defined in the HiveInterface class
That is - thrift is a communication channel that hive uses to expose its main service : which is the translation of SQL commands into hadoop / mapreduce commands.  The ultimate class invoked by the JDBC layer of hive is the HiveServer ~ or client ~ both of which implement the HiveInterface:
Hive JDBC always seemed like a hairy beast to me. Its actually not all that bad:  The URL mapping is translated via the JDBC driver to either use a thrift connection, or else, a "HiveServerHandler", to dial up a HiveInterface. 
The above figure illustrates how the HiveServerHandler implements the HiveInterface.  In fact, its pretty easy to run HiveCommands in pure java without even using JDBC !
  • Just creating an instance of a HiveServerHandler gives you direct access to lower level hive operations (although in an application you shouldn't probably be instantiating this lower level object). 
Nevertheless, if you're an IDE junkie, you can then play around in your IDE with "lower level" Hive functionality by instantiating a "HiveServerHandler".

  • This is a nice way to see the API calls that hive provides as a service to clients. 
Playing with the HiveServerHandler implementation of HiveServer to see what underlying methods get called from the normal Hive interface in your IDE.
So, if you are curious about going deeper into the hive api and like to hack, calling the methods from this class manually is an interesting way to familiarize yourself with hive. 

Part 2: Going deeper into the way the HiveInterface is invoked by tracing the path from JDBC-to-MapReduce.


Now, lets take another route for understanding hive's architecture:  Lets trace the path from JDBC to hadoop.

  • After all, we all know how JDBC works -- the input is SQL, and the output is a ResultSet which is created by a "driver" which makes a database connection.
  • The Hive "Driver" has a runtime dependency on /bin/hadoop (in much the same way that in MySQL, the driver depends on a running MySQL instance
  • The Hive "Driver" allows you to create "HiveStatement" objects, which we know, are the backbone of any JDBC App.
So Lets start tracing the HiveDriver class, which is the JDBC Driver for hive.  If you don't know the basics of JDBC, I'd suggest you read up on it before proceeding:

1) When you connect to JDBC via a URL, you explicitly define the driver:
"org.apache.hive.jdbc.HiveDriver" for hive2.
Note that it used to be, you used org.apache.hadoop.hive.jdbc.HiveDriver.
2) This driver, in turn, registers itself when it is the class is first loaded:
public class HiveDriver implements Driver {
  static {
    try {
      java.sql.DriverManager.registerDriver(new HiveDriver());
    } catch (SQLException e) {
      // TODO Auto-generated catch block
      e.printStackTrace();
    }
  }

3) Looking closer, the Driver also declares its URL_PREFIX, which is used in the acceptsURL implementation (im using hive2:// but for older hive, just "jdbc:hive://" was used).
private static final String URL_PREFIX = "jdbc:hive2://";
public boolean acceptsURL(String url) throws SQLException {
    return Pattern.matches(URL_PREFIX + ".*", url);
}
4) Now - when we make a JDBC statement - the generic DriverManager calls "acceptsUrl" on all registered drivers, and if they match, it uses the matching driver in the runtime to run a query.   As we all know, at this point - we normally create a JDBC Connection to issue a query.  The Driver which is dynamically loaded above provides "public Connection connect(..)" as part of its interface, and returns a "HiveConnection" implementation:
  public Connection connect(String url, Properties info) throws SQLException {
    return new HiveConnection(url, info);
  }

5) The HiveConnection implementation  now has to figure out how to provide a hive service.  There are two scenarios: local and non local.   For non-local ~ i.e. a real hive server ~ a thrift communication channel is opened to talk to the hive server.  For local the HiveServerHandler is spun up:
if uri is empty:                    
     client = HiveServer.HiverServerHandler();
    else (uri nonempty, and thus has a host/port):                 
     client = new HiveClient(TBinaryProtocol(new TSocket(host,port)))
Looking quickly into the HiveServerHandler class, the header describes why it is used when the URI is empty:
/**
* Handler which implements the Hive Interface This class can be used *in  lieu of the HiveClient class to get an embedded server.
*/
public static class HiveServerHandler extends HiveMetaStore.HMSHandler
      implements HiveInterface {
6) At this point, the hive connection is ready to go.  The individual implementations of Hive SQL statements can be seen in the HiveConnection class, for example (from HiveConnection.java): 

      public PreparedStatement prepareStatement(String sql, int resultSetType,
          int resultSetConcurrency) throws SQLException {
          return new HivePreparedStatement(client, sql);
      }

The tricky thing to get here, which explains it all, is that the "client" above is a "HiveInterface" object implmementation, which can either be an instance of HiveServer, OR of the HiveServerHandler.  

 7)  One last remaining question:  Where does HIVE end and MapReduce begin?    To understand that - we look directly into the HiveInterface implementations.  The are both in the HiveServer class.  (remember in 5 above - the HiveServer class is either accessed via creation of a local handler, or via a thrift client service).  

The HiveServer implementation ultimately uses org.apache.hive.ql.Driver class.  This can be seen in the following stacktrace:

    at org.apache.hadoop.hive.ql.Driver.run(Driver.java:945)
    at org.apache.hadoop.hive.service.HiveServer$HiveServerHandler.execute(HiveServer.java:198)
    at org.apache.hadoop.hive.jdbc.HiveStatement.executeQuery(HiveStatement.java:192)
    at org.bigtop.bigpetstore.etl.HiveETL.main(HiveETL.java:105)

The application code creates a "HiveStatement", which triggers an operation on the HiveServer, which feeds the command to the "org.apache.hadoop.hive.ql.Driver" implementation, which calls compile(), and then execute().  The execute() method has all the "magic" of chaining jobs together and running them:
///From org.apache.hadoop.hive.ql.Driver.java
// Loop while you either have tasks running, or tasks queued up
while (running.size() != 0 || runnable.peek() != null) {
// Launch upto maxthreads tasks
while (runnable.peek() != null && running.size() < maxthreads) {
Task<? extends Serializable> tsk = runnable.remove();
launchTask(tsk, queryId, noName, running, jobname, jobs, driverCxt);
}
So i guess Hive isnt such a black box after all ! :) 
 
The driver's run implementation is what ultimately calls hadoop.

- Hive's JDBC front end uses the URI to decide wether or not to implement the hive service on its own, or through a "client".

- The client terminology is kinda funny to me : because the client itself is actually a "HiveInterface", which is implemented via a thrift service (which talks to a running HiveServer) or else, via a embedded hive server.

- Both of these implmentations come are provided by the HiveServer class.

Debugging hive is alot of times all about tracing down "where hive ends" and "where hadoop begins".  to that extent, i hope this post is useful to someone  :)

No comments:

Post a Comment