jayunit100: Experimenting with running spark jobs inside enterprise java applications.

I recently got a chance to try using arquillian, a " container " based testing environment to build a apache spark test in a java enterprise environment.

Arquillian has some nice docker work associated with it which allow you to run tests inside of containers that publish a war file that bundles jar file dependencies.

This work uses the https://github.com/jboss-openshift/ce-testsuite/ tooling around arquillian.

This post just details some things i did to come very close to running a spark job inside an openshift cluster from enterprise arquillian unit testing frameworks.

In the end, this work is incomplete because spark declares RESTEasy classes which trample with respect to the RESTEasy implementation provided by arquillian's testing framework.

Note: This post is for a niche audience of folks interested in running java applications using enterprise tooling that use the JavaSparkClient.

1) Hadoop user names: SPARK_USER

The first thing i had to do was export SPARK_USER. Thanks to +Matt F for showing me that trick. This allows docker containers with no user name to have a user name necessary for any hadoop related file operations (spark uses hadoop idioms and libraries for alot of stuff).

2) Creating a spark client war file.

The way arquillian test wrappers work in the JBoss CE environment is that they allow you to build up a war file. For classes that you reference in arquillian tests, you need to define dependencies in that war file. I didn't add all the classes at once, because i found that there were various dependency conflicts which could occur which resulted in reflection errors, too many versions of this class errors, and so on. That process looks something like this.

public static WebArchive getDeployment() {
WebArchive war = ShrinkWrap.create(WebArchive.class, "run-in-pod.war");
war.setWebXML(new StringAsset("<web-app/>"));
war.addPackage(org.jboss.arquillian.test.impl.EventTestRunnerAdaptor.class.getPackage());
war.addPackage(Arquillian.class.getPackage());
war.addPackage(org.apache.spark.internal.Logging.class.getPackage());
war.addPackage(org.apache.spark.api.java.function.Function.class.getPackage());
war.addPackage(org.apache.spark.SparkConf.class.getPackage());
war.addPackage(scala.Cloneable.class.getPackage());
war.addPackage(SparkTest.class.getPackage());
war.addPackages(true,"org.apache.spark");
war.addPackages(true,"org.apache.commons");
war.addPackages(true,"org.apache.hadoop");
war.addPackages(true,"scala");
war.addPackages(true,"org.slf4j");
war.addPackages(true,"org.spark_project.guava");
war.addPackages(true,"com.google.common");
war.addPackages(true,"io.netty");
war.addPackages(true,"com.esotericsoftware");
war.addPackages(true,"com.twitter");
war.addPackages(true,"com.codahale");
war.addPackages(true,"org.json4s");
war.addPackages(true,"org.spark_project.jetty");
war.addPackages(true,"org.apache.spark.static");
// no op
// war.addPackages(true,"org.apache.spark.ui");
// war.addAsResource(Package.getPackage("org.apache.spark"),"/");
// war.addAsResource(org.apache.spark.SparkConf.class.getPackage(),"org/apache/spark");
// NullPointerException
// war.addAsResource(org.apache.spark.ui.WebUI.class.getPackage(),"static");
war.addAsResource(org.apache.spark.ui.WebUI.class.getPackage(),"static/additional-metrics.js");

The last line makes sure that there is at least one file made available in the static/ directory, so that its added as a resource without any null pointer exceptions.

In the end, at this point, I stopped because the dependency on war.addPackages(true, "org.glassfish.jersey.server");, which has several JavaSparkClient runtime dependencies, caused a RESTEasy conflict for me, which i was not able to easily resolve. In particular the conflict was BAS011232: Only one JAX-RS Ap$ lication Class allowed. org.glassfish.jersey.server.ResourceConfig$WrappingResourceConfig org.glassfish.jersey$ server.ResourceConfig$RuntimeConfig org.glassfish.jersey.server.ResourceConfig" .

The lesson learned:

Applications running on microservice infrastructures, should themselves be true microservices. The reason for this is that over time, dependency compatibility will probably be a harder problem to solve for multi-application processes, since most people will rather spend there time building microservice fences. I'm pretty sure the use of mvn <exclude> and similar dependency manglers in other languages so on will become more of a dark art that is only used in maintaining legacy apps. For the rest of us.... The lesson here is simple: don't be afraid to leverage containerization even in places where single tenants or one off tasks are running. Although most folks think of microservices as infrastructure (i.e. a redis microservice), its important to realize even a one off app which never serves anything over HTTP can still be a microservice... Microservice based application tests gain from alot of the benefits of a container that microservice infrastructure takes advantage.

The root cause? [note: read update below as this is now solved]...

It seems like the idea that a spark client needs to launch a UI servlet is kind of annoying. but i guess theres no such thing as a bare bones spark client. Possible JIRA?

at org.apache.spark.status.api.v1.ApiRootResource$.getServletHandler(ApiRootResource.scala:193)
        at org.apache.spark.ui.SparkUI.initialize(SparkUI.scala:75)
        at org.apache.spark.ui.SparkUI.<init>(SparkUI.scala:81)
        at org.apache.spark.ui.SparkUI$.create(SparkUI.scala:215)
        at org.apache.spark.ui.SparkUI$.createLiveUI(SparkUI.scala:157)

[update]

Well, actually, there is !

Spark actually has a special option that allows you to conf.set("spark.ui.enabled","false");
If you do that, then there is no attempt made to spin up or load the webserver classes that conflict with arquillian/jboss et al.

However.

After running into this, I hit an interesting, very difficult to debug exception...

Caused by: java.io.IOException: Unable to invoke invoke(), status=WAITING
at org.jboss.remotingjmx.protocol.v2.ClientConnection$TheConnection.invoke(ClientConnection.java:1058)

at org.jboss.as.arquillian.container.ManagementClient$MBeanConnectionProxy.invoke(ManagementClient.java:537)

No idea what this means. guess its related to the idea that some serialized functions being containerized into the jboss execution environment dont have enough imports available?

Anyways, thats about as far as i got. The lesson learned above applies.

Applications running on microservice infrastructures, should themselves be containerized microservices - otherwise, technical debt around combining clients for consuming cross service functionality accumulate.
Maybe the spark REST api is a microservice-friendly way to build a spark app. Not sure if there is a way to cross compile scala or java spark apps into REST calls, but if so that would be cool :) http://arturmkrtchyan.com/apache-spark-hidden-rest-api.

2 comments:

AnonymousMay 23, 2017 at 10:46 PM
I really appreciate information shared above. It’s of great help. If someone want to learn Online (Virtual) instructor led live training in Apache Spark, kindly contact us http://www.maxmunus.com/contact
MaxMunus Offer World Class Virtual Instructor led training on Apache Spark. We have industry expert trainer. We provide Training Material and Software Support. MaxMunus has successfully conducted 100000+ trainings in India, USA, UK, Australlia, Switzerland, Qatar, Saudi Arabia, Bangladesh, Bahrain and UAE etc.
For Demo Contact us.
Nitesh Kumar
MaxMunus
E-mail: nitesh@maxmunus.com
Skype id: nitesh_maxmunus
Ph:(+91) 8553912023
http://www.maxmunus.com/

AnonymousJanuary 9, 2018 at 3:19 AM
Really appreciated the information and please keep sharing, I would like to share some information regarding online training.Maxmunus Solutions is providing the best quality of this Apache Spark and Scala programming language. and the training will be online and very convenient for the learner.This course gives you the knowledge you need to achieve success.

For Joining online training batches please feel free to call or email us.
Email : minati@maxmunus.com
Contact No.-+91-9066638196/91-9738075708
website:-www.maxmunus.com

28.6.16

Experimenting with running spark jobs inside enterprise java applications.

2 comments: