Using the Netflix Genie Client in Java

Posted by & filed under , , .

netflix genieOk, so if you haven’t been watching my activity on GitHub you might have missed this, and as such I feel it deserves a full on blog post. Recently, having joined Netflix, I started using some of their libraries, as to be expected. One of the things that I used pretty much from day one here, was the Genie library. To quote from Genie’s page on GitHub:

Genie is a federated job execution engine developed by Netflix. Genie provides REST-ful APIs to run a variety of big data jobs like Hadoop, Pig, Hive, Presto, Sqoop and more. It also provides APIs for managing many distributed processing cluster configurations and the commands and applications which run on them.

As you can probably figure out from the above, I’m using Genie for querying some of our Hive datastores. And in doing so, I’m using the Genie client code which Netflix provides with this package — available in Github: https://github.com/Netflix/genie/tree/develop/genie-client

However, having looked at the sample code they provided I realised this can be actually improved. I spoke with the folks here who are looking after the Genie project and it transpired quickly that indeed the client library is in need of some lovin’. So I set off and put together a pull request (https://github.com/Netflix/genie/pull/116). This has now been merged into the main trunk however I think it needs a bit of attention as I’ve seen code presented in this project used elsewhere which can be improved based on the changes I put together in that pull request. This blog post will walk you quickly through these changes — if you are using pieces of code from the client’s code in GitHub, it might be worth reviewing your code and see if my changes can be applied in your project too.

There are 2 “streams” of changes in that pull request:

  1. I’ve added support for Groovy and added a client usage sample code in Groovy, which I think is a more concise way of doing things (Java makes it at time laborious to set properties and construct object instances to use)
  2. Secondly, I’ve changed the Java client usage sample code as some of the stuff going on in there was really unnecessary and can be re-written much nicer (and simpler).

So let’s go through them, in reverse order 🙂 We’ll start with the Java changes first:

In genie-client/src/main/java/com/netflix/genie/client/sample there is an ExecutionServiceSampleClient.java which I ended up changing. The bit that I felt needed attention was the code dealing with sending the Hive query as a file attachment (~ line 115) — the initial code looked like this:

final File query = File.createTempFile("hive", ".q");
try (PrintWriter pw = new PrintWriter(query, "UTF-8")) {
   pw.println("select count(*) from counters where dateint=20120430 and hour=10;");
}
final Set<FileAttachment> attachments = new HashSet<>();
final FileAttachment attachment = new FileAttachment();
attachment.setName("hive.q");
 
FileInputStream fin = null;
ByteArrayOutputStream bos = null;
try {
   fin = new FileInputStream(query);
   bos = new ByteArrayOutputStream();
   final byte[] buf = new byte[4096];
   int read;
   while ((read = fin.read(buf)) != -1) {
      bos.write(buf, 0, read);
   }
   attachment.setData(bos.toByteArray());
} finally {
   if (fin != null) {
      fin.close();
    }
   if (bos != null) {
      bos.close();
   }
}
attachments.add(attachment);

Now that is a big chunk of code! Which does a lot of not nice things:

  • memory allocation (for byte buffer)
  • object creation (for streams)
  • file I/O (writing to a file then reading it back)

If you run this 1,000,000’s of times a day this code will crucify the JVM 🙁 Looking closer at the code, this big chunk of code is all so we get the bytes of the string storing our Hive query! Not sure why that needed writing to a file then reading it back as well as a bunch of streams and byte buffers, but I decided to get rid of all of it and rewrite the code in fewer (and faster!) lines:

// send the query as an attachment
final Set<FileAttachment> attachments = new HashSet<>();
final FileAttachment attachment = new FileAttachment();
attachment.setName("hive.q");
attachment.setData("select count(*) from counters where dateint=20120430 and hour=10;".getBytes("UTF-8"));
attachments.add(attachment);

No object creation, no disk I/O, no memory allocation!

As I said, I have seen this code (the one I replaced) replicated in a few repo’s so for those out there using the client in the old fashioned way, I’d go back and change the code to eliminate the unneeded overhead the old code was introducing.

Now let’s look back at the second set of changes, the Groovy side of things. If you look at the Java client sample, there’s a big chunk of code needed to initialize a Hive job, set tags, metadata and so on:

final Set<String> criteriaTags = new HashSet<>();
criteriaTags.add("adhoc");
final ClusterCriteria criteria = new ClusterCriteria(criteriaTags);
final List<ClusterCriteria> clusterCriterias = new ArrayList<>();
final Set<String> commandCriteria = new HashSet<>();
clusterCriterias.add(criteria);
commandCriteria.add("hive");
 
Job job = new Job(userName, jobName, "-f hive.q",
   commandCriteria, clusterCriterias, null);
 
job.setDescription("This is a test");
 
// Add some tags for metadata about the job. This really helps for reporting on
// the jobs and categorization.
Set<String> jobTags = new HashSet<>();
jobTags.add("testgenie");
jobTags.add("sample");
 
job.setTags(jobTags);
 
// send the query as an attachment
final Set<FileAttachment> attachments = new HashSet<>();
final FileAttachment attachment = new FileAttachment();
attachment.setName("hive.q");
attachment.setData("select count(*) from counters where dateint=20120430 and hour=10;".getBytes("UTF-8"));
 
attachments.add(attachment);
job.setAttachments(attachments);
job = client.submitJob(job);

With Groovy, you can shorten all of that in this small piece of code:

def clusterCriterias = [new ClusterCriteria(['adhoc'] as Set)] as List
def commandCriteria = ['hive'] as Set
 
def fa = new FileAttachment()
fa.with { (name, data)=['hive.q', 'select count(*) from counters where dateint=20120430 and hour=10;'.getBytes('UTF-8')] }
 
def job = new Job(userName, jobName, '-f hive.q', commandCriteria, clusterCriterias, null)
job.with {
(description, tags, attachments) = ['This is a test', ['testgenie', 'sample'] as Set, [fa]]
}
 
job = client.submitJob(job)

Given the fact that Groovy runs on the JVM, I prefer for pieces of code like the above to fall back onto Groovy, write less code and achieve the same 🙂 I recommend to all Genie client users they do the same for submitting jobs, as it saves (quite) a few keystrokes.

As I said, all of these have been merged into trunk and as such are available on Genie’s Github repo — but for those who didn’t pay attention to the activity in that repo, I thought it’s worth detailing these 2 (small) changes as they could bring a big win in future projects.