I have worked recently in Netflix on a project which was hitting one of our Cassandra clusters. (By the way, we use Cassandra here a lot, wherever possible we prefer it to RDBMS, so we got tons of instances running Cassandra.) Part of what my code had to do was to retrieve a set of records and apply some transformation to one field then write the result in an output file. It is such a simple ETL that I haven’t spent too much time on this initially and simply wrote a code which ran a CQL (Cassandra Query Language) to retrieve the fields that I needed and apply the processing and write the output file line by line.
Of course, in doing so, I missed one important aspect: the volume of data 🙂 (ouch!) This ETL is set to process about 100 million records and even though my code makes sure I only retrieve the columns that I want and not the full row (which would flood the network with a whole bunch of Cassandra columns for which I have no usage!) — it still dragged like a snail when I ran it first time! (I did a quick calculation at the time and it would have taken something like 3-4 days to finish — ouch!!)
To speed it up, I thought it’s ok, I don’t need to do a CQL, instead — bearing in mind I actually know all the keys upfront (they hit my ETL from another process which produces them to me as an input file) — I can just do a getRowSlice()
in Cassandra and retrieve a chunk of N rows. We then multiplex this through a threadpool of X threads and we get some number of records per second and then try to optimize these 2 parameters.
Having changed the code this way, I’ve hit another problem: because the row ID’s coming in where in no particular order, Cassandra was thrashing the disk all the time to retrieve the rows — which saw my speed improving a bit, but still requiring about 2 days to finish. (Needless to point out it’s still unacceptable!)
Ok, so what do we do to speed this up? After a bit of looking around and talking to my colleague Kenneth Kurzweil, we decided to change the code so we actually retrieve all rows (rather than via getRowSlice()
) — we use the Astyanax library and this has a nice AllRowsReader.Builder
which I could use. The nice thing about this is that it allows you to specify the level of concurrency (read “number of reader threads”) as well as the page size (read “number of records to read in one chunk”). With that in mind, I’ve parametrized again N=page size and X=concurrency level, switched the code to use the AllRowsReader.Builder
and started playing with these parameters on my Mac.
Now, how do you measure this script’s performance though? How do you tune the N and X above? Well, Ken suggested this tool he’s been using for ages called “MenuMeters” — which is pretty nifty: it installs a few graphs and visual widgets in your menubar so you can quickly see how your I/O, memory, CPU and so on is doing. He was actually kind enough to configure it for me and save me the time so I get all the relevant info in one go — how’s this!!! 🙂
So with that in mind we started tuning our little app. Bear in mind that my Mac has 8 cores and 16Gb of RAM and runs on the internal LAN which is 1Gbps — however, as I’m sure you know, all we run here is in Amazon cloud so I wasn’t expecting to hit 1Gbps really, and decided that 100Mbps would be nice for this exercise.
We ran initially with a page size of 100 records and 10 threads. Our little widgets indicated that we hardly did any network traffic and my cores weren’t feeling anything. So I went to the opposite extreme: page size=100,000 and thread pooling at 500. Yeah, that was not happening — we weren’t really choking the network but my Mac’s fan was going wild suggesting high CPU activity, which was also reflecting by the little widgets in my bar. So I went back to building it up slow: double the thread-pooling until the widget tells me I’m choking my cores. It turns out that value — for this particular case — is 128. Cool, so I’m going to run with a thread pool of 128 threads and now I started tweaking the page size. Again, in my case, it turns out that a value of 5,000 nearly hits the 100Mbps limit so I stopped there.
At that point N=5,000 and X=128 I re-ran my script. It took me about 30 mins to put together this article and in this 30 minutes, my script has processed about 90% of the data looking at the counters I’m exposing from my app. Now that’s an improvement from 2 days I’m sure you’ll agree! 🙂
I thought it to be an interesting case study of how to tune your code on a Mac — not sure if I would angle this post as “look how fast Cassandra is” (though, believe me, it is!), or “how blazing fast our code at Netflix is” — but much rather as a “this is how to go about tuning parameters like these”. I wish someone else wrote this article before I started writing my ETL code to be honest 🙂