Of Java and Assembler

Posted by & filed under , , .

java codeThe title would no doubt puzzle quite a few of you — after all I’m putting in the same sentence a low-level, processor-specific language (for no better term for “assembler” — I know, I know, I know, “it’s not really a language”, right?) with a rather high-level, even platform-independent language like Java. So, right away you’d all be asking yourself “well, what can they have in common?” — or probably thinking that this is an article looking at how far apart these 2 languages are. The thing is, they’re not actually that far apart! Yup, I’m going to say that again and re-phrase is so the purpose of this post becomes more clear: they are quite similar in fact!

Now, I bet this got your attention, didn’t it ? 🙂 Rest assured, I didn’t just say it for the sake of it — I will try to explain throughout this post how that is possible. And if you decide to read it all, and happen to work within the JVM space, I would be very interested to hear your thoughts on this — either via a comment here or simply drop me a line.

To clarify first any questions about my knowledge of these 2 languages, I will say that I have worked with both. While this is probably a testimony for my age, there are hopefully a few others out there who will relate to my experiences: I know that nowadays kids grow up writing “Hello, world!” in the Google or Amazon cloud as their first program, however, for me my experience with computers took a different route. I started like many others learning my first few programming tricks via simple Basic, worrying about numbering lines in 10’s or 100’s, enough to make room for more lines in between, should I need to 🙂 Growing up in Romania, made access to technology slightly more difficult than it was for an average kid growing up in the West, so I was stuck with a clone of a Z-80 processor based Sinclair Spectrum for quite a while. To the point where I thought I knew everything there was to be known about that dialect of Basic and I started scratching under the surface and came across this thing called “assembler”. I must admit, it looked weird (Z80 processors, for those who don’t know, used to be 8-bit, and the assembler language was totally different to the 80×86 processors). I learned a few commands about loading registers, even figured out a way of doing a loop (LDIR if my memory serves right: "load, increment and repeat"), however, at that age (I was about 12-13) I was concerned on how to do “cool things” — which at the time I perceived to be things like animations and making the speaker make a nice sound. I couldn’t find anywhere a command to do these sort of things, and since no one has spoken to me at the time about “interrupts” I found the whole experience frustrating: who cares if I can program this damn thing to perform some addition really fast in assembler, when I can find no way of actually printing the results of inputting the numbers?

Luckily at the same time, some clones of M18/M118 made their ways into Romania as well — they were Z80 based too — and I was lucky enough that my father’s workplace had a few of those. I made friends very quickly with the guy who was in charge of their IT department, and as such I found myself to be lucky enough to be allowed to go wild on one of those afternoons, evenings and weekends! It was at the time a big learning curve for me, as I came across concepts like operating system, floppy disks (Sinclairs were mostly casette/tape based — it was later on that floppy disks were introduced and that was bloody expensive!) The M18’s stations were running CP/M (you might know of it as the system that really inspired MS-DOS), so I basically came across all of a sudden this high-level language which was CP/M itself.

Around that time, I heard of this “thing” called dBase — as it happened, dBase IV was the de facto database engine — so I started playing with it. I found it a nice way to play with “files” and “data” (the way I perceived at 13 years of age). It was pretty similar to Basic in the sense that it was an interpreted language — even though at the time I didn’t make a clear distinction in between compiled and interpreted languages! — so you just type your program and then execute it inside dBase. There was no visibility of “under-the-cover” from my end so I absorbed that right away.

However, the next thing I discovered was Pascal (I actually did use for quite a while the very first version of Turbo Pascal – that thing, rocked!!! Borland did such a good job with it!). And I started figuring out that writing:

10 PRINT "Hello, world!"

was the same as writing

WRITELN "Hello, world!"

And ultimately I assimilated the Pascal language based on comparisons with my knowledge of Basic. I’m sure that’s how we all learn a language, having previous experience of others, “oh, this bit in this language is similar to this other bit in this other language” etc., so this shouldn’t come as a surprise. However, one of the things that I have learned with Pascal is that it requires compilation in order to execute the program. With the likes of Turbo Pascal (and nowadays all sorts of other IDE’s) you could load a program and trace it line by line, but I figured out right away that this was not the way to do it. You had to compile it, and that produced a .COM file (yup, that’s where Microsoft got the extension from in the first place!) which you then can execute right away from the command line! This minor thing really revolutionized my understanding of programming and I started digging into compilers and libraries (I believe in Pascal they were called “units”) and finally began to figure out — through the usage of Pascal — what the assembler thing was all about: ultimately all this code I wrote was “translated” to assembler somehow in order to get executed!

With that in mind I switched my attention back to assembler, learned about the system interrupts and how to print things on console, to open files and so on. I thought at the time assembler was the best since sliced bread to be honest and started looking at all sorts of ways of making programs faster — just so I could show off, of course 🙂

I then went to highschool to study computer science “properly”. At the times the 80286 processors were making their way to market — even in Romania 🙂 — so I found myself in front of one of them around the age of 14. I had to drop right away assembler as I found out (oh, the disappointment!) that these things used a different processor. So I switched for a while to Pascal (this is what was taught in my school as well) and started learning my MS-DOS and the 80×86 programming language.

computer codeIn the process of learning about INT 21h (it’s kind of sad I still remember the DOS interrupt! 🙂 ) one of the teachers mentioned to me this language called “C” which is very close to assembler apparently. So I dropped my 80×86 assembler books and turned my attention to “C”. That’s when I found out about pointers — this was something that you had to dig a bit in Pascal to find out about, and the concept of them was rather blurry in that language. Also, in terms of assembler, you just dealt with “memory”, so pointers I found to be cool right away because it made the link in between high-level languages and the low-level of assembler.

Shortly I picked up the 80×86 assembler language too and found out how to actually make those 2 work together — which at the age of about 15-16 was to me mind-blowing! I would write parts of my program in C and parts in assembler — it was to be honest just a pure show-off at the time, though I would convince all of my friends that my programs were faster than theirs, even though I don’t think we had any way of measuring that at the time 🙂

Anyway, point being is that I have figured out that simply doing a string concatenation (be it in C or Pascal or any other language) really meant in assembler code: “allocate space for the 2 strings, then copy the characters from the beginning of the second string at the end of the first one — make sure there is space ! — then add the null terminator”. Every time one of my friends would show me something cool (typically involving graphics or colors) I would think “hmmm yesss, that takes something like an interrupt XYZ with these parameters” and I would go away and bash at it and show him the same done in assembler and point out that “you know, this is so much faster” 🙂

I knew basically at this point that “everything compiles to assembler” as one of my friend put it; his statement was based on the fact that in Turbo C you actually even had an option that during the compilation the compiler generates the equivalent assembler code! The Microsoft C compiler had something similar from what I recall — though it could have been the Watcom C compiler which I played with as well — as I remember studying at the time the generated code to the point where I could point out right away an executable compiled with the Borland compiler, based on certain patterns they used for things like loops and so on. (For the record, no, I don’t recall what those patterns are, so don’t ask 🙂 )

Having listened to his opinion at the time, I finally made the conscious distinction in between assembler code and the executable code: after all, when you open up an executable in a text editor, it doesn’t show things like:

MOV AX, BX
INT 21h

(By the way I only got as far as playing with 80×86 16 bit assembler, nowadays the registers are 64 bits and called something like EAX or something I believe…)

Instead, you see just some (random) bytes, obviously. And with this came the realization that bytecode and assembler are not quite similar, but there is a strong connection to them. I realised that assembler is pretty much bytecode, but it’s still an abstraction layer on top of it. And I also had now a confirmation of the fact that in terms of getting close to the metal (read “hardware”) that is as deep as I could get.

I think I played with assembler even deep in my Windows years; yup, I did have a go at Delphi — remember that? that was an amazing RAD tool, and yet again Borland somehow managed to screw it all up and loose to Microsoft, who, at the time, had a pretty rubbish RAD in Visual Basic. I had a go like everyone else at Visual C++ , Visual Basic and so on. I found out that due to the nature of DLL’s in Windows, I could mix languages, if I stick to a certain convention — so I would write UI code in VB and some of the business logic in C/C++. Then I would even go to the point of mixing the C code with assembler. Ultimately I figured out a new way of working pretty low-level via C-written DLLs. And I thought, once again, that was so cool! I could get to the low level matters in C, and optimize my code to be so fast (hmmm, side note: that’s what I thought at the time: I’m sure if I look at that code now I’ll find out it’s actually rubbish 🙂 ). Due to the nature of the VB runtime, I started figuring out that when I’m dealing with large chunks of data in memory, I’m better off switching to C and traversing all of it using pointers — a lot of my code at the time was perceived as faster not because my algorithms were better than my colleagues (in fact probably the contrary is true!), but because I would get down to the lower level of working with memory via C/C++, which meant simple pointer arithmetic and it was much much faster than VB’s array manipulation for instance. Also, I have gotten down and dirty so to speak with the Windows API — man, when it came to graphics manipulation, you can mess about with VB’s objects and so on, but there wass nothing faster than getting a GDI handle and using the Windows API to paint on it, using all sorts of XOR’s and what-not to create some nice and fast visual effects.

Bottom line, I realised that working closer to the metal allows you to be lazy: by that I mean that my algorithms didn’t have to be optimal, I could make up for all of that by switching to the low level languages or to assembler! I had occasionally really good grades for various assignments not because I would sit down like my colleagues and study algorithms, but because I would come up with a solution that was faster than most of my classmates’! These guys would go and study about whether bubble sort, or quick sort, or merge sort, or heap sort would be the best for that particular problem — it would take them hours or days of figuring it out sometimes, as some of the assignments we had were quite complex. I was lazy — I would much rather spend my time hacking away at some API than learn about why merge sort was preferred in certain cases — so I’d put together a simple bubble sort even, but it would be done in C rather than in VB and it would do it in microseconds; as a side note, my friends would implement the algorithm perfectly in VB, but when confronted with large arrays in VB, that thing would drag for minutes! (They’d still get good grades, naturally, after all they did put effort in it, by the way.)

Anyway, I was a low-level/API hacker for a long while — and loved mixing up the low and high level languages in the Windows environment. Years have gone by (I AM getting old after all!) and I went through things like multi-platform C/C++ code (God, so glad I don’t have to make my ways through a thousand #ifdef‘s!), perl, c shell and bash scripting, ruby, python, javascript and a whole bunch of others amongst which Java. For those who haven’t figured out, that is probably my main language nowadays — whether by choice or not.

The annoying bit about Java initially was that there was no low-level to get down to — you can of course write native libraries in C and use them for Java, but I felt that is sort of cheating, not to mention not portable! For a while, back in my Uni years, when Java came out, I was bugged by this thing: is there a way to get close to the metal in Java while still keeping the portability aspect of it ? (For those of you who have written multi-platform code in C or C++ you will surely understand why I think that was — and is! — such a major feature of the language!)

iStock_000013471133XSmallThen as I got deeper and deeper into the language I actually realise that Java is itself the low level in the JVM environment! This became even more obvious with the appearance of other JVM languages (Groovy and Scala spring to mind). The JVM ecosystem now comprises of so many frameworks and languages that have by now hidden away from the developer a lot of the internals — few have to worry about the object allocation and memory churning and garbage collection in the context of say Groovy; similarly, few really need to look under the covers when using the JSP Expression Language and realise a servlet is generated, compiled, loaded etc. when using the likes of Spring MVC or Struts.

I have begun to realise this acutely back in my days with Vibrant Media when I was asked to interview candidates for developer positions. Because of the way the IntelliTXT (by then, I believe it’s called vxEngine by now or similar) platform was written, we were in (desperate) need of dev’s with a deep knowledge of core java — who could work outside the likes of Spring, Struts, Hibernate, JPA and so on.What we found at the time is a lot of candidates who were absolutely spot-on in terms of talking us through the whole MVC framework and operating within the realms of that, however, when talking to them about references to objects, memory allocation, threading and so on, they were all of a sudden a fish out of water and struggled. That was rather annoying at the time as each time we were looking to hire we knew upfront we’ve got a long process ahead of us — though the occasional surprises did happen!

As I said, I was confirmed even more so through that process that Java is the assembler of the JVM ecosystem if you want. There are a lot of Groovy coders who would write without thinking:

if( stringVariable ==~ /\d+/ ) ....

to verify a String contains just digits, yet few actually understand the “assembler code” (read “java code”) underneath that:

Pattern p = Pattern.compile( "\\d+");
Matcher m = p.matcher( stringVariable );
if( m.matches() ) ...

It’s not a problem if you’re using this for a simple script — you want something put together quickly and code optimization is not really on your radar. However, if you find yourself executing this snippet over and over again, you realise that you have a lot of memory churning at each execution step and also that the compilation of the regex at each step is a heavy process. You might go about initializing the Pattern instance in a static variable, reusing Matcher instances and all sorts of other ways you can go about it, point being though is that you work now at low level, worrying about memory, about execution speed — and you address that not by changing the algorithm, but by changing the lower level execution. You will find out as a result of these changes that your program runs faster and has less GC interruptions, and as such is perceived as faster and more responsive by the end-users.

This is just a simple example of course, there are numerous others — infinite maybe? — but it’s good enough I feel to highlight my point: with a lot of the frameworks and JVM languages, all of the low level Java details are being lost. Use Spring MVC and nowadays you will not realise right away that your views are JSP’s which ultimately get compiled into servlets — and the implications of that. JPA hides away the (low-level) implications of JDBC and SQL — this in particular one aspect of the JVM ecosystem that really annoys me as I can’t really see how can a developer take himself or herself seriously without basic understanding of SQL.

Sure, all of these frameworks and languages make it so much easier for us techies to put “stuff” together and build an app — back in the day who (ok, apart from me 🙂 ) wanted to write assembler code when you have C? Why would you worry about using things like Pattern and Matcher instances when Groovy abstracts this away from you via the ==~ operator? Why bother with SQL, database connections and pooling when we got Hibernate?

It’s the same problem allover again, but this time transposed in the JVM ecosystem — where, as it stands right now, the Java language and the JDK is pretty much same as assembler is to the likes of C/C++. The interesting question to ask, of course, and this is based on looking at the evolution of high level languages, is what will the next language/ecosystem bring? You can envisage perhaps a cloud-based parallel language (God, I’m good at making s&#$t up! :D) where the low level would be concerned with managing server instances, database load, web server requests/load — and then the fascinating questions is : what would be the high level languages like? What would they be looking to achieve?

That aside though, and getting back to our Java/assembler comparison: I see nowadays a lot more companies asking for “core Java” — this reminds me of the days when writing portable C code was the holy grail: to have your code compile and run out of the box on both Windows and Unix was the thing to aspire for and if you could write “low-level” C code you had a secure job. With virtualization platforms everywhere now, the need for speed is probably even more stringent than before — and us, core Java developers are the new old-school assembler and C guys who were weary of every single malloc and pointer arithmetic. The mechanisms really are too similar — it’s just the syntax and the tools that are that bit different!