Tracking Users Online — Part 2

Posted by & filed under , .

Binary globeFirst release of this project on Github is now out there. And as promised in my “pilot” post of this series, this post will walk you through what went into this release and why.

The project itself as you recall is available on Github in this repository: https://github.com/liviutudor/PixelServer.

The version for this release is pixelserver-1.0.0 –and the link to the release is this: https://github.com/liviutudor/PixelServer/releases/tag/pixelserver-1.0.0 . Please note that I currently “deploy” this artifact into a private Amazon S3-based repo — once I figure out how to make that public I will include the link to the binary downloads too. However, for now, simply run mvn package into the project folder and you will get the war file.

Preamble

When looking to track users online, websites need “direct access” to the web visitor. This means, in web terms, having the user make a HTTP call to the website/web server. The reason behind this is simple: once the browser makes a request to the web server, the server can get a lot of information from the user — things like:

  • IP address
  • browser type
  • any cookies previously set for the domain
  • various browser-set headers which can indicate things like languages the user “accepts” (read “speaks”) for instance

The IP address seems like a small thing, however, it can be employed to perform a geo-ip location — which is really a (complicated) lookup in a table which converts the IP address into a geographical location. Depending on the geolocation provider used, the result of this lookup can return quite detailed information, such as country the IP address is in, even city or postcode or state/region the IP address originates in. Some of this information can have a high error margin, however, in most cases one can safely rely on the country to be correct. So a simple IP lookup can tell us right away the country our web visitor is in — and that allows us to make a lot of other inferences! For instance, once we know a user is in the UK, we can find out the time of the day in the UK at the moment the call was made and notice it is morning, or afternoon or night. We also can find out through various online services what the weather is like where the user is. The list goes on, and I’m sure you get the picture: once you get hold of the user IP address, there is a lot more information that can be retrieved based on that — enough to start building a basic profile for the current user.

Same goes on about browser type: most browsers send a HTTP header “announcing” themselves to the websites — this typically allows the website to serve back content tailored for the current browser. (If you do a search on the net you’ll find out that despite all the standards, there are still a lot of differences in between browsers nowadays, so one has to be careful when sending back JavaScript or image formats or movies etc — in order to ensure these work correctly on the user browser.) So as you can imagine, once you get your hand on the browser type, you can make again a lot of inferences like:

  • how “modern” the browser is — an old browser typically indicates a non-tech savvy person, while a beta/developer version of a browser can hint to the fact the user is a techie
  • the operating system for the user — this again indicates various affinity and tech savvy levels (you are definitely a techie if you’re using Linux for instance!)
  • mobile phone type for mobile users
  • various browser plugins used can suggest various user interests — e.g. any toolbars from various websites would suggest the user is a regular visitor of those sites etc

Now, the easiest way to get the user to place a direct HTTP request to your webserver is to embed “something” on the web page — when the user visits the web page you’ve embedded your “stuff” onto, the browser has to connect to your webserver to request that “something” — and there’s your direct HTTP request. That “something” can be of course anything from an image to a piece of JavaScript — and this is where the fun starts. Websites are very weary of placing random JavaScript tags on their pages, for at least the following reasons:

  • It slows down their page — and Google penalizes that in their page rank nowadays! All it takes is a simple mistake in your JavaScript and their page takes forever to load and they lose SEO ranking because of this.
  • It can “break” their page. All it takes is a simple JS error in your script to provide some bad user experience and even worse, to potentially prevent other scripts running on that page and you’re in UI hell. No serious website likes that.
  • It allows potentially harmful code to get executed on their pages. Your code might be safe, but if someone brings down your website or hacks it, your code could end up running malware on their pages and they would lose users right away.

There are many others why most websites avoid JS tags, as I said I’ve only listed the ones that come up all the times — you can find out more if you do a bit of googling. However, bottom line is that websites prefer non-JS tags for tracking users. Which brings us to the other effective way of getting on a web page: embed an image! Simply providing an <img> tag to the website to embed on each page ensures that whenever a user visits one of their pages, their browser would have to place a HTTP request to your webserver to download the embedded image. So you can simply provide a generic image on your webserver and send a tag to the website to point at it.

The problem with that approach is that each website has its own layout and style — which is not a problem if you are only tracking users on a single website. However, if you are doing this for multiple websites, you will find pretty shortly that for each website you have to adjust the size, or the colour scheme or the position or the format (GIF/PNG/JPEG) used for each website. Even worse, same wesbite can use a different layout for say their blog part to their news part of the website. So you have to chose your image carefully such that it works with all layouts, sizes, formats and colour schemes.

The easiest way to do so is to provide a really small image (un-noticeable) such that it doesn’t break layout. Also, ideally it should be colour scheme independent — so transparent jumps to mind! The standard for this has emerged, based on the above, to serve a GIF image (support for GIF goes back in the beginning of internet pretty much so all browsers support this format and the transparency GIF format offers) of 1 x 1 pixels size (so really just 1 pixel in the image); and even more, make that pixel transparent. This would make the image invisible to the user and being so small it won’t affect the layout of the site.

Of course you could think, since we are serving a transparent image, why does it have to be 1×1, can be anything right? A bit of CSS and we can ensure this is fixed positioning, underneath the main content etc. — and as such it would still not affect the layout. And you are right, however, the bigger the image, the bigger the size of the response you send back to the browser (the number of bytes you physically send back). Also, you do have to operate a bit of CSS magic to ensure the layout doesn’t get broken, so it’s best to keep this image as small as possible. Typically a 1×1 GIF transparent pixel is under 1kb so it’s ideal from the point of view of decreasing your bandwidth consumption.

As such, for this project, I will be concentrating on providing all the necessary code to serve a 1×1 transparent pixel and track users using this approach.

In this version

This version offers just a simple servlet which when “called” it returns a 1×1 transparent (GIF) pixel. It uses the Spring MVC framework, however, as you can see in the code there is no view — the reason behind it being simple: it’s not needed! 🙂 For each request, we simply return the bytes for the transparent pixel and be done with it. (However, the Spring MVC framework simplifies with defining paths and providing context variables.)

All the code really is in the HomeController class — the entry point is in pixel() method. This provides a mapping to the /pixel URI — when visiting http://your-server/pixel , this method gets executed on the server side. As you can see, all it does, it calls writePixelToResponse() and returns no MVC videw, as per above.

The writePixelToResponse() in return does exactly what the name suggests: writes the 1×1 transparent pixel bytes to the browser and informs the browser of the content type (GIF) and response size. This is in order to ensure the browser will render the image correctly and won’t try to interpret the sequence of bytes we are sending back as something else (and cause errors on the page).

All the other pieces of code in this class, are just to support this (only) operation — the controller class is a Spring bean with a single property (PixelResource) — this property is set in the Spring xml file to a resource pointing to our 1x1.gif image (in resources/img folder). When this property gets set by the Spring framework, we simply go and read the gif file in memory — we know upfront that:

  • the image is very small and can fit in a buffer of 1kb or less (see DEFAULT_BUFFER_SIZE)
  • we only load this image once, when the controller gets initialized, and then we only read this buffer, so we don’t need to worry about concurrency here
  • we can also safely assume the image is in place, so the exception handling is not the best here, but still enough to throw some errors (handled by the Spring framework) if something wrong happens during loading

The rest of the code/xml is just supporting the main chunk of work in HomeController as I said. Once you compile this and deploy the war file into Tomcat (or any other container you might use), visit /pixel — and if you use something like Firebug or a similar tool for examining HTTP traffic in your browser, you will notice that we return the bytes for the pixel, the content length and content type. From there on, to embed this onto a web page, simply add a tag like this:

<img src="http://your-webserver/pixel" />

on any web page you are planning to track the users on and you’re done.

There are a few other versions planned for this project — look in Github and you will see proposed deadlines for these upcoming releases and the sort of things I’m planning to add in them.