About Data Comparison in Java

Posted by & filed under , , .

If you have been involved in some coding (Java or otherwise) more than likely at some point you had to deal with the situation where you have to compare some data — for the purpose of sorting items in a list, of validating input, or many of the many other situations that require it. And in quite a few such cases I’ve seen this pattern I’m going to refer to which can create problems with pieces of code that are being called repeatedely — and these problems can manifest very quickly in high throughput systems.

In one of my previous posts I was referring to building an Address Book-like API — where we store the first name, surname and email for each person added to the address book. I’m going to assume in this example a similar Address Book service, but we’ll change the implementation a bit — rather than using a bean with these 3 properties, we will store into a Map, where the keys are firstname, surname and email and these map to the person’s first name, surname and email. (You can argue that the reasons behind this is such that we allow for classes extending this class to add new properties just by allowing extra entries in this Map for instance.) So the Person class now looks like this (or somewhere around there):

public class Person {
	public static final String PROP_FIRSTNAME = "firstname";
	public static final String PROP_SURNAME = "surname";
	public static final String PROP_EMAIL = "email";
 
	private Map properties;
 
	public Person() {
		this( null, null, null );
	}
 
	public Person(String firstName, String surname, String emailAddress) {
		properties = new ConcurrentHashMap();
		properties.put( PROP_FIRSTNAME, firstName );
		properties.put( PROP_SURNAME, surname );
		properties.put( PROP_EMAIL, emailAddress );
	}
 
	public String getFirstName() {
		return properties.get( PROP_FIRSTNAME );
	}
	public void setFirstName(String firstName) {
		properties.put( PROP_FIRSTNAME, firstName );
	}
	public String getSurname() {
		return properties.get( PROP_SURNAME );
	}
	public void setSurname(String surname) {
		properties.put( PROP_SURNAME, surname );
	}
	public String getEmailAddress() {
		return properties.get( PROP_EMAIL );
	}
	public void setEmailAddress(String emailAddress) {
		properties.put( PROP_EMAIL, emailAddress );
	}
}

Now since we’re talking about comparing data, it would only be appropriate for us to supply an equals method to the class; naturally, this would ensure that 2 instances of Person are equal only if all of the 3 properties are equal (I’m using the Apache Commons Lang StringUtils class here for simplicity):

...
@Override
public boolean equals(Object obj) {
	if( this == obj )
		return true;
	if( obj == null)
		return false;
	if (getClass() != obj.getClass())
		return false;
	Person other = (Person) obj;
	String otherFirstName = other.getFirstName();
	String myFirstName = getFirstName();
	if( !StringUtils.equals(otherFirstName, myFirstName) )
		return false;
 
	String otherSurname = other.getSurname();
	String mySurname = getSurname();
	if( !StringUtils.equals(otherSurname, mySurname) )
		return false;
 
	String otherEmail = other.getEmailAddress();
	String myEmail = getEmailAddress();
	if( !StringUtils.equals(otherEmail, myEmail) )
		return false;
 
	return true;
}
...

(Please note that this is not the way I would normally write my code, but I have seen a lot of this lately — which is why I wrote the code as shown above, to underline the problem with this pattern.)

At first glance there is no problem with the above code:it is clear at each step what are we comparing, it is easy to read, it’s easy to add logging to it at each step (say to log the properties of the 2 instances when we find them not to be equal) etc. However, there is a small (!?) problem with this: each time it is being called, it will create 6 references to 6 String objects — and it will only release them after returning from the function! That might not sound like much, but calling this equals function for instance for each item in a long List will keep creating these 6 references and then discarding them. (Note, that this is not the same as creating 6 actual String instances, but it still requires memory allocation from the JVM!) On top of that, these references could end up referencing String objects that otherwise would be marked for GC, but if the reference itself hasn’t been marked yet for GC, they’ll stay in memory until the next GC run. If you’re not a big fan of simply using a construct like this:

...
if( !StringUtils.equals(other.getEmailAddress(), getEmailAddress()) )
  return false;
...

and prefer to extract these values in local variables, you can still do that, but I would suggest using as little references as possible in taking this approach — in this case, 2 variables would do:

...
@Override
public boolean equals(Object obj) {
	if( this == obj )
		return true;
	if( obj == null)
		return false;
	if (getClass() != obj.getClass())
		return false;
	Person other = (Person) obj;
	String temp = other.getFirstName();
	String myTemp = getFirstName();
	if( !StringUtils.equals(temp , myTemp ) )
		return false;
 
	temp = other.getSurname();
	myTemp  = getSurname();
	if( !StringUtils.equals(temp , myTemp ) )
		return false;
 
	temp = other.getEmailAddress();
	myTemp  = getEmailAddress();
	if( !StringUtils.equals(temp , myTemp ) )
		return false;
 
	return true;
}
...

So by using this simple “trick” we have eliminated 4 references — namely otherSurname, mySurname, otherEmail and myEmail. Your GC will thank you 😉

One Response to “About Data Comparison in Java”

  1. Steve Ford

    I particularly like this piece – for a long time now developers have paid little or no attention to the amount of memory they use (or waste) or to the GC process.
    Whilst the amount of memory available to a process is much greater, and processing speeds are much faster than they were,say 20 or 30 years ago, neither of these ‘commodities’ is unlimited and developers should try and optimise code using techniques such as described in this article wherever possible.