Tuesday, September 11, 2007

Is your VM lazy or eager?

My dad would turn in his grave if I said that sometimes it is actually better to be lazy.

Introduction
We had a strange bug appear on our Suse and Fedora 4 builds today. Our main client product threw a NoClassDefFoundError on a part of the code where it shouldn't have needed to throw that error about the class it was complaining it couldn't find. Basically I used jacob.jar, a very slick little Java-COM bridging library, to do some Windows service interrogation for a health check feature we implemented on our client product. Actually, in this case our client talks to a service but in the context of our entire product suite, the service and the GUI are collectively "the client". Anyway, moving on.

This feature is obviously only necessary on Windows builds, so we don't package the jacob.jar in our non-Windows builds, which was fine because we don't call the potentially offending methods on the non-Windows builds either. Even though there were import statements for these classes, we felt sure that it wouldn't be a problem because those classes are never needed, so naturally they would never be loaded. Or so I ass-u-me'd!

What Tha?!
When our QA guys reported the bug, I felt sure there was another place in our code that tried to load those classes. But after searching the code, I found nothing. Then I assumed that maybe it was the import statements at the top of the class that were causing the problem. So I removed those and fully qualified the class names in the method that was never called. Kaboom! Still complained of NoClassDefFoundError.

Test The Assumption
Okay, so in times like this, it's best to isolate the problem and see whether you can reproduce it with a simple case. So I created 2 classes, MainClass and SomeOtherClass as follows.

Here's MainClass:
package com.test.classloading;

import com.acme.clientlib.SomeOtherClass;

public class MainClass {

static {
System.out.println("Loading MainClass");
}

public void doYourThing() {
System.out.println("Enter doYourThing()");

SomeOtherClass soc = new SomeOtherClass()
soc.doStuff();

System.out.println("Exit doYourThing()");
}

private void doYourOtherThing() {
System.out.println("Enter doYourOtherThing()");

System.out.println("Doing other thing");

System.out.println("Exit doYourOtherThing()");
}

public static void main(String[] args) {
MainClass mc = new MainClass();
mc.doYourThing();
mc.doYourOtherThing();
}
}

And here's SomeOtherClass:
package com.acme.clientlib;

public class SomeOtherClass {

static {
System.out.println("Loading SomeOtherClass");
}

public void doStuff() {
System.out.println("doStuff called");
}
}

As you can see, I have static initializer blocks that indicate when the classes are loaded and initialized.

When I run this test I get the output I expect:
Loading MainClass
Enter doYourThing()
Loading SomeOtherClass
doStuff called
Exit doYourThing()
Enter doYourOtherThing()
Doing other thing
Exit doYourOtherThing()

Ok, so next I removed the call to SomeOtherClass's doStuff() method as follows:
        MainClass mc = new MainClass();
// mc.doYourThing();
mc.doYourOtherThing();

When I run that on Windows I get:
Loading MainClass
Enter doYourOtherThing()
Doing other thing
Exit doYourOtherThing()

So it would appear that SomeOtherClass never gets loaded. Cool! We're all set!

Not so fast
Unfortunately when I ran that same code on the Fedora box (in my case) with Sun JDK 1.5, I got the dreaded NoClassDefFoundError again! What gives?! How can it work on Windows, but not on Linux?! At first we thought it might be compiler inlining optimizations from some static final's we use in our code, but that didn't make sense either, as it would then only be inlined on Windows builds, not Linux builds.

Well as I'm sure you know, when a class is loaded, it must be initialized but it must first be linked before it can be initialized. To link a class it must be prepared, verified, and (optionally) resolved. It seems that in this case, the Windows VM is doing a lazy job of that last step, resolving (i.e. only doing it when it absolutely has to) whereas the Linux VM is being a bit overeager. Okay, so an easy fix then would be to simply move all of that code into a separate class. Then when the general class is loaded, the Windows specific stuff will be elsewhere. No problem. So I changed my code and moved all the jacob.jar dependent code into a separate class. I tested it on Fedora. Worked like a charm. I checked that into CVS, and sat back in the comfort of knowing I'd fixed the bug once and for all... or so I ass-u-me'd.

An annoyingly knowledgeable mate of mine ("YOU KNOW WHO YOU ARE!") pointed out a small catch as per TheOracle(tm), TheSourceOfAllKnowledge(tm)... also known as TheSpec(tm). In particular, he pointed me to the VM spec, section 2.17.1:

"The resolution step is optional at the time of initial linkage. An implementation may resolve a symbolic reference from a class or interface that is being linked very early, even to the point of resolving all symbolic references from the classes and interfaces that are further referenced, recursively."

Yikes! So my fix was really just a band aid! There's every possibility that even though I'd moved the code into another class, some VM implementation might actually try and resolve those classes as well. Ok, well to cut a long story short (er, too late ;-) the best solution for this problem was to simply call the other class I'd created via reflection.

Conclusion
So remember that then. Never assume. Did I say that clearly enough? No? Let me say it again then... NEVER ... ASSUME! When in doubt, check the spec!

There's a reason why there's an acronym called "RTFM".