Saturday, March 29, 2014

What happened to synchronized in Java 8?

I was curious about the compared performance of AtomicXXX classes vs. using Locks or synchronized methods under high contention in a multi-threaded context. I then came up with a micro-benchmark which performs this comparison with various numbers of concurrent threads.

What I am testing are implementations of the following interface:

public interface LongCounter {
  long incrementAndGet();
  long decrementAndGet();
  long addAndGet(long value);
  long get();
  void set(long value);
}

Implementations use the following patterns:

with AtomicLong:

public class LongCounterAtomic implements LongCounter {
  private final AtomicLong value = new AtomicLong(0L);
  @Override public long incrementAndGet() {
    return value.incrementAndGet();
  }
  ... other methods ...
}

with ReentrantLock:

public class LongCounterLock implements LongCounter {
  private long value = 0L;
  private Lock lock = new ReentrantLock();
  @Override public long incrementAndGet() {
    lock.lock();
    try {
      return ++value;
    } finally {
      lock.unlock();
    }
  }
  ... other methods ...
}

with synchronized methods:

public class LongCounterSynchronized implements LongCounter {
  private long value = 0L;
  @Override public synchronized long incrementAndGet() {
    return ++value;
  }
  ... other methods ...
}

The test then performs various invocations of the interface methods on a single instance of one of the implementations, which is shared by all the threads in the test. Thie is done for a specified number of iterations. The number of threads varies in a range from 1 to 256, and the number of iterations is fixed, which means the more threads, the less iterations each thread has to perform. The parameters for a test are thus: the LongCounter implementation instance, the total number of iterations and the number of concurrent threads.

The benchmark measures the elapsed time for each test, as well as the average cpu load during the test. Tetsing with Java 7 on an Intel 17-2600 (4 cores, 8 hardware threads), gives the following results:

So far, so good, it is interesting to see how the Atomic implementation is lagging well behind the others. The higher cpu load is to be expected since AtomicLong uses lock-free algorithms for thread-safe access to the value.

Now, what really threw me off were the results I obtained when I ran the exact same code with Java 8 (and recompiled it to the Java 8 class format):

Wow! What happened to the synchronized stuff? It's now 5 times slower than in Java 7. To be fair, we can notice that the performance of AtomicLong has highly improved, as it is twice faster. This is quite puzzling, and I'm wondering what's going on. I'll appreciate if any feedack on this could be provided. Maybe I'm not testing the right way, or something has dramatically changed in how the JVM handles synchronized methods and blocks.

For reference, I have posted the source code here and the full benchmark results here (as a PDF).

I will be grateful for any input the community can provide, thanks a lot!

7 comments:

  1. Sounds like we are going to face some concurrent performance issues whenever we decided to migrate from Java 7 to Java 8.
    It would be helpful if you open an issue in the Java bug tracking. At least we will hear something about this from Oracle people.

    ReplyDelete
    Replies
    1. Hi Fernando,

      Thanks for the suggestion. I wanted to wait a little bit for more feedback, especially on my benchmarking approach. See Aleksey's comments below, this sheds a new light on what I've done.

      Delete
  2. Did you use synch elision? I'm surprised that the single threaded results are so bad for synch.

    ReplyDelete
    Replies
    1. Hi,
      To be honest, I had never seen this term before I read your comment, so thanks for bringing something new in my life!
      I've looked it up and I do not have the impression I did anything special with regards to thread/lock elision.

      Delete
  3. If you really want to see if there is an issue, then you have to do it with proper benchmarks. The non-extensive list of what makes the performance claims invalid so far:

    1. The warmup is barely valid: you first call the LongCounter methods separately, and then measure them *at the different* call sites. That means your measurement loop is not warmed up at all, and you compile during the measurement.
    2. You measure all three implementations in the same VM, which cross-contaminate the profiles, and possibly makes the first test in the measurement loop run unlikely faster.
    3. You have an unfortunate edge effects with threads starting/stopping: once you run more than one thread, there are cases when benchmark measures the performance without all the threads started and running.

    Read this: http://shipilev.net/#benchmarking
    Use this: http://openjdk.java.net/projects/code-tools/jmh/

    ReplyDelete
    Replies
    1. Thanks Aleksey for taking the time to comment on this. While wrapping my head around JMH, I made some modifications to the code to address the points you listed: 1) warmup and measurements are done on the same LongCounter instances, 2) provided an argument to the main() to specify which kind of counter to test (a string with any combination of 's', 'l' or 'a') and ran each single type of counter in a separate JVM, 3) used a cyclic barrier in the test thread so all the threads are at the same point before measurement starts.

      Indeed I now see a huge difference in the time measured for synchronized. I really have trouble understanding why running the synchronized test by itself in a different VM is making such a difference. I guess I'll have to read again your benchmarking slides, which are very enlightening - thanks a lot for this work.

      Once I get the JMH-based implementation working, I'll publish the results and how I got them in a new article.

      Thanks again,
      -Laurent

      Delete
    2. Could also add to the list that the single threaded performance is worsened because the warm up phase is too short and the monitors are created, inflated and used before biased locking has started. I ran the same experiments with -XX:BiasedLockingStartupDelay=0 and got 5 times faster single threaded performance.

      Delete