Linear Scaling
the power of
a Uniform Resource Engine

Presented in this report are test results that demonstrate the NetKernel ROC architecture scaling linearly with CPU cores. This is achieved without requiring any use or knowledge of asynchronous threaded programming techniques. It can be seen that software on NetKernel is load-balanced across CPU-cores in the same way that a Web application can be load-balanced across an array of servers.

The tests were conducted with the nk-perf system profiling tool, which is available for installation on your NetKernel system using Apposite. The tests are designed with no caching acceleration, therefore in real-world systems, where cacheable resources will be dynamically discovered, actual results yield better than linear performance.

The performance tests were run on two 8-core based servers.

Linear Scaling

NK-perf tool includes full source code so that you can verify our testing methodology.

Scaling Concurrent Requests issues an increasing number of concurrent root requests which invoke several thousand synchronous sub-requests that invoke an endpoint that performs a small computational task and returns a non-cacheable representation. The test named "Scaling Kernel Threads" issues 2000 asynchronous requests that invoke the test endpoint. The asynchronous load is executed on increasing numbers of kernel threads.

A detailed discussion of the different test profiles is provided with the nk-perf tool. The tool includes full source code so that you can verify our testing methodology and adapt or extend the benchmark tests as required.

Dual Quad-Core Xeon Processor L5420 / Solaris

2 x 4-core Intel(R) Xeon(R) CPU L5420 @ 2.50GHz, Solaris 10 10/08 s10x_u6wos_07b X86 , Java 1.6.0_18

Dual Quad-Core AMD Opteron(tm) Processor 2352 / Linux

2x 4-core AMD(R) Opteron(R) 2.0GHz, Linux version 2.6.28-17-server, Java 1.6.0_16

JVM Tuning

This section demonstrates the impact of tuning the Java Runtime environment.

Default

This is the out-of-the box NKMark10 score for the 8-core AMD/Linux test machine. All JVM settings are defaults.

JVM Tuned

Below is the NKMark10 score for the same platform, this time with the additional JVM options -XX:+UseAdaptiveSizePolicy -XX:+UseParallelGC have been set...

It can be seen that these options provide approximately 25% improvement. However it also leads to a change in the scaling profile. We find that with the options set the asynchronous load line develops a pronounced peak corresponding with the number of available processing cores.

This suggests that using the NetKernel throttle with a concurrency of 8 and 8 kernel threads would lock the architecture at peak throughput and eliminate load dependent variability of the system performance.

Cache Linearity

Shown below are the results of the nk-perf cache profile measurements. The two results show the cache performance for varying uniform and normal (gaussian) distribution random data sets. Any non-linearity in the cache would show up as a slurred step function (uniform) and non-normal distribution. It can be seen that the cache is linear.

Discussion

This section provides a detailed discussion of the results.

Details

The 8-core Xeon / Solaris results are very close to the ideal. We see a linear increase in throughput and constant response time while concurrency increases to utilize the available CPU cores. Once concurrency fully utilizes the available cores, we see constant throughput and a linear rise in response time. These results show that the NetKernel architecture is linear.

The 8-core AMD / Linux results show how platform dependencies may affect your results. NetKernel is as before, and again demonstrates linear scaling with CPU cores, but we now see that the response time is not quite as ideal when concurrency exceeds available cores and we see a slow drop off with concurrency. This indicates that the low-level stack (hardware, OS, JVM) is not as well matched and linear as the other test system.

As with the discussion of JVM tuning, it is clear that on a non-linear stack the NetKernel throttle overlay can be introduced to "pin the load line" and provide guaranteed linear response for your architecture. In fact use of the throttle can also be valuable even on a linear-stack. It allows you to control admittance to your applications/services without tying up threads. Therefore your transient memory usage is reduced (often dramatically) which has virtuous performance benefits in that when "memory churn" is granular it allows the GC to operate more efficiently (the bottom line is you are achieving the same, or better, throughput with orders of magnitude fewer object references).

CPU Utilisation

The CPU utilisation values are those reported by the JVM ThreadMXBean. This number is not necessarily reliable since it is subject to Heisenberg effects. In practice we observe that the host operating system's CPU monitor shows the CPUs to be fully loaded during all test scenarios. However, it is interesting to note that the "Scaling Concurrent Requests" result on the Xeon/Solaris platform reports close to 100% utilisation whereas the AMD/Linux platform never shows better than 80%.

Conclusion

These tests show that the NetKernel architecture provides a true linear scaling software solution for multi-core processing platforms without the need for thread or concurrent programming knowledge.

Furthermore, these tests constitute a worst case scenario with no caching of resources. In real world systems, NetKernel's intrinsic system-wide caching dynamically discovers reusable computation results. Which means real-world applications perform even better than these test scenarios.