Presented in this report are test results that demonstrate the NetKernel ROC architecture
scaling linearly with CPU cores.
This is achieved without requiring any use or knowledge of asynchronous threaded programming techniques.
It can be seen that software on NetKernel is load-balanced across CPU-cores in the same way that a Web application can be load-balanced across
an array of servers.
The tests were conducted with the nk-perf system profiling tool, which
is available for installation on your NetKernel system using Apposite.
The tests are designed with no caching acceleration, therefore in real-world
systems, where cacheable resources will be dynamically discovered,
actual results yield better than linear performance.
The performance tests were run on two 8-core based servers (click to expand the sections)
2 x 4-core Intel(R) Xeon(R) CPU L5420 @ 2.50GHz, Solaris 10 10/08 s10x_u6wos_07b X86 , Java 1.6.0_18
A detailed discussion of the different test profiles is provided with the nk-perf tool.
The tool includes full source code so that you can verify our testing methodology
and adapt or extend the benchmark tests as required.
Briefly, the test named "Scaling Concurrent Requests" issues an increasing number of concurrent
root requests which invoke several thousand synchronous sub-requests that invoke
an endpoint that performs a small computational task and returns a non-cacheable representation.
The test named "Scaling Kernel Threads" issues 2000 asynchronous requests that invoke the test endpoint.
The asynchronous load is executed on increasing numbers of kernel threads.
2x 4-core AMD(R) Opteron(R) 2.0GHz, Linux version 2.6.28-17-server, Java 1.6.0_16
The next section demonstrates the impact of tuning the Java Runtime environment.
Standard
This is the out-of-the box NKMark10 score for the 8-core AMD/Linux test machine.
All JVM settings are defaults.
JVM Tuned
Below is the NKMark10 score for the same platform, this time with the additional JVM options
-XX:+UseAdaptiveSizePolicy -XX:+UseParallelGC have been set...
It can be seen that these options provide approximately 25% improvement. However it also
leads to a change in the scaling profile. We find that with the options set the asynchronous load line develops a pronounced peak
corresponding with the number of available processing cores.
This suggests that using the NetKernel throttle with a concurrency of 8 and 8 kernel threads
would lock the architecture at peak throughput and eliminate load dependent variability of the system
performance.
Shown below are the results of the nk-perf cache profile measurements. The two results show the cache
performance for varying uniform and normal (gaussian) distribution random data sets. Any non-linearity in
the cache would show up as a slurred step function (uniform) and non-normal distribution. It can be
seen that the cache is linear.
The 8-core Xeon / Solaris results are very close to the ideal.
We see a linear increase in throughput and constant response time while
concurrency increases to utilize the available CPU cores.
Once concurrency fully utilizes the available cores,
we see constant throughput and a linear rise in response time.
These results show that the NetKernel architecture is linear.
The 8-core AMD / Linux results show how platform dependencies may affect your results.
NetKernel is as before, and again demonstrates linear scaling with CPU cores,
but we now see that the response time is not quite as ideal when concurrency exceeds available cores
and we see a slow drop off with concurrency.
This indicates that the low-level stack (hardware, OS, JVM) is not as well matched and linear as the other
test system.
As with the discussion of JVM tuning, it is clear that on a non-linear stack the NetKernel throttle
overlay can be introduced to "pin the load line" and provide guaranteed linear response for your architecture.
In fact use of the throttle can also be valuable even on a linear-stack. It allows you to control admittance to your applications/services without
tying up threads. Therefore your transient memory usage is reduced (often dramatically) which has virtuous performance benefits
in that when "memory churn" is granular it allows the GC to operate more efficiently (the bottom line is you are achieving the same, or better,
throughput with orders of magnitude fewer object references).
CPU Utilisation
The CPU utilisation values are those reported by the JVM ThreadMXBean. This number is not necessarily reliable since it is subject to Heisenberg effects. In practice
we observe that the host operating system's CPU monitor shows the CPUs to be fully loaded during all test scenarios. However, it is interesting to note that the
"Scaling Concurrent Requests" result on the Xeon/Solaris platform reports close to 100% utilisation whereas the AMD/Linux platform never
shows better than 80%.
Conclusion
These tests show that the NetKernel architecture provides a true linear scaling software solution for multi-core
processing platforms without the need for thread or concurrent programming knowledge.
Furthermore, these tests
constitute a worst case scenario with no caching of resources. In real world systems, NetKernel's intrinsic system-wide
caching dynamically discovers reusable computation results. Which means real-world applications perform
even better than these test scenarios.