Delphi Parallel Programming Library & Memory Managers
In September 2014 Delphi XE7 was launched. I was excited to try the new Parallel Programming Library (PPL). I thought I’d be able to quickly and easily write code which takes advantage of the processors in my Intel i7 machine. To my dismay the performance improvement were virtually non-existent if there was any memory allocation carried out in parallel. I even brought up the subject on Google+ (see here: https://goo.gl/hWc6Z6). The bottleneck seemed to be with FastMM4. When it was first bundled with Delphi 2006, FastMM4 was a breakthrough in single core memory management. But FastMM4 had not been designed for multi-threaded performance. In my tests the PPL versions of my test routines were slower than the single core versions. I concluded it was wiser to stick with the simpler single core routine, and hope for a solution.
A week ago I stumbled across this thread on Google+, “Why is the FastMM4 development stalled?”. I’d never heard about NexusDB’s memory manager. Apparently it scales well in a multi-thread environment. Eivind Bakkestuen from NexusDB kindly offered to provide a test copy. So I thought I’d give it a try. I went back to my main application and fired up the parallel map rendering routines of my Sales Territory Mapping application. To my amazement I got a instant speedup of 68% when using the NexusDB memory manager. I was ecstatic! Out of pure interest I then tried the parallel rendering using FastMM4. To my surprise FastMM4 performed equally as well as NexusDB. What has changed since 2014? FastMM4 hadn’t been updated since May 2013. Was it something new with the PPL included with Delphi 10 Seattle? Or was it as a result of Windows 10. I also noticed the memory manager included with Delphi didn’t perform quite as well as explicitly including FastMM4 at the start of the “dproj” file.
So I set out to create a test and investigate further.
I created a test project (you can download it here). It’s nothing special. The small app creates and destroys lists of small and simple objects. I quickly established there was no speed difference between XE7 and Delphi 10 Seattle applications. I then create six Delphi 10 Seattle versions of the application; a 32 bit and 64 bit version of with the native memory manager, FastMM4 and NexusDB:
- Speedtest-Native-32.exe
- Speedtest-FastMM4-32.exe
- Speedtest-NexusDB-32.exe
- Speedtest-Native-64.exe
- Speedtest-FastMM4-64.exe
- Speedtest-NexusDB-64.exe
Each executable can run in single-core or multi-core mode. You can download the executables here (SpeedTest.zip). I then ran them on four different laptops:
- Dell XPS 15: Windows 10 2.6 GHz i7-6700HQ
- HP from 2009: Windows 7 2.2 GHz i7-2670QM
- HP from 2014: Windows 8.1 2.4 GHz i7-3630QM
- Surface Book: Windows 10 2.4 GUs i5-6300U
Here are the 32 bit results – each value is the times in milliseconds required to execute (smaller is better):
And here are the 64 bit results – each value is the times in milliseconds required to execute (smaller is better):
And here are the key points:
- NexusDB’s memory manager was impressive in every multi-threaded test (in some cases double the speed of FastMM4)
- FastMM4 did much better than I expected. There was a measurable speed improvement in all multi-threaded tests
- The native memory manager (which I thought was FastMM4) was measurably slower than FastMM4 in most multi-threaded tests (e.g. 64 bit multi-threaded). The difference was negligible in single thread tests.
- I was amazed at how well the 2009 laptop performed – Moore’s Law is clearly dead for laptops.
My conclusion is that NexusDB’s memory manager is the one to use if multi-thread performance is an issue.
All of these tests were carried out on laptops. I’d love to see the speeds when run on an eight core machine.
My results, single pass, core i7-4770 CPU @ 3.40 GHz, (4 physical cores + 4 hyperthreading), Windows 7 Pro:
FastMM4 64 – Multicore: 10899
Native 64 – Multicore: 13604
Nexus DB 64 – Multicore: 10397
FastMM4 32 – Multicore: 6399
Native 32 – Multicore: 9138
Nexus DB 32 – Multicore: 5539
FastMM4 64 – single core: 35999
Native 64 – single core: 36099
Nexus DB 64 – single core: 37237
FastMM4 32 – single core: 21088
Native 32 – single core: 21298
Nexus DB 32 – single core: 21843
Have you tried to compare latest NexuDB MM to ScaleMM2?
Core2Duo E8500 @ 3.16 GHz, (2 cores), Windows 7 x86, best from 2 tries:
FastMM4 32 – Multicore: 29769
Native 32 – Multicore: 32581
Nexus DB 32 – Multicore: 29196
FastMM4 32 – single core: 56105
Native 32 – single core: 55363
Nexus DB 32 – single core: 57110
You should try the native manager on Win32 with NeverSleepOnMMThreadContention. Makes a big difference for your test scenario.
+1 for setting System::NeverSleepOnMMThreadContention to True, makes a hige difference for parallel code.
Could you please include OpenSource https://github.com/alan008/sapmm and https://github.com/andremussche/scalemm (in version 2) to your tests?
And also include the private bytes of memory used during the process.
Delphi 10 Seattle on Windows7-64 i7-2600 4C-HT 3,5GHz
Native
SC-32 21,5
MC-32 9,3
SC-64 39,6
MC-64 18,1
Scalemm2
SC-32 9,4
MC-32 2,6
SC-64 40,1
MC-64 10,7
It’s look like the best memory manager: https://github.com/d-mozulyov/BrainMM