(Edit 11 Oct 2012: Please vote for the x86 cpblk deficiency on Microsoft Connect)
Following my last post about an interesting use of the "cpblk" IL instruction as an unmanaged memcpy replacement, I have to admit that I didn't take the time to carefully verify that performance is actually better. Well, I was probably too optimistic... so I have made some tests and the results are very surprising and not expected to be like these...
The memcpy protocol test in C#
When dealing with 3D calculations, large buffers of textures, audio synthesizing or whatever requires a memcpy and interaction with unmanaged world, you will most notably end up with a call to an unmanaged functions like this one:
[DllImport("msvcrt.dll", EntryPoint = "memcpy", CallingConvention = CallingConvention.Cdecl, SetLastError = false), SuppressUnmanagedCodeSecurity]
public static unsafe extern void* CopyMemory(void* dest, void* src, ulong count);
In this test, I'm going to compare this implementation with 4 challengers :
- The cpblk IL instruction
- A handmade memcpy function
- Array.Copy, although It's not relevant because they don't have the same scope. Array.Copy is managed only for arrays only while memcpy is used to copy portion of datas between managed-unmanaged as well as unmanaged-unmanaged memory.
- Marshal.Copy, same as Array.Copy
- Buffer.BlockCopy, which is working on managed array but is working with a byte size block copy.
The naive handmade memcpy is nothing more than this code (not to be the best implem ever but at least safe for any kind of buffer size):
static unsafe void CustomCopy(void * dest, void* src, int count)
{
int block;
block = count >> 3;
long* pDest = (long*)dest;
long* pSrc = (long*)src;
for (int i = 0; i < block; i++)
{
*pDest = *pSrc; pDest++; pSrc++;
}
dest = pDest;
src = pSrc;
count = count - (block << 3);
if (count > 0)
{
byte* pDestB = (byte*) dest;
byte* pSrcB = (byte*) src;
for (int i = 0; i < count; i++)
{
*pDestB = *pSrcB; pDestB++; pSrcB++;
}
}
}
Results
For the x86 architecture, results are expressed as a throughput in Mo/s - higher is better, blocksize is in bytes :BlockSize | x86-cpblk | x86-memcpy | x86-CustomCopy | x86-Array.Copy | x86-Marshal.Copy | x86-BlockCopy |
4 | 146 | 458 | 470 | 85 | 81 | 150 |
8 | 294 | 843 | 1122 | 168 | 167 | 298 |
16 | 587 | 1628 | 1904 | 306 | 327 | 577 |
32 | 950 | 1876 | 3184 | 631 | 558 | 1079 |
64 | 1451 | 3316 | 4295 | 1205 | 1059 | 1981 |
128 | 2245 | 5161 | 4848 | 2176 | 1933 | 3386 |
256 | 4353 | 7032 | 5333 | 3699 | 3386 | 5333 |
512 | 8205 | 13617 | 5517 | 5663 | 6666 | 7441 |
1024 | 13617 | 20000 | 6666 | 7710 | 12075 | 9275 |
2048 | 18823 | 24615 | 7191 | 9142 | 16842 | 9552 |
4096 | 2922 | 7529 | 5663 | 10491 | 7032 | 11034 |
8192 | 2990 | 7804 | 5714 | 11228 | 7441 | 11636 |
16384 | 2857 | 7901 | 5614 | 9142 | 7619 | 10322 |
32768 | 2379 | 6736 | 5333 | 8101 | 6666 | 8205 |
65536 | 2379 | 6808 | 5470 | 8205 | 6808 | 8205 |
131072 | 2509 | 17777 | 5818 | 8101 | 17777 | 8101 |
262144 | 2500 | 11636 | 5423 | 7032 | 11428 | 7111 |
524288 | 2539 | 11428 | 5423 | 7111 | 11428 | 7111 |
1048576 | 2539 | 11428 | 5470 | 7032 | 11428 | 7111 |
2097152 | 2529 | 11428 | 5333 | 7032 | 11034 | 6881 |
For the x64 architecture:
BlockSize2 | x64-cpblk | x64-memcpy | x64-CustomCopy | x64-Array.Copy | x64-Marshal.Copy | x64-BlockCopy |
4 | 583 | 346 | 599 | 99 | 111 | 219 |
8 | 1509 | 770 | 1876 | 212 | 224 | 469 |
16 | 2689 | 1451 | 3316 | 417 | 422 | 903 |
32 | 4705 | 2666 | 5000 | 802 | 864 | 1739 |
64 | 8205 | 4812 | 7272 | 1568 | 1748 | 3350 |
128 | 13333 | 8101 | 9014 | 3004 | 3184 | 6037 |
256 | 18823 | 11428 | 10000 | 5470 | 5245 | 8648 |
512 | 22068 | 16000 | 10491 | 9014 | 9552 | 13913 |
1024 | 22857 | 19393 | 7356 | 13333 | 13617 | 16842 |
2048 | 23703 | 21333 | 7710 | 17297 | 17777 | 20645 |
4096 | 23703 | 22068 | 7804 | 19393 | 20000 | 21333 |
8192 | 23703 | 22857 | 7619 | 22068 | 22068 | 22857 |
16384 | 23703 | 22857 | 7804 | 17297 | 21333 | 18285 |
32768 | 16410 | 16410 | 7710 | 12800 | 16000 | 12800 |
65536 | 13061 | 14883 | 7710 | 13061 | 14545 | 13061 |
131072 | 14222 | 13913 | 7710 | 12800 | 13617 | 12800 |
262144 | 5000 | 5039 | 7032 | 7901 | 5000 | 7804 |
524288 | 5079 | 5000 | 7356 | 8205 | 5079 | 7804 |
1048576 | 4885 | 4885 | 7272 | 7441 | 4671 | 7529 |
2097152 | 5039 | 5079 | 7272 | 7619 | 5000 | 7710 |
Graph comparison only for cpblk, memcpy and CustomCopy:
Don't be afraid about the performance drop for most of the implem... It's mostly due to cache missing and copying around different 4k pages.
Conclusion
Don't trust your .NET VM, check your code on both x86 and x64. It's interesting to see how much the same task is implemented differently inside the CLR (see Marshal.Copy vs Array.Copy vs Buffer.Copy)The most surprising result here is the poor performance of cpblk IL instruction in x86 mode compare to the best one in x64 which is... cpblk. So to summarize:
- On x86, you should better use a memcpy function
- On x64, you should better use a cpblk function, which is performing better from small size (twice faster than memcpy) to large size.
One important consequence of this is when you are developping a C++/CLI and calling a memcpy from a managed function... It will end up in a cpblk copy functions... which is almost the worst case on x86 platforms... so be careful if you are dealing with this kind of issue. To avoir this, you have to force the compiler to use the function from the MSVCRTxx.dll.
Of course, the memcpy is platform dependent, which would not be an option for all...
Also, I didn't perform this test on a CLR 2 runtime... we could be surprised as well... There is also one thing that I should try against a pure C++ memcpy using the optimized SSE2 version that is shipped with later msvcrt.
You can download the VS2010 project from here