I will focus mainly on 3 implementations, so i will not describe an exhaustive list of technique around. I would be glad if you are aware about a smaller implementation! I found quite some interesting results, like a poor performance of the c rand() function, both in terms of speed and uniformity.
Let's recall one of the simplest C random function you can find around :
float Rand() {
return ((float)rand()/RAND_MAX) * 2.0f - 1.0f;
}
The rand() is generating an integer in the range [0, 32767]. RAND_MAX is equal to the max value 32767. The result is then scaled to the range [-1,1]. Let us first implement this in x86 ASM, just as an exercise.
Version 1: Using the c rand() function
A straight implementation of the previous C in x86 ASM could be like this (thanks to AsmHighlighter for the syntax highlighting and the instruction size calculation ;) ) :
_Rand proc
call dword ptr [_imp__rand] ;#6 Call c Rand
push eax ;#1 st(0) st(1)
fild dword ptr [esp] ;#3 rand
fidiv word ptr [RandMax] ;#6 (-1,0]
fchs ;#2 [0,1)
fadd st(0),st(0) ;#2 [0,2)
fld1 ;#2 1 [0,2)
fsubp ;#2 [-1,+1]
pop eax ;#1 clean stack
ret ;#1
_Rand endp ;#26 total bytes
call dword ptr [_imp__rand] ;#6 Call c Rand
push eax ;#1 st(0) st(1)
fild dword ptr [esp] ;#3 rand
fidiv word ptr [RandMax] ;#6 (-1,0]
fchs ;#2 [0,1)
fadd st(0),st(0) ;#2 [0,2)
fld1 ;#2 1 [0,2)
fsubp ;#2 [-1,+1]
pop eax ;#1 clean stack
ret ;#1
_Rand endp ;#26 total bytes
There is a slight difference with the previous C version. I want to generate a number in the range [-1,1), so the RandMax is set to a short (or word) 32768, meaning that the integer value is -2^15. Because the c rand function generate a number in the range 0-32767, i have to scale it to the correct [-1,1) range. Dividing by -2^15 generate a number in the range (-1,0]. I just have to negate this result, multiply by 2 and minus 1 to get the correct range [-1,1) (although this in not critical to have this range. The range (-1,1] is fine as well).
The number of bytes for this function is
26 bytes
. Although you have to add the cost of referencing the msvcrt runtime (this is not so much, but still...). The crinkler compressed size is around 20.5 bytes
. You will find attached to this article the RandAsm Visual C++ Project using MASM and crinkler.The protocol to test the compressed size is to leave only one type of random function at a time when generating the code. The other implementations are disabled. This will give you a more accurate estimation of the compressed size (but this is not a true final compressed size in an intro exe, as context modeling from crinkler heavily depends on... context, previous code compressed... and so on...). But with this approach, you have at least something close to reality... I have also measure the execution time minus the code to fill the buckets (i have run an empty random function to get the reference time).
I have implemented a basic checker of the uniformity of the random number generator. My friend ulrick advise me to use a simple Kolmogorov-Smirnov (KS) test to check the efficiency of the generator, but i found this too difficult (i didn't have time to implement such a test, with storing, sorting all random numbers...). He then suggest me to use a simpler test called Chi-Square Test (check also this nice article from Dr Dobb's about Testing Random Number Generators by Jerry Dwyer and K.B. Williams, June 1996 ). It's basically working on frequencies (and not on individual random number as for the KS test) using a limited amount of buckets to count the frequency of the random number. For this test, i'm using 100 buckets on the range [-1,1[ (bucket 0 is in the range [-1, -0.98] while the bucket number 99 is in the range [0.98, 1[), and generating 1,000,000,000 random float numbers with it. It means that by bucket, i should expect a perfect : 1,000,000,000 / 100 = 10,000,000 values. The Chi-Square Test computes a factor k that is checked against a Chi-Square distribution table.
Be aware that this test (as well as the KS test) is mainly relevant for testing the uniformity of the random numbers, but they don't ensure that the numbers are truly random. We should implement another algo to test this "random-ability" but i don't have time to do this! ;)
The results for the C-Rand ASM version are :
Rand Type | Time(ms) | Precision(bits) | Code Size(bytes) | Compressed Code Size(bytes) | k-ChiSq |
---|---|---|---|---|---|
Rand C ASM | 29890 ms | 15 | 26 | 20.5 | 2066.67 |
You may notice that the k-ChiSquare factor is very high. This factor should be below 134.642! (column 0.01, line 99 in the ChiSquare table). A first look at these results tell us that they are quite poor. We are going to see that other x86 ASM implementation are performing much better than this one.
For the curious, what's behind the c rand() function? Under Visual C++, you have the following code:
int __cdecl rand (void)
{
return( ((next = next * 214013L + 2531011L) >> 16) & 0x7fff );
}
We can see that the 0x7FFF truncation is hurting a lot the precision of the rand function. Only 15 bits are then used when converting to the float.Version 2: Using iq's method
iq wrote a couple of very great article about coding, mainly for graphics. But among them, there is an interesting article about the problem of random float numbers : float, small and random.
Read this article carefully, as it is going a little deeper in the internals of IEEE-754 floating point representation (used by the FPU).
Finally, iq suggest a new efficient algorithm:
float sfrand( int *seed )
{
float res;
seed[0] *= 16807;
*((unsigned int *) &res) = ( ((unsigned int)seed[0])>>9 ) | 0x40000000;
return( res-3.0f );
}
A straight implementation of his algo in asm could be like this one:
_Rand proc
imul eax, dword ptr [RandSeed], 16807 ;#A eax = RandSeed * 16807
mov dword ptr [RandSeed], eax ;#5 RandSeed = eax
mov al, 080h ;#2 eax: rrrrrr80h where r are bits from random seeds
ror eax, 8 ;#3 eax: 80rrrrrrh. value = s * 2^(e-127). e is set to 128, thus 040h value ( sEee eeee efff ffff ffff ffff ffff ffff)
shr eax, 1 ;#2 eax: 40rrrrrrh
push eax ;#1 push eax on stack to load it as a float. rand [2,4)
fld dword ptr [esp] ;#3 load rand st0 = rand [2,4)
push 3 ;#2 load 3
fisub dword ptr [esp] ;#3 st(0) = rand [2,4) - 3 = rand[-1, 1)
pop eax ;#1 free stack
pop eax ;#1 free stack
ret ;#1 return to caller
_Rand endp ;#34 total bytes
This is not probably the smallest x86 ASM implementation of iq's code, but from 1 to 2 bytes, this should be enough close... (hey, i don't claim to be a x86 expert anyway! ;) )
This implementation gives the following results :
Rand Type | Time(ms) | Precision(bits) | Code Size(bytes) | Compressed Code Size(bytes) | k-ChiSq |
---|---|---|---|---|---|
iq's ASM | 13042 ms | 23 | 34 | 30 | 7.36 |
They are significantly better than the previous one. Both in terms of execution time and uniform distribution. The k-ChiSqr factor is 7.36 much below the 134.642 limit. Although, the result is a function a bit larger, around 30 bytes compressed. Not a lot, but still, we need to have a smaller one.
Version 3: Using simple int min divider (Park Miller-Carta's method)
When there is a demo/intro that is using a cool hack or whatever, i'm always curious to see how they did it. Well, i should post someday an article about the fact that "demomaking... is a bit of hacking". So when i went around the code of the 4klang great synth from gopher, i found an interesting tiny random function. [Edit]After some research, i found the original authors are Payne, Rabung & Bogyo (in 1969), then Park Miller (in 1988) and Carta (in 1990), this is a nice implementation and in fact, very simple.[/Edit]
We are basically using the same random generator that iq's suggest at the beginning (other candidates are also good. 4klang is using 16007. I'm wondering if we should use a prime number?) :
seed *= 16807;
seed
is an integer. So the range value is going from [-2^31, 2^31-1]. Following the simple C rand function, we just have to load this seed directly on the FPU as an integer, keeping all the 31 bits precision. Then, we just have to divide by the largest - absolute - value : -2^31 aka 080000000h in hex. In C, this is simple as :seed *= 16807;
return ((float)seed) / (float)0x80000000;
or in x86 ASM:
_Rand proc
imul eax, dword ptr [RandSeed], 16807 ;#A eax = RandSeed * 16807
mov dword ptr [RandSeed], eax ;#5 RandSeed = eax
fild dword ptr [RandSeed] ;#6 load RandSeed as an integer
fidiv dword ptr [RandDiv] ;#6 div by max int value (absolute) = eax / (-2^31)
ret ;#1 return to caller
_Rand endp ;#28 total bytesWe can see that this implementation return a result in the range (-1,1]. This implementation gives the following results :
Rand Type | Time(ms) | Precision(bits) | Code Size(bytes) | Compressed Code Size(bytes) | k-ChiSq |
---|---|---|---|---|---|
Int-Min Divider | 6006 ms | 31 | 28 | 19 | 7.31 |
Results are of course significantly better than the C version, but they are also better in terms of execution time and size compare to iq's method. Not surprisingly, k-ChiSqr factor is 7.31 much below the 134.642 limit and comparable to iq.
With 31 bits for the precision, This method is even better, as the seed is fully loaded on the fpu stack with it's 31 bits precision : because the FPU has a total of 80 bits to store a floating point number (with extended precision), it means that we have 64 bits available for the mantissa, large enough to store the 31 bits . The compression is working well on this, mainly due to the fact that we are using a simple memory addressing that is compressing well with crinkler. Yep, sometimes, it's better to have something like :
mov eax, [address1]; imul eax, dword ptr [address2]
than having to preload a base adress like lea ebx, [address1]; mov eax, [ebx]; imul eax, [ebx+4];
. So with only 19 bytes compressed, this version is really cool and the execution time is impressive too!Faster and better floating point generator
Let's recap the results :
Rand Type | Time(ms) | Precision(bits) | Code Size(bytes) | Compressed Code Size(bytes) | k-ChiSq |
---|---|---|---|---|---|
Rand C ASM | 29890 ms | 15 | 26 | 20.5 | 2066.67 |
iq's ASM | 13042 ms | 23 | 34 | 30 | 7.36 |
Int-Min Divider | 6006 ms | 31 | 28 | 19 | 7.31 |
We can see that the "Int-Min Divider" technique is by far the most efficient, both in terms of execution time and size. It's also a very good generator, as the distribution is quite uniform.
This should be very useful when you need a fast random floating point generator!
A question of floor?
Just a few words about truncating a float to an int. I needed such think in order to increments the frequecies for the bucket. I'm usually using a poor implementation using the fpu instructions couple
fld/fistp
but the results depends on the FPU rounding mode (and the default rounding mode is "nearest" and not "toward zero" as expected). You don't see this when you are using the msvcrt library, as they have implemented a true floor() conversion in the back. But when using the /QIfist option in VC++, you can't use anymore this function from the msvcrt.However, I found a straight function that may do the trick, using following SSE function :
inline long floorSSE(double x) {
__asm cvttsd2si eax,x
}
The cvttsd2si SSE instruction is converting a double to an int (you can find about this an interesting discussion here). This is actually working well if the double is coming originally from a float (moving from the FPU to SSE requires writing to memory). But this instruction seems to work well on float. Although, I should check more extensively the reliability of this function...Update 29/10/2009 : In fact, there is the FPU instruction
fisttp
in SSE3 that do also and is less restrictive than the cvttsd2si
instruction (notice although they are more considered as "truncate" than "floor" as i mentioned).________________________________________
Here is attached the VC++ 2008 RandAsm project i used for this article.