On Wed, 7 Jan 2026 10:51:16 -0000 (UTC)
antispam@fricas.org (Waldek Hebisch) wrote:
Michael S <already5chosen@yahoo.com> wrote:
I experimented a bit more (in fact, more like a lot more) with
test batteries of L?Ecuyer. It led me to conclusion that occasi
onal
failure in the either middle or big battery means nothing.
Sometimes even cripto-quality PRNG does not pass one or another
test. Then you try to reproduce it and see that with any other seed
that you try a failure does not happen.
All in all, it makes me more suspect of PRNGs that consistently pass
both batteries with various seed. I start to see it as a sign of
PRNG being rigged to pass tests.
Well, that depends on the tests and threshhold in the tests.
Some tests when fed with trurly random source will produce produce
very small variation of the results. With generous threshhold
such test will essentially never fail for trurly random source.
OTOH when expected variation of the results is larger and
threshhold is tight, then trurly random source will fail the
test from time to time.
Yes, there are many stable tests. But there are also many unstable
tests in Crash/BigCrash batteries. I can't say if the ratio is 50:50 or
75:25, but it is not 90:10.
svaria_WeightDistrib() and swalk_RandomWalk1() appear most unstable.
You feed them with particular pseudo-random vector (100,000,000 points)
and the test fails. Then you shift the vector by just one point, to the
left or to the right. And it passes, often not just passes, but produces
a result that is not even close to the margin.
And if you test long enough you should
be able to estimate probability of failure and possibly compare
is with theoretical result if available.
Long enough in this case is very damn long.
To say the truth, I do not know what failures of crypto-quality PRNG
means. It may mean that the test is of tight kind that is supposed
to fail from time to time for trurly random source. Or it may mean
to PRNG improves cypto part at cost of statistics. That is
non-uniform distribution of the output is not a problem in
crypto applications. Simply for crypto purposes future output
should be not predictable from current and past output only.
And slightly non-uniform distribution can increase probablity
of test failure enough that you can observe such failures.
BTW: You mentioned using counter and hardware AES128 round.
Given cheap AES128 round I would try 128-bit state and AES128
round as state update. I do not know if hardware AES128 is
fast enough to make it wortwhile, but using AES128 round as a
state update should be much better than scrambling a counter.
It depends on what one considers fast.
My expectations, trained by LCGs and esp. by optimized MT, are rather
high. [On relatively modern Intel and AMD hardware] generators that do
feedback through even a single AES round (with another round for post-processing) do not meet my expectations.
At best, fully inlined and after significant effort of removing all non-necessary bottlenecks, such generators produce one output word per
4-5 clock cycles. More naive implementations with state read from
memory and stored back to memory on each iteration, are much slower
than that, easily takes over 10 clocks per iteration.
For comparison, a counter that goes through 5 rounds of AES without
feedback runs at 3 clocks per word when inlined and at 3.5 clocks per
word via function call.
It seems that given today's hardware limitations the only way to get
really fast PRNG with the state updated through AES round is by running
several chains interleaved. Preferably no less than 4 chains. Which
increases size of basic PRNG state to 512 bits + some more for state management.
Another problem is that when all state bits go through AES then I can
not prove that the period of PRNG is really high. I can't even be sure
that it does not enter very short loop at some stage of its life. I
understand that it is highly unlikely, but don't like uncertainty.
Running half of the state through AES and taking another half from
counter solves it, but at cost of need for better diffusion in
post-processing (temper) state of generator. If more than 32 bits of
output produced per iteration, I would not feel good if tempering
consists of just one round of AES. I would want two rounds at least.
May be, better option is to do something like that:
interleave N times {
prnd_out = AES1R(state.w128);
state.w128 = AES1R(state.w128) ^ state.counter64;
state.counter64 = state.counter64 + 1;
}
I am still not 100% sure that the long period is guaranteed with such arrangement, but I am far more sure of it than in case of original
arrangement.
However, it is far from simple and the state is not very small. So, the question is "Why bother"? What's wrong with much simpler scheme that
takes 64-bit counter and runs it through 5 AES rounds? Or even 6 round
if being paranoid?
The only downside that I can see is that this simple robust scheme is
more dependent on high throughput of AES units which is not necessarily available on somewhat older hardware.
Said above does not apply to scomp_LinearComp() failures of mt19937.
Those failures are very consistent. I just don't consider them
significant for my own use of PRNGs or for any other uses of PRNG
that I personally ever encountered.
Overall an experience strengthened my position that general wisdom, previously shared by O'Neil herself, got it right: in absence of the special considerations people should select mt19937 and especially mt19937-64 as their default PRNGs of choice.
Looking closer, apart from its properties of randomness and apart
from huge period
Huge period alone is easy. AFAICS matrix variants of LCG can
easily get quite large periods. I did not test matrix LCG,
but on statistical tests they should be essentially as good
as multiprecision LCG, but should be cheaper to implement.
Just to be clear, I mean equation x_{n+1} = Ax_n + b, wher x_n
is a vector reprezenting n-th state, A is a matrix and b is a
vector. Matrix A may be somewhat sparse, that is have limited
number of non-zero entries, and some entries my be simple, like
1 or -1. With proper choice of a and b period should be
comparable with number of availalable states.
I see no reason to go for very long periods, already 512 bits
of state allow perids which in practice should be indistingushable
from infinite period.
Didn't I agree with that in the very next statement?
(which does not matter for me) I started to appreciate for
mt19937 for the following properties that I was not aware before:
- it does not use multiplier. So good fit for embedded systems that
have no (or very slow) multiplier HW.
Biggest MCU with no hardware multiplier that I have has 2kB RAM.
I do not want mt19937 on such a machine. 64-bit multiplication
on 8-biter with hardware multiplier may be slow, so 64-bit LCG
(and improvements based on it) may be slow. But multiplication
nicely spreads out bits, so it is not clear to me if there is
equally good cheaper alternative.
64-bit multipliers spread bits nicely, indeed. But implementing 64-bit multiplier on the core that natively has only 8x8=16 is not easy or
fast.
Even implementing 32-bit multiplier here is not easy or fast. And I
would not say that 32-bit multipliers spread bits nicely.
But I was not thinking about that sort of cores.
I had in mind minimalistic implementation of soft core architectures by
Xilinx and Altera.
These core have several useful properties:
- tested and supported much better than open-source alternatives
- occupy relative few logic and block RAM resources
- often capable to run at higher frequency than their full-featured
brethren
- the fact that they can be used without buying a license (free as
beer) does not hurt either
Of course, they are *much* slower (lower IPC) than full-featured cores,
but in plenty of cases it's o.k. One thing that these cores do not have
is multiplier. When compiler sees multiplication in source code it calls support routine. And when it happens then instead of *much* slower
(say, 5x) than full-featured core, which is commonly acceptable, these
cores turn ALOT slower (say, 200x0 which is commonly unacceptable.
Cores like that sometimes used in applications that want good PRNG,
like various sorts of built-in tests. In such app spending 2K bytes for
mt1937 state is o.k. Using emulated 64-bit multiplication for PRMG is
sometimes o.k. but more commonly not so.
If needed I would investigate
matrix LCG, they may be slightly cheaper.
- algorithm is very SIMD-friendly, so optimized implementations can
be very very fast on modern x86 and ARM64 hardware.
Just size of the state puts limit how fast it can be. And size of
the state means that it will compete for cache with user data.
BTW: AFAICS matrix LCG can be SIMD friendly too.
The latter property also means that very fast FPGA implementation is
easily possible as long as designer is willing to through at it
moderate amount of logic resources.
Said above does not mean that PCG generators of O'Neil have no
place. Intuitively, they appear not bad. But the theory is
unproven, optimized implementation is likely slower that optimized
mt19937, claimed "security" advantages are nonsense as admitted
later by O'Neil herself. And, as said above, I no longer trust her empirical methodology, based on work of L?Ecuyer.
So, PCG generators are valuable addition to the toolbox, but not
good enough to change my default.
I agree that ATM it is not entirely clear if PCG-s are as good as
suggested by tests. But I am surprised by your opinion about
speed, I did not analyze deeply either of them, but PCG-s are way
simpler, so I would expect them to be faster.
When used in simple way, PCG has LCG latency (IMUL+ADD) embedded in its
core.
On modern x86-64 it tends to be 4 clocks at very least. On ARM64 it's
often better, by I am more interested in x86-64.
OTOH, optimized implementations of mt19937 have [almost] no inherent
latency limits. If one has wider hardware it directly translate into
both faster state update and faster tampering phase.
More strictly speaking for state update there exists a strict
algorithmic limit of 397 32-bit words. Plus, non-heroic implementations
have limit of 624-397=237 words. Both limits are far away from
capabilities of the widest commonly available hardware (2x or 3x 512
bits, i.e. no more than 48 32-bit operations per cycle).
In practice, the biggest bottleneck , even without 512-bit processing,
is not a state update and not tampering, but fighting impedance
mismatch at the level of API. That is, implementation prefers to
generate 256 bits at once, but API insists on 32 bits. Which leads to
need for buffering and for additional management overhead. More time
spent here than within mt algorithm itself.
Still, even with overhead, when buffer management part is inlined
(which is easy to achieve in practice and does not cause significant
code bloat) modern but not the most modern Intel core (Raptor Cove)
that I have in my desktop at work is capable to deliver one 32-bit word
per exactly 3 clock cycles.
That's a little better than even most basic fully inlined LCG, so
necessarily better than any PCG.
There is little doubt that LCGs (and as result of it PCGs) can be
improved via advancing state by several steps at time instead of one
step at time. But then they will suffer from the same impedance
mismatch with APIs. Solution to mismatch would be the same as with mt - buffering and additional management overhead. So gone simplicity, gone
tiny state size :(
I did not try it, but it is very likely that for given width of output
word the resulting speed of PCG would be almost exactly the same as
optimized mt19937, because generator will spend majority of time in the
code that not merely almost the same between the two, but exactly the
same.
--- PyGate Linux v1.5.2
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)