• Re: Article of Melissa O'Nail

    From Michael S@3:633/10 to All on Sun Dec 28 12:35:33 2025
    On Sun, 28 Dec 2025 05:38:55 -0000 (UTC)
    antispam@fricas.org (Waldek Hebisch) wrote:

    Michael S <already5chosen@yahoo.com> wrote:
    On Wed, 24 Dec 2025 12:12:11 +0200
    Michael S <already5chosen@yahoo.com> wrote:

    On Wed, 24 Dec 2025 09:00:50 -0000 (UTC)
    antispam@fricas.org (Waldek Hebisch) wrote:

    Michael S <already5chosen@yahoo.com> wrote:
    On Tue, 23 Dec 2025 17:54:05 -0000 (UTC)
    antispam@fricas.org (Waldek Hebisch) wrote:

    Michael S <already5chosen@yahoo.com> wrote:
    On Mon, 22 Dec 2025 18:41:10 +0100
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:


    Also, the TestU01 suit is made for generators with 32-bit
    output. M. O?Neill used ad hoc technique to make it applic
    able
    to generators with 64-bit output. Is this technique right? Or
    may be it put 64-bit PRNG at unfair disadvantage?

    My point of view is that generator can be used to generate long
    bistream. Then you can cut the bitstream and get number of
    desired size. Good tests should check that such usage leads
    to reasonable properties. So, fact that one generator produces
    32-bit pieces and other produces 64-bit pieces should be
    irrelevant to the test.


    What you say is correct in few use cases. But there are many uses
    cases (in field of testing of numeric code, probably, most of them)
    in which "less random" LS bits are acceptable.
    Not that I can see why it could be the case for MT19937-64, but it
    could apply to one of two of other 64-bit generators tested by
    O'Neill.


    Besides, I strongly disagree with at least one assertion made
    by O?Neill: "While security-related applications should
    use a secure generator, because we cannot always know the
    future contexts in which our code will be used, it seems wise
    for all applications to avoid generators that make discovering
    their entire internal state completely trivial."
    No, I know exactly what I am doing/ I know exactly that for my
    application easy discovery of complete state of PRNG is not a
    defect.

    O?Neill is not a prophet, ignore what she say it you think y
    ou
    know better (which is probably the above).

    Anyway, even if I am skeptical about her criticism of popular
    PRNGs, intuitively I agree with the constructive part of the
    article - medium-quality PRNG that feeds medium quality hash
    function can potentially produce very good fast PRNG with
    rather small internal state.

    She seem to care very much about having minimal possible state.
    That is may be nice on embeded systems, but in general I would
    happily accept slighty bigger state (say 256 bits). But if
    we can get good properties with very small state, then why not?
    After all looking at state and updating it takes code, so
    small state helps with having fast generator.


    Agreed.

    Concerning Mersenne Twister, she is not the only one to
    criticise it. My personal opinion is that given large
    state and not so simple update Mersenne Twister would
    have to be very very good to justify its use.

    One theoretical advantage of MT19937 is that it has period of
    astronomic proportions. Which means that one instance of PRNG
    could be de-multiplexed into millions or billions of sub-streams
    with no detectable degradation of the quality of each sub-stream.
    However I fail to see how de-multiplexing into more than ~ one
    thousand of sub-streams can be practical. And for the latter one
    does not need to be astronomical, something like period=2**96
    would be fully sufficient with many bits to spare.
    So, in theory I agree with the criticism. But in practice I am not
    bothered by the size of MT state.

    But it
    fails some tests, so does not look _better_ than other
    generators.


    It would be interesting to find out what were those tests that
    failed. I wonder, if tests suit can run faster on multicore
    computer. I don't want to wait 5-6 hours just to find out that
    report does not provide an information that I am looking for.


    I reproduced results of M. O'Neil. Luckily, on semi-modern hardware
    (Coffee Lake or EPYC3) and for PRNGs in question BigCrash finishes
    in 2-2.5 hours. Which is a pain, but considarably less so then 5
    hours. mt19937 fails all tests of scomp_LinearComp() variaty (2 in
    Crash and 2 in BigCrash). It passes all the rest of tests.
    After re-reading O'Neil, I see that she wrote that, but first time I
    didn't pay attention.

    I have read description of scomp_LinearComp(). I can't say that I understood much. Neither theory, nor parameters, in particular, even
    after looking though code I have no idea about the meaning of
    parameter r.
    I am not so sure that Pierre L?Ecuyer himself fully understands
    this
    test, apart from the fact that many moons ago basic algorithm for calculation of linear complexity was published in IEEE Transactions
    on Information Theory. Otherwise, his description would not look so
    much as hand waving. As to O'Neil, she likely understands it better
    than myself, but that's not a huge achievement :(
    My less than scientific feeling about this test is that one part of
    it is looking if test is LFSR or LFSR-family and if the answer is
    yes then test fails.
    So, eccentrically, mt19937 is punished for what it is rather than
    for randomness of results that it produces.

    Well, I peeked at the code but did not read destription of the
    test. In the code I see mention of Berlekamp-Massey. That is
    well-known algorithm to find regularites in data, basically
    tries to find linear recurence satisfied by the data. One of
    things that I do is looking for reccurences, including linear
    ones. Up to now I did not run any serious simulation for
    such things but I may wish to do so (and IIUC other people did
    run some simulations). When doing simulations I really do
    not want to see artifacts due to PRNG producing sequence with
    different features than random sequence. So for me, if mt19937
    produces sequence with visible extra linear regulatities that is
    significant failure.


    I can't say much about Berlekamp-Massey algorithm in general. But this particular test (scomp_LinearComp) is looking for correlations between individual bits and is doing it at unusual intervals. In fact, it's
    even worth. It is looking not for correlations themselves but for
    regularity in changes of measure of correlation in a group of bits that
    is called linear complexity. And if it finds changes regular than it
    declares that PRNG is bad.
    I don't know, may be the test was not intentionally designed to punish
    long LSFRs that use relatively few bits at feedback step, but it
    behaves like the one.
    I personally don't care even if my PRNG has long-distance correlation
    between bits themselves, as long as the distance is long and not
    divisible by small integers. Much less so, when its not correlation
    itself, but a derivative of some abstract function of correlation.

    What I care about are numeric values of groups of bits taken at regular intervals, like 64 bits (most typical), 32 bits or rarely 8/16 bits.
    And for that MT appears to perform very well. For that I certainly
    trust it better than non-hashed output of LCG, esp. of LCG that has
    pow2 for modulo. And if said LCG happens to pass all L?Ecuyer butte
    ries
    it rather makes me more suspect of quality of batteries rather than less suspect of quality of this sort of LCG.

    Now, if somebody cares about non-correlativity of individual bits at
    strange intervals then, may be, LCG(mod 2**96) with output consisting
    of 32 MS bits of the state is what a doctor ordered. However I suspect
    that my usage of PRNG is not just different from his, it also happens to
    be far more common than his usage.

    I made a minimal modification to mt19937 algorithm, telling it to
    skip every 19936th result word. With modification it easily passes
    all 3 crash batteries of Pierre L?Ecuyer.
    Do I think that my modified mt19937 is better than original? No, I
    don't. IMHO, the only thing it is better in is passing batteries of L?Ecuyer.

    Your modification is enough to fool the tests. It is not clear
    to me if it is enough to fool better regularity finders, so
    probably the generator is still defective.


    That was my point. I manged to fool the test. Easily.
    But my conclusion is different. The test is defective, generator, both
    before and after modifications, is o.k..

    Also, note basic priciple of statistial testing: you should collect
    and process data first, than apply _once_ statitical test. Repeated
    testing with tweaked data is likely to prodice false pass. If you
    really want to tweak data the whole thing should be treated as
    one composite test with its own acceptance criteria (which is more
    stringent than separate tests). Given that you used knowledge of
    failing tests to modify generator, passing test after that is much
    weaker claim than passing test for generator without knowledge of
    the tests.


    Pay attention that specific skip modulo (19936) was chosen not
    because it's the only modulo that cause the test to pass. It was
    chosen as a biggest number < 19937 in order to make least noticeable
    impact on the speed of PRNG. Any smaller skip modulo will pass tests as
    well.
    So, if there was feedback loop on my part, it was not strong.

    And that, BTW, is my main criticism of methodology of Melissa O'Neil in constructive part of your article. She seems to apply a feedback loop
    by tweaking her suggested PCGs until they pass L?Ecuyer butteries.


    On related note, I think that even simple counter fed into high
    quality hash function (not cryptographically high quality, far
    less than that) can produce excellent PRNG with even smaller
    internal state. But not very fast one. Although the speed
    depends on specifics of used computer. I can imagine computer
    that has low-latency Rijndael128 instruction. On such
    computer, running counter through 3-4 rounds of Rijndael ill
    produce very good PRNG that is only 2-3 times slower than, for
    example, LCG 128/64.

    Maybe.


    May be I'd even test my hypothesis. Eventually. Except that,
    again, I am not thrilled by idea of waiting 6 hours for each
    result.

    I tested.
    It turned out that my hypothesis was wrong. Running counter through
    3 rounds of Rijndael128 is not enough. Running counter through 4
    rounds is still not enough - it fails 1 test (#86) in BigCrash
    battery. I didn't test 5 rounds, but even if it is enough, which is
    likely, it would almost certainly be slower than other several
    known methods.

    All that with simple 64-bit binary counter as a state variable.

    With 128-bit state and with partial chaining of 64 bits of Rijndael
    output back into part of state (the other half of state is still a counter), passing all batteries appear very easy. It only takes one
    round for chaining and another one for hashing. But under O'Neil's
    figures of merit using 128-bit PRNG state considered cheating );

    O'Neil writes about birtday test: if you take values from N
    element set, with more than sqrt(N) samples you should get
    repetitions. Consider N equal to 2 to the power 64. In
    heavy use one could generate more than sqrt(N) values.
    In PRNG having 64-bit state and producing state as value
    all values are distinct, so generator would fail such a test.
    One could try to fix this by not exposing state, say producing
    only 32-bits in each step.

    Using 64-bit state and generating 32-bit output at each step is exactly
    what I did. In theory it should suffice to pass birthday tests.
    And indeed the test that failed was not one of the birthday tests, but svaria_WeightDistrib test (# 62 in BigCrash).

    But on 64-bit machine it looks
    more efficient to use 128-bit state and produce 64-bits in
    each step.


    It was not about practicality, but about testing the theory that even
    the worst possible long-periodic PRNG (and I can not think about
    anything worse as PRNG than simple counter) should be good enough when
    fed into decent hash as long as one is not greedy and (assuming
    2**64 period) one does not try to use more than 32 bits for output.
    I still think that the theory is correct, but I somewhat underestimated
    the required quality of the hash function.






    --- PyGate Linux v1.5.2
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From Michael S@3:633/10 to All on Mon Jan 5 14:21:44 2026
    I experimented a bit more (in fact, more like a lot more) with
    test batteries of L?Ecuyer. It led me to conclusion that occasional
    failure in the either middle or big battery means nothing.
    Sometimes even cripto-quality PRNG does not pass one or another test.
    Then you try to reproduce it and see that with any other seed that you
    try a failure does not happen.
    All in all, it makes me more suspect of PRNGs that consistently pass
    both batteries with various seed. I start to see it as a sign of
    PRNG being rigged to pass tests.

    Said above does not apply to scomp_LinearComp() failures of mt19937.
    Those failures are very consistent. I just don't consider them
    significant for my own use of PRNGs or for any other uses of PRNG that
    I personally ever encountered.

    Overall an experience strengthened my position that general wisdom,
    previously shared by O'Neil herself, got it right: in absence of the
    special considerations people should select mt19937 and especially
    mt19937-64 as their default PRNGs of choice.
    Looking closer, apart from its properties of randomness and apart from
    huge period (which does not matter for me) I started to appreciate for
    mt19937 for the following properties that I was not aware before:
    - it does not use multiplier. So good fit for embedded systems that
    have no (or very slow) multiplier HW.
    - algorithm is very SIMD-friendly, so optimized implementations can be
    very very fast on modern x86 and ARM64 hardware.
    The latter property also means that very fast FPGA implementation is
    easily possible as long as designer is willing to through at it
    moderate amount of logic resources.

    Said above does not mean that PCG generators of O'Neil have no place. Intuitively, they appear not bad. But the theory is unproven, optimized implementation is likely slower that optimized mt19937, claimed
    "security" advantages are nonsense as admitted later by O'Neil herself.
    And, as said above, I no longer trust her empirical methodology, based
    on work of L?Ecuyer.
    So, PCG generators are valuable addition to the toolbox, but not good
    enough to change my default.


    --- PyGate Linux v1.5.2
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From Michael S@3:633/10 to All on Thu Jan 8 14:03:35 2026
    On Wed, 7 Jan 2026 10:51:16 -0000 (UTC)
    antispam@fricas.org (Waldek Hebisch) wrote:

    Michael S <already5chosen@yahoo.com> wrote:
    I experimented a bit more (in fact, more like a lot more) with
    test batteries of L?Ecuyer. It led me to conclusion that occasi
    onal
    failure in the either middle or big battery means nothing.
    Sometimes even cripto-quality PRNG does not pass one or another
    test. Then you try to reproduce it and see that with any other seed
    that you try a failure does not happen.
    All in all, it makes me more suspect of PRNGs that consistently pass
    both batteries with various seed. I start to see it as a sign of
    PRNG being rigged to pass tests.

    Well, that depends on the tests and threshhold in the tests.
    Some tests when fed with trurly random source will produce produce
    very small variation of the results. With generous threshhold
    such test will essentially never fail for trurly random source.
    OTOH when expected variation of the results is larger and
    threshhold is tight, then trurly random source will fail the
    test from time to time.

    Yes, there are many stable tests. But there are also many unstable
    tests in Crash/BigCrash batteries. I can't say if the ratio is 50:50 or
    75:25, but it is not 90:10.
    svaria_WeightDistrib() and swalk_RandomWalk1() appear most unstable.
    You feed them with particular pseudo-random vector (100,000,000 points)
    and the test fails. Then you shift the vector by just one point, to the
    left or to the right. And it passes, often not just passes, but produces
    a result that is not even close to the margin.

    And if you test long enough you should
    be able to estimate probability of failure and possibly compare
    is with theoretical result if available.


    Long enough in this case is very damn long.


    To say the truth, I do not know what failures of crypto-quality PRNG
    means. It may mean that the test is of tight kind that is supposed
    to fail from time to time for trurly random source. Or it may mean
    to PRNG improves cypto part at cost of statistics. That is
    non-uniform distribution of the output is not a problem in
    crypto applications. Simply for crypto purposes future output
    should be not predictable from current and past output only.
    And slightly non-uniform distribution can increase probablity
    of test failure enough that you can observe such failures.

    BTW: You mentioned using counter and hardware AES128 round.
    Given cheap AES128 round I would try 128-bit state and AES128
    round as state update. I do not know if hardware AES128 is
    fast enough to make it wortwhile, but using AES128 round as a
    state update should be much better than scrambling a counter.


    It depends on what one considers fast.
    My expectations, trained by LCGs and esp. by optimized MT, are rather
    high. [On relatively modern Intel and AMD hardware] generators that do
    feedback through even a single AES round (with another round for post-processing) do not meet my expectations.
    At best, fully inlined and after significant effort of removing all non-necessary bottlenecks, such generators produce one output word per
    4-5 clock cycles. More naive implementations with state read from
    memory and stored back to memory on each iteration, are much slower
    than that, easily takes over 10 clocks per iteration.
    For comparison, a counter that goes through 5 rounds of AES without
    feedback runs at 3 clocks per word when inlined and at 3.5 clocks per
    word via function call.
    It seems that given today's hardware limitations the only way to get
    really fast PRNG with the state updated through AES round is by running
    several chains interleaved. Preferably no less than 4 chains. Which
    increases size of basic PRNG state to 512 bits + some more for state management.

    Another problem is that when all state bits go through AES then I can
    not prove that the period of PRNG is really high. I can't even be sure
    that it does not enter very short loop at some stage of its life. I
    understand that it is highly unlikely, but don't like uncertainty.
    Running half of the state through AES and taking another half from
    counter solves it, but at cost of need for better diffusion in
    post-processing (temper) state of generator. If more than 32 bits of
    output produced per iteration, I would not feel good if tempering
    consists of just one round of AES. I would want two rounds at least.

    May be, better option is to do something like that:
    interleave N times {
    prnd_out = AES1R(state.w128);
    state.w128 = AES1R(state.w128) ^ state.counter64;
    state.counter64 = state.counter64 + 1;
    }

    I am still not 100% sure that the long period is guaranteed with such arrangement, but I am far more sure of it than in case of original
    arrangement.

    However, it is far from simple and the state is not very small. So, the question is "Why bother"? What's wrong with much simpler scheme that
    takes 64-bit counter and runs it through 5 AES rounds? Or even 6 round
    if being paranoid?
    The only downside that I can see is that this simple robust scheme is
    more dependent on high throughput of AES units which is not necessarily available on somewhat older hardware.

    Said above does not apply to scomp_LinearComp() failures of mt19937.
    Those failures are very consistent. I just don't consider them
    significant for my own use of PRNGs or for any other uses of PRNG
    that I personally ever encountered.

    Overall an experience strengthened my position that general wisdom, previously shared by O'Neil herself, got it right: in absence of the special considerations people should select mt19937 and especially mt19937-64 as their default PRNGs of choice.
    Looking closer, apart from its properties of randomness and apart
    from huge period

    Huge period alone is easy. AFAICS matrix variants of LCG can
    easily get quite large periods. I did not test matrix LCG,
    but on statistical tests they should be essentially as good
    as multiprecision LCG, but should be cheaper to implement.

    Just to be clear, I mean equation x_{n+1} = Ax_n + b, wher x_n
    is a vector reprezenting n-th state, A is a matrix and b is a
    vector. Matrix A may be somewhat sparse, that is have limited
    number of non-zero entries, and some entries my be simple, like
    1 or -1. With proper choice of a and b period should be
    comparable with number of availalable states.

    I see no reason to go for very long periods, already 512 bits
    of state allow perids which in practice should be indistingushable
    from infinite period.


    Didn't I agree with that in the very next statement?

    (which does not matter for me) I started to appreciate for
    mt19937 for the following properties that I was not aware before:
    - it does not use multiplier. So good fit for embedded systems that
    have no (or very slow) multiplier HW.

    Biggest MCU with no hardware multiplier that I have has 2kB RAM.
    I do not want mt19937 on such a machine. 64-bit multiplication
    on 8-biter with hardware multiplier may be slow, so 64-bit LCG
    (and improvements based on it) may be slow. But multiplication
    nicely spreads out bits, so it is not clear to me if there is
    equally good cheaper alternative.

    64-bit multipliers spread bits nicely, indeed. But implementing 64-bit multiplier on the core that natively has only 8x8=16 is not easy or
    fast.
    Even implementing 32-bit multiplier here is not easy or fast. And I
    would not say that 32-bit multipliers spread bits nicely.


    But I was not thinking about that sort of cores.
    I had in mind minimalistic implementation of soft core architectures by
    Xilinx and Altera.
    These core have several useful properties:
    - tested and supported much better than open-source alternatives
    - occupy relative few logic and block RAM resources
    - often capable to run at higher frequency than their full-featured
    brethren
    - the fact that they can be used without buying a license (free as
    beer) does not hurt either
    Of course, they are *much* slower (lower IPC) than full-featured cores,
    but in plenty of cases it's o.k. One thing that these cores do not have
    is multiplier. When compiler sees multiplication in source code it calls support routine. And when it happens then instead of *much* slower
    (say, 5x) than full-featured core, which is commonly acceptable, these
    cores turn ALOT slower (say, 200x0 which is commonly unacceptable.

    Cores like that sometimes used in applications that want good PRNG,
    like various sorts of built-in tests. In such app spending 2K bytes for
    mt1937 state is o.k. Using emulated 64-bit multiplication for PRMG is
    sometimes o.k. but more commonly not so.

    If needed I would investigate
    matrix LCG, they may be slightly cheaper.

    - algorithm is very SIMD-friendly, so optimized implementations can
    be very very fast on modern x86 and ARM64 hardware.

    Just size of the state puts limit how fast it can be. And size of
    the state means that it will compete for cache with user data.
    BTW: AFAICS matrix LCG can be SIMD friendly too.

    The latter property also means that very fast FPGA implementation is
    easily possible as long as designer is willing to through at it
    moderate amount of logic resources.

    Said above does not mean that PCG generators of O'Neil have no
    place. Intuitively, they appear not bad. But the theory is
    unproven, optimized implementation is likely slower that optimized
    mt19937, claimed "security" advantages are nonsense as admitted
    later by O'Neil herself. And, as said above, I no longer trust her empirical methodology, based on work of L?Ecuyer.
    So, PCG generators are valuable addition to the toolbox, but not
    good enough to change my default.

    I agree that ATM it is not entirely clear if PCG-s are as good as
    suggested by tests. But I am surprised by your opinion about
    speed, I did not analyze deeply either of them, but PCG-s are way
    simpler, so I would expect them to be faster.


    When used in simple way, PCG has LCG latency (IMUL+ADD) embedded in its
    core.
    On modern x86-64 it tends to be 4 clocks at very least. On ARM64 it's
    often better, by I am more interested in x86-64.
    OTOH, optimized implementations of mt19937 have [almost] no inherent
    latency limits. If one has wider hardware it directly translate into
    both faster state update and faster tampering phase.
    More strictly speaking for state update there exists a strict
    algorithmic limit of 397 32-bit words. Plus, non-heroic implementations
    have limit of 624-397=237 words. Both limits are far away from
    capabilities of the widest commonly available hardware (2x or 3x 512
    bits, i.e. no more than 48 32-bit operations per cycle).
    In practice, the biggest bottleneck , even without 512-bit processing,
    is not a state update and not tampering, but fighting impedance
    mismatch at the level of API. That is, implementation prefers to
    generate 256 bits at once, but API insists on 32 bits. Which leads to
    need for buffering and for additional management overhead. More time
    spent here than within mt algorithm itself.
    Still, even with overhead, when buffer management part is inlined
    (which is easy to achieve in practice and does not cause significant
    code bloat) modern but not the most modern Intel core (Raptor Cove)
    that I have in my desktop at work is capable to deliver one 32-bit word
    per exactly 3 clock cycles.
    That's a little better than even most basic fully inlined LCG, so
    necessarily better than any PCG.

    There is little doubt that LCGs (and as result of it PCGs) can be
    improved via advancing state by several steps at time instead of one
    step at time. But then they will suffer from the same impedance
    mismatch with APIs. Solution to mismatch would be the same as with mt - buffering and additional management overhead. So gone simplicity, gone
    tiny state size :(
    I did not try it, but it is very likely that for given width of output
    word the resulting speed of PCG would be almost exactly the same as
    optimized mt19937, because generator will spend majority of time in the
    code that not merely almost the same between the two, but exactly the
    same.


    --- PyGate Linux v1.5.2
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From Tim Rentsch@3:633/10 to All on Thu Jan 8 09:40:21 2026
    antispam@fricas.org (Waldek Hebisch) writes:

    Michael S <already5chosen@yahoo.com> wrote:

    I experimented a bit more (in fact, more like a lot more) with
    test batteries of L?Ecuyer. It led me to conclusion that occasional
    failure in the either middle or big battery means nothing.
    Sometimes even cripto-quality PRNG does not pass one or another test.
    Then you try to reproduce it and see that with any other seed that you
    try a failure does not happen.
    All in all, it makes me more suspect of PRNGs that consistently pass
    both batteries with various seed. I start to see it as a sign of
    PRNG being rigged to pass tests.

    Well, that depends on the tests and threshhold in the tests.
    Some tests when fed with trurly random source will produce produce
    very small variation of the results. With generous threshhold
    such test will essentially never fail for trurly random source.
    OTOH when expected variation of the results is larger and
    threshhold is tight, then trurly random source will fail the
    test from time to time. And if you test long enough you should
    be able to estimate probability of failure and possibly compare
    is with theoretical result if available.

    It's inherent in the nature of statistical testing that every
    so often a statistical test will "fail" even for a truly random
    input. An input source that never fails can also be indicative
    of a low quality source (and perhaps one that was tuned to the
    particular set of tests being done). It took me a while to
    learn that the results of a PRNG test suite should not be seen
    as purely binary.

    --- PyGate Linux v1.5.2
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)