• Re: a MSVC and glibc-compatible fmod()

    From Bonita Montero@3:633/280.2 to All on Wed Mar 12 00:08:04 2025
    Am 11.03.2025 um 13:55 schrieb Michael S:

    But not when library routine does not use FPU. Or uses FPU only for comparison ops.

    Then the library routine shout use fesetexcept() as you do it yourself.

    The point is, it does not sound right if SNAN is *silently* converted
    to QNAN. That type of conversion has to be loud i.e. accompanied by
    setting of FE_INVALID.

    Interestingly even conversion-operations from double to float do that.
    That's not what I epected.
    And there a some operations that should do that but actually keep the signalling bit zeroed like a sign-change since this is usually done with
    a XOR-operation.

    --- MBSE BBS v1.0.8.4 (Linux-x86_64)
    * Origin: A noiseless patient Spider (3:633/280.2@fidonet)
  • From Michael S@3:633/280.2 to All on Wed Mar 12 00:46:47 2025
    On Tue, 11 Mar 2025 14:55:43 +0200
    Michael S <already5chosen@yahoo.com> wrote:

    On Tue, 11 Mar 2025 13:10:31 +0100
    Bonita Montero <Bonita.Montero@gmail.com> wrote:

    Am 11.03.2025 um 12:34 schrieb Michael S:

    Exactly. Both options are legal. MS's decision to not set
    FE_INVALID is as good as glibc decision to set it.

    If I do a SSE-/AVX-operation where either operand is a signalling
    NaN I get a FE_INVALID; since the FPU behaves this way the MSVC
    runtime should do that also.

    BTW, what is the output of MS library in that case? SNAN or QNAN?


    Results with SNaN parameters are always QNaN, that shoud be common
    with any FPU.


    But not when library routine does not use FPU. Or uses FPU only for comparison ops.
    The point is, it does not sound right if SNAN is *silently* converted
    to QNAN. That type of conversion has to be loud i.e. accompanied by
    setting of FE_INVALID.



    I tested. It appears that MSVC implementation made a mistake in
    cases of fmod(snan, qnan):

    MSVC gcc
    x y result FE_INVALID) result FE_INVALID
    snan 1 snan 0 qnan 1
    snan 0 snan 0 qnan 1
    snan inf snan 0 qnan 1
    snan qnan qnan 0 !!! qnan 1
    1 snan snan 0 qnan 1
    0 snan snan 0 qnan 1
    inf snan snan 0 qnan 1
    qnan snan snan 0 qnan 1





    --- MBSE BBS v1.0.8.4 (Linux-x86_64)
    * Origin: A noiseless patient Spider (3:633/280.2@fidonet)
  • From James Kuyper@3:633/280.2 to All on Wed Mar 12 05:21:02 2025
    On 3/11/25 06:29, Michael S wrote:
    ....
    Pay attention that fmod() has no requirements w.r.t. to such exceptions
    as FE_INEXACT, FE_UNDERFLOW and non-standard FE_DENORMAL.
    Strictly speaking, even raising FE_OVERFLOW is not illegal, but doing
    so would be bad quality of implementation.
    Also spec does not say what happens to FE_INVALID when one of the
    inputs is signalling NAN.

    I've got Bonita killfiled, so the oldest message I can see on this
    thread is one posted by you that indicated you were interested in IEEE
    754 (==IOS/IEC 60559) conformance.

    The C++ standard cross-references the C standard for such issues. The C standard specifies that, for an implementation which pre#defines __STDC__IEC_60559__BFP__, floating point exception handling is very
    tightly specified for conformance with ISO/IEC 60559:

    "The double version of fmod behaves as though implemented by
    #include <math.h>
    #include <fenv.h>
    #pragma STDC FENV_ACCESS ON
    double fmod(double x, double y)
    {
    double result;
    result = remainder(fabs(x), (y = fabs(y)));
    if (signbit(result)) result += y;
    return copysign(result, x);
    }
    " (F10.7.1)

    "Operations defined in 6.5 and functions and macros defined for the
    standard libraries change floating-point status flags and control modes
    just as indicated by their specifications (including conformance to IEC
    60559). They do not change flags or modes (so as to be detectable by the
    user) in any other cases." (F8.6)

    "... signbit ... raise[s] no floating-point exceptions, even if an
    argument is a signaling NaN." (F3p6)

    "fabs(x) raises no floating-point exceptions, even if x is a signaling
    NaN." (F10.4.3)

    "— remainder(x, y) returns a NaN and raises the "invalid" floating-point exception for x infinite or y zero (and neither is a NaN)." (F.10.7.2)

    "copysign(x, y) raises no floating-point exceptions, even if x or y is a signaling NaN." (F10.8.1)

    --- MBSE BBS v1.0.8.4 (Linux-x86_64)
    * Origin: A noiseless patient Spider (3:633/280.2@fidonet)
  • From Michael S@3:633/280.2 to All on Wed Mar 12 05:28:12 2025
    On Mon, 10 Mar 2025 20:38:18 +0200
    Michael S <already5chosen@yahoo.com> wrote:

    On Mon, 10 Mar 2025 19:00:06 +0100
    Bonita Montero <Bonita.Montero@gmail.com> wrote:


    Your idea is really elegant

    I'd rather call it "simple" or "straightforward". "Elegant" in my book
    is something else. For example, the code above is closer to what I
    consider elegant.
    May be, later today or tomorrow, I'll show you solution that I
    consider bright. Bright, but impractical.


    Here, here!
    A bright part is in lines 18 to 29. The rest are hopefully competent technicalities.

    #include <string.h>
    #include <stdint.h>
    #include <math.h>
    #include <fenv.h>

    static uint64_t
    umulrem(uint64_t x, uint64_t y, uint64_t den) {
    #ifdef _MSC_VER
    uint64_t hi, lo = _umul128(x, y, &hi);
    uint64_t rem;
    _udiv128(hi, lo, den, &rem);
    return rem;
    #else
    return ((unsigned __int128)x * y) % den;
    #endif
    }

    // Calculate mod(2**e, y) where y < 2**53
    static uint64_t pow2_mod(uint64_t y, unsigned e) {
    if (e < 64) {
    uint64_t x = (uint64_t)1 << e;
    if (x < y)
    return x;
    return x % y;
    }
    uint64_t x1 = pow2_mod(y, e/2);
    uint64_t x2 = x1 << (e & 1);
    return umulrem(x1, x2, y);
    }

    static double u2d(uint64_t u) {
    double d;
    memcpy(&d, &u, sizeof(d));
    return d;
    }

    static uint64_t d2u(double d) {
    uint64_t u;
    memcpy(&u, &d, sizeof(u));
    return u;
    }

    // raise FE_INVALID and return nan
    static double raise_fe_invalid_ret_nan(double x)
    {
    const uint64_t SNAN_BITS = ~(1ull << 51);
    double snan = u2d(SNAN_BITS);
    #ifndef __clang__
    return snan + x;
    #else
    volatile double v_snan = snan;
    return v_snan + x;
    #endif
    }

    double my_fmod(double x, double y)
    {
    const uint64_t INF_EXP = 2047;
    const uint64_t INF2 = INF_EXP << 53;
    const uint64_t HIDDEN_BIT = (uint64_t)1 << 52;
    const uint64_t MANT_MASK = HIDDEN_BIT - 1;
    const uint64_t SIGN_BIT = (uint64_t)1 << 63;

    uint64_t ux = d2u(x);
    uint64_t uy = d2u(y);
    uint64_t ux2 = ux*2;
    uint64_t uy2 = uy*2;
    uint64_t sx = ux & SIGN_BIT;

    // process non-finite x
    if (ux2 >= INF2) { // x is inf or nan
    if (ux2 > INF2) // x is nan
    return x + y; // raises FE_INVALID when either x or y is sNAN
    // x is inf
    if (uy2 > INF2) // y is nan
    return x + y;
    // y is finite or inf
    return raise_fe_invalid_ret_nan(x);
    }
    // x is finite

    // process non-finite and zero y
    if (uy2-1 >= INF2-1) { // y is inf or nan or 0
    if (uy2 > INF2) // y is nan
    return y;
    // x is inf
    if (uy2 == INF2) // y is inf
    return x;
    // y is 0
    return raise_fe_invalid_ret_nan(x);
    }

    // y is finite non-zero
    if (ux2 < uy2)
    return x; // abs(x) < abs(y)

    // extract mantissa and exponent
    uint64_t mantX = (ux2 >> 1) & MANT_MASK;
    uint64_t mantY = (uy2 >> 1) & MANT_MASK;
    int expX = ux2 >> 53;
    if (expX == 0) { // X subnormal
    // Y is also subnormal, so we can use simple integer reduction
    return u2d((mantX % mantY) | sx);
    }

    int expY = uy2 >> 53;
    if (expY == 0) { // Y subnormal
    mantY |= HIDDEN_BIT; // removed below
    expY = 1;
    }

    mantY ^= HIDDEN_BIT;
    mantX ^= HIDDEN_BIT;
    if (mantX >= mantY) {
    mantX -= mantY;
    if (mantX >= mantY) // can happen when y is subnormal
    mantX %= mantY;
    }

    int dExp = expX - expY;
    uint64_t f = (dExp <= 63) ?
    (uint64_t)1 << dExp : // quick path
    pow2_mod(mantY, dExp); // slow path
    mantX = umulrem(mantX, f, mantY);

    // apply exponent of Y to mantX
    uint64_t ures0 = ((uint64_t)expY << 52) | sx;
    uint64_t ures = ures0 | (mantX & MANT_MASK);
    if (mantX & HIDDEN_BIT)
    ures0 = 0;
    return u2d(ures) - u2d(ures0);
    }


    --- MBSE BBS v1.0.8.4 (Linux-x86_64)
    * Origin: A noiseless patient Spider (3:633/280.2@fidonet)
  • From Michael S@3:633/280.2 to All on Sun Mar 16 22:48:03 2025
    On Tue, 11 Mar 2025 20:28:12 +0200
    Michael S <already5chosen@yahoo.com> wrote:

    On Mon, 10 Mar 2025 20:38:18 +0200
    Michael S <already5chosen@yahoo.com> wrote:

    On Mon, 10 Mar 2025 19:00:06 +0100
    Bonita Montero <Bonita.Montero@gmail.com> wrote:


    Your idea is really elegant

    I'd rather call it "simple" or "straightforward". "Elegant" in my
    book is something else. For example, the code above is closer to
    what I consider elegant.
    May be, later today or tomorrow, I'll show you solution that I
    consider bright. Bright, but impractical.


    Here, here!
    A bright part is in lines 18 to 29. The rest are hopefully competent technicalities.


    And here is non-recursive implementation of the same algorithm that has following potentially useful properties:
    1. It does not use compiler-specific extensions, only standard C.
    2. It does not use FMA, so gives correct results on implementations
    with broken fam(), like MSVC on pre-AVX computers.


    #include <string.h>
    #include <stdint.h>
    #include <math.h>

    static double u2d(uint64_t u) {
    double d;
    memcpy(&d, &u, sizeof(d));
    return d;
    }

    static uint64_t d2u(double d) {
    uint64_t u;
    memcpy(&u, &d, sizeof(u));
    return u;
    }

    // raise FE_INVALID and return nan
    static double raise_fe_invalid_ret_nan(double x)
    {
    const uint64_t SNAN_BITS = ~(1ull << 51);
    double snan = u2d(SNAN_BITS);
    #ifndef __clang__
    return snan + x;
    #else
    volatile double v_snan = snan;
    return v_snan + x;
    #endif
    }

    double my_fmod(double x, double y)
    {
    const uint64_t INF_EXP = 2047;
    const uint64_t INF2 = INF_EXP << 53;
    const uint64_t HIDDEN_BIT = (uint64_t)1 << 52;
    const uint64_t MANT_MASK = HIDDEN_BIT - 1;
    const uint64_t SIGN_BIT = (uint64_t)1 << 63;

    uint64_t ux = d2u(x);
    uint64_t uy = d2u(y);
    uint64_t ux2 = ux*2;
    uint64_t uy2 = uy*2;
    uint64_t sx = ux & SIGN_BIT;

    // process non-finite x
    if (ux2 >= INF2) { // x is inf or nan
    if (ux2 > INF2) // x is nan
    return x + y; // raises FE_INVALID when either x or y is sNAN
    // x is inf
    if (uy2 > INF2) // y is nan
    return x + y;
    // y is finite or inf
    return raise_fe_invalid_ret_nan(x);
    }
    // x is finite

    // process non-finite and zero y
    if (uy2-1 >= INF2-1) { // y is inf or nan or 0
    if (uy2 > INF2) // y is nan
    return y;
    // x is inf
    if (uy2 == INF2) // y is inf
    return x;
    // y is 0
    return raise_fe_invalid_ret_nan(x);
    }

    // y is finite non-zero
    if (ux2 < uy2)
    return x; // abs(x) < abs(y)

    // extract mantissa and exponent
    int64_t mantX = ((ux2 >> 1) & MANT_MASK)+HIDDEN_BIT;
    int64_t mantY = ((uy2 >> 1) & MANT_MASK)+HIDDEN_BIT;
    int expX = ux2 >> 53;
    int expY = uy2 >> 53;
    unsigned dExp = expX - expY;
    double ax = fabs(x);
    double ay = fabs(y);
    if (ax*0x1p-53 <= ay) {
    // Quick path
    int64_t d = (int64_t)(ax/ay);
    if (expY == 0) { // Y subnormal
    // don't normalize
    mantY -= HIDDEN_BIT;
    expY = 1;
    if (expX == 0) { // X subnormal
    mantX -= HIDDEN_BIT;
    expX = 1;
    }
    dExp = expX - expY;
    }
    mantX = (mantX << dExp) - mantY*(uint64_t)d;
    } else {
    if (expY == 0) { // Y subnormal
    // Normalize
    uy = d2u(y*0x1p52);
    mantY = (uy & MANT_MASK) | HIDDEN_BIT;
    expY = ((int)(uy >> 52) & 2047) - 52;
    dExp = expX - expY;
    }
    if (mantY == (int64_t)HIDDEN_BIT)
    return u2d(sx); // Y is power of 2

    // Calculate rem(2**dExp, mantY)
    unsigned e0, n_steps;
    for (n_steps = 0, e0 = dExp; e0 > 105; ++n_steps)
    e0 /= 2;

    double ry = 1.0/mantY;
    uint64_t mantRy = (d2u(ry) & MANT_MASK) | HIDDEN_BIT;
    uint64_t d = mantRy >> (105-e0);
    int64_t f = (((uint64_t)1 << 52) << (e0-52)) - mantY*d;
    // f = rem(2**e0, mantY) + a*mantY where -1 <= a <= 1
    if (n_steps > 0) {
    int next_bit = n_steps-1;
    const uint64_t F_MAX = (uint64_t)3 << 54;
    do {
    double df = (double)f;
    d = (int64_t)(df*df*ry);
    f = (uint64_t)f*(uint64_t)f - mantY*d;
    f <<= (dExp >> next_bit) & 1;
    if (f+F_MAX > F_MAX*2)
    f -= (int64_t)(f*ry)*mantY;
    --next_bit;
    } while (next_bit >= 0);
    }
    if (mantX >= mantY)
    mantX -= mantY;
    d = (int64_t)((double)f*(int64_t)mantX*ry);
    mantX = (uint64_t)f*(uint64_t)mantX - mantY*(uint64_t)d;
    }
    while ((uint64_t)mantX >= (uint64_t)mantY) {
    if (mantX < 0)
    mantX += mantY;
    else
    mantX -= mantY;
    }

    if (expY <= 1) { // Y subnormal
    mantX >>= 1 - expY;
    return u2d(mantX | sx);
    }

    // Apply exponent of Y to mantX
    uint64_t ures0 = ((uint64_t)expY << 52) | sx;
    uint64_t ures = ures0 | (mantX & MANT_MASK);
    if (mantX & HIDDEN_BIT)
    ures0 = 0;
    return u2d(ures) - u2d(ures0);
    }


    --- MBSE BBS v1.0.8.4 (Linux-x86_64)
    * Origin: A noiseless patient Spider (3:633/280.2@fidonet)