Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-135

failures in buildbot in 5.5 on kvm-deb-debian5-amd64

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.5.20
    • Component/s: None
    • Labels:
      None

      Description

      Failing test(s): rpl.rpl_checksum_cache rpl.rpl_heartbeat_basic main.ps_3innodb main.ps main.subselect_mat_cost main.select_pkeycache main.multi_update main.union

        Gliffy Diagrams

          Attachments

            Issue Links

              Activity

              Hide
              knielsen Kristian Nielsen added a comment -

              Simple test case:

              CREATE TABLE t1 (i INT, INDEX);
              INSERT INTO t1 VALUES (1);
              SELECT AVG FROM t1;
              DROP TABLE t1;

              The problem seems to be in my_decimal_div(). This dump is from
              Item_sum_avg::val_decimal():

              XXX3: SQLCOM_SELECT: SELECT AVG FROM t1
              XXX12: Item_sum_avg::val_str()
              XXX11: Item_sum_avg::val_decimal()
              XXX11: using decimal ...
              XXX11: values: 1 / 1
              XXX11: sum_dec=9.0: 1 0 0 0 0 0 0 0 0
              XXX11: count=9.0: 1 0 54436864 0 1609087657 32688 0 0 31
              XXX11: sum/count=9.9: 1 999999999 1 0 11794296 0 6144224 0 1608708904
              XXX12: decimal -> 2.0000
              XXX13 Item::send(Protocol *, ...) buffer=2.0000

              This means that Item_sum_avg::val_decimal() is computing 1/1 with
              my_decimal_div(). The result becomes 1.999999999.

              Unfortunately, the bug occurence is extremely fragile.

              I can repeat on VM vm-debian5-amd64-build by copying in source tarball and
              running debian/autobake-deb.sh. If I then add a single line fprintf() in
              do_div_mod() and `make -j2`, the problem disappears. If I remove the single
              line again and `make -j2`, the problem is still gone ...

              wierd ...

              Show
              knielsen Kristian Nielsen added a comment - Simple test case: CREATE TABLE t1 (i INT, INDEX ); INSERT INTO t1 VALUES (1); SELECT AVG FROM t1; DROP TABLE t1; The problem seems to be in my_decimal_div(). This dump is from Item_sum_avg::val_decimal(): XXX3: SQLCOM_SELECT: SELECT AVG FROM t1 XXX12: Item_sum_avg::val_str() XXX11: Item_sum_avg::val_decimal() XXX11: using decimal ... XXX11: values: 1 / 1 XXX11: sum_dec=9.0: 1 0 0 0 0 0 0 0 0 XXX11: count=9.0: 1 0 54436864 0 1609087657 32688 0 0 31 XXX11: sum/count=9.9: 1 999999999 1 0 11794296 0 6144224 0 1608708904 XXX12: decimal -> 2.0000 XXX13 Item::send(Protocol *, ...) buffer=2.0000 This means that Item_sum_avg::val_decimal() is computing 1/1 with my_decimal_div(). The result becomes 1.999999999. Unfortunately, the bug occurence is extremely fragile. I can repeat on VM vm-debian5-amd64-build by copying in source tarball and running debian/autobake-deb.sh. If I then add a single line fprintf() in do_div_mod() and `make -j2`, the problem disappears. If I remove the single line again and `make -j2`, the problem is still gone ... wierd ...
              Hide
              knielsen Kristian Nielsen added a comment -

              I discovered that the problem occurs when strings/decimal.c is build with DEB_BUILD_HARDENING=1.
              The problem disappears when that file is compiled with that variable not set.

              Show
              knielsen Kristian Nielsen added a comment - I discovered that the problem occurs when strings/decimal.c is build with DEB_BUILD_HARDENING=1. The problem disappears when that file is compiled with that variable not set.
              Hide
              knielsen Kristian Nielsen added a comment -

              Bug is triggered when strings/decimal.c is compiled with -D_FORTIFY_SOURCE=2 (or =1).

              Show
              knielsen Kristian Nielsen added a comment - Bug is triggered when strings/decimal.c is compiled with -D_FORTIFY_SOURCE=2 (or =1).
              Hide
              knielsen Kristian Nielsen added a comment - - edited

              Ok, I analysed this in detail. My conclusion is that this is a bug in the old
              GCC version on Debian 5 "lenny" (4.3.2).

              The code does this:

              if (unlikely(dcarry == 0 && *start1 < *start2))
              ...
              buf1=start1+len2;
              ...
              SUB2(*buf1, *buf1, lo, carry);
              ...
              dcarry= *start1;

              len2 can be zero (and is, when I see the failure). SUB2 assigns to *buf1.

              Checking the disassembled GCC output, what it does is cache the value of
              *start1 from the top in register %r15d:

              a92390: 44 8b 7e fc mov -0x4(%rsi),%r15d # *start1

              and it uses this variable to assign to dcarry:

              a92512: 44 89 fb mov %r15d,%ebx # dcarry=*start1

              This is wrong, as the value in %r15d is stale. *start1 has a new value from the SUB2().

              I do not see any problems with the code in terms of violation of strict
              aliasing or other issues. My conclusion is that GCC is doing the wrong thing
              here.

              I do not think there is a point in trying to report this as a GCC bug. This is
              in a very old version of the compiler, and we do not see this problem on any
              other host/gcc version. It is probably already fixed long ago.

              I will add an #ifdef so that the debian package build can work-around the
              problem on Debian 5.

              Show
              knielsen Kristian Nielsen added a comment - - edited Ok, I analysed this in detail. My conclusion is that this is a bug in the old GCC version on Debian 5 "lenny" (4.3.2). The code does this: if (unlikely(dcarry == 0 && *start1 < *start2)) ... buf1=start1+len2; ... SUB2(*buf1, *buf1, lo, carry); ... dcarry= *start1; len2 can be zero (and is, when I see the failure). SUB2 assigns to *buf1. Checking the disassembled GCC output, what it does is cache the value of *start1 from the top in register %r15d: a92390: 44 8b 7e fc mov -0x4(%rsi),%r15d # *start1 and it uses this variable to assign to dcarry: a92512: 44 89 fb mov %r15d,%ebx # dcarry=*start1 This is wrong, as the value in %r15d is stale. *start1 has a new value from the SUB2(). I do not see any problems with the code in terms of violation of strict aliasing or other issues. My conclusion is that GCC is doing the wrong thing here. I do not think there is a point in trying to report this as a GCC bug. This is in a very old version of the compiler, and we do not see this problem on any other host/gcc version. It is probably already fixed long ago. I will add an #ifdef so that the debian package build can work-around the problem on Debian 5.
              Hide
              knielsen Kristian Nielsen added a comment -

              Buildbot confirms that workaround eliminates the failure.

              Show
              knielsen Kristian Nielsen added a comment - Buildbot confirms that workaround eliminates the failure.

                People

                • Assignee:
                  knielsen Kristian Nielsen
                  Reporter:
                  knielsen Kristian Nielsen
                • Votes:
                  0 Vote for this issue
                  Watchers:
                  0 Start watching this issue

                  Dates

                  • Created:
                    Updated:
                    Resolved:

                    Time Tracking

                    Estimated:
                    Original Estimate - 0 minutes
                    0m
                    Remaining:
                    Remaining Estimate - 0 minutes
                    0m
                    Logged:
                    Time Spent - 3 days, 1 hour
                    3d 1h