Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-7998

Total lockup on two add index statements

    Details

    • Type: Bug
    • Status: Stalled
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 10.0.17-galera
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Environment:
      CentOS 5.11

      Description

      We have galera with two nodes - main one with all access and second as active backup and reports server.
      I was running alter table add index on our main node (did it already on the other node) with
      SET GLOBAL wsrep_OSU_method='RSU';
      SET SESSION wsrep_on=OFF;

      I was monitoring progress in other window and instead of "show processlist;" I unfortunately ran from history older alter table add index on different table - it should have reported error on duplicating index, but instead entire node locked, no connection (new or already established) was responding. I had to ask our server admin to restart it and he had to kill 9 it, then it started OK. Even the added index is present, so it seems to have finished in the background (there was enough time for it before stop was issued).

      Attached is relevant part of mysqld.log, (few rows before event are from day before, I added couple of comments prefixed with !!!!

        Gliffy Diagrams

          Attachments

            Activity

            Hide
            nirbhay_c Nirbhay Choubey added a comment -

            Hi Jiri Kavalik: I couldn't reproduce the issue using the scenario specified. Here is what I tried on a 2 node cluster :

            Node 1 - Session 1:

            MariaDB [test]> SET GLOBAL wsrep_OSU_method='RSU';
            Query OK, 0 rows affected (0.00 sec)
            
            MariaDB [test]> SET SESSION wsrep_on=OFF;
            Query OK, 0 rows affected (0.00 sec)
            
            MariaDB [test]> alter table t1 add index(i);
            Query OK, 0 rows affected (2.51 sec)
            Records: 0  Duplicates: 0  Warnings: 0
            

            Node 1 - Session 2 (Simultaneous to the above ALTER):

            MariaDB [test]> alter table t2 add index(i);
            Query OK, 0 rows affected, 1 warning (3.80 sec)
            Records: 0  Duplicates: 0  Warnings: 1
            

            Can you reproduce it on you end? What are the table structures and indices that you tried to create? Galera configuration options?

            Show
            nirbhay_c Nirbhay Choubey added a comment - Hi Jiri Kavalik : I couldn't reproduce the issue using the scenario specified. Here is what I tried on a 2 node cluster : Node 1 - Session 1: MariaDB [test]> SET GLOBAL wsrep_OSU_method='RSU'; Query OK, 0 rows affected (0.00 sec) MariaDB [test]> SET SESSION wsrep_on=OFF; Query OK, 0 rows affected (0.00 sec) MariaDB [test]> alter table t1 add index(i); Query OK, 0 rows affected (2.51 sec) Records: 0 Duplicates: 0 Warnings: 0 Node 1 - Session 2 (Simultaneous to the above ALTER): MariaDB [test]> alter table t2 add index(i); Query OK, 0 rows affected, 1 warning (3.80 sec) Records: 0 Duplicates: 0 Warnings: 1 Can you reproduce it on you end? What are the table structures and indices that you tried to create? Galera configuration options?
            Hide
            jkavalik Jiri Kavalik added a comment -

            Hello Nirbhay Choubey, I tried it on testing two-node cluster, where I only created those two tables and filled them with enough data - and I too was NOT able to reproduce it. I even ran sysbench to generate some load on the server. Testing cluster was running the same server version "10.0.17-MariaDB-wsrep-log MariaDB Server, wsrep_25.10.r4144" but on different CentOS version (6.6 instead of 5.11) and with less memory available for buffer pool.
            I am not going to try in on production server again, and I have no core dump, because priority was getting it up again and I was not in the right state of mind to remember to get some stack trace through GDB or otherwise.. I understand that bugreport like this is unfortunatelly not helpful, but will try to provide any info you might want in case you get some idea.

            I uploaded server.cnf, which is the same on both nodes, only node2 has 16GB innodb buffer pool instead of 20GB.
            Actual node1 had dynamicaly enabled query cache of 200MB, set max_heap_table_size to 1GB and is doing ~500 QPS on average.
            Only node1 is used by our application, node2 is "active backup" and used for reporting and query optimization checks on real data to not slow the app down.

            I attached create table for both tables - on testing cluster I dropped foreign key and unique constraints to be able to duplicate some existing rows to create more (300k in sazky, 600k in tickets, on real data they are around 3M rows)

            My actual session was more like:

            node1:
            alter table sazky add index `udalost_status` (`udalost_id`,`status`); -- with TOI - 25secs
            
            node2:
            SET GLOBAL wsrep_OSU_method='RSU';
            SET SESSION wsrep_on=OFF;
            alter table ticket add index `user_zal_zrus_host` (`user_id`,`zalozen`,`zruseno`,`host_id`); -- 3 minutes
            SET SESSION wsrep_on=ON;
            SET GLOBAL wsrep_OSU_method='TOI';
            
            node1:
            SET GLOBAL wsrep_OSU_method='RSU';
            SET SESSION wsrep_on=OFF;
            alter table ticket add index `user_zal_zrus_host` (`user_id`,`zalozen`,`zruseno`,`host_id`); -- was running close to 5 minutes before hang
            
            node1 - parallel to previous:
            show processlist; -- multiple times through those 5 minutes
            alter table sazky add index `udalost_status` (`udalost_id`,`status`); -- wanted to run show processlist and pressed UP one time too many..
            

            I was expecting that last alter to return "duplicate index" but everything stopped at that moment. Connections to node2 were working OK, only those system threads for applying replication were counting time as idle so even those stopped on node1.

            Our server admin issued "service mysql stop" on node2 (OK) and then on node1(did not finish), then kill -s 9 on node1, then started node1 and node2. Index user_zal_zrus_host was present in the table on node1, so it either finished just before the hang or server was working inside and only not communicating with connections. There was enough time between the unwanted second alter and stopping/killing the server for the index creation to finish (my estimate for it is ~6 minutes).

            I would be glad if this could help you some way as you know the internals.. But it seems not reproducible repeating steps I remember, and I understand that without any stacktrace or coredump it will be probably closed.

            Show
            jkavalik Jiri Kavalik added a comment - Hello Nirbhay Choubey , I tried it on testing two-node cluster, where I only created those two tables and filled them with enough data - and I too was NOT able to reproduce it. I even ran sysbench to generate some load on the server. Testing cluster was running the same server version "10.0.17-MariaDB-wsrep-log MariaDB Server, wsrep_25.10.r4144" but on different CentOS version (6.6 instead of 5.11) and with less memory available for buffer pool. I am not going to try in on production server again, and I have no core dump, because priority was getting it up again and I was not in the right state of mind to remember to get some stack trace through GDB or otherwise.. I understand that bugreport like this is unfortunatelly not helpful, but will try to provide any info you might want in case you get some idea. I uploaded server.cnf, which is the same on both nodes, only node2 has 16GB innodb buffer pool instead of 20GB. Actual node1 had dynamicaly enabled query cache of 200MB, set max_heap_table_size to 1GB and is doing ~500 QPS on average. Only node1 is used by our application, node2 is "active backup" and used for reporting and query optimization checks on real data to not slow the app down. I attached create table for both tables - on testing cluster I dropped foreign key and unique constraints to be able to duplicate some existing rows to create more (300k in sazky, 600k in tickets, on real data they are around 3M rows) My actual session was more like: node1: alter table sazky add index `udalost_status` (`udalost_id`,`status`); -- with TOI - 25secs node2: SET GLOBAL wsrep_OSU_method='RSU'; SET SESSION wsrep_on=OFF; alter table ticket add index `user_zal_zrus_host` (`user_id`,`zalozen`,`zruseno`,`host_id`); -- 3 minutes SET SESSION wsrep_on=ON; SET GLOBAL wsrep_OSU_method='TOI'; node1: SET GLOBAL wsrep_OSU_method='RSU'; SET SESSION wsrep_on=OFF; alter table ticket add index `user_zal_zrus_host` (`user_id`,`zalozen`,`zruseno`,`host_id`); -- was running close to 5 minutes before hang node1 - parallel to previous: show processlist; -- multiple times through those 5 minutes alter table sazky add index `udalost_status` (`udalost_id`,`status`); -- wanted to run show processlist and pressed UP one time too many.. I was expecting that last alter to return "duplicate index" but everything stopped at that moment. Connections to node2 were working OK, only those system threads for applying replication were counting time as idle so even those stopped on node1. Our server admin issued "service mysql stop" on node2 (OK) and then on node1(did not finish), then kill -s 9 on node1, then started node1 and node2. Index user_zal_zrus_host was present in the table on node1, so it either finished just before the hang or server was working inside and only not communicating with connections. There was enough time between the unwanted second alter and stopping/killing the server for the index creation to finish (my estimate for it is ~6 minutes). I would be glad if this could help you some way as you know the internals.. But it seems not reproducible repeating steps I remember, and I understand that without any stacktrace or coredump it will be probably closed.

              People

              • Assignee:
                nirbhay_c Nirbhay Choubey
                Reporter:
                jkavalik Jiri Kavalik
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated: