Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-6225

Idle replication slave keeps crashing.

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 5.5.37-galera
    • Fix Version/s: 5.5.39-galera, 5.5.39
    • Component/s: None
    • Labels:
    • Environment:
      Debian 7.5, kernel 3.2.54-2, 24 core Xeon E5645 @ 2.40 GHz, 48GB RAM
      Running mysqld_multi with a dozen instances, the crashing one has a 10G buffer pool.

      Description

      I'm using mysqld_multi to run many instances of mysql on the same machine. Each instance is in a 3 node Galera cluster, so there are three physical boxes running a dozen instances each comprising a dozen clusters.

      One instance will always crash given enough time. Sometimes it's less than a day, sometimes it runs for a week, but it always eventually crashes. That instance is no longer in a cluster as I'm trying to troubleshoot it but Galera is still loaded. I've tried not loading the provider as well with the same results.

      It is just replicating and not serving any real queries/traffic. I haven't been able to narrow it down to a specific table or query unfortunately. Today was the first time I had another instance crash ever (it's been running for several months). What does stand out though, is that it's always this host that is actually being a slave that crashes. I have not had the other instances of the cluster crash when it was clustered. It's always the one acting as a slave. Someone on #maria IRC mentioned deadlocks with replication and show slave status but I can't confirm anything.

      I was encouraged to submit a report and include my core dump traces. The 3rd dump labelled mysql-13 is the new crash I was referring to. The other two are from the same instance that has been repeatedly crashing. When it does crash, I rebuild the data fresh. I have the same data on a host that is running mariadb-server-5.5 (not galera) and has never crashed. It's actually what I use to rebuild/reseed this system from when it crashes.

        Gliffy Diagrams

          Attachments

            Activity

            Hide
            jplindst Jan Lindström added a comment -

            Hi,

            Could you also attach here full unedited error log (at least one) from crashing server.

            R: Jan

            Show
            jplindst Jan Lindström added a comment - Hi, Could you also attach here full unedited error log (at least one) from crashing server. R: Jan
            Hide
            sophomeric Eric Webster added a comment -

            I did not save them unfortunately, only the core dumps. I can post one as soon as it happens again though.

            Show
            sophomeric Eric Webster added a comment - I did not save them unfortunately, only the core dumps. I can post one as soon as it happens again though.
            Hide
            sophomeric Eric Webster added a comment -

            New core backtrace attached and associated error log.

            Show
            sophomeric Eric Webster added a comment - New core backtrace attached and associated error log.
            Hide
            sophomeric Eric Webster added a comment -

            Another core backtrace and associated error log.

            Show
            sophomeric Eric Webster added a comment - Another core backtrace and associated error log.
            Hide
            sophomeric Eric Webster added a comment -

            Two more crashes from the weekend. I'm not running it on two machines (identical hardware and software and data) and they are both crashing, but at different times. If it was a bad query or something, you'd think it would crash both.

            Show
            sophomeric Eric Webster added a comment - Two more crashes from the weekend. I'm not running it on two machines (identical hardware and software and data) and they are both crashing, but at different times. If it was a bad query or something, you'd think it would crash both.
            Hide
            sophomeric Eric Webster added a comment -

            I'm still collecting core dumps from these slaves but I've stopped posting them since there hasn't been any reply. If they are still useful, let me know and I'll continue to post them.

            Show
            sophomeric Eric Webster added a comment - I'm still collecting core dumps from these slaves but I've stopped posting them since there hasn't been any reply. If they are still useful, let me know and I'll continue to post them.
            Hide
            jplindst Jan Lindström added a comment -

            revno: 3506
            committer: Jan Lindström <jplindst@mariadb.org>
            branch nick: maria-5.5-galera
            timestamp: Mon 2014-06-30 14:02:54 +0300
            message:
            MDEV-6225: Idle replication slave keeps crashing.

            Analysis: Based on crashed the buffer pool instance identifier is
            not correct on block to be freed. Add LRU list mutex holding
            on functions calling free and add additional safety checks.

            Show
            jplindst Jan Lindström added a comment - revno: 3506 committer: Jan Lindström <jplindst@mariadb.org> branch nick: maria-5.5-galera timestamp: Mon 2014-06-30 14:02:54 +0300 message: MDEV-6225 : Idle replication slave keeps crashing. Analysis: Based on crashed the buffer pool instance identifier is not correct on block to be freed. Add LRU list mutex holding on functions calling free and add additional safety checks.
            Hide
            jplindst Jan Lindström added a comment -

            5.5:

            revno: 4221
            committer: Jan Lindström <jplindst@mariadb.org>
            branch nick: 5.5
            timestamp: Mon 2014-06-30 14:06:28 +0300
            message:
            MDEV-6225: Idle replication slave keeps crashing.

            Analysis: Based on crashed the buffer pool instance identifier is
            not correct on block to be freed. Add LRU list mutex holding
            on functions calling free and add additional safety checks.

            Show
            jplindst Jan Lindström added a comment - 5.5: revno: 4221 committer: Jan Lindström <jplindst@mariadb.org> branch nick: 5.5 timestamp: Mon 2014-06-30 14:06:28 +0300 message: MDEV-6225 : Idle replication slave keeps crashing. Analysis: Based on crashed the buffer pool instance identifier is not correct on block to be freed. Add LRU list mutex holding on functions calling free and add additional safety checks.

              People

              • Assignee:
                jplindst Jan Lindström
                Reporter:
                sophomeric Eric Webster
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0 minutes
                  0m
                  Logged:
                  Time Spent - 6 hours
                  6h