Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-8421

ibdata1 size growing too much but cannot be explained by usage

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 10.0.14-galera
    • Fix Version/s: None
    • Labels:
      None
    • Environment:
      Linux db1 3.2.0-4-amd64 #1 SMP Debian 3.2.63-2 x86_64 GNU/Linux

      Description

      We recently had an issue with one node of our 3-node galera cluster eating all hard disk space. It turned out to be that the ibdata1 file had grown up too much in size. However, this increase cannot be explained by the cluster's everyday usage.

      The cluster is hosting one database that has reached the size of 25GB over a span of 5 years. When all free space was consumed, the size of ibdata1 had grown up to ~12GB, which is about 48% of the whole database size. Using 'innochecksum' and 'innodb_space' tools, we found out that 97% of pages in the ibdata1 file is undo log pages.

      Unfortunately the cluster was loosely monitored at the time of the event, so we do not know if the increase was rapid or increamental over a longer period of time. In any case, the size of 12GB seems to be too big to have been caused by normal usage. After recovering from the situation (we rebootstrapped the cluster on one of the other two nodes and rejoined the affected node after removing its data files), we started monitoring ibdata1 on all 3 nodes. With the same everyday usage on the cluster, ibdata1 now remains the same size (~76MB) on all 3 nodes. So we are not seeing any incremental increase in size.

      Some other, possibly relevant, facts:

      • At the time of the event, we got innodb status output with a "History list length" value of 40957337 which seems pretty big.
      • We had recenlty (before the event) introduced some changes so that one of our applications was issuing multiple queries doing "INSERT INTO ... SELECT ... " in the same query, which is considered bad practice as it introduces locks that can be avoided if the SELECT is done separately (we have now fixed this).
      • We had also configured the nodes with "innodb_locks_unsafe_for_binlog=1".

      We have kept a copy of the 12GB ibdata1 file for further investigation, but we are not sure about what to investigate. Maybe, there is some repeatability of data that could make us think that an endless loop wrote 12GB of data before eating all space, or maybe there is something else.

      We will really appreciate it if we get any advice on this by innodb / xtradb experts out there.

      Thank you in advance.

        Gliffy Diagrams

          Attachments

            Activity

            Hide
            elenst Elena Stepanova added a comment -

            Jan Lindström,

            Could you please take a look at this and maybe advise on how to proceed with the investigation more efficiently.

            Show
            elenst Elena Stepanova added a comment - Jan Lindström , Could you please take a look at this and maybe advise on how to proceed with the investigation more efficiently.
            Hide
            jplindst Jan Lindström added a comment -

            See http://blog.jcole.us/2014/04/16/the-basics-of-the-innodb-undo-logging-and-history-system/.

            While InnoDB purge does freeing undo log pages and unlinking them from the global history list to make them available for re-use. They remain on ibdata file. These free pages are never currently returned to OS and naturally there are several situations where these pages can't be really re-used e.g. on situation where new undo pages are bigger than fragmented datafile pages.

            I do not know any means to really investigate after something mysterious has happened, only way is to monitor current disk usage and if some unexplained allocations are done see what operations really did cause that.

            Show
            jplindst Jan Lindström added a comment - See http://blog.jcole.us/2014/04/16/the-basics-of-the-innodb-undo-logging-and-history-system/ . While InnoDB purge does freeing undo log pages and unlinking them from the global history list to make them available for re-use. They remain on ibdata file. These free pages are never currently returned to OS and naturally there are several situations where these pages can't be really re-used e.g. on situation where new undo pages are bigger than fragmented datafile pages. I do not know any means to really investigate after something mysterious has happened, only way is to monitor current disk usage and if some unexplained allocations are done see what operations really did cause that.
            Hide
            elenst Elena Stepanova added a comment -

            Jan Lindström,

            According to the initial description, the mysteriously big ibdata file is available. Is there any way to analyze its contents to see what really takes all the space, whether it the undo pages or anything else?

            Show
            elenst Elena Stepanova added a comment - Jan Lindström , According to the initial description, the mysteriously big ibdata file is available. Is there any way to analyze its contents to see what really takes all the space, whether it the undo pages or anything else?
            Hide
            Ernest0x Petros Moisiadis added a comment -

            As we have already reported, 97% of pages consists of undo pages indeed, but the size of the data is too big to have been produced by normal usage. So, we suspect that something has stuck and caused an abnormal amount of undo logs to be written. Since an infinite loop could lead to a lot of undo logs written with the same data, we may need to analyze that data for repeating patterns and then see if there is a code path that could lead to such loop...

            It's also important to note that this happened only few days after we intorduced the "INSERT INTO ... SELECT ..." issue in our code and has not happened again since fixing this. So, it is very likely that the bug was triggered by our, at that time, unoptimized code.

            Show
            Ernest0x Petros Moisiadis added a comment - As we have already reported, 97% of pages consists of undo pages indeed, but the size of the data is too big to have been produced by normal usage. So, we suspect that something has stuck and caused an abnormal amount of undo logs to be written. Since an infinite loop could lead to a lot of undo logs written with the same data, we may need to analyze that data for repeating patterns and then see if there is a code path that could lead to such loop... It's also important to note that this happened only few days after we intorduced the "INSERT INTO ... SELECT ..." issue in our code and has not happened again since fixing this. So, it is very likely that the bug was triggered by our, at that time, unoptimized code.

              People

              • Assignee:
                elenst Elena Stepanova
                Reporter:
                Ernest0x Petros Moisiadis
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated: