Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-688

LP:882833 - Query with (a is null or a in (1)) sometimes returns only rows with a = 1

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Incomplete
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:

      Description

      I've tried very hard to reproduce this on a test set, or reliably, but I cannot. I spoke with Monty in IRC yesterday, he advised I use a debug build since I could create a test case. That option failed as well - with the debug build the result is reliable.

      I have three mariadb servers all of the same build (3249) all running repl and built from snapshots of 5.5.5-m3-log data. The problem persists after a optimize table t1 on all three servers.

      The table is similar to:

      CREATE TABLE `t1` (
      `a` bigint(20) NOT NULL DEFAULT '0',
      `b` varchar(512) DEFAULT NULL,
      `c` int(10) DEFAULT NULL,
      `d` tinyint(1) DEFAULT NULL,
      `e` int(10) DEFAULT NULL,
      `f` tinyint(1) DEFAULT NULL,
      `g` bigint(20) DEFAULT NULL,
      PRIMARY KEY (`a`),
      UNIQUE KEY (`a`,`d`),
      UNIQUE KEY (`a`,`d`,`e`),
      KEY `h` (`c`),
      KEY `i` (`e`,`b`),
      KEY `j` (`b`,`d`,`e`),
      KEY `k` (`b`),
      KEY `l` (`g`),
      KEY `m` (`e`,`d`,`c`,`b`) USING BTREE,
      KEY `n` (`e`,`d`,`g`,`c`) USING BTREE,
      KEY `o` (`f`,`d`,`e`,`b`)
      ) ENGINE=InnoDB DEFAULT CHARSET=latin1

      The query is similar to:

      select a from t1 where a in ( select a from t1 where e = 5 and f = 0 ) and e = 5 and f = 0 and ( g is null or g in (1) ) order by b limit 10;

      There are 35m records in the table. There should be 150,000 records returned by this query. There are usually 0-4 of these records where g = 1 when I see the problem manifest. The rest are null.

      The problem appears about 80% of the time, and when it does it only returns the rows where g = 1, it does not honor the g is null portion.

      The problem disappears if I write

      ( g is null or g in (1, null))

      I am fairly certain order by related. The problem has never occurred with no order by.

      I am confident limit is not related. Monty asked me to test this. If I take limit off, it will only return those rows where g = 1.

      I am also confident this is not related to optimizer_switch. I shut all of them off (and alternated with in_to_exists and materialization, where appropriate) and was able to reliably reproduce regardless of any optimizer_switch setting.

      I am also confident query_cache_type is not related. The problem manifests regardless of this setting.

      I have worked around it by using:

      ( g is null or g in (1, null))

      for now. But I did want to open a bug for future reference. I am willing to assist in many ways, but I cannot share my schema or data.

        Gliffy Diagrams

          Attachments

            Activity

            Hide
            philipstoev Philip Stoev added a comment -

            Re: Query with (a is null or a in (1)) sometimes returns only rows with a = 1
            Hello,

            Thank you very much for your bug report. Can you please provide the EXPLAIN of the problematic query? If possible, please run EXPLAIN on the same query several dozen times and report all the different EXPLAINs that you see.

            Thank you.

            Show
            philipstoev Philip Stoev added a comment - Re: Query with (a is null or a in (1)) sometimes returns only rows with a = 1 Hello, Thank you very much for your bug report. Can you please provide the EXPLAIN of the problematic query? If possible, please run EXPLAIN on the same query several dozen times and report all the different EXPLAINs that you see. Thank you.
            Hide
            fimbulvetr Dan Vande More added a comment -

            Re: Query with (a is null or a in (1)) sometimes returns only rows with a = 1
            I had the wrong query on my initial report. Anonymizing the columns messed with me a bit. Here is the correct one:

            select a from t1 where a in ( select a from t1 where e = 5 and d = 0 ) and e = 5 and d = 0 and ( g is null or g in (1) ) order by b limit 11;

            (The change is that now we use d instead of f).

            I have inserted some data from my prod table into t1, and ran the query - the explain I paste below exactly matches the explain from the production, failing dataset (With the exception of # rows for the subquery - in the one pasted they both list 222k rows - in the production data set, the subquery rows is 1/3 of the primary's rows, if that matters).

            explain select a from t1 where a in ( select a from t1 where e = 5 and d = 0 ) and e = 5 and d = 0 and ( g is null or g in (1) ) order by b limit 11;

            -----------------------------------------------------------------------------------------------------------------------------+

            id select_type table type possible_keys key key_len ref rows Extra

            -----------------------------------------------------------------------------------------------------------------------------+

            1 PRIMARY t1 ref_or_null i,l,m,n n 16 const,const,const 221063 Using index condition; Using filesort
            2 SUBQUERY t1 ref PRIMARY,i,j,m,n,o,a,a_2 n 7 const,const 221062 Using index

            -----------------------------------------------------------------------------------------------------------------------------+

            Note there is something else peculiar that I did not report initially but now see as a correlation: When the query executes correctly, I get results back in 1.61 or so seconds. When it executes incorrectly, the results come in about .20 seconds.

            Whether the query works or not appears to still be random at this point, but the explain plan never changes regardless of it being immediately before/after a a correct/incorrect result. I've executed the explain immediately after & before the query countless times.

            On a cold start, this query when incorrect takes 1.4s. On a cold start this query when correct takes 1m21s (15GB Table).

            This production data table is constantly being inserted/updated (Never deleted) by way of replication. It follows a diurnal pattern, with peaks about 40 inserts a second and updates about 20 per second. The incorrect result has been spotted regardless of the #inserts/updates a second.

            Show
            fimbulvetr Dan Vande More added a comment - Re: Query with (a is null or a in (1)) sometimes returns only rows with a = 1 I had the wrong query on my initial report. Anonymizing the columns messed with me a bit. Here is the correct one: select a from t1 where a in ( select a from t1 where e = 5 and d = 0 ) and e = 5 and d = 0 and ( g is null or g in (1) ) order by b limit 11; (The change is that now we use d instead of f). I have inserted some data from my prod table into t1, and ran the query - the explain I paste below exactly matches the explain from the production, failing dataset (With the exception of # rows for the subquery - in the one pasted they both list 222k rows - in the production data set, the subquery rows is 1/3 of the primary's rows, if that matters). explain select a from t1 where a in ( select a from t1 where e = 5 and d = 0 ) and e = 5 and d = 0 and ( g is null or g in (1) ) order by b limit 11; --- ----------- ----- ----------- ----------------------- ---- ------- ----------------- ------ --------------------------------------+ id select_type table type possible_keys key key_len ref rows Extra --- ----------- ----- ----------- ----------------------- ---- ------- ----------------- ------ --------------------------------------+ 1 PRIMARY t1 ref_or_null i,l,m,n n 16 const,const,const 221063 Using index condition; Using filesort 2 SUBQUERY t1 ref PRIMARY,i,j,m,n,o,a,a_2 n 7 const,const 221062 Using index --- ----------- ----- ----------- ----------------------- ---- ------- ----------------- ------ --------------------------------------+ Note there is something else peculiar that I did not report initially but now see as a correlation: When the query executes correctly, I get results back in 1.61 or so seconds. When it executes incorrectly, the results come in about .20 seconds. Whether the query works or not appears to still be random at this point, but the explain plan never changes regardless of it being immediately before/after a a correct/incorrect result. I've executed the explain immediately after & before the query countless times. On a cold start, this query when incorrect takes 1.4s. On a cold start this query when correct takes 1m21s (15GB Table). This production data table is constantly being inserted/updated (Never deleted) by way of replication. It follows a diurnal pattern, with peaks about 40 inserts a second and updates about 20 per second. The incorrect result has been spotted regardless of the #inserts/updates a second.
            Hide
            philipstoev Philip Stoev added a comment -

            Re: Query with (a is null or a in (1)) sometimes returns only rows with a = 1
            Hello,

            Thank you very much for the additional information. Can you please try disabling the index condition pushdown using the following option:

            SET GLOBAL optimizer_switch='index_condition_pushdown=off';
            SET SESSION optimizer_switch='index_condition_pushdown=off';

            And check if the issue is still there?

            Show
            philipstoev Philip Stoev added a comment - Re: Query with (a is null or a in (1)) sometimes returns only rows with a = 1 Hello, Thank you very much for the additional information. Can you please try disabling the index condition pushdown using the following option: SET GLOBAL optimizer_switch='index_condition_pushdown=off'; SET SESSION optimizer_switch='index_condition_pushdown=off'; And check if the issue is still there?
            Hide
            philipstoev Philip Stoev added a comment -

            Re: Query with (a is null or a in (1)) sometimes returns only rows with a = 1
            I have tried to reproduce your bug using your schema and synthetic data, but I could not. I get the exact same EXPLAIN as you do, however the result is consistent for me.

            So, it seems to me that the bug is caused by something in your exact data and the distribution of its values. Since you have already obfuscated the names of the columns and tables, if you also drop the varchar column (and the bug remains repeatable), the table will contain integers only.

            If this alleviates your privacy concerns, would you be willing to provide the obfuscated data set? We can arrange a secure transfer method and we will only use it to debug the problem and will not share it with anyone else.

            Show
            philipstoev Philip Stoev added a comment - Re: Query with (a is null or a in (1)) sometimes returns only rows with a = 1 I have tried to reproduce your bug using your schema and synthetic data, but I could not. I get the exact same EXPLAIN as you do, however the result is consistent for me. So, it seems to me that the bug is caused by something in your exact data and the distribution of its values. Since you have already obfuscated the names of the columns and tables, if you also drop the varchar column (and the bug remains repeatable), the table will contain integers only. If this alleviates your privacy concerns, would you be willing to provide the obfuscated data set? We can arrange a secure transfer method and we will only use it to debug the problem and will not share it with anyone else.
            Hide
            fimbulvetr Dan Vande More added a comment -

            Re: Query with (a is null or a in (1)) sometimes returns only rows with a = 1
            Hello,

            The problem exists with or without index condition pushdown. I actually have another wrong result bug related to index_condition_pushdown=on on the same table but am in the same situation - unable to synthesize failing data.

            As far as the varchar - I cannot remove this column as it is the order by. The problem disappears when I remove the order by or order by an alternate column. (This may be very valuable information - to repeat, the problem only appears if I sort by this varchar). The problem remains whichever direction I sort by. I do not believe I will have time today or tomorrow to continue working on this but middle of this week I will attempt again to synthesize data. I hope you can help me come up with ideas on how to do this.

            Show
            fimbulvetr Dan Vande More added a comment - Re: Query with (a is null or a in (1)) sometimes returns only rows with a = 1 Hello, The problem exists with or without index condition pushdown. I actually have another wrong result bug related to index_condition_pushdown=on on the same table but am in the same situation - unable to synthesize failing data. As far as the varchar - I cannot remove this column as it is the order by. The problem disappears when I remove the order by or order by an alternate column. (This may be very valuable information - to repeat, the problem only appears if I sort by this varchar). The problem remains whichever direction I sort by. I do not believe I will have time today or tomorrow to continue working on this but middle of this week I will attempt again to synthesize data. I hope you can help me come up with ideas on how to do this.
            Hide
            elenst Elena Stepanova added a comment -

            Re: Query with (a is null or a in (1)) sometimes returns only rows with a = 1
            Hi,

            Have you observed the problem ever since, or is it gone?

            Show
            elenst Elena Stepanova added a comment - Re: Query with (a is null or a in (1)) sometimes returns only rows with a = 1 Hi, Have you observed the problem ever since, or is it gone?
            Hide
            ratzpo Rasmus Johansson added a comment -

            Launchpad bug id: 882833

            Show
            ratzpo Rasmus Johansson added a comment - Launchpad bug id: 882833

              People

              • Assignee:
                Unassigned
                Reporter:
                fimbulvetr Dan Vande More
              • Votes:
                0 Vote for this issue
                Watchers:
                0 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: