Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-3708

LP:928048 - Query containing IN subquery with OR in the where clause returns a wrong result

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:

      Description

      A query with IN subquery that can be converted to a semi-join may return a wrong result in maridb-5.3 if the where clause of the subquery contains OR condition.

      The following test case provides such a query.

      create table t1 (a int, b int);
      insert into t1 values (7,5), (3,3), (5,4), (9,3);

      create table t2 (a int, b int, index i_a(a));

      insert into t2 values
      (4,2), (7,9), (7,4), (3,1), (5,3), (3,1), (9,4), (8,1);

      set optimizer_switch='semijoin=on,materialization=on';
      select * from t1 where t1.a in (select a from t2 where t2.a=7 or t2.b<=1);

      The query in from the test case returns a wrong result if the optimizer switch flags 'semijoin' and 'materialization' are set to 'on', a it returns the correct answer if these flags are set to 'off'.

      MariaDB [test]> set optimizer_switch='semijoin=on,materialization=on';
      Query OK, 0 rows affected (0.00 sec)

      MariaDB [test]> select * from t1 where t1.a in (select a from t2 where t2.a=7 or t2.b<=1);
      ----------+

      a b

      ----------+

      7 5
      3 3
      5 4
      9 3

      ----------+
      4 rows in set (0.00 sec)

      MariaDB [test]> set optimizer_switch='semijoin=off,materialization=off';
      Query OK, 0 rows affected (0.00 sec)

      MariaDB [test]> select * from t1 where t1.a in (select a from t2 where t2.a=7 or t2.b<=1);
      ----------+

      a b

      ----------+

      7 5
      3 3

      ----------+
      2 rows in set (0.00 sec)

      The warning returned by EXPLAIN EXTENDED executed for the query with
      optimizer_switch set to 'semijoin=on,materialization=on'
      shows that it happens because in this case the optimizer generates an invalid execution plan:

      MariaDB [test]> set optimizer_switch='semijoin=on,materialization=on';
      Query OK, 0 rows affected (0.00 sec)

      MariaDB [test]> explain extended
      -> select * from t1 where t1.a in (select a from t2 where t2.a=7 or t2.b<=1);
      --------------------------------------------------------------------------------------

      id select_type table type possible_keys key key_len ref rows filtered Extra

      --------------------------------------------------------------------------------------

      1 PRIMARY t1 ALL NULL NULL NULL NULL 4 100.00  
      1 PRIMARY <subquery2> eq_ref distinct_key distinct_key 5 func 1 100.00  
      2 MATERIALIZED t2 ALL i_a NULL NULL NULL 8 100.00  

      --------------------------------------------------------------------------------------
      3 rows in set, 1 warning (0.00 sec)

      MariaDB [test]> show warnings;
      ------------------------------------------------------------------------------------------------------------------------------------------------------------------

      Level Code Message

      ------------------------------------------------------------------------------------------------------------------------------------------------------------------

      Note 1003 select `test`.`t1`.`a` AS `a`,`test`.`t1`.`b` AS `b` from `test`.`t1` semi join (`test`.`t2`) where (((`test`.`t1`.`a` = 7) or (`test`.`t2`.`b` <= 1)))

      ------------------------------------------------------------------------------------------------------------------------------------------------------------------
      1 row in set (0.00 sec)

        Gliffy Diagrams

          Attachments

            Activity

            Hide
            psergey Sergei Petrunia added a comment -

            Re: Query containing IN subquery with OR in the where clause returns a wrong result
            == Analysis ==

            Equality propagation converts the WHERE clause into this:

            (multiple equal(7, t1.a, t2.a) or (t2.b <= 1))
            and
            multiple equal(t1.a, t2.a)

            This is ok.

            Then, equality substitution produces this WHERE clause:

            (t1.a = 7) or (t2.b <= 1)

            we dont expect this kind of WHERE clauses to be produced when
            SJ-Materialization-Lookup strategy is used.

            With that strategy, we expect that the WHERE clause can be broken into two
            AND-parts:

            • Part#1. depend only on SJ-inner tables
            • Part#2. depend only on SJ-outer tables

            The only thing joining the two parts is the IN-equality. IN-equality is not
            put into the WHERE condition, it is generated and checked inside the
            SJ-Materialization code.

            However, make_join_select() gets this clause:

            (t1.a = 7) or (t2.b <= 1)

            which can only be checked when one has both t1.a and t2.b. This never happens,
            so make_join_select() is unable to attach this condition anywhere, and it is
            never checked.

            Show
            psergey Sergei Petrunia added a comment - Re: Query containing IN subquery with OR in the where clause returns a wrong result == Analysis == Equality propagation converts the WHERE clause into this: (multiple equal(7, t1.a, t2.a) or (t2.b <= 1)) and multiple equal(t1.a, t2.a) This is ok. Then, equality substitution produces this WHERE clause: (t1.a = 7) or (t2.b <= 1) we dont expect this kind of WHERE clauses to be produced when SJ-Materialization-Lookup strategy is used. With that strategy, we expect that the WHERE clause can be broken into two AND-parts: Part#1. depend only on SJ-inner tables Part#2. depend only on SJ-outer tables The only thing joining the two parts is the IN-equality. IN-equality is not put into the WHERE condition, it is generated and checked inside the SJ-Materialization code. However, make_join_select() gets this clause: (t1.a = 7) or (t2.b <= 1) which can only be checked when one has both t1.a and t2.b. This never happens, so make_join_select() is unable to attach this condition anywhere, and it is never checked.
            Hide
            igor Igor Babaev added a comment -

            Re: Query containing IN subquery with OR in the where clause returns a wrong result
            Some fix of this bug was pushed into mysql-5.6 (see the fix for bug #13414014).
            I did not review this fix, so I cannot say anything about about its validity.

            Show
            igor Igor Babaev added a comment - Re: Query containing IN subquery with OR in the where clause returns a wrong result Some fix of this bug was pushed into mysql-5.6 (see the fix for bug #13414014). I did not review this fix, so I cannot say anything about about its validity.
            Hide
            psergey Sergei Petrunia added a comment -

            Re: Query containing IN subquery with OR in the where clause returns a wrong result
            == Analysis continued ==

            Equality propagation assumes certain ordering of tables, and tries to
            move make all substitutions to evaluate as early as possible.

            Graphically, if we draw join order like this:

            >tb1->-tbl2->-tbl3-...

            equality propagation will try pushing conditions to the left. The problem
            with semi-joins (or bushy joins in general) is that there is no global
            ordering anymore.

            If we take a query

            select * from ot1,ot2,ot3,ot4... where col in (select ... from it1, it2)

            and its SJM query plan:

            >ot1-ot2>------------(sjm)>ot3--ot4-....

            it1>it2>+

            then one can push the condition left along the top line through tables otX, or
            go down to tables itX, and follow to the left along the lower level.

            What equality substitution currently does is to assume the ordering of
            (ot1, ot2, it1, it2, sjm, ot3, ...) and mostly ignore the fact that there are
            multiple ways to go to the left.

            This almost works, because materialization is only applied to un-
            correlated queries, which gives us warranty that there are no expressions
            that refer to both an otX.col and an itY.col.
            The only exception is the IN-equality, which usually has the form of

            otX.col=itY.col

            presense of equalities of this form changes a lot.

            Show
            psergey Sergei Petrunia added a comment - Re: Query containing IN subquery with OR in the where clause returns a wrong result == Analysis continued == Equality propagation assumes certain ordering of tables, and tries to move make all substitutions to evaluate as early as possible. Graphically, if we draw join order like this: > tb1 - > - tbl2 - > - tbl3 -... equality propagation will try pushing conditions to the left. The problem with semi-joins (or bushy joins in general) is that there is no global ordering anymore. If we take a query select * from ot1,ot2,ot3,ot4... where col in (select ... from it1, it2) and its SJM query plan: > ot1 - ot2 >------------ (sjm) > ot3 -- ot4 -.... it1 > it2 > + then one can push the condition left along the top line through tables otX, or go down to tables itX, and follow to the left along the lower level. What equality substitution currently does is to assume the ordering of (ot1, ot2, it1, it2, sjm, ot3, ...) and mostly ignore the fact that there are multiple ways to go to the left. This almost works, because materialization is only applied to un- correlated queries, which gives us warranty that there are no expressions that refer to both an otX.col and an itY.col. The only exception is the IN-equality, which usually has the form of otX.col=itY.col presense of equalities of this form changes a lot.
            Hide
            psergey Sergei Petrunia added a comment -

            Re: Query containing IN subquery with OR in the where clause returns a wrong result
            === Deficiency ===

            First, the approach "move each condition as much as possible to the left"
            doesn't guarantee maximum possible filtering anymore. When one is following left
            and he reaches the (sjm) split, he should not take one way. He should use both!

            Consider the queries:

            Q1 select * from t1 where t1.col in (select t2.col from t2 where cond(t2.col))
            Q2 select * from t1 where t1.col in (select t2.col from t2) and cond(t1.col)

            Suppose they both are executed with this query plan:

            id select_type table type
            1 PRIMARY t1 ALL
            1 PRIMARY <subquery2> eq_ref
            2 MATERIALIZED t2 ALL

            it is apparent that, for both queries

            1. cond(t1.col) should be attached to table t1
            2. cond(t2.col) should be attached to table t2

            yet, current equality propagation/substituion scheme will not do that, at least
            not when 'cond' is a condition other than equality.

            Fixing this is a non-trivial though, because the fix will need to change the
            whole concept of equality substition from the

            walk through the WHERE clause and substitute certain occurences of
            "tblX.colA" for "tblY.colB"

            to something else. (the above definition doesn't cover substitution of
            Item_equal objects)

            Show
            psergey Sergei Petrunia added a comment - Re: Query containing IN subquery with OR in the where clause returns a wrong result === Deficiency === First, the approach "move each condition as much as possible to the left" doesn't guarantee maximum possible filtering anymore. When one is following left and he reaches the (sjm) split, he should not take one way. He should use both! Consider the queries: Q1 select * from t1 where t1.col in (select t2.col from t2 where cond(t2.col)) Q2 select * from t1 where t1.col in (select t2.col from t2) and cond(t1.col) Suppose they both are executed with this query plan: id select_type table type 1 PRIMARY t1 ALL 1 PRIMARY <subquery2> eq_ref 2 MATERIALIZED t2 ALL it is apparent that, for both queries 1. cond(t1.col) should be attached to table t1 2. cond(t2.col) should be attached to table t2 yet, current equality propagation/substituion scheme will not do that, at least not when 'cond' is a condition other than equality. Fixing this is a non-trivial though, because the fix will need to change the whole concept of equality substition from the walk through the WHERE clause and substitute certain occurences of "tblX.colA" for "tblY.colB" to something else. (the above definition doesn't cover substitution of Item_equal objects)
            Hide
            psergey Sergei Petrunia added a comment -

            Re: Query containing IN subquery with OR in the where clause returns a wrong result
            === Fixing the wrong query result ===

            Let's set a more modest goal of

            • not producing wrong query results
            • not generating regressions wrt non-semijoin materialization

            The idea is that given this chart

            >ot1-ot2>------------(sjm)>ot3--ot4-....

            it1>it2>+

            we

            "must not attempt to move a condition from one line to another"

            e.g. don't move the condition from the top line into the bottom or vice
            versa. (This bug is a result of us moving one part of OR-clause from
            bottom to the top).

            The implementation of is:

            • Don't substitute SJ-inner columns for SJ-outer, or vice versa (however it
              is always okay to substitute with references to const tables)
            • When exploding multiple-equalities:

            if (equality has a const item)

            { for each equality member tblX.colY produce "tblX.colY=const_item" }

            else
            {
            for X in

            {top_level, each semi-join nest}

            { FIRST= first member in X; for each other element OTHER produce "FIRST=OTHER" }

            }

            As for not producing equalities that were "fixed on the upper level":
            don't produce them only as long as they would have been produced at this
            level.

            Show
            psergey Sergei Petrunia added a comment - Re: Query containing IN subquery with OR in the where clause returns a wrong result === Fixing the wrong query result === Let's set a more modest goal of not producing wrong query results not generating regressions wrt non-semijoin materialization The idea is that given this chart > ot1 - ot2 >------------ (sjm) > ot3 -- ot4 -.... it1 > it2 > + we "must not attempt to move a condition from one line to another" e.g. don't move the condition from the top line into the bottom or vice versa. (This bug is a result of us moving one part of OR-clause from bottom to the top). The implementation of is: Don't substitute SJ-inner columns for SJ-outer, or vice versa (however it is always okay to substitute with references to const tables) When exploding multiple-equalities: if (equality has a const item) { for each equality member tblX.colY produce "tblX.colY=const_item" } else { for X in {top_level, each semi-join nest} { FIRST= first member in X; for each other element OTHER produce "FIRST=OTHER" } } As for not producing equalities that were "fixed on the upper level": don't produce them only as long as they would have been produced at this level.
            Hide
            ratzpo Rasmus Johansson added a comment -

            Launchpad bug id: 928048

            Show
            ratzpo Rasmus Johansson added a comment - Launchpad bug id: 928048

              People

              • Assignee:
                psergey Sergei Petrunia
                Reporter:
                igor Igor Babaev
              • Votes:
                0 Vote for this issue
                Watchers:
                0 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: