Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-5470

Cost-based subquery item pushdown must take into account semi-joins

    Details

    • Type: Task
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Cost-based subquery item pushdown is limited by semi-joins. Pasting from an earlier email to maria-developers@:

      Hi!

      The below is an extension of the skype discussion. I wrote down the limitation
      imposed by FirstMatch, and also covered other Semi-join strategies.

      <contents>
      1. SJ-Materialization
      2. FirstMatch
      3. Duplicate Elimination
      4. Loose Scan
      </contents>

      == 1. SJ-Materialization ==

      The general rule is:

      If the condition is originally attached inside an SJM nest, one should
      not move it outside. Conditions that are originally outside should not be
      moved inisde the SJM nest.

      The reason for the rule is that field values for inner tables are only
      available inside the SJM nest. Correspondingly, values of fields of outer
      tables are not available at materialization step.

      The only possible exception is when the condition depends only on columns that
      are in the IN-equality. Then, it could be moved. I don't think this is worth
      handling for the first milestone.

      == 2. FirstMatch ==

      Consider a query

      select ...
      from ot1, nt1, nt2
      where expr(ot1) in (select ... from it1, it2, ... where ...)

      and a join order

      ot1 FirstMatch(it1, it2) nt1 nt2

      It is possible to move cond(otX) to table itX or ntX.

      There is a problem with moving condition attached to itX (will call it
      SUBQ_COND. this ocndition depends on table itX and maybe on ot1) to table ntX.

      Consider join execution execution:
      ot1.row0
      ot1.row1 + it1.row1 – it2.row1

      +- it1.row2 – it2.row2 > (F1)->- nt1.row1
      (F2)-<-

      FirstMatch check is at point (F2). At that point, we know that ot1.row1
      has had a match in the subquery, and the execution returns to table ot1.

      If SUBQ_COND is moved to table ntX, then there may be a situation where
      we will get to point (F2) and assume that the subquery had a match, while
      actually it hasn't.

      A suggestion by Igor: at point (F2) we should check the last return value
      of SUBQ_COND (actually, of all conditions that were moved like SUBQ_COND was)

      It is actually more complicated: what if there is COND(ot1, nt1) such that
      COND(ot1.ro1, nt1.row1) = FALSE? Then, we will never evaluate
      SUBQ_COND(ot1.row1, ...).
      What if never evaluate SUBQ_COND(ot1.row1, ...) but before we have evaluated
      SUBQ_COND(ot1.row0, ...)? We will have to track what we have evaluated for
      the current row of ot1.row1.

      == 3. Duplicate Elimination ==

      Duplicate Elimination removes duplicate matches using a temporary table.
      In order to work correctly, filtering needs to be performed before the weedout
      is done.

      Let's again use an example.

      select ...
      from ot1, nt1, nt2
      where expr(ot1) in (select ... from it1, it2, ... where ...)

      and a join order

      (DUPS-W-START) it1 ot1 it2 (DUPS-W-WEEDOUT) nt1 nt2

      at DUPS-W-START, all rows are deleted from the temp. table.
      at DUPS-W-WEEDOUT, we take ot1.rowid and attempt to write it into the temp
      table.

      Basic idea: If a condition SUBQ_COND(itX, ...) is moved to the right across
      the (DUPS-W-WEEDOUT) point, wrong query results will occur.

      Example: Suppose, SUBQ_COND(it2) is moved to nt1, and join execution
      proceeds as follows:

      it1.row1 — ot1.row1 — it2.row1 – DUPS-W-WEEDOUT – nt1

      • We reach DUPS-W-WEEDOUT with (it1.row1, ot1.row1, it2.row1).
      • The temp. table is empty, so rowid(ot1.row1) is written into temp.table,
        and Weedout doesn't weed it out.
      • At nt1, SUBQ_COND(it2.row1) is evaluated and found to be false.
        join execution returns back.
      • Suppose, tables it2 and ot1 have no more matching rows, so execution returns
        to it1.

      (the same diagram with a few more steps)

      it1.row1 — ot1.row1 — it2.row1 – DUPS-W-WEEDOUT – nt1

      it1.row2 — ot1.row1 — it2.row2 – DUPS-W-WEEDOUT

      • We reach DUPS-W-WEEDOUT with (it1.row2, ot1.tow1, it2.rowX).
      • The temptable already has rowid(ot1.row1). The record combination is filtered
        out. if SUBQ_COND(it2.row2)=TRUE, then the result will miss a row.

      == 4. Loose Scan ==

      Loose Scan is similar to First Match - it does short-cutting (but also makes
      forward jumps on the driving table).

      Short-cutting imposes a limitation similar to FirstMatch - conditions that
      depend on subquery's tables cannot be moved out of LooseScan's range.

        Gliffy Diagrams

          Attachments

            Issue Links

              Activity

              Hide
              psergey Sergei Petrunia added a comment -

              On Dec, 19th, I have committed a patch that limits moving of predicates as described in the above spec.

              The patch only affects the execution part (where Item* objects are wrapped and positioned in join_tab[i]->select_cond. The optimizer part (which makes the join optimizer take into account cost of attached predicates) works as if there were no limits.

              Show
              psergey Sergei Petrunia added a comment - On Dec, 19th, I have committed a patch that limits moving of predicates as described in the above spec. The patch only affects the execution part (where Item* objects are wrapped and positioned in join_tab [i] ->select_cond. The optimizer part (which makes the join optimizer take into account cost of attached predicates) works as if there were no limits.
              Hide
              psergey Sergei Petrunia added a comment -

              The second part of the task is to account for predicate moving limits in the join optimizer.

              The basic idea is: when the optimizer calculates join prefix cost, it should take into account the limits on predicate pushdown that are imposed by semi-join strategies.

              Show
              psergey Sergei Petrunia added a comment - The second part of the task is to account for predicate moving limits in the join optimizer. The basic idea is: when the optimizer calculates join prefix cost, it should take into account the limits on predicate pushdown that are imposed by semi-join strategies.
              Hide
              psergey Sergei Petrunia added a comment -

              The apparent solution would be to amend JOIN::static_pushdown_cost() so that it doesn't try to position expensive predicates where it is not allowed.

              A deficiency of this approach (and a possible problem): static_pushdown_cost() is called after the join order has been constructed, and after the choice has been made about which semi-join strategy to use.

              Suppose, there is a certain join order J_ORD1, for which there are two possible semi-join strategies, SJ1 and SJ2. Suppose, the optimizer picks SJ1 (without taking static pushdown cost into account).
              Now, when static_pushdown_cost() is called for the

              {J_ORD1, SJ1}

              it will see that this choice is expensive, but there will be no way to return to consider

              {J_ORD1, SJ2}

              .

              Show
              psergey Sergei Petrunia added a comment - The apparent solution would be to amend JOIN::static_pushdown_cost() so that it doesn't try to position expensive predicates where it is not allowed. A deficiency of this approach (and a possible problem): static_pushdown_cost() is called after the join order has been constructed, and after the choice has been made about which semi-join strategy to use. Suppose, there is a certain join order J_ORD1, for which there are two possible semi-join strategies, SJ1 and SJ2. Suppose, the optimizer picks SJ1 (without taking static pushdown cost into account). Now, when static_pushdown_cost() is called for the {J_ORD1, SJ1} it will see that this choice is expensive, but there will be no way to return to consider {J_ORD1, SJ2} .
              Hide
              psergey Sergei Petrunia added a comment -

              I am not sure if the above problem affects the primary example of MDEV-83. I suppose it might, because I remember a discussion with Timour about the example of MDEV-83, where the problem was that semi-join optimizer chose to use Materialization, and that was a particularly bad choice, because Materialization enumerates all possible matches that subquery could have, which causes more invocations of the expensive scalar-context subquery that was attached to one of the tables inside the SJ-Materialization nest.

              IIRC the problem was solved by taking into account the cost of expensive scalar-context subquery in the join optimizer.

              Show
              psergey Sergei Petrunia added a comment - I am not sure if the above problem affects the primary example of MDEV-83 . I suppose it might, because I remember a discussion with Timour about the example of MDEV-83 , where the problem was that semi-join optimizer chose to use Materialization, and that was a particularly bad choice, because Materialization enumerates all possible matches that subquery could have, which causes more invocations of the expensive scalar-context subquery that was attached to one of the tables inside the SJ-Materialization nest. IIRC the problem was solved by taking into account the cost of expensive scalar-context subquery in the join optimizer.
              Hide
              psergey Sergei Petrunia added a comment -

              Another possible angle (discussed with Timour, w/o answer): What does current MDEV-83 code do for SJ-Materialization nests? Does optimize_semijoin_nests() include the cost of attached subquery predicates?

              If it does, then SJ-Materialization is heavily penalized, because advance_sj_state() will compare SJ-Materialization costs which include expensive-predicate-cost, with e.g. FirstMatch costs which do not include expensive-predicate-cost.

              Show
              psergey Sergei Petrunia added a comment - Another possible angle (discussed with Timour, w/o answer): What does current MDEV-83 code do for SJ-Materialization nests? Does optimize_semijoin_nests() include the cost of attached subquery predicates? If it does, then SJ-Materialization is heavily penalized, because advance_sj_state() will compare SJ-Materialization costs which include expensive-predicate-cost, with e.g. FirstMatch costs which do not include expensive-predicate-cost.
              Hide
              psergey Sergei Petrunia added a comment -

              Did some debugging to answer the question in the last comment: No, optimize_semijoin_nests() does not include costs of expensive predicates into costs of SJ-Materialization.

              This way, the concern of SJ-Materialization being unfairly penalized by advance_sj_state() is invalid.

              However, the concern about semi-join strategy choice remains. Current code will have this problem:
              1. A join order J_ORD is built
              2. SJ-Materialization is chosen (and FirstMatch is not, for example).
              3. static_subquery_cost() looks at cost J_ORD+SJ_Materialization, finds it to be expensive, and J_ORD is discarded.
              the fact that J_ORD+FirstMatch might have lower cost is ignored.

              Show
              psergey Sergei Petrunia added a comment - Did some debugging to answer the question in the last comment: No, optimize_semijoin_nests() does not include costs of expensive predicates into costs of SJ-Materialization. This way, the concern of SJ-Materialization being unfairly penalized by advance_sj_state() is invalid. However, the concern about semi-join strategy choice remains. Current code will have this problem: 1. A join order J_ORD is built 2. SJ-Materialization is chosen (and FirstMatch is not, for example). 3. static_subquery_cost() looks at cost J_ORD+SJ_Materialization, finds it to be expensive, and J_ORD is discarded. the fact that J_ORD+FirstMatch might have lower cost is ignored.
              Hide
              psergey Sergei Petrunia added a comment -

              A possible solution for this is to take into account expensive-predicate-cost
              as early as possible.

              Suppose there EXPENSIVE_COND(tX, tY, tZ). Then, the optimizer should:

              • as soon as tables tX, tY and tZ are in the join prefix, add cost of
                EXPENSIVE_COND into the join prefix.
              • When a change is made to the query plan (a table is added, or a semi-join
                strategy is applied),check if EXPENSIVE_COND should/can be moved, and if yes,
                move it and adjust the cost.

              This approach will let static-condition-pushdown to affect the choice of
              semi-join strategies.

              Possible issues:

              • This approach requires many more static_subquery_cost() invocations.
              • Does this work well with optimizer plan pruning? We have the following new
                property: adding a table can allow to move an EXPENSIVE_COND further in the
                plan, where it will be invoked fewer times, and the total cost will get
                cheaper. I suspect optimizer plan pruning doesn't work well when extending
                the query plan can make it cheaper.
              • Is this all worth it? We know static pushdown doesn't do a very good job, it
                is only there to provide starting points for dynamic pushdown.
              Show
              psergey Sergei Petrunia added a comment - A possible solution for this is to take into account expensive-predicate-cost as early as possible. Suppose there EXPENSIVE_COND(tX, tY, tZ). Then, the optimizer should: as soon as tables tX, tY and tZ are in the join prefix, add cost of EXPENSIVE_COND into the join prefix. When a change is made to the query plan (a table is added, or a semi-join strategy is applied),check if EXPENSIVE_COND should/can be moved, and if yes, move it and adjust the cost. This approach will let static-condition-pushdown to affect the choice of semi-join strategies. Possible issues: This approach requires many more static_subquery_cost() invocations. Does this work well with optimizer plan pruning? We have the following new property: adding a table can allow to move an EXPENSIVE_COND further in the plan, where it will be invoked fewer times, and the total cost will get cheaper. I suspect optimizer plan pruning doesn't work well when extending the query plan can make it cheaper. Is this all worth it? We know static pushdown doesn't do a very good job, it is only there to provide starting points for dynamic pushdown.

                People

                • Assignee:
                  psergey Sergei Petrunia
                  Reporter:
                  psergey Sergei Petrunia
                • Votes:
                  0 Vote for this issue
                  Watchers:
                  1 Start watching this issue

                  Dates

                  • Created:
                    Updated: