Details

    • Type: Task
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      First, I apologize if I'm not submitting this to the right section. It's not a bug, but I don't see "Feature Request." I'll try to keep this, um, short, not epic.

      OVERVIEW –
      Whether correct or not, the following occurs. You have 2 tables in a typical 1:M relationship, but you only want 1 row back. OK, use a semi-join, right?

      Well, first, up until now the best you could do is use an index-subquery, which was a very limited approach for an exact type of query. Now we have options. But before that in the stone age, what would lots of people do?

      They'd do a LEFT or INNER JOIN and add on a 'hacky' GROUP BY at the end. In fact that's one of your execution approaches for semi-joins. But it's not always the best one. It depends.

      But ...

      The nice part about this awful hack is that you get the columns from the table you're 'semi-joining' to. Sometimes you don't care which row the extra columns are from, and you just need to ensure you get 1 row. Also, in the case that the planner knows it can use an INNER JOIN strategy and rewrite the query for you, it would be 100% valid and loss-less. In other cases, you might just know that what you want is the same across the many rows it would be encountering anyway (yes, I know that's denormalized, but real life isn't always normal). Lots of people summarizing data end up with schemas like this.

      I know this would not pass muster with the SQL normalization overlords, but it would be darn useful and very much in the spirit of MySQL.

      Thoughts?

        Gliffy Diagrams

          Attachments

            Activity

            Hide
            psergey Sergei Petrunia added a comment -

            Do I understand it correctly: you're requesting that MariaDB's SQL dialect has support for SEMI JOIN construct? So that it allows queries like

            SELECT ... FROM table1 SEMI JOIN table2 WHERE ....

            ?

            Show
            psergey Sergei Petrunia added a comment - Do I understand it correctly: you're requesting that MariaDB's SQL dialect has support for SEMI JOIN construct? So that it allows queries like SELECT ... FROM table1 SEMI JOIN table2 WHERE .... ?
            Hide
            jsirovic Jaimie Sirovich added a comment -

            Yes. The difference would be that with "SEMI JOIN," the columns from both table1 and table2 would be materialized. The result would be undefined (in the same way as a GROUP BY is) in certain cases, but sometimes you either:

            1. Don't care.
            2. Know that the particular columns you're selecting will be defined in your particular query. Purists argued for years that this is wrong, but they lost that argument, as even PostgreSQL now allows for this.

            Using a subquery will not permit to express this, and using GROUP BY to remove the duplicate rows is only 1 execution strategy for certain SEMI JOINs. But it's the only one, as it stands, that allows for materialization from both sides of the JOIN.

            Make sense?

            Show
            jsirovic Jaimie Sirovich added a comment - Yes. The difference would be that with "SEMI JOIN," the columns from both table1 and table2 would be materialized. The result would be undefined (in the same way as a GROUP BY is) in certain cases, but sometimes you either: 1. Don't care. 2. Know that the particular columns you're selecting will be defined in your particular query. Purists argued for years that this is wrong, but they lost that argument, as even PostgreSQL now allows for this. Using a subquery will not permit to express this, and using GROUP BY to remove the duplicate rows is only 1 execution strategy for certain SEMI JOINs. But it's the only one, as it stands, that allows for materialization from both sides of the JOIN. Make sense?
            Hide
            jsirovic Jaimie Sirovich added a comment -

            Sighted here: http://archive.cloudera.com/cdh/3/hive/language_manual/joins.html

            Possibly because they don't support subqueries. I'm not sure why referencing the right-hand table must be prohibited, either. It can be allowed. Yes, in some cases it's undefined, but only sometimes.

            "LEFT SEMI JOIN implements the correlated IN/EXISTS subquery semantics in an efficient way. Since Hive currently does not support IN/EXISTS subqueries, you can rewrite your queries using LEFT SEMI JOIN. The restrictions of using LEFT SEMI JOIN is that the right-hand-side table should only be referenced in the join condition (ON-clause), but not in WHERE- or SELECT-clauses etc.

            This type of query

            SELECT a.key, a.value
            FROM a
            WHERE a.key in
            (SELECT b.key
            FROM B);
            Can be written as:

            SELECT a.key, a.val
            FROM a LEFT SEMI JOIN b on (a.key = b.key)
            "

            Show
            jsirovic Jaimie Sirovich added a comment - Sighted here: http://archive.cloudera.com/cdh/3/hive/language_manual/joins.html Possibly because they don't support subqueries. I'm not sure why referencing the right-hand table must be prohibited, either. It can be allowed. Yes, in some cases it's undefined, but only sometimes. "LEFT SEMI JOIN implements the correlated IN/EXISTS subquery semantics in an efficient way. Since Hive currently does not support IN/EXISTS subqueries, you can rewrite your queries using LEFT SEMI JOIN. The restrictions of using LEFT SEMI JOIN is that the right-hand-side table should only be referenced in the join condition (ON-clause), but not in WHERE- or SELECT-clauses etc. This type of query SELECT a.key, a.value FROM a WHERE a.key in (SELECT b.key FROM B); Can be written as: SELECT a.key, a.val FROM a LEFT SEMI JOIN b on (a.key = b.key) "
            Hide
            psergey Sergei Petrunia added a comment -

            Interesting...

            Internally, MariaDB 5.3+ supports semi-joins, one can see it by running EXPLAIN
            EXTENDED on a semi-join subquery and checking the rewritten query in the
            warning.

            As for references to semi-join-inner tables from other parts of the query, the
            only part of the optimizer that is sensitive to their presence is semi-join
            materialization
            (http://kb.askmonty.org/en/semi-join-materialization-strategy).
            Other strategies do not care.

            There is a certain difference from LEFT JOINs. With LEFT JOIN, the WHERE condition
            is applied to the result of the LEFT JOIN ... ON operation (internally, it
            isn't, but the query result is the same as if it was).

            With semi-join (in current MariaDB, and in MySQL 5.6) the WHERE clause affects the
            semi-join operation. Consider a query

            SELECT * FROM country
            WHERE country.code IN (SELECT city.country FROM city
            WHERE city.has_airport=1)

            And, consider a semi-join, which has reference to the subquery table:

            SELECT ...
            FROM country LEFT SEMI JOIN city ON (country.code = city.country)
            WHERE
            city.have_airport=1

            Current semi-join code will evaluate this with this semantics (SEM1):

            for each country
            {
            find some city that belongs to that country AND has an airport;
            return

            {country, city} pair;
            }

            One could expect the following (SEM2) :

            for each country
            {
            find some city $C that belongs to that country;
            if ($C.has_an_airport)
            return {country, city}

            pair;

            }

            With (SEM1), one can freely move conditions between WHERE and ON expression,
            there is no difference. With (SEM2), there is a difference.
            Current optimizer code implements (SEM1), implementing (SEM2) may be
            non-trivial.

            Another possible concern is semi-join on the inner side of an outer join. Its
            subquery equivalent is a subquery in the LEFT OUTER JOIN's ON clause.
            Semi-join optimizations have no support for such queries, they will be executed
            via "old" execution methods. If we decide to support LEFT SEMI JOIN syntax,
            we will have a choice of

            • not supporting LEFT SEMI JOIN within LEFT OUTER JOIN
            • Having semi-join optimizer/executor support semi-join-inside-left-join
              (complicated)
            • Converting semi-join-inside-left-join back into a subquery (complicated and
              error-prone).
            Show
            psergey Sergei Petrunia added a comment - Interesting... Internally, MariaDB 5.3+ supports semi-joins, one can see it by running EXPLAIN EXTENDED on a semi-join subquery and checking the rewritten query in the warning. As for references to semi-join-inner tables from other parts of the query, the only part of the optimizer that is sensitive to their presence is semi-join materialization ( http://kb.askmonty.org/en/semi-join-materialization-strategy ). Other strategies do not care. There is a certain difference from LEFT JOINs. With LEFT JOIN, the WHERE condition is applied to the result of the LEFT JOIN ... ON operation (internally, it isn't, but the query result is the same as if it was). With semi-join (in current MariaDB, and in MySQL 5.6) the WHERE clause affects the semi-join operation. Consider a query SELECT * FROM country WHERE country.code IN (SELECT city.country FROM city WHERE city.has_airport=1) And, consider a semi-join, which has reference to the subquery table: SELECT ... FROM country LEFT SEMI JOIN city ON (country.code = city.country) WHERE city.have_airport=1 Current semi-join code will evaluate this with this semantics (SEM1): for each country { find some city that belongs to that country AND has an airport; return {country, city} pair; } One could expect the following (SEM2) : for each country { find some city $C that belongs to that country; if ($C.has_an_airport) return {country, city} pair; } With (SEM1), one can freely move conditions between WHERE and ON expression, there is no difference. With (SEM2), there is a difference. Current optimizer code implements (SEM1), implementing (SEM2) may be non-trivial. Another possible concern is semi-join on the inner side of an outer join. Its subquery equivalent is a subquery in the LEFT OUTER JOIN's ON clause. Semi-join optimizations have no support for such queries, they will be executed via "old" execution methods. If we decide to support LEFT SEMI JOIN syntax, we will have a choice of not supporting LEFT SEMI JOIN within LEFT OUTER JOIN Having semi-join optimizer/executor support semi-join-inside-left-join (complicated) Converting semi-join-inside-left-join back into a subquery (complicated and error-prone).
            Hide
            psergey Sergei Petrunia added a comment -

            The main question is, do you expect (SEM1), or (SEM2) ?

            Show
            psergey Sergei Petrunia added a comment - The main question is, do you expect (SEM1), or (SEM2) ?
            Hide
            jsirovic Jaimie Sirovich added a comment - - edited

            I'm pretty sure you want SEM1, so that's good news. SEM2 is only correct when the multiple rows are 100% duplicate with respect to what you care about. I think it's always wrong, then, actually. If you expect SEM2, you'd expect totally random results based on how the optimizer decides to walk over the data, no? SEM1 is consistent.

            As far as OUTERs . . .

            I'd vote for only implementing INNER JOIN for now, since it looks like that's mostly an issue of syntax. If it's adopted (i.e. anyone actually cares about it other than we), outer joins could be implemented at a later date. I can see uses for both, actually, but SQL only had inner joins at the beginning of the SQL universe and relational algebra, so it wouldn't be strange at all to have a limit that SEMI joins are INNER JOINs.

            What I find strange to begin with is that this syntax doesn't exist in pretty much any implementation, and that it must be represented as a subquery just to be then turned back into a (semi) join. Why? The "semi join" is a concept you can find as a bona fide join in academic textbooks. It has its own funny symbol and everything. Using subquery semantics seems pained because in relational algebra you're supposed to think of sets, not loops and conditions, which is what subqueries tend to 'think' about. Usually, when you see subqueries everywhere, it means "bad programmer." In the old days of MySQL, everyone said "use joins stupid!" ... except that resulted in using a join with a 'hacky group by,' which is roughly "DUPLICATE WEED OUT." That's very limiting and kind of the worst plan in most cases, though

            So I don't see any huge problems if it's only SEMI INNER JOINs. In certain cases you could still apply TABLE PULLOUT (well, it's no longer a pull out) to make it a true INNER JOIN based on schema.

            MySQL previously implemented INDEX_SUBQUERY as a non-generic semi join for very limited cases (it looks like this is FIRST_MATCH now). However, it still had to be represented as a subquery and suffers from the same painful syntax and lack of ability to select from. I guess materialization is like joining to a non-correlated subquery. Loosescan ... I'm not sure there's a way to express that at all.

            Either way, you would have the ability to query something (that albeit could be denormalized) and 'see everything.' That's a (very nice) side effect, though. Sometimes you know it's OK, so why not? If you care, then you can set something like sqlmode_only_full_group_by, but only purists would care.

            Really I think it's odd that everyone uses the pained subquery syntax to express something that IS a bona fide join.

            Hope this is clear. I don't write optimizers, so . . . I could be totally wrong on some of these things.

            Show
            jsirovic Jaimie Sirovich added a comment - - edited I'm pretty sure you want SEM1, so that's good news. SEM2 is only correct when the multiple rows are 100% duplicate with respect to what you care about. I think it's always wrong, then, actually. If you expect SEM2, you'd expect totally random results based on how the optimizer decides to walk over the data, no? SEM1 is consistent. As far as OUTERs . . . I'd vote for only implementing INNER JOIN for now, since it looks like that's mostly an issue of syntax. If it's adopted (i.e. anyone actually cares about it other than we), outer joins could be implemented at a later date. I can see uses for both, actually, but SQL only had inner joins at the beginning of the SQL universe and relational algebra, so it wouldn't be strange at all to have a limit that SEMI joins are INNER JOINs. What I find strange to begin with is that this syntax doesn't exist in pretty much any implementation, and that it must be represented as a subquery just to be then turned back into a (semi) join. Why? The "semi join" is a concept you can find as a bona fide join in academic textbooks. It has its own funny symbol and everything. Using subquery semantics seems pained because in relational algebra you're supposed to think of sets, not loops and conditions, which is what subqueries tend to 'think' about. Usually, when you see subqueries everywhere, it means "bad programmer." In the old days of MySQL, everyone said "use joins stupid!" ... except that resulted in using a join with a 'hacky group by,' which is roughly "DUPLICATE WEED OUT." That's very limiting and kind of the worst plan in most cases, though So I don't see any huge problems if it's only SEMI INNER JOINs. In certain cases you could still apply TABLE PULLOUT (well, it's no longer a pull out) to make it a true INNER JOIN based on schema. MySQL previously implemented INDEX_SUBQUERY as a non-generic semi join for very limited cases (it looks like this is FIRST_MATCH now). However, it still had to be represented as a subquery and suffers from the same painful syntax and lack of ability to select from. I guess materialization is like joining to a non-correlated subquery. Loosescan ... I'm not sure there's a way to express that at all. Either way, you would have the ability to query something (that albeit could be denormalized) and 'see everything.' That's a (very nice) side effect, though. Sometimes you know it's OK, so why not? If you care, then you can set something like sqlmode_only_full_group_by, but only purists would care. Really I think it's odd that everyone uses the pained subquery syntax to express something that IS a bona fide join. Hope this is clear. I don't write optimizers, so . . . I could be totally wrong on some of these things.
            Hide
            jsirovic Jaimie Sirovich added a comment -

            I've been idling on this. Our of curiousity — and don't get me misconstrued we're a small development firm — what would be the real cost to sponsor something like this? If I had experience with the MySQL source tree I'd think about tinkering with it myself. It doesn't sound that difficult as long as it only works for INNERs. Anything else sounds much harder.

            Show
            jsirovic Jaimie Sirovich added a comment - I've been idling on this. Our of curiousity — and don't get me misconstrued we're a small development firm — what would be the real cost to sponsor something like this? If I had experience with the MySQL source tree I'd think about tinkering with it myself. It doesn't sound that difficult as long as it only works for INNERs. Anything else sounds much harder.

              People

              • Assignee:
                psergey Sergei Petrunia
                Reporter:
                jsirovic Jaimie Sirovich
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated: