Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-6125

Connect engine - cannot read XML file with default XML namespace defined

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Not a Bug
    • Affects Version/s: 10.0.10
    • Fix Version/s: None
    • Component/s: None
    • Labels:

      Description

      It's not possible to read an XML file through the Connect engine if the XML file has the default XML namespace specified, i.e. the root node includes the attribute xmlns="[namespace URI]". Notice that namespaces with prefix works well, i.e. xmlns:[prefix]="[namespace URI]".

      I'm including an example to be able to reproduce the problem. Attached find the file 20140401_1846_Running.gpx, which has the following root node:

      <gpx version="1.1" creator="runtastic - makes sports funtastic, http://www.runtastic.com" xsi:schemaLocation="http://www.topografix.com/GPX/1/1
                                      http://www.topografix.com/GPX/1/1/gpx.xsd
                                      http://www.garmin.com/xmlschemas/GpxExtensions/v3
                                      http://www.garmin.com/xmlschemas/GpxExtensionsv3.xsd
                                      http://www.garmin.com/xmlschemas/TrackPointExtension/v1
                                      http://www.garmin.com/xmlschemas/TrackPointExtensionv1.xsd" xmlns="http://www.topografix.com/GPX/1/1" xmlns:gpxtpx="http://www.garmin.com/xmlschemas/TrackPointExtension/v1" xml\
      ns:gpxx="http://www.garmin.com/xmlschemas/GpxExtensions/v3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
      

      In the root node the default namespace is given in the xmlns -attribute:

      xmlns="http://www.topografix.com/GPX/1/1"
      

      Let's then create a Connect engine -table of the XML:

      CREATE TABLE `GPXSource` (
        `lon` double(20,16) NOT NULL `field_format`='@',
        `lat` double(20,16) NOT NULL `field_format`='@',
        `ele` double(20,16) NOT NULL,
        `time` datetime date_format="YYYY-MM-DD 'T' hh:mm:ss '.000Z'"
      ) ENGINE=CONNECT DEFAULT CHARSET=latin1 `table_type`=XML `file_name`='/home/rasmus/nodegpxmariadb/public/uploads/20140401_1846_Running.gpx' `tabname`='trkseg';
      

      Do a SELECT over the created table:

      MariaDB [running]> SELECT * FROM GPXSource;
      | lon                | lat                | ele                | time |
      +--------------------+--------------------+--------------------+------+
      | 0.0000000000000000 | 0.0000000000000000 | 0.0000000000000000 | NULL |
      | 0.0000000000000000 | 0.0000000000000000 | 0.0000000000000000 | NULL |
      +--------------------+--------------------+--------------------+------+
      2 rows in set (0.01 sec)
      

      Instead of two rows full of zeros and nulls it should have included a lot of rows without zero and null values.

      Let's edit the XML file 20140401_1846_Running.gpx and remove the xmlns="http://www.topografix.com/GPX/1/1" -attribute and after that run the same SELECT over the table again. This time we get what we expected:

      MariaDB [running]> SELECT * FROM GPXSource LIMIT 10;
      +-----------------------+---------------------+--------------------+---------------------+
      | lon                   | lat                 | ele                | time                |
      +-----------------------+---------------------+--------------------+---------------------+
      | -121.9822235107421900 | 37.3884925842285160 | 6.6108512878417970 | 2014-04-01 14:54:05 |
      | -121.9821929931640600 | 37.3885803222656250 | 6.7878279685974120 | 2014-04-01 14:54:08 |
      | -121.9821624755859400 | 37.3886299133300800 | 6.7719874382019040 | 2014-04-01 14:54:10 |
      | -121.9821395874023400 | 37.3886795043945300 | 6.3355555534362790 | 2014-04-01 14:54:12 |
      | -121.9821166992187500 | 37.3887557983398440 | 6.5390658378601070 | 2014-04-01 14:54:15 |
      | -121.9821014404296900 | 37.3888053894043000 | 6.6053481101989750 | 2014-04-01 14:54:17 |
      | -121.9820785522461000 | 37.3888893127441400 | 5.8897328376770020 | 2014-04-01 14:54:20 |
      | -121.9820709228515600 | 37.3889427185058600 | 6.3196611404418945 | 2014-04-01 14:54:22 |
      | -121.9820785522461000 | 37.3890190124511700 | 6.0785322189331055 | 2014-04-01 14:54:25 |
      | -121.9820861816406200 | 37.3890762329101560 | 6.4796056747436520 | 2014-04-01 14:54:27 |
      +-----------------------+---------------------+--------------------+---------------------+
      10 rows in set (0.01 sec)
      

        Gliffy Diagrams

          Attachments

            Activity

            Hide
            bertrandop Olivier Bertrand added a comment -

            CONNECT does not handle the processing of XML files but delegates it to specialized libraries, MS DOMDOC or libxml2 on windows and libxml2 on Linux.

            On Windows, with the default DOMDOC library, this table is normally handled with no error. However, when specifying libxml2, the above error occurs.

            Therefore, this seems to be a libxml2 error, not a CONNECT one.

            Show
            bertrandop Olivier Bertrand added a comment - CONNECT does not handle the processing of XML files but delegates it to specialized libraries, MS DOMDOC or libxml2 on windows and libxml2 on Linux. On Windows, with the default DOMDOC library, this table is normally handled with no error. However, when specifying libxml2, the above error occurs. Therefore, this seems to be a libxml2 error, not a CONNECT one.
            Hide
            bertrandop Olivier Bertrand added a comment - - edited

            Looking more closely at what happens with this example, I found that the problem is that a general name space is defined and all XPATH are looking in this name space. When trying to locate the main table node, CONNECT construct an XPATH of '//trkseg' that fails to find the corresponding node.

            Currently, when the table node is not found, CONNECT tries to use the ROOT node instead, which in that case, produces the wrong answer. Providing a general fix seems difficult but meawhile you can bypass this issue by specifying the TABNAME option as an XPATH that will ignore the the currently defined name space, in this example:

            alter table gpxsource tabname="//*[local-name()='trkseg']";
            

            For me, this worked.
            Note that the lon and ele columns should be declared as double(21,16)

            Show
            bertrandop Olivier Bertrand added a comment - - edited Looking more closely at what happens with this example, I found that the problem is that a general name space is defined and all XPATH are looking in this name space. When trying to locate the main table node, CONNECT construct an XPATH of '//trkseg' that fails to find the corresponding node. Currently, when the table node is not found, CONNECT tries to use the ROOT node instead, which in that case, produces the wrong answer. Providing a general fix seems difficult but meawhile you can bypass this issue by specifying the TABNAME option as an XPATH that will ignore the the currently defined name space, in this example: alter table gpxsource tabname= "//*[local-name()='trkseg']" ; For me, this worked. Note that the lon and ele columns should be declared as double(21,16)
            Hide
            bertrandop Olivier Bertrand added a comment - - edited

            As a matter of facts, it did not work completely because if the row node attributes were retrieved normally, the column nodes were not found for the same reason (not in the default name space) To have a complete result this table must be created as:

             
            CREATE TABLE `GPXSource` (
              `lon` double(21,16) NOT NULL field_format='@',
              `lat` double(20,16) NOT NULL field_format='@',
              `ele` double(21,16) NOT NULL field_format="*[local-name()='ele']",
              `time` datetime field_format="*[local-name()='time']" date_format="YYYY-MM-DD 'T' hh:mm:ss '.000Z'"
            ) ENGINE=CONNECT DEFAULT CHARSET=latin1 table_type=XML file_name='20140401_1846_Running.gpx'
            tabname="//*[local-name()='trkseg']" option_list='xmlsup=libxml2';
            

            With the field format xpath specified as above, the complete result is returned.

            Show
            bertrandop Olivier Bertrand added a comment - - edited As a matter of facts, it did not work completely because if the row node attributes were retrieved normally, the column nodes were not found for the same reason (not in the default name space) To have a complete result this table must be created as: CREATE TABLE `GPXSource` ( `lon` double(21,16) NOT NULL field_format='@', `lat` double(20,16) NOT NULL field_format='@', `ele` double(21,16) NOT NULL field_format= "*[local-name()='ele']" , `time` datetime field_format= "*[local-name()='time']" date_format= "YYYY-MM-DD 'T' hh:mm:ss '.000Z'" ) ENGINE=CONNECT DEFAULT CHARSET=latin1 table_type=XML file_name='20140401_1846_Running.gpx' tabname= "//*[local-name()='trkseg']" option_list='xmlsup=libxml2'; With the field format xpath specified as above, the complete result is returned.
            Hide
            ratzpo Rasmus Johansson added a comment -

            Olivier, the xpath tweaks worked well. I published a blog post a while ago already where I used it, https://blog.mariadb.org/crunching-xml-files-with-mariadb/

            Show
            ratzpo Rasmus Johansson added a comment - Olivier, the xpath tweaks worked well. I published a blog post a while ago already where I used it, https://blog.mariadb.org/crunching-xml-files-with-mariadb/

              People

              • Assignee:
                bertrandop Olivier Bertrand
                Reporter:
                ratzpo Rasmus Johansson
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0 minutes
                  0m
                  Logged:
                  Time Spent - 6 hours
                  6h