Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 10.0.13
    • Fix Version/s: 10.0.16
    • Labels:
      None
    • Environment:
      CentOS release 6.5 (X86_64)

      Description

      I want to make a CONNECT table directed to a local CSV file saved in character set utf8.
      The table data is correctly displayed but the header is corrupting.
      Can I configure to fix this or is it a bug?

      /var/lib/mysql/table_utf8.csv

      [root@kc0022 ~]# cat /var/lib/mysql/table_utf8.csv
      "COUPON_NUM","COUPON_NAME","START_DATE","END_DATE"
      "A8772","CAMPAIGN1","2005-03-11","2005-04-01"
      "A6703","ANNIVERSARY","2005-08-15","2005-09-05"
      "A4313","NEW_ITEM","2005-09-13","2005-10-04"
      

      CONNECT table on MariaDB

      [root@kc0022 ~]# mysql -u root
      Welcome to the MariaDB monitor.  Commands end with ; or \g.
      Your MariaDB connection id is 351
      Server version: 10.0.13-MariaDB MariaDB Server
      
      Copyright (c) 2000, 2014, Oracle, SkySQL Ab and others.
      
      Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
      
      MariaDB [(none)]> use test;
      Reading table information for completion of table and column names
      You can turn off this feature to get a quicker startup with -A
      
      Database changed
      MariaDB [test]> CREATE TABLE table_utf8
          -> engine=CONNECT table_type=CSV file_name='../table_utf8.csv'
          -> header=1 sep_char=',' quoted=1;
      Query OK, 0 rows affected (0.05 sec)
      
      MariaDB [test]> EXPLAIN table_utf8;
      +----------------------------------------------------------------------+----------+------+-----+---------+-------+
      | Field                                                                | Type     | Null | Key | Default | Extra |
      +----------------------------------------------------------------------+----------+------+-----+---------+-------+
      | i≫?"i?£i? ̄i?μi?°i? ̄i?Ri??i?Ri?μi?-"                                  | char(5)  | NO   |     | NULL    |       |
      | i?£i? ̄i?μi?°i? ̄i?Ri??i?Ri?!i?-i?\                                    | char(33) | NO   |     | NULL    |       |
      | i?3i?´i?!i?2i?´i??i??i?!i?´i?\                                       | char(10) | NO   |     | NULL    |       |
      | i?\i?Ri??i??i??i?!i?´i?\                                             | char(10) | NO   |     | NULL    |       |
      +----------------------------------------------------------------------+----------+------+-----+---------+-------+
      4 rows in set (0.05 sec)
      
      MariaDB [test]> SELECT * FROM table_utf8;
      +----------------------------------------------------------------------+--------------------------------------------------------------------+--------------------------------------------------------------+--------------------------------------------------+
      | i≫?"i?£i? ̄i?μi?°i? ̄i?Ri??i?Ri?μi?-"                                  | i?£i? ̄i?μi?°i? ̄i?Ri??i?Ri?!i?-i?\                                  | i?3i? ´i?!i?2i?´i??i??i?!i?´i?\                               | i?\i?Ri??i??i??i?!i?´i?\                         |
      +----------------------------------------------------------------------+--------------------------------------------------------------------+--------------------------------------------------------------+--------------------------------------------------+
      | A8772                                                                | CAM PAIGN1                                                 | 2005-03-11                                                   | 2005-04-01                                       |
      | A6703                                                                | ANN IVERSARY                                             | 2005-08-15                                                   | 2005-09-05                                       |
      | A4313                                                                | NEW _ITEM                                                   | 2005-09-13                                                   | 2005-10-04                                       |
      +----------------------------------------------------------------------+--------------------------------------------------------------------+--------------------------------------------------------------+--------------------------------------------------+
      3 rows in set (0.00 sec)
      
      MariaDB [test]> show variables like "%char%";
      +--------------------------+----------------------------+
      | Variable_name            | Value                      |
      +--------------------------+----------------------------+
      | character_set_client     | utf8                       |
      | character_set_connection | utf8                       |
      | character_set_database   | utf8                       |
      | character_set_filesystem | binary                     |
      | character_set_results    | utf8                       |
      | character_set_server     | utf8                       |
      | character_set_system     | utf8                       |
      | character_sets_dir       | /usr/share/mysql/charsets/ |
      +--------------------------+----------------------------+
      8 rows in set (0.00 sec)
      

        Gliffy Diagrams

          Attachments

            Activity

            Hide
            bertrandop Olivier Bertrand added a comment - - edited

            Indeed, using UltraEdit it was correctly decoded and I could verify if is in UTF-8 (coded in three bytes).

            I did not check yet what happen with column names. They are retrieved by CONNECT from the file header during the discovery process and perhaps not decoded properly. This can be temporarily avoided by manually defining the columns in the create table statement.

            However, I traced what happen with the data part. Because it is in UTF-8 (even the ASCII columns can be regarded as UTF-8) I created the table as:

            CREATE TABLE table_utf8 engine=CONNECT table_type=CSV default charset=UTF8
            file_name='E:/Data/table_utf8.csv' header=1 sep_char=',' quoted=1 data_charset=utf8;
            

            The data_charset option indicates the coding of the table data.

            What CONNECT does when reading the table is retrieving the column data and I could verify that it get it correctly and call the matching field store function with the charset parameter set by the data_charset option as UTF-8:

              rc= fp->store(p, strlen(p), charset, CHECK_FIELD_WARN);
            

            From there, all is done by MariaDB. If the default charset of the table is Latin1, MariaDB tries to convert the three bytes characters to Latin1 and fails, then they are displayed as '?' question marks.

            If the table default charset is UTF-8, the data is not converted (as the from charset is equal to the to charset)
            The way it is displayed depends on the client program and settings.

            Therefore this seems to be a MariaDB issue, at least for the data part of the table.

            Show
            bertrandop Olivier Bertrand added a comment - - edited Indeed, using UltraEdit it was correctly decoded and I could verify if is in UTF-8 (coded in three bytes). I did not check yet what happen with column names. They are retrieved by CONNECT from the file header during the discovery process and perhaps not decoded properly. This can be temporarily avoided by manually defining the columns in the create table statement. However, I traced what happen with the data part. Because it is in UTF-8 (even the ASCII columns can be regarded as UTF-8) I created the table as: CREATE TABLE table_utf8 engine=CONNECT table_type=CSV default charset=UTF8 file_name='E:/Data/table_utf8.csv' header=1 sep_char=',' quoted=1 data_charset=utf8; The data_charset option indicates the coding of the table data. What CONNECT does when reading the table is retrieving the column data and I could verify that it get it correctly and call the matching field store function with the charset parameter set by the data_charset option as UTF-8: rc= fp->store(p, strlen(p), charset, CHECK_FIELD_WARN); From there, all is done by MariaDB. If the default charset of the table is Latin1, MariaDB tries to convert the three bytes characters to Latin1 and fails, then they are displayed as '?' question marks. If the table default charset is UTF-8, the data is not converted (as the from charset is equal to the to charset) The way it is displayed depends on the client program and settings. Therefore this seems to be a MariaDB issue, at least for the data part of the table.
            Hide
            bertrandop Olivier Bertrand added a comment -

            Redefining the table as:

            CREATE TABLE `table_utf8` (
              `COUPON_NUM` char(5) NOT NULL,
              `COUPON_NAME` char(33) NOT NULL,
              `START_DATE` char(10) NOT NULL,
              `END_DATE` char(10) NOT NULL
            ) ENGINE=CONNECT DEFAULT CHARSET=UTF8 `TABLE_TYPE`='CSV' `FILE_NAME`='E:/Data/table_utf8.csv' `SEP_CHAR`=',' `DATA_CHARSET`='utf8' `HEADER`=1 `QUOTED`=1;
            

            I get the following result when character_set_results='latin1':

            +------------+-------------+------------+------------+
            | COUPON_NUM | COUPON_NAME | START_DATE | END_DATE   |
            +------------+-------------+------------+------------+
            | A8772      | ?????????   | 2005-03-11 | 2005-04-01 |
            | A6703      | ??????????? | 2005-08-15 | 2005-09-05 |
            | A4313      | ????????    | 2005-09-13 | 2005-10-04 |
            +------------+-------------+------------+------------+
            

            This seems normal, those three bytes characters cannot be translated to latin1.
            Now if character_set_results='utf8' the display is:

            +------------+-----------------------------------+------------+------------+
            | COUPON_NUM | COUPON_NAME                       | START_DATE | END_DATE   |
            +------------+-----------------------------------+------------+------------+
            | A8772      | ´╝ú´╝í´╝¡´╝░´╝í´╝®´╝º´╝«´╝æ       | 2005-03-11 | 2005-04-01 |
            | A6703      | ´╝í´╝«´╝«´╝®´╝´╝Ñ´╝▓´╝│´╝í´╝▓´╝╣ | 2005-08-15 | 2005-09-05 |
            | A4313      | ´╝«´╝Ñ´╝À´╝┐´╝®´╝┤´╝Ñ´╝¡          | 2005-09-13 | 2005-10-04 |
            +------------+-----------------------------------+------------+------------+
            

            To me it looks as if the client program were not decoding the result even the character_set_client is set to UTF-8. So it could also be a problem coming from the client program.

            Show
            bertrandop Olivier Bertrand added a comment - Redefining the table as: CREATE TABLE `table_utf8` ( `COUPON_NUM` char (5) NOT NULL, `COUPON_NAME` char (33) NOT NULL, `START_DATE` char (10) NOT NULL, `END_DATE` char (10) NOT NULL ) ENGINE=CONNECT DEFAULT CHARSET=UTF8 `TABLE_TYPE`='CSV' `FILE_NAME`='E:/Data/table_utf8.csv' `SEP_CHAR`=',' `DATA_CHARSET`='utf8' `HEADER`=1 `QUOTED`=1; I get the following result when character_set_results='latin1': +------------+-------------+------------+------------+ | COUPON_NUM | COUPON_NAME | START_DATE | END_DATE | +------------+-------------+------------+------------+ | A8772 | ????????? | 2005-03-11 | 2005-04-01 | | A6703 | ??????????? | 2005-08-15 | 2005-09-05 | | A4313 | ???????? | 2005-09-13 | 2005-10-04 | +------------+-------------+------------+------------+ This seems normal, those three bytes characters cannot be translated to latin1. Now if character_set_results='utf8' the display is: +------------+-----------------------------------+------------+------------+ | COUPON_NUM | COUPON_NAME | START_DATE | END_DATE | +------------+-----------------------------------+------------+------------+ | A8772 | ´╝ú´╝í´╝¡´╝░´╝í´╝®´╝º´╝«´╝æ | 2005-03-11 | 2005-04-01 | | A6703 | ´╝í´╝«´╝«´╝®´╝´╝Ñ´╝▓´╝│´╝í´╝▓´╝╣ | 2005-08-15 | 2005-09-05 | | A4313 | ´╝«´╝Ñ´╝À´╝┐´╝®´╝┤´╝Ñ´╝¡ | 2005-09-13 | 2005-10-04 | +------------+-----------------------------------+------------+------------+ To me it looks as if the client program were not decoding the result even the character_set_client is set to UTF-8. So it could also be a problem coming from the client program.
            Hide
            bertrandop Olivier Bertrand added a comment -

            Concerning the column names, there was a problem when specified in UTF8 because, MariaDB internally handling column names in UTF8, CONNECT was translating them before creating the table. Now for CSV tables specifying DATA_CHARSET=UTF8 this translation is no more done, considering they already are coded in UTF8.

            For instance, starting from the file:

            Opération,Monnaie,Débit,Crédit
            Chèque,£,100,0
            Virement,€,0,5600
            

            I can create the table:

            create table tab_csv8 engine=connect table_type=csv file_name='E:/Data/csv8.csv' header=1 lrecl=64 data_charset=utf8;
            

            Describe tab_csv8 replies:

            Field Type Null Key Default Extra
            Opération char(8) NO <null> <null> <null>
            Monnaie char(3) NO <null> <null> <null>
            Débit int(3) NO <null> <null> <null>
            Crédit int(4) NO <null> <null> <null>

            and

            select * from tab_csv8;
            

            Returns:

            Opération Monnaie Débit Crédit
            Chèque £ 100 0
            Virement 0 5600

            However, if instead of using my graphic client program I use the mysql client, the table is displayed as:

            MariaDB [test]> select * from tab_csv8;
            +-----------+---------+-------+--------+
            | Opération | Monnaie | Débit | Crédit |
            +-----------+---------+-------+--------+
            | Chèque    | £       |   100 |      0 |
            | Virement  | ?       |     0 |   5600 |
            +-----------+---------+-------+--------+
            

            See; the Euro character is not recognized and printed as a question mark. This shows that the problem is a problem of character recognition by the mysql client program.

            Note that this is done with:

            MariaDB [test]> show variables like '%char%';
            +--------------------------+-----------------------------------------------------------------+
            | Variable_name            | Value                                                           |
            +--------------------------+-----------------------------------------------------------------+
            | character_set_client     | cp850                                                           |
            | character_set_connection | cp850                                                           |
            | character_set_database   | latin1                                                          |
            | character_set_filesystem | binary                                                          |
            | character_set_results    | cp850                                                           |
            | character_set_server     | latin1                                                          |
            | character_set_system     | utf8                                                            |
            | character_sets_dir       | d:\CommonSource\mariadb-10.0.5\10.0-connect\sql\share\charsets\ |
            +--------------------------+-----------------------------------------------------------------+
            

            If those character sets are set to UTF8, things are even worst and the table is displayed as:

            MariaDB [test]> select * from tab_csv8;
            +------------+---------+--------+---------+
            | Op├®ration | Monnaie | D├®bit | Cr├®dit |
            +------------+---------+--------+---------+
            | Chèque    | £      |    100 |       0 |
            | Virement   | Ôé¼     |      0 |    5600 |
            +------------+---------+--------+---------+
            

            Therefore, I think this issue is fixed for CONNECT (column names are now retrieved correctly) but of course there are pending problem for displaying some characters that are not CONNECT wise.

            Show
            bertrandop Olivier Bertrand added a comment - Concerning the column names, there was a problem when specified in UTF8 because, MariaDB internally handling column names in UTF8, CONNECT was translating them before creating the table. Now for CSV tables specifying DATA_CHARSET=UTF8 this translation is no more done, considering they already are coded in UTF8. For instance, starting from the file: Opération,Monnaie,Débit,Crédit Chèque,£,100,0 Virement,€,0,5600 I can create the table: create table tab_csv8 engine=connect table_type=csv file_name='E:/Data/csv8.csv' header=1 lrecl=64 data_charset=utf8; Describe tab_csv8 replies: Field Type Null Key Default Extra Opération char(8) NO <null> <null> <null> Monnaie char(3) NO <null> <null> <null> Débit int(3) NO <null> <null> <null> Crédit int(4) NO <null> <null> <null> and select * from tab_csv8; Returns: Opération Monnaie Débit Crédit Chèque £ 100 0 Virement € 0 5600 However, if instead of using my graphic client program I use the mysql client, the table is displayed as: MariaDB [test]> select * from tab_csv8; +-----------+---------+-------+--------+ | Opération | Monnaie | Débit | Crédit | +-----------+---------+-------+--------+ | Chèque | £ | 100 | 0 | | Virement | ? | 0 | 5600 | +-----------+---------+-------+--------+ See; the Euro character is not recognized and printed as a question mark. This shows that the problem is a problem of character recognition by the mysql client program. Note that this is done with: MariaDB [test]> show variables like '%char%'; +--------------------------+-----------------------------------------------------------------+ | Variable_name | Value | +--------------------------+-----------------------------------------------------------------+ | character_set_client | cp850 | | character_set_connection | cp850 | | character_set_database | latin1 | | character_set_filesystem | binary | | character_set_results | cp850 | | character_set_server | latin1 | | character_set_system | utf8 | | character_sets_dir | d:\CommonSource\mariadb-10.0.5\10.0-connect\sql\share\charsets\ | +--------------------------+-----------------------------------------------------------------+ If those character sets are set to UTF8, things are even worst and the table is displayed as: MariaDB [test]> select * from tab_csv8; +------------+---------+--------+---------+ | Op├®ration | Monnaie | D├®bit | Cr├®dit | +------------+---------+--------+---------+ | Ch├¿que | ┬ú | 100 | 0 | | Virement | Ôé¼ | 0 | 5600 | +------------+---------+--------+---------+ Therefore, I think this issue is fixed for CONNECT (column names are now retrieved correctly) but of course there are pending problem for displaying some characters that are not CONNECT wise.
            Hide
            takuya Takuya Aoki added a comment -

            Indeed, setting default charset=UTF8 and configuring the client solves the problem for the data part.
            This is the same in my test I submitted in 2015-01-14.

            But for the header part, I don't understand what is fixed at the moment.
            Do you mean data is handled correctly inside, but displaying it errors?
            Is it possible to fix the client so that all characters can be shown properly?

            Show
            takuya Takuya Aoki added a comment - Indeed, setting default charset=UTF8 and configuring the client solves the problem for the data part. This is the same in my test I submitted in 2015-01-14. But for the header part, I don't understand what is fixed at the moment. Do you mean data is handled correctly inside, but displaying it errors? Is it possible to fix the client so that all characters can be shown properly?
            Hide
            bertrandop Olivier Bertrand added a comment -

            As I said, the header part was a bug that will be fixed in next version.

            Show
            bertrandop Olivier Bertrand added a comment - As I said, the header part was a bug that will be fixed in next version.

              People

              • Assignee:
                bertrandop Olivier Bertrand
                Reporter:
                takuya Takuya Aoki
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0 minutes
                  0m
                  Logged:
                  Time Spent - 1 day
                  1d