Details

    • Type: Task
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Fix Version/s: None
    • Labels:
      None

      Description

      As far as I understand, there is no way to tell CONNECT that a special character in a file represents NULL.

      Please, take a look at this example. It's an open dataset from Italian government, and the same format is used for several other datasets:
      http://www.dati.salute.gov.it/imgs/C_17_dataset_7_download_itemDownload0_upFile.CSV

      The column DATAFINEVALIDITA is a DDMMYYYY date, and '-' represents a NULL value.

      If I may suggest, a new column option would be a great solution:

      col_name DATE null_value = '-',

        Gliffy Diagrams

          Attachments

            Activity

            Hide
            bertrandop Olivier Bertrand added a comment -

            This is a general issue I have been already thinking about. What CONNECT currently does,
            when a column is nullable, is to regard character columns as NULL when represented by an empty string
            and to regard numeric columns as NULL when their value is 0. The case of DATE column is unclear.
            Currently they are treated as numeric (dates are internally represented like in C by the number of
            second since 01-01-1970) Unfortunately, even invalid dates are automatically set to 01-01-1970,
            their value is not 0 because added or substracted a time shift corresponding to their GMT setting.
            This why they are not set to NULL in this case.

            I am not yet ready to implement a general solution. Should the NULL_CHAR option be a column or
            table option, the NULL representation be a single character or a string or a numeric value?
            These questions are still open.

            Meanwhile, to take care of your problem, I have updated the DATE data type for invalid dates
            to be regarded as NULL whatever GMT setting is applied. Here is the CREATE TABLE I used:

            CREATE TABLE `salute` (
              `CODICEIDENTIFICATIVOSITO` int(6) NOT NULL,
              `DENOMINAZIONESITOLOGISTICO` char(83) NOT NULL,
              `INDIRIZZO` char(97) NOT NULL,
              `PARTITAIVA` bigint(11) NOT NULL FIELD_FORMAT='Z',
              `CAP` int(5) NOT NULL,
              `CODICECOMUNEISTAT` char(6) NOT NULL,
              `DESCRIZIONECOMUNE` char(33) NOT NULL,
              `CODICEPROVINCIAISTAT` char(3) NOT NULL,
              `SIGLAPROVINCIA` char(2) NOT NULL,
              `DESCRIZIONEPROVINCIA` char(25) NOT NULL,
              `CODICEREGIONE` char(3) NOT NULL,
              `DESCRIZIONEREGIONE` char(21) NOT NULL,
              `DATAINIZIOVALIDITA` date NOT NULL DATE_FORMAT='DD/MM/YYYY',
              `DATAFINEVALIDITA` date DATE_FORMAT='DD/MM/YYYY',
              `LATITUDINE` double(17,14) NOT NULL FIELD_FORMAT='D,',
              `LONGITUDINE` double(17,14) NOT NULL FIELD_FORMAT='D,',
              `LOCALIZE` int(1) NOT NULL
            ) ENGINE=CONNECT DEFAULT CHARSET=latin1 `TABLE_TYPE`='csv' `FILE_NAME`='E:/Data/salute.csv' `SEP_CHAR`=';' `HEADER`=1 `QUOTED`=0;
            

            Note the FIELD_FORMAT='D,' to take care of the comma used as decimal separator and
            the QUOTED=0 option needed because two lines in the file have their INDIRIZZO column quoted.

            The query:

            SELECT DENOMINAZIONESITOLOGISTICO, DATAINIZIOVALIDITA, DATAFINEVALIDITA, LATITUDINE FROM salute LIMIT 10;
            

            now returns:

            DENOMINAZIONESITOLOGISTICO DATAINIZIOVALIDITA DATAFINEVALIDITA LATITUDINE
            Parafarmacia S.F.A. Fasani 2006-10-01 NULL 41.55404142482020
            eurosan 2006-10-01 NULL 38.18235051630770
            DOCPHARMA DI EPIS DOTT. LEONARDO 2006-10-01 2014-01-31 45.73083169375610
            PARAFARMACIA NATURESTE 2006-10-01 NULL 45.21971325978440
            parafarmacia sanitaria di tamagni dott.ssa roberta 2006-10-01 NULL 46.18183947061230
            FARMA+ dott.Giuseppe moscariello 2006-10-01 NULL 40.84112547106560
            pharmasan 2006-10-01 2007-02-28 40.83665880000000
            Sole e Natura di Aricò Alessandra 2006-10-01 2009-03-31 37.51717130000000
            ARTEMISIA 2006-11-01 2009-03-31 36.95185760000000
            Erboristeria D.ssa Valeria Tantardini 2006-11-01 2007-06-30 46.11608700000000
            Show
            bertrandop Olivier Bertrand added a comment - This is a general issue I have been already thinking about. What CONNECT currently does, when a column is nullable, is to regard character columns as NULL when represented by an empty string and to regard numeric columns as NULL when their value is 0. The case of DATE column is unclear. Currently they are treated as numeric (dates are internally represented like in C by the number of second since 01-01-1970) Unfortunately, even invalid dates are automatically set to 01-01-1970, their value is not 0 because added or substracted a time shift corresponding to their GMT setting. This why they are not set to NULL in this case. I am not yet ready to implement a general solution. Should the NULL_CHAR option be a column or table option, the NULL representation be a single character or a string or a numeric value? These questions are still open. Meanwhile, to take care of your problem, I have updated the DATE data type for invalid dates to be regarded as NULL whatever GMT setting is applied. Here is the CREATE TABLE I used: CREATE TABLE `salute` ( `CODICEIDENTIFICATIVOSITO` int (6) NOT NULL, `DENOMINAZIONESITOLOGISTICO` char (83) NOT NULL, `INDIRIZZO` char (97) NOT NULL, `PARTITAIVA` bigint(11) NOT NULL FIELD_FORMAT='Z', `CAP` int (5) NOT NULL, `CODICECOMUNEISTAT` char (6) NOT NULL, `DESCRIZIONECOMUNE` char (33) NOT NULL, `CODICEPROVINCIAISTAT` char (3) NOT NULL, `SIGLAPROVINCIA` char (2) NOT NULL, `DESCRIZIONEPROVINCIA` char (25) NOT NULL, `CODICEREGIONE` char (3) NOT NULL, `DESCRIZIONEREGIONE` char (21) NOT NULL, `DATAINIZIOVALIDITA` date NOT NULL DATE_FORMAT='DD/MM/YYYY', `DATAFINEVALIDITA` date DATE_FORMAT='DD/MM/YYYY', `LATITUDINE` double (17,14) NOT NULL FIELD_FORMAT='D,', `LONGITUDINE` double (17,14) NOT NULL FIELD_FORMAT='D,', `LOCALIZE` int (1) NOT NULL ) ENGINE=CONNECT DEFAULT CHARSET=latin1 `TABLE_TYPE`='csv' `FILE_NAME`='E:/Data/salute.csv' `SEP_CHAR`=';' `HEADER`=1 `QUOTED`=0; Note the FIELD_FORMAT='D,' to take care of the comma used as decimal separator and the QUOTED=0 option needed because two lines in the file have their INDIRIZZO column quoted. The query: SELECT DENOMINAZIONESITOLOGISTICO, DATAINIZIOVALIDITA, DATAFINEVALIDITA, LATITUDINE FROM salute LIMIT 10; now returns: DENOMINAZIONESITOLOGISTICO DATAINIZIOVALIDITA DATAFINEVALIDITA LATITUDINE Parafarmacia S.F.A. Fasani 2006-10-01 NULL 41.55404142482020 eurosan 2006-10-01 NULL 38.18235051630770 DOCPHARMA DI EPIS DOTT. LEONARDO 2006-10-01 2014-01-31 45.73083169375610 PARAFARMACIA NATURESTE 2006-10-01 NULL 45.21971325978440 parafarmacia sanitaria di tamagni dott.ssa roberta 2006-10-01 NULL 46.18183947061230 FARMA+ dott.Giuseppe moscariello 2006-10-01 NULL 40.84112547106560 pharmasan 2006-10-01 2007-02-28 40.83665880000000 Sole e Natura di Aricò Alessandra 2006-10-01 2009-03-31 37.51717130000000 ARTEMISIA 2006-11-01 2009-03-31 36.95185760000000 Erboristeria D.ssa Valeria Tantardini 2006-11-01 2007-06-30 46.11608700000000

              People

              • Assignee:
                bertrandop Olivier Bertrand
                Reporter:
                f_razzoli Federico Razzoli
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated: