"CSV Must Die" o.k. but are there any tips and tricks?
Posted: Thu Nov 27, 2014 9:10 pm
Hi,
I have been investigating the export of data from an Access database for editing in Excel and then re-importing back into Access. The short answer is forget it if the database uses "memo" fields and yes mine does. However, I digress. As part of my research I read Richard Gaskin's paper "CSV must Die" http://www.fourthworld.com/embassy/arti ... t-die.html. Richard presents a list of recommendations based on how delimited text data should be exported:
Unfortunately the data I am working with contains many "memo" fields containing CRLF i.e. in-data returns, so I am experimenting with exporting Tab delimited files which use the "~" as the text qualifier (as Microsoft call it). When I import the resulting file into Livecode I will have to replace the in-data CRLF and Tabs with the escaped characters. This is harder than it first looks as it is necessary to preserve the CRLF which mark the end of a record.
The data I am parsing has text in its last field so it is almost safe to assume that the end of record occurs following a text qualifier and CRLF combination, I say almost because there is nothing stopping text data that starts with a CRLF. The first field in the data set happens to be a number so I have also tried replacing CRFL and number but again this is not safe in a general case as a memo field is likely to have a similar character pattern if for instance the database user has entered a numbered list into the memo field.
I believe that in the situation where the exported data has to be edited post export to remove "dirty" characters then some form of end of record marker is needed. In Access this may be achieved by adding a text field and entering a code such as <EOR> in every record. As long as this field is the last field in the table then it will export in the desired location. I think I prefer this idea to the alternative of using an SQL Replace command to replace the tabs and CRLF characters in an update query but am open to suggestions.
As the title suggests I welcome any tips and tricks for the parsing of text records.
Simon
I have been investigating the export of data from an Access database for editing in Excel and then re-importing back into Access. The short answer is forget it if the database uses "memo" fields and yes mine does. However, I digress. As part of my research I read Richard Gaskin's paper "CSV must Die" http://www.fourthworld.com/embassy/arti ... t-die.html. Richard presents a list of recommendations based on how delimited text data should be exported:
I am interested in how the ASCII11 and ASCII4 characters might be used in a Livecode application. Is a case of replacing the ASCII11s and ASCII4s when the text is displayed in a field?For my delimited exports in Fourth World products I've adopted the convention used in FileMaker Pro and others, with escapes that are unlikely to be in the data:
records are delimited with returns
fields delimited with tabs
quotes are never added and are included only when they are part of the data
any in-data returns are escaped with ASCII 11
any in-data tabs escaped with ASCII 4
Simple to write, lightning-fast to parse.
Unfortunately the data I am working with contains many "memo" fields containing CRLF i.e. in-data returns, so I am experimenting with exporting Tab delimited files which use the "~" as the text qualifier (as Microsoft call it). When I import the resulting file into Livecode I will have to replace the in-data CRLF and Tabs with the escaped characters. This is harder than it first looks as it is necessary to preserve the CRLF which mark the end of a record.
The data I am parsing has text in its last field so it is almost safe to assume that the end of record occurs following a text qualifier and CRLF combination, I say almost because there is nothing stopping text data that starts with a CRLF. The first field in the data set happens to be a number so I have also tried replacing CRFL and number but again this is not safe in a general case as a memo field is likely to have a similar character pattern if for instance the database user has entered a numbered list into the memo field.
I believe that in the situation where the exported data has to be edited post export to remove "dirty" characters then some form of end of record marker is needed. In Access this may be achieved by adding a text field and entering a code such as <EOR> in every record. As long as this field is the last field in the table then it will export in the desired location. I think I prefer this idea to the alternative of using an SQL Replace command to replace the tabs and CRLF characters in an update query but am open to suggestions.
As the title suggests I welcome any tips and tricks for the parsing of text records.
Simon