"CSV Must Die" o.k. but are there any tips and tricks?

Simon Knight · Post by **Simon Knight** » Thu Nov 27, 2014 9:10 pm

Hi,
I have been investigating the export of data from an Access database for editing in Excel and then re-importing back into Access. The short answer is forget it if the database uses "memo" fields and yes mine does. However, I digress. As part of my research I read Richard Gaskin's paper "CSV must Die" http://www.fourthworld.com/embassy/arti ... t-die.html. Richard presents a list of recommendations based on how delimited text data should be exported:

For my delimited exports in Fourth World products I've adopted the convention used in FileMaker Pro and others, with escapes that are unlikely to be in the data:

records are delimited with returns
fields delimited with tabs
quotes are never added and are included only when they are part of the data
any in-data returns are escaped with ASCII 11
any in-data tabs escaped with ASCII 4
Simple to write, lightning-fast to parse.

I am interested in how the ASCII11 and ASCII4 characters might be used in a Livecode application. Is a case of replacing the ASCII11s and ASCII4s when the text is displayed in a field?

Unfortunately the data I am working with contains many "memo" fields containing CRLF i.e. in-data returns, so I am experimenting with exporting Tab delimited files which use the "~" as the text qualifier (as Microsoft call it). When I import the resulting file into Livecode I will have to replace the in-data CRLF and Tabs with the escaped characters. This is harder than it first looks as it is necessary to preserve the CRLF which mark the end of a record.

The data I am parsing has text in its last field so it is almost safe to assume that the end of record occurs following a text qualifier and CRLF combination, I say almost because there is nothing stopping text data that starts with a CRLF. The first field in the data set happens to be a number so I have also tried replacing CRFL and number but again this is not safe in a general case as a memo field is likely to have a similar character pattern if for instance the database user has entered a numbered list into the memo field.

I believe that in the situation where the exported data has to be edited post export to remove "dirty" characters then some form of end of record marker is needed. In Access this may be achieved by adding a text field and entering a code such as <EOR> in every record. As long as this field is the last field in the table then it will export in the desired location. I think I prefer this idea to the alternative of using an SQL Replace command to replace the tabs and CRLF characters in an update query but am open to suggestions.

As the title suggests I welcome any tips and tricks for the parsing of text records.

Simon

Mark · Post by **Mark** » Fri Nov 28, 2014 10:30 am

Hi,

Sheer nonsense. Whenever you think CSV is the best solution for your purpose, just use it. In many cases, it is an easier, faster and cheaper solution than XML. Make sure to include comments with your code to make clear what you're doing to developers who may come after you.

You can use ASCII 4 and ASCII 11 as item delimiters. Personally, I prefer to use ASCII 14, which is the vertical tab and is actually used by some formats, such as DIF. Let's assume that Access can export CSV files, replacing carriage returns inside fields with ASCII 14. After reading you file with a LiveCode script and storing the data into variable myData, you can parse lines like this:

Code: Select all

set the itemDel to tab
put numToChar(14) into myVertTab
repeat for each line myLine in myData
  put item 1 of myLine into fld "Record ID"
  put item 2 of myLine into myComment
  replace myVertTab with cd in myComment
  put myComment into fld "Comment"
  create card
end repeat

This scripts assumes that you want to put each record on its own card, but it is just an example. You will probably think of something better.

Kind regards,

Mark

Dixie · Post by **Dixie** » Fri Nov 28, 2014 10:56 am

Sheer nonsense. Whenever you think CSV is the best solution for your purpose, just use it. In many cases, it is an easier, faster and cheaper solution than XML. Make sure to include comments with your code to make clear what you're doing to developers who may come after you.

Got to agree with what Marks' comments... if anything should be in the firing line it should be XML, after all it only bloats your data...

Simon Knight · Post by **Simon Knight** » Fri Nov 28, 2014 12:13 pm

Hi Mark/Dixie,

Thanks for your comments, I don't think that the paper is anti delimited text but it is against the way CSV uses common printing characters to delimit and escape certain other characters.

I have concluded that it is unwise to rely on export/import routines built into Microsoft and other products and that I will have to create my own using visual basic sometime in the future. For the moment I am using Livecode to clean the data files produced by these built in routines so am interested in how others deal with them.

There are 32 non printing characters Control Characters defined in the ASCII character set, many of them defined for and used by mechanical teletype and early printers. Is there any reason why fields should not be delimited with ASCII 31 (Unit Separator) and records by ASCII 30 (Record Separator)?

I have to agree about XML it just seems like a lot of hard work to create and process.

best wishes

Simon

Mark · Post by **Mark** » Fri Nov 28, 2014 12:23 pm

Hi Simon,

If something is wrong with a CSV format, it isn't CSV to blame but the developer. The CSV format is quite flexible and pretty much anything comma or tab delimited is CSV and can have additional custom delimeters. So, CSV doesn't use common printing characters, but some (or many) developers do.

I can't think of reason why ASCII 30 and 31 should not be used. Just make sure that they aren't part of the data already.

About XML: yes, it bloats your data and I tend to avoid using it, but sometimes it is a nice way to include data in HTML and it is compatible with many systems. It also allows you to define encodings for individual items. So, it can be useful.

Kind regards,

Mark

FourthWorld · Post by **FourthWorld** » Fri Nov 28, 2014 4:00 pm

Simon Knight wrote:I am interested in how the ASCII11 and ASCII4 characters might be used in a Livecode application. Is a case of replacing the ASCII11s and ASCII4s when the text is displayed in a field?

Unfortunately the data I am working with contains many "memo" fields containing CRLF i.e. in-data returns, so I am experimenting with exporting Tab delimited files which use the "~" as the text qualifier (as Microsoft call it). When I import the resulting file into Livecode I will have to replace the in-data CRLF and Tabs with the escaped characters.

To replace characters you can use the "replace" command, e.g.:

Code: Select all

function GetEscaptedItem pData, pItemNum, pLineNum
   set the itemdel to tab
   get item pItemNum of line pLineNum of pData
   replace numToChar(4) with tab in it
   replace numToChar(11) with cr in it
   return it
end GetEscapedItem

on PutEscapedItem pValue, pItemNum, pLineNum, @pData
   replace tab with numToChar(4) in pValue
   replace cr with numToChar(11) in pValue
   put pValue into item pItemNum of line pLineNum of pData
end PutEscapedItem

But of course that's only useful for data formatted with delimiters more efficient to work with than commas, which distinguish CSV. If your data is already in CSV, just use the handler provided in the article you cited.

The article closes with Postel's Law from RFC 1122:
"Be liberal in what you accept, and conservative in what you send."

What our apps read is determined by the past, over which we have no control. It's very useful to be able to read as many formats as practical; Postel suggests limiting exports to create an ever saner future.

If you were writing a tool that might be used to exchange data with a wide range of programs then Postel's guidance may be helpful. But in your case, the target application for the data is the same as its origin, so the intermediary format hardly matters.

Alex Tweedly's CSV parsing algorithm provided in the article is one of the most efficient and complete implementations you'll find in the LiveCode community, handling the range of common exceptions and most of the various escaping and escape-escaping conventions you're likely to find, such as those provided in the famous sample data also included in the article.

Just copy that handler into your stack and enjoy. It'll save you many hours other programmers have to spend every year accounting for the dizzying variations of CSV in the wild.

FourthWorld · Post by **FourthWorld** » Fri Nov 28, 2014 4:02 pm

Dixie wrote:if anything should be in the firing line it should be XML, after all it only bloats your data...

Agreed. XML is a very different topic, with its own set of trade-offs, which is why I didn't address it at all in that article.

XML is indeed often overused, esp. for representing tabular data where simple delimiters (provided they appear less frequently in actual data than commas) can be both more compact and more efficient to parse.

Where XML is useful is for the tasks it was originally designed for: exchanging data between programs that know little if anything about each other. For such purposes, XML has become the go-to choice because, unlike CSV, it's both well-defined and consistent.

XML can also be useful for hierarchically-structured data, which can be difficult to express with simple delimiters. But even there, in recent years JSON has become more popular as it's more compact.

LiveCode Forums.

"CSV Must Die" o.k. but are there any tips and tricks?

"CSV Must Die" o.k. but are there any tips and tricks?

Re: "CSV Must Die" o.k. but are there any tips and tricks?

Re: "CSV Must Die" o.k. but are there any tips and tricks?

Re: "CSV Must Die" o.k. but are there any tips and tricks?

Re: "CSV Must Die" o.k. but are there any tips and tricks?

Re: "CSV Must Die" o.k. but are there any tips and tricks?

Re: "CSV Must Die" o.k. but are there any tips and tricks?