CSV files

Da_Elf · Post by **Da_Elf** » Mon Feb 16, 2015 2:01 pm

Ok this might be really simple bt i just cant figure it out

lets say i have this information in my csv file for userid, name, descripion

1,"Bob Hope","He was short,tall,wide and funny"

when i import it and start using comma as itemDelimiter my problem is the commas inside of the ""

Da_Elf · Post by **Da_Elf** » Mon Feb 16, 2015 2:18 pm

to further screw with me i just noticed that some of the entries have line breaks in them when i opened the csv into excel.

like line 4 column b might have

This is some text
then more text, and more text.

when opening it in notepad i only saw that it was "This is some text then more text, and more text." it didnt show the line break

so basically i guess i need it to ignore line the line breaks and commas inside of the ""

dunbarx · Post by **dunbarx** » Mon Feb 16, 2015 3:22 pm

Hi,

If your text is that poorly formed, it will be difficult to parse. Ultimately, you must find at least one character that can be relied upon

Is it true that the third quote in each "record" is always in its place? And that only one quote always follows it? If so, but this seems dicey to me, you might write a handler that counts out that character, starting from the first line, and parses as it goes.

Dicey, though.

Craig Newman

FourthWorld · Post by **FourthWorld** » Mon Feb 16, 2015 4:13 pm

http://www.fourthworld.com/embassy/arti ... t-die.html

SparkOut · Post by **SparkOut** » Mon Feb 16, 2015 7:46 pm

Read Richard's essay on why CSV must die... Yep, see the link in his post above.

Da_Elf · Post by **Da_Elf** » Mon Feb 16, 2015 8:28 pm

it all sucks in general. best for me to tell the client to use tab delimited files and for goodness sake strip out the linebreaks. (or as i usually do replace them with ~)

phaworth · Post by **phaworth** » Mon Feb 16, 2015 8:34 pm

Dealing with csv files in a huge pain. The best way to handle them is get hold of Alex Tweedly's handler for unpacking them. It handles all the situations you are encountering and is very fast. The code is in this list post:

http://runtime-revolution.278305.n4.nab ... 16066.html

FourthWorld · Post by **FourthWorld** » Mon Feb 16, 2015 9:30 pm

Da_Elf wrote:....best for me to tell the client to use tab delimited files and for goodness sake strip out the linebreaks. (or as i usually do replace them with ~)

Probably best to just use the code provided in the article I posted.

As phaworth suggested, it's a tidied version of Tweedly's algo, about as efficient as you'll find.

Extra bonus points that the article also contains the classic test data used to verify CSV parsers, and a modded version that accounts for at least one of two variances found in the real world not accounted for in the original classic set.

Thierry · Post by **Thierry** » Thu Feb 19, 2015 12:16 pm

Dealing with csv files in a huge pain.

Umm, I would say first: try to know the data.

Checking with Da_Elf his data, it appears to be that
the csv file has nothing wrong and is correclty formed:
Every item which contains a comma or a CR is quoted; that's it.

Here is a working script based on setting the lineDelimiter to quote.

Code: Select all

on mouseUp
   local Ql
   --  Our lines are separated by quotes
   set the lineDelimiter to quote
   -- Which lines are quoted; odd or even ones ?
   if char 1 of field "Fsource" is quote then put 1 into Ql
   else put 0 into Ql
   put empty into  field "Fdest"
   lock screen
   put 1 into Lnumber
   repeat for each line aLine in field "Fsource"
      if ( Lnumber + Ql) mod 2 is 0 then
         put transformQuotedLine( aLine) after field "Fdest"
      else
         put transformUnQuotedLine( aLine) after field "Fdest"
      end if
      add 1 to Lnumber
   end repeat
end mouseUp

private function transformQuotedLine L
   replace CR with "<CR>" in L
   return L
end transformQuotedLine

private function transformUnQuotedLine L
   replace comma with tab in L
   return L
end transformUnQuotedLine

This code changes the comma delimiters to tabs, drop the quotes and replace the quoted CR with some chars,
and leave as they are the quoted commas.
Now you have the same data presented in a more conventional way for livecoders.

Setting back lineDelimiter to return and itemDelimiter to tab, and here we go...

Thierry

FourthWorld · Post by **FourthWorld** » Thu Feb 19, 2015 6:27 pm

Thierry wrote:Checking with Da_Elf his data, it appears to be that
the csv file has nothing wrong and is correclty formed

Indeed it is. But CSV is a very broad collection of conventions only loosely considered a specification, in which all of the following are also correct:

"Fourth World","4","Richard ""Nickname"" Gaskin" -- in-data quotes escaped by doubling
"Sons O' Thunder",5,"Ken \"Da Man\" Ray" -- in-data quotes escape with backslash
"Canella Software", "6", "Mark Talluto" -- comma+space delimiters
"Bass R Us",7,"Klaus Major" -- unquoted numeric values
RunRev,8,Kevin Miller -- all values unquoted

And these are just a few of the variations we find in the wild.

The example data set in the "CSV Must Die" article I linked to is commonly used for validating CSV parsers, but even that classic data set is incomplete, missing at least one or two conventions I added in a modified version there.

For a one-off used only on a single data set, many quick solutions are possible. But when attempting to accept CSV files in general, the range of variations makes the task very difficult.

The now-famous Tweedly algorithm was the result of a long series of posts on the use-livecode list a few years back, in which many different options were explored both for performance and the ability to handle the dizzying variety of formatting options commonly found in CSV files.

So far our community has not come up with an alternative which performs as well while handling as many CSV oddities as Tweedly's handles. It's likely possible - just about everything can always be improved. But given the strange mish-mash of different ways of handling things collectively called "CSV" it's a non-trivial task.

It's that very complexity and inherent inefficiency of CSV that led me to write the closing paragraphs in my article.

AxWald · Post by **AxWald** » Sun Feb 22, 2015 9:12 pm

Hi,

it may be a confusion of terms, CSV initially meant "Comma Separated Values". But since this would make it rather useless, a broad range of text files containing table data separated with what-ever chars are also recognized as CSV meanwhile - for instance, the good old Tab-return format (TRSV?)

So we might use the other meaning, "Character Separated Values". And this is an invention by the gods of computing themselves - we can store any kind of table data using it, as long as we can agree for 2 (two) chars in the charset that we don't use in the data itself, and that we'll use as delimiters for fields and records.

Are there alternatives? Anyone said XML, really? For sure, we can blow up our data ad infinitum, doubling, tripling, quadrupling the byte count - and will still have to write parsers hand tailored to the very format.

Thus I love CSV. It's the common smallest denominator for table data, it can be tailored to suit any needs, and it has the smallest possible foot print.

Your data contain tabs, semicolons and line feeds? You'll surely find chars that are not used.
Your data need to be distinguished for text data and numeric data? Use ", or ', or whatever suits you.
For sure, it takes some work to set such up. But it pays off.

Many not-LC software offers quite some options to be set as field-, record- and text delimiters. Usually you can also use or omit column headers.

The resulting CSV, especially when zipped, is the smallest human readable form of transferring table data. And it can be set up suiting any possible case, because of its flexibility.

So, please excuse, dear Richard, I'd just say: "Organize your data!" instead of "CSV Must Die". I'd rather advocate the claim to "Bury this XML finally, even if it breathes still!", but this is another topic ;-)

For the OP, he might try if it's possible to use other delimiters. If the delimiters are present in the data, you'll have problems anyways. Should the source program not be able to write meaningful data, something like Thierry showed above might be useful as "tool for cleansing incoming stuff".

CSV needs as much knowledge and thinking of as any other data format. But it's versatility and sleekness makes it this valuable, IMHO!

Have a good time, kind regards!

dunbarx · Post by **dunbarx** » Sun Feb 22, 2015 9:35 pm

Hi.

I had suggested:

If your text is that poorly formed, it will be difficult to parse. Ultimately, you must find at least one character that can be relied upon.

Is it true that the third quote in each "record" is always in its place? And that only one quote always follows it? If so,...

Quite a kluge, and the point is that CSV is so unruly, that such herculean efforts are often necessary. You say that effort is a small price to pay. I say kill, kill...

Craig

SparkOut · Post by **SparkOut** » Sun Feb 22, 2015 11:18 pm

You'll find tab-delimited mentioned as a much more robust alternative to CSV in a number of places. Even though it is an improvement, it is not perfect. But is an example of a "character separated value" format. And is not CSV, which is unfortunately recognised as Comma Separated Value. We can all find some values that should be unique in separating values and records in a given data file. If the format is CSV, that option is unavailable. If we have control to produce the data file with chosen unique separators, then it is no longer CSV.
Organising the data is king. CSV does not constitute robust organisation. It's not even a consistent standard.

AxWald · Post by **AxWald** » Mon Feb 23, 2015 5:00 pm

Hi,

I think we're talking about semantics here. For sure, CSV isn't a standard. An RFC for it exists, but it doesn't seem to have ever reached approval.
Besides this, the term "Comma Separated Values" is grossly misleading, because this would make it unusable for large parts of the world, where the comma is used as decimal separator (Am I right that the comma is used in anglo-american notation as separator for thousands?). No fractional numbers possible, nor nicely formatted ones? This doesn't make sense at all.

So, even if Wikipedia claims this name to be correct, it isn't. Nobody uses commas as field delimiters! The current common use is to have semicolons as field delimiters, check in your office software!

Thus it can be said that:

CSV isn't a standard or a special format, but
it is a just a description for
a wide range of "Character Separated Values", usually
containing database values, arrays, table-like data.

CSV can be a lot of things: Tab-Return data, Semicolon-Separated data, EDI files, whatever. If we look closely, even XML is a form of CSV (even as we use the maximum bloated form possible ...)

CSV depends of the knowledge of the used delimiters. Field, record and text delimiters. (And, for sure, of the knowledge of the used charset)
As soon as this is given, it works. All over the world, daily.

What else could we use to transmit table-like data? Binary formats? Bloated XML?
Well, we'd need a parser to interact with. And we'd need the knowledge of the special file format.

Where's then the difference to an agreed CSV variant? Only that a zipped CSV is by multitudes smaller, human readable, and more easy to handle, right? Or did I forget something?

As mentioned, IMHO we're discussing semantics here. I may have another view of CSV as you, and you may not agree to my view of it as "Character Separated Values".

It may also be the difference in what we use LC for - I don't do fancy games, crazy apps or educational stuff: My job is to keep things going for various businesses, to grab data from one database (or a CSV

), compare it with another database, and update a third database with my results. Or even more esoteric stuff ...
I'm (to be precise: the servers my LC standalones run on) sending literally thousands of CSV packets each hour, like this:

Code: Select all

ORD received:  212445, 84, 53a2a591deb6d
0154120,2.3,6
9141000,3.14,12
7008510,4.15,6
0856030,8.93,6
0856010,9.64,6

(This is a real-world example of the acknowledgment of an order, and it's actually REAL CSV!

)

So, please abandon your evil desire to kill poor old CSV!
It hasn't done no harm to you, and it would break my heart would I be forced to replace this nice, cute, little piece of datagram with an XML equivalent, where alone the parentheses would be more than the actual data ...

Back to topic:
I cannot check my old HyperCard stuff anymore (due to all my "real Macs" in Apple heaven), but in the good old days I wrote some stacks for administrative accounting. I used some self-made optimized externals for file-input, searching and -output (thx to CompileIt! *1 and MacsBug), and it ran like a charm even on my SE30.
All data was organized in plain text files (CSV!), and if I remember correctly, I used numtochar(28 - 30) as delimiters. No problems, never.

Have a good time, and don't mind my arguing, plz! Can we agree to disagree?

*1: Reading this gave me heart ache. Don't forget to follow the first link ("interview")! But only suited for old time Mac fans ...

FourthWorld · Post by **FourthWorld** » Mon Feb 23, 2015 6:55 pm

AxWald wrote:it may be a confusion of terms, CSV initially meant "Comma Separated Values". But since this would make it rather useless, a broad range of text files containing table data separated with what-ever chars are also recognized as CSV meanwhile - for instance, the good old Tab-return format (TRSV?)

Yes, as I described in the article I linked to earlier, tab-separated values are a much saner option for those cases when we have a choice, such as writing exporters.

So we might use the other meaning, "Character Separated Values".

We might, but as you note above, the most common definition for "CSV" is "Comma-Separated Values". So while some have later confused the acronym by using it to describe something else, given the prevalence of the "C" in "CSV" understood as "Comma" (even searching Wikipedia for "character separated values" resolves to the "comma-separated values" article), it's often helpful to explicitly note when describing data relying on other delimiters.

The OP here included a comma-delimited example in his post, it seems safe to assume his interest was in the traditional definition of "CSV".

To help distinguish delimiters used in file names, I've seen "*.tab" sometimes used for tab-delimited files, something I use myself when writing exporters.

So, please excuse, dear Richard, I'd just say: "Organize your data!" instead of "CSV Must Die".

Given the most common use of "CSV" as "Comma-Separated Values" and our shared enthusiasm for any saner alternative, I think we're on the same page.

The OP here has made it clear he's talking about comma-separated values, and to help clarify for those who might read the heading of my article but completely ignore its content, I've updated its heading to be more explicit:

CSV Must Die

A Plea to Introduce Sanity to the Software Development World by
Pledging to Never Write Exporters for Comma-Separated Values

All that said, CSV importers are important, because while it would be nice to be able to simply "organize our data", in practice so much of the data we're asked to work with comes from sources we have no control over. Tragically, true CSV is very commonly used for a wide range of published data sets, and is the default choice for delimited export in many programs.

The Tweedly algo listed in the article is a very good one, with most of the strange variety of silliness used in the wild well accounted for with about as much efficiency as we could hope for with such an insane choice of delimiter.

One of the many benefits of the Tweedly algo is that the output is tab-delimited, which not only works well with most programs that handle delimited data like Microsoft Excel and LibreOffice Calc, but in LiveCode can be rendered in a field with multiple columns automatically.

I'd rather advocate the claim to "Bury this XML finally, even if it breathes still!", but this is another topic

Agreed on both fronts:

- XML can be a good choice for exchanging hierarchically-ordered data between programs that know little about each other, but is sorely overused for a great many other uses where simpler formats would be much more efficient.

- XML, JSON, BSON, and other formats are beyond the scope of a discussion focused on delimited tabular data, which I why I didn't mention them at all in my article.

For the OP, he might try if it's possible to use other delimiters.

Agreed, and the Tweedly algo in the article I linked to will translate that into a saner format easily.

LiveCode Forums

CSV files

CSV files

Re: CSV files

Re: CSV files

Re: CSV files

Re: CSV files

Re: CSV files

Re: CSV files

Re: CSV files

Re: CSV files

Re: CSV files

Re: CSV files

Re: CSV files

Re: CSV files

Re: CSV files

Re: CSV files