CSV files
Moderators: FourthWorld, heatherlaine, Klaus, kevinmiller, robinmiller
CSV files
Ok this might be really simple bt i just cant figure it out
lets say i have this information in my csv file for userid, name, descripion
1,"Bob Hope","He was short,tall,wide and funny"
when i import it and start using comma as itemDelimiter my problem is the commas inside of the ""
lets say i have this information in my csv file for userid, name, descripion
1,"Bob Hope","He was short,tall,wide and funny"
when i import it and start using comma as itemDelimiter my problem is the commas inside of the ""
Re: CSV files
to further screw with me i just noticed that some of the entries have line breaks in them when i opened the csv into excel.
like line 4 column b might have
This is some text
then more text, and more text.
when opening it in notepad i only saw that it was "This is some text then more text, and more text." it didnt show the line break
so basically i guess i need it to ignore line the line breaks and commas inside of the ""
like line 4 column b might have
This is some text
then more text, and more text.
when opening it in notepad i only saw that it was "This is some text then more text, and more text." it didnt show the line break
so basically i guess i need it to ignore line the line breaks and commas inside of the ""
-
- VIP Livecode Opensource Backer
- Posts: 9669
- Joined: Wed May 06, 2009 2:28 pm
- Location: New York, NY
Re: CSV files
Hi,
If your text is that poorly formed, it will be difficult to parse. Ultimately, you must find at least one character that can be relied upon
Is it true that the third quote in each "record" is always in its place? And that only one quote always follows it? If so, but this seems dicey to me, you might write a handler that counts out that character, starting from the first line, and parses as it goes.
Dicey, though.
Craig Newman
If your text is that poorly formed, it will be difficult to parse. Ultimately, you must find at least one character that can be relied upon
Is it true that the third quote in each "record" is always in its place? And that only one quote always follows it? If so, but this seems dicey to me, you might write a handler that counts out that character, starting from the first line, and parses as it goes.
Dicey, though.
Craig Newman
-
- VIP Livecode Opensource Backer
- Posts: 9842
- Joined: Sat Apr 08, 2006 7:05 am
- Location: Los Angeles
- Contact:
Re: CSV files
Richard Gaskin
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn
Re: CSV files
Read Richard's essay on why CSV must die... Yep, see the link in his post above.
Re: CSV files
it all sucks in general. best for me to tell the client to use tab delimited files and for goodness sake strip out the linebreaks. (or as i usually do replace them with ~)
Re: CSV files
Dealing with csv files in a huge pain. The best way to handle them is get hold of Alex Tweedly's handler for unpacking them. It handles all the situations you are encountering and is very fast. The code is in this list post:
http://runtime-revolution.278305.n4.nab ... 16066.html
http://runtime-revolution.278305.n4.nab ... 16066.html
-
- VIP Livecode Opensource Backer
- Posts: 9842
- Joined: Sat Apr 08, 2006 7:05 am
- Location: Los Angeles
- Contact:
Re: CSV files
Probably best to just use the code provided in the article I posted.Da_Elf wrote:....best for me to tell the client to use tab delimited files and for goodness sake strip out the linebreaks. (or as i usually do replace them with ~)
As phaworth suggested, it's a tidied version of Tweedly's algo, about as efficient as you'll find.
Extra bonus points that the article also contains the classic test data used to verify CSV parsers, and a modded version that accounts for at least one of two variances found in the real world not accounted for in the original classic set.
Richard Gaskin
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn
Re: CSV files
Umm, I would say first: try to know the data.Dealing with csv files in a huge pain.
Checking with Da_Elf his data, it appears to be that
the csv file has nothing wrong and is correclty formed:
Every item which contains a comma or a CR is quoted; that's it.
Here is a working script based on setting the lineDelimiter to quote.
Code: Select all
on mouseUp
local Ql
-- Our lines are separated by quotes
set the lineDelimiter to quote
-- Which lines are quoted; odd or even ones ?
if char 1 of field "Fsource" is quote then put 1 into Ql
else put 0 into Ql
put empty into field "Fdest"
lock screen
put 1 into Lnumber
repeat for each line aLine in field "Fsource"
if ( Lnumber + Ql) mod 2 is 0 then
put transformQuotedLine( aLine) after field "Fdest"
else
put transformUnQuotedLine( aLine) after field "Fdest"
end if
add 1 to Lnumber
end repeat
end mouseUp
private function transformQuotedLine L
replace CR with "<CR>" in L
return L
end transformQuotedLine
private function transformUnQuotedLine L
replace comma with tab in L
return L
end transformUnQuotedLine
and leave as they are the quoted commas.
Now you have the same data presented in a more conventional way for livecoders.
Setting back lineDelimiter to return and itemDelimiter to tab, and here we go...
Thierry
!
SUNNY-TDZ.COM doesn't belong to me since 2021.
To contact me, use the Private messages. Merci.
!
SUNNY-TDZ.COM doesn't belong to me since 2021.
To contact me, use the Private messages. Merci.
!
-
- VIP Livecode Opensource Backer
- Posts: 9842
- Joined: Sat Apr 08, 2006 7:05 am
- Location: Los Angeles
- Contact:
Re: CSV files
Indeed it is. But CSV is a very broad collection of conventions only loosely considered a specification, in which all of the following are also correct:Thierry wrote:Checking with Da_Elf his data, it appears to be that
the csv file has nothing wrong and is correclty formed
"Fourth World","4","Richard ""Nickname"" Gaskin" -- in-data quotes escaped by doubling
"Sons O' Thunder",5,"Ken \"Da Man\" Ray" -- in-data quotes escape with backslash
"Canella Software", "6", "Mark Talluto" -- comma+space delimiters
"Bass R Us",7,"Klaus Major" -- unquoted numeric values
RunRev,8,Kevin Miller -- all values unquoted
And these are just a few of the variations we find in the wild.
The example data set in the "CSV Must Die" article I linked to is commonly used for validating CSV parsers, but even that classic data set is incomplete, missing at least one or two conventions I added in a modified version there.
For a one-off used only on a single data set, many quick solutions are possible. But when attempting to accept CSV files in general, the range of variations makes the task very difficult.
The now-famous Tweedly algorithm was the result of a long series of posts on the use-livecode list a few years back, in which many different options were explored both for performance and the ability to handle the dizzying variety of formatting options commonly found in CSV files.
So far our community has not come up with an alternative which performs as well while handling as many CSV oddities as Tweedly's handles. It's likely possible - just about everything can always be improved. But given the strange mish-mash of different ways of handling things collectively called "CSV" it's a non-trivial task.
It's that very complexity and inherent inefficiency of CSV that led me to write the closing paragraphs in my article.
Richard Gaskin
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn
Re: CSV files
Hi,
it may be a confusion of terms, CSV initially meant "Comma Separated Values". But since this would make it rather useless, a broad range of text files containing table data separated with what-ever chars are also recognized as CSV meanwhile - for instance, the good old Tab-return format (TRSV?)
So we might use the other meaning, "Character Separated Values". And this is an invention by the gods of computing themselves - we can store any kind of table data using it, as long as we can agree for 2 (two) chars in the charset that we don't use in the data itself, and that we'll use as delimiters for fields and records.
Are there alternatives? Anyone said XML, really? For sure, we can blow up our data ad infinitum, doubling, tripling, quadrupling the byte count - and will still have to write parsers hand tailored to the very format.
Thus I love CSV. It's the common smallest denominator for table data, it can be tailored to suit any needs, and it has the smallest possible foot print.
Your data contain tabs, semicolons and line feeds? You'll surely find chars that are not used.
Your data need to be distinguished for text data and numeric data? Use ", or ', or whatever suits you.
For sure, it takes some work to set such up. But it pays off.
Many not-LC software offers quite some options to be set as field-, record- and text delimiters. Usually you can also use or omit column headers.
The resulting CSV, especially when zipped, is the smallest human readable form of transferring table data. And it can be set up suiting any possible case, because of its flexibility.
So, please excuse, dear Richard, I'd just say: "Organize your data!" instead of "CSV Must Die". I'd rather advocate the claim to "Bury this XML finally, even if it breathes still!", but this is another topic ;-)
For the OP, he might try if it's possible to use other delimiters. If the delimiters are present in the data, you'll have problems anyways. Should the source program not be able to write meaningful data, something like Thierry showed above might be useful as "tool for cleansing incoming stuff".
CSV needs as much knowledge and thinking of as any other data format. But it's versatility and sleekness makes it this valuable, IMHO!
Have a good time, kind regards!
it may be a confusion of terms, CSV initially meant "Comma Separated Values". But since this would make it rather useless, a broad range of text files containing table data separated with what-ever chars are also recognized as CSV meanwhile - for instance, the good old Tab-return format (TRSV?)
So we might use the other meaning, "Character Separated Values". And this is an invention by the gods of computing themselves - we can store any kind of table data using it, as long as we can agree for 2 (two) chars in the charset that we don't use in the data itself, and that we'll use as delimiters for fields and records.
Are there alternatives? Anyone said XML, really? For sure, we can blow up our data ad infinitum, doubling, tripling, quadrupling the byte count - and will still have to write parsers hand tailored to the very format.
Thus I love CSV. It's the common smallest denominator for table data, it can be tailored to suit any needs, and it has the smallest possible foot print.
Your data contain tabs, semicolons and line feeds? You'll surely find chars that are not used.
Your data need to be distinguished for text data and numeric data? Use ", or ', or whatever suits you.
For sure, it takes some work to set such up. But it pays off.
Many not-LC software offers quite some options to be set as field-, record- and text delimiters. Usually you can also use or omit column headers.
The resulting CSV, especially when zipped, is the smallest human readable form of transferring table data. And it can be set up suiting any possible case, because of its flexibility.
So, please excuse, dear Richard, I'd just say: "Organize your data!" instead of "CSV Must Die". I'd rather advocate the claim to "Bury this XML finally, even if it breathes still!", but this is another topic ;-)
For the OP, he might try if it's possible to use other delimiters. If the delimiters are present in the data, you'll have problems anyways. Should the source program not be able to write meaningful data, something like Thierry showed above might be useful as "tool for cleansing incoming stuff".
CSV needs as much knowledge and thinking of as any other data format. But it's versatility and sleekness makes it this valuable, IMHO!
Have a good time, kind regards!
All code published by me here was created with Community Editions of LC (thus is GPLv3).
If you use it in closed source projects, or for the Apple AppStore, or with XCode
you'll violate some license terms - read your relevant EULAs & Licenses!
If you use it in closed source projects, or for the Apple AppStore, or with XCode
you'll violate some license terms - read your relevant EULAs & Licenses!
-
- VIP Livecode Opensource Backer
- Posts: 9669
- Joined: Wed May 06, 2009 2:28 pm
- Location: New York, NY
Re: CSV files
Hi.
I had suggested:
Craig
I had suggested:
Quite a kluge, and the point is that CSV is so unruly, that such herculean efforts are often necessary. You say that effort is a small price to pay. I say kill, kill...If your text is that poorly formed, it will be difficult to parse. Ultimately, you must find at least one character that can be relied upon.
Is it true that the third quote in each "record" is always in its place? And that only one quote always follows it? If so,...
Craig
Re: CSV files
You'll find tab-delimited mentioned as a much more robust alternative to CSV in a number of places. Even though it is an improvement, it is not perfect. But is an example of a "character separated value" format. And is not CSV, which is unfortunately recognised as Comma Separated Value. We can all find some values that should be unique in separating values and records in a given data file. If the format is CSV, that option is unavailable. If we have control to produce the data file with chosen unique separators, then it is no longer CSV.
Organising the data is king. CSV does not constitute robust organisation. It's not even a consistent standard.
Organising the data is king. CSV does not constitute robust organisation. It's not even a consistent standard.
Re: CSV files
Hi,
I think we're talking about semantics here. For sure, CSV isn't a standard. An RFC for it exists, but it doesn't seem to have ever reached approval.
Besides this, the term "Comma Separated Values" is grossly misleading, because this would make it unusable for large parts of the world, where the comma is used as decimal separator (Am I right that the comma is used in anglo-american notation as separator for thousands?). No fractional numbers possible, nor nicely formatted ones? This doesn't make sense at all.
So, even if Wikipedia claims this name to be correct, it isn't. Nobody uses commas as field delimiters! The current common use is to have semicolons as field delimiters, check in your office software!
Thus it can be said that:
CSV depends of the knowledge of the used delimiters. Field, record and text delimiters. (And, for sure, of the knowledge of the used charset)
As soon as this is given, it works. All over the world, daily.
What else could we use to transmit table-like data? Binary formats? Bloated XML?
Well, we'd need a parser to interact with. And we'd need the knowledge of the special file format.
Where's then the difference to an agreed CSV variant? Only that a zipped CSV is by multitudes smaller, human readable, and more easy to handle, right? Or did I forget something?
As mentioned, IMHO we're discussing semantics here. I may have another view of CSV as you, and you may not agree to my view of it as "Character Separated Values".
It may also be the difference in what we use LC for - I don't do fancy games, crazy apps or educational stuff: My job is to keep things going for various businesses, to grab data from one database (or a CSV ), compare it with another database, and update a third database with my results. Or even more esoteric stuff ...
I'm (to be precise: the servers my LC standalones run on) sending literally thousands of CSV packets each hour, like this:
(This is a real-world example of the acknowledgment of an order, and it's actually REAL CSV! )
So, please abandon your evil desire to kill poor old CSV!
It hasn't done no harm to you, and it would break my heart would I be forced to replace this nice, cute, little piece of datagram with an XML equivalent, where alone the parentheses would be more than the actual data ...
Back to topic:
I cannot check my old HyperCard stuff anymore (due to all my "real Macs" in Apple heaven), but in the good old days I wrote some stacks for administrative accounting. I used some self-made optimized externals for file-input, searching and -output (thx to CompileIt! *1 and MacsBug), and it ran like a charm even on my SE30.
All data was organized in plain text files (CSV!), and if I remember correctly, I used numtochar(28 - 30) as delimiters. No problems, never.
Have a good time, and don't mind my arguing, plz! Can we agree to disagree?
*1: Reading this gave me heart ache. Don't forget to follow the first link ("interview")! But only suited for old time Mac fans ...
I think we're talking about semantics here. For sure, CSV isn't a standard. An RFC for it exists, but it doesn't seem to have ever reached approval.
Besides this, the term "Comma Separated Values" is grossly misleading, because this would make it unusable for large parts of the world, where the comma is used as decimal separator (Am I right that the comma is used in anglo-american notation as separator for thousands?). No fractional numbers possible, nor nicely formatted ones? This doesn't make sense at all.
So, even if Wikipedia claims this name to be correct, it isn't. Nobody uses commas as field delimiters! The current common use is to have semicolons as field delimiters, check in your office software!
Thus it can be said that:
- CSV isn't a standard or a special format, but
- it is a just a description for
- a wide range of "Character Separated Values", usually
- containing database values, arrays, table-like data.
CSV depends of the knowledge of the used delimiters. Field, record and text delimiters. (And, for sure, of the knowledge of the used charset)
As soon as this is given, it works. All over the world, daily.
What else could we use to transmit table-like data? Binary formats? Bloated XML?
Well, we'd need a parser to interact with. And we'd need the knowledge of the special file format.
Where's then the difference to an agreed CSV variant? Only that a zipped CSV is by multitudes smaller, human readable, and more easy to handle, right? Or did I forget something?
As mentioned, IMHO we're discussing semantics here. I may have another view of CSV as you, and you may not agree to my view of it as "Character Separated Values".
It may also be the difference in what we use LC for - I don't do fancy games, crazy apps or educational stuff: My job is to keep things going for various businesses, to grab data from one database (or a CSV ), compare it with another database, and update a third database with my results. Or even more esoteric stuff ...
I'm (to be precise: the servers my LC standalones run on) sending literally thousands of CSV packets each hour, like this:
Code: Select all
ORD received: 212445, 84, 53a2a591deb6d
0154120,2.3,6
9141000,3.14,12
7008510,4.15,6
0856030,8.93,6
0856010,9.64,6
So, please abandon your evil desire to kill poor old CSV!
It hasn't done no harm to you, and it would break my heart would I be forced to replace this nice, cute, little piece of datagram with an XML equivalent, where alone the parentheses would be more than the actual data ...
Back to topic:
I cannot check my old HyperCard stuff anymore (due to all my "real Macs" in Apple heaven), but in the good old days I wrote some stacks for administrative accounting. I used some self-made optimized externals for file-input, searching and -output (thx to CompileIt! *1 and MacsBug), and it ran like a charm even on my SE30.
All data was organized in plain text files (CSV!), and if I remember correctly, I used numtochar(28 - 30) as delimiters. No problems, never.
Have a good time, and don't mind my arguing, plz! Can we agree to disagree?
*1: Reading this gave me heart ache. Don't forget to follow the first link ("interview")! But only suited for old time Mac fans ...
All code published by me here was created with Community Editions of LC (thus is GPLv3).
If you use it in closed source projects, or for the Apple AppStore, or with XCode
you'll violate some license terms - read your relevant EULAs & Licenses!
If you use it in closed source projects, or for the Apple AppStore, or with XCode
you'll violate some license terms - read your relevant EULAs & Licenses!
-
- VIP Livecode Opensource Backer
- Posts: 9842
- Joined: Sat Apr 08, 2006 7:05 am
- Location: Los Angeles
- Contact:
Re: CSV files
Yes, as I described in the article I linked to earlier, tab-separated values are a much saner option for those cases when we have a choice, such as writing exporters.AxWald wrote:it may be a confusion of terms, CSV initially meant "Comma Separated Values". But since this would make it rather useless, a broad range of text files containing table data separated with what-ever chars are also recognized as CSV meanwhile - for instance, the good old Tab-return format (TRSV?)
We might, but as you note above, the most common definition for "CSV" is "Comma-Separated Values". So while some have later confused the acronym by using it to describe something else, given the prevalence of the "C" in "CSV" understood as "Comma" (even searching Wikipedia for "character separated values" resolves to the "comma-separated values" article), it's often helpful to explicitly note when describing data relying on other delimiters.So we might use the other meaning, "Character Separated Values".
The OP here included a comma-delimited example in his post, it seems safe to assume his interest was in the traditional definition of "CSV".
To help distinguish delimiters used in file names, I've seen "*.tab" sometimes used for tab-delimited files, something I use myself when writing exporters.
Given the most common use of "CSV" as "Comma-Separated Values" and our shared enthusiasm for any saner alternative, I think we're on the same page.So, please excuse, dear Richard, I'd just say: "Organize your data!" instead of "CSV Must Die".
The OP here has made it clear he's talking about comma-separated values, and to help clarify for those who might read the heading of my article but completely ignore its content, I've updated its heading to be more explicit:
All that said, CSV importers are important, because while it would be nice to be able to simply "organize our data", in practice so much of the data we're asked to work with comes from sources we have no control over. Tragically, true CSV is very commonly used for a wide range of published data sets, and is the default choice for delimited export in many programs.CSV Must Die
A Plea to Introduce Sanity to the Software Development World by
Pledging to Never Write Exporters for Comma-Separated Values
The Tweedly algo listed in the article is a very good one, with most of the strange variety of silliness used in the wild well accounted for with about as much efficiency as we could hope for with such an insane choice of delimiter.
One of the many benefits of the Tweedly algo is that the output is tab-delimited, which not only works well with most programs that handle delimited data like Microsoft Excel and LibreOffice Calc, but in LiveCode can be rendered in a field with multiple columns automatically.
Agreed on both fronts:I'd rather advocate the claim to "Bury this XML finally, even if it breathes still!", but this is another topic
- XML can be a good choice for exchanging hierarchically-ordered data between programs that know little about each other, but is sorely overused for a great many other uses where simpler formats would be much more efficient.
- XML, JSON, BSON, and other formats are beyond the scope of a discussion focused on delimited tabular data, which I why I didn't mention them at all in my article.
Agreed, and the Tweedly algo in the article I linked to will translate that into a saner format easily.For the OP, he might try if it's possible to use other delimiters.
Richard Gaskin
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn