Escapes in string literals

DarScott · Post by **DarScott** » Sun Jun 02, 2013 12:07 am

At the end of long days, I type "and" for &.

The concatenation operator does have a learning cost. It is an important cost in learning LiveCode. I don't consider it optional. (Though beginners can use 'before' and 'after'.) That includes using concatenation with variables and constants. But, once learned, then only the name of the quote constant is needed to create strings with quotes. That is shared with comma, tab and whatever the new line name is. Whoops--by saying "whatever the new line name is", I have just given more weight to backslash notation.

So, I think that difference in learning cost is the difference in learning backslash notation vs learning that there are names for some strings. I think that delta cost is less for the latter.

I don't like the idea of having to do a double backslash to create strings that are used by interpreters that use backslash characters. So, if all of the backslashes are preprocessed for format() then it only has to do the value insertions. The language is maybe simpler in one aspect if that preprocessing is everywhere.

You are wearing me down, Monte, but maybe I just go with the breeze.

I still don't like the idea of breaking things. Maybe there is a way the IDE can say, "This stack contains backslashes." I don't know how it would know whether I have already reviewed those or not.

I wonder if numToChar() would hardly be used in such a world that had the backslashes in LiveCode strings.

mwieder · Post by **mwieder** » Sun Jun 02, 2013 1:50 am

At the end of long days, I type "and" for &.

<sigh> I think we all do.
I'd prefer to have the catenation operator be "+", but I realize we're past that point.

I find "\t" easier to read than "numtochar(9)". I like to think it's because there's one fewer translation step, but maybe it's because I'm already used to reading escaped characters. It's hard to step back in time and look at this with fresh eyes. I find

Code: Select all

put "he said 'they went to Billy\'s Diner'" into tDialog

pretty easy to read and understand.

I can't think of a situation where I've wanted to put a backslash in a string literal. LiveCode deals with forward slashes in file paths on all platforms. The only thing I can think of that might get me into trouble is passing a file path to an external library. Do you have any other examples that might break if escaping were allowed by default in all string literals?

DarScott · Post by **DarScott** » Sun Jun 02, 2013 3:27 am

Windows shell()?

I'm wondering if this is so basic language-wise that this needs a broader participation in discussion.

I wonder if there are some standards for including such characters that are not wordy or tied to some language or C-olatry.

We are not only well past "+" but we are deep into the emphasis on strings and "+" applies to numerals as addition. (By numeral, I mean a string representing a number.) I would like to see numbers completely disappear (virtually), but that might be hard. I think that the language would be simpler, but I don't think I can convince people. That is, the result of arithmetic is a virtual string. There are a lot of issues, but I think they are solvable. But, regardless of my daydream, numbers and numerals are blurred and that means "+" cannot distinguish the kinds of values to know whether to concatenate or add.

mwieder · Post by **mwieder** » Sun Jun 02, 2013 4:02 am

I wasn't really that serious about "+" for string catenation, although it would bring parity with javascript. But "&" and "&&" aren't really natural language forms either... we've just gotten used to using them (when we don't use "and" instead).

I do throw backslashes into windows shell scripts, and I'd happily rewrite all those in exchange for ubiquitous character escaping, but that's just me.

As an alternative, we could say that characters are always escaped if the string is enclosed in single quotes, but I'd have a hard time remembering that.

jacque · Post by **jacque** » Sun Jun 02, 2013 6:50 pm

I'll jump in as a "fresh eyes" person who doesn't speak C. I have no problem with backslash-escaped characters and it seems more readable to me than anything else suggested here. The format command uses it, BBEdit uses it, and it's an easy thing for people who come from other languages. I'd say go with what most (other-language) people are used to, and what many LiveCoders are already using in the existing regex-related functions.

Back to lurking.

SparkOut · Post by **SparkOut** » Sun Jun 02, 2013 7:02 pm

What Jacque said.

DarScott · Post by **DarScott** » Sun Jun 02, 2013 9:24 pm

If escapes are allowed, I would hope for this:

Special string literal parsing for format() and matchText() goes away.

All escapes are easy to type, and perhaps consist of ASCII characters only.

No doors are blocked for Unicode in script string literals in the future.

There is a growth path so that Unicode characters can be represented as escape sequences, as consistent with upcoming changes for Unicode in LiveCode.

Arbitrary byte sequences can be readily represented. (Unless new language features make those unneeded.) The doors should be open for adding decimal byte representations.

This kind of literal is allowed everywhere, including constants. There is only one kind of string literal.

There is no need for nested escapes in normal LiveCode use. That is, unless working with some library or external system, there is normally no need for a lot of backslashes.

Colorization should make that readily readable.

It should be like the escape sequences currently in format(), I think but I'm open to ideas, and new things not known to format() should be like examples in languages such as C, Python, Haskell or Scheme if they have them, but we should handle Unicode well, if those do it poorly, and we can add new kinds of escapes. (If octal got lost, I personally wouldn't mind, but it should be documented--I haven't been on a PDP-8 in years and I don't fool with file privileges using octal, and I'd rather the computer do UTF8 for me. Same with bell, I don't use a teletype any more. I'll look aside if that goes away.)

Performance should be consistent with the compiler making the transformation, not some runtime execution. Currently, get format() takes about three times as long as get "...". This should have the speed of the latter.

Help should be available with or in the IDE that brings this change (and maybe most changes that potentially break things). But, perhaps some mitigation of pain can be done other ways.

There should be no holes in the characters and bytes that can be represented.

monte · Post by **monte** » Sun Jun 02, 2013 9:45 pm

The & operator is where I was going when I discussed implied concatenation but I imagine that would be quite complicated to implement... No matter what char we use it breaks the readability and makes the statement less english like.

DarScott · Post by **DarScott** » Sun Jun 02, 2013 10:45 pm

String concatenation operators...

Mathematica uses <>.
Matlab uses a separate function but array concatenation can be used.
Maple uses a separate function.
I think PHP uses dot. This might look too much like a C structure.
Lua uses "..".
JavaScript uses +.

I have seen juxtaposition for concatenation in algebra articles. I have also seen ||, but I don't like that because I use it for the reciprocal of the sum of reciprocals, as in the resistance of parallel resistors. I can learn to live with it. I think mathematicians have used double-plus, but I'm not sure. I think I have seen comma, maybe inside some context brackets. I think have seen plus with a circle, but that is often associated with exclusive-or.

The operator should feel associative but not commutative. Yet, most operators above look symmetrical. Mathematically, it feels as though they shouldn't. Yet we have subtraction using the symmetrical minus sign. OK, then, it shouldn't look like an operator currently used for a commutative operation, that is, under this consideration, + is the worse.

In English, we might use "followed by" but that might look like a verb modifier rather than a noun modifier. Maybe "appended to" is more likely to be read as a noun modifier. That might be shortened to appTo, or aTo or at or @. There is "Stuck to", but I don't know if $ has any advantage to @. Hmmm, if the copyright simple was on my keyboard, I might mention "concatenated with" or the like. I don't mind if "appended to" is an alternative to @.

Because mathematicians sometimes use juxtaposition (putting things next to each other), that might get an extra point for consideration, but mathematicians use that for multiplication and function application and probably other things, so that might be goofy.

monte · Post by **monte** » Mon Jun 03, 2013 12:08 am

Hmm... I think @ would have the same issues as &. I don't mind . (dot) but it does have meaning in a sentence so in our english like language I'd say it should be avoided. Like I said before I think no matter what character we use it reduces readability.

monte · Post by **monte** » Mon Jun 03, 2013 2:52 am

I changed the topic to what it should have been... blame the jet lag...

PS... not sure why would need to escape apostrophe if we didn't add the option to either single and double quote string literals.

First of all, the writing is on the wall for the way the input string to 'format()' works. There is a huge problem with format() and matchText() - for the argument list, the way script is tokenized is changed. This means that what a token means is different depending on whether it is within the parantheses for format()/matchText() or whether it is within the parantheses of anywhere else in the language. This is because the way strings are tokenized is changed (see 'allowescapes') during parsing. This is a really really bad idea because it means that the syntactic structure of the script has a direct affect on the lexical structure of the script. (I don't think I can over-stress just how bad an idea this is in terms of language design - it means the parsing and lexical phases cannot be entirely separated, and it heavily restricts the type of parsing you can use).

So... just say we had escapes everywhere would that resolve the problems with escapes in format so we could use the standard \" instead of inventing \q for the purpose?

but we should handle Unicode well

Unfortunately to handle Unicode we would be mixing out encodings... UTF8 is usually encoded as \u.. But if that was expanded to a UTF8 char then the rest of the string might be MacRoman or ISO... I guess we could assume that the author knows that the rest of the string can only be ASCII... I'm still perplexed about our Unicode future and scripts/custom property names etc. Currently they are natively encoded and translated when the file is loaded if necessary. Ideally we would move to UTF8 on all platforms but I think there's some reason @runrevmark found not to do that... so will we be about set our scripts to use UTF16? And what will we do about all the little indians???

DarScott · Post by **DarScott** » Mon Jun 03, 2013 4:03 am

BTW, my brain was confusing @ and & a bit back there. I think the little gray cells put them in the same shoebox labelled "A".

I never got into trading baseball cards, but this might be similar. I think in congress, it is called logrolling. I might be willing to back escapes if it means uniform string literals, and transition pain is mitigated. And there is something that indicates we are not on a path of willy-nilly changes and breaking backward compatibility, that such changes are made only after much renting of garments and pouring ashes on our heads. So to speak.

For those who never touch "advanced" things like format() and matchText(), this might complicate the notion of string literals, but only slightly, it is the backslash that is the switch that opens the door to new complexity and that can be avoided by just not using it. For those who have touched format(), this simplifies the language. That is, pedagogically in the new world, simple strings do not have quotes or backslashes. Advanced strings have escapes. One can open a door to learn about those. In a text or reference, it is a table or box one can jump over. The notions of format() and matchText() do not enter the picture. (Some beginners want to type Enter in a string, and I don't know if escapes help there or not, but they might--the existence of \n in the table might be a clue for those who do look at the table.)

For me (should this happen), I am even looking forward to much more flexibility in constants.

(And should it seem cleaner for the language to drop bell, backspace, VT, formfeed and octal, I'm fine with that, especially since they are less important in flattening arrays these days, and can always be done with \xnn. We can add some escapes for bytes and Unicode with the table less cluttered. If they are important, even then the ASCII name as in Haskell, or the Unicode name can be used, for all[\b] ASCII control characters.)

monte · Post by **monte** » Mon Jun 03, 2013 6:44 am

6 pages of forum posts feels like renting of garments to me

Regarding typing enter in a string do you mean something like this is valid:

Code: Select all

put "hello
    world" into tVar

Certainly the most readable of the options but it doesn't resolve the quote issue and I'm guessing it would be complicated to implement... also whitespace before line 2 or after line 1 would hard to work with...

LCMark · Post by **LCMark** » Mon Jun 03, 2013 9:40 am

Focusing back on the main discussion about bringing c-style formatted strings to the language then, as I said before, this is perfectly feasible at the point the new parser kicks in and people need to translate their scripts to take advantage of it. The translater will be able to handle rewriting the string literals appropriately. An important thing to remember here is that we are just talking about the interpretation of the tokens in the script - the act of consuming the string literal at the lexical analysis phase will result in a value that has the appropriate content (i.e. escapes resolved) and as such causes no performance impact.

In terms of Unicode, the \u type escapes simply mean insert a character in the string at this point that has the specific Unicode code-point. This will work fine when the engine is endowed with Unicode capability and be transparent to the user. (By transparent I mean that what internal encoding the string is won't matter to script - script will just deal with all strings as a sequence of chars and the engine will handle any internal juggling between native or whatever-unicode-encoding-is-necessary - the only time script will need to care is when exporting and importing text).

6 pages of forum posts feels like renting of garments to me

Well, this is a substantial change to the language (everyone uses string constants!) so I think it's worth renting a few garments to make sure it is the right path

monte · Post by **monte** » Mon Jun 03, 2013 9:53 am

Hmm... but \u<code-point> is translated to different bytes depending on the type of unicode we are talking about. Or would a string literal only support UTF-8?

LiveCode Forums

Escapes in string literals

Re: Escapes in string constants

Re: Escapes in string constants

Re: Escapes in string constants

Re: Escapes in string constants

Re: Escapes in string constants

Re: Escapes in string constants

Re: Escapes in string constants

Re: Escapes in string constants

Re: Escapes in string constants

Re: Escapes in string constants

Re: Escapes in string literals

Re: Escapes in string literals

Re: Escapes in string literals

Re: Escapes in string literals

Re: Escapes in string literals