Unicode Field routines

jcollett · Post by **jcollett** » Thu Sep 02, 2010 9:36 pm

Difficulties in the use of unicode in Revolution are persuading me NOT to move my extensive Chinese study materials from HTML+Javascript into Revolution. That is disappointing. Just when I think I am beginning to make progress, another problem arises. For example :

A stack which I have been studying (http://revolution.byu.edu/unicode/unico ... utines.rev) says "And field 1 contains four "000A" characters, you can see they are RED." Yes, I see them, and they are red. I also find that, in my own exploratory work, that particular Chinese character is always a problem. Why? (Curious that no explanation was supplied in that stack!)
JC

Mark · Post by **Mark** » Fri Sep 03, 2010 12:11 am

Hi JC,

Do you still want to try to solve this problem or have you decided not to use Revolution? If you still want to continue with Revolution, why is 000A causing you problems? Is this character a return character?

Best,

Mark

jcollett · Post by **jcollett** » Fri Sep 03, 2010 12:56 am

Mark, I'm still trying to make up my mind. This character, for example, 上, referred to in the source I quoted as 000A, is always omitted when I use code to move Chinese from one field, where I have pasted it, to another. It has taken me several hours to obtain Chinese-containing text from an external file, put it into a field, and then split the 3 items of each line (Chinese characters, pinyin representation, and English) into 3 other fields. So I have just about completed (apart from the 上 problem) the first two of a set of functions I am going to need for Chinese work. I hope I'm going to get better at it. If I do, I'll stay with Revolution. But if the 上 problem remains, or others appear, I'll be very sad, but I'll have to go back to quick'n'easy javascript.
JC

jcollett · Post by **jcollett** » Sat Sep 04, 2010 1:07 am

Further study of (http://revolution.byu.edu/unicode/unico ... utines.rev) reveals the use of "countUnicodeLines(tdata)" and "UnicodeLineOffset(tdata,5)". These, and any others relating to Unicode, could be useful to me. But I cannot find them defined anywhere in the stack in question, nor can I find them in the Revolution Dictionary. So how can they work? And is there some other source of ready-made and potentially useful Unicode functions which I don't know about?
JC

bn · Post by bn » Sat Sep 04, 2010 10:00 am

Hi JC,
the routines are self defined functions that are

Code: Select all

function countUnicodeLines @tdata
  put number of characters of tdata into tlength
  set useunicode to true
  put 1 into tlinecount
  repeat with i = 1 to tlength step 2
    if chartonum(char i to i+1 of tdata) is 10 then add 1 to tlinecount
  end repeat
  return tlinecount
end countUnicodeLines

and

Code: Select all

function UnicodeLineOffset @tdata,whichline
  put number of characters of tdata into tlength
  set useunicode to true
  put 1 into tlinecount
  put 1 into tlineoffset
  repeat with i = 1 to tlength step 2
    if tlinecount is whichline then exit repeat
    if chartonum(char i to i+1 of tdata) is 10 then
      add 1 to tlinecount
      put i+2 into tlineoffset
    end if
  end repeat
  if whichline > tlinecount then put tlength + 1 into tlineoffset
  return tlineoffset
end UnicodeLineOffset

these functions are in the script of the stack, the stack script. Have a look at it. You see from the rest of the stack how they are called and what to do with what they return.
regards
Bernd

Mark · Post by **Mark** » Sat Sep 04, 2010 10:12 am

Hi JC,

Can you attach a simple text containing the 上 character? Just to make sure that I have a unicode file with some valid Chinese? I would like to do a little experiment with it.

As I said, 000A is a return character. Is isn't a 上 character. Just isn't. Apparently, there is some incompatibility between your text and the UTF16 encoding.

Best regards,

Mark

jcollett · Post by **jcollett** » Sat Sep 04, 2010 8:36 pm

Hi Mark. This is the first time I have ever attached anything to a reply. I don't know if I have done it correctly. My stored texts are usually in UTF 8 format. I have made a second copy of the one-line sample in UTF 16 format, and am hoping to upload both.
Thanks for your interest
JC

Mark · Post by **Mark** » Sat Sep 04, 2010 8:48 pm

No attachment

jcollett · Post by **jcollett** » Sat Sep 04, 2010 9:06 pm

Hi Mark,
I'm having a bad morning. I don't know how to send attachments, and cannot find out. The 17 references to "attach" in the FAQ do not tell me. I went though the process of selecting the files to upload, and assumed that they would then be uploaded when I submitted the message. I must have omitted an important step.

I have checked in the stack where I got the idea that 上 is 000A. It says "There are 14 lines in field 1.
And field 1 contains four "000A" characters,
you can see they are RED."
The red character is 上. The text is not Chinese. I assume it is Japanese. But that should not affect a character's unicode number, should it?

The two little files I have tried to send you each contain
上,shang4,above 上,shang4,above

PS Bernd, thanks for your help on another matter relating to unicode.

bn · Post by bn » Sat Sep 04, 2010 10:08 pm

JC,
zip your stuff before uploading. Than it will work whatever is in it.
regards
Bernd

Mark · Post by **Mark** » Sat Sep 04, 2010 10:23 pm

JC,

To put an end to all the confusion, the hex equivalent of 上 is a0e4 and the hex equivalent of a return (actually a linefeed in RunRev) is 000a. The unicode stack to which you linked works correctly, but the mention of the 上 characters makes no sense to me.

Best,

Mark

jcollett · Post by **jcollett** » Sat Sep 04, 2010 10:42 pm

Thank you Mark and Bernd. Unfortunately many of the sources I have been using to understand and learn how to manipulate text containing unicode appear maddeningly incomplete (their authors assume I know more than I do) or "make no sense" as in the case of the specific character we were discussing. I have dropped the approach I was pursuing, and am trying an alternative which looks more promising.

On a different matter entirely, why are the methods of opening a card script and a stack script so different?
JC

bn · Post by bn » Sat Sep 04, 2010 10:55 pm

JC,

why are the methods of opening a card script and a stack script so different?

If you have a stack open an in front and look at the Object menu there you have the opions "Card Script" and "Stack Script". They lead you to the respective scripts.

In the property inspector, if the focus is on the stack then you can choose in the right arrow below the lock "edit script".

In the menu "Tools" if you choose the "Application Browser" and select in the left pane the stack, than do a right click and it offers you to go to the script of the stack.
So there are many ways to get at the script. The "Application Browser" tellse you the number of lines of a script, so there you can see whether there is a scrip at all and what objects do have scripts.

In Rev at first it can be confusing where all the scripts are since virtually any object can have a script. It helps to read on the "message hierarchy" either in Shafers book or in the "Revolution User Guide".
It just takes a little to get used to but when you understand the message hierarchy it is quite logical.
regards
Bernd

jcollett · Post by **jcollett** » Tue Sep 07, 2010 2:08 am

For several days I made pleasing progress. I was splitting Chinese dialogues, each one in a field, into separate lines of dialogue, one line per field. New fields were created as needed. Using callbacks from a player, I could then get each line of text to be shown as it was spoken by the player. Splendid. I was on the brink of being well and truly hooked. Then I noticed that some of my lines of dialogue were being split into two lines, with two new fields being created instead of one.

As soon as I was aware of the problem, it was short work to locate the source of the problem. It's that 上 character, behaving as though it is a 'return'.

I made a couple of tests (with acknowledgements to Devin Asay of Brigham Young University) with a button and a field. I put the character 上 into a field called chinText :
on mouseUp
set the useUnicode to true
put charToNum(char 1 to 2 of fld "chinText") -- returns 19978
end mouseUp

Then I tried it the other way, using 19978 :

on mouseUp
set the useUnicode to true
set the unicodeText of fld "chinLetter" to numToChar(19978) -- the letter 上 should appear in the field, and it does.
end mouseUp

So far so good. But when I try to move this line of a field:

这是你第上一次来中国吗？
into a new field of its own, I get two new fields
这是你第
一次来中国吗？
You see the problem. The 上 has disappeared. A return has taken its place.

The code which moves the text is a line such as:

set the unicodeText of fld "Field2" to the unicodeText of line lineNumber of fld "Field1"

It always works, unless there is a 上 in it.

The character 上 is listed as being the fourteenth most common character in the Chinese language. Shall I write special character-by-character checks to watch out for it whenever it appears? No way. JC

Mark · Post by **Mark** » Tue Sep 07, 2010 2:18 pm

Hi JC,

Apparently, you are doing something wrong when splitting your fields. You need to keep in mind that unicodeText is binary. It isn't text. For example, of you use syntaxy such as

Code: Select all

put line 1 of fld 1 into x
put word 2 to -1 of fld 1 into x
put item 4 of fld 1 into x

things will go completely wrong. One of the reasons is that all unicode characters consist of two binary symbols. These symbols can include commas, tabs, returns, linefeeds, quotes, etc. In other words, all references to items, lines and words are completely useless. That's why you need to write your own routines, to find the correct items, lines and words (probably that was the purpose of the stack, which had the 上 characters coloured red). This explains why

Code: Select all

set the unicodeText of fld "Field2" to the unicodeText of line lineNumber of fld "Field1"

doesn't work. Your syntax contains a reference to lines, while the binary 上 symbol is composed of a linefeed (which behaves as a return in RunRev) and a NULL. To find all lines in your text, you could use a repeat loop. The following example finds the first line.

Code: Select all

repeat with x = 1 to (number of chars of fld "Field1" - 1) step 2
  if byte x to (x+1) of fld "Field1" is linefeed & NULL then
    set the unicodeText of fld "Field2" to byte 1 to (x+1) of fld "Field1"
    exit repeat
  end if
end repeat

Note that bytes are the same as chars in this example, but using "byte" makes it clear that we are actually not working with characters, since the actual characters as displayed in the field each consist of 2 bytes.

The example should allow you to write your own scripts to find all lines in your field.

Best,

Mark

LiveCode Forums.

Unicode Field routines

Unicode Field routines

Re: Unicode Field routines

Re: Unicode Field routines

Re: Unicode Field routines

Re: Unicode Field routines

Re: Unicode Field routines

Re: Unicode Field routines

Re: Unicode Field routines

Re: Unicode Field routines

Re: Unicode Field routines

Re: Unicode Field routines

Re: Unicode Field routines

Re: Unicode Field routines

Re: Unicode Field routines

Re: Unicode Field routines