XML parsing and unicode text fields

Anything beyond the basics in using the LiveCode language. Share your handlers, functions and magic here.

Moderators: FourthWorld, heatherlaine, Klaus, kevinmiller, robinmiller

Post Reply
massung
Posts: 93
Joined: Thu Mar 19, 2009 5:34 pm

XML parsing and unicode text fields

Post by massung » Wed Aug 12, 2009 3:17 am

So, I'm having an issue parsing XML from various websites. While the XML tags parse just fine, the inner text doesn't convert unicode characters (for example: ’). For example:

Code: Select all

<some_tag>&#x2019;</some_tag>
This returns back a tag with the inner text of "&#x2019;" which is wrong. I can apply a regular expression to the text and use numToChar with useUnicode set to true, but the character still won't appear properly in the text field. Sample code:

Code: Select all

set useUnicode to true
         
-- strip all the html tags from the description
put it into tDesc
         
-- fix all the unicode characters
repeat while matchText(tDesc, "&#x([0-9A-Fa-f]+);", tUninum)
  put baseConvert(tUninum, 16, 10) into tUnihex
  put numToChar(tUnihex) into tUnichar
  replace "&#x" & tUninum & ";" with tUnichar in tDesc
end repeat

put tDesc into field "Test"
Anyway, numToChar doesn't appear to work properly for this situation (or I'm doing something wrong), and regardless of the font I use in the field, if numToChar is working, the character being displayed is very wrong.

Of course, what would be ideal is if revXml just properly handled the translation itself (maybe it does and I'm doing something wrong?).

Any thoughts?

Jeff M.

Post Reply