Converting Unicode codes to readable text

japino · Post by **japino** » Fri Mar 27, 2020 8:43 am

I'm getting data from a website in JSON format. As far as I understand, the text is Russian and I've put it into a field.

The text is this:
\u041f\u0440\u0438\u0432\u0435\u0442 \u043a\u0430\u043a \u0434\u0435\u043b\u0430?

Is there no simple way to convert this to readable text? I've searched the forum and I came across this:
https://forums.livecode.com/viewtopic.p ... on#p183823

Is there no easier way to convert those Unicode codes?

FourthWorld · Post by **FourthWorld** » Fri Mar 27, 2020 3:51 pm

Does the JSON file or the API documentation include a description of the encoding used?

japino · Post by **japino** » Fri Mar 27, 2020 8:45 pm

No, I checked both the returned JSON output and the API docs and encoding isn’t mentioned anywhere.

richmond62 · Post by **richmond62** » Fri Mar 27, 2020 9:41 pm

-

Code: Select all

on mouseUp
   set the itemDelimiter to "\"
   put 2 into KOUNT
   repeat until item KOUNT of fld "fRAW" is "XXX"
         put item KOUNT of fld "fRAW" into BUKVA
         delete char 1 of BUKVA
         put ("0x" & BUKVA) into MAGIC
         put numToCodepoint(MAGIC) after fld "fOUT"
         add 1 to KOUNT
   end repeat
end mouseUp

Oddly enough the 3 words don't have gaps between them:

Привет как дела

Hey, what's up?

japino · Post by **japino** » Sat Mar 28, 2020 6:38 pm

Thanks for this richmond62! So it does look like I need to convert each character one by one. Was hoping that Livecode had some function for this that I overlooked, but I guess not. I’ll figure out a way to make sure the space gets preserved. Thanks again.

richmond62 · Post by **richmond62** » Sat Mar 28, 2020 7:41 pm

По принсип мой скрипт е една функция!

jacque · Post by **jacque** » Sat Mar 28, 2020 11:42 pm

A bit quicker, but the same idea:

Code: Select all

function doTranslate pString
  set the itemDelimiter to "\u"
  if char 1 to 2 of pString = "\u" then delete char 1 to 2 of pString -- avoid empty first item
  repeat for each item i in pString
    put numToCodepoint("0x" & i) after tTranslation
  end repeat
  return tTranslation
end doTranslate

Thierry · Post by **Thierry** » Sun Mar 29, 2020 8:57 am

Hi,

Applying your text sample with the last solution, I found 2 errors:
-- spaces are suppressed
-- last chunk \u0430? breaks the code (error with numtocodepoint)

So, here is my take on this:

Code: Select all

local T = "\u041f\u0440\u0438\u0432\u0435\u0442 \u043a\u0430\u043a \u0434\u0435\u043b\u0430?"

on mouseUp
   put tdzTranslate(T)
end mouseUp

Code: Select all

on getCodePoint V
   return numToCodepoint( "0x" & V)
end getCodePoint

function tdzTranslate T
   local R
   get sunnyReplace(T,"\\u([0-9a-f]{4})","?{ getCodePoint \1}", R)
   return R
end tdzTranslate

--> Привет как дела?

and thank you, I've learned my 1st Russian sentence today

Take care,

Thierry

japino · Post by **japino** » Sun Mar 29, 2020 10:25 am

Thanks Jacque and Thierry.

Thierry, for my own small project I can't really afford a paid external, but it's good to know that it's there and I've made note of it, may be I will use it some time in the future.

For now I've used a repeat loop which finds each \uXXXX string and replaces it with the actual character.
A bit hesitant to paste it here because I know I'm a bad hobby coder

but anyway, here you have it:

Code: Select all

on mouseup
   put "\u041f\u0440\u0438\u0432\u0435\u0442 \u043a\u0430\u043a \u0434\u0435\u043b\u0430?" into myTranslation
   repeat
      put "\u" into myCharsToFind
      put offset(myCharsToFind, myTranslation) into myStartChar
      if myStartChar is 0 then exit repeat
      put myStartChar + 5 into myEndChar
      put char myStartChar to myEndChar of myTranslation into codeToConvert
      replace "\u" with "0x" in codeToConvert
      put numToCodepoint(codeToConvert) into myChar 
      delete char myStartChar to myEndChar in myTranslation
      put myChar after char myStartChar - 1 in myTranslation
   end repeat
   answer myTranslation
end mouseup

Thierry · Post by **Thierry** » Sun Mar 29, 2020 11:35 am

japino wrote: Thierry, for my own small project I can't really afford a paid external...

It's fine for me, I do understand.
Actually, I have a small number of regex followers who like
regex use cases; that's the main reason of my regex posts...

Oh, BTW, it's a library, not an external.

For now I've used a repeat loop which finds each \uXXXX string and replaces it with the actual character.
A bit hesitant to paste it here because I know I'm a bad hobby coder but anyway, here you have it:

I've quickly made a new version of your excellent code,
just in case your curious...
But your code and mine is not efficient for long input text!

Code: Select all

function tdzTranslate txt
   repeat
      put offset("\u", txt) into idxStart
      if idxStart is 0 then exit repeat
      put idxStart + 5 into idxEnd
      get numToCodepoint("0x" & char idxStart+2 to idxEnd of txt)
      put IT into char idxStart to idxEnd of txt
   end repeat
   return txt
end tdzTranslate

Take care,

Thierry

japino · Post by **japino** » Sun Mar 29, 2020 4:09 pm

Aw, many thanks for this Thierry, this is excellent! And I don't worry about long texts, because I should be dealing with sentences only.

LiveCode Forums

Converting Unicode codes to readable text

Converting Unicode codes to readable text

Re: Converting Unicode codes to readable text

Re: Converting Unicode codes to readable text

Re: Converting Unicode codes to readable text

Re: Converting Unicode codes to readable text

Re: Converting Unicode codes to readable text

Re: Converting Unicode codes to readable text

Re: Converting Unicode codes to readable text

Re: Converting Unicode codes to readable text

Re: Converting Unicode codes to readable text

Re: Converting Unicode codes to readable text