Page 1 of 1

Normalizing text

Posted: Sat Aug 17, 2019 10:32 am
by hopkins
Hello,

Is there a way to convert text with accentuation into a "normalized" text ?

For example: transform "Béla Bartók" into "Bela Bartok"

I cannot seem to get the function "normalize" (https://livecode.fandom.com/wiki/NormalizeText) to work, but I may not be using it correctly.

Re: Normalizing text

Posted: Sat Aug 17, 2019 3:08 pm
by jmburnod
Hi,
I tried to work first time with normalizeText function but i dont understand yet.
I use this for similar goal:

Code: Select all

function fromAccentToNot pText
   put "àâäçéèêëîïòôùû" into tAccent
   put "aaaceeeeiioouu" into tNoAccent
   put 0 into tCount
   repeat for each char tChar in tAccent
      add 1 to tCount
      replace tChar with char tCount of tNoAccent in pText
   end repeat
   return  pText
end fromAccentToNott
Best regards
Jean-Marc

Re: Normalizing text

Posted: Sat Aug 17, 2019 5:29 pm
by jacque
Normalize isn't the right term. It only works with a few characters that are visually identical but which have different unicode values. It isn't meant to remove diacriticals, those are actually part of the character itself.

I think Jean-Marc's method is all you can do.

Re: Normalizing text

Posted: Sat Aug 17, 2019 6:24 pm
by [-hh]
@Jean-Marc
Two considerations.
1. One should possibly use the casesensitive.
2. Sometimes 'accented' chars are replaced by two chars, for example "Ä" with "Ae" or "ß" with "sz".

Here is a mapping that I once used (=fld "Mapping" in the code below).

Code: Select all

Á to A
Á to A
Ä to Ae
 to A
À to A
à to A
Å to A
Č to C
Ç to C
Ć to C
Ď to D
É to E
Ě to E
Ë to E
È to E
Ê to E
Ẽ to E
Ĕ to E
Ȇ to E
Í to I
Ì to I
Î to I
Ï to I
Ň to N
Ñ to N
Ó to O
Ö to Oe
Ò to O
Ô to O
Õ to O
Ø to O
Ř to R
Ŕ to R
Š to S
Ť to T
Ú to U
Ů to U
Ü to Ue
Ù to U
Û to U
Ý to Y
Ÿ to Y
Ž to Z
á to a
ä to ae
â to a
à to a
ã to a
å to a
č to c
ç to c
ć to c
ď to d
é to e
ě to e
ë to e
è to e
ê to e
ẽ to e
ĕ to e
ȇ to e
í to i
ì to i
î to i
ï to i
ň to n
ñ to n
ó to o
ö to oe
ò to o
ô to o
õ to o
ø to o
ð to o
ř to r
ŕ to r
š to s
ť to t
ú to u
ů to u
ü to ue
ù to u
û to u
ý to y
ÿ to y
ž to z
þ to b
Þ to B
Đ to D
đ to d
ß to sz
Æ to AE
Πto OE
æ to ae
œ to oe
Then the "Replace" button code is alike yours:

Code: Select all

on mouseUp
  lock screen; lock messages
  put fld "TextIn" into s
  put fld "Mapping" into m
  replace " to " with comma in m
  set the casesensitive to true
  repeat for each line L in m
    replace (item 1 of L) with (item 2 of L) in s
  end repeat
  put s into fld "TextOut"
end mouseUp

Re: Normalizing text

Posted: Sun Aug 18, 2019 8:23 am
by hopkins
Thank you for your answers. I wish there were a simpler way of doing this, but the above code will certainly do the job.

Re: Normalizing text

Posted: Sun Aug 18, 2019 10:28 am
by [-hh]
Possibly not "simpler" but for long input strings and replacement mappings 20-30 times faster by using regular expressions (btn "ReplaceText"):

Code: Select all

on mouseUp
  put the millisecs into m1
  lock screen; lock messages
  put fld "TextIn" into s
  put length(s) into n
  put fld "Mapping2" into m2
  replace " to " with comma in m2
  repeat for each line L in m2
    put replaceText(s,item 1 of L,item 2 of L) into s
  end repeat
  put s into fld "TextOut2"
  put n & ": " & (the millisecs - m1) && (s is fld "TextOut") into fld "timing"
end mouseUp
The replacement mapping (fld "Mapping2"):

Code: Select all

Á|Á|Â|À|Ã|Å to A
Þ to B
Č|Ç|Ć to C
Ď|Đ to D
É|Ě|Ë|È|Ê|Ẽ|Ĕ|Ȇ to E
Í|Ì|Î|Ï to I
Ň|Ñ to N
Ó|Ò|Ô|Õ|Ø to O
Ř|Ŕ to R
Š to S
Ť to T
Ú|Ů|Ù|Û to U
Ý|Ÿ to Y
Ž to Z
á|â|à|ã|å to a
þ to b
č|ç|ć to c
ď|đ to d
é|ě|ë|è|ê|ẽ|ĕ|ȇ to e
í|ì|î|ï to i
ň|ñ to n
ó|ò|ô|õ|ø|ð to o
ř|ŕ to r
š to s
ť to t
ú|ů|ù|û to u
ý|ÿ to y
ž to z
Æ to AE
Ä to Ae
Πto OE
Ö to Oe
Ü to Ue
ä|æ to ae
ö|œ to oe
ß to sz
ü to ue

Re: Normalizing text

Posted: Sun Aug 18, 2019 11:29 am
by hopkins
Cool, thanks :D

Re: Normalizing text

Posted: Sat Aug 24, 2019 7:40 pm
by jmburnod
Hi,
@Hermann
Two considerations...
Yes, you're right
Thank you for doing the job
Jean-Marc

Re: Normalizing text

Posted: Sat Aug 24, 2019 8:58 pm
by richmond62
"ß" with "sz"
That's odd:

Straße = Strasse