Page 1 of 1

Filenames, Browsers URLEncode and UTF8 - Some notes

Posted: Sat Oct 28, 2017 10:46 pm
by Simon Knight
I have been attempting to use a revBrowser and more recently a Browser Widget to display a PDF file in my application. Both these browser objects require the URL of the file that is to be displayed. Passing a filename into a browser is likely to fail if there are any URL unfriendly characters used in the file path. The obvious thing to do is to pass the filename to the built in function URLEncode but this again is likely to fail without any error message.

My solution to date has been to conduct my own simplistic conversion from filename to URL by replacing spaces and vertical bars with their hex character codes e.g.

Code: Select all

on mouseUp
  local tFile
  answer file "Please choose the file you would like to display" with type "PDF document|pdf|PDF"
  if it is not empty then
    put it into tFile
    replace " " with "%20" in tFile  -- code based on script used with revbrowser
    replace "|" with "%7C" in tFile  -- code based on script used with revbrowser
    set the url of widget "browser" to tFile
  end if
end mouseUp
Now while this has worked on my mac for some while I recognise that it is a bit of a "kludge".

Looking at the Livecode function URLEncode it takes a filename (on mac) and encodes it replacing slashes etc with hex digits. However, for some reason it replaces "spaces" with "+" and these are not accepted by either of the browser objects. So my new solution is to use UrlEncode to catch all but the spaces then to replace the plus signs with the string "%20". This seems to work and I believe that it is all that has to be done if my file names stick within the ASCII character set.

All well and good but my reading indicates that Mac OS uses UTF8 while windows uses UTF16 in their respective filenames. I am unsure if I need to pay attention to these encoding schemes if I am creating an application that uses files on both operating systems. UTF-8 may use up to four bytes per character but ASCII values of 127decimal remain single byte. Now if I name a file as "A_Euro_€.pdf" and run it through URLEncode the result is "A-EURO_%DB.pdf" and if this is passed into a browser the source file is not displayed. However, if the file name is first encoded as UTF-8 then URL encoded the file is displayed correctly with the Euro symbol is being encoded as %E2%A2%AC, which is correct according to the Wikipedia entry I found. Som my code now looks like this:

Code: Select all

put empty into field "DecodedText"
    put it into tFile
    put textencode(tFile,"UTF-8") into TextEncodedFile
    put the URLEncode of TextEncodedFile  into tTest
    put "TextEncoded: " & tTest after field "DecodedText"
    --replace "+" with "%20" in URLEncoded  -- code based on script used with revbrowser
    --replace "|" with "%7C" in EncodedFile  -- code based on script used with revbrowser
    set the url of widget "browser" to tTest
What seems odd is that spaces are still being replaced with plus signs (or that is what is being displayed in a field) but the URL now works when sent to a browser - odd.

It appears that if I write for both OSs I have to use an if statement to switch the TextEncode type. I wonder if this the case?

Comments welcome
best wishes

Simon K