[SOLVED] Parsing Word files

LiveCode is the premier environment for creating multi-platform solutions for all major operating systems - Windows, Mac OS X, Linux, the Web, Server environments and Mobile platforms. Brand new to LiveCode? Welcome!

Moderators: FourthWorld, heatherlaine, Klaus, kevinmiller, robinmiller

bogs
Posts: 5435
Joined: Sat Feb 25, 2017 10:45 pm

Re: Parsing Word files

Post by bogs » Wed Jan 27, 2021 9:36 pm

Yep Jacque, I was fooling with this off and on since my last reply, I suspect JS will do the job, but JS is one of the (relatively few) languages I never did much in.

I didn't see anything listed in the browsers properties either (in the dictionary), and I of course don't spend enough time in the newer IDEs to even hazard a guess as to what would be needed specifically.

Weird that you can copy / paste manually, but not programmatically just because it is a widget. I mean, isn't that what the copy / paste commands are supposed to be for?

And here is the real question, if your in the IDE, and you select text from a pdf in the browser widget, then go to "Edit -> Copy", bring the SE or message box to the front, insert the caret and go to "Edit -> Paste", it sure does copy and paste from the widget.

Image

I have to think there is a way to do it, if you can do it in the IDE, but I am not going to wade through one of the IDEs to find it myself :D

*Edit - well, I lied, I went and looked :P

The IDE's copy goes to the menubar stack, then to stack "revidelibrary" for the revIDECopy handler, which apparently checks to see whether what is being copied is an image, text, object, etc.

Unfortunately, I do *not* have the time to determine exactly the flow it goes through to copy from a pdf :roll:
Image

stam
Posts: 2679
Joined: Sun Jun 04, 2006 9:39 pm
Location: London, UK

Re: Parsing Word files

Post by stam » Thu Jan 28, 2021 2:35 am

I instinctively checked the html text as well, then realised the PDF isn’t html, but a binary representation. I don’t think it’s possible to do this natively and probably not with JS either.

I’ve searched for more info on this - as Richard suggested, best bet seems to run a command line tool from shell().

I’ve been looking at the pdftotext tool, part of the open source XPDFReader (apps and command line tools available for all desktop platforms).

Theoretically via CLI you throw a PDF at it and it produces a text file, which would be trivial to import into LC. However my *nix knowledge isn’t great and haven’t yet got it to work.
I’ll keep trying and post a solution if I manage it (or a plea for help if I can’t!)

bogs
Posts: 5435
Joined: Sat Feb 25, 2017 10:45 pm

Re: Parsing Word files

Post by bogs » Thu Jan 28, 2021 1:10 pm

stam wrote:
Thu Jan 28, 2021 2:35 am
I don’t think it’s possible to do this natively
@ stam, did you read my last post? The iDE allows you to use the 'Edit' menu items 'copy / paste' to copy and paste text out of a browser widget. Since that is the case, it certainly makes it 'possible'. Not easy to find out exactly how, but certainly 'possible'.

*Edit - Never mind the above, since it does not work in all situations which I find weird :?
Image

stam
Posts: 2679
Joined: Sun Jun 04, 2006 9:39 pm
Location: London, UK

Re: Parsing Word files

Post by stam » Thu Jan 28, 2021 2:50 pm

bogs wrote:
Thu Jan 28, 2021 1:10 pm
stam wrote:
Thu Jan 28, 2021 2:35 am
I don’t think it’s possible to do this natively
@ stam, did you read my last post? The iDE allows you to use the 'Edit' menu items 'copy / paste' to copy and paste text out of a browser widget. Since that is the case, it certainly makes it 'possible'. Not easy to find out exactly how, but certainly 'possible'.

*Edit - Never mind the above, since it does not work in all situations which I find weird :?
Yeh copy/paste works but only if you click on an item of text in the PDF. I can see no way to automate that across hundreds of PDFs.
I was referring to programmatically extracting the text data, reliably across a whole folder full of documents.
My hopes are with the XPDFReader's pdftotext command line tool, i just need to figure it out :)

bogs
Posts: 5435
Joined: Sat Feb 25, 2017 10:45 pm

Re: Parsing Word files

Post by bogs » Thu Jan 28, 2021 3:00 pm

stam wrote:
Thu Jan 28, 2021 2:50 pm
Yeh copy/paste works but only if you click on an item of text in the PDF. I can see no way to automate that across hundreds of PDFs.
Hm. My thoughts tend to go along this route from the observations made (I am not suggesting in any way, shape, or form that ultimately your current way is not going to be easier) -
1.) the IDE appears to use (in some way) the selected section of the pdf.
2.) if it can tell what is selected, then you should be able to select through code, aside from using the mouse (like a field).

If 1 and 2 are true, then there ultimately should be a way to both select and copy / paste the selection programatically from the browser widget. While I don't know exactly how to do it (at the moment) I am almost positive it is possible to do within the confines of the Lc language.

The above is merely a mental exercize in this case, as I said earlier, if there is a ready made tool that has it all figured out (that you can figure out how to use) by all means, do it that way ;)
Image

stam
Posts: 2679
Joined: Sun Jun 04, 2006 9:39 pm
Location: London, UK

Re: Parsing Word files

Post by stam » Thu Jan 28, 2021 3:55 pm

OK, cracked it: the problem i was having was that file names in the shell command included spaces, which is a no-no in Unix and these need to be escaped. I have not yet tested what works on Windows/PowerShell.

For anyone else that needs this, here's a way to export the ascii text from a pdf - this is just for text, the layout/graphics are not included.

1. Download the command line tools from XPDFReader > downloads > command line tools > select platform > download
2. The tool to use is called pdftotext - because i want to include this with my app i copied it to the folder containing mainstack (i.e. in specialFolderPath("resources"))
3. I avoid mucking around with environment variables and refer to this with the full system path
4. Important: escape all spaces in the file paths: eg. replaceText (filePath, space, "\ ")
5. the syntax is /path/to/pdftotext /path/to/pdf_file.pdf (optional: /path/to/text_file_output.txt)
6. if no text file specified this creates a text file at the same location and with the same name as the pdf file (e.g. pdf_file.txt in this case), otherwise it creates this as per the supplied path.
7. The text file can be trivially imported into LC

this code worked for me:

Code: Select all

on mouseUp pMouseButton
  local tCommand, tResourcePath, tFilePath
  
   put specialFolderPath("resources") & "/pdftotext" into tResourcePath
   answer file "Select PDF" with type "PDF|pdf"
   if it is empty then exit mouseUp
   
   put replaceText(it, space, "\ ") into tFilePath --escape spaces (ascii 32) in file names
   put tResourcePath & space & tFilePath into tCommand
   get shell(tCommand)
   put it -- will be empty if successfull, in which case it creates a text file a the location and with the name of the PDF file
end mouseUp
Stam

--------------------
edit: the command line tools include a number of useful command line apps - extract images from PDF, convert to HTML etc. I've tested these with variations of the script above, all work pretty well (just need to modify to follow the syntax in the 'support' section of the website...)
Last edited by stam on Fri Jan 29, 2021 1:40 am, edited 2 times in total.

FourthWorld
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 9823
Joined: Sat Apr 08, 2006 7:05 am
Location: Los Angeles
Contact:

Re: Parsing Word files

Post by FourthWorld » Thu Jan 28, 2021 4:46 pm

Nicely done, Stam. Thanks for posting that solution.
Richard Gaskin
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn

stam
Posts: 2679
Joined: Sun Jun 04, 2006 9:39 pm
Location: London, UK

Re: Parsing Word files

Post by stam » Fri Jan 29, 2021 1:38 am

Thank you for pointing in the right direction Richard!

Post Reply

Return to “Getting Started with LiveCode - Experienced Developers”