[SOLVED] Parsing Word files

stam · Post by **stam** » Sun Jan 24, 2021 2:55 am

Embarrasingly, i posed the question without actually testing what simply importing the file would produce - i had just assumed this would be binary data...so i did what i should have done before posting the question and tested...

i imported a typical Word file into a variable and examined the text... About 70% of this is clearly binary data, but interestingly the file also seems to include the entire letter in plain ascii text as well !!
I suspect the binary data contains formatting/layout data for the included ascii text, but since this includes the entire text i can easily extract it since our systems automatically insert boilerplate text at start and end of letters... so no need to splash out on the plugin just yet.

So sorry to waste your time with this... but on this note, we also receive a small percentage of our letters in PDF format (especially when sent to us from other centres). These do contain extractable text (inasmuch as it can be copy/pasted) - but not sure if it's possible to extract this in LC?
I did a cursory search and not sure i found any kind of useful answer other than using a command line tool.

Any suggestions for extracting text from PDF?

FourthWorld · Post by **FourthWorld** » Sun Jan 24, 2021 3:48 am

PDF is a monster of a format. Parsing it is not for mere mortals.

There are many command line utilities that can attempt* to extract text from PDF, many of them free and open source. You can call command line utilities from LC with the shell function.

* I say "attempt" because the format is designed for display only, and through such complex means that it offers no assurances that any contents can be extracted back out into any other format.

stam · Post by **stam** » Sun Jan 24, 2021 4:45 am

Never mind, i just saw that this is a feature of the business edition, which includes a PDF viewer that can extract the text. At 4 times the price of Indy ($2000), will have to think carefully if the expense is going to be worth it for this one feature i need, but suspect it won't be

FourthWorld · Post by **FourthWorld** » Sun Jan 24, 2021 4:56 am

The LC component you found is highly specialized. If all you need is the text there are many free tools available, many automatable from the command line.

bogs · Post by **bogs** » Sun Jan 24, 2021 12:10 pm

stam wrote: ↑
Sun Jan 24, 2021 2:55 am
... but on this note, we also receive a small percentage of our letters in PDF format (especially when sent to us from other centres). These do contain extractable text (inasmuch as it can be copy/pasted) - but not sure if it's possible to extract this in LC?
I did a cursory search and not sure i found any kind of useful answer other than using a command line tool.

Any suggestions for extracting text from PDF?

Well, certainly if you want full manipulation of the pdf, I'd lean towards Richard's suggestion, but if all you want is what I stylized up there, then you might use the browser widget to accomplish it (as long as you are not using Linux).

For example, on Windows 7, this browser widget is displaying the pdf version of the user guide for Revolution 2.7 {kinda old meets new, eh?

} which is on the local desktop, although I don't imagine the source really matters much.

: I think I see where you are going with this....

Displaying it didn't require much....

Code: Select all

set the URL of widget "Browser" to "file:" & specialFolderPath("desktop") & "/all.pdf"

I manually copied and pasted to notepad what you see there, as you suggest in what I placed in bold italic, but I am pretty confident anyone that actually uses the browser widget could figure out how to do it programmatically.

stam · Post by **stam** » Mon Jan 25, 2021 12:36 am

Thanks Bogs, but couldn't get that to work for PDF files.
I don't care at all about formatted text; i just want to access the text of the PDF to parse it into a database.

I tried both the URL format file:<filepath> and (what is normal for browsers) file://<filepath> but nothing renders in the browser widget...
The code below is to select a PDF file:

Code: Select all

on mouseUp
   local tFileURL
   answer file "select PDF file:"
   if it is empty then exit mouseUp
   put "file:" & it into tFileURL                 -- also tried file://
   set the url of widget "browser" to tFileURL    -- if I put a normal web URL here it loads normally
end mouseUp

Nothing renders in the browser widget. If i replace tFileURL above with a normal web URL, it renders normally.
What am i doing wrong?

bogs · Post by **bogs** » Mon Jan 25, 2021 3:43 am

stam wrote: ↑
Mon Jan 25, 2021 12:36 am
What am i doing wrong?

Heh, if you only knew how many times I ask myself that self same question haha.

As far as what is wrong, I really don't know. I tested your code unmodified first on Win 7 / Lc 9.0.1, and it worked to open a pdf no issue. I slightly modified your code to include 'with type "PDF|pdf", so the answer file would only see pdf files.

PDF_BrowserTest.zip: I knew I should have taken that left at Albequerque...; (769 Bytes) Downloaded 118 times

It isn't fancy, but it worked here, as you can see in this video...https://youtu.be/P75sziW7cII

It is a little longer than it really should have been, but I was distracted making it heh.

FourthWorld · Post by **FourthWorld** » Mon Jan 25, 2021 6:27 am

If what you want is the plain text, you don't need to jump through hoops embedding an entire browser application with PDF rendering extensions to do it.

I've been down this road. You will want to use a command line tool for this. Simple, memory-efficient, developer-efficient, robust, and lightning fast.

Such tools are written by people who need to do that for a living, a wheel that need not be reinvented.

I believe macOS has one preinstalled. Debian/Ubuntu/Mint have more than one in their repos. I haven't needed this on Windows but I'd imagine given its 86% market share there are several available for that too.

bogs · Post by **bogs** » Mon Jan 25, 2021 11:30 am

As far as what I did goes, as far as I can tell, I only copied / pasted plain text.

However, there is a lot of value in the point Richard made, I was just experimenting to see if it could be done without downloading anything. Of course, the method I used would have issues on 'nix (not sure on Mac) because the browser widget doesn't work well with 'nix, I only tested on Win as you didn't say which OS you are using.

stam · Post by **stam** » Tue Jan 26, 2021 1:25 am

FourthWorld wrote: ↑
Mon Jan 25, 2021 6:27 am
You will want to use a command line tool for this. Simple, memory-efficient, developer-efficient, robust, and lightning fast.

Thanks Richard,
my circumstances are a bit special inasmuch as all work related activity is done on an extremely locked down windows 10 environment that runs off a windows server 12 farm.

It’s not possible for users to install anything that might require admin access and even just copying data is ridiculously locked down - USB drives are out for example. However, in typical ludicrous MS techie style it’s absolutely fine to copy anything from MS Teams and OneDrive, which is how I’ve been copying my custom apps to our systems and it’s been fine.

I hadn’t considered a CLI because my impression is that this needs to go through a “proper” installation process which just won’t be feasible.
Do you know if it’s possible to use a CLI without “installing” it?

I have no idea how to get data back from a CLI, are there any resources you can recommend?
(Did a quick search but only came up with things like “how to create a gui for command line tools”.)

stam · Post by **stam** » Tue Jan 26, 2021 2:20 am

bogs wrote: ↑
Mon Jan 25, 2021 3:43 am
I slightly modified your code to include 'with type "PDF|pdf", so the answer file would only see pdf files.

Hi Bogs, the document type thingie noted - thank you.

This forum is truly a treasure trove of LC info, although can take some time to find stuff. I finally came across an older post from Klaus about this exact issue.

On a Mac (my primary dev environment), it works but with some modification. The URL needs to be converted to a web safe string, so all spaces need to be replaced with %20. And the URL needs to start with "file//", so the URL should look like: file:///Users/...
The following now works perfectly:

Code: Select all

on mouseUp
   local tFileURL
   answer file "select pdf" with type "PDF|pdf"
   if it is empty then exit mouseUp
   put "file://" & it into tFileURL
   replace space with "%20" in tFileURL 
   set the url of widget "browser" to tFileURL
end mouseUp

Sadly as Richard mentioned, this does not make it easier to extract the text from the PDF though

Will be looking into command line (or other) tools for this - thanks for your help guys and grateful for any other suggestions...

bogs · Post by **bogs** » Tue Jan 26, 2021 2:03 pm

I hadn’t considered a CLI because my impression is that this needs to go through a “proper” installation process which just won’t be feasible.
Do you know if it’s possible to use a CLI without “installing” it?

Depends, but I want to make sure we're talking about the same thing. I believe Richard is referring to tools that are run *from* the CLI { Command Line Interface }. All Operating Systems come with some form of the CLI itself (far as I know).

I work primarily in 'nix / bsd, and those often come with VIM or Emacs, with which automating this would likely be trivial. I don't know if Mac includes a similar CLI editor, the last Windows I used did for its .net platform (actually spent a year programming in that heh), and currently (as far as I know) uses 'power shell' for which again, automating this would be <somewhat> trivial.

I don't work with the browser widget at all, so I don't know if it has a version of 'theSelected', selectedLine / chunk , etc... If it does, it should then be just as trivial to copy the text of that to the clipboard and then paste it into a document. The only problem with that thought is that, at least on 'nix, that scenario doesn't work all the time. I'm not entirely sure why, but I have tested it enough to know that copying from an Lc app using Lc code to the clipboard will NOT paste to any place [ctrl+c] will.

Now, the track I was using was simply highlighting whatever you want to copy from the pdf in the browser using the keyboard (I believe on mac the two key sets would be [cmd + c] for copy, and [cmd + v] for paste. I'll rig up an osx session today to see if it would work through the menus but I doubt it, and I can't send those commands on the VM I use.

I did find this article though, see if this might help you out (same principle using OSX's built in pdf reader).
https://www.techjunkie.com/extract-text-pdf-mac/

*Edit - I did finally get around to testing it in OSX and it worked as expected. I did take advantage of having it up to modify the file picking code some more (after making sure the original stuff worked) to the following -

Code: Select all

on mouseUp
	answer file "...pdf?" with type "PDF|pdf"
	
	if it is not empty then
		put it into tUrl
		replace space with "%20" in tUrl
		set the url of of widget "browser" to tUrl
	end if		
end mouseUp

The pdf came up no problem, inside the IDE (since I can't use [cmd] key) I -
1. selected the text in the pdf
2. went to the edit menu and choose 'copy'
3. opened a text editor
4. went to the edit menu and choose 'paste'

The text appeared no issue, so if it doesn't do that for you, then I can't even start to guess why not.

*Side note - I did all of the above through the IDE, however, when you turn this into a standalone program, you will have to add a menu including a copy entry or, alternately, either a button with copy to clipboard or instructions to use [cmd+c / v] along with the program because you won't have the IDE around to create that menu for you

stam · Post by **stam** » Tue Jan 26, 2021 3:07 pm

Thanks Bogs - poor choice of words because I was too lazy to type “command line tool” - this is what I was referring to as “CLI” but clearly this just my laziness - sorry.

As mentioned copy/paste is not an option because we’ll be dealing with hundreds of letters - I was looking for a way to programmatically extract the text from PDF...

Actual deployment will be on Windows 10. Powershell is installed so that’s something.
But as our IT has severely locked down everything I don’t think I’ll be able to change the $PATH.

I’ve looked at the xpdf command line tool but haven’t yet got it to work...

bogs · Post by **bogs** » Tue Jan 26, 2021 3:25 pm

I edited my above post, but didn't post till after you replied heh.

As mentioned copy/paste is not an option because we’ll be dealing with hundreds of letters - I was looking for a way to programmatically extract the text from PDF...

As I said above, I don't work in the newer IDEs and so have zero experience with the browser widget, but I assume (we both know how bad that is, right?

) there is some property in the widget that is similar to the lines in a field.

If your pdf's are structured (I would think they would have to be), the code to copy the text shouldn't be much harder than something like ~

Code: Select all

set  the clipboardData["text"] to line x to y of widget "browser"

or some such (although I found out the browser isn't a container ? today, so I don't know for sure the browser methods for selecting lines to copy).

Hopefully someone that does work with the widget will come along and tell you the exact code, if there is some such available.

*Edit - now I am sure it is possible, because of this post by Capellan -

capellan wrote: ↑
Fri Nov 25, 2016 7:23 am
Hi All,

How could we copy an image from a webpage
(loaded in Browser widget) into Livecode?

Actually, we could copy text (via clipboard)
from a widget webpage and paste this text
into a LiveCode field.

If text can be copied to the clipboard from a webpage, I'm pretty dang sure it can be copied from a pdf as well.

jacque · Post by **jacque** » Wed Jan 27, 2021 9:23 pm

I just tried loading a pdf in to a browser widget. It's true you can manually copy from it but I don't see any way to get the content programmatically. My first impulse was to get the htmltext of the widget but it is empty, at least in the pdf I tested. Aside from the htmltext, I don't see any other properties that would allow text selection or manipulation. The "select" command doesn't work in the browser widget.

There may be a way using Javascript.
Edit: "the number of lines in the text of widget "browser"" returns 0.

LiveCode Forums

[SOLVED] Parsing Word files

Re: Parsing Word files

Re: Parsing Word files

Re: Parsing Word files

Re: Parsing Word files

Re: Parsing Word files

Re: Parsing Word files

Re: Parsing Word files

Re: Parsing Word files

Re: Parsing Word files

Re: Parsing Word files

Re: Parsing Word files

Re: Parsing Word files

Re: Parsing Word files

Re: Parsing Word files

Re: Parsing Word files