[SOLVED] Parsing Word files
Moderators: FourthWorld, heatherlaine, Klaus, kevinmiller, robinmiller
Re: Parsing Word files
Embarrasingly, i posed the question without actually testing what simply importing the file would produce - i had just assumed this would be binary data...so i did what i should have done before posting the question and tested...
i imported a typical Word file into a variable and examined the text... About 70% of this is clearly binary data, but interestingly the file also seems to include the entire letter in plain ascii text as well !!
I suspect the binary data contains formatting/layout data for the included ascii text, but since this includes the entire text i can easily extract it since our systems automatically insert boilerplate text at start and end of letters... so no need to splash out on the plugin just yet.
So sorry to waste your time with this... but on this note, we also receive a small percentage of our letters in PDF format (especially when sent to us from other centres). These do contain extractable text (inasmuch as it can be copy/pasted) - but not sure if it's possible to extract this in LC?
I did a cursory search and not sure i found any kind of useful answer other than using a command line tool.
Any suggestions for extracting text from PDF?
i imported a typical Word file into a variable and examined the text... About 70% of this is clearly binary data, but interestingly the file also seems to include the entire letter in plain ascii text as well !!
I suspect the binary data contains formatting/layout data for the included ascii text, but since this includes the entire text i can easily extract it since our systems automatically insert boilerplate text at start and end of letters... so no need to splash out on the plugin just yet.
So sorry to waste your time with this... but on this note, we also receive a small percentage of our letters in PDF format (especially when sent to us from other centres). These do contain extractable text (inasmuch as it can be copy/pasted) - but not sure if it's possible to extract this in LC?
I did a cursory search and not sure i found any kind of useful answer other than using a command line tool.
Any suggestions for extracting text from PDF?
-
- VIP Livecode Opensource Backer
- Posts: 9837
- Joined: Sat Apr 08, 2006 7:05 am
- Location: Los Angeles
- Contact:
Re: Parsing Word files
PDF is a monster of a format. Parsing it is not for mere mortals.
There are many command line utilities that can attempt* to extract text from PDF, many of them free and open source. You can call command line utilities from LC with the shell function.
* I say "attempt" because the format is designed for display only, and through such complex means that it offers no assurances that any contents can be extracted back out into any other format.
There are many command line utilities that can attempt* to extract text from PDF, many of them free and open source. You can call command line utilities from LC with the shell function.
* I say "attempt" because the format is designed for display only, and through such complex means that it offers no assurances that any contents can be extracted back out into any other format.
Richard Gaskin
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn
Re: Parsing Word files
Never mind, i just saw that this is a feature of the business edition, which includes a PDF viewer that can extract the text. At 4 times the price of Indy ($2000), will have to think carefully if the expense is going to be worth it for this one feature i need, but suspect it won't be
-
- VIP Livecode Opensource Backer
- Posts: 9837
- Joined: Sat Apr 08, 2006 7:05 am
- Location: Los Angeles
- Contact:
Re: Parsing Word files
The LC component you found is highly specialized. If all you need is the text there are many free tools available, many automatable from the command line.
Richard Gaskin
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn
Re: Parsing Word files
Well, certainly if you want full manipulation of the pdf, I'd lean towards Richard's suggestion, but if all you want is what I stylized up there, then you might use the browser widget to accomplish it (as long as you are not using Linux).stam wrote: ↑Sun Jan 24, 2021 2:55 am... but on this note, we also receive a small percentage of our letters in PDF format (especially when sent to us from other centres). These do contain extractable text (inasmuch as it can be copy/pasted) - but not sure if it's possible to extract this in LC?
I did a cursory search and not sure i found any kind of useful answer other than using a command line tool.
Any suggestions for extracting text from PDF?
For example, on Windows 7, this browser widget is displaying the pdf version of the user guide for Revolution 2.7 {kinda old meets new, eh? } which is on the local desktop, although I don't imagine the source really matters much.
Displaying it didn't require much....
Code: Select all
set the URL of widget "Browser" to "file:" & specialFolderPath("desktop") & "/all.pdf"
Re: Parsing Word files
Thanks Bogs, but couldn't get that to work for PDF files.
I don't care at all about formatted text; i just want to access the text of the PDF to parse it into a database.
I tried both the URL format file:<filepath> and (what is normal for browsers) file://<filepath> but nothing renders in the browser widget...
The code below is to select a PDF file:
Nothing renders in the browser widget. If i replace tFileURL above with a normal web URL, it renders normally.
What am i doing wrong?
I don't care at all about formatted text; i just want to access the text of the PDF to parse it into a database.
I tried both the URL format file:<filepath> and (what is normal for browsers) file://<filepath> but nothing renders in the browser widget...
The code below is to select a PDF file:
Code: Select all
on mouseUp
local tFileURL
answer file "select PDF file:"
if it is empty then exit mouseUp
put "file:" & it into tFileURL -- also tried file://
set the url of widget "browser" to tFileURL -- if I put a normal web URL here it loads normally
end mouseUp
What am i doing wrong?
Re: Parsing Word files
Heh, if you only knew how many times I ask myself that self same question haha.
As far as what is wrong, I really don't know. I tested your code unmodified first on Win 7 / Lc 9.0.1, and it worked to open a pdf no issue. I slightly modified your code to include 'with type "PDF|pdf", so the answer file would only see pdf files.
It isn't fancy, but it worked here, as you can see in this video...https://youtu.be/P75sziW7cII
It is a little longer than it really should have been, but I was distracted making it heh.
-
- VIP Livecode Opensource Backer
- Posts: 9837
- Joined: Sat Apr 08, 2006 7:05 am
- Location: Los Angeles
- Contact:
Re: Parsing Word files
If what you want is the plain text, you don't need to jump through hoops embedding an entire browser application with PDF rendering extensions to do it.
I've been down this road. You will want to use a command line tool for this. Simple, memory-efficient, developer-efficient, robust, and lightning fast.
Such tools are written by people who need to do that for a living, a wheel that need not be reinvented.
I believe macOS has one preinstalled. Debian/Ubuntu/Mint have more than one in their repos. I haven't needed this on Windows but I'd imagine given its 86% market share there are several available for that too.
I've been down this road. You will want to use a command line tool for this. Simple, memory-efficient, developer-efficient, robust, and lightning fast.
Such tools are written by people who need to do that for a living, a wheel that need not be reinvented.
I believe macOS has one preinstalled. Debian/Ubuntu/Mint have more than one in their repos. I haven't needed this on Windows but I'd imagine given its 86% market share there are several available for that too.
Richard Gaskin
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn
Re: Parsing Word files
As far as what I did goes, as far as I can tell, I only copied / pasted plain text.
However, there is a lot of value in the point Richard made, I was just experimenting to see if it could be done without downloading anything. Of course, the method I used would have issues on 'nix (not sure on Mac) because the browser widget doesn't work well with 'nix, I only tested on Win as you didn't say which OS you are using.
However, there is a lot of value in the point Richard made, I was just experimenting to see if it could be done without downloading anything. Of course, the method I used would have issues on 'nix (not sure on Mac) because the browser widget doesn't work well with 'nix, I only tested on Win as you didn't say which OS you are using.
Re: Parsing Word files
Thanks Richard,FourthWorld wrote: ↑Mon Jan 25, 2021 6:27 amYou will want to use a command line tool for this. Simple, memory-efficient, developer-efficient, robust, and lightning fast.
my circumstances are a bit special inasmuch as all work related activity is done on an extremely locked down windows 10 environment that runs off a windows server 12 farm.
It’s not possible for users to install anything that might require admin access and even just copying data is ridiculously locked down - USB drives are out for example. However, in typical ludicrous MS techie style it’s absolutely fine to copy anything from MS Teams and OneDrive, which is how I’ve been copying my custom apps to our systems and it’s been fine.
I hadn’t considered a CLI because my impression is that this needs to go through a “proper” installation process which just won’t be feasible.
Do you know if it’s possible to use a CLI without “installing” it?
I have no idea how to get data back from a CLI, are there any resources you can recommend?
(Did a quick search but only came up with things like “how to create a gui for command line tools”.)
Re: Parsing Word files
Hi Bogs, the document type thingie noted - thank you.
This forum is truly a treasure trove of LC info, although can take some time to find stuff. I finally came across an older post from Klaus about this exact issue.
On a Mac (my primary dev environment), it works but with some modification. The URL needs to be converted to a web safe string, so all spaces need to be replaced with %20. And the URL needs to start with "file//", so the URL should look like: file:///Users/...
The following now works perfectly:
Code: Select all
on mouseUp
local tFileURL
answer file "select pdf" with type "PDF|pdf"
if it is empty then exit mouseUp
put "file://" & it into tFileURL
replace space with "%20" in tFileURL
set the url of widget "browser" to tFileURL
end mouseUp
Will be looking into command line (or other) tools for this - thanks for your help guys and grateful for any other suggestions...
Re: Parsing Word files
Depends, but I want to make sure we're talking about the same thing. I believe Richard is referring to tools that are run *from* the CLI { Command Line Interface }. All Operating Systems come with some form of the CLI itself (far as I know).I hadn’t considered a CLI because my impression is that this needs to go through a “proper” installation process which just won’t be feasible.
Do you know if it’s possible to use a CLI without “installing” it?
I work primarily in 'nix / bsd, and those often come with VIM or Emacs, with which automating this would likely be trivial. I don't know if Mac includes a similar CLI editor, the last Windows I used did for its .net platform (actually spent a year programming in that heh), and currently (as far as I know) uses 'power shell' for which again, automating this would be <somewhat> trivial.
I don't work with the browser widget at all, so I don't know if it has a version of 'theSelected', selectedLine / chunk , etc... If it does, it should then be just as trivial to copy the text of that to the clipboard and then paste it into a document. The only problem with that thought is that, at least on 'nix, that scenario doesn't work all the time. I'm not entirely sure why, but I have tested it enough to know that copying from an Lc app using Lc code to the clipboard will NOT paste to any place [ctrl+c] will.
Now, the track I was using was simply highlighting whatever you want to copy from the pdf in the browser using the keyboard (I believe on mac the two key sets would be [cmd + c] for copy, and [cmd + v] for paste. I'll rig up an osx session today to see if it would work through the menus but I doubt it, and I can't send those commands on the VM I use.
I did find this article though, see if this might help you out (same principle using OSX's built in pdf reader).
https://www.techjunkie.com/extract-text-pdf-mac/
*Edit - I did finally get around to testing it in OSX and it worked as expected. I did take advantage of having it up to modify the file picking code some more (after making sure the original stuff worked) to the following -
Code: Select all
on mouseUp
answer file "...pdf?" with type "PDF|pdf"
if it is not empty then
put it into tUrl
replace space with "%20" in tUrl
set the url of of widget "browser" to tUrl
end if
end mouseUp
1. selected the text in the pdf
2. went to the edit menu and choose 'copy'
3. opened a text editor
4. went to the edit menu and choose 'paste'
The text appeared no issue, so if it doesn't do that for you, then I can't even start to guess why not.
*Side note - I did all of the above through the IDE, however, when you turn this into a standalone program, you will have to add a menu including a copy entry or, alternately, either a button with copy to clipboard or instructions to use [cmd+c / v] along with the program because you won't have the IDE around to create that menu for you
Last edited by bogs on Tue Jan 26, 2021 3:14 pm, edited 1 time in total.
Re: Parsing Word files
Thanks Bogs - poor choice of words because I was too lazy to type “command line tool” - this is what I was referring to as “CLI” but clearly this just my laziness - sorry.
As mentioned copy/paste is not an option because we’ll be dealing with hundreds of letters - I was looking for a way to programmatically extract the text from PDF...
Actual deployment will be on Windows 10. Powershell is installed so that’s something.
But as our IT has severely locked down everything I don’t think I’ll be able to change the $PATH.
I’ve looked at the xpdf command line tool but haven’t yet got it to work...
As mentioned copy/paste is not an option because we’ll be dealing with hundreds of letters - I was looking for a way to programmatically extract the text from PDF...
Actual deployment will be on Windows 10. Powershell is installed so that’s something.
But as our IT has severely locked down everything I don’t think I’ll be able to change the $PATH.
I’ve looked at the xpdf command line tool but haven’t yet got it to work...
Re: Parsing Word files
I edited my above post, but didn't post till after you replied heh.
If your pdf's are structured (I would think they would have to be), the code to copy the text shouldn't be much harder than something like ~
or some such (although I found out the browser isn't a container ? today, so I don't know for sure the browser methods for selecting lines to copy).
Hopefully someone that does work with the widget will come along and tell you the exact code, if there is some such available.
*Edit - now I am sure it is possible, because of this post by Capellan -
As I said above, I don't work in the newer IDEs and so have zero experience with the browser widget, but I assume (we both know how bad that is, right? ) there is some property in the widget that is similar to the lines in a field.As mentioned copy/paste is not an option because we’ll be dealing with hundreds of letters - I was looking for a way to programmatically extract the text from PDF...
If your pdf's are structured (I would think they would have to be), the code to copy the text shouldn't be much harder than something like ~
Code: Select all
set the clipboardData["text"] to line x to y of widget "browser"
Hopefully someone that does work with the widget will come along and tell you the exact code, if there is some such available.
*Edit - now I am sure it is possible, because of this post by Capellan -
If text can be copied to the clipboard from a webpage, I'm pretty dang sure it can be copied from a pdf as well.
-
- VIP Livecode Opensource Backer
- Posts: 7237
- Joined: Sat Apr 08, 2006 8:31 pm
- Location: Minneapolis MN
- Contact:
Re: Parsing Word files
I just tried loading a pdf in to a browser widget. It's true you can manually copy from it but I don't see any way to get the content programmatically. My first impulse was to get the htmltext of the widget but it is empty, at least in the pdf I tested. Aside from the htmltext, I don't see any other properties that would allow text selection or manipulation. The "select" command doesn't work in the browser widget.
There may be a way using Javascript.
Edit: "the number of lines in the text of widget "browser"" returns 0.
There may be a way using Javascript.
Edit: "the number of lines in the text of widget "browser"" returns 0.
Jacqueline Landman Gay | jacque at hyperactivesw dot com
HyperActive Software | http://www.hyperactivesw.com
HyperActive Software | http://www.hyperactivesw.com