[SOLVED] Parsing Word files
Moderators: FourthWorld, heatherlaine, Klaus, kevinmiller, robinmiller
[SOLVED] Parsing Word files
Hi all,
I have a large number of Word files, the text of which i'd like to import to a database.
Is there an way to do this easily with LC?
(these are the older version of .doc, not .docx)
Many thanks
Stam
-- EDIT: Long discussion, the solutions for me can be distilled to these 2 responses:
See comment about parsing Word files in this reply. This was an adequate solution for me.
See also this solution about extracting ascii text from PDF files.
I have a large number of Word files, the text of which i'd like to import to a database.
Is there an way to do this easily with LC?
(these are the older version of .doc, not .docx)
Many thanks
Stam
-- EDIT: Long discussion, the solutions for me can be distilled to these 2 responses:
See comment about parsing Word files in this reply. This was an adequate solution for me.
See also this solution about extracting ascii text from PDF files.
Last edited by stam on Sun Jan 31, 2021 5:21 pm, edited 1 time in total.
Re: Parsing Word files
Hi Stam,
I don't know of any free libraries, but you can buy an add-on for LC here:
https://livecode.com/extensions/wordlib/2-2-0/
Best
Klaus
I don't know of any free libraries, but you can buy an add-on for LC here:
https://livecode.com/extensions/wordlib/2-2-0/
Best
Klaus
Re: Parsing Word files
Thanks Klaus,
I was aware of this but was wondering if there was a built-in/free option... i guess not then
I was aware of this but was wondering if there was a built-in/free option... i guess not then
-
- VIP Livecode Opensource Backer
- Posts: 9580
- Joined: Wed May 06, 2009 2:28 pm
- Location: New York, NY
Re: Parsing Word files
Hi.
Just wondering, since I never do anything like this. What is missing, or what unwanted baggage comes over if you simply load the entire Word document into a field? Is it things like the fact that links lose their, er, links? Formatting seems intact. not that a lot of work might be necessary to transform the text into something more to your liking.
Craig
Just wondering, since I never do anything like this. What is missing, or what unwanted baggage comes over if you simply load the entire Word document into a field? Is it things like the fact that links lose their, er, links? Formatting seems intact. not that a lot of work might be necessary to transform the text into something more to your liking.
Craig
Re: Parsing Word files
Obviously you never tried this yourself!
I exported an RTF file to DOC and imported it into an LC field, have fun:
I exported an RTF file to DOC and imported it into an LC field, have fun:
-
- VIP Livecode Opensource Backer
- Posts: 9580
- Joined: Wed May 06, 2009 2:28 pm
- Location: New York, NY
Re: Parsing Word files
Klaus.
What, that text is not readable?
So a big difference between reading a file with, say the "read from file" command and simply copying the contents of a file and pasting into LC.
I have used "the "read from file" command for years without issue, but only read "txt" documents that I made myself, rebuilding stuff "saved" by stacks I run. When reading "txt" files, the text comes back just fine.
Now I see what wordLib does.
Craig
What, that text is not readable?
So a big difference between reading a file with, say the "read from file" command and simply copying the contents of a file and pasting into LC.
I have used "the "read from file" command for years without issue, but only read "txt" documents that I made myself, rebuilding stuff "saved" by stacks I run. When reading "txt" files, the text comes back just fine.
Now I see what wordLib does.
Craig
Last edited by dunbarx on Fri Jan 22, 2021 3:47 pm, edited 2 times in total.
-
- Livecode Opensource Backer
- Posts: 9287
- Joined: Fri Feb 19, 2010 10:17 am
- Location: Bulgaria
Re: Parsing Word files
Indeed . . .I exported an RTF file to DOC
I tend to export DOC, OTD, whatever files as RTF files and then import them using
Code: Select all
set the RTFtext
-
- VIP Livecode Opensource Backer
- Posts: 9802
- Joined: Sat Apr 08, 2006 7:05 am
- Location: Los Angeles
- Contact:
Re: Parsing Word files
That's an interestingly anomalous RTF conversion, Klaus.
Looks like the chars and styles are preserved, but alignment is off because of tab settings.
How does it look if you adjust the field's tabstops to match the source Word doc?
Looks like the chars and styles are preserved, but alignment is off because of tab settings.
How does it look if you adjust the field's tabstops to match the source Word doc?
Richard Gaskin
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn
Re: Parsing Word files
No idea, I do not have any Office software installed.
I used the "Save as..." feature in TextEdit on my Mac to export an existing RTF file to DOC.
Funny how the actual text is reversed.
I used the "Save as..." feature in TextEdit on my Mac to export an existing RTF file to DOC.
Funny how the actual text is reversed.
-
- VIP Livecode Opensource Backer
- Posts: 9802
- Joined: Sat Apr 08, 2006 7:05 am
- Location: Los Angeles
- Contact:
Re: Parsing Word files
The reversal seems an encoding issue. Does it improve when you use textEncode to match the source?
Richard Gaskin
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn
Re: Parsing Word files
No idea, I only wanted to show Craig that a WORD/DOC file is not per se "human readable".
-
- VIP Livecode Opensource Backer
- Posts: 9802
- Joined: Sat Apr 08, 2006 7:05 am
- Location: Los Angeles
- Contact:
Re: Parsing Word files
Stam, what format do you want to store it in?
Richard Gaskin
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn
Re: Parsing Word files
Thanks all for the interesting discussions, even though no free solution is found.
To expand on the case-use a bit, i want create an app for work to help us database our patients correctly which is something sorely lacking.
The electronic patient record system we use stores all our letters in Word .doc format and are moderately complex in structure (from a text parsing point of view).
Converting these to .rtf is definitely out of the question - i'm talking about 10,000 letters here, possibly more.
I already have created an app that will correctly parse the text of these letters to extract demographics, patient identifiers, diagnoses, medication etc -- i use this to copy/paste the text of the word document directly into the app which is fine for singular cases, but processing thousands of letters is just not feasible.
Hence, i'd like the equivalent of extracting the text programmatically.
It looks like it might be feasible with the Word plugin, but was just wondering if there was another way -- but i'm going to guess not
-
- VIP Livecode Opensource Backer
- Posts: 9802
- Joined: Sat Apr 08, 2006 7:05 am
- Location: Los Angeles
- Contact:
Re: Parsing Word files
How uniformly structured is the info in the letters?
Richard Gaskin
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn
Re: Parsing Word files
Fairly uniform - but with enough unpredictability - one has to cater for different styles between doctors, different specialities, the secretaries' views of how the letter should be structured, random spelling errors etc.
It was a real pain to create parsing algorithms to reliably be able to parse text when copy/pasted into the app but it works 99% of the time now.
But not sure how that helps extract the text from Word files?