[SOLVED] Parsing Word files

LiveCode is the premier environment for creating multi-platform solutions for all major operating systems - Windows, Mac OS X, Linux, the Web, Server environments and Mobile platforms. Brand new to LiveCode? Welcome!

Moderators: FourthWorld, heatherlaine, Klaus, kevinmiller, robinmiller

stam
Posts: 2636
Joined: Sun Jun 04, 2006 9:39 pm
Location: London, UK

[SOLVED] Parsing Word files

Post by stam » Fri Jan 22, 2021 12:09 pm

Hi all,

I have a large number of Word files, the text of which i'd like to import to a database.

Is there an way to do this easily with LC?
(these are the older version of .doc, not .docx)

Many thanks
Stam

-- EDIT: Long discussion, the solutions for me can be distilled to these 2 responses:
See comment about parsing Word files in this reply. This was an adequate solution for me.
See also this solution about extracting ascii text from PDF files.
Last edited by stam on Sun Jan 31, 2021 5:21 pm, edited 1 time in total.

Klaus
Posts: 13806
Joined: Sat Apr 08, 2006 8:41 am
Location: Germany
Contact:

Re: Parsing Word files

Post by Klaus » Fri Jan 22, 2021 12:50 pm

Hi Stam,

I don't know of any free libraries, but you can buy an add-on for LC here:
https://livecode.com/extensions/wordlib/2-2-0/

Best

Klaus

stam
Posts: 2636
Joined: Sun Jun 04, 2006 9:39 pm
Location: London, UK

Re: Parsing Word files

Post by stam » Fri Jan 22, 2021 12:52 pm

Thanks Klaus,
I was aware of this but was wondering if there was a built-in/free option... i guess not then :)

dunbarx
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 9580
Joined: Wed May 06, 2009 2:28 pm
Location: New York, NY

Re: Parsing Word files

Post by dunbarx » Fri Jan 22, 2021 2:59 pm

Hi.

Just wondering, since I never do anything like this. What is missing, or what unwanted baggage comes over if you simply load the entire Word document into a field? Is it things like the fact that links lose their, er, links? Formatting seems intact. not that a lot of work might be necessary to transform the text into something more to your liking.

Craig

Klaus
Posts: 13806
Joined: Sat Apr 08, 2006 8:41 am
Location: Germany
Contact:

Re: Parsing Word files

Post by Klaus » Fri Jan 22, 2021 3:18 pm

Obviously you never tried this yourself! :-)
I exported an RTF file to DOC and imported it into an LC field, have fun:
rtf.jpg
word_in_lc.jpg

dunbarx
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 9580
Joined: Wed May 06, 2009 2:28 pm
Location: New York, NY

Re: Parsing Word files

Post by dunbarx » Fri Jan 22, 2021 3:35 pm

Klaus.

What, that text is not readable?

So a big difference between reading a file with, say the "read from file" command and simply copying the contents of a file and pasting into LC.

I have used "the "read from file" command for years without issue, but only read "txt" documents that I made myself, rebuilding stuff "saved" by stacks I run. When reading "txt" files, the text comes back just fine.

Now I see what wordLib does.

Craig
Last edited by dunbarx on Fri Jan 22, 2021 3:47 pm, edited 2 times in total.

richmond62
Livecode Opensource Backer
Livecode Opensource Backer
Posts: 9287
Joined: Fri Feb 19, 2010 10:17 am
Location: Bulgaria

Re: Parsing Word files

Post by richmond62 » Fri Jan 22, 2021 3:36 pm

I exported an RTF file to DOC
Indeed . . . 8)

I tend to export DOC, OTD, whatever files as RTF files and then import them using

Code: Select all

set the RTFtext

FourthWorld
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 9802
Joined: Sat Apr 08, 2006 7:05 am
Location: Los Angeles
Contact:

Re: Parsing Word files

Post by FourthWorld » Fri Jan 22, 2021 5:09 pm

That's an interestingly anomalous RTF conversion, Klaus.

Looks like the chars and styles are preserved, but alignment is off because of tab settings.

How does it look if you adjust the field's tabstops to match the source Word doc?
Richard Gaskin
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn

Klaus
Posts: 13806
Joined: Sat Apr 08, 2006 8:41 am
Location: Germany
Contact:

Re: Parsing Word files

Post by Klaus » Fri Jan 22, 2021 5:13 pm

No idea, I do not have any Office software installed.
I used the "Save as..." feature in TextEdit on my Mac to export an existing RTF file to DOC.

Funny how the actual text is reversed. :-D

FourthWorld
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 9802
Joined: Sat Apr 08, 2006 7:05 am
Location: Los Angeles
Contact:

Re: Parsing Word files

Post by FourthWorld » Fri Jan 22, 2021 5:59 pm

The reversal seems an encoding issue. Does it improve when you use textEncode to match the source?
Richard Gaskin
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn

Klaus
Posts: 13806
Joined: Sat Apr 08, 2006 8:41 am
Location: Germany
Contact:

Re: Parsing Word files

Post by Klaus » Fri Jan 22, 2021 6:06 pm

No idea, I only wanted to show Craig that a WORD/DOC file is not per se "human readable". :D

FourthWorld
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 9802
Joined: Sat Apr 08, 2006 7:05 am
Location: Los Angeles
Contact:

Re: Parsing Word files

Post by FourthWorld » Fri Jan 22, 2021 6:44 pm

Stam, what format do you want to store it in?
Richard Gaskin
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn

stam
Posts: 2636
Joined: Sun Jun 04, 2006 9:39 pm
Location: London, UK

Re: Parsing Word files

Post by stam » Sat Jan 23, 2021 8:28 pm

FourthWorld wrote:
Fri Jan 22, 2021 6:44 pm
Stam, what format do you want to store it in?
Thanks all for the interesting discussions, even though no free solution is found.

To expand on the case-use a bit, i want create an app for work to help us database our patients correctly which is something sorely lacking.
The electronic patient record system we use stores all our letters in Word .doc format and are moderately complex in structure (from a text parsing point of view).

Converting these to .rtf is definitely out of the question - i'm talking about 10,000 letters here, possibly more.

I already have created an app that will correctly parse the text of these letters to extract demographics, patient identifiers, diagnoses, medication etc -- i use this to copy/paste the text of the word document directly into the app which is fine for singular cases, but processing thousands of letters is just not feasible.

Hence, i'd like the equivalent of extracting the text programmatically.
It looks like it might be feasible with the Word plugin, but was just wondering if there was another way -- but i'm going to guess not :)

FourthWorld
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 9802
Joined: Sat Apr 08, 2006 7:05 am
Location: Los Angeles
Contact:

Re: Parsing Word files

Post by FourthWorld » Sat Jan 23, 2021 10:08 pm

How uniformly structured is the info in the letters?
Richard Gaskin
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn

stam
Posts: 2636
Joined: Sun Jun 04, 2006 9:39 pm
Location: London, UK

Re: Parsing Word files

Post by stam » Sun Jan 24, 2021 1:30 am

FourthWorld wrote:
Sat Jan 23, 2021 10:08 pm
How uniformly structured is the info in the letters?
Fairly uniform - but with enough unpredictability - one has to cater for different styles between doctors, different specialities, the secretaries' views of how the letter should be structured, random spelling errors etc.

It was a real pain to create parsing algorithms to reliably be able to parse text when copy/pasted into the app but it works 99% of the time now.

But not sure how that helps extract the text from Word files?

Post Reply

Return to “Getting Started with LiveCode - Experienced Developers”