OCR?

LiveCode is the premier environment for creating multi-platform solutions for all major operating systems - Windows, Mac OS X, Linux, the Web, Server environments and Mobile platforms. Brand new to LiveCode? Welcome!

Moderators: FourthWorld, heatherlaine, Klaus, kevinmiller, robinmiller

bamakojeff
Posts: 13
Joined: Tue Dec 06, 2022 6:53 pm

Re: OCR?

Post by bamakojeff » Mon Mar 18, 2024 1:41 am

Don't know if this is helpful or not, but I built an application for a friend of mine in the publishing industry (which runs on Windows) in order to create royalty reports for individual authors from a single giant pdf which contains all the data for thousands of authors. So the Livecode app shells out to pdfinfo.exe, pdftk.exe, and pdftotext.exe to do all the actual pdf processing. It converts each page of the pdf to text, reads the text off each page to determine where one author's report ends and another's begins, and then splits the massive pdf up into separate pages corresponding to separate reports for each author which it saves as separate pdf files which can then be emailed to the appropriate recipient.

All these are standalone executables. I store them a folder "below" the main app. So to get all the text off of page "tPageNum" from the pdf "sFile", I run:

Code: Select all

put quote & specialFolderPath("resources") & slash & "library/pdftotext.exe" & quote into sPDFtotext 
put shell(sPDFtotext && "-f" && tPageNum && "-l" && tPageNum && "-layout" && sFile && "-") into tShellResults
Now I have the text of that page in tShellResults.

I don't know but I expect that all these utilities also run on Mac. (They are all linux utils originally, I believe.)

If that's helpful to you, I'm happy to share more about it. And if not, no worries. :-)

Jeff

stam
Posts: 2686
Joined: Sun Jun 04, 2006 9:39 pm
Location: London, UK

Re: OCR?

Post by stam » Mon Mar 18, 2024 9:55 am

Hi Jeff, I presume your answer is directed to me?
Thanks if that’s the case, but the PDFs I had to work with were scans rather than documents converted to PDF, which makes PDF utilities useless - there is only picture data.

Having said that situations have changed and this is now indefinitely on hold.


bamakojeff
Posts: 13
Joined: Tue Dec 06, 2022 6:53 pm

Re: OCR?

Post by bamakojeff » Wed Mar 20, 2024 5:57 pm

Stam, I saw this thread come across the digest and didn't bother to look at the original posting data when I replied. :-)

If the project ever comes around again, I've had good luck using the open source version of Tesseract (https://github.com/tesseract-ocr/tesseract) for OCR on images.

Jeff

stam
Posts: 2686
Joined: Sun Jun 04, 2006 9:39 pm
Location: London, UK

Re: OCR?

Post by stam » Wed Mar 20, 2024 8:16 pm

Thanks Jeff. But the results I posted at the start of this thread were from using Tesseract. Not great and certainly not usable in an automated process in a medical context sadly…

Post Reply

Return to “Getting Started with LiveCode - Experienced Developers”