text copied from LC generated PDF, WTF?

Anything beyond the basics in using the LiveCode language. Share your handlers, functions and magic here.

Moderators: FourthWorld, heatherlaine, Klaus, kevinmiller, robinmiller

Klaus
Posts: 14199
Joined: Sat Apr 08, 2006 8:41 am
Contact:

text copied from LC generated PDF, WTF?

Post by Klaus » Tue Feb 18, 2020 8:55 pm

Hi friends,

I know that copying text form a PDF file may give you unexspected results,
but this is really ridiculous!?

I created a PDF from LC (selected "Save as PDF" in the macOS Print dialog) and
when I copy some text and past it into TextEdit, this is what I get, see screenshot.
Where on earth are my numbers and where is my text?
Any insights very appreciated!


Best

Klaus

P.S.
I also posted this to the mailing-list, but for Craig's sake I also had to post it here. 8)
Attachments
text_from_lc_pdf.jpg

bogs
Posts: 5480
Joined: Sat Feb 25, 2017 10:45 pm

Re: text copied form LC generated PDF, WTF?

Post by bogs » Tue Feb 18, 2020 9:09 pm

I have no idea, but it reminds me of the code you would use in C to get the text you copied.
Image

AxWald
Posts: 578
Joined: Thu Mar 06, 2014 2:57 pm

Re: text copied form LC generated PDF, WTF?

Post by AxWald » Tue Feb 18, 2020 10:17 pm

Hmmm.

Either your MacOS PDF printing, or your PDF viewer, or your text editor is broken.
I assume you're using a newer version of LC, this creates a 4th candidate then.

I tried with a "Planmaker" (Softmaker Office 2018 pro) spreadsheet, and printed with "MS Print to PDF", then "exported as PDF". After that I "MS Printed to PDF" a similar table field from LC 6.7.10.
When opened with "Sumatra PDF", and pasted into "EditPad Pro", all 3 versions came perfectly fine.

One of your tools is broken. Happy bug hunting ;-)

Have fun!
All code published by me here was created with Community Editions of LC (thus is GPLv3).
If you use it in closed source projects, or for the Apple AppStore, or with XCode
you'll violate some license terms - read your relevant EULAs & Licenses!

Klaus
Posts: 14199
Joined: Sat Apr 08, 2006 8:41 am
Contact:

Re: text copied form LC generated PDF, WTF?

Post by Klaus » Tue Feb 18, 2020 10:24 pm

Hm, I exclusively use the build-in means of macOS 10.14.6 with LC Indy 9.5.1:
1. System wide Printing to PDF resp. "Open in Preview"
2. Preview to open and diplay the PDF, I copied the text here
3. TextEdit to paste the copied text from 2

So any bug-hunting is out of question, i'm afraid... :?

dunbarx
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 10333
Joined: Wed May 06, 2009 2:28 pm

Re: text copied from LC generated PDF, WTF?

Post by dunbarx » Tue Feb 18, 2020 10:50 pm

Klaus.

Greetings from the forum.

I made a PDF with the "open printing to pdf" command. Copied some text from that output, and it pasted fine into textEdit.

Then printed a card normally containing a field with text in it, but as a pdf. Same good paste.

Then used "revPrintField" to print just a field. All good.

LC 9.5.1, Mac 10.13.14

Um, er, ah, ahem. Good luck with your hunting.

Craig

Klaus
Posts: 14199
Joined: Sat Apr 08, 2006 8:41 am
Contact:

Re: text copied from LC generated PDF, WTF?

Post by Klaus » Tue Feb 18, 2020 10:56 pm

I have been reporting more than 150 bugs in the last 15 years and now I'm tired of hunting LC bugs... :(

bogs
Posts: 5480
Joined: Sat Feb 25, 2017 10:45 pm

Re: text copied from LC generated PDF, WTF?

Post by bogs » Tue Feb 18, 2020 11:07 pm

Sorry to tell you Klaus, but you have to keep going till you get them all Image
Image

Klaus
Posts: 14199
Joined: Sat Apr 08, 2006 8:41 am
Contact:

Re: text copied from LC generated PDF, WTF?

Post by Klaus » Tue Feb 18, 2020 11:35 pm

Although this is the plan, I doubt I will live long enough for that task.

bogs
Posts: 5480
Joined: Sat Feb 25, 2017 10:45 pm

Re: text copied from LC generated PDF, WTF?

Post by bogs » Tue Feb 18, 2020 11:50 pm

I'm sorry, I can't accept merely dying as an excuse, you'll have to try harder :twisted:
Image

LCMark
Livecode Staff Member
Livecode Staff Member
Posts: 1232
Joined: Thu Apr 11, 2013 11:27 am

Re: text copied from LC generated PDF, WTF?

Post by LCMark » Wed Feb 19, 2020 8:52 am

I have been reporting more than 150 bugs in the last 15 years and now I'm tired of hunting LC bugs... :(
Around 200 actually - 3/4s of which have been in non-GM releases - so thanks for using DP's and RC's and helping to make LC better :)

In regards to this issue - this isn't really a bug in LC - it is due to the difficulty in doing text extraction from PDFs.

PDFs don't actually contain the source text - all PDF viewers have to reverse engineer the text content by looking at the glyphs, fonts, their locations then reverse mapping those details back into a linear form of text. All PDF viewers do this differently and It isn't 100% accurate (most do it quite badly actually, for more than simple western-language paragraphs!). The text layout and fonts used can have a great effect on the efficacy of the process.

In this case changing the font to Courier in LC in your test stack gives slightly better results, but the tabular layout foxes Preview's ability to reconstruct things well (and some columns it doesn't 'see' at all). I'd also point out that replicating that exact text, tabs and such in TextEdit with same font/size and printing to PDF (from TextEdit) doesn't do that much better in terms of extraction - it gets more of the text content - but in entirely the wrong order.

At the end of the day PDF is a format designed to encapsulate something for printing, if you want a format which is manipulatable as text, then distribute in a text form? e.g. HTML, RTF, Plain Text?

P.S. Cross-posting to the forums and mailing-list is a little irksome and not very good practice. Where is someone meant to reply if they are on both?

Klaus
Posts: 14199
Joined: Sat Apr 08, 2006 8:41 am
Contact:

Re: text copied from LC generated PDF, WTF?

Post by Klaus » Wed Feb 19, 2020 9:44 am

Hi Mark,

thanks a lot for your explanation, very helpful, although not really satisfying. :D
LCMark wrote:
Wed Feb 19, 2020 8:52 am
At the end of the day PDF is a format designed to encapsulate something for printing, if you want a format which is manipulatable as text, then distribute in a text form? e.g. HTML, RTF, Plain Text?
I do not want a format which is manipulatable as text, I just discovered this inconvenience and wanted an explanation. 8)
Someone is definitively not prepared to copy a little text like the first date (first 8 characters) in my example PDF
-> 05.01.20 and finally end with -> !".!$.%! pasted in a text editor. So pelase forgive me my reaction.
LCMark wrote:
Wed Feb 19, 2020 8:52 am
P.S. Cross-posting to the forums and mailing-list is a little irksome and not very good practice. Where is someone meant to reply if they are on both?
Some prefer the forum, some prefer the mailing list. Recently Craig (dunbarx) asked my why I did post to the mailing list
and not to the forum (felt a tad jealous :D ), so I also posted the PDF question to the forum.
And now this is also not right, sigh...


Best

Klaus

bogs
Posts: 5480
Joined: Sat Feb 25, 2017 10:45 pm

Re: text copied from LC generated PDF, WTF?

Post by bogs » Wed Feb 19, 2020 10:20 am

LCMark wrote:
Wed Feb 19, 2020 8:52 am
P.S. Cross-posting to the forums and mailing-list is a little irksome and not very good practice. Where is someone meant to reply if they are on both?
I have to admit, I'm not sure why this would be 'irksome' or a 'not a good practice' myself. Although I sometimes skim the mailing list, for instance, I am not a regular browser of that source of information and so often miss questions I might either help or learn from.

There are others that are the reverse, and we never (or almost never) see them here on the boards.

Then there are those who are, as you mention, on both.

If a question is on both, the either / or crowd will at least see the question, which increases knowledge in both groups. If your on both, you can post your answer to which ever you were on when you saw it, or to both. As I said, I don't see an extra burden being placed on anyone other than the OP in that situation, but I do see some additional benefit for everyone else.
Klaus wrote:
Wed Feb 19, 2020 9:44 am
And now this is also not right, sigh...
That remains to be seen :wink:
Image

AxWald
Posts: 578
Joined: Thu Mar 06, 2014 2:57 pm

Re: text copied from LC generated PDF, WTF?

Post by AxWald » Wed Feb 19, 2020 10:45 am

Klaus,

did you try another PDF viewer? That's the one that loads the clipboard ...

I tried my LC table field ("printed to PDF" with basic Win printer driver) again, and with whatever I displayed the PDF with (Calibre, Sumatra, Edge, Waterfox, FireFox ...), I never got these scrambled result like yours.

Since it displays correctly in the PDF viewer I'd assume to be it less probable that LC or the PDF printer driver are responsible.
And since displaying clipboardData["text"] shouldn't be this hard I'd rule out the text editor, too.

Remains converting the displayed PDF data to text, and loading the clipboard - the PDF viewer. Try another one - there's many, even on Mac. Every browser can display PDF for instance, these days ...

Have fun!
All code published by me here was created with Community Editions of LC (thus is GPLv3).
If you use it in closed source projects, or for the Apple AppStore, or with XCode
you'll violate some license terms - read your relevant EULAs & Licenses!

Klaus
Posts: 14199
Joined: Sat Apr 08, 2006 8:41 am
Contact:

Re: text copied from LC generated PDF, WTF?

Post by Klaus » Wed Feb 19, 2020 10:56 am

I also tried SKIM, same problem: !".!$.%! instead of 05.01.20!

But that is not the point!
I do not want to convert the PDF to text as I wrote, as always I am just trying to put myself in the position of a newbie!
They will consider this a bug in the software that had generated the PDF AND they will always use what they already got with
the OS like in my case.

I know how to export to text or whatever in LC, believe me. 8)

AxWald
Posts: 578
Joined: Thu Mar 06, 2014 2:57 pm

Re: text copied from LC generated PDF, WTF?

Post by AxWald » Thu Feb 20, 2020 11:44 am

Hi Klaus,

I guess you misunderstood me, or I wrote something misunderstandable. Sorry.
My intention was not to explain something to you, but to help to find the culprit. Something is wrong here, but what?

I wrote table fields to PDF via my Win PDF printer driver using LC 6 & 9.5 (And tried 2 other ways to generate such PDFs). All worked as expected - no !".!$.%! instead of 05.01.20.
Further, the PDF you generate displays correctly (in at least 2 PDF viewers) on your Mac.
Means, you can delete LC from the list of possible culprits, with a high probability.

If you upload such a PDF I'd happily try what a copied portion of it would look on a different machine/ OS. This would narrow the range of culprits further.

Have fun!
All code published by me here was created with Community Editions of LC (thus is GPLv3).
If you use it in closed source projects, or for the Apple AppStore, or with XCode
you'll violate some license terms - read your relevant EULAs & Licenses!

Post Reply