[Solved] md5digest of large files, how reliable is it?

Anything beyond the basics in using the LiveCode language. Share your handlers, functions and magic here.

Moderators: FourthWorld, heatherlaine, Klaus, kevinmiller, robinmiller

FourthWorld
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 9802
Joined: Sat Apr 08, 2006 7:05 am
Location: Los Angeles
Contact:

Re: md5digest of large files, how reliable is it?

Post by FourthWorld » Fri Dec 14, 2018 11:15 pm

Ledigimate wrote:
Thu Dec 13, 2018 7:13 pm
Update: The problem where I get the same result on two different files occur only when the file size exeeds 504 MB.

Code: Select all

function fileMd5Digest pFilePath
   local tCheckSum
   local tData
   put url ("binfile:" & pFilePath) into tData
   get binarydecode("H*", md5digest(tData), tCheckSum)
   return tCheckSum
end fileMd5Digest
When the file size exceeds 504 MB, tData is empty. I don't know if this is due to a limitation in LiveCode, or due to insufficient memory.
All I/O calls are best followed by error-checking (things can go wrong for many reasons):

Code: Select all

function fileMd5Digest pFilePath
   local tCheckSum
   local tData
   put url ("binfile:" & pFilePath) into tData
   if the result is not empty then
       answer "Couldn't read file "&quote& pFilePath &": "& the result &"("& sysError()&")"
       exit to top
   end if
   get binarydecode("H*", md5digest(tData), tCheckSum)
   return tCheckSum
end fileMd5Digest
That will not only let you know what LC is reporting (the result), but also the specific error code associated with the failure (sysError).

Also, check the limits of the binaryDecode function. It may be that you're reading the whole file but binaryDecode can only convert so much at a time.
Richard Gaskin
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn

Ledigimate
Livecode Opensource Backer
Livecode Opensource Backer
Posts: 132
Joined: Mon Jan 14, 2013 3:37 pm

Re: md5digest of large files, how reliable is it?

Post by Ledigimate » Sat Dec 15, 2018 8:54 am

Thank you Richard! I wasn't aware of the sysError function. It yielded system error code 8, which translates to insufficient memory.

I can now conclude that in order for the md5Digest function to compute the md5 checksum of a file, the entire file would need to be loaded into a string variable. The md5Digest function reliably computes the md5 value of the string you give it. So this means that if I can't guarantee enough available system memory to load the entire file into a string variable, then I can't use the md5Digest function because I would like the resulting md5 value to be the same as what I would get from the command-line utility. I will now resort to running the command-line utility and parsing the output.

Thanks to all of you for helping me find the answer, especially Richard. I appreciate your guiding me patiently to eventually understand what must have been obvious to you from the start. You make these forums totally worth it.

Kind regards

Gerrie
010100000110010101100001011000110110010100111101010011000110111101110110011001010010101101010100011100100111010101110100011010000010101101001010011101010111001101110100011010010110001101100101

[-hh]
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 2262
Joined: Thu Feb 28, 2013 11:52 pm
Location: Göttingen, DE

Re: [Solved] md5digest of large files, how reliable is it?

Post by [-hh] » Sat Dec 15, 2018 10:31 am

From the dictionary entry to messageDigest, introduced in LC 9:
MD5 is cryptographically broken and unsuitable for further use. Do not use for security-critical purposes, unless required for backward compatibility with existing systems.
shiftLock happens

Ledigimate
Livecode Opensource Backer
Livecode Opensource Backer
Posts: 132
Joined: Mon Jan 14, 2013 3:37 pm

Re: [Solved] md5digest of large files, how reliable is it?

Post by Ledigimate » Sat Dec 15, 2018 2:04 pm

[-hh] wrote:
Sat Dec 15, 2018 10:31 am
From the dictionary entry to messageDigest, introduced in LC 9:
MD5 is cryptographically broken and unsuitable for further use. Do not use for security-critical purposes, unless required for backward compatibility with existing systems.
Thank you, Hermann. I only needed it for checksum purposes.
010100000110010101100001011000110110010100111101010011000110111101110110011001010010101101010100011100100111010101110100011010000010101101001010011101010111001101110100011010010110001101100101

FourthWorld
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 9802
Joined: Sat Apr 08, 2006 7:05 am
Location: Los Angeles
Contact:

Re: md5digest of large files, how reliable is it?

Post by FourthWorld » Sun Dec 16, 2018 8:06 pm

Ledigimate wrote:
Sat Dec 15, 2018 8:54 am
Thank you Richard! I wasn't aware of the sysError function. It yielded system error code 8, which translates to insufficient memory.
Error checking after I/O is such a valuable habit I'm surprised it isn't part of every tutorial on the subject.
I can now conclude that in order for the md5Digest function to compute the md5 checksum of a file, the entire file would need to be loaded into a string variable. The md5Digest function reliably computes the md5 value of the string you give it. So this means that if I can't guarantee enough available system memory to load the entire file into a string variable, then I can't use the md5Digest function because I would like the resulting md5 value to be the same as what I would get from the command-line utility. I will now resort to running the command-line utility and parsing the output.
Maybe not - did you see Trevor's post on this? By reading the file in chunks and aggregating the checksum you can get a good result without taxing RAM:
https://forums.livecode.com/viewtopic.p ... 92#p174152

The only downside to that approach would be if you later need to compare the checksum with those generated by other tools. But even there is hope: a low-memory condition will affect any program, so we can assume that any command line tool uses a chunked-read method as Trevor's example illustrates. If you need to match the output from a command line tool from your LC-native solution you can study the algo used to page from disk and adjust the chunk size accordingly.
Thanks to all of you for helping me find the answer, especially Richard. I appreciate your guiding me patiently to eventually understand what must have been obvious to you from the start. You make these forums totally worth it.
My pleasure. You're among the few here who actually read what I take the time to time. Makes it worthwhile. Thanks.
Richard Gaskin
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn

Ledigimate
Livecode Opensource Backer
Livecode Opensource Backer
Posts: 132
Joined: Mon Jan 14, 2013 3:37 pm

Re: [Solved] md5digest of large files, how reliable is it?

Post by Ledigimate » Sun Dec 16, 2018 9:33 pm

Maybe not - did you see Trevor's post on this? By reading the file in chunks and aggregating the checksum you can get a good result without taxing RAM:
https://forums.livecode.com/viewtopic.p ... 92#p174152

The only downside to that approach would be if you later need to compare the checksum with those generated by other tools. But even there is hope: a low-memory condition will affect any program, so we can assume that any command line tool uses a chunked-read method as Trevor's example illustrates. If you need to match the output from a command line tool from your LC-native solution you can study the algo used to page from disk and adjust the chunk size accordingly.
Yes, but even though the command-line utility reads the file in chunks, it actually computes only one md5 digest of the whole file, not the md5 digest of the md5 digests of every chunk. For a native LC chunk-read solution it looks like I would indeed need to study the official RFC paper on the md5 message-digest algorithm and then write my own implementation of it. I like a good challenge, but to save time I think I'll stick with parsing the output from the command-line utility.
010100000110010101100001011000110110010100111101010011000110111101110110011001010010101101010100011100100111010101110100011010000010101101001010011101010111001101110100011010010110001101100101

FourthWorld
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 9802
Joined: Sat Apr 08, 2006 7:05 am
Location: Los Angeles
Contact:

Re: [Solved] md5digest of large files, how reliable is it?

Post by FourthWorld » Mon Dec 17, 2018 2:01 am

True, with a 512k block size running a custom-scripted MD5 would be slow, in addition to being tedious to script.

But that would only be needed if the consumer of the hash needs to compare it to output from md5sum.

If your own systems are the consumer, hashing the aggregate of hashes will produce reasonably good signatures of uniqueness, and with the buffer size in Trevor's script should run acceptably fast.
Richard Gaskin
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn

Post Reply

Return to “Talking LiveCode”