[Solved] md5digest of large files, how reliable is it?

Anything beyond the basics in using the LiveCode language. Share your handlers, functions and magic here.

Moderators: FourthWorld, heatherlaine, Klaus, kevinmiller, robinmiller

Ledigimate
Livecode Opensource Backer
Livecode Opensource Backer
Posts: 132
Joined: Mon Jan 14, 2013 3:37 pm

[Solved] md5digest of large files, how reliable is it?

Post by Ledigimate » Wed Dec 12, 2018 1:37 pm

Hi

I've tested LiveCode's md5digest function and its super fast, even with files larger in size than the computer's RAM.

Code: Select all

function fileMd5Digest pFilePath
   local tCheckSum
   get binarydecode("h*", md5digest(url ("binfile:" & pFilePath)), tCheckSum)
   return tCheckSum
end fileMd5Digest
I just don't know how reliable it is for verifying the checksum of large files. How is it that it can compute it so quickly? It would almost seem like it doesn't actually read the entire file, and if it doesn't, how can this be reliable?

Regards

Gerrie
Last edited by Ledigimate on Sat Dec 15, 2018 8:55 am, edited 1 time in total.
010100000110010101100001011000110110010100111101010011000110111101110110011001010010101101010100011100100111010101110100011010000010101101001010011101010111001101110100011010010110001101100101

Ledigimate
Livecode Opensource Backer
Livecode Opensource Backer
Posts: 132
Joined: Mon Jan 14, 2013 3:37 pm

Re: md5digest of large files, how reliable is it?

Post by Ledigimate » Wed Dec 12, 2018 3:00 pm

I just tested the above code on two large files that differ by a single bit.
The result is disappointing.
It returns the same result for both files.
So now my question becomes, am I going about this the wrong way?
010100000110010101100001011000110110010100111101010011000110111101110110011001010010101101010100011100100111010101110100011010000010101101001010011101010111001101110100011010010110001101100101

dunbarx
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 9579
Joined: Wed May 06, 2009 2:28 pm
Location: New York, NY

Re: md5digest of large files, how reliable is it?

Post by dunbarx » Wed Dec 12, 2018 4:17 pm

Hi.

I never used this function before, but with a string of about 7000 chars in two variables, I get a different value if they differ by a single char, and the same value if they are in fact the same.

Craig Newman

Ledigimate
Livecode Opensource Backer
Livecode Opensource Backer
Posts: 132
Joined: Mon Jan 14, 2013 3:37 pm

Re: md5digest of large files, how reliable is it?

Post by Ledigimate » Wed Dec 12, 2018 4:45 pm

If you pass a file URL to the md5Digest function, does it always read the whole file?
If not, that might explain why I got the same value for two very large files that differ by only one bit.
I made a copy of a 3.04 GB file, changed only one bit using a raw disk editor, ran the function against both files, and it gave the same result.
010100000110010101100001011000110110010100111101010011000110111101110110011001010010101101010100011100100111010101110100011010000010101101001010011101010111001101110100011010010110001101100101

FourthWorld
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 9802
Joined: Sat Apr 08, 2006 7:05 am
Location: Los Angeles
Contact:

Re: md5digest of large files, how reliable is it?

Post by FourthWorld » Wed Dec 12, 2018 5:57 pm

Ledigimate wrote:
Wed Dec 12, 2018 1:37 pm
I just don't know how reliable it is for verifying the checksum of large files. How is it that it can compute it so quickly? It would almost seem like it doesn't actually read the entire file, and if it doesn't, how can this be reliable?
How large is "large"?

Once generated, how is the md5 value being used?
Richard Gaskin
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn

Ledigimate
Livecode Opensource Backer
Livecode Opensource Backer
Posts: 132
Joined: Mon Jan 14, 2013 3:37 pm

Re: md5digest of large files, how reliable is it?

Post by Ledigimate » Wed Dec 12, 2018 11:17 pm

How large is "large"?
Files that are too large to be loaded entirely into RAM, I guess.
Once generated, how is the md5 value being used?
I would like to use the md5 value to verify the integrity of a copied file.
010100000110010101100001011000110110010100111101010011000110111101110110011001010010101101010100011100100111010101110100011010000010101101001010011101010111001101110100011010010110001101100101

FourthWorld
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 9802
Joined: Sat Apr 08, 2006 7:05 am
Location: Los Angeles
Contact:

Re: md5digest of large files, how reliable is it?

Post by FourthWorld » Wed Dec 12, 2018 11:23 pm

Ledigimate wrote:
Wed Dec 12, 2018 11:17 pm
How large is "large"?
Files that are too large to be loaded entirely into RAM, I guess.
Which OS are you using? Many provide hashing functions that can be called from the command line via LC's shell function.
Once generated, how is the md5 value being used?
I would like to use the md5 value to verify the integrity of a copied file.
Will you be doing that manually? For one file, 10 files, 10,000 files? Why MD5 as opposed to more recent algos? Is this all on your local hard drive, or by "copy" do you mean "download"?
Richard Gaskin
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn

Ledigimate
Livecode Opensource Backer
Livecode Opensource Backer
Posts: 132
Joined: Mon Jan 14, 2013 3:37 pm

Re: md5digest of large files, how reliable is it?

Post by Ledigimate » Thu Dec 13, 2018 12:23 am

Which OS are you using?
Any version of Windows from XP up to 10.
Many provide hashing functions that can be called from the command line via LC's shell function.
I have a command line utility from Microsoft that can do the job, but I just wanted to try the LC function first. If it could spare me some effort, it's worth a shot.
Will you be doing that manually? For one file, 10 files, 10,000 files?
I want to create a utility that runs off the root directory of a removable drive and recursively calculates the checksum values of each file on the drive, and presents the results in a user friendly format. I want to distribute it along with our software so the users can check for corrupted files on the installation media. It's about 15 files.
010100000110010101100001011000110110010100111101010011000110111101110110011001010010101101010100011100100111010101110100011010000010101101001010011101010111001101110100011010010110001101100101

trevordevore
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 1005
Joined: Sat Apr 08, 2006 3:06 pm
Location: Overland Park, Kansas
Contact:

Re: md5digest of large files, how reliable is it?

Post by trevordevore » Thu Dec 13, 2018 1:38 am

I use the following function to get the md5 digest of a file. I think Mark Waddingham (LiveCode CTO) was the original author.

Code: Select all

function MD5DigestOfFile pFile
  -----
  local CHUNK_SIZE,theMD5
  local theError
  -----

  ## This combination gave the best results in a very rough test on
  ## OS X 10.5 on an Intel iMac.
  ## Compared with [3|1|2] * 1024 * 1024
  ## Compared 1024 * [128|32|8]
  put 1024 * 128 into CHUNK_SIZE

  open file pFile for binary read
  put the result into theError

  if theError is empty then
    repeat
      read from file pFile for CHUNK_SIZE chars
      if the result is EOF then
        exit repeat
      else
        if the result is not empty then
          put the result into theError
        end if
      end if
      put the md5Digest of it after theMD5

      if theError is not empty then exit repeat
    end repeat
    close file pFile
  end if

  return the md5Digest of theMD5
end MD5DigestOfFile
Trevor DeVore
ScreenSteps - https://www.screensteps.com

LiveCode Repos - https://github.com/search?q=user%3Atrevordevore+topic:livecode
LiveCode Builder Repos - https://github.com/search?q=user%3Atrevordevore+topic:livecode-builder

FourthWorld
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 9802
Joined: Sat Apr 08, 2006 7:05 am
Location: Los Angeles
Contact:

Re: md5digest of large files, how reliable is it?

Post by FourthWorld » Thu Dec 13, 2018 1:46 am

Thanks, Trevor. Is that chunksize and aggregating method the same as used by macOS's md5 command?
Richard Gaskin
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn

trevordevore
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 1005
Joined: Sat Apr 08, 2006 3:06 pm
Location: Overland Park, Kansas
Contact:

Re: md5digest of large files, how reliable is it?

Post by trevordevore » Thu Dec 13, 2018 1:58 am

I don't know Richard. I would guess not. I've only used it in situations where I ran the same function on all files.
Trevor DeVore
ScreenSteps - https://www.screensteps.com

LiveCode Repos - https://github.com/search?q=user%3Atrevordevore+topic:livecode
LiveCode Builder Repos - https://github.com/search?q=user%3Atrevordevore+topic:livecode-builder

FourthWorld
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 9802
Joined: Sat Apr 08, 2006 7:05 am
Location: Los Angeles
Contact:

Re: md5digest of large files, how reliable is it?

Post by FourthWorld » Thu Dec 13, 2018 2:07 am

trevordevore wrote:
Thu Dec 13, 2018 1:58 am
I don't know Richard. I would guess not. I've only used it in situations where I ran the same function on all files.
Thanks. That's why I was asking about usage above. Solutions are easy to come by when the producer and the consumer of information is the same party. But if the scenario was to provide a checksum to others for a file being offered for download, depending on the method used any checksum we post may or may not bear any relationship to the algo they use to run a confirming checksum.
Richard Gaskin
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn

Ledigimate
Livecode Opensource Backer
Livecode Opensource Backer
Posts: 132
Joined: Mon Jan 14, 2013 3:37 pm

Re: md5digest of large files, how reliable is it?

Post by Ledigimate » Thu Dec 13, 2018 9:30 am

After some more testing, I also discovered that the binarydecode function swaps the hex characters in the result when asked to decode binary data to hex data, e.g. it yields
9c0f8f59bf89ba19955ff10d92e732d6
instead of
c9f0f895fb98ab9159f51fd0297e236d
Why on earth would it do that? Or is the md5Digest function to blame?
010100000110010101100001011000110110010100111101010011000110111101110110011001010010101101010100011100100111010101110100011010000010101101001010011101010111001101110100011010010110001101100101

FourthWorld
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 9802
Joined: Sat Apr 08, 2006 7:05 am
Location: Los Angeles
Contact:

Re: md5digest of large files, how reliable is it?

Post by FourthWorld » Thu Dec 13, 2018 10:53 am

Ledigimate wrote:
Thu Dec 13, 2018 9:30 am
After some more testing, I also discovered that the binarydecode function swaps the hex characters in the result when asked to decode binary data to hex data, e.g. it yields
9c0f8f59bf89ba19955ff10d92e732d6
instead of
c9f0f895fb98ab9159f51fd0297e236d
Why on earth would it do that? Or is the md5Digest function to blame?
Double check that first argument. Options are provided for most data sizes in both byte orders.
Richard Gaskin
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn

Ledigimate
Livecode Opensource Backer
Livecode Opensource Backer
Posts: 132
Joined: Mon Jan 14, 2013 3:37 pm

Re: md5digest of large files, how reliable is it?

Post by Ledigimate » Thu Dec 13, 2018 7:13 pm

Double check that first argument. Options are provided for most data sizes in both byte orders.
Thanks Richard, that was it. I couldn't make sense of the relevant dictionary entry from inside LC, so I looked it up online and even there I had to decipher the text which wasn't properly punctuated. But I figured it out.

Update: The problem where I get the same result on two different files occur only when the file size exeeds 504 MB.

Code: Select all

function fileMd5Digest pFilePath
   local tCheckSum
   local tData
   put url ("binfile:" & pFilePath) into tData
   get binarydecode("H*", md5digest(tData), tCheckSum)
   return tCheckSum
end fileMd5Digest
When the file size exceeds 504 MB, tData is empty. I don't know if this is due to a limitation in LiveCode, or due to insufficient memory.
010100000110010101100001011000110110010100111101010011000110111101110110011001010010101101010100011100100111010101110100011010000010101101001010011101010111001101110100011010010110001101100101

Post Reply

Return to “Talking LiveCode”