Randomize Lines of a Large Text File

Got a LiveCode personal license? Are you a beginner, hobbyist or educator that's new to LiveCode? This forum is the place to go for help getting started. Welcome!

Moderators: FourthWorld, heatherlaine, Klaus, kevinmiller

AxWald
Posts: 578
Joined: Thu Mar 06, 2014 2:57 pm

Re: Randomize Lines of a Large Text File

Post by AxWald » Tue Jun 01, 2021 7:42 pm

Ooops,
hadn't read that:
xoxiwe wrote:
Sun May 30, 2021 7:17 pm
well, my program checks every line in a text file based on some given condition, and gets me the lines according to the condition given
This isn't a problem: Finding matching lines is easy. There's some techniques to do this really fast.

But what does this mean?
xoxiwe wrote:
Sun May 30, 2021 7:17 pm
what I need is to randomize the lines so the chance to get the expected lines faster since the lines are categorized and I don't know from which line a new categorized record starts.
If you randomize the lines they are - well, randomized. How would this help to find something faster?

If you want to find all lines matching a certain condition: it doesn't matter if they are sorted or not - sorting huge amounts of data takes a lot of time! Nearly always it's faster just to loop through & check each line individually.

Have fun!

PS: For my previous post I used a file with ~55MB & ~1.000.000 lines. Changed to do only this:

Code: Select all

   repeat for each line L in myData
      wait 0 millisec with messages
      if the controlKey is down then exit repeat
      if (L is not empty) and (" was " is in L) then
         put L & CR after myVar
      end if
   end repeat
   delete char -1 of myVar
it uses only 3 seconds to return the matching data (below 1 sec if I omit the first 2 security lines in the loop). Actually, it takes longer to show the result in the message box than to compute it ;-)
All code published by me here was created with Community Editions of LC (thus is GPLv3).
If you use it in closed source projects, or for the Apple AppStore, or with XCode
you'll violate some license terms - read your relevant EULAs & Licenses!

dunbarx
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 9647
Joined: Wed May 06, 2009 2:28 pm
Location: New York, NY

Re: Randomize Lines of a Large Text File

Post by dunbarx » Tue Jun 01, 2021 9:42 pm

I was playing around with ten million lines of data in the following two similar handlers, with a prize placed in the middle of the list.

The first one uses "repeat for each" in a normal variable, and the second uses an array. This is in a button script with a field hanging around. Only the actual search is timed:

Code: Select all

on mouseUp
   put "" into fld 1
   
   repeat 10000000
      put "aaa" & return after accum
   end repeat
   put "xxx" into line 5000000 of accum
   put the ticks into tStart  --TIMER STARTS HERE
   repeat for each line tLine in accum
      if tLine = "xxx" then
         put the ticks -tStart into fld 1
         exit repeat
      end if
   end repeat
end mouseUp

on mouseUp
   put "" into fld 1
   
   repeat with y = 1 to 10000000
      put "aaa" into myArray[y]
   end repeat
   put "xxx" into  myArray[5000000]
   put the ticks into tStart --TIMER STARTS HERE
   repeat for each key tKey in myArray
      if myArray[tKey] = "xxx" then
         put myArray[tKey] & return  & the ticks -tStart into fld 1
         exit repeat
      end if
   end repeat
end mouseUp
The top (ordinary) variable search took about 180 ticks in the timed section, whereas the lower (array) version took about 320 ticks. Each was fairly consistent.

i had thought the array would be faster, and moreover, that the array times would be different in each run of the array, since there is no order to the keys, and the match might come sooner or later depending on each build. Clearly I was wrong about that.

Craig

FourthWorld
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 9823
Joined: Sat Apr 08, 2006 7:05 am
Location: Los Angeles
Contact:

Re: Randomize Lines of a Large Text File

Post by FourthWorld » Tue Jun 01, 2021 10:02 pm

dunbarx wrote:
Tue Jun 01, 2021 9:42 pm
i had thought the array would be faster
Not necessarily. The hashing arrays use is faster than brute-force sequential search when looking for something in a collection, but is not without some overhead.

Arrays excel at access speed for a specific thing in a collection of things, but the advantage goes away when doing aggregate operations across all members of a collection.
...and moreover, that the array times would be different in each run of the array, since there is no order to the keys, and the match might come sooner or later depending on each build.
There's an order, it's just not an order than lends itself to human thinking. The order is derived from the internal hash used to assign the keys to buckets. With a given set of keys, that order shouldn't change, since the hash used is the same. But good luck making any sense of it. :)
Richard Gaskin
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn

dunbarx
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 9647
Joined: Wed May 06, 2009 2:28 pm
Location: New York, NY

Re: Randomize Lines of a Large Text File

Post by dunbarx » Tue Jun 01, 2021 10:24 pm

But good luck making any sense of it
That is what I have you for. :D

Craig

FourthWorld
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 9823
Joined: Sat Apr 08, 2006 7:05 am
Location: Los Angeles
Contact:

Re: Randomize Lines of a Large Text File

Post by FourthWorld » Tue Jun 01, 2021 10:27 pm

If you know the needle you're looking for in a line-based haystack, you can also let the lineOffset function do the work:

Code: Select all

on mouseup
   repeat 10000000
      put "aaa" & return after accum
   end repeat
   put "xxx" into line 5000000 of accum
   put the ticks into tStart  --TIMER STARTS HERE
   --
   get lineoffset("xxx", accum)
   --
   put the ticks - tStart && it into fld 1
end mouseup
Richard Gaskin
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn

rkriesel
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 118
Joined: Thu Apr 13, 2006 6:25 pm

Re: Randomize Lines of a Large Text File

Post by rkriesel » Fri Jun 04, 2021 6:51 am

xoxiwe wrote:
Sun May 30, 2021 7:17 pm
...
Also I forgot to mention that showing the outcome into a field doesnt necessary.
...
Hi, xoxiwe.

If you'd not tried putting your result into a field, you might have found that your first attempt would have succeeded fast enough for you.

Code: Select all

sort tLines by random(2^30)
Given 100 MB of input as 5,000,000 lines each with 20 characters, that sort takes about 18 seconds here.

Thanks to the folks who contributed code to randomize lines, I created my own version. It's thrifty and nifty, but it takes 115 seconds.

How fast is fast enough for you? What requires the randomized file?

-- Dick

dunbarx
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 9647
Joined: Wed May 06, 2009 2:28 pm
Location: New York, NY

Re: Randomize Lines of a Large Text File

Post by dunbarx » Sat Jun 05, 2021 1:14 am

Dick.

Your point is well taken. Always work in variables, loading into a field only at the very end. That is what I did.

I had three chars in my test lines. The length of those lines does not matter, only the number of lines.

Craig

Post Reply

Return to “Getting Started with LiveCode - Complete Beginners”