Is There Any Disk Bound Storage?

Moderators: FourthWorld, heatherlaine, Klaus, kevinmiller, LCMark

Locked
SirWobbyTheFirst
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 239
Joined: Tue Jun 30, 2009 11:15 pm

Is There Any Disk Bound Storage?

Post by SirWobbyTheFirst » Thu Mar 06, 2014 2:30 pm

Hey folks, I have a question, is there anything available to LiveCode that one could use to store data like it was a file system but would be read from and written to on a disk rather than being read into the memory? I ask because I have a file system concept for storing files and folders for Nix using XML and whilst XML is a fantastic resource for storing data, it is loaded into the memory as one chunk and I have the disturbing feeling that if the resulting XML file become big, it would result in a startup lag as the whole file would be read into memory by LiveCode.

So is there any alternative that LiveCode will work with from disk rather than putting a copy into memory? I've been looking at SQLite but again, I don't know the specifics of SQLite to make a proper judgement.

Note: The reason I posted this question in this section is because I've never seen runrevmark post anywhere other than here and I want to get his input too.

LCfraser
Livecode Staff Member
Livecode Staff Member
Posts: 71
Joined: Thu Nov 28, 2013 11:18 am
Contact:

Re: Is There Any Disk Bound Storage?

Post by LCfraser » Fri Mar 07, 2014 11:26 am

Because you're using XML, there is no option but to load the file as one chunk. XML is not random-access; you have to read the whole file in order to parse it. The other problem is that it is designed for textual data - binary data will need to be encoded with something like Base64, greatly inflating its size.

If you're determined to use XML, my recommendation would be to have a binary blob file or files containing the file data and use XML to just store the metadata (filenames, creation times, offsets into the blob, etc). That way, you have complete control of when the file data is read into memory - if you use the open file/read from file syntax, you can choose to only load part of the file. LiveCode only reads in the whole file if you use the url syntax.

Regards,
Fraser

malte
Posts: 1098
Joined: Thu Feb 23, 2006 8:34 pm
Location: Ostenfeld germany
Contact:

Re: Is There Any Disk Bound Storage?

Post by malte » Fri Mar 07, 2014 1:03 pm

Because you're using XML, there is no option but to load the file as one chunk.

*cough* StAx *cough*

LCfraser
Livecode Staff Member
Livecode Staff Member
Posts: 71
Joined: Thu Nov 28, 2013 11:18 am
Contact:

Re: Is There Any Disk Bound Storage?

Post by LCfraser » Fri Mar 07, 2014 1:15 pm

malte wrote: *cough* StAx *cough*
You're quite right. But to seek to an arbitrary node in the XML document, you still need to read all of the file up to that point - you can't just seek to an arbitrary offset in the file and start reading as you have no idea where in the document tree you are. If the first file you want is at the end of the XML document, you're in for a long wait ;)

malte
Posts: 1098
Joined: Thu Feb 23, 2006 8:34 pm
Location: Ostenfeld germany
Contact:

Re: Is There Any Disk Bound Storage?

Post by malte » Fri Mar 07, 2014 2:24 pm

But there would be no need to keep it in memory completly, which would allow XML-Data of any size, which indeed would be kind of cool :-D

Best,

Malte

FourthWorld
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 9802
Joined: Sat Apr 08, 2006 7:05 am
Location: Los Angeles
Contact:

Re: Is There Any Disk Bound Storage?

Post by FourthWorld » Fri Mar 07, 2014 4:35 pm

Given the ways XML bounds data with tags, parsing file too large for practical use in RAM will be very slow. If the data is read-only you could traverse it once and create a list of pointers into the data, but then the trick is how to access the pointer you need - lineoffset works well with thousands of items, but can be slow with tens of thousands, and at that point reintroduces the RAM concern.

On the dev list a few years ago I floated the idea of LC providing something like a disk-based array, a simple key-value store that would be efficient to get and put values by a given unique key, but unlike arrays it would leave most of the data on disk.

The suggestion from nearly everyone in that thread was to use SQLite, and given SQLite's ever-growing popularity and its rich feature set (now even richer in LC since the SQLite component was updated recently), it's a very good choice indeed.

But even the collective wisdom of the community never stopped me from foolish experimentation. :) Curious about how b-trees work (the underlying mechanism of nearly every large data store, from the various SQL implementations to most of the NoSQL DBs like CouchDB and MongoDB), I set about exploring whether I could make an efficient b-tree in LiveCode.

Not surprisingly, I can't. B-trees are complex, so much so that relatively few ever write their own; most DBs are based on a handful of b-tree implementations that have been floating around for years, reused in project after project, occasionally revised but rarely attempted from scratch. And even if I could, it would be slow in LC compared to the highly-optimized C-based versions commonly in use.

But I did stumble across something else, a sort of hash table that more or less splits the difference between what LiveCode does well and what it doesn't do so well, arriving at the beginnings of a library I call libDiskArray.

The goal with this was to provide a simple key-value store optimized for CGI use, probably the most demanding environment we use LC in since everything it does runs from a cold-start with each HTTP request, and since most folks use LC Server on shared hosts both CPU time and RAM are critical considerations. Since it can work on files up to 4GB (and with a modest revision could accommodate several terabytes), it may be useful in some desktop contexts as well.

The good news is that if the only thing you need to do is store or retrieve a given value by key, it outperforms even SQLite, and it's all in just native LC in just a few hundred lines of code so it's easily embeddable with minimal overhead. In fact, aside from the value of a key itself, the largest read it does is only 8k, and touches the file only 4 to 5 times, so it's reasonably light on system resources.

But the bad news is that it only does that one thing, setting and getting a value by key. If you need to do any aggregate operations across the data set, like queries, you're left with iterating across the data set, as indexing is a level of complexity I won't be exploring with this.

Since it's extremely rare that I need only that one feature, I've set libDiskArray aside to focus more on SQLite. It really is a better choice. The performance was gratifying, but really somewhat modest (only about 20% faster than an equivalent lookup in SQLite), and with only one feature compared to SQLite's richness it was a good learning experience but not suitable for most practical uses.

Pardon the long post, but in short:

Working with large collections of data from disk is inherently complex, and most useful solutions use b-trees at their core, which are too complex to bother attempting in LC. The engine could do this, but since RunRev has already provided access to SQLite there's almost no reason to, since SQLite has efficient store-and-retrieve along with a wider range of features than anything the team could make time to come up with themselves.

If you absolutely need only a key-value store, have fewer than 1 million keys, and all of your data plus the keys totals less than 4GB, I might be motivated to consider finishing libDiskArray.

But first I would encourage you to consider SQLite. It's a great toolkit, and relatively easy to learn given the power and flexibility it provides.

I found this book enormously helpful in getting started with SQLite:

Using SQLite
Small. Fast. Reliable. Choose Any Three.
By Jay A. Kreibich
http://shop.oreilly.com/product/9780596521196.do
Richard Gaskin
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn

SirWobbyTheFirst
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 239
Joined: Tue Jun 30, 2009 11:15 pm

Re: Is There Any Disk Bound Storage?

Post by SirWobbyTheFirst » Fri Mar 07, 2014 5:29 pm

Maybes I should give some more information about what my intentions would be, as it stands now I have a virtual file system of sorts using XML. The node within a node structure of XML lends itself to a directory structure rather nicely and the attributes that can be set on a node allow me to add additional information about a file or folder at ease, the current implementation is that the node name is the actual object name, so referencing a file inside the XML would be akin to revXMLNodeContents(XMLID, "Volume/System/Libraries/PAL-MacOS.dll") and with a little public API magic, I can turn this to look like a standard Windows file path and thus that path becomes HD0:\System\Libraries\PAL-MacOS.dll but internally gets translated into the above.

When referencing the object, space characters are translated into underscores and the usual invalid XML characters are slapped out of the string. I have two attributes attached to the node, Type and Metadata, Type contains a value of File or Folder and Metadata is an array that has been encoded and then the string passed through the Base 64 encoder and contains values such as cSize, cMD5, cCreationDate, cModificationDate as well as a cOwners and cSecurityLevel property that are specific to my software and can be added to, to introduce hidden files, read only files, etc.

Additionally the contents of a file are passed through the Compress function and Base 64 encoded before being stored in the nodes text and upon being read the system will compare an MD5 of the actual data to the value stored in the Metadata array to at least alert the user if a file has been corrupted due to whatever reason. Of course the cSize and cMD5 properties will not apply to a folder object and file objects can only be stored inside folder objects.

The problem however is if the user were to chuck a hefty file on there or a bunch of files, not only may it be slow to write the whole file system back to disk during shutdown but it may also be slow to load it in during startup and as Richard mentioned, there is a limit on the amount of RAM a process can take. So my question was, is there an alternative to XML I could use that could stay disk bound.

I went through a lot of potential ideas for the technology I would use for the file system and among them was a LiveCode array, an archive, a stack and of course XML and out of all of them, XML was the only one that really stood out but still had the disadvantage of potentially bloating the application or crashing it due to an out of memory exception, if the file system only ever held my apps files and not any data from the user then it wouldn't be a problem because the stack files my app uses and the config files are like a couple of KB in size.

I hope this sheds some light on the situation, I am happy to go on with XML, I just wanted to know if there was a possible alternative. I'd also be happy to explain some more info about the file system if anyone is interested.

Regards.

FourthWorld
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 9802
Joined: Sat Apr 08, 2006 7:05 am
Location: Los Angeles
Contact:

Re: Is There Any Disk Bound Storage?

Post by FourthWorld » Fri Mar 07, 2014 7:19 pm

mickpitkin92 wrote:Maybes I should give some more information about what my intentions would be, as it stands now I have a virtual file system of sorts using XML. The node within a node structure of XML lends itself to a directory structure rather nicely and the attributes that can be set on a node allow me to add additional information about a file or folder at ease, the current implementation is that the node name is the actual object name, so referencing a file inside the XML would be akin to revXMLNodeContents(XMLID, "Volume/System/Libraries/PAL-MacOS.dll") and with a little public API magic, I can turn this to look like a standard Windows file path and thus that path becomes HD0:\System\Libraries\PAL-MacOS.dll but internally gets translated into the above.
Ah, but there's the rub: in file systems, the "path" is just an illusion created for the convenience of the user. Under the hood, the relationships between files and folders in a file system are implemented as - you guessed it - a b-tree.

XML is a great way to express hierarchical data in a somewhat human-readable form. But as with many things in computing, there are tradeoffs between human-readability and machine-processing.

JSON is another way to express hierarchical data, but it's worth noting that, like XML, it's most commonly used for exchanging data between systems, rather than for persistent storage of large collections. In a persistent store, such as MongoDB, rather than JSON they use BSON, a binary variant, which would be much harder to parse in high-level languages like JavaScript but is much more efficient to parse in a low-level language like C.

And not surprisingly, the index used for rapidly accessing the BSON data in MongoDB relies on a b-tree.

MongoDB and other NoSQL document stores are great for many needs, but like most DBs they require an always-on DB server (and in Mongo's case also enough RAM to store the index), so they're not ideal for use in CGI environments on shared hosts, or in most standalone desktop applications. SQLite is an exception, designed primarily for single-user local storage (though it can also be used well in other contexts).

If minimal memory usage is a higher priority over performance, you could write a pull-parser in LiveCode that traverses the XML, in which each read could be relatively small, parsing as it goes until it finds the desired item.

But without some form of index to point to the data for a specific document, it would mean finding a document stored at the end of the file could take a very long time.

An additional concern is handling writes, which complicate matters significantly, since updating a given document when the new size is larger than the original would require either moving the contents of all other data down to accommodate it, or maintaining pointers somewhere so you could simply append the file to the collection and note the new location in the pointer.

But then you're back to the core question of finding an efficient way to maintain a key-pointer lookup system that's both memory- and performance-efficient. And that's where most data stores rely on b-trees.

It's a tough nut to crack.

So far the best minds in the business seem to have settled on b-trees as the foundation for such systems.

SQLite's b-tree is very efficient, and a relatively simple SQL schema could support all the metadata options you're looking for, while still providing the very good performance SQLite is famous for, even with millions of records.
Richard Gaskin
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn

SirWobbyTheFirst
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 239
Joined: Tue Jun 30, 2009 11:15 pm

Re: Is There Any Disk Bound Storage?

Post by SirWobbyTheFirst » Fri Mar 07, 2014 8:11 pm

Very true and when you start shifting data around then you get to the problem practically all file systems are known for, fragmentation, I had a thought about using a combination of XML for metadata storage and a binary blob but got instantly hit with the fragmentation problem. I'm going to have a look into SQLite at some point and might see if I can recreate my file system again, if it works, I might integrate that into Nix instead of the XML file system.

From the sounds of it, LiveCode doesn't load the whole SQLite database into the memory which would be the best thing since sliced bread (My Durham-ness is showing again) to be honest. I'm not too heavily dependant on the XML library's functions, mostly getting the data, attributes, creating and deleting entries and writing the data, I don't use any of the fancy stuff like next sibling, previous sibling, so theoretically, it should be possible.

FourthWorld
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 9802
Joined: Sat Apr 08, 2006 7:05 am
Location: Los Angeles
Contact:

Re: Is There Any Disk Bound Storage?

Post by FourthWorld » Fri Mar 07, 2014 8:48 pm

mickpitkin92 wrote:Very true and when you start shifting data around then you get to the problem practically all file systems are known for, fragmentation
That's one of the things I find fascinating about the EXT4 file system many Linux distros ship with: it's considered generally free of fragmentation, but the underlying mechanism that allows for that isn't something I've looked into. Would be good to read some time, since most other file systems are notorious for fragmentation impairing performance.
mickpitkin92 wrote:From the sounds of it, LiveCode doesn't load the whole SQLite database into the memory which would be the best thing since sliced bread (My Durham-ness is showing again) to be honest.
SQLite is flexible enough to provide options for working with a DB exclusively in memory, but by default it's a disk-based store and takes care of making smart paging decisions for us, so we can focus on the business logic of our apps and let the SQLite engine handle the nitty gritty for us automatically.
Richard Gaskin
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn

SirWobbyTheFirst
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 239
Joined: Tue Jun 30, 2009 11:15 pm

Re: Is There Any Disk Bound Storage?

Post by SirWobbyTheFirst » Fri Mar 07, 2014 9:04 pm

That's one of the things I find fascinating about the EXT4 file system many Linux distros ship with: it's considered generally free of fragmentation, but the underlying mechanism that allows for that isn't something I've looked into. Would be good to read some time, since most other file systems are notorious for fragmentation impairing performance.
Does Wikipedia not provide any information about it? If not you could always pop on one of the Linux mailing lists and ask. I've noticed in OS X that the kernel will spawn a thread in the background to intercept any write requests made to a HFS+ based volume and rearranges the data to try and limit fragmentation, I think its one of the things that has to be patched if you want to try and enable SSD Trim support. I've got Auslogics DiskDefrag set to run on a Sunday morning in Windows, it is so much more efficient than Windows built in Defragger.
SQLite is flexible enough to provide options for working with a DB exclusively in memory, but by default it's a disk-based store and takes care of making smart paging decisions for us, so we can focus on the business logic of our apps and let the SQLite engine handle the nitty gritty for us automatically.
That sounds fantastic to me, I'll have to get experimenting this weekend, provided nothing crops up, two questions left.Would you happen to know what SQLite performance is like on SSD? By which I mean is it safe to use on an SSD, I have hard disks but I'd want to be sure I wouldn't involuntarily cause a guys SSD to wear out quicker than usual. And does the SQLite library for LC support atomicity (I think that's the name of the feature) so that if two pieces of code try and write to the same entry, the library commits one write and then the other afterwards?

I'm positive atomicity is the right name for it.

UPDATE: I've been trying to get started with SQLite in order to see how I get basic database queries done using the format and I'm using the SQLite Database Lesson on the RunRev website for reference and despite what the lesson and the dictionary says typing:

Code: Select all

Put revOpenDatabase("sqlite", "X:/SQLite Stuff/Test Database.sqlite")
To get the Message Box to even create the database and return a handle to it returns a "revdberr,syntax error" message, this is regardless of whether I use 6.5.0 or 6.5.2, so I'm kind of stuck at the moment, X: is the drive where I store all of my programming stuff and is very much write capable from LiveCode considering stack files and every other file type can be saved there. I even tried sticking the file path into a variable named tDatabasePath and passing that in the function call but it yields the same result, so I'm wondering if the SQLite driver has been messed up slightly.

Locked

Return to “Engine Contributors”