5 Gig File - How to go about reading portions of it?
Moderators: FourthWorld, heatherlaine, Klaus, kevinmiller, robinmiller
5 Gig File - How to go about reading portions of it?
I have a 5 gig XML file. I don't want to load the whole thing in at once since that may even be impossible, if it is, then it would take far too long to be of use. What I would like to do is be able to only open a portion of it, search for some data, then move further in a portion and so on and build an index of entries in this large file. The index would contain keywords and start and end points to entries related to the keywords.
My question first is whether this is going to be possible in Rev?
Second, would using "open file" actually cause rev to load the entire file?
Third, if this is not possible, does anyone know of a windows program(prefer free) that would convert this 5 gig XML file to an SQL database that Rev can use?
Thanks,
~Garrett
My question first is whether this is going to be possible in Rev?
Second, would using "open file" actually cause rev to load the entire file?
Third, if this is not possible, does anyone know of a windows program(prefer free) that would convert this 5 gig XML file to an SQL database that Rev can use?
Thanks,
~Garrett
'What you do not want done to yourself, do not do to others.' - Confucius (550 b.c. to 479 b.c.)
-
- VIP Livecode Opensource Backer
- Posts: 977
- Joined: Sat Apr 08, 2006 7:47 am
- Contact:
To avoid loading the entire 5GB into a giant DOM tree, you can also use the 'SAX' method of handling events as the XML file is parsed. For this, you would put something like the following in your card script:
This approach frees you from having to hand-write an XML parser, and keeps the memory requirements to a minimum as it only hands you back chunks of data at a time and can forget about that data once it sent it to the card as part of the appropriate message.
HTH,
Jan Schenkel.
Code: Select all
command ParseXmlFileWithMessages pFilePath
return revCreateXmlTreeFromFile(pFilePath, false, false, true)
end ParseXmlFileWithMessages
on revXMLStartTree
-- setup everything here to hold the data and insert into database
end revXMLStartTree
on revXMLEndTree
-- cleanup everything here, as we went through the entire file
end revXMLEndTree
on revStartXMLNode pNodeName, pNodeAttributes
-- this will be called as the parser enters a node
end revStartXMLNode
on revEndXMLNode pNodeName
-- this will be called as the parser leaves a node
end revEndXMLNode
on revStartXMLData pElementData
-- this will be called as the parser finds data between tags
end revStartXMLData
HTH,
Jan Schenkel.
Quartam Reports & PDF Library for LiveCode
www.quartam.com
www.quartam.com
Thanks a bunch to you to Jan 
My initial problem was not being able to view the XML file at all since nothing on Windows was capable of viewing a 5 gig file. I wanted to at first see what xml tags were being used, then setup the code to go through the file and create the sql database.
BTW, the 5 gig file is a data dump of Wikipedia. It just contains the entries from their site with no images. I am making an offline search/reader for it that I can put on my netbook before I start taking some classes at the local college here at the end of the month. I can't afford cell phone service to use the netbook on, so figured I'd setup things I might need locally on the computer, such as Wikipedia. Next I'll do the same for their dictionary and thesaurus.
There is an offline reader already for Wikipedia, but I really do not like it's interface and lack of proper handling and presentation of the data. Though it does have one advantage over what I'm trying to do, and that is, it reads the xml file from within the compressed file that you download from Wikipedia, which is only 5 gigs in size.. Turns out the xml file itself once uncompressed is 23 gigs in size.
But rev didn't complain on my initial test of accessing the file.
Am I correct in thinking that once I convert this to SQL that the file size will be far less than the 23 gig XML file?
Thanks again,
~Garrett

My initial problem was not being able to view the XML file at all since nothing on Windows was capable of viewing a 5 gig file. I wanted to at first see what xml tags were being used, then setup the code to go through the file and create the sql database.
BTW, the 5 gig file is a data dump of Wikipedia. It just contains the entries from their site with no images. I am making an offline search/reader for it that I can put on my netbook before I start taking some classes at the local college here at the end of the month. I can't afford cell phone service to use the netbook on, so figured I'd setup things I might need locally on the computer, such as Wikipedia. Next I'll do the same for their dictionary and thesaurus.
There is an offline reader already for Wikipedia, but I really do not like it's interface and lack of proper handling and presentation of the data. Though it does have one advantage over what I'm trying to do, and that is, it reads the xml file from within the compressed file that you download from Wikipedia, which is only 5 gigs in size.. Turns out the xml file itself once uncompressed is 23 gigs in size.

Am I correct in thinking that once I convert this to SQL that the file size will be far less than the 23 gig XML file?
Thanks again,
~Garrett
'What you do not want done to yourself, do not do to others.' - Confucius (550 b.c. to 479 b.c.)
-
- VIP Livecode Opensource Backer
- Posts: 977
- Joined: Sat Apr 08, 2006 7:47 am
- Contact:
Hmm, that depends largely on the verbosity of the xml tag names, and how compact the database stores the data. If you know the maximum lengths of the database fields, use VARCHAR strings to save space.Garrett wrote:Am I correct in thinking that once I convert this to SQL that the file size will be far less than the 23 gig XML file?
This also sounds more like a job for PostgreSQL than SQLite, but you'll just have to import it and see how well it reacts under such a load.
Jan Schenkel.
Quartam Reports & PDF Library for LiveCode
www.quartam.com
www.quartam.com