5 Gig File - How to go about reading portions of it?

Garrett · Post by **Garrett** » Mon Sep 07, 2009 2:12 pm

I have a 5 gig XML file. I don't want to load the whole thing in at once since that may even be impossible, if it is, then it would take far too long to be of use. What I would like to do is be able to only open a portion of it, search for some data, then move further in a portion and so on and build an index of entries in this large file. The index would contain keywords and start and end points to entries related to the keywords.

My question first is whether this is going to be possible in Rev?

Second, would using "open file" actually cause rev to load the entire file?

Third, if this is not possible, does anyone know of a windows program(prefer free) that would convert this 5 gig XML file to an SQL database that Rev can use?

Thanks,
~Garrett

Klaus · Post by **Klaus** » Mon Sep 07, 2009 2:36 pm

Hi Garret,

1. Yes, this is possible!
Check "read from file..." in the docs (= Rev Dictionary) and use it in a repeat loop.
2. No.
3. Sorry, no idea, I am a Mac user

Best

Klaus

Garrett · Post by **Garrett** » Mon Sep 07, 2009 9:51 pm

Thanks Klaus. Likely then I can convert the XML file to SQL using Rev itself.

Klaus · Post by **Klaus** » Tue Sep 08, 2009 8:30 am

Janschenkel · Post by **Janschenkel** » Tue Sep 08, 2009 1:25 pm

To avoid loading the entire 5GB into a giant DOM tree, you can also use the 'SAX' method of handling events as the XML file is parsed. For this, you would put something like the following in your card script:

Code: Select all

command ParseXmlFileWithMessages pFilePath
  return revCreateXmlTreeFromFile(pFilePath, false, false, true)
end ParseXmlFileWithMessages

on revXMLStartTree
  -- setup everything here to hold the data and insert into database
end revXMLStartTree

on revXMLEndTree
  -- cleanup everything here, as we went through the entire file
end revXMLEndTree

on revStartXMLNode pNodeName, pNodeAttributes
  -- this will be called as the parser enters a node
end revStartXMLNode

on revEndXMLNode pNodeName
  -- this will be called as the parser leaves a node
end revEndXMLNode

on revStartXMLData pElementData
  -- this will be called as the parser finds data between tags
end revStartXMLData

This approach frees you from having to hand-write an XML parser, and keeps the memory requirements to a minimum as it only hands you back chunks of data at a time and can forget about that data once it sent it to the card as part of the appropriate message.

HTH,

Jan Schenkel.

Garrett · Post by **Garrett** » Tue Sep 08, 2009 4:57 pm

Thanks a bunch to you to Jan

My initial problem was not being able to view the XML file at all since nothing on Windows was capable of viewing a 5 gig file. I wanted to at first see what xml tags were being used, then setup the code to go through the file and create the sql database.

BTW, the 5 gig file is a data dump of Wikipedia. It just contains the entries from their site with no images. I am making an offline search/reader for it that I can put on my netbook before I start taking some classes at the local college here at the end of the month. I can't afford cell phone service to use the netbook on, so figured I'd setup things I might need locally on the computer, such as Wikipedia. Next I'll do the same for their dictionary and thesaurus.

There is an offline reader already for Wikipedia, but I really do not like it's interface and lack of proper handling and presentation of the data. Though it does have one advantage over what I'm trying to do, and that is, it reads the xml file from within the compressed file that you download from Wikipedia, which is only 5 gigs in size.. Turns out the xml file itself once uncompressed is 23 gigs in size.

But rev didn't complain on my initial test of accessing the file.

Am I correct in thinking that once I convert this to SQL that the file size will be far less than the 23 gig XML file?

Thanks again,
~Garrett

Janschenkel · Post by **Janschenkel** » Tue Sep 08, 2009 7:58 pm

Garrett wrote:Am I correct in thinking that once I convert this to SQL that the file size will be far less than the 23 gig XML file?

Hmm, that depends largely on the verbosity of the xml tag names, and how compact the database stores the data. If you know the maximum lengths of the database fields, use VARCHAR strings to save space.
This also sounds more like a job for PostgreSQL than SQLite, but you'll just have to import it and see how well it reacts under such a load.

Jan Schenkel.

Garrett · Post by **Garrett** » Tue Sep 08, 2009 9:08 pm

Alright, and again thanks for the help and info.