April has been one hectic month! Between launching WhatGrowersUse.com, pre-launch marketing Optimise and fixing PlantConnection, I’ve barely had a moment to stop and think!
Add to this the fact that I’ve been swotting up on all things dev. I love coding in PHP and today I managed to hack out a super slender piece of functionality for an awesome *top secret* project I’m working on!
I basically need to pull a massive XML feed (think >10MB) and process the contents into a database. So this is grand-scale processing of huge wads of XML data – something I’ve never really had to do before.
This may seem a bit backwards, but moving away from the XML is imperative. This project needs to be fast and agile. I don’t want to have to allocate huge chunks of memory to my whole PHP layer just to deal with a few records in various, disparate XML files.
Starting with SimpleXML
SimpleXML is a fantastic XML parser. Being able to turn XML into native PHP objects, arrays and variables is supremely handy. However, “Simple” is definitely the operative word! When it comes to huge amounts of XML data, SimpleXML just doesn’t cut it!
So in my search for an alternative, I came across something I’ve never used before: a little-known – and it seems little-used – PHP extension called XMLReader. My hero.
XMLReader saves the day
XMLReader is what is known as a pull parser… you grab what you need as the parser races through the XML at lightning speed. It streams the file into context too: reading, caching, splurging… so you can deal with the XML before the whole file is loaded into memory.
This is a huge advantage for massive files. As long as the XML is valid, you can run away processing elements and attributes at a blistering pace.
Something to remember though is that doing it this way is no good if you then try to process all of this data in memory intensive variables or objects (which is the problem with SimpleXML). You either face increasing the memory allocation to PHP (which is limited by how much RAM you have) or you find a quick way to deal with the data and move on.
It’s grab-and-release… no time for processing here. Any kind of re-assignment of data where volumes exceed 30,000 reps of a while-loop will suck your memory dry. So no massive strings, no huge arrays and definitely no objects!
Execution Time
The only issue left is execution time. Between parsing the XML file and saving it to a local MySQL db, your script could take a lot longer than your default execution time.
XMLReader is pretty darn snappy and there’s really no way to improve that with digging into some C code and rebuilding the extension. The code that actually does stuff with the data from the XML is very minimal. The biggest challenge is time writing to the database.
I’m using MySQL. The quickest method of writing to a database (save reformatting the data into some kind delimited text file and using load_data_infile) is mysqli prepared statements. This greatly reduces database load and running thousands (or hundreds of thousands) of queries can be done in mere seconds.
However, even this will be too slow if your max_execution time is 30 seconds. But the only real way to speed MySQL up is a faster processor and improved disk-write speed (think SSDs). Those are expensive options.
The simplest option? Increase script execution time. If this isn’t a user page, you can be a little more relaxed on timeouts. You should be able to use the set_time_limit() function to increase execution time. In fact this function has a handy habit of simply extending your current execution time by the limit you set.
<php
while ( $xmlReader->read() ) {
set_time_limit(2);
// ... get data, save to DB etc ...
}
?>
This will give each loop an extra two seconds to execute, which should be more than enough time to parse an extra few XML nodes and execute a prepared statement one more time.
So there it is. If you’d like to see the full code or have any questions, just ask, I’d be happy to share it. A huge thanks goes to Chad Fennell for his excellent post on XMLReader
