Using Unix tools to process text on Windows
There was a need at work recently to perform a bunch of text processing tasks on very large XML documents spanning 10s of gigabytes in file size. The documents in question would look more or less like this:
... some meta data tags here ...
<Rows>
<TableRow>
<Fields>
<Field>Blah blah blah</Field>
<Field>Stuff here</Field>
<Field>3</Field>
<Field>More stuff</Field>
</Fields>
</TableRow>
<TableRow>
<Fields>
<Field>Blah blah blah and blah</Field>
<Field>Stuff here</Field>
<Field>2</Field>
<Field>More stuff</Field>
</Fields>
</TableRow>
</Rows>
Here's the processing that needed to be done:
- Extract the text contents from the first
Field
tag from underFields
under eachTableRow
. - Filter out rows that didn't match a specific regular expression.
- Extract a specific sub-string from the text.
- Group the data on the sub-string and compute the number of times that string occurs.
- Produce the output as a comma separated value (CSV) file.
Extracting text from the Field
tag
For step 1, since we're dealing with extremely large XML files, using a DOM based parser was out of the question since that wouldn't be very memory efficient. I wrote a small utility in C++ (called get-msg) using the XmlLite parser that's been shipping in Windows since Vista days! XmlLite is a native component modeled on .NET's XmlReader
and XmlWriter
types. It is a forward only, stream processing pull parser which means that it has extremely low memory footprint and can deal with XML inputs of arbitrary size. On the flip side, the programming model isn't quite as convenient as a DOM parser.
The following snippet shows how you can load up an XML document using XmlLite. TableReader
is a simple class I put together to make working with XmlLite easier. The variable _reader
below is a member instance of type CComPtr<IXmlReader>
and _fileStream
is another member of type CComPtr<IStream>
.
bool TableReader::Load(wstring file)
{
// free up current reader and stream
_reader.Release();
_fileStream.Release();
// load up file
HRESULT hr = SHCreateStreamOnFile(
file.c_str(),
STGM_READ,
&_fileStream);
if (FAILED(hr)) {
return false;
}
hr = CreateXmlReader(
__uuidof(IXmlReader),
(void **) &_reader,
nullptr);
if (FAILED(hr)) {
return false;
}
hr = _reader->SetInput(_fileStream);
if (FAILED(hr)) {
return false;
}
// move to the "Rows" element
if (MoveToElement(L"Rows") == false) {
return false;
}
return true;
}
The code should be fairly self-explanatory. The MoveToElement
method right at the end of the method is a member method of the TableReader
class that's intended to make the job of navigating the node tree easier. Here's what this method looks like:
bool TableReader::MoveToElement(wstring elementName)
{
HRESULT hr;
XmlNodeType nodeType;
LPCWSTR wszLocalName = nullptr;
while ((hr = _reader->Read(&nodeType)) == S_OK) {
switch (nodeType) {
case XmlNodeType_Element:
{
hr = _reader->GetLocalName(&wszLocalName, nullptr);
if (FAILED(hr)) {
return false;
}
// check if the local name is the same as
// "elementName" and if yes, then we're
// done
if (elementName.compare(wszLocalName) == 0) {
return true;
}
break;
}
}
}
return SUCCEEDED(hr);
}
As you can tell, all it does is to keep walking the nodes in the XML document till it encounters an element whose name matches elementName
. With this method handy, looking for the specific Field
XML tag in question becomes fairly straightforward. Here's the method that does the job:
bool TableReader::ReadMessage(LPCWSTR *ppwszMsg)
{
HRESULT hr;
// move to next "TableRow" element
if (!MoveToElement(L"TableRow")) {
return false;
}
// move to first "Field" element
if (!MoveToElement(L"Field")) {
return false;
}
// move reader to the "text" part of the element
XmlNodeType nodeType;
hr = _reader->Read(&nodeType);
if (nodeType != XmlNodeType_Text &&
nodeType != XmlNodeType_EndElement) {
return false;
}
// retrieve the message
*ppwszMsg = nullptr;
hr = _reader->GetValue(ppwszMsg, nullptr);
return SUCCEEDED(hr);
}
The final program is then basically a tight loop that keeps calling ReadMessage
till it returns false. Here are the relevant bits.
wstring fileName{ argv[1] };
TableReader reader;
if (reader.Load(fileName) == false) {
wcout << L"Attempt to load the XML file failed." << endl;
return 1;
}
// read and print all the messages
LPCWSTR pwszMsg;
while (reader.ReadMessage(&pwszMsg)) {
// we use wprintf instead of wcout because wcout seems to have
// trouble dealing with embedded byte order mark byte sequences
// for some reason
wprintf(L"%s\n", pwszMsg);
}
Getting the tools - GnuWin
Now that we have a way of rapidly extracting the Field
element that we're interested in from the source XML the rest of the text processing work turns out to be fairly straightforward when we have the right tools handy. The first thing to do is to install the GnuWin package via Chocolatey. If you don't know what is Chocolatey and you're a Windows user then you really should get to know it! Briefly, Chocolatey is a command line package manager for Windows - apt-get
for Windows if you will. GnuWin is a package that basically installs Win32 ports of all the key Unix/Linux tools without having to rely on a heavyweight "environment" like Cygwin. Installing GnuWin is a simple matter of running the following from a command prompt:
cinst GnuWin
That's it. It does take a while to pull in all the files and get setup though.
Processing the text
The tools we're going to use to get the job done are essentially - grep, sed, sort and uniq. Here are the commands I used.
Filter out rows that didn't match a specific regular expression:
grep "Creating OSDisk from OSImage\:.*"
Extract a specific sub-string from the text:
sed -n "s/.*Creating OSDisk from OSImage:\(.*\).*/\1/p"
Group the data on the sub-string and compute the number of times that string occurs:
sort | uniq -c
Produce the output as a comma separated value (CSV) file:
sed -n "s/ *\([0-9]*\) \(.*\)/\2,\1/p"
What we do is to basically pipe everything together like so:
get-msg input.xml |
grep "Creating OSDisk from OSImage\:.*" |
sed -n "s/.*Creating OSDisk from OSImage:\(.*\).*/\1/p" |
sort | uniq -c |
sed -n "s/ *\([0-9]*\) \(.*\)/\2,\1/p"
And finally output redirect everything to a .csv file. That's pretty much it! Processing a 14 GB XML document through this pipeline on my quad core Intel i7 2014 Lenovo Carbon with 8 GB of RAM (and a truly horrendous keyboard) takes about 5 minutes. Not bad eh?