Using Unix tools to process text on Windows

There was a need at work recently to perform a bunch of text processing tasks on very large XML documents spanning 10s of gigabytes in file size. The documents in question would look more or less like this:

... some meta data tags here ...
<Rows>
    <TableRow>
      <Fields>
        <Field>Blah blah blah</Field>
        <Field>Stuff here</Field>
        <Field>3</Field>
        <Field>More stuff</Field>
      </Fields>
    </TableRow>
    <TableRow>
      <Fields>
        <Field>Blah blah blah and blah</Field>
        <Field>Stuff here</Field>
        <Field>2</Field>
        <Field>More stuff</Field>
      </Fields>
    </TableRow>
</Rows>

Here's the processing that needed to be done:

Extract the text contents from the first Field tag from under Fields under each TableRow.
Filter out rows that didn't match a specific regular expression.
Extract a specific sub-string from the text.
Group the data on the sub-string and compute the number of times that string occurs.
Produce the output as a comma separated value (CSV) file.

Extracting text from the `Field` tag

For step 1, since we're dealing with extremely large XML files, using a DOM based parser was out of the question since that wouldn't be very memory efficient. I wrote a small utility in C++ (called get-msg) using the XmlLite parser that's been shipping in Windows since Vista days! XmlLite is a native component modeled on .NET's XmlReader and XmlWriter types. It is a forward only, stream processing pull parser which means that it has extremely low memory footprint and can deal with XML inputs of arbitrary size. On the flip side, the programming model isn't quite as convenient as a DOM parser.

The following snippet shows how you can load up an XML document using XmlLite. TableReader is a simple class I put together to make working with XmlLite easier. The variable _reader below is a member instance of type CComPtr<IXmlReader> and _fileStream is another member of type CComPtr<IStream>.

bool TableReader::Load(wstring file)
{
    // free up current reader and stream
    _reader.Release();
    _fileStream.Release();

    // load up file
    HRESULT hr = SHCreateStreamOnFile(
        file.c_str(),
        STGM_READ,
        &_fileStream);
    if (FAILED(hr)) {
        return false;
    }
    hr = CreateXmlReader(
        __uuidof(IXmlReader),
        (void **) &_reader,
        nullptr);
    if (FAILED(hr)) {
        return false;
    }
    hr = _reader->SetInput(_fileStream);
    if (FAILED(hr)) {
        return false;
    }

    // move to the "Rows" element
    if (MoveToElement(L"Rows") == false) {
        return false;
    }

    return true;
}

The code should be fairly self-explanatory. The MoveToElement method right at the end of the method is a member method of the TableReader class that's intended to make the job of navigating the node tree easier. Here's what this method looks like:

bool TableReader::MoveToElement(wstring elementName)
{
    HRESULT hr;
    XmlNodeType nodeType;
    LPCWSTR wszLocalName = nullptr;

    while ((hr = _reader->Read(&nodeType)) == S_OK) {
        switch (nodeType) {
            case XmlNodeType_Element:
            {
                hr = _reader->GetLocalName(&wszLocalName, nullptr);
                if (FAILED(hr)) {
                    return false;
                }

                // check if the local name is the same as
                // "elementName" and if yes, then we're
                // done 
                if (elementName.compare(wszLocalName) == 0) {
                    return true;
                }
                break;
            }
        }
    }

    return SUCCEEDED(hr);
}

As you can tell, all it does is to keep walking the nodes in the XML document till it encounters an element whose name matches elementName. With this method handy, looking for the specific Field XML tag in question becomes fairly straightforward. Here's the method that does the job:

bool TableReader::ReadMessage(LPCWSTR *ppwszMsg)
{
    HRESULT hr;

    // move to next "TableRow" element
    if (!MoveToElement(L"TableRow")) {
        return false;
    }

    // move to first "Field" element
    if (!MoveToElement(L"Field")) {
        return false;
    }

    // move reader to the "text" part of the element
    XmlNodeType nodeType;
    hr = _reader->Read(&nodeType);
    if (nodeType != XmlNodeType_Text &&
        nodeType != XmlNodeType_EndElement) {
        return false;
    }

    // retrieve the message
    *ppwszMsg = nullptr;
    hr = _reader->GetValue(ppwszMsg, nullptr);
    return SUCCEEDED(hr);
}

The final program is then basically a tight loop that keeps calling ReadMessage till it returns false. Here are the relevant bits.

wstring fileName{ argv[1] };  
TableReader reader;  
if (reader.Load(fileName) == false) {  
    wcout << L"Attempt to load the XML file failed." << endl;
    return 1;
}

// read and print all the messages
LPCWSTR pwszMsg;  
while (reader.ReadMessage(&pwszMsg)) {  
    // we use wprintf instead of wcout because wcout seems to have
    // trouble dealing with embedded byte order mark byte sequences
    // for some reason
    wprintf(L"%s\n", pwszMsg);
}

Getting the tools - GnuWin

Now that we have a way of rapidly extracting the Field element that we're interested in from the source XML the rest of the text processing work turns out to be fairly straightforward when we have the right tools handy. The first thing to do is to install the GnuWin package via Chocolatey. If you don't know what is Chocolatey and you're a Windows user then you really should get to know it! Briefly, Chocolatey is a command line package manager for Windows - apt-get for Windows if you will. GnuWin is a package that basically installs Win32 ports of all the key Unix/Linux tools without having to rely on a heavyweight "environment" like Cygwin. Installing GnuWin is a simple matter of running the following from a command prompt:

cinst GnuWin

That's it. It does take a while to pull in all the files and get setup though.

Processing the text

The tools we're going to use to get the job done are essentially - grep, sed, sort and uniq. Here are the commands I used.

Filter out rows that didn't match a specific regular expression:

grep "Creating OSDisk from OSImage\:.*"

Extract a specific sub-string from the text:

sed -n "s/.*Creating OSDisk from OSImage:\(.*\).*/\1/p"

Group the data on the sub-string and compute the number of times that string occurs:

sort | uniq -c

Produce the output as a comma separated value (CSV) file:

sed -n "s/ *\([0-9]*\) \(.*\)/\2,\1/p"

What we do is to basically pipe everything together like so:

get-msg input.xml |
  grep "Creating OSDisk from OSImage\:.*" |
  sed -n "s/.*Creating OSDisk from OSImage:\(.*\).*/\1/p" |
  sort | uniq -c |
  sed -n "s/ *\([0-9]*\) \(.*\)/\2,\1/p"

And finally output redirect everything to a .csv file. That's pretty much it! Processing a 14 GB XML document through this pipeline on my quad core Intel i7 2014 Lenovo Carbon with 8 GB of RAM (and a truly horrendous keyboard) takes about 5 minutes. Not bad eh?

Extracting text from the Field tag

Getting the tools - GnuWin

Processing the text

Extracting text from the `Field` tag