Reconciling/comparing huge data sets with C#

Here is a technique I have used to quickly reconcile large data sets, eg a million rows.

I use this to compare large tables or large files against each other, to find any data rows that are in either file that are missing from the other.

This is similar to a row by row file compare, except that the order of records/lines is unimportant, yet is still fast (an important point). This is all done very efficiently with a hash table, so that it runs pretty quickly (around 10 seconds on my average PC for two sets of a million rows, each row being 100 bytes). In the EFTPOS world, this kind of operation is useful for end-of-day settling of transactions between banks, and poor implementations can take up to several hours, as opposed to mere seconds using this technique.

Okay the first thing to note here, is that I’m not simply loading two big arrays, and scanning between them one at a time, which would cause an exponential growth in the length of time required for larger files. This technique loads both data sets into one hash table. The use of a hash table makes it quicker to look up a particular row, and as it turns out, the C# implementation of Dictionary is plenty quick enough for this, without using too much memory (in my comparison of two 100MB data sets, it used ~280MB of RAM).

Dictionary<string, int> Comparer = new Dictionary<string, int>();

The dictionary above is used to keep track of all records. The (string) key is used to store all the records, and the (int) value is used to keep track of the count of occurrences of that record.

Each time a record is loaded from file A, the integer is incremented for that given record. In reverse, for file B, the integer is decremented for that record.

Then at the end, the program simply looks for any entries in the dictionary where the value isn’t zero. A value of >= 1 means that the record was found in file A but not in B, whereas a value <= -1 was found in B but not in A. The absolute size of the value tells you how many copies of the record were found, this may or may not be useful depending what you are trying to achieve.

Code is available here: github.com/chrishulbert/csharp-dictionary-reconciler.

And here is the important part:

// This compares the contents of two files to find any records 
// that are in one file but not in the other file.
// Note that the order of the records in the files is immaterial.

// A positive value in the integer part of the dictionary
// signifies that the record is found in file A
// A negative value means file B
Dictionary<string, int> Comparer = new Dictionary<string, int>();
string line;
int records=0; // only used for progress reporting

// Load the first file into the dictionary
Console.WriteLine("Loading file A");
using (StreamReader sr = new StreamReader("BigFileA.txt"))
{
  while (sr.Peek() >= 0)
  {
    // Progress reporting
    if (++records % 10000 == 0)
      Console.Write("{0}%    \r",
        sr.BaseStream.Position * 100 / sr.BaseStream.Length);

    line = sr.ReadLine();
    if (Comparer.ContainsKey(line))
      Comparer[line]++;
    else
      Comparer[line] = 1;
  }
}
Console.WriteLine("Loaded");

// Load the second file, hopefully zeroing out the dictionary values
Console.WriteLine("Loading file B");
using (StreamReader sr = new StreamReader("BigFileB.txt"))
{
  while (sr.Peek() >= 0)
  {
    // Progress reporting
    if (++records % 10000 == 0)
      Console.Write("{0}%    \r",
        sr.BaseStream.Position * 100 / sr.BaseStream.Length);

    line = sr.ReadLine();
    if (Comparer.ContainsKey(line))
      Comparer[line]--;
    else
      Comparer[line] = -1;
  }
}
Console.WriteLine("Loaded");

// List any mismatches
int mismatches = 0;
foreach (KeyValuePair<string, int> kvp in Comparer)
{
  if (kvp.Value != 0)
  {
    mismatches++;
    string InWhich = kvp.Value > 0 ? "A" : "B";
    Console.Write("Extra value '{0}' found in file {1}",
      kvp.Key, InWhich);
    if (Math.Abs(kvp.Value) != 1)
      Console.Write(" ({0} times)", Math.Abs(kvp.Value));
    Console.WriteLine();
  }
}
if (mismatches == 0)
  Console.WriteLine("No mismatches found");

// How much ram did this use?
Console.WriteLine(
  "Used {0} MB of memory (private bytes) to compare {1} records",
  System.Diagnostics.Process.
  GetCurrentProcess().PrivateMemorySize64/1024/1024,
  records);

// Free the memory to the GC explicitly in case you use this in other code
// This isn't essential, it just returns the memory faster in my tests.
Comparer.Clear();
Comparer = null;

Thanks for reading!

Thanks for reading! And if you want to get in touch, I'd love to hear from you: chris.hulbert at gmail.

Chris Hulbert

(Comp Sci, Hons - UTS)

Software Developer (Freelancer / Contractor) in Australia.

I have worked at places such as Google, Cochlear, CommBank, Assembly Payments, News Corp, Fox Sports, NineMSN, FetchTV, Coles, Woolworths, Trust Bank, and Westpac, among others. If you're looking for help developing an iOS app, drop me a line!

Get in touch:
[email protected]
github.com/chrishulbert
linkedin

Subscribe via RSS