Sorting enormous files using a C# external merge sort

This post is kindof a follow on from yesterdays fast but memory intensive file reconciliation post. On StackOverflow it was suggested to me that when reconciling large files, it'd be more memory efficient to sort the files first, and then reconciling them line by line rather than storing the entirety of the files in memory. Which brings up another challenge: how to sort a file that's bigger than RAM. For this example, i'll be using a 1 gig file, with random 100-character records on each line, and attempting to sort it all using less than 50MB of RAM. The technique i'm using is pretty close to the external merge sort. Here's the overview: Split into smaller chunks, then quicksort each chunk, then merge all the sorted chunks. Splitting the chunks is pretty straightforwards - and unoptimised! I simply read the input file line by line and output it to a 'split' file, until i've reached 10megs and then open the next 'split' file:

static void Split(string file)
{
  int split_num = 1;
  StreamWriter sw = new StreamWriter(
    string.Format("c:\\split{0:d5}.dat", split_num));
  long read_line = 0;
  using (StreamReader sr = new StreamReader(file))
  {
    while (sr.Peek() >= 0)
    {
      // Progress reporting
      if (++read_line % 5000 == 0)
        Console.Write("{0:f2}%   \r",
          100.0 * sr.BaseStream.Position / sr.BaseStream.Length);

      // Copy a line
      sw.WriteLine(sr.ReadLine());

      // If the file is big, then make a new split,
      // however if this was the last line then don't bother
      if (sw.BaseStream.Length > 100000000 && sr.Peek() >= 0)
      {
        sw.Close();
        split_num++;
        sw = new StreamWriter(
          string.Format("c:\\split{0:d5}.dat", split_num));
      }
    }
  }
  sw.Close();
}

Sorting the chunks is also pretty straightforwards, as i use C#'s Array.Sort because it does the job just fine thank you very much. C# purists won't like my use of GC.Collect here, but it was mainly just to keep things under the 50mb limit. In production code, you'd probably not use that, and rather just let C# figure out when to free the memory itself. Anyway:

static void SortTheChunks()
{
  foreach (string path in Directory.GetFiles("C:\\", "split*.dat"))
  {
    // Read all lines into an array
    string[] contents = File.ReadAllLines(path);
    // Sort the in-memory array
    Array.Sort(contents);
    // Create the 'sorted' filename
    string newpath = path.Replace("split", "sorted");
    // Write it
    File.WriteAllLines(newpath, contents);
    // Delete the unsorted chunk
    File.Delete(path);
    // Free the in-memory sorted array
    contents = null;
    GC.Collect();
  }
}

The merge is the only really tricky bit of code. But basically, it opens a FIFO queue of all the sorted chunks simultaneously. It then outputs the lowest-sorted record from all queues one at a time, until all queues are empty. You can get a better explanation than mine on wikipedia: merge algorithm. Anyway, here's my code, which actually runs surprisingly quickly:

static void MergeTheChunks()
{
  string[] paths = Directory.GetFiles("C:\\", "sorted*.dat");
  int chunks = paths.Length; // Number of chunks
  int recordsize = 100; // estimated record size
  int records = 10000000; // estimated total # records
  int maxusage = 500000000; // max memory usage
  int buffersize = maxusage / chunks; // bytes of each queue
  double recordoverhead = 7.5; // The overhead of using Queue<>
  int bufferlen = (int)(buffersize / recordsize /
    recordoverhead); // number of records in each queue

  // Open the files
  StreamReader[] readers = new StreamReader[chunks];
  for (int i = 0; i < chunks; i++)
    readers[i] = new StreamReader(paths[i]);

  // Make the queues
  Queue<string>[] queues = new Queue<string>[chunks];
  for (int i = 0; i < chunks; i++)
    queues[i] = new Queue<string>(bufferlen);

  // Load the queues
  for (int i = 0; i < chunks; i++)
    LoadQueue(queues[i], readers[i], bufferlen);

  // Merge!
  StreamWriter sw = new StreamWriter("C:\\BigFileSorted.txt");
  bool done = false;
  int lowest_index, j, progress = 0;
  string lowest_value;
  while (!done)
  {
    // Report the progress
    if (++progress % 5000 == 0)
      Console.Write("{0:f2}%   \r",
        100.0 * progress / records);

    // Find the chunk with the lowest value
    lowest_index = -1;
    lowest_value = "";
    for (j = 0; j < chunks; j++)
    {
      if (queues[j] != null)
      {
        if (lowest_index < 0 ||
          String.CompareOrdinal(
            queues[j].Peek(), lowest_value) < 0)
        {
          lowest_index = j;
          lowest_value = queues[j].Peek();
        }
      }
    }

    // Was nothing found in any queue? We must be done then.
    if (lowest_index == -1) { done = true; break; }

    // Output it
    sw.WriteLine(lowest_value);

    // Remove from queue
    queues[lowest_index].Dequeue();
    // Have we emptied the queue? Top it up
    if (queues[lowest_index].Count == 0)
    {
      LoadQueue(queues[lowest_index],
        readers[lowest_index], bufferlen);
      // Was there nothing left to read?
      if (queues[lowest_index].Count == 0)
      {
        queues[lowest_index] = null;
      }
    }
  }
  sw.Close();

  // Close and delete the files
  for (int i = 0; i < chunks; i++)
  {
    readers[i].Close();
    File.Delete(paths[i]);
  }
}

static void LoadQueue(Queue<string> queue,
  StreamReader file, int records)
{
  for (int i = 0; i < records; i++)
  {
    if (file.Peek() < 0) break;
    queue.Enqueue(file.ReadLine());
  }
}

One thing i noticed as well, was that the splitting operation was slower than the sort and merge. It should really be faster than them, since it is simpler and involves the sequential rather than random IO. So, i believe it wouldn't be hard to optimise that to take less than1 minute, which would bring the whole process down to 3 1/2 minutes. Which is pretty awesome, for sorting a 1-gig file of 10 million records - roughly equivalent to sorting the Australian phone book! Another thing is, i limited myself to 50MB mainly for the challenge of it. I checked to see how much faster it ran with more RAM: 10x bigger chunks and merge buffers. I was surprised to find that it was slower overall. And here's the project if you want it. You'll need to write your own 1-gig sample file, i'm not uploading that! github.com/chrishulbert/ExternalMergeSort

Thanks for reading! And if you want to get in touch, I'd love to hear from you: chris.hulbert at gmail.

Chris Hulbert

(Comp Sci, Hons - UTS)

Software Developer (Freelancer / Contractor) in Australia.

I have worked at places such as Google, Cochlear, Assembly Payments, News Corp, Fox Sports, NineMSN, FetchTV, Coles, Woolworths, Trust Bank, and Westpac, among others. If you're looking for help developing an iOS app, drop me a line!

Get in touch:
[email protected]
github.com/chrishulbert
linkedin

Subscribe via RSS