This post is kindof a follow on from yesterdays fast but memory intensive file reconciliation post. On StackOverflow it was suggested to me that when reconciling large files, it'd be more memory efficient to sort the files first, and then reconciling them line by line rather than storing the entirety of the files in memory. Which brings up another challenge: how to sort a file that's bigger than RAM. For this example, i'll be using a 1 gig file, with random 100-character records on each line, and attempting to sort it all using less than 50MB of RAM. The technique i'm using is pretty close to the external merge sort. Here's the overview: Split into smaller chunks, then quicksort each chunk, then merge all the sorted chunks. Splitting the chunks is pretty straightforwards - and unoptimised! I simply read the input file line by line and output it to a 'split' file, until i've reached 10megs and then open the next 'split' file:
static void Split(string file) { int split_num = 1; StreamWriter sw = new StreamWriter( string.Format("c:\\split{0:d5}.dat", split_num)); long read_line = 0; using (StreamReader sr = new StreamReader(file)) { while (sr.Peek() >= 0) { // Progress reporting if (++read_line % 5000 == 0) Console.Write("{0:f2}% \r", 100.0 * sr.BaseStream.Position / sr.BaseStream.Length); // Copy a line sw.WriteLine(sr.ReadLine()); // If the file is big, then make a new split, // however if this was the last line then don't bother if (sw.BaseStream.Length > 100000000 && sr.Peek() >= 0) { sw.Close(); split_num++; sw = new StreamWriter( string.Format("c:\\split{0:d5}.dat", split_num)); } } } sw.Close(); }
Sorting the chunks is also pretty straightforwards, as i use C#'s Array.Sort because it does the job just fine thank you very much. C# purists won't like my use of GC.Collect here, but it was mainly just to keep things under the 50mb limit. In production code, you'd probably not use that, and rather just let C# figure out when to free the memory itself. Anyway:
static void SortTheChunks() { foreach (string path in Directory.GetFiles("C:\\", "split*.dat")) { // Read all lines into an array string[] contents = File.ReadAllLines(path); // Sort the in-memory array Array.Sort(contents); // Create the 'sorted' filename string newpath = path.Replace("split", "sorted"); // Write it File.WriteAllLines(newpath, contents); // Delete the unsorted chunk File.Delete(path); // Free the in-memory sorted array contents = null; GC.Collect(); } }
The merge is the only really tricky bit of code. But basically, it opens a FIFO queue of all the sorted chunks simultaneously. It then outputs the lowest-sorted record from all queues one at a time, until all queues are empty. You can get a better explanation than mine on wikipedia: merge algorithm. Anyway, here's my code, which actually runs surprisingly quickly:
static void MergeTheChunks() { string[] paths = Directory.GetFiles("C:\\", "sorted*.dat"); int chunks = paths.Length; // Number of chunks int recordsize = 100; // estimated record size int records = 10000000; // estimated total # records int maxusage = 500000000; // max memory usage int buffersize = maxusage / chunks; // bytes of each queue double recordoverhead = 7.5; // The overhead of using Queue<> int bufferlen = (int)(buffersize / recordsize / recordoverhead); // number of records in each queue // Open the files StreamReader[] readers = new StreamReader[chunks]; for (int i = 0; i < chunks; i++) readers[i] = new StreamReader(paths[i]); // Make the queues Queue<string>[] queues = new Queue<string>[chunks]; for (int i = 0; i < chunks; i++) queues[i] = new Queue<string>(bufferlen); // Load the queues for (int i = 0; i < chunks; i++) LoadQueue(queues[i], readers[i], bufferlen); // Merge! StreamWriter sw = new StreamWriter("C:\\BigFileSorted.txt"); bool done = false; int lowest_index, j, progress = 0; string lowest_value; while (!done) { // Report the progress if (++progress % 5000 == 0) Console.Write("{0:f2}% \r", 100.0 * progress / records); // Find the chunk with the lowest value lowest_index = -1; lowest_value = ""; for (j = 0; j < chunks; j++) { if (queues[j] != null) { if (lowest_index < 0 || String.CompareOrdinal( queues[j].Peek(), lowest_value) < 0) { lowest_index = j; lowest_value = queues[j].Peek(); } } } // Was nothing found in any queue? We must be done then. if (lowest_index == -1) { done = true; break; } // Output it sw.WriteLine(lowest_value); // Remove from queue queues[lowest_index].Dequeue(); // Have we emptied the queue? Top it up if (queues[lowest_index].Count == 0) { LoadQueue(queues[lowest_index], readers[lowest_index], bufferlen); // Was there nothing left to read? if (queues[lowest_index].Count == 0) { queues[lowest_index] = null; } } } sw.Close(); // Close and delete the files for (int i = 0; i < chunks; i++) { readers[i].Close(); File.Delete(paths[i]); } } static void LoadQueue(Queue<string> queue, StreamReader file, int records) { for (int i = 0; i < records; i++) { if (file.Peek() < 0) break; queue.Enqueue(file.ReadLine()); } }
One thing i noticed as well, was that the splitting operation was slower than the sort and merge. It should really be faster than them, since it is simpler and involves the sequential rather than random IO. So, i believe it wouldn't be hard to optimise that to take less than1 minute, which would bring the whole process down to 3 1/2 minutes. Which is pretty awesome, for sorting a 1-gig file of 10 million records - roughly equivalent to sorting the Australian phone book! Another thing is, i limited myself to 50MB mainly for the challenge of it. I checked to see how much faster it ran with more RAM: 10x bigger chunks and merge buffers. I was surprised to find that it was slower overall. And here's the project if you want it. You'll need to write your own 1-gig sample file, i'm not uploading that! github.com/chrishulbert/ExternalMergeSort
Thanks for reading! And if you want to get in touch, I'd love to hear from you: chris.hulbert at gmail.
(Comp Sci, Hons - UTS)
Software Developer (Freelancer / Contractor) in Australia.
I have worked at places such as Google, Cochlear, Assembly Payments, News Corp, Fox Sports, NineMSN, FetchTV, Coles, Woolworths, Trust Bank, and Westpac, among others. If you're looking for help developing an iOS app, drop me a line!
Get in touch:
[email protected]
github.com/chrishulbert
linkedin