A Random Line From A File
The problem at work was the following: given a relatively large text file (with a little over 5,000,000 lines), write a procedure to return a random line out of this file. One naïve implementation that can be thought of would be the following: count the number of lines (O(n)), read off all the lines into an array (O(n)) and finally return a random line by randomly choosing one from the bounded array. That would cost not only O(2n), but O(n)-space which leads to yet extra processing time. A better approach would be the following: as you read the lines off of the file, have an index that indicates the current line that you're processing (line 1, 2, 3, and so on). Call this index the "lineIndex" counter. Then, here is the magical trick: select the current line as the one to be chosen with a probability 1/lineIndex. Suppose that your file only has one line. That line will surely be the one picked by this algorithm since the probability 1/1 means 100% chance. If the file has two lines, the second one will be picked with a probability of 1/2, or 50%. It also means that the first element won't be selected with a probability equals to 1-1/2, or 50%. As you can see, the algorithm is simply making an in-place decision about whether or not to pick the current line as the chosen one based on the right probability for that line. It reduces the time complexity by 1/2 (O(n)), with an extra bonus of only using O(1)-space. Here are the results comparing both approaches (the naïve one as well as the optimized one) - a saving of near 1 second, or ~3x more efficient than the first implementation:
Random Line: eumolpus
Execution time for ReadRandomLineNonEfficiently: 1.4310818 seconds
Random Line: dalmatic
Execution time for ReadRandomLineEfficiently: 0.5110293 seconds
And here are the two implementations:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
namespace ReadRandomLine {
class Program {
static void Main(string[] args) {
if (args.Length != 1) {
Console.WriteLine("Usage: ReadRandomLine.exe <file name>");
return;
}
ReadRandomLineNonEfficiently(args[0]);
ReadRandomLineEfficiently(args[0]);
}
static void ReadRandomLineNonEfficiently(string fileName) {
if (String.IsNullOrEmpty(fileName) || !File.Exists(fileName)) {
return;
}
long ticksBefore = DateTime.Now.Ticks;
string retVal = "";
FileInfo fi = new FileInfo(fileName);
StreamReader sr = fi.OpenText();
int numberOfEntries = 0;
while (!sr.EndOfStream) {
sr.ReadLine();
numberOfEntries++;
}
if (sr != null) {
sr.Close();
}
sr = fi.OpenText();
int index = 0;
string[] lines = new string[numberOfEntries];
while (!sr.EndOfStream) {
lines[index++] = sr.ReadLine();
}
if (sr != null) {
sr.Close();
}
Random rd = new Random();
retVal = lines[rd.Next(0, lines.Length)];
Console.WriteLine("Random Line: {0}", retVal);
long ticksAfter = DateTime.Now.Ticks;
double latency = (ticksAfter - ticksBefore) / 10000000.0;
Console.WriteLine("Execution time for ReadRandomLineNonEfficiently: {0} seconds", latency);
Console.WriteLine();
}
static void ReadRandomLineEfficiently(string fileName) {
if (String.IsNullOrEmpty(fileName) || !File.Exists(fileName)) {
return;
}
long ticksBefore = DateTime.Now.Ticks;
string retVal = "";
FileInfo fi = new FileInfo(fileName);
StreamReader sr = fi.OpenText();
Random rd = new Random();
int index = 0;
while (!sr.EndOfStream) {
string line = sr.ReadLine();
if (rd.Next(0, ++index) == 0) {
retVal = line;
}
}
if (sr != null) {
sr.Close();
}
Console.WriteLine("Random Line: {0}", retVal);
long ticksAfter = DateTime.Now.Ticks;
double latency = (ticksAfter - ticksBefore) / 10000000.0;
Console.WriteLine("Execution time for ReadRandomLineEfficiently: {0} seconds", latency);
Console.WriteLine();
}
}
}
Random Line: eumolpus
Execution time for ReadRandomLineNonEfficiently: 1.4310818 seconds
Random Line: dalmatic
Execution time for ReadRandomLineEfficiently: 0.5110293 seconds
And here are the two implementations:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
namespace ReadRandomLine {
class Program {
static void Main(string[] args) {
if (args.Length != 1) {
Console.WriteLine("Usage: ReadRandomLine.exe <file name>");
return;
}
ReadRandomLineNonEfficiently(args[0]);
ReadRandomLineEfficiently(args[0]);
}
static void ReadRandomLineNonEfficiently(string fileName) {
if (String.IsNullOrEmpty(fileName) || !File.Exists(fileName)) {
return;
}
long ticksBefore = DateTime.Now.Ticks;
string retVal = "";
FileInfo fi = new FileInfo(fileName);
StreamReader sr = fi.OpenText();
int numberOfEntries = 0;
while (!sr.EndOfStream) {
sr.ReadLine();
numberOfEntries++;
}
if (sr != null) {
sr.Close();
}
sr = fi.OpenText();
int index = 0;
string[] lines = new string[numberOfEntries];
while (!sr.EndOfStream) {
lines[index++] = sr.ReadLine();
}
if (sr != null) {
sr.Close();
}
Random rd = new Random();
retVal = lines[rd.Next(0, lines.Length)];
Console.WriteLine("Random Line: {0}", retVal);
long ticksAfter = DateTime.Now.Ticks;
double latency = (ticksAfter - ticksBefore) / 10000000.0;
Console.WriteLine("Execution time for ReadRandomLineNonEfficiently: {0} seconds", latency);
Console.WriteLine();
}
static void ReadRandomLineEfficiently(string fileName) {
if (String.IsNullOrEmpty(fileName) || !File.Exists(fileName)) {
return;
}
long ticksBefore = DateTime.Now.Ticks;
string retVal = "";
FileInfo fi = new FileInfo(fileName);
StreamReader sr = fi.OpenText();
Random rd = new Random();
int index = 0;
while (!sr.EndOfStream) {
string line = sr.ReadLine();
if (rd.Next(0, ++index) == 0) {
retVal = line;
}
}
if (sr != null) {
sr.Close();
}
Console.WriteLine("Random Line: {0}", retVal);
long ticksAfter = DateTime.Now.Ticks;
double latency = (ticksAfter - ticksBefore) / 10000000.0;
Console.WriteLine("Execution time for ReadRandomLineEfficiently: {0} seconds", latency);
Console.WriteLine();
}
}
}
This comment has been removed by the author.
ReplyDeleteThanks for this beautiful problem and its solution, which is an instance of the more general "Reservoir sampling" (http://en.wikipedia.org/wiki/Reservoir_sampling), which is used very frequently for analyzing big data, huge logs, etc. What's interesting is that the fact that this solution is correct is not very obvious to most of the people, but your reasoning can be generalized from 1 to k and it leaves no doubts about its correctness.
ReplyDeleteIndeed, the reservoir sampling is very neat, a great extrapolation from one sample to many. When i wrote the algorithm it "seemed" correct although when i tried to sketch a mathematical proof of its correctness i got lost in the statistical calculations.
ReplyDelete