External Sorting – A novel implementation Harish G., Pooja P., Priyanka S., Ramya Vastrad, Shivapriya B. Department of Computer Science and Engineering, R. V. College of Engineering, Bangalore, India.
[email protected], {pooja.prakashrao, spriyanka.33, ramyavastrad, shivpri.b}@gmail.com
Abstract— When an ordered set of data is needed and the size of the data set far exceeds the memory to sort it, external sorting becomes a viable solution. Here we implement the sorting process in three stages namely: splitting, sorting and merging. Sorting is done using heap sort and merging is done as a combination of the widely known k-way and multistep merging mechanisms. Java threads enable us to achieve the required speedup and strike a middle ground between the sorting i.e. computation and the I/O being performed throughout. We also perform analysis of sorting with a range of memory capacities and files, in terms of time taken for sorting.
of the numerous sorting algorithms. • Write the sorted chunks back to disk. • Merge the sorted chunks into a single file. At first the input file is split into a predefined number of chunk files. In the next step, the individual chunks are sorted using a feasible sorting algorithm. Finally, the sorted chunks are combined in a multi-step manner by combining some k chunks at one go. This results in a sorted file.
Keywords— external sorting, heap sort, huge data sets, kway merging
The first and foremost needed feature is that the file, before and after external sort is same content-wise. Neither extra records should get introduced in the file, nor should there be any loss of existing data. Specifically, the MD5 hash of the file prior to and after the sorting should be the same. In addition, the implementations must be open to the following features: • Handle variable-sized data records. • Handle data irrespective of capacity of RAM. • Support different values of k in k-way merging and in number of chunks to be created. We narrowed our scope to ASCII formatted data records because ASCII is the most widely used data format. The size of the data file used was 1.49 gigabytes (GB) containing ASCII records delimited by new line character (\n). The sorting algorithm to be used on the data chunks in main memory had a lot of contenders. After analysis of a few sorting algorithms, we shortlisted merge sort, quick sort and heap sort. Since merge sort and quick sort both are recursive in nature, too many calls to the functions causes stack overflow and this is undesirable in the given scenario [4]. On the contrary, heap sort does not have any of the shortcomings like stack overflow or recursive function calls and hence is our choice [5].
II. PRACTICALITIES
I. INTRODUCTION Sorting has been one of the oldest problems in algorithmic study. Though there have been great improvements in memory sizes and processing speeds, the data that needs to be sorted has also been perceptibly increasing over the years and handling huge amounts of data is the order of the day. Hundreds of terabytes (TB) of data is considered as a trifling matter. Handling of data involves adding new records to existing data, searching, deleting and modifying existing data. Searching for a key value, in the most trivial case, involves going through all records of the file till the key is found. Some of the widely used and most efficient searching algorithms include binary search and interpolation search. The only prerequisite needed for these algorithms is that the input data must be ordered. Ordering, for all practical purposes is known as sorting. Thus, all these tasks are greatly simplified by sorting the data in consideration. There are a lot of in-memory sorting algorithms that can be used; but when the data itself is comparable to or greater in size than the memory, the question of loading the entire data into the memory and sorting it is superfluous. Since disks with sizes of 1 terabyte (TB) or more are common these days, the data can be sorted by having it in the disk and sorting it chunk by chunk by reading it into the main memory. External sorting [3] is a sorting mechanism using external memory, say, the hard disk. An approach to this is as follows: • Split the file to be sorted into smaller chunks that can be loaded into the memory. • Sort the chunks in memory by the application of any
III. IMPLEMENTATION The external sort has two major modules namely: • Split and sort [2] • Merge The implementation is in Java programming language. The pseudo codes for the two modules are as below:
1
Split and sort: 1. Read the filename, no_of_chunks, k_way 2. Calculate the size_of_each_chunk = file_size/no_of_ chunks 3. Spawn a "read" thread (producer) and a "sort" (consumer) thread 4. read thread: while(!end_of_file) do while(!end_of_file && length_read_so_far != size_of_each_chunk) do data[] ← read_a_line length_read_so_far = length_read_so_far + size_of_the _read_line end of while end of while notify "sort" thread 5. sort thread: wait for read thread to complete reading a chunk to memory after obtaining a notification from read thread, sort the chunk using Heap Sort Merge: 1. merge: for (j ← no_of_chunks;j >1; j ← j/k_way) do if(no_of_chunks==1) return for i ← 0 to (no_of_chunks/k_way) do starting_chunk_no ← (k_way * i)+1 sort_and_merge(chunk_no) end end 2. sort_and_merge(chunk_no): minItem ← index of minimum item in chunk[1], chunk[2], …, chunk[K] write item[minItem] to merged_file for i←1 to K do if chunk[i]==chunk[minItem] then get next item from chunk[i] end output merged_file
disk have been read and sorted back to disk. The use of threads helps to overcome the I/O bottleneck due to disk reads and writes. In the next module, simultaneous sorting as well as merging tasks is carried out using multistep and k-way [1] merge mechanisms. A pictorial representation of the same is in Fig.1.
Fig 1. Multistep k-way merging mechanism From the disk, k files are concurrently opened and the contents are compared record by record and an ordered list of records is realized. So at the end of the first k-way merge, we gathered ‘1’ file from ‘k’ files. The same process is carried on till a single sorted file results. In this way, two existing merging mechanisms are consolidated in one merge module to get an efficient merge technique.
IV. RESULTS AND ANALYSIS The sorting has been run for different configurations: • Varying RAM size – 512MB and 3GB • Varying input file size – 100MB and 1.49GB • Varying number of intermediate file chunks after splitting – 16, 32, 128, 256, 512 and 1024 number of chunks of the input file size • Varying the k value used in the merging process – 8, 16, 32 and 128.
In the layout of this implementation, we use the design pattern of producer-consumer where a producer supplies (produces) certain data that is utilized (consumed) by the consumer. The application of this design comes into play when there is a synchronization mechanism required between the entity producing data and the entity consuming it. The consumer is waiting for the data till it is produced and the producer waits for the consumer to finish it’s processing in order to produce new data. Both the producer and the consumer are in the wait state till the other notifies them to perform the data processing. In the first module, the producer and consumer are threads which run parallel to the main thread. The function of the producer is to read the data from the disk and once the read size is equal to the size of each chunk so calculated, it notifies the consumer thread. In turn, the consumer thread reads the fed data and heapifies it, i.e. constructs a heap structure. Further heap sort is applied to obtain an ordered set of data. These ordered data records are written onto a new file in disk. Once this is done, the producer is awoken. The cycle continues till all the chunks of data in the
The results are noted based on two factors: • The time taken while keeping k constant, and varying number of chunks • The minimum time taken among all the k-way configurations. • From the graph in Fig. 2, for a memory of 512 MB, and varying number of chunks and a file size of 100 MB, it is observed that for number of chunks = 128 the time taken for sorting is minimum. • From the graph in Fig. 2, for a memory of 512 MB, and different k values and a file size of 100 MB, it is observed that for k = 16, 32 and 128 the time taken for sorting is minimum.
2
Fig 4. Time (in min) vs. number of chunks for 3GB RAM and 100MB file
Fig 2. Time (in min) vs. number of chunks for 512 MB RAM and 100MB file
F Fig 5. Time (in min) vs. number of chunks for 3GB RAM and 1.49GB file Fig 3. Time (in min) vs. number of chunks for 512 MB RAM and 1.49GB file •
• •
• •
From the graph in Fig. 3, for a memory of 512 MB, and varying number of chunks and a file size of 1.49 GB, it is observed that for number of chunks = 512 the time taken for sorting is minimum. From the graph in Fig. 3, for a memory of 512 MB, and different k values and a file size of 1.49 GB, it is observed that for k = 32, the time taken for sorting is minimum. From the graph in Fig. 4, for a memory of 3 GB, and varying number of chunks and a file size of 100 MB, it is observed that for number of chunks = 32, the time taken for sorting is minimum. From the graph in Fig. 4, for a memory of 3 GB, and different k values and a file size of 100 GB, it is observed that for k = 32, the time taken for sorting is minimum. From the graph in Fig. 5, for a memory of 3 GB, and varying number of chunks and a file size of 1.49 GB, it is observed that for number of chunks = 128, the time taken for sorting is minimum.
•
From the graph in Fig. 5, for a memory of 3GB, and different k values and a file size of 1.49 GB, it is observed that for k = 16, the time taken for sorting is minimum.
From all the above plots, we deduce that irrespective of the file size and RAM capacity, the sorting produces optimum results for intermediate values of number of chunks. There are aberrations at either extremities of the plot. When the amount of RAM, k and number of chunks are kept constant, the time taken for sorting is directly proportional to the file size. As the number of chunks increases, the I/O performed at splitting phase increase and also the number of times merging is performed increases, thus causing a steep increase in time taken. In intermediary values of number of chunks, a balance is struck between the I/O performed and the number of times, multistep merge is executed, thus resulting in an optimum performance.
3
V. CONCLUSION This paper introduces a different implementation of external sorting. It makes an honest effort in applying the design pattern (producer-consumer) concept in the form of threads to an algorithmic system which seeks to sort massive data sets. The two diverse mechanisms of k-way and multistep merging are combined to obtain a better speedup during merge phase of the sorting process.
REFERENCES [1] Greene, William A. k-way Merging and k-ary Sorts, Proceedings of the 31-st Annual ACM Southeast Conference, April 14-16, 1993, Birmingham, AL; pp. 127-135. [2] Per-Åke Larson. External Sorting: Run Formation Revisited. IEEE Transactions on Knowledge and Data Engineering, 15(4): 961-972, 2003. [3] Michael J. Folk, Bill Zoellick, File Structures, 3rd edition, Pearson Education. December 1997. [4] Colin M. Davidson Quicksort Revisited, IEEE Transaction on Software Engineering, Vol. 14 , No. 10, October 1988. [5] Lutz M.Wegner and Jukka I.Teuhola. The External Heapsort. IEEE Transaction on Software Engineering, Vol. 15, No. 7, July 1989.
4