External Sorting – A novel implementation Harish G., Pooja P., Priyanka S., Ramya Vastrad, Shivapriya B. Department of Computer Science and Engineering, R. V. College of Engineering, Bangalore, India. [email protected], {pooja.prakashrao, spriyanka.33, ramyavastrad, shivpri.b}@gmail.com

Abstract— When an ordered set of data is needed and the size of the data set far exceeds the memory to sort it, external sorting becomes a viable solution. Here we implement the sorting process in three stages namely: splitting, sorting and merging. Sorting is done using heap sort and merging is done as a combination of the widely known k-way and multistep merging mechanisms. Java threads enable us to achieve the required speedup and strike a middle ground between the sorting i.e. computation and the I/O being performed throughout. We also perform analysis of sorting with a range of memory capacities and files, in terms of time taken for sorting.

of the numerous sorting algorithms. • Write the sorted chunks back to disk. • Merge the sorted chunks into a single file. At first the input file is split into a predefined number of chunk files. In the next step, the individual chunks are sorted using a feasible sorting algorithm. Finally, the sorted chunks are combined in a multi-step manner by combining some k chunks at one go. This results in a sorted file.

Keywords— external sorting, heap sort, huge data sets, kway merging

The first and foremost needed feature is that the file, before and after external sort is same content-wise. Neither extra records should get introduced in the file, nor should there be any loss of existing data. Specifically, the MD5 hash of the file prior to and after the sorting should be the same. In addition, the implementations must be open to the following features: • Handle variable-sized data records. • Handle data irrespective of capacity of RAM. • Support different values of k in k-way merging and in number of chunks to be created. We narrowed our scope to ASCII formatted data records because ASCII is the most widely used data format. The size of the data file used was 1.49 gigabytes (GB) containing ASCII records delimited by new line character (\n). The sorting algorithm to be used on the data chunks in main memory had a lot of contenders. After analysis of a few sorting algorithms, we shortlisted merge sort, quick sort and heap sort. Since merge sort and quick sort both are recursive in nature, too many calls to the functions causes stack overflow and this is undesirable in the given scenario [4]. On the contrary, heap sort does not have any of the shortcomings like stack overflow or recursive function calls and hence is our choice [5].

II. PRACTICALITIES

I. INTRODUCTION Sorting has been one of the oldest problems in algorithmic study. Though there have been great improvements in memory sizes and processing speeds, the data that needs to be sorted has also been perceptibly increasing over the years and handling huge amounts of data is the order of the day. Hundreds of terabytes (TB) of data is considered as a trifling matter. Handling of data involves adding new records to existing data, searching, deleting and modifying existing data. Searching for a key value, in the most trivial case, involves going through all records of the file till the key is found. Some of the widely used and most efficient searching algorithms include binary search and interpolation search. The only prerequisite needed for these algorithms is that the input data must be ordered. Ordering, for all practical purposes is known as sorting. Thus, all these tasks are greatly simplified by sorting the data in consideration. There are a lot of in-memory sorting algorithms that can be used; but when the data itself is comparable to or greater in size than the memory, the question of loading the entire data into the memory and sorting it is superfluous. Since disks with sizes of 1 terabyte (TB) or more are common these days, the data can be sorted by having it in the disk and sorting it chunk by chunk by reading it into the main memory. External sorting [3] is a sorting mechanism using external memory, say, the hard disk. An approach to this is as follows: • Split the file to be sorted into smaller chunks that can be loaded into the memory. • Sort the chunks in memory by the application of any

III. IMPLEMENTATION The external sort has two major modules namely: • Split and sort [2] • Merge The implementation is in Java programming language. The pseudo codes for the two modules are as below:

1

Split and sort: 1. Read the filename, no_of_chunks, k_way 2. Calculate the size_of_each_chunk = file_size/no_of_ chunks 3. Spawn a "read" thread (producer) and a "sort" (consumer) thread 4. read thread: while(!end_of_file) do while(!end_of_file && length_read_so_far != size_of_each_chunk) do data[] ← read_a_line length_read_so_far = length_read_so_far + size_of_the _read_line end of while end of while notify "sort" thread 5. sort thread: wait for read thread to complete reading a chunk to memory after obtaining a notification from read thread, sort the chunk using Heap Sort Merge: 1. merge: for (j ← no_of_chunks;j >1; j ← j/k_way) do if(no_of_chunks==1) return for i ← 0 to (no_of_chunks/k_way) do starting_chunk_no ← (k_way * i)+1 sort_and_merge(chunk_no) end end 2. sort_and_merge(chunk_no): minItem ← index of minimum item in chunk[1], chunk[2], …, chunk[K] write item[minItem] to merged_file for i←1 to K do if chunk[i]==chunk[minItem] then get next item from chunk[i] end output merged_file

disk have been read and sorted back to disk. The use of threads helps to overcome the I/O bottleneck due to disk reads and writes. In the next module, simultaneous sorting as well as merging tasks is carried out using multistep and k-way [1] merge mechanisms. A pictorial representation of the same is in Fig.1.

Fig 1. Multistep k-way merging mechanism From the disk, k files are concurrently opened and the contents are compared record by record and an ordered list of records is realized. So at the end of the first k-way merge, we gathered ‘1’ file from ‘k’ files. The same process is carried on till a single sorted file results. In this way, two existing merging mechanisms are consolidated in one merge module to get an efficient merge technique.

IV. RESULTS AND ANALYSIS The sorting has been run for different configurations: • Varying RAM size – 512MB and 3GB • Varying input file size – 100MB and 1.49GB • Varying number of intermediate file chunks after splitting – 16, 32, 128, 256, 512 and 1024 number of chunks of the input file size • Varying the k value used in the merging process – 8, 16, 32 and 128.

In the layout of this implementation, we use the design pattern of producer-consumer where a producer supplies (produces) certain data that is utilized (consumed) by the consumer. The application of this design comes into play when there is a synchronization mechanism required between the entity producing data and the entity consuming it. The consumer is waiting for the data till it is produced and the producer waits for the consumer to finish it’s processing in order to produce new data. Both the producer and the consumer are in the wait state till the other notifies them to perform the data processing. In the first module, the producer and consumer are threads which run parallel to the main thread. The function of the producer is to read the data from the disk and once the read size is equal to the size of each chunk so calculated, it notifies the consumer thread. In turn, the consumer thread reads the fed data and heapifies it, i.e. constructs a heap structure. Further heap sort is applied to obtain an ordered set of data. These ordered data records are written onto a new file in disk. Once this is done, the producer is awoken. The cycle continues till all the chunks of data in the

The results are noted based on two factors: • The time taken while keeping k constant, and varying number of chunks • The minimum time taken among all the k-way configurations. • From the graph in Fig. 2, for a memory of 512 MB, and varying number of chunks and a file size of 100 MB, it is observed that for number of chunks = 128 the time taken for sorting is minimum. • From the graph in Fig. 2, for a memory of 512 MB, and different k values and a file size of 100 MB, it is observed that for k = 16, 32 and 128 the time taken for sorting is minimum.

2

Fig 4. Time (in min) vs. number of chunks for 3GB RAM and 100MB file

Fig 2. Time (in min) vs. number of chunks for 512 MB RAM and 100MB file

F Fig 5. Time (in min) vs. number of chunks for 3GB RAM and 1.49GB file Fig 3. Time (in min) vs. number of chunks for 512 MB RAM and 1.49GB file •

• •

• •

From the graph in Fig. 3, for a memory of 512 MB, and varying number of chunks and a file size of 1.49 GB, it is observed that for number of chunks = 512 the time taken for sorting is minimum. From the graph in Fig. 3, for a memory of 512 MB, and different k values and a file size of 1.49 GB, it is observed that for k = 32, the time taken for sorting is minimum. From the graph in Fig. 4, for a memory of 3 GB, and varying number of chunks and a file size of 100 MB, it is observed that for number of chunks = 32, the time taken for sorting is minimum. From the graph in Fig. 4, for a memory of 3 GB, and different k values and a file size of 100 GB, it is observed that for k = 32, the time taken for sorting is minimum. From the graph in Fig. 5, for a memory of 3 GB, and varying number of chunks and a file size of 1.49 GB, it is observed that for number of chunks = 128, the time taken for sorting is minimum.



From the graph in Fig. 5, for a memory of 3GB, and different k values and a file size of 1.49 GB, it is observed that for k = 16, the time taken for sorting is minimum.

From all the above plots, we deduce that irrespective of the file size and RAM capacity, the sorting produces optimum results for intermediate values of number of chunks. There are aberrations at either extremities of the plot. When the amount of RAM, k and number of chunks are kept constant, the time taken for sorting is directly proportional to the file size. As the number of chunks increases, the I/O performed at splitting phase increase and also the number of times merging is performed increases, thus causing a steep increase in time taken. In intermediary values of number of chunks, a balance is struck between the I/O performed and the number of times, multistep merge is executed, thus resulting in an optimum performance.

3

V. CONCLUSION This paper introduces a different implementation of external sorting. It makes an honest effort in applying the design pattern (producer-consumer) concept in the form of threads to an algorithmic system which seeks to sort massive data sets. The two diverse mechanisms of k-way and multistep merging are combined to obtain a better speedup during merge phase of the sorting process.

REFERENCES [1] Greene, William A. k-way Merging and k-ary Sorts, Proceedings of the 31-st Annual ACM Southeast Conference, April 14-16, 1993, Birmingham, AL; pp. 127-135. [2] Per-Åke Larson. External Sorting: Run Formation Revisited. IEEE Transactions on Knowledge and Data Engineering, 15(4): 961-972, 2003. [3] Michael J. Folk, Bill Zoellick, File Structures, 3rd edition, Pearson Education. December 1997. [4] Colin M. Davidson Quicksort Revisited, IEEE Transaction on Software Engineering, Vol. 14 , No. 10, October 1988. [5] Lutz M.Wegner and Jukka I.Teuhola. The External Heapsort. IEEE Transaction on Software Engineering, Vol. 15, No. 7, July 1989.

4

External Sorting – A novel implementation

Abstract— When an ordered set of data is needed and the ... Keywords— external sorting, heap sort, huge data sets, k- ... external memory, say, the hard disk.

351KB Sizes 1 Downloads 58 Views

Recommend Documents

Speeding Up External Sorting with No Additional Disk ... - PDFKUL.COM
... and Engineering Discipline, Khulna University, Khulna-9208, Bangladesh. cseku @khulna.bangla.net, sumonsrkr @yahoo.com†, optimist_2195 @yahoo.com ...

Speeding Up External Sorting with No Additional Disk ...
Md. Rafiqul Islam, Md. Sumon Sarker†, Sk. Razibul Islam‡ ... of Computer Science and Engineering Discipline, Khulna University, Khulna-9208, Bangladesh.

A Novel Rainbow Table Sorting Method
statistical analysis of 28,000 passwords recently stolen from a ... Analysis and evaluation ..... on Fast Software Encryption, Lecture Notes in Computer Science,.

External Guidance on the implementation of Policy 0070 - European ...
Sep 22, 2017 - Medicines Agency policy on the publication of clinical data .... Marketing authorisation transfers . ...... 9 For reports which present analysis of data collected from multiple studies, the applicant/MAH should include the.

External guidance on the implementation of the European Medicines ...
Apr 12, 2017 - Template cover letter text: “Redaction Proposal Document” package . ...... should be redacted, e.g. name, email, phone number, signature and ...

Summary of changes to the 'External guidance on the implementation ...
Apr 12, 2017 - indication application submitted in the context of regulatory procedures not ... are able to identify validation non-compliance at an early stage. 2.

External guidance on the implementation of the European Medicines ...
12 Apr 2017 - External guidance on the implementation of the European Medicines Agency policy on the publication of clinical data for medicinal products for human use. EMA/90915/2016. Page 57/100. •. Information on scientific advice received from a

External guidance on the implementation of the European Medicines ...
Apr 12, 2017 - Page 2/100 ..... Information that is already in the public domain or publicly available – Rejection ..... and on the free movement of such data.

a novel parallel clustering algorithm implementation ...
In the process of intelligent grouping of the files and websites, clustering may be used to ..... CUDA uses a recursion-free, function-pointer-free subset of the C language ..... To allow for unlimited dimensions the process of loading and ... GPU, s

a novel parallel clustering algorithm implementation ...
parallel computing course which flattened the learning curve for us. We would ...... handling 2D graphics from Adobe Flash or low stress 3D graphics. However ...

a novel parallel clustering algorithm implementation ... - Varun Jewalikar
calculations. In addition to the 3D hardware, today's GPUs include basic 2D acceleration ... handling 2D graphics from Adobe Flash or low stress 3D graphics.

Compositions for sorting polynucleotides
Aug 2, 1999 - glass supports: a novel linker for oligonucleotide synthesis ... rules,” Nature, 365: 5664568 (1993). Gryaznov et al .... 3:6 COMPUTER.

Compositions for sorting polynucleotides
Aug 2, 1999 - (Academic Press, NeW York, 1976); U.S. Pat. No. 4,678,. 814; 4,413,070; and ..... Apple Computer (Cupertino, Calif.). Computer softWare for.

Novel Hardware Implementation of the Cipher ...
MACs are used in public key digital signature tech- niques that provide data .... portable clients (for data collection), that need to be cheap, small, and have minor ...

8.0 External Charge - Major Maint. External Charge created 052716 ...
Try one of the apps below to open or edit this item. 8.0 External Charge - Major Maint. External Charge created 052716.pdf. 8.0 External Charge - Major Maint.

Novel Hardware Implementation of the Cipher Message ...
been deployed by VISA, MasterCard, and many other leading companies .... the computation of the MAC may begin “online” before the entire message is ...

INTERROGATIONS BY EXTERNAL PERSONNEL
Apr 12, 2016 - Law enforcement officers, legal guardians of a student, and under some ... interview, it is desirable that the individual comply “in loco parentis”.

Observability and Sorting in a Market for Names ...
... the sale of a well-established name may be public because it is covered ... cereal brand, the potential buyers were trusted companies, Kraft and General Mills. In fact it ..... Clients get utility 0 from a bad outcome and utility 1 from a good on

phonics sorting cards.pdf
Be sure to follow my TpT store and check out my blog for. more teaching ideas! {Primary Press}. **This item is for single classroom use only. Please do not.

pdf sorting software
... your download doesn't start automatically. Page 1 of 1. pdf sorting software. pdf sorting software. Open. Extract. Open with. Sign In. Main menu. Displaying pdf ...

Worker Sorting and Agglomeration Economies
The same relationship however emerges if I consider a stricter definition where either 5, 10 or 50 postings are needed for an occupation to be available. ... The CPS uses the 2002 Census occupational classification, while BG reports the data using th

Sorting by search intensity
such a way that the worker can use a contact with one employer as a threat point in the bargaining process with another. Specifically, an employed worker who has been contacted by an outside firm will match with the most productive of the two and bar

A stochastic path tracer implementation - GitHub
Computing the ray direction in specular surfaces (mirrors). Law of reflection . Fresnel equation (Schlick Approx.) R(θ) ≈ R0 + (1 − R0)(1 − cos(θ))5. Direction of ...