Split Computation in Hadoop
A simple Hadoop job consists of a map and a reduce task. The map phase is parallelized by grouping sets of input files into splits. A split is a unit of parallelism. Multiple map tasks are instantiated and each of these is assigned a split. Hadoop needs to know the size of input files so that they can grouped into equal sized splits. Often times, input files are spread across many directories. For example, two years of data, organized into hourly directories, results in 17520 directories. If each directory contains 6 files, this makes a grand total of 105,120 files. The sizes of all these files need to be determined for split computation. Hadoop uses a generic filesystem API – that was designed primarily with HDFS in mind. It includes a filesystem implementation on top of S3 which allows higher layers to be agnostic of the source of the data. Map-Reduce calls the generic file listing API against each input directory to get the size of all files in the directory. In our example, this results in 17520 API calls. This is not a big deal in HDFS, but results in very bad performance in S3. Every listing call in S3 involves using a Rest API call and parsing of XML results which has very high overhead and latency. Furthermore, Amazon employs protection mechanisms against high rate of API calls. For certain workloads, split computation becomes a huge bottleneck.
Faster Split computation for S3
To solve this problem, we modified split computation to invoke listing at the level of the parent directory. This call returns all files (and their sizes) in all subdirectories in blocks of 1000. Some subdirectories and files may not be of interest to job/query e.g. partition elimination may be eliminated some of them. We take advantage of the fact that file listing is in lexicographic order and perform a modified merge join of the list of files and list of directories of interest. This allows us to efficiently identify files sizes of interesting files. The modified algorithm results in only 106 API calls (each call returns 1000 files) compared to 17520 API calls in the original implementation. We compared the two approaches using a simple Hive test. In this test, we take a partitioned table T with 15,000 files but vary the number of partitions (a partition corresponds to a directory). We compare the performance of ‘select count(*) from T’. In the extreme case, this optimization shows a speedup of 8x!
hadoop, amazon s3, hadoop distributed file system, hdfs