Hadoop In-Mapper Combiner
Abstract
Hadoop has a Traditional Combiner which writes it’s Iterables to memory. In-Mapper is basically moving parts of this writes to local aggregation, which optimizes the running time since it has less write. However, due to states being preserved within the mapper (local), it causes large memory overhead. To put it simply you may think that In-Mapper Combiner takes up more space, but in return takes less time.
Most likely programmers will have to tweak parts of other parts of the MapReduce process, but in this example it is a simple Word Length Count, so it’s not required here.
Traditional Combiner
Let…
Input be a simple text (i.e. “The brown fox jumps over the lazy dog”)
Output be a tuple of length of the word and the number of counts (i.e. (3,4), (4,2), (5,2))
The code is quite straight forward except the fact I’m passing Text type instead of IntWritable. It’s possible to utilize only IntWritables instead of Text and IntWritable, but I was modifying the traditional WordCount java program, so I just kept the Text type. (you can see at “word.set(length+””)” that I’m overriding an int type to a string so that it can be set as a Text type.). If you are thinking of making the most optimal code, I suggest you pass IntWritables only instead.
Modifying Traditional Combiner to In-Mapper Combiner (Abstract)
Pseudocode might help here.
You need to have the cleanup, else you will run into troubles because 1. the partial map that has been used before, it might aggregate in a wrong manner on the next. 2. you want to flush out accumulation of results as the map finishes. I feel like both 1 and 2 are pretty much the same, but trust me, if you don’t do a cleanup method it’s likely to have bugs.
Modifying Traditional Combiner to In-Mapper Combiner (Code)
As you can see HashMap is used to do local aggregation. After the local aggregation is finished, the cleanup method will go through the HashMap with Iterator to flush out the accumulated results. As I mentioned there is type overriding in some places due to the same reason I have stated previously.
Final Thoughts
It’s actually possible to see the actual time reduced when you run the In-Mapper Combiner. For a 40MB plain text file in a 1 namenode + 2 datanode distributed environment, the Traditional Combiner process was
The In-Mapper Combiner process was
47 seconds and 29 seconds. Quite an improvement eh?
However, I’d like to re-highlight the part that local aggregation will use more memory overhead. I mean I’m pretty sure if you have coded before there’s a balance between space and time. When one is improving the other one is probably diminishing. It’s not a strict 1:1 trade, so if given infinite values for both, you can do whatever that suits your needs, but in real life, you will have limitation and importance that you will have to balance to get it just right ;).
Comments