MR 全域性排序多reducer

在資料量大的情況下，使用單reducer進行全域性排序的方式明顯效率較低，可次採用多reducer的方式。

在map中進行分桶，分桶方式自定。

#! /usr/bin/python
import sys
base_count = 10000
try:
for line in sys.stdin:
ss = line.strip().split('\t')
key = ss[0]
val = ss[1]
new_key = base_count + int(key)
partition_id = 1
if new_key <= (10000+10100) / 2:
partition_id = 0
print "%s\t%s\t%s" %(partition_id, new_key, val)
except exception:
print "map error"

#! /usr/bin/python
import sys
try:
for line in sys.stdin:
partition_id, key, val = line.strip().split('\t')
print '\t'.join([key,val])
except exception:
print "reduce error"

set -e -x hadoop_cmd="/usr/local/src/hadoop-2.6.5/bin/hadoop" stream_jar_path="/usr/local/src/hadoop-2.6.5/share/hadoop/tools/lib/hadoop-streaming-2.6.5.jar" input_file_path_a="/test/mr_allsort_reducebymany/a.txt" input_file_path_b="/test/mr_allsort_reducebymany/b.txt" output_path="/test/mr_allsort_reducebymany/result" $hadoop_cmd fs -rm -r -skiptrash $output_path $hadoop_cmd jar $stream_jar_path \ -input $input_file_path_a,$input_file_path_b \ -output $output_path \ -reducer "python red_sort.py" \ -file ./map_sort.py \ -file ./red_sort.py \ # 定義兩個reduce任務 -jobconf mapred.reduce.tasks=2 \ # map分隔符位於第二個欄位後，前兩個字段作為key，後面的作為value -jobconf stream.num.map.output.key.fields=2 \ # 按分隔符分割後，第乙個字段作為分割槽字段 -jobconf num.key.fields.for.partition=1 \ -partitioner org.apache.hadoop.mapred.lib.keyfieldbasedpartitioner

num.key.fields.for.partition 設定key內前幾個字段用來做partition

org.apache.hadoop.mapred.lib.keyfieldbasedpartitioner 如果要設定key中用於partition的字段，而不是把整個key都用來做partition，就用此配置項

結果：由於定義了兩個reduce任務，所以生成兩個結果檔案。

MR 全域性排序多reducer

MR 二次排序

hadoop mr 全域性排序

4 ，mr 八步練習排序

MR 全域性排序 多reducer

MR 二次排序

hadoop mr 全域性排序

4 ，mr 八步練習 排序

相關推薦

MR 全域性排序多reducer

4 ，mr 八步練習排序