Spark RDD 二次分組排序取TopK

用spark求出每個院系每個班每個專業前3名。

資料格式：id,studentid,language,math,english,classid,departmentid，即id，學號，語文，數學，外語，班級，院系

1,111,68,69,90,1班,經濟系 2,112,73,80,96,1班,經濟系 3,113,90,74,75,1班,經濟系 4,114,89,94,93,1班,經濟系 5,115,99,93,89,1班,經濟系 6,121,96,74,79,2班,經濟系 7,122,89,86,85,2班,經濟系 8,123,70,78,61,2班,經濟系 9,124,76,70,76,2班,經濟系 10,211,89,93,60,1班,外語系 11,212,76,83,75,1班,外語系 12,213,71,94,90,1班,外語系 13,214,94,94,66,1班,外語系 14,215,84,82,73,1班,外語系 15,216,85,74,93,1班,外語系 16,221,77,99,61,2班,外語系 17,222,80,78,96,2班,外語系 18,223,79,74,96,2班,外語系 19,224,75,80,78,2班,外語系

20,225,82,85,63,2班,外語系

import org.apache.log4j.
import org.apache.spark.
/** *學生成績 topk問題
* * 每個院系每個班每科前3名
* 每行資料格式：id,studentid,language,math,english,classid,departmentid
*/object
testgroupby )
/**結果顯示*/
topk.foreach(println)
/*(外語系,2班,map(語文前3 -> list(82分:學號225, 80分:學號222, 79分:學號223), 數學前3 -> list(99分:學號221, 85分:學號225, 80分:學號224), 外語前3 -> list(96分:學號222, 96分:學號223, 78分:學號224)))
(外語系,1班,map(語文前3 -> list(94分:學號214, 89分:學號211, 85分:學號216), 數學前3 -> list(94分:學號213, 94分:學號214, 93分:學號211), 外語前3 -> list(93分:學號216, 90分:學號213, 75分:學號212)))
(經濟系,1班,map(語文前3 -> list(99分:學號115, 90分:學號113, 89分:學號114), 數學前3 -> list(94分:學號114, 93分:學號115, 80分:學號112), 外語前3 -> list(96分:學號112, 93分:學號114, 90分:學號111)))
(經濟系,2班,map(語文前3 -> list(96分:學號121, 89分:學號122, 76分:學號124), 數學前3 -> list(86分:學號122, 78分:學號123, 74分:學號121), 外語前3 -> list(85分:學號122, 79分:學號121, 76分:學號124)))
*/}}

Spark RDD 二次分組排序取TopK

hadoop reducer二次分組

Hadoop Streaming二次排序

MapReduce二次排序

Spark RDD 二次分組排序取TopK

hadoop reducer二次分組

Hadoop Streaming二次排序

MapReduce二次排序

相關推薦