舉例說明Spark RDD的分割槽依賴

例子如下:

scala> val textfilerdd = sc.textfile("/users/zhuweibin/downloads/hive_04053f79f32b414a9cf5ab0d4a3c9daf.txt")
15/08/03
07:00:08 info memorystore: ensurefreespace(57160) called with curmem=0, maxmem=278019440
15/08/03
07:00:08 info memorystore: block broadcast_0 stored as values in memory (estimated size 55.8 kb, free 265.1 mb)
15/08/03
07:00:08 info memorystore: ensurefreespace(17237) called with curmem=57160, maxmem=278019440
15/08/03
07:00:08 info memorystore: block broadcast_0_piece0 stored as bytes in memory (estimated size 16.8 kb, free 265.1 mb)
15/08/03
07:00:08 info blockmanagerinfo: added broadcast_0_piece0 in memory on localhost:51675 (size: 16.8 kb, free: 265.1 mb)
15/08/03
07:00:08 info sparkcontext: created broadcast 0 from textfile at :21
textfilerdd: org.apache
.spark
.rdd
scala> println( textfilerdd.partitions
.size )
15/08/03
07:00:09 info fileinputformat: total input paths to process : 1
2scala> textfilerdd.partitions
.foreach 
index:
0 hascode:1681
index:
1 hascode:1682
scala> println("dependency size:" + textfilerdd.dependencies)
dependency size:list(org.apache
.spark
.onetoonedependency
@543669de)
scala> println( textfilerdd )
scala> textfilerdd.dependencies
.foreach 
dependency type:class org.apache
.spark
.onetoonedependency
dependency rdd:/users/zhuweibin/downloads/hive_04053f79f32b414a9cf5ab0d4a3c9daf.txt hadooprdd[0] at textfile at :21
dependency partitions:[lorg.apache
.spark
.partition
;@c197f46
dependency partitions size:2
scala> 
scala> val flatmaprdd = textfilerdd.flatmap(_.split(" "))
flatmaprdd: org.apache
.spark
.rdd
scala> println( flatmaprdd )
scala> flatmaprdd.dependencies
.foreach 
dependency type:class org.apache
.spark
.onetoonedependency
dependency partitions:[lorg.apache
.spark
.partition
;@c197f46
dependency partitions size:2
scala> 
scala> val maprdd = flatmaprdd.map(word => (word, 1))
maprdd: org.apache
.spark
.rdd
scala> println( maprdd )
scala> maprdd.dependencies
.foreach 
dependency type:class org.apache
.spark
.onetoonedependency
dependency partitions:[lorg.apache
.spark
.partition
;@c197f46
dependency partitions size:2
scala> 
scala> 
scala> val counts = maprdd.reducebykey(_ + _)
counts: org.apache
.spark
.rdd
.rdd[(string, int)] = shuffledrdd[4] at reducebykey at :27
scala> println( counts )
shuffledrdd[4] at reducebykey at :27
scala> counts.dependencies
.foreach 
dependency type:class org.apache
.spark
.shuffledependency
dependency partitions:[lorg.apache
.spark
.partition
;@c197f46
dependency partitions size:2
scala>

從輸出我們可以看出，對於任意乙個rdd x來說，其dependencies代表了其直接依賴的rdds（乙個或多個）。那dependencies又是怎麼能夠表明rdd之間的依賴關係呢？假設dependency為dependencies成員

那麼，如果某個rdd的partition計算失敗，要回朔到哪個rdd為止呢？上例中列印出的dependency.rdd如下：

[1]at

textfile

at:21

[2]at

flatmap

at:23

[3]at

mapat

:25shuffledrdd

[4]at

reducebykey

at:27

可以看出每個rdd都有乙個編號，在回朔的過程中，每向上回朔一次變回得到乙個或多個相對父rdd，這時系統會判斷該rdd是否存在（即被快取），如果存在則停止回朔，如果不存在則一直向上回朔到某個rdd存在或到最初rdd的資料來源為止。

舉例說明Spark RDD的分割槽依賴

sprintf舉例說明

python argparse舉例說明

c 引用舉例說明

舉例說明Spark RDD的分割槽 依賴

sprintf舉例說明

python argparse舉例說明

c 引用 舉例說明

相關推薦

舉例說明Spark RDD的分割槽依賴

c 引用舉例說明