03 Spark RDD轉換運算元之雙Value型別

2021-10-06 12:08:40 字數 3205 閱讀 5398

作用:求並集,對源rdd和引數rdd求並集之後返回乙個新的rdd

示例:

scala> val rdd1 = sc.makerdd(1 to 6)

rdd1: org.apache.spark.rdd.rdd[int] = parallelcollectionrdd[95] at makerdd at :24

scala> val rdd2 = sc.makerdd(4 to 10)

rdd2: org.apache.spark.rdd.rdd[int] = parallelcollectionrdd[96] at makerdd at :24

scala> val rdd = rdd1.union(rdd2)

rdd: org.apache.spark.rdd.rdd[int] = unionrdd[97] at union at :28

scala> rdd.collect

res53: array[int] = array(1, 2, 3, 4, 5, 6, 4, 5, 6, 7, 8, 9, 10)

作用:計算差集,從原rdd中減去原rddotherdateset中的共同的部分。

示例:

scala> val rdd1 = sc.makerdd(1 to 6)

rdd1: org.apache.spark.rdd.rdd[int] = parallelcollectionrdd[98] at makerdd at :24

scala> val rdd2 = sc.makerdd(4 to 10)

rdd2: org.apache.spark.rdd.rdd[int] = parallelcollectionrdd[99] at makerdd at :24

scala> val rdd = rdd1.subtract(rdd2)

scala> rdd.collect

res54: array[int] = array(1, 2, 3)

作用:計算交集,對源rdd和引數rdd求交集後返回乙個新的rdd

示例:

scala> val rdd1 = sc.makerdd(1 to 6)

rdd1: org.apache.spark.rdd.rdd[int] = parallelcollectionrdd[104] at makerdd at :24

scala> val rdd2 = sc.makerdd(4 to 10)

rdd2: org.apache.spark.rdd.rdd[int] = parallelcollectionrdd[105] at makerdd at :24

scala> val rdd = rdd1.intersection(rdd2)

scala> rdd.collect

res55: array[int] = array(4, 5, 6)

作用:計算 2 個rdd的笛卡爾積,盡量避免使用。

示例:

scala> val rdd1 = sc.makerdd(1 to 6)

rdd1: org.apache.spark.rdd.rdd[int] = parallelcollectionrdd[112] at makerdd at :24

scala> val rdd2 = sc.makerdd(4 to 10)

rdd2: org.apache.spark.rdd.rdd[int] = parallelcollectionrdd[113] at makerdd at :24

scala> val rdd = rdd1.cartesian(rdd2)

rdd: org.apache.spark.rdd.rdd[(int, int)] = cartesianrdd[114] at cartesian at :28

scala> rdd.collect

res56: array[(int, int)] = array((1,4), (1,5), (1,6), (1,7), (1,8), (1,9), (1,10), (2,4), (3,4), (2,5), (2,6), (3,5), (3,6), (2,7), (2,8), (3,7), (3,8), (2,9), (2,10), (3,9), (3,10), (4,4), (4,5), (4,6), (4,7), (4,8), (4,9), (4,10), (5,4), (6,4), (5,5), (5,6), (6,5), (6,6), (5,7), (5,8), (6,7), (6,8), (5,9), (5,10), (6,9), (6,10))

作用:拉鍊操作,兩個rdd的元素的數量必須相同,否則會丟擲異常。

示例:

scala> val rdd1 = sc.makerdd(1 to 5)

rdd1: org.apache.spark.rdd.rdd[int] = parallelcollectionrdd[118] at makerdd at :24

scala> val rdd2 = sc.makerdd(6 to 10)

rdd2: org.apache.spark.rdd.rdd[int] = parallelcollectionrdd[119] at makerdd at :24

scala> val rdd = rdd1.zip(rdd2)

rdd: org.apache.spark.rdd.rdd[(int, int)] = zippedpartitionsrdd2[120] at zip at :28

scala> rdd.collect

res58: array[(int, int)] = array((1,6), (2,7), (3,8), (4,9), (5,10))

sparkStreaming轉換運算元

map 集群 nc 埠 9000 可以修改 替換 字 flatmap 切分壓平 filter repartition union合併 local 當只有兩個的時候 只有乙個分割槽 另乙個處理資料集 count reduce join 和 cogroup用兩個佇列join 以上運算元都是無狀態的 各處...

RDD運算元怎麼區分轉換運算元和行動運算元

textfile 既不是transformation 也不是 action 它是為生成rdd前做準備 運算元 指的就是rdd上的方法。spark中的運算元分為2類 1 轉換運算元 transformation 由rrd 呼叫方法 返回乙個新的rdd 一直存在drive中因為沒生成task 特點 生成...

RDD轉換運算元和行動運算元的區別

textfile 既不是transformation 也不是 action 它是為生成rdd前做準備 運算元 指的就是rdd上的方法。spark中的運算元分為2類 1 轉換運算元 transformation 由rrd 呼叫方法 返回乙個新的rdd 一直存在drive中因為沒生成task 特點 生成...