Spark Scala程式設計常用技巧集錦

2021-10-03 19:03:56 字數 2874 閱讀 3628

(1) 獲取filesystem

//1. 生成filesystem

def gethdfs

(path: string)

: filesystem =

(2) 根據時間戳獲取最新目錄
def findcandidate

(filesystem: filesystem, fspath: string)

: path =

(3) 讀取最新目錄下全部有效資料檔案
spark.read.text(finalpath):將檔案讀取為dataframe

//獲取最新目錄

val validpath =

findcandidate

(gethdfs

(path)

, path)

println

("validfilepath: "

+ validpath)

val finalpath = validpath.tostring.

concat

("/part-*"

)println

("finalpath: "

+ finalpath)

val result = spark.read.

text

(finalpath)

(4) 解析檔案中按行訪問的json,解析後儲存到新的dataframe中
val list = result.

collect()

for(row <

- list)

val features = json.

getjsonarray

("feature"

).toarray

val imgid = json.

getstring

("img_id"

) val imgurl = json.

getstring

("img_url"

) val width = json.

getintvalue

("width"

) val height = json.

getintvalue

("height"

) val date = json.

getstring

("date"

) val isimg = json.

getstring

("type"

) val extention = json.

getstring

("extention"

) val path = json.

getstring

("path"

) val source = json.

getstring

("source"

) datalist.

add(

row(adidslist.toarray, features, imgid, imgurl, width, height, date, isimg, extention, path, source)

)}

其中,datalist需要事先定義好row的scheme,如下所示

val schema =

structtype

(list

(structfield

("s_ad_id"

,arraytype

(longtype,

true),

true),

structfield

("feature"

,arraytype

(stringtype,

true),

true),

structfield

("img_id"

, stringtype,

true),

structfield

("img_url"

, stringtype,

true),

structfield

("width"

, integertype,

true),

structfield

("height"

, integertype,

true),

structfield

("date"

, stringtype,

true),

structfield

("format_type"

, stringtype,

true),

structfield

("extention"

, stringtype,

true),

structfield

("path"

, stringtype,

true),

structfield

("source"

, stringtype,

true))

) val datalist =

newutil.arraylist

[row]

()

(5) 根據datalist建立新的dataframe
var df2 = spark.

createdataframe

(datalist, schema)

ps:未完待續

spark scala 常用函式

將多個字串連線成乙個字串並用分隔符隔開 key相同的元素的value進行binary function的合併操作,如若括號內為 x,y x y則表示對key相同元素value求和 用來丟棄指定列 類似於subtrac,刪掉 rdd 中鍵與 other rdd 中的鍵相同的元素 表一.join 表二,...

ReportingSerivces 常用技巧

解決重複線問題 dim name as string public function findline byval value as string as string if name value then return false else name value return true end if...

Delphi ListView的用法 常用技巧

delphi listview的用法 常用技巧 2008 02 03 11 37 增加 i listview1.items.count with listview1 do begin listitem items.add listitem.caption inttostr i listitem.su...