05 測試hadoop自帶詞頻統計demo

2021-10-03 22:50:23 字數 4365 閱讀 6468

在了解了hadoop中的儲存元件hdfs之後,我們再來看一下hadoop中另乙個重要元件的計算mapreduce。hdfs搞定海量的儲存,mapreduce搞定海量的計算。hadoop如其他優秀的開源元件一樣,也提供了豐富的demo,下面我們就來看一下如何使用mapreduce自帶demo進行詞頻統計。

# 切換到家目錄

cd # 進入hadoop的bin目錄

cd hadoop-2.5.2/bin

# vim word,在其中加入以下內容並儲存退出,讀者可以隨意加入別的內容,這是我們待會要統計詞頻的檔案

hello i am zhangli

hello i am xiaoli

hi i am ali

who are you

i am xiaoli

# 上傳word檔案

./hdfs dfs -put word /word

# 檢視上傳結果

./hdfs dfs -cat /word

# 開始統計,其中

# ./yarn是執行命令

# jar是表示執行的是jar包

# /root/hadoop-2.5.2/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.2.jar 表示要執行的jar包

# wordcount 是要執行過程的名字

# /word 是我們上傳的待分析的檔案在hdfs中的路徑

# /output 是我們分析之後結果的輸出路徑

./yarn jar /root/hadoop-2.5.2/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.2.jar wordcount /word /output

# 等待一陣子,會有以下輸出

19/05/30 12:29:41 info client.rmproxy: connecting to resourcemanager at hadoop1/192.168.100.192:8032

19/05/30 12:29:46 info input.fileinputformat: total input paths to process : 1

19/05/30 12:29:47 info mapreduce.jobsubmitter: number of splits:1

19/05/30 12:29:48 info mapreduce.jobsubmitter: submitting tokens for job: job_1559056674360_0002

19/05/30 12:29:51 info mapreduce.job: running job: job_1559056674360_0002

19/05/30 12:30:19 info mapreduce.job: job job_1559056674360_0002 running in uber mode : false

19/05/30 12:30:19 info mapreduce.job: map 0% reduce 0%

19/05/30 12:30:36 info mapreduce.job: map 100% reduce 0%

19/05/30 12:30:46 info mapreduce.job: map 100% reduce 100%

19/05/30 12:30:49 info mapreduce.job: job job_1559056674360_0002 completed successfully

19/05/30 12:30:50 info mapreduce.job: counters: 49

file system counters

file: number of bytes read=111

file: number of bytes written=194141

file: number of read operations=0

file: number of large read operations=0

file: number of write operations=0

hdfs: number of bytes read=156

hdfs: number of bytes written=65

hdfs: number of read operations=6

hdfs: number of large read operations=0

hdfs: number of write operations=2

job counters

launched map tasks=1

launched reduce tasks=1

data-local map tasks=1

total time spent by all maps in occupied slots (ms)=15451

total time spent by all reduces in occupied slots (ms)=7614

total time spent by all map tasks (ms)=15451

total time spent by all reduce tasks (ms)=7614

total vcore-seconds taken by all map tasks=15451

total vcore-seconds taken by all reduce tasks=7614

total megabyte-seconds taken by all map tasks=15821824

total megabyte-seconds taken by all reduce tasks=7796736

map-reduce framework

map input records=5

map output records=17

map output bytes=135

map output materialized bytes=111

input split bytes=89

combine input records=17

combine output records=10

reduce input groups=10

reduce shuffle bytes=111

reduce input records=10

reduce output records=10

spilled records=20

shuffled maps =1

failed shuffles=0

merged map outputs=1

gc time elapsed (ms)=697

cpu time spent (ms)=8200

physical memory (bytes) snapshot=445980672

virtual memory (bytes) snapshot=4215586816

total committed heap usage (bytes)=322437120

shuffle errors

bad_id=0

connection=0

io_error=0

wrong_length=0

wrong_map=0

wrong_reduce=0

file input format counters

bytes read=67

file output format counters

bytes written=65

# 檢視/output輸出,在以下路徑中會看到有兩個檔案,其中_success代表成功,part-r-00000代表輸出結果

./hdfs dfs -ls /output

以下為輸出:

found 2 items

-rw-r--r-- 2 root supergroup 0 2019-05-30 12:30 /output/_success

-rw-r--r-- 2 root supergroup 65 2019-05-30 12:30 /output/part-r-00000

# 檢視詞頻統計結果

./hdfs dfs -cat /output/part-r-00000

# 以下為輸出

ali 1

am 4

are 1

hello 1

hi 1

i 4

who 1

xiaoli 2

you 1

zhangli 1

以上就是利用hadoop自帶的詞頻統計demo進行統計並檢視統計結果的過程。

Hadoop 詞頻統計(續)

如上圖所示,統計結果僅僅是按照key排序,value值沒有順序。而我們最終希望的是 1 統計結果在乙個最終檔案中,而不是分散到很多檔案中。2 統計結果按value值,及單詞出現的頻率排序。應該有很多方法可以實現以上的要求,我們以比較簡單的方式來完成這個需求。我們將充分利用hadoop的shuffle...

詞頻統計測試

1.上網查詢關於vs2015對程式進行單元測試的教程,學習了測試的方法。1 首先開啟vs2015新建乙個測試專案,如圖 2 編寫測試 此 是對map對映儲存單詞進行測試 include stdafx.h using namespace system using namespace system te...

Hadoop之詞頻統計WordCount

參考文章 ubuntu16.04安裝hadoop單機和偽分布式環境超詳細 1.啟動hdfs start all.sh2.檢視hdfs下包含的檔案目錄 hadoop dfs ls 由於是第一次執行,沒有檔案 3.在hdfs中建立乙個檔案目錄input,將 usr local hadoop readme...