hadoop程式開發 python

這裡以統計單詞為例

mkdir /usr/local/hadoop-python

cd /usr/local/hadoop-python

#!/usr/bin/env python
import sys
# input comes from stdin (standard input) 輸入來自stdin（標準輸入）
for line in sys.stdin:
# remove leading and trailing whitespace 刪除前導和尾隨空格
line = line.strip(
)# split the line into words 把線分成單詞
words = line.split(
)# increase counters 增加櫃檯
for word in words:
# write the results to stdout (standard output); 
# 將結果寫入stdout（標準輸出）；
# what we output here will be the input for the
# reduce step, i.e. the input for reducer.py
# tab-delimited; the trivial word count is 1
# 我們在此處輸出的內容將是reduce步驟的輸入，即reducer.py製表符分隔的輸入; # 平凡的字數是1
print
'%s\t%s'
%(word,
1)

檔案儲存後，請注意將其許可權作出相應修改：

vim reducer.py

#!/usr/bin/env python
from operator import itemgetter
import sys
current_word =
none
current_count =
0word =
none
# input comes from stdin 輸入來自stdin
for line in sys.stdin:
# remove leading and trailing whitespace 
# 刪除前導和尾隨空格
line = line.strip(
) word, count = line.split(
'\t',1
)# convert count (currently a string) to int
# 將count（當前為字串）轉換為int
try:
count =
int(count)
except valueerror:
# count was not a number, so silently
# ignore/discard this line
# count不是數字，因此請忽略/丟棄此行
continue
# this if-switch only works because hadoop sorts map output
# by key (here: word) before it is passed to the reducer
# 該if開關僅起作用是因為hadoop在將對映輸出傳遞給reducer之前按鍵（此處為word）對 # 對映輸出進行排序
if current_word == word:
current_count += count
else
:if current_word:
# write result to stdout
# 將結果寫入stdout
print
'%s\t%s'
%(current_word, current_count)
current_count = count
current_word = word
# do not forget to output the last word if needed!
# 如果需要，不要忘記輸出最後乙個單詞！
if current_word == word:
print
'%s\t%s'
%(current_word, current_count)

檔案儲存後，請注意將其許可權作出相應修改：

chmod a+x /usr/local/hadoop-python/reducer.py

首先可以在本機上測試以上**，這樣如果有問題可以及時發現：

輸出： foo 1 foo 1 quux 1 labs 1 foo 1 bar 1

quux 1

再執行以下包含reduce.py的**：

輸出：

/www.gutenberg.org/cache/epub/20417/pg20417.txt

然後把這二本書上傳到hdfs檔案系統上：

# 在hdfs上的該使用者目錄下建立乙個輸入檔案的資料夾 hdfs dfs -mkdir /input # 上傳文件到hdfs上的輸入資料夾中

hdfs dfs -put /usr/local/hadoop-python/input/pg20417.txt /input

cd $hadoop_home find .

/-name "*streaming*.jar"

然後就會找到我們的share資料夾中的hadoop-straming*.jar檔案:

. /share/hadoop/tools/lib/hadoop-streaming-2.8.4.jar ./share/hadoop/tools/sources/hadoop-streaming-2.8.4-test-sources.jar

./share/hadoop/tools/sources/hadoop-streaming-2.8.4-sources.jar

/usr/local/hadoop-2.8.4/share/hadoop/tools/lib

由於這個檔案的路徑比較長，因此我們可以將它寫入到環境變數：

vim /etc/profile

export stream=/usr/local/hadoop-2.8.4/share/hadoop/tools/lib/hadoop-streaming-2.8.4.jar

由於通過streaming介面執行的指令碼太長了，因此直接建立乙個shell名稱為run.sh來執行：

vim run.sh

hadoop jar /usr/local/hadoop-2.8.4/share/hadoop/tools/lib/hadoop-streaming-2.8.4.jar \ /usr/local/hadoop-python/reducer.py \ -reducer /usr/local/hadoop-python/reducer.py \ -input /input/pg20417.txt \

-output /output1

Eclipse開發Hadoop程式

如鏈結所示，如果想通過eclipse開發hadoop程式，需要以下步驟 3 通過window preferens，配置hadoop map reduce選項。指定hadoop安裝位址。4 配置map reduce locations。在window show view中開啟map reduce l...

Hadoop之MapReduce程式開發流程

對於乙個資料處理問題，若需要mapreduce，那麼如何設計和實現？mapreduce程式基礎模板，包含兩個部分，乙個是map，乙個是reduce。map和reduce的設計取決解決問題的演算法思路而map和reduce的執行需要作業的排程。因此，mapreduce程式開發可以遵循以下流程。第一步...

hadoop 開發基礎

目錄 rz的使用 vi 快捷鍵檔案許可權的操作配置免密碼登陸後台服務管理 iptable 防火牆 linux 中的軟體安裝本地yum安裝倉庫配置檔案上傳到linux 是上傳到當前目錄所在的資料夾 yum list grep lrzsz sudo yum y install lrzsz.x8...

hadoop程式開發 python

Eclipse開發Hadoop程式

Hadoop之MapReduce程式開發流程

hadoop 開發基礎

相關推薦