Hadoop學習筆記 8

倒排索引是文件檢索系統中最常用資料結構。根據單詞反過來查在文件中出現的頻率，而不是根據文件來，所以稱倒排索引(inverted index)。結構如下:

這張索引表中，每個單詞都對應著一系列的出現該單詞的文件，權表示該單詞在該文件中出現的次數。現在我們假定輸入的是以下的檔案清單：

t1 ： hello world hello china

t2 : hello hadoop

t3 ： bye world bye hadoop bye bye

輸入這些檔案，我們最終將會得到這樣的索引檔案：

bye t3:4;

china t1:1;

hadoop t2:1;t3:1;

hello t1:2;t2:1;

world t1:1;t3:1;

接下來，我們就是要想辦法利用hadoop來把這個輸入，變成輸出。從上一章中，其實也就是分析如何將hadoop中的步驟個性化，讓其工作。整個步驟中，最主要的還是map和reduce過程，其它的都可稱之為配角，所以我們先來分析下map和reduce的過程將會是怎樣？

首先是map的過程。map的輸入是文字輸入，一條條的行記錄進入。輸出呢？應該包含：單詞、所在檔案、單詞數。 map的輸入是key-value。那這三個資訊誰是key，誰是value呢？數量是需要累計的，單詞數肯定在value裡，單詞在key中，檔案呢？不同檔案內的相同單詞也不能累加的，所以這個檔案應該在key中。這樣key中就應該包含兩個值：單詞和檔案，value則是預設的數量1，用於後面reduce來進行合併。

所以map後的結果應該是這樣的：

key value

hello;t1 1

hello:t1 1

world:t1 1

china:t1 1

hello:t2 1 …

即然這個key是復合的，所以常歸的型別已經不能滿足我們的要求了，所以得設定乙個復合健。復合健的寫法在上一章中描述到了。所以這裡我們就直接上**：

public

static

class mytype implements writablecomparable

private string word;

public string getword()

public

void setword(string value)

private string filepath;

public string getfilepath()

public

void setfilepath(string value)

@override

public

void write(dataoutput out) throws ioexception

@override

public

void readfields(datainput in) throws ioexception

@override

public

int compareto(mytype arg0) }

有了這個復合健的定義後，這個map函式就好寫了：

public

static

public

void map(object key, text value, context context)

throws interruptedexception, ioexception }

} 注意：第13行，路徑是全路徑的，為了看起來方便，我們把目錄替換掉，直接取檔名。

有了map，接下來就可以考慮recude了，以及在map之後的combine。map的輸出的key型別是mytype，所以reduce以及combine的輸入就必須是mytype了。

如果直接將map的結果送到reduce後，發現還需要做大量的工作來將key中的單詞再重排一下。所以我們考慮在reduce前加乙個combine，先將數量進行一輪合併。

這個combine將會輸入下面的值：

key value

bye t3:4;

china t1:1;

hadoop t2:1;

hadoop t3:1;

hello t1:2;

hello t2:1;

world t1:1;

world t3:1;

**如下：

public

static

class invertedindexcombiner extends

reducer

context.write(key, new text(key.getfilepath()+ ":" + sum)); }

} 有了上面combine後的結果，再進行reduce就容易了，只需要將value結果進行合併處理：

public

static

class invertedindexreducer extends

reducer

result.set(filelist);

context.write(new text(key.getword()), result); }

}經過這個reduce處理，就得到了下面的結果：

bye t3:4;

china t1:1;

hadoop t2:1;t3:1;

hello t1:2;t2:1;

world t1:1;t3:1;

最後，mapreduce函式都寫完後，就可以掛在job中執行了。

public

static

void main(string args) throws ioexception,

interruptedexception, classnotfoundexception

注：這裡為了除錯方便，我們把in和out都寫死，不用傳入執行引數了，並且，每次執行前，判斷out資料夾是否存在，如果存在則刪除。

Hadoop學習筆記 8

8 hadoop學習筆記02

Hadoop學習筆記 Hadoop初識

Hadoop學習筆記

Hadoop學習筆記 8

8 hadoop學習筆記02

Hadoop學習筆記 Hadoop初識

Hadoop學習筆記

相關推薦