MapReduce倒排索引簡單實現

倒排索引：倒排索引是文件檢索系統中最常用的資料結構，被廣泛的應用於全文搜尋引擎。它主要用來儲存某個單詞（或片語），在乙個文件或一組文件中的儲存位置的對映，即提供了一種根據內容來查詢文件的方式，由於不是根據文件來確定文件所包含的內容，而是進行了相反的操作，因而被稱為倒排索引。

例如：input：輸入有三個檔案

news1 :

hello, world! hello, urey!

news2 :

hello, mapreduce!

news3 :

hello, "hadoop"!

output:

hadoop news3:1, hello news3:1,news1:2,news2:1, mapreduce news2:1, urey news1:1,

world news1:1,

mapreduce實現：

word:uri 1

hello:news1 1

combiner input:

word:uri 1

hello:news1 1

combiner output:

word uri:number

hello news1:number

reducer input：

word uri:number

hello news1:number

reducer output：

word uri1:number1,uri2:number2,...

hello news1:number1,news2:number2,...

源**：

text outkey = new text();

text outvalue = new text();

pattern pattern = pattern.compile("[a-za-z0-9]+");

matcher match;

public void map(longwritable key, text value, context context)

throws ioexception, interruptedexception else

try catch (ioexception e) catch (interruptedexception e)

} }}

combiner:

public static class invertedindexcombiner extends reducer
string keys = key.tostring().split(":");
outkey.set(keys[0]);
int index = keys[keys.length-1].lastindexof('/');
outvalue.set(keys[keys.length-1].substring(index+1)+":"+sum);
try catch (ioexception e) catch (interruptedexception e) 
} }

reducer:

public class invertedindexreducer extends reducer
context.write(key, new text(string.tostring()));
}}

倒排索引基礎知識見文章：《搜尋引擎-倒排索引基礎知識》

倒排索引和MapReduce簡介

1.前言學習hadoop的童鞋，倒排索引這個演算法還是挺重要的。這是以後展開工作的基礎。首先，我們來認識下什麼是倒排索引 2.mapreduce框架簡介 2.1inputformat類 inputformat類的作用是什麼呢？其實就是把輸入的資料就是你上傳到hdfs的檔案切分成乙個個的spli...

MapReduce練習之倒排索引

實現統計多個文件中乙個單詞出現的頻數和出現在哪個文件中在map中讀取當前文件的每一行資料,得到當前文件路徑 mapkey 單詞文件路徑 mapvalue 數值1 在map端設定combiner類整合資料,減少向reduce端傳輸資料的網路開銷將map的輸出重新組合輸出單詞,文件路徑單詞頻...

mapreduce在倒排索引中練習

倒排索引是檔案檢索系統中常用的資料結構，被廣泛應用於全文章搜尋引擎。通常情況下，倒排索引由乙個單詞或片語以及相關的文件列表組成，文件列表中的文件或者是標識文件的id 號，或者是指定文件所在位置的 uri 在實際應用中，往往還需要給每個文件加乙個權值，用來指出每個文件與搜尋內容的相關度我的例子中，文...

MapReduce倒排索引簡單實現

倒排索引和MapReduce簡介

MapReduce練習之倒排索引

mapreduce在倒排索引中練習

相關推薦