學習大檔案統計與排序

這篇主要記錄一下學習陳碩同學的對下面這道題的演算法思想與**。

題目是這樣的：

有10個檔案，每個檔案1g，每個檔案的每行存放的都是使用者的query（請自己隨機產生），每個檔案的query都可能重複。要求你按照query的頻度排序。

（當然，這裡的重點是大檔案，所以10個1g的檔案，或者1個10g的檔案，原理都是一樣的）

陳碩的**在這裡：

這是一段非常漂亮的**，解法與**都非常值得一看。

【解法】

基本步驟就是不斷讀入檔案，並做初步統計，到了某個記憶體的極限時寫出檔案，寫的方式是按query的雜湊值分配到10個不同的檔案中，直到讀完所有檔案內容，然後把這10個檔案中的query按count排序，並10路歸併排序出最後結果。

shuffle

從命令列傳入輸入檔案，逐行讀入，並存放在乙個hashmap中，邊讀邊統計，到map的size到達指定size時（10*1000*1000，主要是考慮記憶體容量），把這個hashmap的內容先寫出去，寫到10個檔案的第hash(query) % 10個中去，這保證了相同的query肯定在同乙個檔案中。這樣，直到把檔案讀完。所以如果輸入檔案總大小為10g的話，每個檔案大小為 <1g （因為相同的query並合併了），可以進行單檔案記憶體內處理。注意此時雖然相同的query在同一檔案中，他們可能是分布在好幾個地方的，如：

query1 10

query2 5

query3 3

query1 3

query4 3

query 2 7

reduce

把每個檔案中相同query合併，並將query按count排序。

merge

10個有序的檔案，通過歸併排序得到最後的結果。歸併的方式是通過乙個10個元素的堆，相比於兩兩迭代歸併排序，這大大減少了讀檔案的時間。

【執行】

該程式只在linux下執行，並需要boost，ubunut下，先安裝boost：

apt-get install libboost-dev

然後編譯，該程式用到了c++ 0x的feature，所以需要-std=c++0x:

g++ sort.cpp -o sort -std=c++0x

在執行前，需要準備輸入資料，這裡用lua隨機產生：（

--
updated version, use a table thus no gc involved
local file = io.open("
file.txt
", "w"
)local t ={}
for i = 1, 500000000
dolocal n = i % math.random(10000
) 
local str = string.format("
this is a number %d\n
", n)
table.insert
(t, str)
if i % 10000 == 0
then
file:write(table.concat
(t))
t = {}
endend

好，開始執行：

sort file.txt

結果如下：

$ time sort file.txt

processing file.txt

shuffling done

reading shard-00000-of-00010

writing count-00000-of-00010

reading shard-00001-of-00010

writing count-00001-of-00010

reading shard-00002-of-00010

writing count-00002-of-00010

reading shard-00003-of-00010

writing count-00003-of-00010

reading shard-00004-of-00010

writing count-00004-of-00010

reading shard-00005-of-00010

writing count-00005-of-00010

reading shard-00006-of-00010

writing count-00006-of-00010

reading shard-00007-of-00010

writing count-00007-of-00010

reading shard-00008-of-00010

writing count-00008-of-00010

reading shard-00009-of-00010

writing count-00009-of-00010

reducing done

merging done

real 19m18.805s

user 14m20.726s

sys 1m37.758s

在我的32位ubuntu11.10虛擬機器上，分配了1g記憶體，1個2.5g的cpu core，處理乙個15g的檔案，花了19m分鐘。

【學習】

學習大檔案統計與排序

php Shell大檔案資料統計並且排序

Shell 排序大檔案

LINUX AIX UNIX 大檔案排序問題

學習 大檔案統計與排序

php Shell大檔案資料統計並且排序

Shell 排序大檔案

LINUX AIX UNIX 大檔案排序問題

相關推薦

學習大檔案統計與排序