Hadoop 學習研究壓縮實現與詳解

hadoop作為乙個較通用的海量資料處理平台，每次運算都會需要處理大量資料，我們會在 hadoop系統中對資料進行壓縮處理來優化磁碟使用率，提高資料在磁碟和網路中的傳輸速度，從而提高系統處理資料的效率。在使用壓縮方式方面，主要考慮壓縮速度和壓縮檔案的可分割性。綜合所述，使用壓縮的優點如下：

1.節省資料占用的磁碟空間；

2.加快資料在磁碟和網路中的傳輸速度，從而提高系統的處理速度。

hadoop對於壓縮格式的是自動識別。如果我們壓縮的檔案有相應壓縮格式的副檔名（如 lzo，gz，bzip2等）。hadoop會根據壓縮格式的副檔名自動選擇相對應的解碼器來解壓資料，此過程由hadoop自動處理，我們只需要確保輸入的壓縮檔案有副檔名。

hadoop對每個壓縮格式的支援, 詳細見下表：

表 1. 壓縮格式

壓縮格式

工具演算法

副檔名多檔案

可分割性

deflate

無deflate

.deflate不不

gzip

deflate

.gzp不不

zipzip

deflate

.zip

是是，在檔案範圍內

bzip2

.bz2不是

lzolzop

lzo.lzo不是

hadoop下各種壓縮演算法的壓縮比，壓縮時間，解壓時間見下表:

表 2. 效能對比

壓縮演算法

原始檔案大小

壓縮檔案大小

壓縮速度

解壓速度

gzip

8.3gb

1.8gb

17.5mb/s

58mb/s

bzip2

8.3gb

1.1gb

2.4mb/s

9.5mb/s

lzo-bset

8.3gb

2gb4mb/s

60.6mb/s

lzo8.3gb

2.9gb

49.3mb/s

74.6mb/s

因此我們可以得出：

1) bzip2壓縮效果明顯是最好的，但是 bzip2壓縮速度慢，可分割。

2) gzip壓縮效果不如 bzip2，但是壓縮解壓速度快，不支援分割。

3) lzo壓縮效果不如 bzip2 和 gzip，但是壓縮解壓速度最快！並且支援分割！

檔案的可分割性在 hadoop中是很非常重要的，它會影響到在執行作業時 map啟動的個數，從而會影響到作業的執行效率！

所有的壓縮演算法都顯示出一種時間空間的權衡，更快的壓縮和解壓速度通常會耗費更多的空間。在選擇使用哪種壓縮格式時，我們應該根據自身的業務需求來選擇。

使用方式

mapreduce可以在三個階段中使用壓縮。

1.輸入壓縮檔案。如果輸入的檔案是壓縮過的，那麼在被 mapreduce讀取時，它們會被自動解壓。

2.mapreduce作業中，對 map 輸出的中間結果集壓縮。實現方式如下：

conf.setboolean("mapreduce.compress.map.output", true);

conf.setclass("mapreduce.map.output.compression.codec", gzipcodec.class, compressioncodec.class);

最後一行**指定 map輸出結果的編碼器。

3.mapreduce作業中，對 reduce 輸出的最終結果集壓。實現方式如下：

fileoutputformat.setcompressoutput(job, true);

fileoutputformat.setoutputcompressorclass(job, gzipcodec.class);

壓縮解壓工廠類 compressioncodefactory.class 主要功能就是負責根據不同的副檔名來自動獲取相對應的壓縮解壓器 compressioncodec.class，是整個壓縮框架的核心控制器。接下來用**來說明壓縮和解壓縮：

壓縮檔案：

fsdatainputstream data_in = fs.open(pathlist[0]);  //開啟檔案 獲取檔案輸入流
byte buf = new byte[1000];
int len = data_in.read(buf); //讀取檔案資料至buf中
class<?> codecclass = gzipcodec.class;
compressioncodec codec = (compressioncodec) codecclass.newinstance();
fsdataoutputstream zip_out = fs.create(new path(pathoutput)); //建立zip檔案的輸出路徑
compressionoutputstream pre_out = codec.createoutputstream(zip_out);
pre_out.write(buf, 0, len); //將資料buf寫入zip檔案中
pre_out.finish();fsdatainputstream data_in = fs.open(pathlist[0]); //開啟檔案 獲取檔案輸入流
byte buf = new byte[1000];
int len = data_in.read(buf); //讀取檔案資料至buf中
class<?> codecclass = gzipcodec.class;
compressioncodec codec = (compressioncodec) codecclass.newinstance();
fsdataoutputstream zip_out = fs.create(new path(pathoutput)); //建立zip檔案的輸出路徑
compressionoutputstream pre_out = codec.createoutputstream(zip_out);
pre_out.write(buf, 0, len); //將資料buf寫入zip檔案中
pre_out.finish();

解壓檔案：

compressioncodecfactory factory = new compressioncodecfactory(conf);
path path = new path(pathinput);
compressioncodec codec = factory.getcodec(path);
fsdatainputstream data_in = fs.open(new path(pathinput));
compressioninputstream pre_in = codec.createinputstream(data_in);
byte buf = new byte[1000];
int len = pre_in.read(buf, 0, buf.length);
system.out.println(len);
pre_in.close();
string str = new string(buf, 0, len);
system.out.println(str);
//將讀取到的zip檔案中的資料寫入到txt中
fsdataoutputstream data_out = fs.create(new path(pathoutput));
data_out.write(buf, 0, len);
data_out.close();

通過對 hadoop平台壓縮框架的學習，對各壓縮格式的效率進行對比分析，能夠在使用hadoop平台時，通過對資料進行壓縮處理來提高資料處理效率。當再次面臨海量資料處理時，hadoop平台的壓縮機制可以讓我們事半功倍。

Hadoop 學習研究壓縮實現與詳解

學習與研究

code review研究與學習

Hadoop中的壓縮（1）概述與例項

Hadoop 學習研究 壓縮實現與詳解

學習與研究

code review研究與學習

Hadoop中的壓縮（1） 概述與例項

相關推薦

Hadoop 學習研究壓縮實現與詳解

Hadoop中的壓縮（1）概述與例項