nutch原始碼閱讀 7 Generator

2021-09-01 18:14:20 字數 1909 閱讀 9063

繼續向下看,第二個job

....................

....................

....................

// read the subdirectories generated in the temp

// output and turn them into segments

listgeneratedsegments = new arraylist();

//讀取上個job生成的多個fetchlist的segment

filestatus status = fs.liststatus(tempdir);

try

} catch (exception e)

if (generatedsegments.size() == 0)

....................

....................

....................

// 這裡主要是通過urlpartitioner來做的,具體是按哪乙個來分類,是通用引數來配置的,這裡有partition_mode_domain,partition_mode_ip  

// 來配置,預設是按url的hashcode來分。 

private path partitionsegment(filesystem fs, path segmentsdir, path inputdir,

int numlists) throws ioexception

//產生乙個新的目錄,以當前時間明明

path segment = new path(segmentsdir, generatesegmentname());

//在上面的目錄下,再產生乙個特定的crawl_generate目錄

path output = new path(segment, crawldatum.generate_dir_name);

log.info("generator: segment: " + segment);

nutchjob job = new nutchjob(getconf());

job.setjobname("generate: partition " + segment);

job.setint("partition.url.seed", new random().nextint());

fileinputformat.addinputpath(job, inputdir);

job.setinputformat(sequencefileinputformat.class);

job.setmapoutputkeyclass(text.class);

job.setmapoutputvalueclass(selectorentry.class);

job.setpartitionerclass(urlpartitioner.class);

job.setreducerclass(partitionreducer.class);

job.setnumreducetasks(numlists);

fileoutputformat.setoutputpath(job, output);

job.setoutputformat(sequencefileoutputformat.class);

job.setoutputkeyclass(text.class);

job.setoutputvalueclass(crawldatum.class);

job.setoutputkeycomparatorclass(hashcomparator.class);

jobclient.runjob(job);

return segment;

}

nutch原始碼閱讀 1 Crawl

org.apache.nutch.crawl.crawl實現的是乙個完成的抓取過程,所以由它開始。perform complete crawling and indexing to solr given a set of root urls and the solr parameter respec...

nutch原始碼閱讀 5 Injector總結

nutch的inject 有二個job 第乙個job 如下圖 1 url是否有tab分割的k v 對如果有記錄下來,2 如果配置了過濾使用 urlnormalizers和 urlfilters 對url 進行格式化和過濾,3 如果過濾的url 不為空則建立crawldatum物件,狀態 status...

《原始碼閱讀》原始碼閱讀技巧,原始碼閱讀工具

檢視某個類的完整繼承關係 選中類的名稱,然後按f4 quick type hierarchy quick type hierarchy可以顯示出類的繼承結構,包括它的父類和子類 supertype hierarchy supertype hierarchy可以顯示出類的繼承和實現結構,包括它的父類和...