乙個簡單網路爬蟲示例

在學生時期，可能聽到網路爬蟲這個詞會覺得很高大上，但是它的簡單實現可能學生都不難懂。網路爬蟲應用，就是把整個網際網路真的就當做一張網，像蜘蛛網那樣，應用就像乙個蟲子，在網上面按照一定的規則爬動。

現在網際網路應用最廣的就是http(s)協議了，本文例子就是基於使用http(s)協議的，只作為示例，不涉及複雜的演算法（實際上是最重要的）。

設計思路：

程式入口從乙個或多個url開始，通過http(s)獲取url的內容，對獲取到內容處理，獲取內容中需要爬取的資訊，獲取到內容中的url鏈結，再重複以上步驟。

不多說，詳情看**已經注釋：

/**
* 功能概要：主程式
* *@author hwz
*/public
class
private integer corepoolsize = 10;
private integer maxpoolsize = 20;
private threadpoolexecutor executor;
/** 工作佇列 */
private spiderqueue workqueue;
public
void
start(string url) throws exception 
catch (exception e) 
//提交第乙個執行任務
executor.submit(new ******spider(workqueue, "thread-" + "main"));
int i=0;
int idle = 0;
while(true) 
else
if (workqueue.size() == 0)
thread.sleep(1000);
}else 
}system.out.println("end!,workqueue.size=" + workqueue.size() + 
",executorqueue.activecount=" + executor.getactivecount() + ",executorqueue.completedtaskcount" +
executor.getcompletedtaskcount() + ",i=" + i);
workqueue.printall();
executor.shutdown();
system.exit(0);
}public
static
void
main(string args) throws exception 
}

/**
* * 功能概要：自定義爬蟲工作同步佇列，使用arraylist實現
* *@author hwz
*/public
class
spiderqueue 
public
synchronized
void
add(spiderurl spiderurl) 
public
synchronized spiderurl poll() 
//控制台列印結果，方便檢視
spiderurl spiderurl = queue.remove(0);
system.out.println("spiderqueue,poll,spiderurl=" + spiderurl.tostring() + ",remain size=" + queue.size());
return spiderurl;
}public
synchronized spiderurl peek() 
return queue.get(0);
}public
synchronized
boolean
i***sit(spiderurl spiderurl) 
public
synchronized
intsize() 
public
void
printall() 
}}

/**
* * 功能概要：爬蟲工作的url
* *@author hwz
*/public
class
spiderurl 
public string geturl() 
public
void
seturl(string url) 
public
intgetdeep() 
public
void
setdeep(int deep) 
@override
public
boolean
equals(object obj) 
spiderurl oth = (spiderurl) obj;
return
this.url.equals(oth.geturl());
}@override
public
inthashcode() 
@override
public string tostring() 
}

/**
* * 功能概要：爬蟲工作類，主要實現類
* *@author hwz
*/public
class
******spider
implements
runnable
@override
public
void
run() 
else 
catch (interruptedexception e) }}
system.out.println(threadname + " end run...");
}/**
* url解析
*@param url
*@return void
*/private
void
parseurl(spiderurl url) 
try }}
}catch (ioexception e) 
}/**
* 讀取http url 內容
*@param connection
*@return
*@return string
*/private string getresource(urlconnection connection) 
stringbuilder sb = new stringbuilder();
try 
}catch (ioexception e) 
return sb.tostring();
}/**
* 從url內容獲取標題
*@param content
*@return
*@return string
*/private string gettitle(string content) 
pattern pattern = pattern.compile("(.)");
matcher matcher = pattern.matcher(content);
string title = null;
if (matcher.find()) 
return title;
}/**
* 從url內容中獲取存在的url鏈結
*@param content
*@return
*@return list*/
private listgeturls(string content) 
pattern pattern = pattern.compile("(])");
matcher matcher = pattern.matcher(content);
string a;
string lastchar;
listlinks = new arraylist();
while (matcher.find()) 
links.add(a);
}return links;
}}

該**示例，旨在說明乙個簡單的爬蟲，關於多執行緒和http的處理沒有過多考慮，如存在錯誤，請指出。

乙個簡單的網路爬蟲 SharkCrawler

最近需要通過網路爬蟲來收集點資料，想找一些簡單易用的開源版本，總是要麼配置起來有點複雜，要麼功能上不太容易擴充套件。還是自己實現乙個簡單的版本更容易擴充套件相應的功能。這個版本的實現完全參照wiki上面對於webcrawler的架構來設計型別。實現了一些簡單的功能執行目標輸出路徑和工作執行緒數暫...

python製作乙個簡單網路爬蟲

這章我們用python標準庫urllib2來實現簡單的網路爬蟲本章很簡單適合小白，不喜勿噴一 urllib2定義了以下方法 urllib2.urlopen url,data,timeout data引數 post資料提交例如賬號密碼傳送給伺服器判斷登陸 url引數網頁url，可接受requ...

乙個游標簡單示例

下面是乙個游標的例項，方便初學者學習，也可以防止自己忘記 if exists select 1 from sysobjects where name hehe drop procedure hehe goset ansi nulls on set quoted identifier on gocre...

乙個簡單網路爬蟲示例

乙個簡單的網路爬蟲 SharkCrawler

python製作乙個簡單網路爬蟲

乙個游標簡單示例

相關推薦