乙隻爬蟲的產生

以下環境基於py2.7

爬蟲架構：

url管理器:處理待爬url以及爬過的url，防止重複抓取以及死迴圈

網頁解析器：解析出想要的資料，以及捕捉新的url位址交給url管理器進行處理繼續抓取。過濾資料，拿到有價值的資料進行處理。

資料的存放：

python 的 set集合可以防止資料的重複

需要長期儲存的話存放到關係型資料庫

需要效能存放到快取資料庫

urllib2.urlopen(url)

最簡單：

import urllib2
response = urllib2.urlopen(url) # 開啟url
response.getcode() # 獲得狀態碼

帶上引數：

request = urllib2.request(url)
request.add_data('name','value') #資料
request.add_header('user-agent','...')#模擬瀏覽器頭部訪問
response.urllib2.urlopen(url)

帶上cookie、**、https、重定向等：

cj = cookielib.cookiejar() #建立cj容器

urllib2.install_opener(opener) # 安裝opener容器

response = urllib2.urlopen(url) 帶著cookie訪問url

解析器：regex、html.parser、beautifulsoup、lxml

一般使用bs4 beautiful soup 4

1、建立bs物件

soup = beautifulsoup(
html_doc,#文件字串
'html.parser', # html解析器
from_encoding='utf-8' #html文件編碼
)

2、搜尋節點

sou.find_all(標籤名,屬性,字串)#可以使用正則直接搜尋

exp:

# coding:utf8
from bs4 import beautifulsoup
import re
html_doc = """
the dormouse's story
once upon a time there were three little sisters; and their names were
elsie,
lacie and
tillie;
and they lived at the bottom of a well.
..."""
soup = beautifulsoup(html_doc,'html.parser',from_encoding='utf-8')
links = soup.find_all('a')
# for link in links:
# print link.name,link['href'],link.get_text()
# print 'only lacie'
# link = soup.find('a',href='')
# print link
# print 'regex start....'
# reg = soup.find('a',href=re.compile(r'ill'))
# print reg.get_text()
# print 'p'
# p_node = soup.find('p',class_=re.compile(r"s"))
# print p_node.get_text()

每個語句執行結果自行測試喲

以上內容摘自慕課網講師ppt

乙隻垂直的小爬蟲

這只垂直的小爬蟲,使用如下實現實現的思路很簡單,我從主函式開始簡單敘述一下整個執行流程,第一步收集需要爬取的url位址,容器我選擇的是concurrentlinkedqueue非阻塞佇列,它底層使用unsafe實現,要的就是它執行緒安全的特性主函式如下 static string url 新...

乙隻R語言de爬蟲

該爬蟲爬取得是某地新聞內容 pa1 用於找到href鏈結 pa2 用於根據鏈結找到新聞內容 pa3 用於儲存進資料庫嘿嘿爬蟲pa1 library xml 引入xml包 givehref function rootnode givenames function rootnode getpage ...

乙隻爬蟲帶你看世界 4

7.模擬瀏覽器訪問，隱藏python自身資訊原理當瀏覽器訪問伺服器上的內容時，伺服器會抓取訪問資訊中header中的user agent資訊，若user agent中顯示有python資訊等，則視為爬蟲程式,此時伺服器會阻止它進行資訊爬取。為了隱藏爬蟲程式，此時使用模擬瀏覽器訪問的方式來進行資訊...

乙隻爬蟲的產生

乙隻垂直的小爬蟲

乙隻R語言de爬蟲

乙隻爬蟲帶你看世界 4

相關推薦