簡單實現乙個初步的爬蟲

django:

#建立project

django-admin startproject mysite

cd mysite

#啟動專案

python manage.py runserver

scrapy:

#建立project 專案名稱

scrapy startproject xdb

cd xdb

#建立爬蟲爬蟲名稱爬蟲位址

scrapy genspider chouti chouti.com

scrapy genspider cnblogs cnblogs.com

#啟動爬蟲

scrapy crawl chouti

scrapy crawl chouti --nolog

"""
原始碼內容：
1. 判斷當前xdbpipeline類中是否有from_crawler
有：obj = xdbpipeline.from_crawler(...)
否：obj = xdbpipeline()
2. obj.open_spider()
3. obj.process_item()|obj.process_item()|obj.process_item()|
4. obj.close_spider()
"""from scrapy.exceptions import dropitem
class
xdbpipeline(object):
def__init__
(self, path):
self.f =none
self.path =path
@classmethod
deffrom_crawler(cls, crawler):
'''初始化時候，用於建立pipeline物件
:param crawler:
:return:
'''path = crawler.settings.get('
href_file_path')
return
cls(path)
defopen_spider(self, spider):
'''爬蟲開始執行時，呼叫
:param spider:
:return:
'''self.f = open(self.path, 'a+'
) 
defprocess_item(self, item, spider):
#print(item.get("text"))
self.f.write(item.get('
href
') + '\n'
) 
return
item # 交給下乙個pipleline中的process_item方法去執行
return dropitem() # 後續的pipeline中的process_item方法不再執行
defclose_spider(self, spider):
'''爬蟲關閉時，被呼叫
:param spider:
:return:
'''self.f.close()

持久化：pipelines
pipelines.py
class
xdbpipeline(object):
def__init__
(self, path):
self.f =none
self.path =path
@classmethod
deffrom_crawler(cls, crawler):
path = crawler.settings.get('
href_file_path')
return
cls(path)
defopen_spider(self, spider):
self.f = open(self.path, 'a+'
) 
defprocess_item(self, item, spider):
#print(item.get("text"))
self.f.write(item.get('
href
') + '\n'
) 
return
item
defclose_spider(self, spider):
self.f.close()
settings.py
item_pipelines =
items.py
import
scrapy
class
xdbitem(scrapy.item):
text =scrapy.field()
href =scrapy.field()
chouti.py
import
scrapy
xdb.items 
import
xdbitem
class
choutispider(scrapy.spider):
name = '
chouti
'allowed_domains = ['
chouti.com']
start_urls = ['
']defparse(self, response):
content_list = response.xpath('
//div[@class="link-con"]//div[@class="link-detail"]')
for item in
content_list:
text = item.xpath('
./a/text()
').extract_first()
href = item.xpath('
./a/@href
').extract_first()
yield xdbitem(text=text, href=href)

使用selenium實現乙個簡單的爬蟲

使用selenium爬蟲前2頁商品指定內容。主要思想請求url，從原始碼中獲取指定selector，進行爬取。import time from selenium import webdriver browser webdriver.chrome browser.set page load tim...

實現簡單乙個簡單的python爬蟲程式

爬蟲又稱網路蜘蛛，網頁機械人，是一種按照一定的規則，自動地抓取全球資訊網資訊的程式或者指令碼。http協議超文字傳輸協議 https協議超文字傳輸協議安全使用基本流程實現網頁採集 coding utf 8 author 安城 ance requests網頁資料採集時間 2021 1 1...

乙個最簡單的網路爬蟲的實現

當然，其中還牽扯到各種策略，什麼廣度優先，深度優先，但我們這裡是最簡單的網路爬蟲，所以不討論。好，那麼我們從最簡單的原理入手。首先，我們要建立乙個儲存的資料結構。public class queue 入佇列 public void enqueue object elem 出佇列 public ob...

簡單實現乙個初步的爬蟲

使用selenium實現乙個簡單的爬蟲

實現簡單乙個簡單的python爬蟲程式

乙個最簡單的網路爬蟲的實現

相關推薦