基於Scrapy框架編寫爬蟲專案

知識點：

2種安裝模組的方式。

以下兩種方式可以安裝絕大部分模組，

網路安裝：指直接在控制台 pip install xx

第6條，配置過程：

1.複製：f:\程式設計\python\lib\site-packages\pywin32_system32 下的兩個.dll檔案

2.貼上到：c:\windows\system32 裡

1.建立爬蟲:scrapy startproject xx

2.檢視模版:scrapy genspider -l

basic:

# -*- coding: utf-8 -*-
import scrapy
class fstspider(scrapy.spider):
name = 'fst'
allowed_domains = ['aliwx.com.cn']
start_urls = ['']
def parse(self, response):
pass

crawl:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import linkextractor
from scrapy.spiders import crawlspider, rule
class secondspider(crawlspider):
name = 'second'
allowed_domains = ['aliwx.com.cn']
start_urls = ['']
rules = (
rule(linkextractor(allow=r'items/'), callback='parse_item', follow=true),
)def parse_item(self, response):
i = {}
#i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()
#i['name'] = response.xpath('//div[@id="name"]').extract()
#i['description'] = response.xpath('//div[@id="description"]').extract()
return i

csvfeed:

# -*- coding: utf-8 -*-
from scrapy.spiders import csvfeedspider
class thirdspider(csvfeedspider):
name = 'third'
allowed_domains = ['aliwx.com.cn']
start_urls = ['feed.csv']
# headers = ['id', 'name', 'description', 'image_link']
# delimiter = '\t'
# do any adaptations you need here
#def adapt_response(self, response):
# return response
def parse_row(self, response, row):
i = {}
#i['url'] = row['url']
#i['name'] = row['name']
#i['description'] = row['description']
return i

xmlfeed:

# -*- coding: utf-8 -*-
from scrapy.spiders import xmlfeedspider
class fourthspider(xmlfeedspider):
name = 'fourth'
allowed_domains = ['aliwx.com.cn']
start_urls = ['feed.xml']
iterator = 'iternodes' # you can change this; see the docs
itertag = 'item' # change it accordingly
def parse_node(self, response, selector):
i = {}
#i['url'] = selector.select('url').extract()
#i['name'] = selector.select('name').extract()
#i['description'] = selector.select('description').extract()
return i

3.在spiders中建立爬蟲:scrapy genspider -tbasic/crawl/csvfeed/xmlfeed

爬蟲名模板**網域名稱

4.執行爬蟲:scrapy crawl 爬蟲檔名

items : 儲存想要爬取的目標字段

siders:儲存多個爬蟲檔案

middelwares:中介軟體,用處不明

pipelines:爬後處理,

使用pymysql

首先需在pymysql資料夾中的connections.py中,更改charset為utf8,可防止亂碼.

scrapy開發手冊

scrapy爬蟲框架

作者經過幾周的python爬蟲實踐之後，深入學習了一下scrapy這個爬蟲框架，現將一些基本知識和總結整理一下，以備後查。2.scrapy的命令列使用這部分網上很多部落格都有總結，不需要背，理解會用主要的命令 startproject crawl fetch list genspider.即可，...

scrapy 爬蟲框架

1.安裝公升級pip版本 pip install upgrade pip 通過pip安裝scrapy框架 pip install scrapy 安裝成功只執行scrapy 進行測試是否安裝成功 2.scrapy startproject 爬蟲專案名稱執行此命令,可以生成乙個爬蟲專案會預先生成...

Scrapy爬蟲框架

scrapy中的各大元件及其功能 1.scrapy引擎 engine 引擎負責控制資料流在系統的所有元件中流動，並在相應動作發生時觸發事件。2.排程器 scheduler 排程器從引擎接收request並將它們入隊，以便之後引擎請求request時提供給引擎。4.spider。spider是scra...

基於Scrapy框架編寫爬蟲專案

scrapy爬蟲框架

scrapy 爬蟲框架

Scrapy爬蟲框架

相關推薦