python爬蟲 scrapy爬取傳智播客教師資訊

輕鬆獲取html元素的xpath

開啟/關閉控制台：ctrl-shift鍵-x

參考：介紹一款chrome爬蟲網頁解析工具-xpath helper

# 建立工程 scrapy startproject myspider # 建立爬蟲 scrapy genspider itcast itcast.cn # 檢視爬蟲 scrapy list # 執行爬蟲 scrapy crawl itcast # 4種格式: json jsonl csv xml 預設為unicode編碼 # 輸出為json格式：scrapy crawl itcast -o data.json # 啟用終端 scrapy shell url # response.headers # response.body # 選擇器提取資料返回都是列表 # response.xpath() 提取出來的是乙個列表 # response.css() # extract() 將xpath物件轉為unicode字串物件

# re() 正則

為了說明spider、item、pipeline三個類的基本用法，**比較冗餘

# itcast_spider.py
# -*- coding: utf-8 -*-
import scrapy
from myspider.items.itcast_item import itcastitem
# py2解決編碼問題
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
class
itcastspider
(scrapy.spider):
# 爬蟲名稱，必須且唯一
name = "itcast"
# 限定爬取範圍（可選）
allowed_domains = ["itcast.cn"]
# 配置處理item的pipeline
custom_settings = 
}# 爬取的第一批url列表
start_urls = [""]
defparse
(self, response):
# 解析每個鏈結的教師列表
for li_txt in response.css(".li_txt"):
name = li_txt.xpath("./h3/text()").extract()[0]
title = li_txt.xpath("./h4/text()").extract()[0]
info = li_txt.xpath("./p/text()").extract()[0]
# 將資料放入item中返回給pipeline
item = itcastitem()
item["name"] = name
item["title"] = title
item["info"] = info
yield item

# itcast_item.py
# -*- coding: utf-8 -*-
import scrapy
class
itcastitem
(scrapy.item):
name = scrapy.field() # 姓名
title = scrapy.field() # 職稱
info = scrapy.field() # 詳細資訊

# itcast_pipline.py 
# -*- coding: utf-8 -*-
import json
class
itcastpipeline
(object):
# 類只例項化一次
def__init__
(self):
print
"@@@@@@爬蟲初始化"
self.f = open("itcast.json", "w")
self.count = 0
# 計數
defprocess_item
(self, item, spider):
# 必須實現的方法
dct = json.dumps(dict(item), ensure_ascii=false)
self.f.write(dct.encode("utf-8")+"\n")
self.count += 1
return item # 必須返回，讓其他管道處理
defopen_spider
(self, spider):
print
"@@@@@@爬蟲開啟"
defclose_spider
(self, spider):
self.f.close()
print
"@@@@@@爬蟲關閉"
print
"爬取資料條數：%s" % self.count

python爬蟲之scrapy爬取豆瓣電影（練習）

開發環境 windows pycharm mongodb scrapy 任務目標任務目標爬取豆瓣電影top250 將資料儲存到mongodb中。items.py檔案 coding utf 8 define here the models for your scraped items see d...

Scrapy爬蟲爬取電影天堂

目標建立專案 scrapy startproject 爬蟲專案檔案的名字生成 crawlspider 命令 scrapy genspider t crawl 爬蟲名字爬蟲網域名稱終端執行 scrapy crawl 爬蟲的名字 python操作mysql資料庫操作爬蟲檔案 coding ut...

scrapy多爬蟲以及爬取速度

主要這段時間一直使用的就是scrapy這個框架，因為公司裡面需要爬取大量的所以才使用了多爬蟲，但是目前測試也只是幾十個，一直也想不到更好的方法去同時抓取成千上百個結構不同的所以也很是苦逼的用了scrapy裡面的多爬蟲，對每個分別解析，還好雖然幾次改需求但是欄位都是統一的，可以很輕鬆的通過ite...

python爬蟲 scrapy爬取傳智播客教師資訊

python爬蟲之scrapy爬取豆瓣電影（練習）

Scrapy爬蟲爬取電影天堂

scrapy多爬蟲以及爬取速度

相關推薦