python3 scrapy 爬取騰訊招聘

安裝scrapy不再贅述，

在控制台中輸入scrapy startproject tencent 建立爬蟲專案名字為 tencent

接著cd tencent

用pycharm開啟tencent專案

構建item檔案

# -*- coding: utf-8 -*-
# define here the models for your scraped items
## see documentation in:
# import scrapy
class tencentitem(scrapy.item):
# define the fields for your item here like:
# name = scrapy.field()
#職位名
positionname = scrapy.field()
#詳細鏈結
positionlink = scrapy.field()
#職位類別
positiontype = scrapy.field()
#招聘人數
peoplenum = scrapy.field()
#工作地點
worklocation = scrapy.field()
#發布時間
publishtime = scrapy.field()

接著在spiders資料夾中新建tencentpostition.py檔案**如下注釋寫的很清楚

# -*- coding: utf-8 -*-
import scrapy
from tencent.items import tencentitem
class tencentpostitionspider(scrapy.spider):
#爬蟲名
name = 'tencent'
#爬蟲域
allowed_domains = ['tencent.com']
#設定url
url = ''
#設定頁碼
offset = 0
#預設url
start_urls = [url+str(offset)]
def parse(self, response):
#xpath匹配規則
for each in response.xpath("//tr[@class='even'] | //tr[@class='odd']"):
item = tencentitem()
# 職位名
item["positionname"] = each.xpath("./td[1]/a/text()").extract()[0]
# 詳細鏈結
item["positionlink"] = each.xpath("./td[1]/a/@href").extract()[0]
# 職位類別
try:
item["positiontype"] = each.xpath("./td[2]/text()").extract()[0]
except:
item["positiontype"] = '空'
# 招聘人數
item["peoplenum"] = each.xpath("./td[3]/text()").extract()[0]
# 工作地點
item["worklocation"] = each.xpath("./td[4]/text()").extract()[0]
# 發布時間
item["publishtime"] = each.xpath("./td[5]/text()").extract()[0]
#把資料交給管道檔案
yield item
#設定新url頁碼
if(self.offset<2620):
self.offset += 10
#把請求交給控制器
yield scrapy.request(self.url+str(self.offset),callback=self.parse)

接著配置管道檔案pipelines.py**如下

# -*- coding: utf-8 -*-
# define your item pipelines here
## don't forget to add your pipeline to the item_pipelines setting
# see: 
import json
class tencentpipeline(object):
def __init__(self):
#在初始化方法中開啟檔案
self.filename = open("tencent.json","wb")
def process_item(self, item, spider):
#把資料轉換為字典再轉換成json
text = json.dumps(dict(item),ensure_ascii=false)+"\n"
#寫到檔案中編碼設定為utf-8
self.filename.write(text.encode("utf-8"))
#返回item
return item
def close_spider(self,spider):
#關閉時關閉檔案
self.filename.close()

接下來需要配置settings.py檔案

不遵循robots規則

robotstxt_obey = false

download_delay = 3

#設定請求頭

default_request_headers =

#交給哪個管道檔案處理資料夾.管道檔名.類名

item_pipelines =

接下來再控制台中輸入　

scrapy crawl tencent

即可爬取

原始碼位址

Python3 Scrapy 安裝方法

寫了幾個爬蟲的雛形，想看看有沒有現成的，發現了scrapy，筆記本win10，想用新版本py3來裝scrapy，老是提示error microsoft visual c 14.0 is required.get it with microsoft visual c build tools 媽蛋vc ...

Python3 Scrapy 安裝方法

看了看相關介紹後選擇了scrapy框架，然後興高采烈的開啟了控制台，坑出現了。執行報錯 error unable to find vcvarsall.bat最後花費了將近一天的時間，終於找到了解決方法。使用wheel安裝。這個裡邊是編譯好的各種庫的同時，推薦一篇我的scrapy入門例項部落格 s...

Python3 scrapy學習小結

1.如何建立scrapy工程?命令列中輸入 scrapy startproject projectname projectname是自定義的工程名稱 2.spiders資料夾這部分處理怎麼爬取資料。通常有乙個或者多個spider，每個spider有下面四個基本的組成除此外會有其它的自定義變數或函式...

python3 scrapy 爬取騰訊招聘

Python3 Scrapy 安裝方法

Python3 Scrapy 安裝方法

Python3 scrapy學習小結

相關推薦