Scrapy第一戰爬取智聯招聘

scrapy是專業級t_t爬蟲框架，在研究爬蟲領域頗負盛名，是當今世界最流行的爬蟲框架，沒有之一。不過如此強大的爬蟲框架，學習成本卻比較高，作為乙個新手，我對此感同身受，希望我的學習心得，避免大家入坑。

## 安裝scrapy 開啟命令列，輸入 : `pip install scrapy `

就是這麼簡單，安裝完成。

### 建立專案通過命令列進入你要放置專案的檔案，輸入： ` scrapy startproject ***xx ***xx代表你專案的名字 ` ![scrapy]( 輸入`scrapy genspider zl www.zhaopin.com`就會生成乙個爬蟲

![zl](

開啟item.py,定義一些字段，這些字段可以臨時儲存資料，以方便後面^.^把資料儲存到本地檔案、資料庫和其他地方。

廢話不多說，上**：

# -*- coding: utf-8 -*-
# define here the models for your scraped items
## see documentation in:
# import scrapy
class zhilianitem(scrapy.item):
# define the fields for your item here like:
name = scrapy.field()#職位名稱
rate = scrapy.field()#反饋率
compy = scrapy.field()#公司
money = scrapy.field()#月薪
place = scrapy.field()#工作地點

字段通過`scrapy.field()`定義，很簡單吧��。

![機制](

這是爬蟲的核心部分，是不是很緊張，很刺激��。不要怕其實一點都不難（這是小白教程），不多說先看**，沒有**講了也會一臉懵逼：

first_url = '全國&sm=0&p='

last_url = '&sg=d5859246414f499ba3fa6c723a9749f5'

defstart_requests

(self):

for i in range(1,91):

url = self.first_url + str(i) + self.last_url

yield request(url,self.parse)

defparse

(self, response):

soup = beautifulsoup(response.body.decode('utf-8'),'lxml')

for site in soup.find_all('table',class_='newlist'):

item = zhilianitem()

try:

item['name'] = site.find('td',class_='zwmc').get_text().strip()

#print(item['name'])

item['rate'] = site.find('td',class_='fk_lv').get_text()

#print(item['rate'])

item['compy'] = site.find('td',class_='gsmc').get_text()

#print(item['compy'])

item['money'] = site.find('td',class_='zwyx').get_text()

#print(item['money'])

item['place'] = site.find('td',class_='gzdd').get_text()

#print(item['place'])

#print(item)

yield item

except:

pass

前面幾行都是呼叫一些包和模組，`re`和`bs4`就不多說了��，`request`是乙個單獨的request的模組，需要跟進url的時候，需要用它。`zhilianitem`是匯入的items的`zhilianitem`類。我們要遍歷所有的職位頁面。

![智聯](

對於這些鏈結，有些人會選擇直接把鏈結貼到start_urls但是這太蹩腳了（如果你執意這麼做也要考慮程式設計量），看我的方法優雅方便多了吧��。

先賦值了幾個變數，定義start_requests通過requuest生成response，呼叫parse函式對每個響應進行處理。

使用beautifulsoup進行解析，提取元素，將我們匯入的item檔案進行例項化，用來儲存我們的資料,將需要的資料，複製給item[key] (注意這兒的key就是我們前面在item檔案中定義的那些字段。)（懶癌犯了，寫的簡單點了��）

開啟pipelines檔案，定義zhilianpipeline類，啥也不說了，再來一波**~_~：

# -*- coding: utf-8 -*-
# define your item pipelines here
## don't forget to add your pipeline to the item_pipelines setting
# see: 
import json
import os
class
zhilianpipeline
(object):
defopen_spider
(self, spider):
path = 'd:/資料/'
ifnot os.path.exists(path):
os.makedirs(path)
self.file = open(path + '智聯招聘.jl', 'wt',encoding='utf-8')
defclose_spider
(self, spider):
self.file.close()
defprocess_item
(self, item, spider):
line = json.dumps(dict(item),ensure_ascii=false) + '\n'
self.file.write(line)
return item

匯入`json`和`os`模組，定義`open_spider`函式(當spider執行是呼叫),建立乙個json，定義`close_spider`函式（當spider關閉時呼叫），定義` process_item`函式，把`item`資料寫入json中。在轉到`settings.py`輸入一下**：

item_pipelines =scrapy crawl zl

執行爬蟲，生成json資料。

![img](

over^3^!!!scrapy雖難，但是只要入門，之後自然就輕鬆了，加油。

Scrapy第一戰爬取智聯招聘

python 爬取智聯招聘

python爬取智聯招聘資訊

python爬取智聯招聘資訊

Scrapy第一戰 爬取智聯招聘

python 爬取智聯招聘

python爬取智聯招聘資訊

python爬取智聯招聘資訊

相關推薦

Scrapy第一戰爬取智聯招聘