BOSS直聘網爬蟲初嘗試

2022-09-15 13:33:13 字數 3670 閱讀 4634

1、需求分析

通過搜尋職位關鍵字,來獲取該職位的招聘資訊,資訊包括:崗位的名稱,招聘的公司,公司所在地,工作經驗要求,學歷要求

2、技術分析

**語言:python;

所需要的類庫:requests、beautifuisoup

3、詳細分析

在boss直聘**搜尋框上搜尋關鍵字:j**a

圖1 搜尋關鍵字

觀察下其跳轉鏈結,發現其請求的方式為get,那麼,我們就需要構造乙個請求的鏈結。

圖2 目標鏈結位址

圖3 請求的引數

def get_target_link(keyword):

data =

queries =urlencode(data)

url = base_url + queries

現在只需要傳入乙個keyword,就能構造出乙個職位的鏈結位址了,那麼下面就需要解析這個位址的內容

def get_target_content(url,keyword):

try:

response = requests.get(url = url,headers =headers)

if response.status_code == 200

: parser_target_content(response.text,keyword)

else

: print(

"request failed")

except requestexception

ase:

print(e)

這段**解析出頁面的內容之後,再去呼叫parser_target_content這個函式,解析目標內容

def parser_target_content(content,keyword):

job_info =

soup = beautifulsoup(content,"

lxml")

s = soup.find_all("

div",class_="

job-list")

for i in

s: m = i.find_all("li"

)

for j in

m: parser_detail_jobinfo(job_info)

def parser_detail_jobinfo(job_info):

for i in

job_info:

s = i.find_all("

div",class_="

job-primary")

for j in

s: job_name = j.find("

div",class_="

job-title

").get_text()

job_wage = j.find("

span

").get_text()

company_location = re.search(r"

([\u4e00-\u9fa5])(\s)

", str(j).replace("

\n","")).group(1

) work_experience = re.search(r"

(.*)",str(j)).group(1

) #aca_require = re.search(r"

em>(.)

", str(j)).group(1

) m = j.find_all("

div",class_="

company-text")

for k in

m: company = k.find("a"

).get_text()

print(company,job_name,job_wage,company_location,work_experience)

一頁的內容有了,下面的問題是如何把全部屬於這個關鍵字的職位頁面開啟來,瀏覽發現,底部並沒有最後一頁的位址,所有說,不能通過最後一頁的位址用迴圈遍歷所有的頁面,如何解決?

圖4 頁碼分布

觀察下面兩張截圖,圖5是首頁的,圖6是末頁的

圖5 頁碼截圖1

圖6 頁碼截圖2

def get_href(url):

response = requests.get(url = url, headers =headers).text

soup = beautifulsoup(response,"

lxml")

s = soup.find_all("

div",class_="

page")

for i in

s: h = i.find_all("a"

)

for j in h[-1

:]: href = "

" + j.get("

href")

return

href

def judge_nextpage(href):

response = requests.get(url = href, headers =headers).text

soup = beautifulsoup(response, "

lxml")

s = soup.find_all("

div", class_="

page")

for i in

s: h = i.find_all("a"

) ka = re.search(r'

class="(.*) href=\"

',str(h[-1])).group(1).strip("\""

)

if ka == "

next":

return

true

def print_href(url,keyword):

get_target_content(url, keyword)

href =get_href(url)

flag =judge_nextpage(href)

if flag ==true:

print_href(href,keyword)

4、爬取結果

抓取boss直聘的資訊

from bs4 import beautifulsoup import requests import ip proxy from urllib import parse def get boss info my ip,detailed url url proxy response request...

Scrapy實戰 爬Boss直聘

我們爬取頁面中每個公司的崗位資訊,包括職位 辦公地點 工作經驗 上圖中的11個加上boss直聘的jobid共12個資訊 開啟shell scrapy shell view response 發現返回403 嘗試把headers一併給出 from scrapy import request fetch...

爬取boss直聘招聘資訊

直接上主 from bs4 import beautifulsoup import requests import ip proxy from urllib import parse headers def get boss info my ip,detailed url url proxy res...