Python爬取51jobs之資料清洗 3

前面已經用get_webpage方法獲取**，下面說一說如何從**資訊中過濾出我想要的資訊(招聘公司，招聘資訊，薪水)。

以乙個公司招聘**為例子，

ur=『

』我們要獲取這3個地方的文字資訊

首先匯入requests模組，寫上請求頭

import requests
import re
from bs4 import beautifulsoup
def data_cleaning():
user_agent = 'mozilla/4.0 (compatible;msie 5.5; windows nt)'
headers = 
url = ""
r = requests.get(url, headers)
soup = beautifulsoup(r.text, 'html.parser', exclude_encodings="utf-8")

接著在利用正則，bs4模組獲取目的資訊

def data_cleaning():
.... # 省略之前寫的
# 1，公司名稱
sname = soup.find_all(class_='catn')[0]['title'] 
# 2，職位資訊
directory = soup.find_all(class_='bmsg job_msg inbox')[0] # 返回乙個# typeerror: 'nonetype' object is not callable
job_datas = str(directory).replace("\n", "")
pattern = re.compile('(.*?)', '').replace('
','\n')
# 3，月薪
job_salary = soup.find_all(class_='cn')[0].strong.text
return sname,job_data,job_salary

值得注意的是：

job_datas = str(directory).replace("\n", "")

首先：directory = soup.find_all(class_='bmsg job_msg inbox')[0]返回的是乙個元素，到底在python裡面算什麼（nonetype），反正不是字串，所以首先要將directory轉換為字串

然後directory其實還有很多換行符「/n」在你除錯的時候並不會顯示，但你在正則匹配的時候它又確確實實存在，所以為了方便消除換行符，用字串的replace方法

後續只要將sname，job_data,job_salary存入mysql資料庫就算基本ok了。

爬取電影資源之網頁爬取篇（python）

6v電影網的主頁分為三列，如下圖所示。該網每天會推薦一些電影如上圖中的今日推薦電影質量還算可以，大部分電影評分還行。所以這部分及是我們現在要提取的部分。然後我們檢視其原始碼，找到該部分的還是很好找的，網頁結構比較簡單見下圖為了驗證找到的區域是否為我們需要的，可以把圖中框出的複製到乙個...

python之websocket資料爬取

首先我爬取的是貨幣網的爬取商家的購買和的資料下面是我利用websocket庫進行爬取的不廢話,直接上碼.import json,time from websocket import create connection class otc number 是區分購買和 number 0 這個 ...

python之爬取郵箱電話

這裡使用requests庫爬取網頁要比urllib庫方便用finditer查詢 import requests import re url 帶爬取的網頁 html requests.get url text text為轉化為str資料 pat w w com d 正則模式 res re.findi...

Python爬取51jobs之資料清洗 3

爬取電影資源之網頁爬取篇（python）

python之websocket資料爬取

python之爬取郵箱電話

相關推薦