爬蟲（3）爬取資料再處理

上次我們爬取了2023年世界的gdp

但是還是有一些資料需要去除的，比如空，還有有空格的地方，還有廣告位等等，這裡我們去除這些東西

from selenium import webdriver
from bs4 import beautifulsoup
driver=webdriver.chrome(
)url=
""xpath=
"/html/body/div[2]/div[1]/div[5]/div[1]/div/div/div/table"
driver.get(url)
tablel=driver.find_element_by_xpath(xpath)
.get_attribute(
'innerhtml'
)soup=beautifulsoup(tablel,
"html.parser"
)table=soup.find_all(
'tr'
)for row in table:
cols=
[col.text for col in row.find_all(
'td')]
iflen
(cols)==0
ornot cols[0]
.isdigit():
continue
print
(cols)

這裡加了

if len(cols)==0 or not cols[0].isdigit():

continue

目的是去除空，列表第乙個元素是空格的行，還有廣告位

結果如下

node爬蟲爬取csdn資料

必須安裝node，我裝的是8.11.2版本，mac開發發出http請求 superagent控制併發請求 async eventproxy分析網頁內容 cheerio 直接配置一下package.json dependencies 配置好後 nom install 安裝所需依賴接下來開始寫爬蟲。...

requests爬蟲爬取頁面資料

新建檔案test.py，寫入一下 import requests 通過pip install requests安裝 from bs4 import beautifulsoup 通過pip install bs4安裝 import re 安裝了python就有了re模組 import json 安裝了...

爬蟲爬取騰訊疫情資料

網頁結構實現爬取的資料結語右鍵檢查，分析網頁找到我們需要的資料所在的找到下面就是相關實現了。首先匯入python相關庫 requests 網頁請求，獲取原始資料 json 網頁解析，去除多餘字元 pandas 資料處理 import requests import json impor...

爬蟲（3）爬取資料再處理

node爬蟲爬取csdn資料

requests爬蟲爬取頁面資料

爬蟲 爬取騰訊疫情資料

相關推薦

爬蟲爬取騰訊疫情資料