爬取彼岸桌布

2022-06-22 04:54:11 字數 4749 閱讀 3629

看到論壇上有人發,自己跟著敲了遍**,有些地方進行了改動,學習了。

# -*- coding: utf-8 -*-

# @time : 2020/6/17 18:24

# @author : banshaohuan

# @site :

# @file : bizhi.py

# @software: pycharm

import requests

from bs4 import beautifulsoup

import os

import time

import random

from fake_useragent import useragent

index = ""

interval = 0.1

first_dir = "d:/彼岸桌面爬蟲"

# 存放**分類子頁面的資訊

classification_dict = {}

# 得到乙個隨機的header

def get_headers():

# 設定headers

ua = useragent()

headers =

return headers

# 獲取頁面篩選後的內容列表

# 獲取頁面篩選後的內容列表

def screen(url, select):

headers = get_headers()

html = requests.get(url=url, headers=headers)

html = html.text

soup = beautifulsoup(html, "lxml")

return soup.select(select)

# 將分類子頁面資訊存放在字典中

def init_classification():

url = index

select = "#header > div.head > ul > li:nth-child(1) > div > a"

classifications = screen(url, select)

for c in classifications:

href = c.get("href")

text = c.string

if text == "4k桌布": # 4k桌布需要許可權,無法爬取,只能跳過

continue

second_dir = f"/"

url = index + href

global classification_dict

classification_dict[text] =

# 獲取頁碼

# 定位到 1920 1080 解析度

def handle_images(links, path):

for link in links:

href = link.get("href")

# 過濾廣告

if href == "":

continue

# 第一次跳轉

print(f":無此,爬取失敗")

continue

href = link[0].get("href")

# 第二次跳轉

url = index + href

# 找到要爬取的

select = "div#main table a img"

link = screen(url, select)

if link == :

print(f":該需要登入才能爬取,爬取失敗")

continue

# 這裡去掉alt中所有的符號,只保留名字

# ui互動頁面

def ui():

print("-----------netbian----------")

print("全部", end=" ")

for c in classification_dict.keys():

print(c, end=" ")

print()

choice = input("請輸入分類名:")

if choice == "全部":

for c in classification_dict.keys():

select_classification(c)

elif choice not in classification_dict.keys():

print("輸入錯誤,請重新輸入!")

print("----")

ui()

else:

select_classification(choice)

def main():

if not os.path.exists(first_dir):

os.mkdir(first_dir)

init_classification()

ui()

if __name__ == "__main__":

main()

參考:

python爬取彼岸桌面桌布

1.目標站點分析 進入 經過f12分析,url都儲存在 2.選擇爬取工具,這裡網頁比較簡單,就採用requests庫和正則.import requests import osimport reimport time 主頁 main urls headers ifnot os.path.exists ...

scrapy 爬取桌布

scrapy startproject bizhi scrapy genspider bizhispider www.netbian.com 要爬取的桌布 網域名稱www.netbian.com 新建python檔案run.py from scrapy import cmdline cmdline....

python爬取必應的桌布

閒著沒事。想找點桌布,於是用python寫個爬蟲來爬個桌布吧。1 收先安裝python環境 2.安裝所需要的三方庫 win下 pip install requests pip install beautifulsoup4 import requests import urllib.request i...