百度貼吧爬蟲

2021-09-13 17:51:54 字數 1532 閱讀 5234

# encoding: utf-8

import urllib.request

import urllib.parse

import time

import random

def load_page(url):

"""通過url來獲取網頁內容jfa

:param url: 待獲取的頁面

:return: url對應的網頁內容

"""headers =

request = urllib.request.request(url=url, headers=headers)

response = urllib.request.urlopen(request)

content = response.read()

return content.decode("utf-8")

def write_page(html, filename):

""":param html: 要保持的頁面內容

:param filename: 要保持頁面內容的檔名

:return:

"""print("正在儲存檔案:" + filename)

with open(filename, "w", encoding="utf-8") as f:

f.write(html)

print("保持檔案完畢:" + filename)

def tieba_spider(keyword, start, end):

""":param keyword: 指定要爬取的貼吧

:param start: 開始的頁面

:param end: 終止的頁面

keyword = input("請輸入要爬取的貼吧")

tieba_spider(keyword, 1, 5)

3 百度貼吧爬蟲

被寫檔案坑了一晚上,因為自己寫了writefile 但是呼叫的是writefile 剛好python裡面有writefile 所以剛好不報錯!coding utf 8 created on 2018 7月12號 author sss 型別 get請求 from pip.vendor.distlib....

爬蟲 百度貼吧相簿

import requests from lxml import etree from fake useragent import useragent import os from selenium import webdriver urls name defget urls input ua us...

百度貼吧爬蟲練習

在互動平台列印貼吧內的的鏈結位址 1 coding utf 823 importre4 import urllib 導入庫56 defgethtml url 7 page urllib.urlopen url 開啟鏈結的頁面 8 html page.read 讀取鏈結的原始碼 正則 13 imgre...