自動獲取cookie，爬取新浪微博熱門評論

一、前言

二、** 網盤

selenium僅僅用於獲取cookie，實際爬取將直接使用requests請求，以保證爬取效率

話不多說，**也不複雜，直接上**了，關鍵的地方有注釋

import requests
# import selenium
from selenium import webdriver
from selenium.webdriver.support.ui import webdriverwait
from selenium.webdriver.chrome.options import options
from bs4 import beautifulsoup
import re
import json
import time
class spider_weibo(object):
def __init__(self,id): 
self.chrome_options = options()
#設定靜默
self.chrome_options.add_argument('--headless')
self.driver = webdriver.chrome(options=self.chrome_options,executable_path='chromedriver.exe')
self.wait = webdriverwait(self.driver, 100)
self.headers=
self.weibo_id = id
#講cookie_dict轉成字串
def get_cookielist(self):
print('正在獲取cookie')
cookie_str = ''
url = ''.format(
self.weibo_id, 1)
self.driver.get(url)
time.sleep(7)
#通過selenium模擬瀏覽器操作，獲取訪客cookie
cookielist = self.driver.get_cookies()
for cookie in cookielist:
cookie_str = cookie_str + cookie['name']+'='+cookie['value']+';'
return cookie_str
#使用**ip（待完善）
def get_proxy(self,order_id):
url = ''.format(order_id)
response = requests.get(url)
#用bs4解析請求得到頁面
def use_bs4(self,retext):
#初始化待拼接字串
text = ''
retextjson = json.loads(retext)
#獲取請求到的頁面
data = retextjson.get("data").get('html')
soup = beautifulsoup(data, 'lxml')
ul_list = soup.select('.list_box')[0].select('.list_ul')[0].find_all('div',attrs=)
for ul in ul_list:
try:
list_con = ul.find_all('div', attrs=)[0]
content = list_con.find_all('div', attrs=)[0].text
text = text + content+'\n'
except exception as e:
print('error')
return text
def spider(self,page_num):
session = requests.session()
#獲取cookie
cookie_str = self.get_cookielist()
print("cookie:",cookie_str)
#設定cookie
self.headers['cookie'] = cookie_str
#以utf-8編碼開啟檔案
file = open('comment.txt','w',encoding='utf-8')
for i in range(page_num):
try:
# 熱評請求位址
url = ''.format(
self.weibo_id, i)
response = session.get(url, headers=self.headers)
response.encoding = 'unicode'
text = self.use_bs4(response.text)
print(text)
file.write(text)
time.sleep(2)
except exception as e:
print(e)
file.close()
if __name__ =='__main__':
#輸入需要爬取的頁數
page_number = input("enter page num: ");
#將頁數轉成int型別
page_num = int(page_number)
#輸入微博id
id = input("enter weibo id: ");
#id = '4391901606692228'
weibo_spider = spider_weibo(id)
weibo_spider.spider(page_num)

才疏學淺，**簡陋，如有不足之處懇請指出！

爬取新浪微博

學到的東西。1 習慣用logger，而不是用print self.logger.debug 開始解析 format response.url 2 習慣用正規表示式這是在pipeline清理資料時用到的 s 5分鐘前 if re.match d 分鐘前 s minute re.match d s g...

爬取新浪微博熱搜榜

一主題式網路爬蟲設計方案 15分 3.主題式網路爬蟲設計方案概述包括實現思路與技術難點本案例使用requests庫獲取網頁資料，使用beautifulsoup庫解析頁面內容，再使用pandas庫把爬取的資料輸出，並對資料視覺化，最後進行小結技術難點爬取有用的資料，將有礙分析的資料剔除，回歸...

爬取新浪網頁

唯一性的用id表示，id前面需要加例如使用select 找出所有id為title 的元素 alink soup.select title print alink print alink.text 有相同的用class表示，class前面需要加.例如使用select 找出所有class為link...

自動獲取cookie，爬取新浪微博熱門評論

爬取新浪微博

爬取新浪微博熱搜榜

爬取新浪網頁

相關推薦