python爬取資料豆瓣讀書

xpath爬取指令碼：

from urllib import request

from lxml import etree

base_url=「

response=request.urlopen(base_url)

html=response.read().decode(『utf-8』)

htmls=etree.html(html)

titles=htmls.xpath(』//div[@class=「threadlist_lz clearfix」]/div/a/@title』)

for i in titles:

print(i)

為啥不直接找呢？

因為沒有class標籤，所以不好找，所以找它的父親，看套再哪個class裡

//div[@class=「pl2」]/a/@title

//p[@class=「pl」]/text()作者

p標籤下用text找

//span[@class=「rating_nums」]/text()評分

短評

//span[@class=「inq」]/text()

//div[@class=「movie-content」]/a/img/@src

#爬取豆瓣讀書top250

fp=open(』./douban.txt』,『a』,encoding=『utf-8』)

def index():

for i in range(0,226,25):#製作頁碼

# print(i)

base_url=『

# print(base_url)

#抓取原始碼階段

response=request.urlopen(base_url)

html=response.read().decode(『utf-8』)

#處理原始碼（用etree將html轉換為xml

htmls=etree.html(html)#就可以用xpath語言寫了

clean_sto(htmls)

def clean_sto(htmls):

titles=htmls.xpath(』//div[@class=「pl2」]/a/@title』)

# print(titles)發現是乙個乙個的列表

for i in titles:

# print(i)

fp.write(i+』\n』)

ifname==『main』:

index()

fp.close()

Python爬取豆瓣讀書標籤程式設計

要爬取的簡單版複雜版簡單版 import numpy as np import csv import time def get one page url response requests.get url if response.status code 200 return response....

python爬取豆瓣影評

看的別人的爬取某部影片的影評沒有模擬登入只能爬6頁 encoding utf 8 import requests from bs4 import beautifulsoup import re import random import io import sys import time 使用se...

爬取豆瓣讀書的書籍（一）

環境準備 python3 pycharm 2018.3.4 x64 google chrome瀏覽器爬取豆瓣讀書書籍的基本步驟 1 在pycharm中匯入urllib模組的request 2 獲取豆瓣讀書網的url資訊和user agent 3 用urlopen開啟並傳送請求 4 用urlret...

python爬取資料豆瓣讀書

Python爬取 豆瓣讀書標籤 程式設計

python爬取豆瓣影評

爬取豆瓣讀書的書籍（一）

相關推薦

Python爬取豆瓣讀書標籤程式設計