python 爬蟲例項

# -*- coding:utf-8 -*-

import re
import sys
import os
from time import sleep
from bs4 import beautifulsoup
import requests
reload(sys)
sys.setdefaultencoding('utf-8')

//上面的是引得包和解決一些bug的，什麼證書問題什麼的
def xs2(url,):
path = r'e:/desktop/img/cc.txt'
localpath = unicode(path, 'utf-8')//轉譯，如果路徑中有中文可能報錯
req = requests.get(url, headers=headers).text//headers寫自己瀏覽器的header是
soup = beautifulsoup(req, 'html.parser')//這裡用的beautifulsoup，因為比較容易匹配
list = soup.find_all('p')//因為縱橫的**html正文都是寫在所有匹配p標籤
title_txtbox = soup.find_all(class_='title_txtbox')//匹配書名
fn = open(localpath, 'a+')//寫入
fn.write(title_txtbox[0].get_text())
for i in range(0, len(list)):
pp = list[i].get_text()
fn.write(pp)
print "正在寫入" + pp
fn.write("\n")//寫完1章來個換行
fn.close()
ree=re.findall(r'href="(.*?)"',str(nextchapter))匹配href的屬性，(.*?)表示這是我要的
sleep(2)//睡2秒，太快可能被反爬蟲**了ip可以換個headers繼續使用，平常的話建議用比人的headers 23333
if __name__ == '__main__':
url = ''//縱橫****
xs2(url)

Python爬蟲例項

中國大學排名專案功能描述輸出大學排名資訊的螢幕輸出排名，大學名稱，總分技術路線 requests bs4 定向爬蟲僅對輸入url進行爬取，不擴充套件爬取程式的結構設計步驟1 從網路上獲取大學排名網頁內容步驟2 提取網頁內容中資訊到合適的資料結構二維列表步驟3 利用資料結構展示並...

Python 爬蟲例項

下面是我寫的乙個簡單爬蟲例項 1.定義函式讀取html網頁的源 2.從源通過正規表示式挑選出自己需要獲取的內容 3.序列中的htm依次寫到d盤 usr bin python import re import urllib.request 定義函式讀取html網頁的源 def gethtml url...

python爬蟲 10 爬蟲例項（6）

coding utf 8 import re import requests import time f open 鬥破蒼穹.txt a def get info url response requests.get url,headers header if response.status code...

python 爬蟲例項

Python爬蟲例項

Python 爬蟲例項

python爬蟲 10 爬蟲例項（6）

相關推薦