python 爬取HTML內容並儲存到txt檔案內

# @updatetime : 2020-12-08 16:53
# @author : wz
# @file : get_webdetails
# @software: pycharm
# @used: 爬取任意頁面中任意資料
import re
import urllib.request
from utils.log import logger
logger_message = logger()
# 爬取gbk網頁(爬取html頁面檔案)
html = urllib.request.urlopen("").read()
html = html.decode('utf-8')
# print(html)
# 爬取鏈結和目錄(通過正規表示式進行過濾)
reg = r'(.*?) (.*?) '
urls = re.findall(reg, html) # 這是獲取的鏈結和目錄時沒有規律的(雜亂無章)
# print(urls)
for url in urls:
chapter_titles = url[2]
chapter_url = '' + str(url[0])
# print(url[0])
# logger_message.loginfo(chapter_url + '\t' + chapter_titles)
htmls = urllib.request.urlopen(chapter_url).read()
htmls = htmls.decode ('utf-8')
# print(htmls)
content = r'(.*?)
' content = re.findall(content, htmls)
# print(content)
for next in content:
strs = next.replace("
", "")
stres = strs.replace("　　","")
nextes = (('%s' % chapter_titles) + "\t" +stres)
# 列印內容文字（儲存到乙個檔案內）
fn = open('name.txt', 'a')
fn.write(chapter_titles + "\n" + nextes)
# 分章節儲存到不同的txt檔案內
fn = open(chapter_titles + '.txt', 'w', encoding='utf-8')
fn.write(nextes)

Python爬取網頁內容

其時序圖如圖所示。給定乙個要訪問的url，獲取這個html及內容，遍歷html中的某一類鏈結，如a標籤的href屬性，從這些鏈結中繼續訪問相應的html頁面，然後獲取這些html的固定標籤的內容，如果需要多個標籤內容，可以通過字串拼接，最後通過正規表示式刪除所有的標籤，最後將其中的內容寫入.txt檔...

python爬取頁面內容，並統計指定欄位的數量

整體思路 1 獲取想要爬取頁面的url 2 應用requests beautifulsoup庫爬取到頁面內容，找到所要統計欄位的規律，以xml格式儲存到本地檔案中 3 讀取儲存的本地檔案內容 4 通過split切分獲取指定欄位的數量 usr bin env python coding utf 8 i...

靜態網頁內容爬取（python）

以漏洞掃瞄為例 from bs4 import beautifulsoup from urllib.request import urlopen import pymysql as mysqldb import re import os 插入資料 def insertdata lis cursor...

python 爬取HTML內容並儲存到txt檔案內

Python爬取網頁內容

python爬取頁面內容，並統計指定欄位的數量

靜態網頁內容爬取（python）

相關推薦