Python練手小程式從html中提取正文

在github上發現一些很有意思的專案，由於本人作為python的初學者，程式設計**能力相對薄弱，為了加強python的學習，特此利用前輩們的學習知識成果，自己去親自實現。

今天練習第0008題，題目如下：

先展示一下我的html檔案吧

我打算提取html檔案中的每個題目和摘要，然後儲存在mongodb上。先看下html的原始檔，在頁面上雙擊選中「顯示頁面原始檔」，就可以看到右側的原始碼資訊。

我要提取的內容在原始檔中的位置檢視

我打算用最簡單的正規表示式提取裡面的內容，後期練手其他專案再使用一些庫提取。

本文主要是方法記錄一下吧，正則表達使用，打算抽時間好好搞下！

python**如下：

import re
import codecs
defread_html
(path)
:"""
讀取html檔案
:param path:
:return: 字串型別
"""with codecs.
open
(path,
'r', encoding=
'utf-8'
)as f:
html = f.read(
)return html
defparse_html
(html)
:"""
使用正則表達解析html檔案
注：由於存在乙個abstract沒正則提出來，所以先不管了，而提取出其對應的title也先刪掉了
好吧，我太南了
:param html:
:return:
"""x_pattern = re.
compile
(r'(.*?)(.*?)'
, re.s)
y_pattern = re.
compile
(r'.*?(.*?)
|.*?(.*?)
|(.*?)
', re.s)
x_groups = x_pattern.findall(html)
y_groups = y_pattern.findall(html)
# 為了和提取出的鵝abstracts保持對應關係，刪掉其中不需要的title
x_groups.pop(-3
)for title, abstract in
zip(x_groups, y_groups)
:print
('title:',''
.join(title)
.strip(),
'\n'
,'abstract:',''
.join(abstract)
.strip(
).replace('',
'').replace('',
''))if __name__ ==
'__main__'
: path =
html = read_html(path)
parse_html(html)

python練手小程式

python小白 usr bin env python coding utf 8 time 2019 11 5 8 53 author october file py 整數序列求和 n input 請輸入整數n sum 0 for i in range int n sum i 1print 1到n求...

10個Python練手小程式

程式1 題目有1 2 3 4個數字，能組成多少個互不相同且無重複數字的三位數？都是多少？1.程式分析可填在百位十位個位的數字都是1 2 3 4。組成所有的排列後再去掉不滿足條件的排列。2.程式源 for i in range 1,5 for j in range 1,5 for k in ...

小程式練手（c ）

給定乙個陣列input 如果陣列長度n為奇數，則將陣列中最大的元素放到 output 陣列最中間的位置，如果陣列長度n為偶數，則將陣列中最大的元素放到 output 陣列中間兩個位置偏右的那個位置上，然後再按從大到小的順序，依次在第乙個位置的兩邊，按照一左一右的順序，依次存放剩下的數。例如 inpu...

Python練手小程式 從html中提取正文

python練手小程式

10個Python練手小程式

小程式練手（c ）

相關推薦

Python練手小程式從html中提取正文