爬取校園新聞首頁的新聞

2022-06-02 06:12:12 字數 2368 閱讀 6026

1. 用requests庫和beautifulsoup庫,爬取校園新聞首頁新聞的標題、鏈結、正文、show-info。

2. 分析info字串,獲取每篇新聞的發布時間,作者,**,攝影等資訊。

3. 將字串格式的發布時間轉換成datetime型別

4. 使用正規表示式取得新聞編號

5. 生成點選次數的request url

6. 獲取點選次數

7. 將456步驟定義成乙個函式 def getclickcount(newsurl):

8. 將獲取新聞詳情的**定義成乙個函式 def getnewdetail(newsurl):

9. 嘗試用使用正規表示式分析show info字串,點選次數字串。

import

requests

from bs4 import

beautifulsoup

from datetime import

datetime

import

locale

import

relocale.setlocale(locale.lc_ctype,

'chinese')

url = "

"res =requests.get(url)

res.encoding = '

utf-8

'soup = beautifulsoup(res.text, '

html.parser')

defgetclickcount(newsurl):

newsid = re.search(r"

\_(.*).html

", newsurl).group(1)[-4:]

clicktimesurl = ("

").format(newsid)

clicktimes = int(requests.get(clicktimesurl).text.split("

.html(

")[-1].lstrip("

'").rstrip("

');"

))

return

clicktimes

defgetnewsdetail(newsurl):

resdet =requests.get(newsurl)

resdet.encoding = '

utf-8

'soupdet = beautifulsoup(resdet.text, '

html.parser')

contentdetail = soupdet.select('

#content

')[0].text

showinfo = soupdet.select('

.show-info

')[0].text

date = showinfo.lstrip("

")[:19]

author = re.search('

', showinfo).group(1)

checker = re.search('

', showinfo).group(1)

source = re.search('

', showinfo).group(1)

clicktimes =getclickcount(address)

datetime = datetime.strptime(date, '

%y-%m-%d %h:%m:%s')

print("

".format(datetime, author, checker, source, clicktimes))

print

(contentdetail)

for news in soup.select('li'

):

if len(news.select('

.news-list-title

')) >0:

title = news.select('

.news-list-title

')[0].text

description = news.select('

.news-list-description

')[0].text

info = news.select('

.news-list-info

')[0].text

address = news.select('

a')[0]['

href']

print("

".format(title, description, info, address))

getnewsdetail(address)

爬取校園新聞首頁的新聞

1.用requests庫和beautifulsoup庫,爬取校園新聞首頁新聞的標題 鏈結 正文 show info。2.分析info字串,獲取每篇新聞的發布時間,作者,攝影等資訊。import requests newsurl res requests.get newsurl 返回response物...

爬取校園新聞首頁的新聞

1.用requests庫和beautifulsoup庫,爬取校園新聞首頁新聞的標題 鏈結 正文 show info。import requests from bs4 import beautifulsoup newsurl res requests.get newsurl res.encoding ...

爬取校園新聞首頁的新聞

1.用requests庫和beautifulsoup庫,爬取校園新聞首頁新聞的標題 鏈結 正文 show info。2.分析info字串,獲取每篇新聞的發布時間,作者,攝影等資訊。import requests from bs4 import beautifulsoup from datetime ...