資料結構化與儲存

2022-05-27 18:36:06 字數 4013 閱讀 8229

作業是**同學的,因為沒有對新聞資訊做提取,所有無法新增新聞資訊到字典。已練習pandas庫的相關使用方法,匯出excel檔案。ps:自己的**會盡快修改!

import

requests

from bs4 import

beautifulsoup

from datetime import

datetime

import

re, pandas

#獲取新聞點選次數

defgetnewsid(url):

newsid = re.findall(r'

\_(.*).html

', url)[0][-4:]

clickurl = '

'.format(newsid)

clickres =requests.get(clickurl)

#利用正規表示式獲取新聞點選次數

clickcount = int(re.search("

hits'\).html\('(.*)'\);

", clickres.text).group(1))

return

clickcount

#將新聞正文寫進檔案,不會被覆蓋

defwritenewscontenttofile(content):

f=open('

gzccnews.txt

','a

',encoding='

utf-8')

f.write(content)

f.close()

#獲取新聞細節

defgetnewsdetail(newsurl):

contentlist={}

resd =requests.get(newsurl)

resd.encoding = '

utf-8

'soupd = beautifulsoup(resd.text, '

html.parser')

newsdict ={}

content = soupd.select('

#content

')[0].text

writenewscontenttofile(content)

info = soupd.select('

.show-info

')[0].text

newsdict[

'title

'] = soupd.select('

.show-title

')[0].text

#識別時間格式

date = re.search('

(\d.\d.\d\s\d.\d.\d)

', info).group(1)

#識別乙個至三個資料

if(info.find('

')>0):

newsdict[

'author

'] = re.search('

', info).group(1)

else

: newsdict[

'author

'] = '

none

'if(info.find('

')>0):

newsdict[

'check

'] = re.search('

', info).group(1)

else

: newsdict[

'check

'] = '

none

'if(info.find('

')>0):

newsdict[

'sources

'] = re.search('

', info).group(1)

else

: newsdict[

'sources

'] = '

none

'if (info.find('

攝影:') >0):

newsdict[

'photo

'] = re.search('

攝影:(.*)\s*點

', info).group(1)

else

: newsdict[

'photo

'] = '

none'#

用datetime將時間字串轉換為datetime型別

newsdict['

datetime

'] = datetime.strptime(date, '

%y-%m-%d %h:%m:%s')

#呼叫getnewsid()獲取點選次數

newsdict['

click

'] =getnewsid(newsurl)

return

newsdict

defgetlistpage(listurl):

res =requests.get(listurl)

res.encoding = '

utf-8

'soup = beautifulsoup(res.text, '

html.parser')

for new in soup.select('li'

):

if len(new.select('

.news-list-title

')) >0:

title = new.select('

.news-list-title

')[0].text

description = new.select('

.news-list-description

')[0].text

newsurl = new.select('

a')[0]['

href']

list =

print('

'.format(title, description, newsurl))

#呼叫getnewsdetail()獲取新聞詳情

dict =getnewsdetail(newsurl)

total.extend(list)

total =

listurl = '

'getlistpage(listurl)

res =requests.get(listurl)

res.encoding = '

utf-8

'soup = beautifulsoup(res.text, '

html.parser')

listcount = int(soup.select('

.a1')[0].text.rstrip('

條'))//10+1df =pandas.dataframe(total)

df.to_excel(

'newsresult.xlsx')

print(df[['

click

','author

','datetime

','sources

']][(df['

click

']>3000)&(df['

sources

'] == u'

學生處'

)])print(df[(df['

click

'] > 3000) & (df['

sources

'] == '

學校綜合辦

')])

print(df[['

click

', '

author

', '

sources

']].head(6))

news_info = ['

國際學院

', '

學生工作處']

print(df[df['

sources

'].isin(news_info)])

資料結構化與儲存

1.將新聞的正文內容儲存到文字檔案。soup beautifulsoup res.text,html.parser content soup.select show content 0 text f open news.txt w encoding utf 8 f.write content f.c...

資料結構化與儲存

1.將新聞的正文內容儲存到文字檔案。newscontent soup.select show content 0 text f open news.txt w f.write newscontent f open news.txt r print f.read 3.安裝pandas,用pandas....

資料結構化與儲存

1.將新聞的正文內容儲存到文字檔案。f open content.txt a encoding utf 8 f.write content f.close 2.將新聞資料結構化為字典的列表 獲取新聞詳情 defgetnewdetail url resd requests.get url resd.e...