資料結構化與儲存

2022-01-13 04:18:24 字數 3787 閱讀 2033

1. 將新聞的正文內容儲存到文字檔案。

soup = beautifulsoup(res.text,'

html.parser')

content =soup.select('

.show-content

')[0].text

f=open('

news.txt

','w

',encoding='

utf-8')

f.write(content)

f.close()

2. 將新聞資料結構化為字典的列表:

3. 安裝pandas,用pandas.dataframe(newstotal),建立乙個dataframe物件df.

4. 通過df將提取的資料儲存到csv或excel 檔案。

5. 用pandas提供的函式和方法進行資料分析:

import

requests

import

reimport

pandas

import

openpyxl

from bs4 import

beautifulsoup

from datetime import

datetime

homepage='

'#newsurl=''

res = requests.get(homepage) #

返回response物件

res.encoding='

utf-8

'soup = beautifulsoup(res.text,'

html.parser')

newscount=int(soup.select('

.a1')[0].text.split('條'

)[0])

newspages = newscount // 10 + 1allnews=

alllistnews=

defget_new_click_count(click_url):

res =requests.get(click_url)

res.encoding = '

utf-8

'soup = beautifulsoup(res.text, '

html.parser

').text

return soup.split("

('#hits').html

")[1].lstrip("

('").rstrip("

');"

)def

get_all_news(newurl):

dictionary={}

res =requests.get(newurl)

res.encoding = '

utf-8

'soup = beautifulsoup(res.text, '

html.parser')

newlist=soup.select('

.news-list

')[0].select('li'

)

for newitem in

newlist:

title = newitem.select('

.news-list-title

')[0].text

describe = newitem.select('

.news-list-description

')[0].text

newurl=newitem.a.attrs['

href']

newcontenturl = re.search('

(\d\.html)

', newurl).group(1)

newcontenturl2 = newcontenturl.rstrip('

.html')

click_url='

'+newcontenturl2+'

&modelid=80

'newclicktimes=get_new_click_count(click_url)

defget_new_click_content(newurl):

res =requests.get(newurl)

res.encoding = '

utf-8

'soup = beautifulsoup(res.text, '

html.parser')

distributetime = soup.select('

.show-info

')[0].text.split()[0].lstrip('')

author = soup.select('

.show-info

')[0].text.split()[2].lstrip('')

trial = soup.select('

.show-info

')[0].text.split()[3].lstrip('')

orgin = soup.select('

.show-info

')[0].text.split()[4].lstrip('')

photograph = soup.select('

.show-info

')[0].text.split()[5].lstrip('

攝影:'

)

return

distributetime,author,trial,orgin,photograph

dictionary[

'distributetime

']=get_new_click_content(newurl)[0]

dictionary[

'author

'] = get_new_click_content(newurl)[1]

dictionary[

'trial

'] = get_new_click_content(newurl)[2]

dictionary[

'orgin

'] = get_new_click_content(newurl)[3]

dictionary[

'photograph

'] = get_new_click_content(newurl)[4]

dictionary[

'title

'] =title

dictionary[

'describe

'] =describe

return

allnews

for i in range(2,6):

page='

{}.html

'.format(i)

alllistnews.extend(get_all_news(page))

df =pandas.dataframe(alllistnews)

print

(df)

df.to_excel(

'text.xlsx')

print(df.head(6))

super=df[(df['

clickcount

']>2000)&(df['

source

']=='

學校綜合辦')]

print(super)

截圖如下:

資料結構化與儲存

作業是 同學的,因為沒有對新聞資訊做提取,所有無法新增新聞資訊到字典。已練習pandas庫的相關使用方法,匯出excel檔案。ps 自己的 會盡快修改!import requests from bs4 import beautifulsoup from datetime import datetim...

資料結構化與儲存

1.將新聞的正文內容儲存到文字檔案。newscontent soup.select show content 0 text f open news.txt w f.write newscontent f open news.txt r print f.read 3.安裝pandas,用pandas....

資料結構化與儲存

1.將新聞的正文內容儲存到文字檔案。f open content.txt a encoding utf 8 f.write content f.close 2.將新聞資料結構化為字典的列表 獲取新聞詳情 defgetnewdetail url resd requests.get url resd.e...