爬取全部的校園新聞

2022-05-16 20:55:56 字數 3363 閱讀 8311

作業要求:

0.從新聞url獲取點選次數,並整理成函式

1.從新聞url獲取新聞詳情:

字典,anews

2.從列表頁的url獲取新聞url:

3.生成所頁列表頁的url並獲取全部新聞 :

列表extend(列表) allnews*每個同學爬學號尾數開始的10個列表頁

4.設定合理的爬取間隔

import time

import random

time.sleep(random.random()*3)

5.用pandas做簡單的資料處理並儲存

儲存到csv或excel檔案 

newsdf.to_csv(r'f:\duym\爬蟲\gzccnews.csv')

**如下:

1

importre2

from bs4 import

beautifulsoup

3from datetime import

datetime

4import

requests

5import

pandas as pd

6import

time

7import

random89

"""新聞點選次數

"""10

defnewsclick(newsurl):

11 newsid = re.findall('

(\d+)

', newsurl)[-1]

12 clickurl = '

'.format(newsid)

13 resclicks =requests.get(clickurl).text

14 resclick = int(re.search("

hits'[)].html[(]'(\d*)'[)]

", resclicks).groups(0)[0])

15return

resclick

1617

"""新聞發布時間

"""18

defnewsdatetime(showinfo):

19 newsdate = showinfo.split()[0].split('

:')[1]

20 newstime = showinfo.split()[1]

21 newsdatetime = newsdate + '

' +newstime

22 datetime = datetime.strptime(newsdatetime, '

%y-%m-%d %h:%m:%s

') #

型別轉換

23return

datetime

2425

"""新聞字典

"""26

defnewsdicts(newsurl):

27 newstext =requests.get(newsurl)

28 newstext.encoding = '

utf-8

'29 newssoup = beautifulsoup(newstext.text, '

html.parser')

30 newsdict ={}

31 newsdict['

newstitle

'] = newssoup.select('

.show-title

')[0].text

32 showinfo = newssoup.select('

.show-info

')[0].text

33 newsdict['

newsdatetime

'] =newsdatetime(showinfo)

34 newsdict['

newsclick

'] =newsclick(newsurl)

35return

newsdict

3637

"""新聞列表

"""38

defnewslist(newsurl):

39 newstext =requests.get(newsurl)

40 newstext.encoding = '

utf-8

'41 newssoup = beautifulsoup(newstext.text, '

html.parser')

42 newslist =

43for news in newssoup.select('li'

):44

if len(news.select('

.news-list-title

')) >0:

45 url = news.select('

a')[0]['

href']

46 newsdesc = news.select('

.news-list-description

')[0].text

47 newsdict =newsdicts(url)

48 newsdict['

newsurl

'] =url

49 newsdict['

description

'] =newsdesc

5051

return

newslist

5253

"""27-37頁新聞列表

"""54

defallnews():

55 allnews =

56for i in range(27,38):

57 newsurl = '

'.format(i)

58allnews.extend(newslist(newsurl))

59 time.sleep(random.random() * 3) #

爬取間隔

60return

allnews

6162 newsdf =pd.dataframe(allnews())

63 newsdf.to_csv('

gzccnews.csv

') #

儲存為csv檔案

儲存gzccnews.csv檔案截圖如下:

爬取全部的校園新聞

1 從新聞url獲取新聞詳情 2 從列表頁的url獲取新聞url 3 生成所頁列表頁的url並獲取全部新聞 4 設定合理的爬取間隔 5 用pandas做簡單的資料處理並儲存成csv和sql檔案 import requests from bs4 import beautifulsoup from da...

爬取全部的校園新聞

本次作業 於 import包 import re import requests from bs4 import beautifulsoup from datetime import datetime import time import random import pandas as pd 0.從...

爬取全部的校園新聞

本次作業的要求來自於 0.從新聞url獲取點選次數,並整理成函式 1.熟練運用re.search match findall 2.從新聞url獲取新聞詳情 字典,anews import requests from bs4 import beautifulsoup from datetime imp...