別樣視角 Python網頁抓取

在過去的幾年中，開發了一些涉及抓取和資料處理的大型專案。它們是為完全不同的行業開發的，我們在這裡的方法也為所有這些找到最佳解決方案。

隨意選擇您選擇的**。但是對於本專案，我們正在選擇此**。注意：本教程僅用於教育目的，我不建議您為任何非法活動而抓取該**。因此，我們先來看看我們正在抓取的內容。

azlyrics提供具有功能的歌曲普通歌詞，以搜尋歌曲並獲取singer演唱的所有歌曲的列表。這是乙個簡單的**，不需要任何帳戶或驗證即可獲得歌詞。

設定環境

假設您已經安裝了venv，但是如果尚未安裝，則可以檢視此鏈結以了解如何在系統中安裝venv。

python3 -m venv env

啟用python環境使用**源env / bin / activate。

您還需要在開發環境中使用pip安裝各種依賴項，包括request，beautifulsoup，pandas和regex。

pip3 install requests regex pandas beautifulsoup4

建立新檔案並另存為print_lyrics.py。要編寫**，您可以使用任何開發ide。

import requests 
from bs4 import beautifulsoup

現在要抓取網頁，您必須了解html的一些基礎知識。如果通過按f12開啟「開發人員工具」或「檢查元素」，則可以檢查要從中提取資料的類。

在這裡，您可以看到「 lyricsh」類包含歌手名稱，而「 div-share」類包含歌曲名稱。

因此，要從伺服器獲取頁面，我們必須使用請求庫使用其url從internet獲取網頁。

# importing librabies
import requests
# web page you want to scrap
url = ''
response = requests.get(url)
# printing the status code of received url
print(response.status_code)
# the status code will be 200 if everything is ok.

在上述**中，我們通過請求庫請求url，然後將結果儲存在response中。要執行檔案，您可以使用以下命令

python3 print_lyrics.py

現在要從響應中提取詳細資訊，我們必須使用beautifulsoup庫來解析html頁面，這可以使用command beautifulsoup（response.text，『html.parser』）命令從響應中獲取html解析文字。解析網頁中的不同部分可能會有些混亂，但是通過實踐，您可以輕鬆地從網頁中獲取所需的詳細資訊。

# importing libraries
import requests 
from bs4 import beautifulsoup
url = ''
response = requests.get(url) # using requests to get webpage
html_soup = beautifulsoup(response.text , 'html.parser') # parsing the webpage to get its html content
# these are 
singer = html_soup.find('div' , class_ = 'lyricsh').h2.text
song_name = html_soup.find('div' , class_ = 'col-xs-12 col-lg-8 text-center').find_all('div',class_= 'div-share')[1].text.split('"')[1]
lyrics = html_soup.find('div' , class_ = 'col-xs-12 col-lg-8 text-center').find_all('div')[5].text
print('singer name -> {}'.format(singer))
print('song name -> {}'.format(song_name))
print('lyrics is -> {}'.format(lyrics))

該檔案的輸出將是終端中的歌手名稱，歌曲名稱和歌詞。

儲存響應

為了將此輸出儲存到磁碟，我將使用json轉儲資料，但是您也可以使用csv或文字。除非您計畫廢棄和管理數百萬個歌詞資料集，否則使用sql將資料儲存到磁碟可能會導致致命的後果。只需要很少的歌詞，json或csv就會很好用。完整**現在看起來像

import requests 
from bs4 import beautifulsoup
import json
url = ''
response = requests.get(url)
html_soup = beautifulsoup(response.text , 'html.parser')
singer = html_soup.find('div' , class_ = 'lyricsh').h2.text
song_name = html_soup.find('div' , class_ = 'col-xs-12 col-lg-8 text-center').find_all('div',class_= 'div-share')[1].text.split('"')[1]
lyrics = html_soup.find('div' , class_ = 'col-xs-12 col-lg-8 text-center').find_all('div')[5].text
data = {}
data['singer'] = singer
data['song_name'] = song_name
data['lyrics'] = lyrics
with open('data.json', 'w') as outfile:
json.dump(data, outfile)

這將在與歌手和歌詞相同的目錄中建立檔案「 data.json」。

詳情參閱

別樣視角 Python網頁抓取

Python抓取網頁

Python網頁抓取

python抓取網頁過程

別樣視角 Python網頁抓取

Python抓取網頁

Python網頁抓取

python抓取網頁過程

相關推薦