本篇介紹如何使用BeautifulSoup套件來解析抓回來的網頁資料。
本篇承上一篇Python 3 使用 Requests 套件抓取 PTT 網頁資料。
從https://www.ptt.cc/bbs/Beauty/index.html
抓取回來的網頁內容如下。
<html>
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>看板 Beauty 文章列表 - 批踢踢實業坊</title>
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.27/bbs-common.css">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.27/bbs-base.css" media="screen">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.27/bbs-custom.css">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.27/pushstream.css" media="screen">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.27/bbs-print.css" media="print">
</head>
<body>
<div id="topbar-container">
<div id="topbar" class="bbs-content">
<a id="logo" href="/bbs/">批踢踢實業坊</a>
<span>›</span>
<a class="board" href="/bbs/Beauty/index.html"><span class="board-label">看板 </span>Beauty</a>
<a class="right small" href="/about.html">關於我們</a>
<a class="right small" href="/contact.html">聯絡資訊</a>
</div>
</div>
<div id="main-container">
<div id="action-bar-container">
<div class="action-bar">
<div class="btn-group btn-group-dir">
<a class="btn selected" href="/bbs/Beauty/index.html">看板</a>
<a class="btn" href="/man/Beauty/index.html">精華區</a>
</div>
<div class="btn-group btn-group-paging">
<a class="btn wide" href="/bbs/Beauty/index1.html">最舊</a>
<a class="btn wide" href="/bbs/Beauty/index3036.html">‹ 上頁</a>
<a class="btn wide disabled">下頁 ›</a>
<a class="btn wide" href="/bbs/Beauty/index.html">最新</a>
</div>
</div>
</div>
<div class="r-list-container action-bar-margin bbs-screen">
<div class="search-bar">
<form type="get" action="search" id="search-bar">
<input class="query" type="text" name="q" value="" placeholder="搜尋文章⋯">
</form>
</div>
<div class="r-ent">
<div class="nrec"><span class="hl f2">2</span></div>
<div class="title">
<a href="/bbs/Beauty/M.1566697382.A.136.html">[正妹] 台日混血正妹</a>
</div>
<div class="meta">
<div class="author">Iwanz1018</div>
<div class="article-menu">
<div class="trigger">⋯</div>
<div class="dropdown">
<div class="item"><a
href="/bbs/Beauty/search?q=thread%3A%5B%E6%AD%A3%E5%A6%B9%5D+%E5%8F%B0%E6%97%A5%E6%B7%B7%E8%A1%80%E6%AD%A3%E5%A6%B9">搜尋同標題文章</a>
</div>
<div class="item"><a href="/bbs/Beauty/search?q=author%3AIwanz1018">搜尋看板內 Iwanz1018 的文章</a>
</div>
</div>
</div>
<div class="date"> 8/25</div>
<div class="mark"></div>
</div>
</div>
<div class="r-ent">
<div class="nrec"><span class="hl f2">8</span></div>
<div class="title">
<a href="/bbs/Beauty/M.1566698983.A.C2D.html">[正妹] 捨棄AKB徵選的少女</a>
</div>
<div class="meta">
<div class="author">teow113554</div>
<div class="article-menu">
<div class="trigger">⋯</div>
<div class="dropdown">
<div class="item"><a
href="/bbs/Beauty/search?q=thread%3A%5B%E6%AD%A3%E5%A6%B9%5D+%E6%8D%A8%E6%A3%84AKB%E5%BE%B5%E9%81%B8%E7%9A%84%E5%B0%91%E5%A5%B3">搜尋同標題文章</a>
</div>
<div class="item"><a href="/bbs/Beauty/search?q=author%3Ateow113554">搜尋看板內 teow113554
的文章</a></div>
</div>
</div>
<div class="date"> 8/25</div>
<div class="mark"></div>
</div>
</div>
<div class="r-ent">
<div class="nrec"><span class="hl f2">1</span></div>
<div class="title">
<a href="/bbs/Beauty/M.1566703064.A.84F.html">[新聞] 想被她摔!超美女警為家計苦練柔道虐翻 </a>
</div>
<div class="meta">
<div class="author">james7923</div>
<div class="article-menu">
<div class="trigger">⋯</div>
<div class="dropdown">
<div class="item"><a
href="/bbs/Beauty/search?q=thread%3A%5B%E6%96%B0%E8%81%9E%5D+%E6%83%B3%E8%A2%AB%E5%A5%B9%E6%91%94%EF%BC%81%E8%B6%85%E7%BE%8E%E5%A5%B3%E8%AD%A6%E7%82%BA%E5%AE%B6%E8%A8%88%E8%8B%A6%E7%B7%B4%E6%9F%94%E9%81%93%E8%99%90%E7%BF%BB+">搜尋同標題文章</a>
</div>
<div class="item"><a href="/bbs/Beauty/search?q=author%3Ajames7923">搜尋看板內 james7923 的文章</a>
</div>
</div>
</div>
<div class="date"> 8/25</div>
<div class="mark"></div>
</div>
</div>
<div class="r-ent">
<div class="nrec"><span class="hl f2">2</span></div>
<div class="title">
<a href="/bbs/Beauty/M.1566704866.A.E33.html">[廣告] 仁藤りさ營養不在前面</a>
</div>
<div class="meta">
<div class="author">graperson</div>
<div class="article-menu">
<div class="trigger">⋯</div>
<div class="dropdown">
<div class="item"><a
href="/bbs/Beauty/search?q=thread%3A%5B%E5%BB%A3%E5%91%8A%5D+%E4%BB%81%E8%97%A4%E3%82%8A%E3%81%95%E7%87%9F%E9%A4%8A%E4%B8%8D%E5%9C%A8%E5%89%8D%E9%9D%A2">搜尋同標題文章</a>
</div>
<div class="item"><a href="/bbs/Beauty/search?q=author%3Agraperson">搜尋看板內 graperson 的文章</a>
</div>
</div>
</div>
<div class="date"> 8/25</div>
<div class="mark"></div>
</div>
</div>
<div class="r-ent">
<div class="nrec"><span class="hl f3">17</span></div>
<div class="title">
<a href="/bbs/Beauty/M.1566707650.A.E8C.html">[正妹] 圖多 不知名的正妹們</a>
</div>
<div class="meta">
<div class="author">gp99000</div>
<div class="article-menu">
<div class="trigger">⋯</div>
<div class="dropdown">
<div class="item"><a
href="/bbs/Beauty/search?q=thread%3A%5B%E6%AD%A3%E5%A6%B9%5D+%E5%9C%96%E5%A4%9A+%E4%B8%8D%E7%9F%A5%E5%90%8D%E7%9A%84%E6%AD%A3%E5%A6%B9%E5%80%91">搜尋同標題文章</a>
</div>
<div class="item"><a href="/bbs/Beauty/search?q=author%3Agp99000">搜尋看板內 gp99000 的文章</a>
</div>
</div>
</div>
<div class="date"> 8/25</div>
<div class="mark"></div>
</div>
</div>
<div class="r-ent">
<div class="nrec"></div>
<div class="title">
<a href="/bbs/Beauty/M.1566709451.A.60E.html">[神人] 平面模特兒</a>
</div>
<div class="meta">
<div class="author">john11894324</div>
<div class="article-menu">
<div class="trigger">⋯</div>
<div class="dropdown">
<div class="item"><a
href="/bbs/Beauty/search?q=thread%3A%5B%E7%A5%9E%E4%BA%BA%5D+%E5%B9%B3%E9%9D%A2%E6%A8%A1%E7%89%B9%E5%85%92">搜尋同標題文章</a>
</div>
<div class="item"><a href="/bbs/Beauty/search?q=author%3Ajohn11894324">搜尋看板內
john11894324 的文章</a></div>
</div>
</div>
<div class="date"> 8/25</div>
<div class="mark"></div>
</div>
</div>
<div class="r-ent">
<div class="nrec"></div>
<div class="title">
<a href="/bbs/Beauty/M.1566712655.A.D63.html">[正妹] 笑容可以</a>
</div>
<div class="meta">
<div class="author">JANUARZ</div>
<div class="article-menu">
<div class="trigger">⋯</div>
<div class="dropdown">
<div class="item"><a
href="/bbs/Beauty/search?q=thread%3A%5B%E6%AD%A3%E5%A6%B9%5D+%E7%AC%91%E5%AE%B9%E5%8F%AF%E4%BB%A5">搜尋同標題文章</a>
</div>
<div class="item"><a href="/bbs/Beauty/search?q=author%3AJANUARZ">搜尋看板內 JANUARZ 的文章</a>
</div>
</div>
</div>
<div class="date"> 8/25</div>
<div class="mark"></div>
</div>
</div>
<div class="r-list-sep"></div>
<div class="r-ent">
<div class="nrec"><span class="hl f0">XX</span></div>
<div class="title">
<a href="/bbs/Beauty/M.1557742996.A.657.html">[公告] 開放噓文暫停X1條款</a>
</div>
<div class="meta">
<div class="author">hateOnas</div>
<div class="article-menu">
<div class="trigger">⋯</div>
<div class="dropdown">
<div class="item"><a
href="/bbs/Beauty/search?q=thread%3A%5B%E5%85%AC%E5%91%8A%5D+%E9%96%8B%E6%94%BE%E5%99%93%E6%96%87%E6%9A%AB%E5%81%9CX1%E6%A2%9D%E6%AC%BE">搜尋同標題文章</a>
</div>
<div class="item"><a href="/bbs/Beauty/search?q=author%3AhateOnas">搜尋看板內 hateOnas 的文章</a>
</div>
</div>
</div>
<div class="date"> 5/13</div>
<div class="mark">M</div>
</div>
</div>
<div class="r-ent">
<div class="nrec"><span class="hl f3">43</span></div>
<div class="title">
<a href="/bbs/Beauty/M.1558415952.A.8D7.html">[公告] 不願上表特 & 優文推薦 & 檢舉建議專區</a>
</div>
<div class="meta">
<div class="author">hateOnas</div>
<div class="article-menu">
<div class="trigger">⋯</div>
<div class="dropdown">
<div class="item"><a
href="/bbs/Beauty/search?q=thread%3A%5B%E5%85%AC%E5%91%8A%5D+%E4%B8%8D%E9%A1%98%E4%B8%8A%E8%A1%A8%E7%89%B9+%EF%BC%86+%E5%84%AA%E6%96%87%E6%8E%A8%E8%96%A6+%EF%BC%86+%E6%AA%A2%E8%88%89%E5%BB%BA%E8%AD%B0%E5%B0%88%E5%8D%80">搜尋同標題文章</a>
</div>
<div class="item"><a href="/bbs/Beauty/search?q=author%3AhateOnas">搜尋看板內 hateOnas 的文章</a>
</div>
</div>
</div>
<div class="date"> 5/21</div>
<div class="mark">M</div>
</div>
</div>
<div class="r-ent">
<div class="nrec"></div>
<div class="title">
<a href="/bbs/Beauty/M.1563960846.A.05A.html">Fw: [公告] 請使用者多加注意我國保護兒少的法令</a>
</div>
<div class="meta">
<div class="author">hateOnas</div>
<div class="article-menu">
<div class="trigger">⋯</div>
<div class="dropdown">
<div class="item"><a
href="/bbs/Beauty/search?q=thread%3A%5B%E5%85%AC%E5%91%8A%5D+%E8%AB%8B%E4%BD%BF%E7%94%A8%E8%80%85%E5%A4%9A%E5%8A%A0%E6%B3%A8%E6%84%8F%E6%88%91%E5%9C%8B%E4%BF%9D%E8%AD%B7%E5%85%92%E5%B0%91%E7%9A%84%E6%B3%95%E4%BB%A4">搜尋同標題文章</a>
</div>
<div class="item"><a href="/bbs/Beauty/search?q=author%3AhateOnas">搜尋看板內 hateOnas 的文章</a>
</div>
</div>
</div>
<div class="date"> 7/24</div>
<div class="mark">!</div>
</div>
</div>
<div class="r-ent">
<div class="nrec"></div>
<div class="title">
<a href="/bbs/Beauty/M.1564114881.A.155.html">[公告] 表特板板規(2019.7.26)</a>
</div>
<div class="meta">
<div class="author">hateOnas</div>
<div class="article-menu">
<div class="trigger">⋯</div>
<div class="dropdown">
<div class="item"><a
href="/bbs/Beauty/search?q=thread%3A%5B%E5%85%AC%E5%91%8A%5D+%E8%A1%A8%E7%89%B9%E6%9D%BF%E6%9D%BF%E8%A6%8F%282019.7.26%29">搜尋同標題文章</a>
</div>
<div class="item"><a href="/bbs/Beauty/search?q=author%3AhateOnas">搜尋看板內 hateOnas 的文章</a>
</div>
</div>
</div>
<div class="date"> 7/26</div>
<div class="mark">!</div>
</div>
</div>
<div class="r-ent">
<div class="nrec"><span class="hl f2">1</span></div>
<div class="title">
<a href="/bbs/Beauty/M.1564117458.A.1AF.html">[公告] 201907 板主徵選延長</a>
</div>
<div class="meta">
<div class="author">hateOnas</div>
<div class="article-menu">
<div class="trigger">⋯</div>
<div class="dropdown">
<div class="item"><a
href="/bbs/Beauty/search?q=thread%3A%5B%E5%85%AC%E5%91%8A%5D+201907+%E6%9D%BF%E4%B8%BB%E5%BE%B5%E9%81%B8%E5%BB%B6%E9%95%B7">搜尋同標題文章</a>
</div>
<div class="item"><a href="/bbs/Beauty/search?q=author%3AhateOnas">搜尋看板內 hateOnas 的文章</a>
</div>
</div>
</div>
<div class="date"> 7/26</div>
<div class="mark"></div>
</div>
</div>
</div>
<div class="bbs-screen bbs-footer-message">本網站已依台灣網站內容分級規定處理。此區域為限制級,未滿十八歲者不得瀏覽。</div>
</div>
<script>
(function (i, s, o, g, r, a, m) {
i['GoogleAnalyticsObject'] = r;
i[r] = i[r] || function () {
(i[r].q = i[r].q || []).push(arguments)
}, i[r].l = 1 * new Date();
a = s.createElement(o),
m = s.getElementsByTagName(o)[0];
a.async = 1;
a.src = g;
m.parentNode.insertBefore(a, m)
})(window, document, 'script', 'https://www.google-analytics.com/analytics.js', 'ga');
ga('create', 'UA-32365737-1', {
cookieDomain: 'ptt.cc',
legacyCookieDomain: 'ptt.cc'
});
ga('send', 'pageview');
</script>
<script src="//ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<script src="//images.ptt.cc/bbs/v2.27/bbs.js"></script>
</body>
</html>
對照的實際畫面。
若現在只想抓取網頁內容中文章標題的文字部分,從以上可以觀察到文章標題都位在<div class="title">
下的<a>
中。以第一篇文章標題為例如下
<div class="title">
<a href="/bbs/Beauty/M.1566697382.A.136.html">[正妹] 台日混血正妹</a>
</div>
此時就可以用BeautifulSoup幫我們抓文章標題的部分。
在Windows cmd命令視窗輸入python -m pip install beautifulsoup4
安裝BeautifulSoup套件。
延續前一篇的程式碼,在下面新增BeautifulSoup相關的程式碼來取得文章標題。
get-html.py
import requests
from bs4 import BeautifulSoup #匯入BeautifulSoup
s = requests.session() #取得Request Session
#通過檢查是否超過18歲的頁面
s.post('https://www.ptt.cc/ask/over18',
data = {'from': '/bbs/Beauty/index.html', 'yes': 'yes'})
res = s.get('https://www.ptt.cc/bbs/Beauty/index.html') #取得HTML頁面
soup = BeautifulSoup(res.text, 'html.parser') #將抓回的HTML頁面傳入BeautifulSoup,使用html.parser解析
div_tags = soup.find_all('div', {'class': 'title'}) #找到網頁中全部的 <div class="title">
for div_tag in div_tags:
a_tag = div_tag.find('a') #找到 <div class="title"> 下的 <a>
if a_tag is not None: #或文章被刪除會是None,所以要排除None
print(a_tag.text) #印出文字部分
印出結果如下。
[正妹] 台日混血正妹
[正妹] 捨棄AKB徵選的少女
[新聞] 想被她摔!超美女警為家計苦練柔道虐翻
[廣告] 仁藤りさ營養不在前面
[正妹] 圖多 不知名的正妹們
[神人] 平面模特兒
[正妹] 雪乳青筋!看了受不了
[神人] 有人知道這位是誰嗎?
[正妹] 路人東京妹
[正妹] 有IG的一些正妹
[正妹] 高材生
[正妹] 瑞莎
[正妹] 準醫師
[公告] 開放噓文暫停X1條款
[公告] 不願上表特 & 優文推薦 & 檢舉建議專區
Fw: [公告] 請使用者多加注意我國保護兒少的法令
[公告] 表特板板規(2019.7.26)
[公告] 201907 板主徵選延長
沒有留言:
張貼留言