composer init
in the terminal. Then, go through the dialog with responses like this:
project
as the type, and add voku/simple_html_dom
as a dependency. You can configure the rest however you want.
Finally, create a file in the directory called simple_html_dom.php and put the following in the file:
1 |
<?php
|
2 |
|
3 |
use voku\helper\HtmlDomParser; |
4 |
|
5 |
require_once 'composer/autoload.php'; |
6 |
1 |
// Load from a string
|
2 |
$dom = HtmlDomParser::str_get_html('<html><body><p>Hello World!</p><p>We\'re here</p></body></html>'); |
3 |
// Load from a file or URL
|
4 |
$dom = HtmlDomParser::file_get_html('https://bbc.com'); |
load_file
()
method delegates its job to PHP's file_get_contents
. If allow
_url_fopen
is not set to true in your php.ini file, you may not be able to open a remote file this way. You could always fall back on the cURL library to load remote pages in this case.
find()
and creating collections. A collection is a group of objects found via a selector—the syntax is quite similar to jQuery.
1 |
<html>
|
2 |
<body>
|
3 |
<p>Hello World!</p> |
4 |
<p>We're Here.</p> |
5 |
</body>
|
6 |
</html>
|
1 |
// create & load the HTML
|
2 |
$dom = HtmlDomParser::str_get_html("<html><body><p>Hello World!</p><p>We're here</p></body></html>"); |
3 |
// get all paragraph elements
|
4 |
$element = $dom->find("p"); |
5 |
// modify the second
|
6 |
$element[1]->innerText .= " and we're here to stay."; |
7 |
// print it
|
8 |
echo $dom->save(); |
find
method call finds all <
p>
tags in the HTML and returns them as an array. The first paragraph will have an index of 0, and subsequent paragraphs will be indexed accordingly.
Finally, we access the second item in our collection of paragraphs (index 1) and make an addition to its innertext
attribute. in
nertext
represents the contents between the tags, while outertex
t
represents the contents including the tag. We could replace the tag entirely by using outertex
t
.
We're going to add one more line and modify the class of our second paragraph tag.
1 |
$element[1]->class = "class_name"; |
2 |
echo $dom->save(); |
1 |
<html>
|
2 |
<body>
|
3 |
<p>Hello World!</p> |
4 |
<p class="class_name">We're here and we're here to stay.</p> |
5 |
</body>
|
6 |
</html>
|
1 |
// get the first occurrence of id="foo"
|
2 |
$single = $dom->find('//foo', 0); |
3 |
// get all elements with class="foo"
|
4 |
$collection = $dom->find('.foo'); |
5 |
// get all the anchor tags on a page
|
6 |
$collection = $dom->find('a'); |
7 |
// get all anchor tags that are inside H1 tags
|
8 |
$collection = $dom->find('h1 a'); |
9 |
// get all img tags with a title of 'himom'
|
10 |
$collection = $dom->find('img[title=himom]'); |
$single
is a single element, rather than an array of elements with one item.
The rest of the examples are self-explanatory.
1 |
use voku\helper\HtmlDomParser; |
2 |
require_once 'vendor/autoload.php'; |
3 |
|
4 |
$articles = array(); |
5 |
getArticles('https://code.tutsplus.com/tutorials'); |
getArticles
function with the page we'd like to start parsing. In this case, we're starting near the end and being kind to Tuts+'s server.
We're also declaring a global array to make it simple to gather all the article information in one place. Before we begin parsing, let's take a look at how an article summary is described on Tuts+.
1 |
<article>
|
2 |
<header>...</header> |
3 |
<div class="posts__post-teaser">...</div> |
4 |
<footer class="posts__post-details"> |
5 |
<div class="posts__post-teaser-overlay"></div> |
6 |
<div class="posts__post-publication-meta"> |
7 |
... |
8 |
<div class="posts__post-details__info"> |
9 |
<address class="posts__post-author">...</address> |
10 |
<time class="posts__post-publication-date">...</time> |
11 |
</div>
|
12 |
</div>
|
13 |
<div class="posts__post-primary-category"> |
14 |
<a class="posts__post-primary-category-link topic-code" href="">...</a> |
15 |
</div>
|
16 |
</footer>
|
17 |
</article>
|
1 |
function getArticles($page) { |
2 |
global $articles; |
3 |
$html = HtmlDomParser::file_get_html($page); |
4 |
...
|
5 |
}
|
$articles
and loading the page into Simple HTML DOM using file_get_html
, as we have done previously. $page
is the URL we passed in earlier.
1 |
$items = $html->find('article'); |
2 |
foreach($items as $post) { |
3 |
$articles[] = array(/* get title */ $post->findOne(".posts__post-title")->firstChild()->text(), |
4 |
/* get description */ $post->findOne("posts__post-teaser")->text()); |
5 |
}
|
getArtic
les
function. It's going to take a closer look to really understand what's happening.
First we create an array of elements—divs with the class of prev
iew
. We now have a collection of articles stored in $items
.
In the foreach block, $post
now refers to a single div of class preview
. If we look at the original HTML, we can see that the title of the post is contained in the first child of posts__post-t
itle
. Therefore, to get the title, we take the text of that element.
The description is contained in posts_post-title
. We take the text out of that element and put in the second element in the arti
cle
item. A single record in articles now looks like this:
1 |
$articles[0][0] = "My Article Name Here"; |
2 |
$articles[0][1] = "This is my article description" |
1 |
<a rel="next" class="pagination__button pagination__next-button" aria-label="next" href="/tutorials?page=2"><i class="fa fa-angle-right"></i></a> |
next
postslink
. Now that information can be put to use.
1 |
if($next = $html->find('a[class=pagination__next-button]', 0)) { |
2 |
$URL = $next->href; |
3 |
$html->clear(); |
4 |
unset($html); |
5 |
getArticles($URL); |
6 |
}
|
pagination__next-button
. Take special notice of the second parameter for find()
. This specifies that we only want the first element (index 0) of the found collection returned. $next
will only be holding a single element, rather than a group of elements.
Next, we assign the link's href to the variable $URL
. This is important because we're about to destroy the HTML object. Due to a PHP circular references memory leak, $html
must be cleared and unset before another one is created. Failure to do so could cause you to eat up all your available memory.
Finally, we call getArticles
with the URL of the next page. This recursion ends when there are no more pages to parse.
You are done with the scraping! This is how the final code should look:
1 |
<?php
|
2 |
use voku\helper\HtmlDomParser; |
3 |
require_once 'vendor/autoload.php'; |
4 |
|
5 |
$articles = array(); |
6 |
getArticles('https://code.tutsplus.com/tutorials'); |
7 |
|
8 |
function getArticles($page) { |
9 |
global $articles; |
10 |
$html = HtmlDomParser::file_get_html($page); |
11 |
$items = $html->find('article'); |
12 |
foreach($items as $post) { |
13 |
$articles[] = array(/* get title */ $post->findOne(".posts__post-title")->firstChild()->text(), |
14 |
/* get description */ $post->findOne("posts__post-teaser")->text()); |
15 |
}
|
16 |
if($next = $html->find('a[class=pagination__next-button]', 0)) { |
17 |
$URL = $next->href; |
18 |
$html->clear(); |
19 |
unset($html); |
20 |
getArticles($URL); |
21 |
}
|
22 |
}
|
1 |
#main { |
2 |
margin:80px auto; |
3 |
width:500px; |
4 |
}
|
5 |
h1 { |
6 |
font:bold 40px/38px helvetica, verdana, sans-serif; |
7 |
margin:0; |
8 |
}
|
9 |
h1 a { |
10 |
color:#600; |
11 |
text-decoration:none; |
12 |
}
|
13 |
p { |
14 |
background: #ECECEC; |
15 |
font:10px/14px verdana, sans-serif; |
16 |
margin:8px 0 15px; |
17 |
border: 1px #CCC solid; |
18 |
padding: 15px; |
19 |
}
|
20 |
.item { |
21 |
padding:10px; |
22 |
}
|
1 |
<?php
|
2 |
foreach($articles as $item) { |
3 |
echo "<div class='item'>"; |
4 |
echo $item[0]; |
5 |
echo $item[1]; |
6 |
echo "</div>"; |
7 |
}
|
8 |
?>
|
getArticles()
call.