| Oct | NOV | Dec |
| 26 | ||
| 2023 | 2024 | 2025 |
COLLECTED BY
Collection: Save Page Now Outlinks
To see all available qualifiers, see our documentation.
Sign in /| Name | Name |
Last commit date
| ||
|---|---|---|---|---|
Latest commitHistory | ||||
| ||||
.py file, and run with Python. Note that this code assumes that you have the libraries installed. More on this later.
import requests from bs4 import BeautifulSoup response = requests.get("https://en.wikipedia.org/wiki/Web_scraping") bs = BeautifulSoup(response.text,"lxml") print(bs.find("p").text)This will go to the Wikipedia page for the web scraping and print the first paragraph on the terminal. This code shows the simplicity and power of Python. You will find this code in
webscraping_5lines.py file.
bs4orBeautifulSoup4.
●The CSV library is part of the standard Python installation. No separate installation is required.
●Typically, a virtual environment is used to install these libraries. If you don't know about virtual environments, you can install these libraries in the user folder.
To install these libraries, start the terminal or command prompt of your OS and type in:
pip install requests BeautifulSoup4 lxmlDepending on your OS and settings, you may need to use
pip3 instead of pip. You may also need to use --user switch, depending on your settings.
import requests url_to_parse = "https://en.wikipedia.org/wiki/Python_(programming_language)" response = requests.get(url_to_parse) print(response)Save this file as a python file with
.py extension and run it from your terminal. The output should be something like this:
"><Response (200)>
It means that the response has been received and the status code is 200. The HTTP Response code 200 means a successful response. Response codes in the range of 400 and 500 mean error. You can read more about the response codes here.
To get the HTML from the response object, we can simply use the .text attribute.
print(response.text)This will print the HTML on the terminal. The first few characters will be something like this: \n
<!DOCTYPE html>\n<html class="client-nojs" lang=" ...If we check the data type of this, it will be a string. The next step is to convert this string into something that can be queried to find the specific information. Meet BeautifulSoup!
lxml is the most commonly used. This will need a separate install.
pip install lxmlOnce
beautifulsoup4 and lxml is installed, we can create an object of BeautifulSoup:
soup = BeautifulSoup(response_text, 'lxml')Now we have access to several methods to query the HTML elements. For example, to get the title of the page, all we need to do is access the tag name like an attribute: Python (programming language) - Wikipedia print(soup.title.text) # OUTPUT: # Python (programming language) - Wikipedia">
print(soup.title) # OUTPUT: # <title>Python (programming language) - Wikipedia</title> print(soup.title.text) # OUTPUT: # Python (programming language) - WikipediaNote that to get the text inside the element, we simply used the
text attribute.
Similarly soup.h1 will return the first h1 tag it finds:
Python (programming language)
">print(soup.h1) # OUTPUT: # <h1 class="firstHeading" id="firstHeading">Python (programming language)</h1>
find() and find_all(). Let’s open the Wikipedia page and get the table of contents.
The signature of find looks something like this:
find(name=None, attrs={}, recursive=True, text=None, **kwargs)As it is evident that the find method can be used to find elements based on
name, attributes, or text. This should cover most of the scenarios. For scenarios like finding by class, there is **kwargs that can take other filters.
Moving on to Wikipedia example, the first step is to look at the HTML markup for the table of contents to be extracted. Right-click on the div that contains the table of contents and examine its markup. It is clear that the whole table of contents is in a div tag with the class attribute set to toc:
"><div id="toc" class="toc">If we simply run
soup.find("div"), it will return the first div it finds - similar to writing soup.div. This needs filtering as we need a specific div. We are lucky in this case as it has an idattribute. The following line of code can extract the div element:
soup.find("div",id="toc")Note that the second parameter here -
id="toc". The find method does not have a named parameter id, but still this works because of the implementation of the filter using the **kwargs.
Be careful with CSS class though. class is a reserved keyword in Python. It cannot be used as a parameter name directly. There are two workarounds – first, just use class_ instead of class. The second workaround is to use a dictionary as the second argument.
This means that the following two statements are same:
soup.find("div",class_="toc") #not the underscore soup.find("div",{"class": "toc"})The advantage of using a dictionary is that more than one attribute can be specified. For example,if you need to specify both class and id, you can use the find method in the following manner:
soup.find("div",{"class": "toc", "id":"toc"})What if we need to find multiple elements?
find_all method.
This method works the same way find method works, just that instead of one element, it returns a list of all the elements that match criteria. If we look at the source code, we can see that all the heading text is inside a span, with toctext as class. We can use find_all method to extract all these:
soup.find_all("span",class_="toctext")This will return a list of elements: History, Design philosophy and features, Syntax and semantics, Indentation, .....] ">
[<span class="toctext">History</span>, <span class="toctext">Design philosophy and features</span>, <span class="toctext">Syntax and semantics</span>, <span class="toctext">Indentation</span>, .....]Similarly, the heading numbers can be extracted using this statement:
soup.find_all("span",class_="tocnumber")This will return a list of elements: 1, 2, 3, 3.1, ...]">
[<span class="tocnumber">1</span>, <span class="tocnumber">2</span>, <span class="tocnumber">3</span>, <span class="tocnumber">3.1</span>, ...]However, we need to have one list containing both the number and text.
table_of_contents = soup.find("div",id="toc")If we look at the markup, we can see that each heading number and text is inside an
litag.
One of the great features of BeautifulSoup is that find and find_all methods can be used on WebElements too. In the above example, whole_toc is an instance of WebElement. We can find all the li tags inside this element.
headings = table_of_contents.find_all("li")Now we have a list of elements. All these individual elements contain both the heading text and heading number. A simple for loop can be used to create a dictionary, which can be added to a list.
data= [] for heading in headings: heading_text = heading.find("span", class_="toctext").text heading_number = heading.find("span", class_="tocnumber").text data.append({ 'heading_number' : heading_number, 'heading_text' : heading_text, })If this data is printed, it is a list of dictionaries.
[{'heading_number': '1', 'heading_text': 'History'},
{'heading_number': '2', 'heading_text': 'Design philosophy and features'},
{'heading_number': '3', 'heading_text': 'Syntax and semantics'},
{'heading_number': '3.1', 'heading_text': 'Indentation'},
{'heading_number': '3.2', 'heading_text': 'Statements and control flow'},
.....]
This data can now be exported easily using CSV module.
newline parameter should be set to an empty string. If this is not done, you will see unwarted new line characters in your CSV file
file= open("toc.csv", "w", newline="")After that, create an instance of DictWriter object. This needs a list of headers. In our case, these are simply going to be the dictionary keys in the data.
writer = csv.DictWriter(file,fieldnames=['heading_number','heading_text'])Optionally, write the header and then call the
write.writerows() method to write the data. To write one row, use the method writerow(). To write all rows, use the method writerow().
writer.writeheader() writer.writerows(data)That's it! We have the data ready in a CSV. You can find this complete code in the file
wiki_toc.py file.
Also, check this tutorial on pypi