AI can only take you so far. 🌟 Start with core skills in JavaScript, HTML, CSS, or Python. 🚀

Join the Treehouse affiliate program and earn 25% recurring commission!

New No-Code Track! 🚀start learning today!

🌟 Dreaming of a bright future? 🎓 Ask about the Treehouse Scholarship program! 🚀

✨ Earn college credits in Cybersecurity, JS, HTML, CSS and Python

Well done!

You have completed Scraping Data From the Web!

Sign up for Treehouse Back to Library

Preview

Sign up for Treehouse Continue

More Soup in the Tureen

6:18 with Ken Alger

Let's look at two Beautiful Soup methods, `find()` and `find_all()`, in greater detail.

Teacher's Notes
Questions?
Video Transcript
Downloads
Workspaces

This video doesn't have any notes.

Related Discussions

Have questions about this video? Start a discussion with the community and Treehouse staff.

Sign up

Related Discussions

Have questions about this video? Start a discussion with the community and Treehouse staff.

Sign up

Welcome back. 0:00

We just saw how to utilize the find all method to find 0:01

all of a particular item on the page. 0:05

We can use the find method to find the first instance of an item. 0:08

We can change this to find, get rid of our for loop here, 0:14

And run it. 0:26

I should probably have changed that name to just div, ah well. 0:30

Naming things is always a challenge for me. 0:33

We're getting all of the info back for that particular div element. 0:36

The featured one on the page here, in this case. 0:40

What if we just want the header text in here? 0:43

Since it's a child element of that div, we can chain elements together. 0:46

Let's comment this out. 0:53

Close this down. 0:57

We want featured_header = soup.find. 1:01

We want div class featured. 1:08

We just want the h2 element. 1:17

And we'll print the featured header. 1:20

Nice. 1:26

But we still have our tag elements in there. 1:27

From a data cleanliness standpoint, 1:30

it would be great if we could get rid of those, right? 1:32

Well, there's a convenient method for that called get_text. 1:35

Called get_text. 1:43

Yipee, we got some text from our site. 1:49

We scraped it out. 1:52

There's a bit of a gotcha to watch out for with this get_text method, though. 1:53

It strips away the tags from whatever we're working with, 1:57

leaving just a block of text. 2:00

Let's take away this h2 element from our text value to see what I mean. 2:02

While this is perhaps more readable for 2:12

us, it makes it much more challenging to process going forward. 2:14

If we wanted to select mustang, or 2:18

the text about them at this point, it would be more of a challenge. 2:20

The thing to remember about get_text is to use it as the last step 2:24

in the scraping process. 2:29

We've seen that the find method returns the first occurrence of an item in 2:31

a Beautiful Soup object. 2:35

It is basically a find all method with a setting of the limit of results to one. 2:36

Let's look at the parameters these methods take. 2:41

Name, which looks for tags with certain names, such as title or div. 2:45

Attrs, which allows for the searching for a specific CSS class. 2:50

We'll take a look at this here shortly. 2:54

Recursive, by default, find and find all examines all descendants of a tag. 2:57

If we set recursity over false, 3:03

it will only look at the direct children of the tag. 3:05

String or text allows for the searching of strings instead of tags. 3:09

Kwargs, which allows researching on other items, such as CSS ID. 3:15

Limit, the find all method also accepts a limit argument 3:20

to limit the results that return. 3:24

As I mentioned, find is a find all with a limit set to one. 3:26

We can pass in a string, a list, a regular expression, 3:30

a value equals true, or even a function to the name, string, or 3:34

kwargs arguments to further enhance the searching capabilities. 3:38

Let's take a look at the attrs argument to search for the CSS class or print out 3:43

all references to this primary button class, which is this button down here. 3:48

Come back over here to our code, let's comment this out. 3:54

So for button in soup.find. 4:00

Gonna look for a class, and 4:07

that class was button button--primary. 4:10

And we'll just print the buttons out. 4:20

And more, here it is. 4:27

Since class is a reserved word in Python and searching for items on page based 4:29

on class is a frequent task, Beautiful Soup provides a process for that. 4:34

We can change our code to use a special keyword argument, class underscore. 4:39

So we can take all this out, remove our closing curly bracket, 4:44

and we get the same result with a bit less typing. 4:55

Another very common task which will be useful when we want to move 4:58

from one page to another is to get all of the hyperlinks on a page. 5:02

We can navigate into a specific tag and 5:07

use the get method to abstract specific information. 5:09

Minimize that. 5:13

Again, we'll comment this out, just for clarity. 5:17

So for link in soup.find all, so we'll look for all the anchor elements. 5:21

And then we'll print out all of the href attributes. 5:29

So link and we'll get the hrefs. 5:34

We can look at these patterns to determine internal and external links. 5:42

Definitely, a handy thing to do. 5:47

Beautiful Soup is a very powerful tool, and 5:50

we've just scratched the surface of its power. 5:52

But we've seen how we can use Python to read a webpage and 5:55

get very specific data from the HTML. 5:58

It can take a bit of work to decipher the page structure, but 6:01

that is time well spent for data collection. 6:04

Before we get too much further into collecting data from websites, 6:08

we should talk about some other things to think about to be good data wranglers. 6:12

I'll see you all back here in a bit and have a look. 6:16

You need to sign up for Treehouse in order to download course files.

Sign up

You need to sign up for Treehouse in order to set up Workspace

Sign up