Heads up! To view this whole video, sign in with your Courses account or enroll in your free 7-day trial. Sign In Enroll
Well done!
You have completed Scraping Data From the Web!
You have completed Scraping Data From the Web!
Preview
Let's look at two Beautiful Soup methods, `find()` and `find_all()`, in greater detail.
This video doesn't have any notes.
Related Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign upRelated Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign up
Welcome back.
0:00
We just saw how to utilize
the find all method to find
0:01
all of a particular item on the page.
0:05
We can use the find method to find
the first instance of an item.
0:08
We can change this to find,
get rid of our for loop here,
0:14
And run it.
0:26
I should probably have changed
that name to just div, ah well.
0:30
Naming things is always a challenge for
me.
0:33
We're getting all of the info back for
that particular div element.
0:36
The featured one on the page here,
in this case.
0:40
What if we just want
the header text in here?
0:43
Since it's a child element of that div,
we can chain elements together.
0:46
Let's comment this out.
0:53
Close this down.
0:57
We want featured_header = soup.find.
1:01
We want div class featured.
1:08
We just want the h2 element.
1:17
And we'll print the featured header.
1:20
Nice.
1:26
But we still have our
tag elements in there.
1:27
From a data cleanliness standpoint,
1:30
it would be great if we could
get rid of those, right?
1:32
Well, there's a convenient method for
that called get_text.
1:35
Called get_text.
1:43
Yipee, we got some text from our site.
1:49
We scraped it out.
1:52
There's a bit of a gotcha to watch out for
with this get_text method, though.
1:53
It strips away the tags from
whatever we're working with,
1:57
leaving just a block of text.
2:00
Let's take away this h2 element from
our text value to see what I mean.
2:02
While this is perhaps more readable for
2:12
us, it makes it much more challenging
to process going forward.
2:14
If we wanted to select mustang, or
2:18
the text about them at this point,
it would be more of a challenge.
2:20
The thing to remember about get_text
is to use it as the last step
2:24
in the scraping process.
2:29
We've seen that the find method returns
the first occurrence of an item in
2:31
a Beautiful Soup object.
2:35
It is basically a find all method with
a setting of the limit of results to one.
2:36
Let's look at the parameters
these methods take.
2:41
Name, which looks for tags with
certain names, such as title or div.
2:45
Attrs, which allows for
the searching for a specific CSS class.
2:50
We'll take a look at this here shortly.
2:54
Recursive, by default, find and find
all examines all descendants of a tag.
2:57
If we set recursity over false,
3:03
it will only look at the direct
children of the tag.
3:05
String or text allows for
the searching of strings instead of tags.
3:09
Kwargs, which allows researching
on other items, such as CSS ID.
3:15
Limit, the find all method
also accepts a limit argument
3:20
to limit the results that return.
3:24
As I mentioned, find is a find
all with a limit set to one.
3:26
We can pass in a string,
a list, a regular expression,
3:30
a value equals true, or
even a function to the name, string, or
3:34
kwargs arguments to further enhance
the searching capabilities.
3:38
Let's take a look at the attrs argument
to search for the CSS class or print out
3:43
all references to this primary button
class, which is this button down here.
3:48
Come back over here to our code,
let's comment this out.
3:54
So for button in soup.find.
4:00
Gonna look for a class, and
4:07
that class was button button--primary.
4:10
And we'll just print the buttons out.
4:20
And more, here it is.
4:27
Since class is a reserved word in Python
and searching for items on page based
4:29
on class is a frequent task, Beautiful
Soup provides a process for that.
4:34
We can change our code to use a special
keyword argument, class underscore.
4:39
So we can take all this out,
remove our closing curly bracket,
4:44
and we get the same result
with a bit less typing.
4:55
Another very common task which will
be useful when we want to move
4:58
from one page to another is to get
all of the hyperlinks on a page.
5:02
We can navigate into a specific tag and
5:07
use the get method to abstract
specific information.
5:09
Minimize that.
5:13
Again, we'll comment this out,
just for clarity.
5:17
So for link in soup.find all, so
we'll look for all the anchor elements.
5:21
And then we'll print out
all of the href attributes.
5:29
So link and we'll get the hrefs.
5:34
We can look at these patterns to
determine internal and external links.
5:42
Definitely, a handy thing to do.
5:47
Beautiful Soup is a very powerful tool,
and
5:50
we've just scratched
the surface of its power.
5:52
But we've seen how we can use
Python to read a webpage and
5:55
get very specific data from the HTML.
5:58
It can take a bit of work to
decipher the page structure, but
6:01
that is time well spent for
data collection.
6:04
Before we get too much further into
collecting data from websites,
6:08
we should talk about some other things to
think about to be good data wranglers.
6:12
I'll see you all back here in a bit and
have a look.
6:16
You need to sign up for Treehouse in order to download course files.
Sign upYou need to sign up for Treehouse in order to set up Workspace
Sign up