Welcome to the Treehouse Community

Want to collaborate on code errors? Have bugs you need feedback on? Looking for an extra set of eyes on your latest project? Get support with fellow developers, designers, and programmers of all backgrounds and skill levels here with the Treehouse Community! While you're at it, check out some resources Treehouse students have shared here.

Looking to learn something new?

Treehouse offers a seven day free trial for new students. Get access to thousands of hours of content and join thousands of Treehouse students and alumni in the community today.

Start your free trial

Python Scraping Data From the Web A World Full of Spiders Everyone Loves Charlotte

Michael Strand
Michael Strand
10,897 Points

Finding Internal Links Without .html

Hi, How would you go about finding internal links that have a URL structure like https://www.website.com/about-us?

Michael Strand
Michael Strand
10,897 Points

This is my starting point, but this only returns one url.

from urllib.request import urlopen
from bs4 import BeautifulSoup

import re

site_links = []

def internal_links(linkURL):
    html = urlopen('https://www.website.com/{}'
                .format(linkURL))
    soup = BeautifulSoup(html, 'html.parser')

    return soup.find('a', href=re.compile('(^https://www.website.com/)'))


if __name__ == '__main__':
    urls = internal_links('/')
    while len(urls) > 0:
        page = urls.attrs['href']

        if page not in site_links:
            site_links.append(page)

            print(page)
            print('\n===========\n')
            urls = internal_links(page)

        else:
            break

2 Answers

Steven Parker
Steven Parker
231,271 Points

According to the documentation, "find" always returns just one result. To get a list of all matching items, use "find_all" instead.

Michael Strand
Michael Strand
10,897 Points

I did try that and it just returned errors. If I follow the video it uses find but looks for the extension .html and returns multiple results.

Steven Parker
Steven Parker
231,271 Points

I'm confused. I don't have experience with it myself, but unless I'm reading the docs wrong, they seem to state explicitly that "find" only returns one item.

Michael Strand
Michael Strand
10,897 Points

Yes, I was confused as well. I don't quite understand how that worked as well.