Data Scraping: Navigating Unique Web Structures

Learn how to scrape web data by identifying HTML elements and extracting text using BeautifulSoup. Understand common pitfalls and adapt your scraping strategy to handle irregular web page structures.

Key Insights

Use BeautifulSoup's find method to precisely locate HTML elements by their tags and attributes, such as retrieving text from an a tag with a specific attribute (name="1.1.13").
Apply list comprehensions in Python to efficiently extract text from multiple HTML elements, demonstrated by grabbing content from the first 10 a tags.
Stay attentive to the actual structure and quirks of web pages rather than relying on conventional standards, as pages may deviate from typical usage, requiring flexible scraping strategies.

This lesson is a preview from our Data Science & AI Certificate Online (includes software) and Python Certification Online (includes software & exam). Enroll in a course for detailed lessons, live instructor support, and project-based training.

Let's solve these challenges. First, our code shall be a little academy. Let's look here.

Here’s that text. If I want, I can inspect it and find out exactly where it is. It’s the A tag with the name attribute 1.1.13. Okay, so what we want is to say, um, line = soup.find('a', {'name': '1.1.13'}). And then we’ll run line.get_text().

And there it is. All right, the second one’s a little tougher. Not so tough, a little tougher.

All A tags. So, that would be, we can say, I don’t know, tags. That seems like a fine name, pretty generic, but we’re doing something pretty generic.

I want to find all A tags. That’s it. Just find them all.

Now, let’s print out the text for the first 10. Let’s make a tag_texts list. It’s a new list.

We’ll do a list comprehension. We want tag. I kind of prefer to write it like this: for tag in tags.

And now I can go back and say I want tag.get_text() for every tag in tags. Okay, and now I only want, again, the first 10 of these. And there we go.

We have Shakespeare's homepage, 'Love's Labor Lost,' and then some content from it. Now, these were not maybe what you were expecting. These are the tags up here that are more what we think of as links.

That’s what A tags typically are. These are doing kind of a funky thing on this page. But it’s something to pay attention to as you’re navigating scraping data: hey, they don’t have to follow regular standards of how to make a page.

Our job as data scrapers is to figure out, hey, what is it that this page does? Not what should they be doing, right? But we may get more, fewer, or just different texts than we think we will if we’re not paying very careful attention to not what pages typically do, but what this page is doing in this case.

Colin Jaffe

Colin Jaffe is a programmer, writer, and teacher with a passion for creative code, customizable computing environments, and simple puns. He loves teaching code, from the fundamentals of algorithmic thinking to the business logic and user flow of application building—he particularly enjoys teaching JavaScript, Python, API design, and front-end frameworks.

Colin has taught code to a diverse group of students since learning to code himself, including young men of color at All-Star Code, elementary school kids at The Coding Space, and marginalized groups at Pursuit. He also works as an instructor for Noble Desktop, where he teaches classes in the Full-Stack Web Development Certificate and the Data Science & AI Certificate.

Colin lives in Brooklyn with his wife, two kids, and many intricate board games.

Key Insights

Colin Jaffe

How to Learn Data Science