Heads up! To view this whole video, sign in with your Courses account or enroll in your free 7-day trial. Sign In Enroll
Preview
Start a free Courses trial
to watch this video
Sets let us combine explicit characters and escape patterns into pieces that can be repeated multiple times. They also let us specify pieces that should be left out of any matches.
New terms
-
[abc]
- this is a set of the characters 'a', 'b', and 'c'. It'll match any of those characters, in any order, but only once each. -
[a-z]
,[A-Z]
, or[a-zA-Z]
- ranges that'll match any/all letters in the English alphabet in lowercase, uppercase, or both upper and lowercases. -
[0-9]
- range that'll match any number from 0 to 9. You can change the ends to restrict the set.
Related Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign upRelated Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign up
Let's see how far we've come.
0:00
We have exact matches, loose matches with
escape sequences, counts and
0:01
loose counts, and position.
0:05
We can do a huge amount already.
0:07
That's awesome.
0:09
But what if I know exactly the characters
that I want to match?
0:11
Or I need to make sure a certain character
isn't there?
0:14
Python's regular expressions engine has a
concept known as sets, and
0:17
these help us achieve exactly that.
0:20
We define a set of characters with square
brackets.
0:23
Any character in the brackets will be
looked for and
0:26
you can leave out duplicates.
0:28
So if we want to find the word apple, we'd
have a set with aple.
0:29
We can also define ranges in our sets.
0:33
If we want all of the lowercase letters,
we can define a set like a-z.
0:35
This is available for uppercase letters
too, as well as numbers.
0:39
And lastly, if we start a set off with a
caret,
0:44
it says to not match those characters.
0:46
So if we want to make sure our pattern
doesn't have a two, we can say caret two.
0:49
We also need to look at two new flags.
0:54
The ignore case flag, which let's us match
against both upper and
0:55
lower case letters at once.
0:58
And the robost flag, which lets us write
our patterns out in a more natural way.
1:00
All right, lots to do, so let's get back
to it.
1:04
So I think, now, I wanna get the email
addresses.
1:07
The pattern that we're going to use isn't
going to work for
1:10
100% of all the email addresses you will
ever encounter out on the internet.
1:13
So don't look at this as being a, a
cure-all for finding email addresses.
1:18
If you really wanna see just how far this
has to go, search for
1:22
email address reg.exe or email address
regular expression on a stack overflow,
1:28
and just enjoy the, the madness.
1:32
Okay, but we're gonna do this, we're gonna
use a set.
1:37
Let's comment that out.
1:40
So sets contain all of the characters
1:41
that we're cool with finding that, that we
want the regular expression to find.
1:47
So, an email address, or at least our
email addresses that we had in
1:50
our text file there, can have they can
have word characters in them.
1:55
They can have at symbols.
1:59
But that'll be a little bit later.
2:02
They can have numbers, and they can have
underscores which is included in the \w.
2:04
They can have hyphens, put that at the
beginning.
2:09
Just in case, since this specifies ranges
which we'll talk about in a minute.
2:12
And they can have plus signs.
2:15
And they can have dots, they can have
periods.
2:20
Okay, and then they can have an at symbol.
2:22
They have to have an at symbol, cuz it's
an email address.
2:23
And then for the end it's pretty much the
same set of stuff except
2:26
there won't be any plus signs cuz you
can't have a plus sign in a domain name.
2:32
So, there's our pattern against data, but
2:36
the problem is we want, we don't want just
one of these to show up.
2:41
We want multiples of any of those to show
up, and then same here.
2:46
We could have multiples of any of those,
so
2:52
we can mark our group as being available
one or more times.
2:55
So, all right, let's save that and let's
let's try printing that out.
3:00
And hey look at it, there they all are.
3:07
That's pretty cool.
3:09
We got them all.
3:10
It's too bad they're email addresses and
not Pokemon or, we'd, anyway.
3:11
So let's do another set and
3:16
see if we can get all the instances of the
word treehouse.
3:19
So, print re.findall.
3:23
And if I do the word treehouse here, which
these are what I want to find
3:28
I don't have to repeat letters cuz an e is
an e is an e.
3:34
So we can take out these last two e's and
3:38
then we just leave everything else in
there.
3:42
So it kinda looks like treehouse.
3:44
And let's do a plus sign on this.
3:46
And now I did this with all lowercase
letters, but some of the places in here,
3:50
we have Treehouse with an uppercase T.
3:54
So let's mark this.
3:58
Let's give this a flag that says,
3:59
I don't care what the case is, of the
thing that you match.
4:02
So we will use the re.IGNORECASE flag.
4:06
All right.
Let's let's try that.
4:10
Oh!
Well.
4:13
I mean, that worked we did, we did find
stuff, and I do see Treehouse in there.
4:15
But this isn't exactly what we wanted.
4:23
So let's actually go back, if you remember
we talked about word boundaries, so
4:25
let's add in word boundaries to this.
4:29
So we want \b on both sides.
4:31
We want Treehouse to be this, like,
standalone word.
4:35
And, you know what,
4:39
this IGNORECASE is really long, so I'm
gonna actually take out.
4:40
There, so I'm just gonna use re.I.
4:47
Exact same thing as IGNORECASE, it's just
the shorthand version of it.
4:48
So all right.
4:54
Let's run this again and look at that.
4:55
That's a lot better.
4:58
We've got Treehouse, Treehouse.
4:59
We've got se, which isn't exactly a match.
5:01
The and us isn't exactly a match, but we
also go Treehouse and Treehouse here.
5:03
So, okay.
5:08
A lot better.
5:08
It's not perfect, but it's a lot better.
5:09
So, we know if we look at Treehouse and
5:12
we wanted to count all those letters, we
know that Treehouse is nine letters long.
5:15
So, what if, instead of this plus sign, we
said,
5:19
find any of these letters so long as
they're in a set of nine.
5:24
There's always nine of them.
5:28
So, let's try that again and check it out.
5:31
We got Treehouse, Treehouse, Treehouse,
5:34
Treehouse cuz we have four people listed
that work at Treehouse.
5:36
So, that's pretty awesome.
5:40
You need to sign up for Treehouse in order to download course files.
Sign upYou need to sign up for Treehouse in order to set up Workspace
Sign up