01-02 Web Crawler

0:00 - 0:02

[Sebastian Thrun] So what's your take on how to build a search engine,
0:02 - 0:03

you've build one before, right?
0:03 - 0:06

[Sergey Brin - Co-Founder, Google] Yes. I think the most important thing
0:06 - 0:08

if you're going to build a search engine
0:08 - 0:12

is to have a really good corpus to start out with.
0:12 - 0:19

In our case we used the world wide web, which at time was certainly smaller than it is today.
0:19 - 0:21

But it was also very new and exciting.
0:21 - 0:23

There were all sorts of unexpected things there.
0:23 - 0:26

[David Evans] So the goal for the first three units for the course is to build that corpus.
0:27 - 0:30

And we want to build the corpus for our search engine
0:30 - 0:32

by crawling the web and that's what a web crawler does.
0:32 - 0:36

What a web crawler is, it's a program that collects content from the web.
0:36 - 0:40

If you think of a web page that you see in your browser, you have a page like this.
0:40 - 0:43

And we'll use the udacity site as an example web page.
0:43 - 0:47

It has lot's of content, it has some images, it has some text.
0:47 - 0:51

All of this comes into your browser when you request the page.
0:51 - 0:53

The important thing that it has is links.
0:53 - 0:57

And what a link is, is something that goes to another page.
0:57 - 1:00

So we have a link to the frequently asked questions,
1:00 - 1:02

we have a link to CS 101 page.
1:02 - 1:04

There's some other links on the page.
1:04 - 1:07

And that link may show in you browser with an underscore,
1:07 - 1:09

it may not, depending on how your browser is set.
1:09 - 1:11

But the important thing that it does,
1:11 - 1:13

is it's a pointer to some other web page.
1:13 - 1:16

And those other web pages may also have links
1:16 - 1:19

so we have another link on this page.
1:19 - 1:23

Maybe it's to my name, you can follow to my home page.
1:23 - 1:26

And all the pages that we can find with our web crawler
1:26 - 1:29

are found by following the links.
1:29 - 1:31

So it won't necessarily find every page on the web
1:31 - 1:33

If we start with a good seed page
1:33 - 1:35

we'll find lot's of pages, though.
1:35 - 1:37

And what the crawler's gonna do is start with one page,
1:37 - 1:41

find all the links on that page, follow them to find other pages
1:41 - 1:45

and then on those other pages it will follow the links on those pages
1:45 - 1:48

to find other pages and there will be lot's more links on those pages.
1:48 - 1:51

And eventually we'll have a collection of lot's of pages on the web.
1:51 - 1:54

So that's what we want to do to build a web crawler.
1:54 - 1:56

We want to find some way to start from one seed page,
1:56 - 1:59

extract the links on that page,
1:59 - 2:01

follow those links to other pages,
2:01 - 2:03

then collect the links on those other pages,
2:03 - 2:05

follow them, collect all that.
2:05 - 2:07

So that sounds like a lot to do.
2:07 - 2:09

We're not going to all that this first class.
2:09 - 2:12

What we're going to do this first unit, is just extract a link.
2:12 - 2:14

So we're going to start with a bunch of text.
2:14 - 2:17

It's going to have a link in it with a URL.
2:17 - 2:19

What we want to find is that URL,
2:19 - 2:21

so we can request the next page.
2:21 - 2:23

The goal for the second unit
2:23 - 2:25

is be able to keep going.
2:25 - 2:28

if there's many links on one page, you will want to be able to find them all.
2:28 - 2:30

So that's what we'll do in unit 2,
2:30 - 2:32

is to figure out how to keep going to extract all those links.
2:32 - 2:36

In unit three, well, we want to go beyond just one page.
2:36 - 2:40

So by the end of unit two we can print out all the links on one page.
2:40 - 2:44

For unit 3 we want to collect all those links, so we can keep going,
2:44 - 2:47

end up following our crawler to collect many, many pages.
2:47 - 2:50

So by the end of unit three we'll have built a web crawler.
2:50 - 2:52

We'll have a way of building our corpus.
2:52 - 2:57

Then the remaining three units will look at how to actually respond to queries.
2:57 - 3:01

So in unit four we'll figure out how to give a good response.
3:01 - 3:08

So if you search for a keyword, you want to get a response that's a list of the pages
3:08 - 3:10

where that keyword appears.
3:10 - 3:15

And we'll figure out in unit five a way to do that, that scales, if we have a large corpus.
3:15 - 3:19

And then in unit six what we want to do is, well, we don't just want to find a list,
3:19 - 3:21

we want to find the best one.
3:21 - 3:24

So we'll figure out how to rank all the pages where that keyword appears.
3:24 - 3:27

So we're getting a little ahead of ourselves now,
3:27 - 3:30

because all we're going to do for unit one,
3:30 - 3:32

is to figure out how to extract a link from the page.
3:32 - 3:35

And the search engine that we'll build at the end of this
3:35 - 3:37

will be a functional search engine.
3:37 - 3:40

It will have the main components that a search engine like Google has.
3:40 - 3:43

It certainly won't be as powerful as Google will be,
3:43 - 3:44

we want to keep things simple.
3:44 - 3:46

We want to have a small amount of code to write.
3:46 - 3:48

And we should remember that our real goal
3:48 - 3:50

is not as much to build a search engine,
3:50 - 3:52

but to use the goal of building a search engine as a vehicle
3:52 - 3:55

for learning about computer science
3:55 - 3:56

and learning about programming
3:56 - 3:58

so the things we learn by doing this
3:58 -

will allow us to solve lot's and lot's of other problems.

Title:: 01-02 Web Crawler
Description:: Professor David Evans gives an overview of the unit in CS 101.

more » « less
Video Language:: English
Duration:: 04:03

	Gundega edited English subtitles for 01-02 Web Crawler
	Gundega edited English subtitles for 01-02 Web Crawler
	Gundega edited English subtitles for 01-02 Web Crawler
	Gundega edited English subtitles for 01-02 Web Crawler
	tpievila edited English subtitles for 01-02 Web Crawler
	tpievila added a translation

English subtitles

Revisions

Revision 6

Gundega

01-02 Web Crawler

Revisions

Our website uses cookies

Operating cookies (Required)