Return to Video

01-02 Web Crawler

  • 0:00 - 0:02
    [Sebastian Thrun] So what's your take on how to build a search engine,
  • 0:02 - 0:03
    you've build one before, right?
  • 0:03 - 0:06
    [Sergey Brin - Co-Founder, Google] Yes. I think the most important thing
  • 0:06 - 0:08
    if you're going to build a search engine
  • 0:08 - 0:12
    is to have a really good corpus to start out with.
  • 0:12 - 0:19
    In our case we used the world wide web, which at time was certainly smaller than it is today.
  • 0:19 - 0:21
    But it was also very new and exciting.
  • 0:21 - 0:23
    There were all sorts of unexpected things there.
  • 0:23 - 0:26
    [David Evans] So the goal for the first three units for the course is to build that corpus.
  • 0:27 - 0:30
    And we want to build the corpus for our search engine
  • 0:30 - 0:32
    by crawling the web and that's what a web crawler does.
  • 0:32 - 0:36
    What a web crawler is, it's a program that collects content from the web.
  • 0:36 - 0:40
    If you think of a web page that you see in your browser, you have a page like this.
  • 0:40 - 0:43
    And we'll use the udacity site as an example web page.
  • 0:43 - 0:47
    It has lot's of content, it has some images, it has some text.
  • 0:47 - 0:51
    All of this comes into your browser when you request the page.
  • 0:51 - 0:53
    The important thing that it has is links.
  • 0:53 - 0:57
    And what a link is, is something that goes to another page.
  • 0:57 - 1:00
    So we have a link to the frequently asked questions,
  • 1:00 - 1:02
    we have a link to CS 101 page.
  • 1:02 - 1:04
    There's some other links on the page.
  • 1:04 - 1:07
    And that link may show in you browser with an underscore,
  • 1:07 - 1:09
    it may not, depending on how your browser is set.
  • 1:09 - 1:11
    But the important thing that it does,
  • 1:11 - 1:13
    is it's a pointer to some other web page.
  • 1:13 - 1:16
    And those other web pages may also have links
  • 1:16 - 1:19
    so we have another link on this page.
  • 1:19 - 1:23
    Maybe it's to my name, you can follow to my home page.
  • 1:23 - 1:26
    And all the pages that we can find with our web crawler
  • 1:26 - 1:29
    are found by following the links.
  • 1:29 - 1:31
    So it won't necessarily find every page on the web
  • 1:31 - 1:33
    If we start with a good seed page
  • 1:33 - 1:35
    we'll find lot's of pages, though.
  • 1:35 - 1:37
    And what the crawler's gonna do is start with one page,
  • 1:37 - 1:41
    find all the links on that page, follow them to find other pages
  • 1:41 - 1:45
    and then on those other pages it will follow the links on those pages
  • 1:45 - 1:48
    to find other pages and there will be lot's more links on those pages.
  • 1:48 - 1:51
    And eventually we'll have a collection of lot's of pages on the web.
  • 1:51 - 1:54
    So that's what we want to do to build a web crawler.
  • 1:54 - 1:56
    We want to find some way to start from one seed page,
  • 1:56 - 1:59
    extract the links on that page,
  • 1:59 - 2:01
    follow those links to other pages,
  • 2:01 - 2:03
    then collect the links on those other pages,
  • 2:03 - 2:05
    follow them, collect all that.
  • 2:05 - 2:07
    So that sounds like a lot to do.
  • 2:07 - 2:09
    We're not going to all that this first class.
  • 2:09 - 2:12
    What we're going to do this first unit, is just extract a link.
  • 2:12 - 2:14
    So we're going to start with a bunch of text.
  • 2:14 - 2:17
    It's going to have a link in it with a URL.
  • 2:17 - 2:19
    What we want to find is that URL,
  • 2:19 - 2:21
    so we can request the next page.
  • 2:21 - 2:23
    The goal for the second unit
  • 2:23 - 2:25
    is be able to keep going.
  • 2:25 - 2:28
    if there's many links on one page, you will want to be able to find them all.
  • 2:28 - 2:30
    So that's what we'll do in unit 2,
  • 2:30 - 2:32
    is to figure out how to keep going to extract all those links.
  • 2:32 - 2:36
    In unit three, well, we want to go beyond just one page.
  • 2:36 - 2:40
    So by the end of unit two we can print out all the links on one page.
  • 2:40 - 2:44
    For unit 3 we want to collect all those links, so we can keep going,
  • 2:44 - 2:47
    end up following our crawler to collect many, many pages.
  • 2:47 - 2:50
    So by the end of unit three we'll have built a web crawler.
  • 2:50 - 2:52
    We'll have a way of building our corpus.
  • 2:52 - 2:57
    Then the remaining three units will look at how to actually respond to queries.
  • 2:57 - 3:01
    So in unit four we'll figure out how to give a good response.
  • 3:01 - 3:08
    So if you search for a keyword, you want to get a response that's a list of the pages
  • 3:08 - 3:10
    where that keyword appears.
  • 3:10 - 3:15
    And we'll figure out in unit five a way to do that, that scales, if we have a large corpus.
  • 3:15 - 3:19
    And then in unit six what we want to do is, well, we don't just want to find a list,
  • 3:19 - 3:21
    we want to find the best one.
  • 3:21 - 3:24
    So we'll figure out how to rank all the pages where that keyword appears.
  • 3:24 - 3:27
    So we're getting a little ahead of ourselves now,
  • 3:27 - 3:30
    because all we're going to do for unit one,
  • 3:30 - 3:32
    is to figure out how to extract a link from the page.
  • 3:32 - 3:35
    And the search engine that we'll build at the end of this
  • 3:35 - 3:37
    will be a functional search engine.
  • 3:37 - 3:40
    It will have the main components that a search engine like Google has.
  • 3:40 - 3:43
    It certainly won't be as powerful as Google will be,
  • 3:43 - 3:44
    we want to keep things simple.
  • 3:44 - 3:46
    We want to have a small amount of code to write.
  • 3:46 - 3:48
    And we should remember that our real goal
  • 3:48 - 3:50
    is not as much to build a search engine,
  • 3:50 - 3:52
    but to use the goal of building a search engine as a vehicle
  • 3:52 - 3:55
    for learning about computer science
  • 3:55 - 3:56
    and learning about programming
  • 3:56 - 3:58
    so the things we learn by doing this
  • 3:58 -
    will allow us to solve lot's and lot's of other problems.
Title:
01-02 Web Crawler
Description:

Professor David Evans gives an overview of the unit in CS 101.

more » « less
Video Language:
English
Duration:
04:03
Gundega edited English subtitles for 01-02 Web Crawler
Gundega edited English subtitles for 01-02 Web Crawler
Gundega edited English subtitles for 01-02 Web Crawler
Gundega edited English subtitles for 01-02 Web Crawler
tpievila edited English subtitles for 01-02 Web Crawler
tpievila added a translation

English subtitles

Revisions