WEBVTT 00:00:04.285 --> 00:00:09.285 Technical evolution of Wikipedia by Brion Vibber - Former CTO of Wikipedia 00:00:09.285 --> 00:00:11.540 Good evening, everyone. 00:00:11.540 --> 00:00:15.417 It's my pleasure to present to you, Brion Vibber. 00:00:15.417 --> 00:00:23.475 For years, he's worked at the Wikimedia Foundation as their chief technical officer. 00:00:23.475 --> 00:00:31.236 and I'm very happy, that Lu [the German translator] could come. He runs Esperantoland.org. 00:00:31.236 --> 00:00:35.167 I'll give the floor to Brion. 00:00:35.167 --> 00:00:52.477 Wikipedia: is there anyone who doesn't know about Wikipedia? 00:00:52.477 --> 00:00:57.167 So, a bit of Wikipedia history 00:00:57.167 --> 00:01:04.583 and mostly about the technical aspect of multilingual support. 00:01:04.583 --> 00:01:14.417 Originally when the Wikipedia was founded, it was only in English. 00:01:14.417 --> 00:01:22.042 Now something which is nice and easy about English is that it doesn't have any accented characters. 00:01:22.042 --> 00:01:30.417 But of course, an important problem with that is that the American programmers, like myself, 00:01:30.417 --> 00:01:38.042 don't know much about problems concerning letters and writing systems in other languages, 00:01:38.042 --> 00:01:48.042 and because of that, a lot of software and websites don't handle languages well, 00:01:48.042 --> 00:01:54.250 except those from Western Europe. 00:01:54.250 --> 00:01:59.250 Many are interested in supporting other languages, 00:01:59.250 --> 00:02:07.524 so we can take all human knowledge to everyone on earth. 00:02:07.524 --> 00:02:16.375 So, it would be good to support other languages, but that didn't work well at first. 00:02:16.375 --> 00:02:22.333 I set up many websites for Wikipedia in many languages, 00:02:22.333 --> 00:02:25.167 several dozen languages, in fact. 00:02:25.167 --> 00:02:33.417 But many of them were totally messed up, for example the Japanese Wikipedia. 00:02:33.417 --> 00:02:40.917 Now, it can be written well. It has many characters. 00:02:40.917 --> 00:02:47.000 It looks good and one can read and write. Everything works well now. 00:02:47.000 --> 00:02:51.875 But originally, it looked very similar to that. 00:02:51.875 --> 00:03:01.833 It remained an important problem for many languages: Japanese, Chinese, Russian, Hebrew, etc. 00:03:01.833 --> 00:03:05.875 Many of them didn't work at all. 00:03:05.875 --> 00:03:19.569 At one time, Polish even set up its own website for its wiki, 00:03:19.569 --> 00:03:28.042 which supported Eastern European letters well, but still not Japanese nor Russian, etc. 00:03:28.042 --> 00:03:35.737 At that time, I started to get to know Wikipedia through Esperanto. 00:03:35.737 --> 00:03:44.769 I was a university student and I learned French at a normal course. 00:03:44.769 --> 00:03:55.042 But I became interested in other languages, and I taught myself Esperanto online and through books, etc. 00:03:55.042 --> 00:04:00.208 On my computer and online, it was very interesting, 00:04:00.208 --> 00:04:08.292 and the Esperanto Wikipedia was founded by our dear Chuck. 00:04:08.292 --> 00:04:20.947 At that time, we still had a messed up character set, well for Esperanto, which has accented characters. 00:04:23.716 --> 00:04:31.000 That way it has an accent on the letters. 00:04:31.000 --> 00:04:40.271 But, for it to be written on the webpage, you had to write using the "x system". 00:04:40.271 --> 00:04:50.482 So, "cx" changes to "ĉ", etc. It looks really ugly. 00:04:50.482 --> 00:05:07.479 To make it more beautiful and show it the way it should be, I added Unicode support. 00:05:07.479 --> 00:05:18.542 Unicode is a system to encode characters for every language in the world: 00:05:18.542 --> 00:05:28.273 from Egyptian hieroglyphics to modern Japanese and Korean as well as many symbols 00:05:28.273 --> 00:05:35.375 in one system, which can include all of them. 00:05:35.375 --> 00:05:44.793 So that, we don't need a separate Polish system for Eastern Europe, 00:05:44.793 --> 00:05:48.667 and French for Western Europe, etc. 00:05:48.667 --> 00:05:58.193 We can have just one system, one program, one website for every language. 00:05:58.193 --> 00:06:12.875 With that worldwide system, it was started in computers already 20 years ago, 00:06:12.875 --> 00:06:23.417 but in 2001 or 2002, when we founded Wikipedia, Unicode was still "new" online, 00:06:23.417 --> 00:06:30.327 so it was difficult to use it in "American" programs. 00:06:30.327 --> 00:06:42.208 You had to kind of study how UTF8 works to put Unicode in a web page. 00:06:42.208 --> 00:06:52.851 But, I was able to study it a bit, and I added support to Wikipedia's original software. 00:06:52.851 --> 00:07:04.208 I gave it a converter from the "sx" to the correct "ŝ", etc. 00:07:04.208 --> 00:07:10.167 But I found that it's not just for Esperanto. 00:07:10.167 --> 00:07:14.125 It can work for other languages as well. 00:07:14.125 --> 00:07:21.504 For example Russian, Japanese and Polish can work with Unicode. 00:07:21.504 --> 00:07:27.750 Unfortunately, it was a bit more complicated, 00:07:27.750 --> 00:07:40.917 because at that time we also upgraded to new Wikipedia software, which was completely different from the original. 00:07:40.917 --> 00:07:47.792 It was better, but it still didn't support Unicode. 00:07:47.792 --> 00:07:56.822 Of course, it was created by Western Europeans and Americans, 00:07:56.822 --> 00:08:04.583 and it didn't know there were other languages other than in Western Europe and North America 00:08:04.583 --> 00:08:08.060 which have other letters. 00:08:08.060 --> 00:08:16.875 So, that's why it was necessary to add Unicode support three times. 00:08:16.875 --> 00:08:20.808 Originally for the Esperanto Wikipedia. 00:08:20.808 --> 00:08:33.958 The second time for the new system, which was originally created for the English Wikipedia and didn't need Unicode. 00:08:33.958 --> 00:08:43.833 And again when we completely changed the software to speed it up, 00:08:43.833 --> 00:08:49.000 but then it was completed the third time. 00:08:49.000 --> 00:08:59.333 In 2002 and 2003, we tried to start new Wikipedias in many languages. 00:08:59.333 --> 00:09:12.583 We reacquired Polish and were able to better unite it with the other languages. 00:09:12.583 --> 00:09:21.667 For example, one language can link to a page about the same thing in another language. 00:09:21.667 --> 00:09:29.458 Now with the same system for everything, one can do that. 00:09:29.458 --> 00:09:38.833 It's better to combine the groups in their own language. 00:09:38.833 --> 00:09:45.750 Similarly, there were other problems for the languages in the program online. 00:09:45.750 --> 00:09:53.667 It was somewhat problematic, that the traditional American programmers 00:09:53.667 --> 00:10:06.157 and often even the Western Europeans created their own programs only in English. 00:10:06.157 --> 00:10:15.750 It was a problem when someone didn't know English or didn't know it well 00:10:15.750 --> 00:10:20.958 or just wanted to use a system in their own language. 00:10:20.958 --> 00:10:31.750 Because of that, we also had to add a system to translate messages from the websites, 00:10:31.750 --> 00:10:37.958 so everyone can understand it in their own language. 00:10:37.958 --> 00:10:52.417 For example, we can see ... Article, Discussion, History, Delete 00:10:52.417 --> 00:10:57.042 "Article", "Talk", "Edit", "History", only in English. 00:10:57.042 --> 00:11:01.458 It's not very good, though generally one understands English. 00:11:01.458 --> 00:11:23.375 So, we created a map between the messages and a short description about every message. 00:11:23.375 --> 00:11:41.382 When we have something larger, long messages, and there are sentences and paragraphs, etc. 00:11:41.382 --> 00:11:46.141 It's a bit more complicated than simple words. 00:11:46.141 --> 00:11:54.385 That's why we give a name for every message. 00:11:54.385 --> 00:12:10.384 In the program, it doesn't have an English sentence, it just has a name which is "login-message" or the like. 00:12:10.384 --> 00:12:21.552 In a file with the map for each individual language, is the name and the message. 00:12:21.552 --> 00:12:25.732 The message can be translated into every language. 00:12:25.732 --> 00:12:36.000 Similar systems are used in many programs of various kinds, 00:12:36.000 --> 00:12:50.080 but what is most different about the Wikipedia system, is that one can also change that message. 00:12:50.080 --> 00:13:13.565 If I want to change that sentence a bit, so that my Wikipedia can have a standard or rule 00:13:13.565 --> 00:13:24.000 how one writes an article or choose administrators, etc. 00:13:24.000 --> 00:13:30.632 It can be different in the Wikipedia system. 00:13:30.632 --> 00:13:36.344 One can ... I'm not logged in, so I can't ... 00:13:36.344 --> 00:13:48.583 but the website administrators can use the wiki to change their own messages. 00:13:48.583 --> 00:13:54.375 [Unfortunately then, my camera stopped working.]