1 00:00:04,285 --> 00:00:09,285 Technical evolution of Wikipedia by Brion Vibber - Former CTO of Wikipedia 2 00:00:09,285 --> 00:00:11,540 Good evening, everyone. 3 00:00:11,540 --> 00:00:15,417 It's my pleasure to present to you, Brion Vibber. 4 00:00:15,417 --> 00:00:23,475 For years, he's worked at the Wikimedia Foundation as their chief technical officer. 5 00:00:23,475 --> 00:00:31,236 and I'm very happy, that Lu [the German translator] could come. He runs Esperantoland.org. 6 00:00:31,236 --> 00:00:35,167 I'll give the floor to Brion. 7 00:00:35,167 --> 00:00:52,477 Wikipedia: is there anyone who doesn't know about Wikipedia? 8 00:00:52,477 --> 00:00:57,167 So, a bit of Wikipedia history 9 00:00:57,167 --> 00:01:04,583 and mostly about the technical aspect of multilingual support. 10 00:01:04,583 --> 00:01:14,417 Originally when the Wikipedia was founded, it was only in English. 11 00:01:14,417 --> 00:01:22,042 Now something which is nice and easy about English is that it doesn't have any accented characters. 12 00:01:22,042 --> 00:01:30,417 But of course, an important problem with that is that the American programmers, like myself, 13 00:01:30,417 --> 00:01:38,042 don't know much about problems concerning letters and writing systems in other languages, 14 00:01:38,042 --> 00:01:48,042 and because of that, a lot of software and websites don't handle languages well, 15 00:01:48,042 --> 00:01:54,250 except those from Western Europe. 16 00:01:54,250 --> 00:01:59,250 Many are interested in supporting other languages, 17 00:01:59,250 --> 00:02:07,524 so we can take all human knowledge to everyone on earth. 18 00:02:07,524 --> 00:02:16,375 So, it would be good to support other languages, but that didn't work well at first. 19 00:02:16,375 --> 00:02:22,333 I set up many websites for Wikipedia in many languages, 20 00:02:22,333 --> 00:02:25,167 several dozen languages, in fact. 21 00:02:25,167 --> 00:02:33,417 But many of them were totally messed up, for example the Japanese Wikipedia. 22 00:02:33,417 --> 00:02:40,917 Now, it can be written well. It has many characters. 23 00:02:40,917 --> 00:02:47,000 It looks good and one can read and write. Everything works well now. 24 00:02:47,000 --> 00:02:51,875 But originally, it looked very similar to that. 25 00:02:51,875 --> 00:03:01,833 It remained an important problem for many languages: Japanese, Chinese, Russian, Hebrew, etc. 26 00:03:01,833 --> 00:03:05,875 Many of them didn't work at all. 27 00:03:05,875 --> 00:03:19,569 At one time, Polish even set up its own website for its wiki, 28 00:03:19,569 --> 00:03:28,042 which supported Eastern European letters well, but still not Japanese nor Russian, etc. 29 00:03:28,042 --> 00:03:35,737 At that time, I started to get to know Wikipedia through Esperanto. 30 00:03:35,737 --> 00:03:44,769 I was a university student and I learned French at a normal course. 31 00:03:44,769 --> 00:03:55,042 But I became interested in other languages, and I taught myself Esperanto online and through books, etc. 32 00:03:55,042 --> 00:04:00,208 On my computer and online, it was very interesting, 33 00:04:00,208 --> 00:04:08,292 and the Esperanto Wikipedia was founded by our dear Chuck. 34 00:04:08,292 --> 00:04:20,947 At that time, we still had a messed up character set, well for Esperanto, which has accented characters. 35 00:04:23,716 --> 00:04:31,000 That way it has an accent on the letters. 36 00:04:31,000 --> 00:04:40,271 But, for it to be written on the webpage, you had to write using the "x system". 37 00:04:40,271 --> 00:04:50,482 So, "cx" changes to "ĉ", etc. It looks really ugly. 38 00:04:50,482 --> 00:05:07,479 To make it more beautiful and show it the way it should be, I added Unicode support. 39 00:05:07,479 --> 00:05:18,542 Unicode is a system to encode characters for every language in the world: 40 00:05:18,542 --> 00:05:28,273 from Egyptian hieroglyphics to modern Japanese and Korean as well as many symbols 41 00:05:28,273 --> 00:05:35,375 in one system, which can include all of them. 42 00:05:35,375 --> 00:05:44,793 So that, we don't need a separate Polish system for Eastern Europe, 43 00:05:44,793 --> 00:05:48,667 and French for Western Europe, etc. 44 00:05:48,667 --> 00:05:58,193 We can have just one system, one program, one website for every language. 45 00:05:58,193 --> 00:06:12,875 With that worldwide system, it was started in computers already 20 years ago, 46 00:06:12,875 --> 00:06:23,417 but in 2001 or 2002, when we founded Wikipedia, Unicode was still "new" online, 47 00:06:23,417 --> 00:06:30,327 so it was difficult to use it in "American" programs. 48 00:06:30,327 --> 00:06:42,208 You had to kind of study how UTF8 works to put Unicode in a web page. 49 00:06:42,208 --> 00:06:52,851 But, I was able to study it a bit, and I added support to Wikipedia's original software. 50 00:06:52,851 --> 00:07:04,208 I gave it a converter from the "sx" to the correct "ŝ", etc. 51 00:07:04,208 --> 00:07:10,167 But I found that it's not just for Esperanto. 52 00:07:10,167 --> 00:07:14,125 It can work for other languages as well. 53 00:07:14,125 --> 00:07:21,504 For example Russian, Japanese and Polish can work with Unicode. 54 00:07:21,504 --> 00:07:27,750 Unfortunately, it was a bit more complicated, 55 00:07:27,750 --> 00:07:40,917 because at that time we also upgraded to new Wikipedia software, which was completely different from the original. 56 00:07:40,917 --> 00:07:47,792 It was better, but it still didn't support Unicode. 57 00:07:47,792 --> 00:07:56,822 Of course, it was created by Western Europeans and Americans, 58 00:07:56,822 --> 00:08:04,583 and it didn't know there were other languages other than in Western Europe and North America 59 00:08:04,583 --> 00:08:08,060 which have other letters. 60 00:08:08,060 --> 00:08:16,875 So, that's why it was necessary to add Unicode support three times. 61 00:08:16,875 --> 00:08:20,808 Originally for the Esperanto Wikipedia. 62 00:08:20,808 --> 00:08:33,958 The second time for the new system, which was originally created for the English Wikipedia and didn't need Unicode. 63 00:08:33,958 --> 00:08:43,833 And again when we completely changed the software to speed it up, 64 00:08:43,833 --> 00:08:49,000 but then it was completed the third time. 65 00:08:49,000 --> 00:08:59,333 In 2002 and 2003, we tried to start new Wikipedias in many languages. 66 00:08:59,333 --> 00:09:12,583 We reacquired Polish and were able to better unite it with the other languages. 67 00:09:12,583 --> 00:09:21,667 For example, one language can link to a page about the same thing in another language. 68 00:09:21,667 --> 00:09:29,458 Now with the same system for everything, one can do that. 69 00:09:29,458 --> 00:09:38,833 It's better to combine the groups in their own language. 70 00:09:38,833 --> 00:09:45,750 Similarly, there were other problems for the languages in the program online. 71 00:09:45,750 --> 00:09:53,667 It was somewhat problematic, that the traditional American programmers 72 00:09:53,667 --> 00:10:06,157 and often even the Western Europeans created their own programs only in English. 73 00:10:06,157 --> 00:10:15,750 It was a problem when someone didn't know English or didn't know it well 74 00:10:15,750 --> 00:10:20,958 or just wanted to use a system in their own language. 75 00:10:20,958 --> 00:10:31,750 Because of that, we also had to add a system to translate messages from the websites, 76 00:10:31,750 --> 00:10:37,958 so everyone can understand it in their own language. 77 00:10:37,958 --> 00:10:52,417 For example, we can see ... Article, Discussion, History, Delete 78 00:10:52,417 --> 00:10:57,042 "Article", "Talk", "Edit", "History", only in English. 79 00:10:57,042 --> 00:11:01,458 It's not very good, though generally one understands English. 80 00:11:01,458 --> 00:11:23,375 So, we created a map between the messages and a short description about every message. 81 00:11:23,375 --> 00:11:41,382 When we have something larger, long messages, and there are sentences and paragraphs, etc. 82 00:11:41,382 --> 00:11:46,141 It's a bit more complicated than simple words. 83 00:11:46,141 --> 00:11:54,385 That's why we give a name for every message. 84 00:11:54,385 --> 00:12:10,384 In the program, it doesn't have an English sentence, it just has a name which is "login-message" or the like. 85 00:12:10,384 --> 00:12:21,552 In a file with the map for each individual language, is the name and the message. 86 00:12:21,552 --> 00:12:25,732 The message can be translated into every language. 87 00:12:25,732 --> 00:12:36,000 Similar systems are used in many programs of various kinds, 88 00:12:36,000 --> 00:12:50,080 but what is most different about the Wikipedia system, is that one can also change that message. 89 00:12:50,080 --> 00:13:13,565 If I want to change that sentence a bit, so that my Wikipedia can have a standard or rule 90 00:13:13,565 --> 00:13:24,000 how one writes an article or choose administrators, etc. 91 00:13:24,000 --> 00:13:30,632 It can be different in the Wikipedia system. 92 00:13:30,632 --> 00:13:36,344 One can ... I'm not logged in, so I can't ... 93 00:13:36,344 --> 00:13:48,583 but the website administrators can use the wiki to change their own messages. 94 00:13:48,583 --> 00:13:54,375 [Unfortunately then, my camera stopped working.]