Current MS Students/Seyed Saeid Mousavi/CS 590
From CSWiki
Contents |
[edit] Jan 5 (week 1)
[edit] Jan 12 (week 2)
Thesis
Studying the Wikipedia's definition of itself:
Wikipedia as Wikepedia defines it
Researching books/documents/multimedia resources on Amazon for Wikipedia:
Brainstorming on related topics:
* Sources of information: emails and contact#'s, books and online documents * Hardware and Software infrastructure * Administration of the organization and data * How the information gets verified * What was the financial/humanitarian edge of the founders
Class lecture
Preparing a powerpoint document for the lecture:
* The first chapter from the Essential Software Architecture
Essential Software Architecture
Misc
Playing around with my pages on the site.
[edit] Jan 19 (week 3)
‘’’Thesis’’’
This week, delving into the online documents presented by Wikipedia I reviewed the following topics:
History Content and internal structure Software and hardware Language editions Reliability and bias Criticism Cultural significance Related projects
Here is a gist of some of the topics:
Larry Sanger and Jimmy Wales are the founders of Wikipedia.
This online encycopedia was formed after Nupedia, another online free encyclopedia which was edited by a professional team.
It has grown exponentially in number of languages it is offered, depth and expanse.
Exponential growth of Wikipedia
The operation of Wikipedia depends on MediaWiki, a custom-made, free and open source wiki software platform written in PHP and built upon the MySQL database. Wikipedia employed a single server until 2004, when the server setup was expanded into a distributed multitier architecture.
In January 2005, the project ran on 39 dedicated servers located in Florida. This configuration included a single master database server running MySQL, multiple slave database servers, 21 web servers running the Apache HTTP Server, and seven Squid cache servers. By September 2005, its server cluster had grown to around 100 servers: main servers in Tampa, Florida and the rest in Amsterdam and Seoul.
There are currently 253 language editions of Wikipedia; of these, 15 have over 100,000 articles and 145 have over 1,000 articles.
Wikipedia receives between 10,000 and 35,000 page requests per second, depending on time of day. Page requests are first passed to a front-end layer of Squid caching servers. Requests that cannot be served from the Squid cache are sent to load-balancing servers running the Linux Virtual Server software, which in turn pass the request to one of the Apache web servers for page rendering from the database. The web servers deliver pages as requested, performing page rendering for all the language editions of Wikipedia.
To increase speed further, rendered pages for anonymous users are cached in a distributed memory cache until invalidated, allowing page rendering to be skipped entirely for most common page accesses. Two larger clusters in the Netherlands and Korea now handle much of Wikipedia's traffic load.
[edit] Jan 26 (week 4)
Essential Software Architecture
Chapter 1: Understanding Software Architecture
Essential Software Architecture Chapter 1
Chapter 2: Introducing the Case Study
To present the conceptual theories in a more tangible format, the author introduces ICDE as an example.
The Information Capture and Dissemination Environment (ICDE) is part of a suite of
software systems for providing intelligent assistance to professionals such as financial analysts, scientific researchers and intelligence analysts.
ICDE automatically captures and stores data that records a range of actions performed by a user when operating a workstation. For example, when a user performs a Google search, the ICDE system will transparently store in a database:
*The search query string *Copies of the web pages returned by Google that the user displays in their browser
The ICDE 1.0 was created to intrigue the potential investors to take the topic more seriously.
The business goals of this project can be categorized as:
*Encourage third party tool developers to write applications for the ICDE system. *Promote the ICDE concept and tools to potential customers, in order to enhance their analytical working environment.
Essential Software Architecture Chapter 2
Essential Software Architecture
Chapter 3: Software Quality Attributes
This chapter specifically talks about Software Quality standards as the measuring criteria to aim at.
Quality attribute requirements are part of an application’s nonfunctional requirements, which capture the many facets of how the functional requirements of an application are achieved.
Quality Attributes are roughly categorized as:
*Scalability: A scalable solution will permit additional processing capacity to be deployed
to increase throughput and decrease response time.
*Security: Understanding the precise security requirements for an application, and devising
mechanisms to support them.
*Performance: A performance quality requirement defines a metric that states the amount of
work an application must perform in a given time, and/or deadlines that must be met for
correct operation.
*Modifiability: The modifiability quality attribute is a measure of how easy it may be to
change an application to cater for new functional and non-functional requirements.
*Availability: Failures in applications cause them to be unavailable. Failures impact on an
application’s reliability, which is usually measured by the mean time between failures. The
length of time any period of unavailability lasts is determined by the amount of time it
takes to detect failure and restart the system.
*Integration: Integration is concerned with the ease with which an application can be usefully
incorporated into a broader application context.
Some other quality attributes:
*Portability *Testability *Supportability
Essential Software Architecture Chapter 3
[edit] Feb 2 (week 5)
Who operates Wikipedia?
Wikipedians
Wikipedians are the people who write and edit articles for Wikipedia. The number of account holders has grown to over six million (currently and precisely 6,323,733 — counting registered user accounts), plus an unknown, but relatively large, number of unregistered contributors.
The Wikipedia Foundation
The Wikimedia Foundation, Inc. is a non-profit charitable organization based in St. Petersburg, Florida, USA, and organized under the laws of the state of Florida. It operates several online collaborative projects including Wikipedia, Wiktionary, Wikiquote, Wikibooks, Wikisource, Wikimedia Commons, Wikispecies, Wikinews, Wikiversity, and Meta-Wiki.
The Foundation's creation was officially announced by Wikipedia co-founder Jimmy Wales, who was running Wikipedia within his company Bomis, on June 20, 2003.
The functions of the Wikimedia Foundation were, for the first few years, executed almost entirely by volunteers. In the Spring of 2005, the Foundation only had two employees, Danny Wool and Brion Vibber. Though the number of employees has grown, the bulk of Foundation work continues to be done by volunteers, with the Foundation having very few employees.
As of October 4, 2006, the Wikimedia Foundation had five paid employees: two programmers (software manager Brion Vibber in California and server administrator Chad Perrin in Tampa); "to answer the phones", administrative assistant Barbara Brown; to handle fundraising and grants, Danny Wool; and to manage, interim executive director Brad Patrick.
Advisory Board
The Advisory Board is an international network of experts who have agreed to give the Foundation meaningful help on a regular basis in many different areas, including law, organizational development, technology, policy, and outreach.
Administrators
Administrators, commonly known as admins and also called sysops (system operators), are Wikipedia editors who have access to technical features that help with maintenance. English Wikipedia practice is to grant administrator status to anyone who has been an active and regular Wikipedia contributor for at least a few months, is familiar with and respects Wikipedia policy, and who has gained the trust of the community. They can protect and delete pages, block other editors, and undo these actions as well. These privileges are granted indefinitely, and are only removed on request or under circumstances involving high-level intervention. Administrators undertake additional responsibilities on a voluntary basis, and they are not employees of the Wikimedia Foundation.
Jimmy Whales, the founder of Wikipedia describes Wiki's administrator's position as follows:
I just wanted to say that becoming a sysop is *not a big deal*.
I think perhaps I'll go through semi-willy-nilly and make a bunch of people who have been around for awhile sysops. I want to dispel the aura of "authority" around the position. It's merely a technical matter that the powers given to sysops are not given out to everyone.
I don't like that there's the apparent feeling here that being granted sysop status is a really special thing.
– Jimbo Wales
[edit] Feb 9 (week 6)
This week, I started by focusing on the prospectus and brainstormed my initial ideas and perspectives.
Later on, I read an article from Dirk Riehle, the chair of WikiSym 2006, "How and Why Wikipedia Works", which is an interview with three high ranking officers of Wikipedia: Angela Beesley, Elisabeth Bauer, and Kizu Naoko
All three are leading Wikipedia practitioners in the English, German, and Japanese Wikipedias and related projects. The interview focuses on how Wikipedia works and why these three practitioners believe it will keep working. The interview was conducted via email in preparation of WikiSym 2006, the 2006 International Symposium on Wikis, with the goal of furthering Wikipedia research.
Here is a description of each interviewee:
Angela Beesley: In June 2004, I was elected to the board of the Wikimedia Foundation by the community. I was re-elected, for a two year term, in July 2005. I've held various positions on other Wikimedia projects. Wikipedia exists to provide a globally available, free (as in freedom, as well as money), encyclopedic (verifiable and unbiased) resource to everyone in their own language. I subscribe to this goal and I also enjoy working with people who share it with me.
Elisabeth Bauer: I was a board member of Wikimedia Deutschland e.V., the German WMF. With Arne Klempert and Delphine Menard I was a leading organizer of the first Wikimania, WMF’s main annual conference. I also helped set up OTRS, WMF’s help desk. We have multiple languages because we are such a decentralized organization. Projects share a few common norms but everything else is left to the language communities to decide. Different cultures tend to evolve different organizational structures and policies.
Kizu Naoko: I was temporarily a sysop on Wikipedia when I worked as an election officer in the summer of 2005 and helped the board election process. I also have been a member of the communications committee, officially since May 2006. Since September 2005, I’ve been a board-approved editor of the WMF website. We want to create a free (both liberty and gratia) encyclopedia, hence empowering the world intellectually. This goal motivates me personally.
Here is a gist of the interviews:
About roles in Wikipedia, there are readers, editors, administrators, recent changes patrollers (reverting vandalism), policy makers, subject area experts (WikiProjects offers a place for people who want to focus on one topic to have a focused community within the larger Wikipedia community), content maintainers, software developers, system operators and many more. There are also all sorts of informal groups within the project. For example, the welcoming committee is a self selected group of people who say they will help with welcoming new users.
Most people start out as editors or uploaders. The majority stays in that role. After that, though, many different roles are possible. Maybe the most prominent one is the administrator role.
Wikipedia is approaching the quality problem from two sides: From the bottom and from the top. From the bottom, the deletion process simply is used to weed out poor articles. From the top we encourage high quality articles by providing extra recognition for an author’s work as ‘excellent articles’. Such extra recognition by the community as well as the visibility to the general Wikipedia readership gives authors immaterial rewards for their work.
Some challenges that Wikipedians face:
Legal threats, in particular libelous edits and copyright infringements. In general a legal conflict can harm a project, even if in the end no real conflict before a court arises between the rights holder and the Wikimedia Foundation. The problem is that being in limbo might prevent further development of content and might be a source of human conflict on the project. But usually, it is nothing that can’t be fixed.
Keeping integrity as a project, Some Wikipedias, like the English or German one, have many editors who are also involved with global activities like the Commons, Meta, or Foundation wikis. On other Wikipedias, much fewer volunteers like these exist, and bad communication between the local level and the global level might result. This can be a severe problem for the local projects.
Lack of involvement, A lot of people are needed to keep a project alive! For smaller wikis, a dearth of contributors happens easily. Poor involvement of editors or even inactivity challenges the sustainability of the project. Therefore we need to go back to the first and foremost challenge: To keep the openness of the wikis that makes it easy for people to join.
Credibility, Young Wikipedias need to build a certain level of credibility. If they fail to establish their credibility or take too long a time, the project might falter.
Wikipedia's administration hirarchy
[edit] Feb 16 (week 7)
This week I worked more on the Prospectus, filling out more gaps and trying to wrap my mind around the subject.
Also I digged a bit into the Software architecture of Wiki and came up with the following:
WIKIPEDIA Started as Perl CGI script running on single server in 2001.
Now the site has grown into a distributed platform, containing multiple technologies, all of them open. The principle of openness forced all operation to use free & open-source software only.
Having commercial alternatives was out of question, Wikipedia had the challenging task to build efficient platform of freely available components.
High level picture of this environment contains the following components:
(core components, front to back)
Linux - operating system (Fedora, Ubuntu) PowerDNS - geo-based request distribution LVS - used for distributing requests to cache and application servers Squid - content acceleration and distribution lighttpd - static file serving Apache - application HTTP server PHP5 - Core language MediaWiki - main application Lucene, Mono - search Memcached - various object caching MySQL
Load balancing between different servers is established through hardware/db system and more importantly php code.
Database queries
All database interaction is optimized around MySQL’s methods of reading the data.
Some of requirements for every query are obvious, a few of the important ones are:
Every query must have appropriate index for reads. Paging / offsetting should be done just by key positions than by resultsets. No sparse data reads should be done, except for hottest tables. This means covering indexes for huge tables, allowing to do reads in different range directions/scopes. Queries prone to hit multiversioning troubles have to be rewritten accordingly
Splitting (partitioning of data)
Multiple database servers allow splitting in many different patterns:
Move read load to slaves
At mediocre or low write rates this always helps - scaling read operations becomes especially cheap. If replication positions are accounted in application, the inconsistencies are barely noticed.
Partitioning by data segments Partitioning by tasks Partition by time
Compression
One of easy ways to save resources is actually by using compression wherever data compresses and takes more than a kilobyte. Of course, sending compressed information over internet makes page loads faster, but even for internal communications, having various blobs compresed helps in:
Less fragmentation (less data is split off from row to another blocks in database) Less bytes to send over the network Less data for DB server to work on - better memory/cpu efficiency, etc. Less data for DB server to store
And all this comes out from really fast zlib (gzip) library.
It takes a fraction of millisecond to compress ten kilobytes of text, resulting in few times smaller object to keep around.
[edit] Feb 23 (week 8)
Excerpts from
By: Jimmy Wales, President, Wikimedia Foundation, Wikipedia Founder
What is the Wikimedia Foundation?
Non-profit foundation Aims to distribute a free encyclopedia to every single person on the planet in their own language Wikipedia and its sister projects Funded by public donations Applying for grants Wikipedia is a freely licensed encyclopedia written by thousands of volunteers in many languages Free license allows others to freely copy, redistribute, and modify our work commercially or non-commercially Founded January 15, 2001
Advantages of Freely Licensed Content
GNU Free Documentation Licence Allows authors to retain attribution Remains non-proprietary Enhances the popularity of Wikipedia Decreases individual sense of ownership Increases a sense of shared ownership
Wikimedia Projects
Wikipedia Wiktionary Wikibooks Wikisource Wikiquote Wikispecies Wikimedia Commons Wikinews
MediaWiki
MediaWiki is one of many wiki engines Collaborative software that allows users to add or edit content Primarily developed for Wikipedia from 2002 onwards Scalable and multilingual Free license
MediaWiki features
Quality control features (versioning) Editing features (simple markup) Community features (talk pages, profiles, access levels)
How can such a large community scale?
Through software features Through policy (mediation, arbitration) Through an atmosphere of love and respect
[edit] Mar 1 (week 9)
This week I spent time revising and enhancing my prospectus.
There was also a nice link submitted by professor Abbott on the topic of who really has made Wikipedia (and also digg and Helium) what it is.
Is it the wisedome of the crowd or an elite group of managers (Admins) who are pulling the strings behind the scenes.
The question that the author is trying to answer is wether the power of the minority of the admins who are supervising other users' activities and inputs, changes the democratic nature of the community.
Apparently since the election of this small group is based on their contribution to the growth of the media and they are the normal people who get further and further up the administration ladder and not paid employees, and perform their duties according to the policies and procedures set forth by the community, changes the perspective from an authoritative system to more liberal community.
Digg is a nice (almost) copycat of Wikipedia where people submit their stories and the best stories get to be on the frontpage.
Another nice website which is modeling after wikipedia is Helium.com, a repository of articles and editorials. Its foundercompares his site to a capitalist version of Wikipedia. On Helium, contributors compete to have the top-ranked article on a given subject.
[edit] Mar 8 (week 10)
This week I read an article on Wikipedia's Internal.
Nice topic and well covered. Here goes some highlights:
As application tends to be most resource hungry part of the system, every component is built to be semi-independent from it, so that less interference would happen between multiple tiers when a request is served. The most distinct separation is media serving, which can happen without accessing any PHP/Apache code segments. Other services, like search, still have to be served by application
Wikipedia started as Perl CGI script running on single server in 2001, site has grown into distributed platform, containing multiple technologies, all of them open. The principle of openness forced all operation to use free & open-source software only. Having commercial alternatives out of question, Wikipedia had the challenging task to build efficient platform of freely available components.
Database Servers
Though WIKIPEDIA currently uses 16GB-class machines it is still considered an scale-out shop.
The main ideology in operating database servers is RAIS:
Redundant Array of Inexpensive/Independent/Instable Servers
RAID0. Seems to provide additional performance/space.
Database queries
All database interaction is optimized around MySQL’s methods of reading the data.
Some of requirements for every query are:
• Every query must have appropriate index for reads
• Every query result should be sorted by index, not by filesorts.
• No sparse data reads should be done, except for hottest tables. This means covering indexes for huge tables, allowing to do reads in different range directions/scopes. Narrow tables (like m:n relation cross-indexes) usually always have covering indexes to both directions.
LVS: Load balancer LVS (Linux Virtual Service) is critical component, which allows efficient load balancing between Application and Search.
The key advantages against the competition are:
• Kernel-level
• Director gets just inbound traffic, outbound traffic is sent directly from processing node with proper source IP
• Efficient
Week 11
More than a third of American adult internet users (36%) consult the citizen-generated online encyclopedia Wikipedia, according to a new nationwide survey by the Pew Internet & American Life Project.
In addition, young adults and broadband users have been among those who are earlier adopters of Wikipedia. While 44% of those ages 18-29 use Wikipedia to look for information, just 29% of users age 50 and older consult the site.
Hitwise data suggest several reasons for the popularity of Wikipedia:
First, there is the sheer amount of material on the site, covering everything from ancient history to current events and popular culture.
Second, Wikipedia's dramatic growth is strongly correlated with Americans’ affection for search engines. Wikipedia’s article structure helps explain this. Many of the pieces in the encyclopedia are full of links to other Wikipedia articles and other material on the Web. One of the prime factors in Google's search results algorithm is the number of links connected to a given webpage.
The paper aims at being the first detailed study about user behavior on the Wikipedia, and on how users of the system create and maintain information appearing on its pages. As the authors tell us:
"This paper tries to model the behavior of users contributing to Wikipedia (hereafter called contributors) as a way of understanding its evolution over time. It presents what we believe to be the first extensive effort in that direction. This understanding will allow us, in the future, to create a model for Wikipedia evolution that will be able to show its trends and possible effects of changes in the way it is managed."
A few of the findings about user interactions, from the paper:
1. The number of articles on the Wikipedia has been growing at an exponential rate since it started, but the number of articles from each contributor has decreased over time.
2. Most users tend to revise existing articles rather than creating new ones.
3. Most users tend to focus their attentions upon a single main article.
There are also some interesting numbers coming out of the study (which uses Wikipedia data from October, 2006). Here are a few of those:
1) The amount of links in the Wikipedia number 58.9 million, an average of 45 links per article.
2) The number of broken links is 6.5 million (I wonder if that includes “citation needed” links that appear in some articles.)
3) The number of internal redirects is 6.8 million.
4) The number of revisions listed in the study’s data is 48.2 million.

