NICK WELLS
Paper A
This study finds that most
new articles are created after a reference to them is first entered. This type
of growth is interesting because Wikipedia is expanding its coverage in a
"breadth-first traversal." When a new article is created, Wikipedia
includes links to nonexistant articles which help spur their creation by
suggesting them to future authors.
I think that this is an
interesting study of Wikipedia. It would be interesting
to see if a similar
conclusion could be reached regarding other data-driven
websites.
Paper B
This paper observes data from
Wikipedia, Essembly, Bugzilla and Digg in order to form theories about their
dynanic growth. They examine the power law as a fit for the user participation
and found it to be a good fit. They examine the exponent (alpha) and relate it
to the effort required for user participation. They found that these alphas
were good indicators of effort required to contribute. They also examine the
distribution of the participation points (i.e. edits, diggs) in order to assess
them and found them to be lognormal.
This is an interesting
article with substantive results. I would be interested
in seeing if the results hold
in more than just these four cases examined.
ANDREW BERRY
Paper A
Even though the paper states
that the growth of Wikipedia lies comfortably between the inflationary and
deflationary hypothesis, I was most surprised that Wikipedia is built from the
inside out. I always thought that users made particular pages based on their
own interests. However, it does make sense that references must lead to
definitions for sustainable and connected growth. Yet, because of the loose
framework regarding contributions to Wikipedia, it is an interesting phenomenon
that the ratio between complete and incomplete articles remains constant. The
Barabasi model is a well thought out model of Wikipedia, but how does one
quantify a vertexÕs connectivity? Also, the model assumes that at each timestep
the maximum number of vertices that can be added to the network is equal to the
number of vertices already in the network. However, in reality it is quite
feasible for this to be more. The authors do adjust the model which provides a
better fit to the Wikipedia data, but I do wonder if some assumptions are flawed.
Paper B
The Wilkinson paper
demonstrates that the probability a person stops contributing varies inversely
with the number of entries the user has already made. This paper also shows
that a small number of popular topics account for the majority of contributions.
These conclusions are supported rather robustly as they hold for Wikipedia,
Digg, Bugzilla and Essembly which are different peer production systems for
different purposes. Wilkinson uses a power law to describe the user
participation levels in all systems where the probability of quitting is
inversely proportional to previous contributions. The conclusions of the paper
also suggest that participation is dependent mostly on this probability and the
difficulty of contributing to the system. Even though this model fits the data
for all four sites, I do not think it can be generalized to all online peer
production systems. Suppose there is a peer production system similar to
Wikipedia where you could contribute to a particular topic easily and incrementally.
In this case there would be some sweet spot where as the number of contributors
grows, the user has a lower probability of quitting. The user may have an
incentive to not contribute if he has to start the entry or do much of the
buildwork, but when the number of previous contributors is enough, the userÕs
probability of quitting may decrease up to a certain point.
PETER BLAIR
Paper A
In this article, the authors
endeavor to understand and model the growth of Wikkipedia. The point of this
study is to determine whether the growth of Wikkipedia is stable or unstable.
This work was motivated by the idea of an inflationary process of growth --
whereby as more articles were contributed to Wikkipedia, the number of
unwritten articles that are referenced in completed articles would increase
without bound, eventually undermining the credibility and trustworthiness of
Wikkipedia as a reliable source of information. The converse hypothesis -- a
deflationary scenario would arise if the propotion of articles written was
outstripping the number of new articles that the existing articles pointed to
-- eventually because of a decreasing rate of reference articles, the
accumulation of new knowledge to Wikkipedia would stagnate and eventually the
site would not evolve nearly quickly enough to contain relevant information or
enough new infroamtion.I wonder wether the following asumption is reasonable: that the number of pages
with stubs that are useful and the number of non-useful pages that are not
marked with stubs cancel identically. It would be more satisfying for the
authors to motivate this result some more. In reading the article I am reminded
of the attempt of facebook.com to translate its pages into other languges using just its users
as translators. From all reports the effort was not successful -- by comparison
it seems that wikkipedia has been successful because of its ability to compartmentalize
its articles in workable and independent chucks that do not depend on each
other in the same was taht an integrated website like facebook depends on all
of its functions and pages to have a consistent translation. In this light
facebook may be an intersting case study for the sustainable growth of a website.
Paper B
In this paper the authors
study the contributions of users to peers production systems. In particular the
authors focus on both the volume of
contribution in addtion to the time evolution of content contribution by
topic. The central results of this paper are : (i) for in active users, the
probablilty that a given user will quit after having contributed k times is
given by a power law distribution ~ 1/k (ii) contribtion to a particular topic
is distributed log normally due to what is termed a "multiplicative
reinforcement mechanis" whereby contributions to a topic increases it's
popularity, which in turn increases the number of contributions. The authors
use as case studies the peer interaction sites Wikkipedia, Digg, Ensemble and
Bugzilla.
The results of this paper are
dervied under reasonable assumptions in a clean manner, which leads to its
readiblity and believeability as model. Particularly noteworthy is the modeling
fact that the power law distribution be derived for in active users, who it turns
out make up >50% of the contributor for all of the sites. This modeling
simplification avoid higly nolinear and complex effect from superusers who as a
group can drive the content and direction of the site in a way that is not
representative of the larger population of site users. I also appreciated the
explanation of the various sites invovled in the study. In fact, I learned an
interesting fact about Digg.com -- one can only add positive diggs. It seems
that this type of voting rule should be lest to manipulation, if we think about
a desirable outcome as being reflected in teh agreegate utlitily of all the
voter -- i.e. a person who really, really likes a given gets rewarded for
digging the article multiple times in a way that is consistent with the goals
of the voting rule and expected outcomes (this is an interesting aside!). The
grouping of the plots in Fiigure 1 (a)-(c) is quite suggestive; in particular
processes with similar exponents the exponents such as essembly and digg votes
are grouped on the same plot. Is this to suggest some universality in human
behavior when it comes to participating in a similar peer interaction such as
similar voting on sites. This point is made very subtely and suggestively and I
would have apprecaited more of a hypothesis as to why we see such a striking
similarity, otherwise it leaves the reader thinking that something fishy is
going on. The authors deliberate state that they avoid drawing sociological
extraopolations from their results. I am personally dissatisfied with this
leariness to speculate in light of the highly suggestive representation of the
team's data. While on this point, I found it bizzare that the alpha for Wiki
edits was higher than that for Digg submissions. This empirical realization
seems to be at odds with the suggestion that alpha mentions how hard it is to
contribute to the peer site. Clearly it is easier to submit a site for digging
than to update a wikkipedia article. THis is an area where the authors could
have been a bit more imaginative in their conclusion -- offering a reason for
this inconsistency. One potential reason for this discrepancy could be a
contributors willingness to participate being postively correlated with the
social good that his/her particpation produces -- in this case contributing to
a Wikkipedia article presents more of a social good than listing an article
that one diggs -- after all the wikki article has a longer shelf-life and has
research implications potentially. The log normal result for the topics was
also a particularly satisfying result that makes intutitive sense -- initially
many articles get edited, as time progresses the number of edits per article
increases but the number of articles getting edited decreases, i.e. the
community hones in on the important articles that are most beneficial to the
community and hence more effort is expended to make these articles more
accurate. This reasoning feeds into the social desirabilty explanation that I
submit above for why the alpha for wikki edits is lower than that of digg
submissions. THis can be a potential future area of research -- developping
some metric for social benefit for a given peer production site and howthis
factors into or incentives edits. Woudl the distribution again be log normal as
a function of percieved social benefit? You tube may also be an intersting case
study for the "multiplicative reinforcement mechanism" given that
popular videos are highliged on the front of the website.
SUBHASH ARJA
The paper "The
Collaborative Organization of Knowledge" seeks to study the online
encyclopedia Wikipedia. Specifically, the study involves observing the usual
time difference between when an article is referenced and when it is actually
created. Also, the authors seek to find out the ration of complete to
incomplete articles to test the "inflationary hypothesis". This kind
of study is important to gauge the usefulness and feasibility of a system like
Wikipedia. Because it depends on the pooling of knowledge by every contributor,
there must be some bound on whether creating articles in turn leads to creating
an unbounded number of undefined articles. One result that was very interesting
was the finding that the ratio of incomplete articles to complete articles
started close to 3 and has reached an almost steady-state value of close to 1.
This is a testament to great increase of participation in Wikipedia by the
general public. Another result from the paper that is very important and
supports the Wikipedia concept is that new articles are more likely to be
written by someone who is not the author of the original article referencing
it. This shows that there is no one person that is an overwhelming contributer
but that Wikipedia truly is a collaborative effort.
The second paper,
"Strong Regularities in Online Peer Production", shows that in
submission and user-edited sites, like Digg and Wikipedia, a user is more
likely to contribute if he has been contributing regularly. Also, the study
finds that the level of activity for a certain topic follows a log normal
distribution. Both results are not very surprising considering the purpose of
the type of websites being studied. The author states that since Digg
submissions and Wikipedia edits have a higher barrier to entry than Digg and
Essembly votes. While I agree with this finding, I think the barrier to entry
study is incomplete. For instance, in Digg, many submissions are duplicated,
and, if one of these ends up in the top 10, it is most likely because it was
submitted by a very popular user, even in cases where he have been the third or
fourth submitter of the same story. Thus, there is also a "popular
user" barrier to entry. Also, stories from a certain news site or tech site
tend be more likely to receive more Diggs. This could be an interesting study
in addition to the one conducted in this paper as a complement to the presented
results.
ALICE GAO
Paper A
This paper studies the
process of Wikipedia growth.
Specifically, it asks the question of whether the Wikipedia development
is a sustainable process, and what triggers the creation of new articles on
Wikipedia. The results are that
WikipediaÁøs development is a sustainable process and references to
non-existent articles trigger the eventually creation of a corresponding
article.
I thought the idea of
sustainable growth is pretty interesting.
It characterizes growth trends at two extremes that both lead to bad
consequences. Also, I had a
general thought about using Wikipedia data for research purposes. Basically, we have a lot of data available
that we can do analysis on. Also,
I believe that we can easily write software programs to perform these
analyses. Therefore, all we really
need is to ask an interesting and valuable question. Having a good question that motivates the research would be
the most crucial step in the research process in my opinion. I donÁøt think the topic explored in
this paper is very well motivated.
For example, I can think of some more specific motivations in studying
Wikipedia, such as how to make the information aggregation process in Wikipedia
more effective, how to reduce vandalism of pages, and how to create incentives
such that people will follow appropriate guidelines for contributing articles,
and etc.
Paper B
First of all, I think one
important contribution of this paper is find common regularities in different
peer production systems. This
means that the result is not specific to a particular peer production system,
but reflects a general property of such system. Also, the idea of momentum associated with participation is
an interesting idea. This result
tells us the pattern of contributions made. We can use this as a starting to point to study the
underlying reasons for these observations.
The author claims that this
paper only serves as a starting point for the general study of peer production
system. Indeed, I think future
studies can take on different perspectives to study behaviours of contributors
as well as their interactions in these kinds of systems. In particular, this reminds me of a
study on the interactions of users in Wikipedia in terms of participants
interacting in a social network.
MICHAEL AUBOURG
Paper A
What is for me one strength
of wikipedia is its incentive to complete its content with new article. When
links are red, it means that Wikipedia thinks a topic deserves an article, but
cannot currently provide an article with an adequate content. For instance, if
I go to the Eiffel tower webpage, on Wikipedia : http://en.wikipedia.org/wiki/Eiffel_Tower There is a red link for the "Avenue de Suffren", and
when we click on that link, there is a pre-existing page http://en.wikipedia.org/wiki/Avenue_de_Suffren with the immediate possibility to write the article. "Start the Avenue de Suffren article "
Another remarkable point is
that Wikipedia is powerful thanks to the genuine desire people have to share,
and to make wikipedia progress : Indeed the subsequent definition oof an
article in Wikipedia
appear to be a collaborative
phenomenon at the rate of 97%.
What are the disadvantage of
Wikipedia ? The fact that people who write or modify articles are hidden behind
Wiki nickname is great. However, in means that everybody is allowed to modify
every pages, even when a certain person shouldn't try to modify a page about
her/himself.
This is the case with the
Wikiscanner : This is a relatively new site that will track the edits made on
Wikipedia. The purpose of this service is to see whoÕs behind edits made, and
how these actions generally lend themselves towards the self-interested
corporations hoping to promote and protect brand identities, which
is a shame. Created by a student, WikiScanner searches the entirety of the
XML-based records in Wikipedia and cross-references them with public and
private IP and domain information to see who is behind the edits made on the
online encyclopedia. This is an awesome idea. Personal anonymity is
preserved, but now we can reveal institutions identity. With WikiScanner, there are a few
levels on which you can search for info, including organization name, exact
Wikipedia URL, or IP address, among others.
This student found that a
good portion of edits for company entries are being made by the companies
themselves. This isnÕt really surprising and it was expected. The team
behind Wikipedia is aware of it, and has been working to deal with issues such
as this. WikipediaÕs policies have changed since itÕs onset, and the
user-generated system has been improved as a result.
For this reason, I think
anonimity should be preserved at the individual scale, but we should reveal
some informations about the person who edit/modify an article such has his/her
interests like politics, companies... Let's not forget one of the five founding
rules of Wikipedia : "Wikipedia has a neutral point of view, which means we strive for articles that advocate no single
point of view. Sometimes this requires representing multiple points of view,
presenting each point of view accurately, providing context for any given point
of view, and presenting no one point of view as "the truth" or
"the best view." "
VICTOR CHAN
The two papers, The
Collaborative Organization of Knowledge and Strong Regularities in Online Peer
Production, by Spinellis&Louridas and Wilkinson respectively dealt with the
subject of online peer production systems. The first paper deals
specifically with
Wikipedia.org while the second paper touches on Wikipedia.org, Digg.com,
Buggzilla.com and Essembly.com.
These websites all contain
user generated content and is good for the analysis of user participation and
content growth.
The main contribution of
Spinellis&Louridas' paper is the evaluation of wikipedia and how new
content is created in the system. The results the paper present shows that the
growth of wikipedia scales fairly well, even though the number of undefined
links outnumbers the number of defined pages. Wikipedia's development lies
between the extremes of link inflation and deflation, which can be attributed
to the fact that undefined links drive the creation of new pages. It is shown
in the results that the majority of links are defined within one month of their
first reference.
The second paper discuss the
four websites and presents results on the participation of users and the
contributions to a given topic in these peer production systems. The main
results show that user participation levels follow a power law and it is
suggested that a small group of users contribute the majority o f the content. The
paper then elaborates on the momentum that a user will have to participate in
the create of content. It is shown that the probability of quitting is
inversely proportional to the number of previous contribution the user has
already had. The next main point the paper touches on is that the distribution
of contributions to a topic is lognormal. Based on this, the paper suggests
that a small number of popular topics makes up the majority of contributions.
The ideas in the second paper
seem interesting, because it is suggesting that a small amount of users and a
small number of popular topics seems to drive the growth of these online peer
systems. However, the overall system still grows fairly quickly as shown in the
first paper. I am curious how the size of theses heavy users and popular topics
grow as the size of the entire system grows.
Another interesting point is
that in this social computing platform, we once again have a small number of
"experts" that make the most difference, as we saw in the prediction
markets. Social computing was defined as deriving the intelligence of a whole
group of people, rather than an expert, however these paper suggest that the
whole group should rather be defined as a group of "experts". Only
users that have information are useful to these peer production systems or
prediction markets.
XIAOLU YU
This first paper presents an
empirical study of regularities in four online peer production systems.
The paper first showed that
the distribution of number of participations per user is strongly right skewed
and well described by a power law.
The heavy right skew means that a small fraction of very active participants are
responsible for the large majority of the activity, an unfortunate reality for
recommender systems attempting to provide accuracy for all users.
Secondly, the authors showed
that user activity levels in all the systems can be represented as lognormal
due to a reinforcement mechanism where more contributions lead to higher
popularity. This explains the famous 80%-20% rule, that a few popular topics dominate
all the activities of the whole system even though various systems have
different variations.
I believe this heavy tail
form is almost ubiquitous in most forms of online activity, including peer
production, online discussion, rating and commenting, among others. In other
words, the power law was very typical and representative in online
collaborative systems as well as peer systems.
In the second paper, the
authors started with two hypotheses: when Wikipedia is expanding, Wikipedia
will be become less useful as more and more of the terms in the average article
are not covered; during its growth, Wikipedia's growth will slow or stop as the
number of links to uncreated articles approaches zero. In the first case,
Wikipedia's coverage will decrease as it will contain articles drowned in an
increasing number of undefined concepts. In the second case, Wikipedia's growth
may stop. It shows that Wikipedia grows at a speed between the two extremes.
The authors examined a
snapshot of the Wikipedia corpus, 485 GB of data, adding up to 1.9 million
pages and 28.2 million revisions. They analyze the relationship between
references to non-existent articles and the creation of new articles. Their
experiments showed that the ratio of non-existent articles to defined articles
in Wikipedia is stable over time. In addition, the authors discovered that
missing links is what drives Wikipedia growth. They found that new articles are
contributed by users in a collaborative fashion: users often add new entries
when they find a missing links.
It also showed that the connection between missing links in existent
articles and new articles is a collaborative one, and that adding missing links
in existent articles actually spurs others to create new articles. The study
also showed that new articles were created within the first month that they
were always referenced in another article. I believe that this way of growth
(called preferential attachment in the paper), may be able to explain some
collaborative systems in some areas.
BRETT HARISSON
Both these papers offer
empirical studies of online collaboration and peer-editing systems, including
such internet giants as Wikipedia and Digg.
The power law result
(describing the distribution of contributions among users) is an intuitive yet
interesting result. It directly implies that an online-collaboration system
with a lower "alpha" will solicit more contributions and hence more
traffic (and probably, for systems such as Wikipedia, more accurate information).
Digg, for example, provides an extremely easy way for users to "Digg"
stories, i.e. they just have to click on a single button. This suggests that
one of the most important components of such systems are their user interfaces,
and that much time and thought should be taken to design UIs that are as
intuitive and easy to use as possible.
I also am curious about the
results regarding the correlation between number of links to non-existing pages
and the creation of new pages. While there is certainly empirical evidence to
back up this claim, what will happen in the following scenario: if Wikipedia
mandates that every new article contain at least 10 out going links (replace 10
with any number if you wish), will that stimulate further growth? Or what if I
start making articles in which every other word is an outgoing link to a
non-existing page... will that stimulate the creation of pages? I am curious
about the direction of causality with this claim.
ANGELA YING
Paper A
This paper discussed the
similarities between different online peer production services, including digg,
wikipedia, Essembly, and Bugzilla, all of whom are websites where users
register and are able to share information, either by editing existing articles
or by posting articles to the site. This paper had some interesting results
about the probability of a user stopping contributions inversely proportional
to the number of contributions made, and found constants for all 4 websites. It
also derived a formula for the number of people who made contributions greater
than a certain k. In addition, an important result found was that the value of
alpha, the constant, is related to the amount of effort required in
contributing - for example, editing a wikipedia article or submitting a digg
article has similar alphas that are larger than alphas for digg'ing an article
or voting on essembly. Finally, the paper found that the lognormal parameters
for contributions per topic will vary linearly over time.
I thought that this was an
interesting paper because it actually looked at data from 4 fairly popular
websites that cater to different audiences and have different types of people
making contributions. An idea for future work would be to look at the beginning
of these websites and analyze how the probabilities derived from this paper
were different when the websites were not so popular. Perhaps even a year by
year analysis could work.
Paper B
This paper explored Wikipedia
and the effectiveness of peer revision. In particular, it examined how the
number of articles grow and reference non-existent articles. Originally, the
author was concerned about the possibility that the number of non-existent
articles referenced would grow at a rate great than the number of articles
created, which would make Wikipedia's growth slow and increase the proportion
of stubs. However, it examined two models and looked at real data and concluded
that the ratio of non-existent articles to created articles would remain about
the same, creating sustainable growth. A particularly interesting result found
that a number of articles created on Wikipedia were actually created because of
the non-existent references or the stubs.
I thought that this was an
interesting study, although I would be curious to learn more about the models
themselves. The author briefly explained how they worked but did not really
describe their context, and whether they were formulated according to a real
example or simply just out of theory. A possible extension would be to look at
websites other than Wikipedia (perhaps smaller wikis or Wikipedia extensions
focused on certain topics) and see if they follow similar trends.
RORY KULZ
Paper A
This paper is pretty
straightforward. It's basically to me one of
those papers where there's a
very intuitive result (e.g., everything
follows a power law, and the
harder it is for people to participate,
the higher the rate of
participation dropoff), but you know, someone
had to go out in the world
and prove / explain it. In other words,
this paper would have been
more interesting if something very
unexpected happened.
The most engaging part of
this paper is definitely the explanation of
multiplicative reinforcement
and seeing how the stochastic formulas
fit the data, especially Digg
with the discount factor. As for the
other results, I was a little
surprised about the dropoff for
Wikipedia edits, and I was
wondering if the data set had incorporated
unregistered users, which
wasn't clear to me (they just mention a
"user ID"). It at
least seems like "barrier to participation" might
not be the most natural
explanatory concept for Wikipedia, since the
barrier is actually the
lowest (anyone visiting can click "edit" --
not any visitor can, e.g.,
"digg" up a story), while the desire of
most visitors to participate
might be the lowest (most people come
seeking information and may
only edit if they see a blatant error).
I'm not sure.
Paper B
I'm sorry, is this paper for
real? "We hypothesize that the addition
of new Wikipedia articles is
not a purely random process following the
whims of its contributors but
that references to nonexistent articles
trigger the eventual creation
of a corresponding article." The shock!
When there is a demand for an
article that has not been written, the
people who frequently
contribute to Wikipedia will eventually fill
that demand! (It should be
noted that this article does not also
consider the case of links
being removed to articles that editors have
deemed not appropriate to
write (following the policies at
http://en.wikipedia.org/wiki/Wikipedia:What_Wikipedia_is_not), which
should also be viewed as a
definitive statement on its status in the
way the addition of an entry
is.)
Okay, so there are some nice
graphs, Figures 1 and 2b being the main
ones, and it is a nice result
that the growth of Wikipedia appears
sustainable, but seriously,
regarding the growth mechanism, would it
actually be remotely
plausible for any other situation? Independent
pockets working on their own
articles and then gradually linking
together? Anyone who has
spent time on Wikipedia can see this isn't,
at least in recent times, the
case; the most interesting thing, I
think, to consider is why
this might be a function of a large, diverse
user base, comparing for
example circa-2001 Wikipedia's growth with
circa-2008 Wikipedia's growth
and seeing if distinctions can be drawn.
The authors touch on the
importance of the user base very briefly in
the conclusion --
"...the scalability of the endeavor is limited not
by the capacity of individual
contributors but by the total size of
the contributor pool" --
but come on, what's the deal with Figure 1?
They say the coverage ratio
is basically stable after 2003, but they
don't look into why it seems
substantially different from 2001-2003.
This I think is a missed
opportunity.
HAOQI ZHANG
Paper A
The main contributions of
this paper is in its systematic analysis of wikipedia's growth over time and
the reasons behind its growth (and kind of growth). In particular, the authors
show empirically that links to nonexistent articles seem to contribute to the
site's growth (by these pages being filled in, often by others), and
furthermore, that the nonexistent links are growing at a rate that is neither
much faster nor much slower than content is being contributed, showing signs of
sustainable growth. What I found most interesting about the paper is the
discussion of system designs that aid in the site's growth. For example, I
found that a watchlist for users to be alerted of changes to articles they are
interested in to be significant for regulation and growth. As another example,
I found that the style guidelines leading to splitting of overly long articles
into shorter ones to be interesting in terms of how the growth forms.
I found the empirical
analysis to be interesting and convincing. However, while the development has
followed these patterns so far, I can imagine that this will begin to change as
more content is added and the population of users who are able and willing to
contribute to shrink as the basic topics are already covered. I think an
interesting direction to go and extend the current work is to introduce a model
that captures the affect of various system designs to predict how we may wish
to modify various aspects of wikipedia to facilitate long term growth.
Paper B
The main contribution of this
paper is in quantitatively classifying some commonalities among peer production
systems, that (1) user participation in all tested systems are described by
power law, where few active users are contributing a large amount of the
content and (2) a few popular topics dominant most of the activity on a site.
In describing power laws, the authors show that different barriers for entry
(e.g., the amount of effort required to participate) leads to different power
law constants, that is, it has a direct effect on the amount of participation.
This is not surprising, but nevertheless significant in thinking about how few
users are encouraged to do most of the contributing.
It is not clear to me where
this paper leaves us. One interesting question is how we can encourage
participation from all participants while guaranteeing the quality / level of
dedication required. For wikipedia in particular, how do we get users to
contribute high quality content? Part of the answer relies on how we believe
knowledge is distributed, that is, is there knowledge that is not being entered
into the system that could be otherwise. What motivates the heavy contributors
to contribute?
NIKHIL SRIVASTAVA
The Spinellis and Louridas
provide a strong argument for the sustainability of Wikipedia's growth by
examining the addition of new articles and the links that connect them to the
existing network, both with empirical evidence and in comparison to a
scale-free graph model. I found the paper to be intellectually interesting, but
I think its focus straddled two even more interesting topics. First, what can
examining the growth of Wikipedia tell us about the people who contribute to it
and their incentives for doing so? (I think we'll see more of this later in the
course). Second, how can this information tell us something interesting about
*knowledge* itself, or about the relationships between sets of information on
the internet. (I plan to do my project proposal on this idea).
The Wilkinson paper begins to
address the first issue, but is framed still in terms of understanding the
dynamics of the peer production system instead of the users behind it. I
imagine the idea is to optimize the quality and quantity of information aggregation,
but I still think there are interesting ideas to be discussed in the other
issues I mentioned. Specifically, the Wilkinson paper shows that a wide range
of peer production systems display a power law distribution of user
contribution and a lognormal distribution of topic activity.
AVNER MAY
I found these articles rather
interesting, particularly the one about WikipediaÕs growth. I thought that the model they proposed
for WikipediaÕs growth, in which references to non-existent articles spur the
creation of new articles, is very enlightening, and not immediately
obvious. It was nice to see that
the WikipediaÕs growth falls into neither the ÒinflationaryÓ nor ÒdeflationaryÓ
hypotheses, and thus that the growth appears sustainable. Thinking of all human knowledge as a graph,
and imagining Wikipedia slowly but surely covering more and more of this graph
in a breadth-first-search manner, is an interesting model. Maybe this breadth first search is one
with multiple start nodes, or with start nodes appearing randomly at every time
step (a start node appearing randomly would be equivalent to a random
generation of a new article, as opposed to an article being created due to the
fact that it was already referenced).
This is kind of like imagining a candle being melted by multiple lit
wicks, and a new wick being lit occasionally. I would be interested in studying how quickly a graph would
be covered in this fashion (fraction covered vs. time), and how this depended
on the topology of the graph. What
is a good model for the topology of the graph of all human knowledge? Can the growth rates of Wikipedia,
together with the insights from this article, be used to study this
question? In this article, they
looked at WikipediaÕs growth as the creation of a graph, as opposed to a traversal
of a graph. I would be interested
in modeling what Wikipedia does as a traversal of a graph by a large group of
contributors. With regard to the
second article, I thought it was interesting how Òthe probability a person
stops contributing varies inversely with the number of contributions he has
made,Ó and the implications this has about how much of the content is provided
by the x% most active users.
MALVIKA RAO
Paper B
Strong Regularities in Online
Peer Production: I did not find the findings
of this paper to be
surprising. It is reasonable to see that in these
systems a very few active
users account for most of the contributions, a
few visible popular topics
dominate the total activity, and that the
probability of quitting is
inversely proportional to the number of
previous contributions. Good
to know that statistical models such as power
law, lognormal distribution,
etc. can fairly accurately describe these
phenomena.
SAGAR MEHTA
Paper A
This paper provided empirical
results on how Wikipedia grows, which the authors argue is primarily through
undefined references. I found the findings to be rather intuitive. A
contributor to Wikipedia seems less likely to start a new article from scratch
(which may have a higher barrier since it requires more work) than editing an
existing stub. Furthermore, it's more likely that he or she will stumble upon
an article stub that is linked to in another article of interest to him or her.
I think this paper could have given further insight into how Wikipedia grows if
it had focused not only on which pages were being edited/completed, but also on
who was editing them. Are there a few contributors fueling the growth of new
articles or many? Does this matter in deciding how accurate the information is
in Wikipedia? Is a newer article on average less accurate than an older one?
Theoretically, this should be true as more revisions could lead to better
results/information aggregation. However, in some cases one could argue that in
the presence of too many sources, Wikipedia becomes a tool to convey the
consensus opinion on a matter rather than the "actual" truth.
Paper B
I found it pretty interesting
that online peer production systems share several general results with regard
to how people contribute. One thing I wondered about in the multiplicative
reinforcement section was their assumption that for Wikipedia we do not need a
discount factor to describe dN(t), the number of contributions to a given topic
made between time t and t + dt, but we do need one for Digg to account for the
"decay in novelty of news stories over time". Votes to a story in
Digg surely should be discounted over time because of this, but shouldn't edits
on Wikipedia also be discounted to account for the idea that knowledge on a
particular topic is complete? So, for Wikipedia, we don't necessarily need a
discount factor to account for "decay in novelty", but I would argue
we do need one to account for the idea that information becomes more complete
as time progresses.
HAO-YUH SU
Paper A
This paper investigates the
growth of Wikipedia. It claims that Wikipedia has sustainable growth by
offering the following two evidences: 1. the growth of unresolved references 2.
references lead to definitions.
Moreover, the paper builds up a scale-free network to describe the process of adding references and
entries. Things are pretty clear when the first part of argument is laid out.
However, when it comes to the second part, the scale-free network, I am a
little confused. First, I'd like to know the "connectivity" the
authors mentioned about when deriving the network model. How is the
connectivity of the entry defined in Wikipedia? Second, I am confused about the formula of: {k}=rP(k).
Doesn't k represent for connectivity?
Why is it used to denote the
expected number of added references? Third, in the model, when the authors say r references are
added to a given entry, are they talking about incoming references or outgoing references or
both? My last question is about some argument in the conclusion. In the conclusion, the authors claim they found
that new articles are typically written by different authors from the ones
behind the references to them, and further draw a conclusion that the
scalability of the endeavor is limited by the size of the contributor pool. This is a remarkable statement. However,
I think the authors should provide quantitative data to support this important argument. It will be rarely
exciting if we can see the exact probability of such a phenomenon in Wikipedia.
Paper B
This paper probes the
mechanism inside of the four big online peer productions: Wikipedia, Digg,
Bugzillan and Essembly. In the beginning, it investigates the user
participation in the four
systems. Then, it examines
the number of contributions per story. My first question is about the data in
Table 1. In the table, it shows that Wikipedia have 1.5 M topics. However, in
the paper, the authors say Wikipedia has over 9 million articles. Is the
difference from difference time frame
or from different definitions
of "topics" and "articles?" Second, when the authors are
trying to interpret the data in Section 3, they say: when the required effort
to contribute is higher, a larger value of alpha is expected. I agree that
voting in Digg and Essembly, productions with smaller value of alpha, can be quickly done. However,
when it comes to the Digg submissions and Wikipedia edits, the two with highest
alpha values in order, the statement, in my opinion, cannot be easily made. The data shows Digg submission has
an alpha value of 2.4, while the one of Wikipedia edit
is 2.28. After a thorough observation of these two website, I
find that Wikipedia edit seems to need more effort than Digg submission.
Therefore, I think besides the required effort of contribution, there must be
some other factors affecting the value of alpha. My second question is about
the heavy-tail property of the lognormal distribution. Although the authors
have given some explanation on this point, I'm still unclear about why this
property would lead to the difficulties in predicting popularity in the peer
production.
ZHENMING LIU
Both papers describe the
fundamental behavior of an online production site Wikipedia from different
aspects. In particular, Spinellis and LouridasÕ paper analyzed the rate between complete and incomplete articles
in wiki and tried to reason why wikipediaÕs remarkable growth is sustainable.
While on the other hand, Wilkinson collected data for usersÕ macro level
behavior and attempted to model these behavior mathematically.
Many of the empirical results
were already well known and the models in both papers are good (though not
particularly impressive). In addition, like many recent papers attempt to
analyze usersÕ behavior, these two papers also oversimplifies the userÕs
behavior. For example, in WilkinsonÕs paper, defining inactive users as those
who fail to contribute any articles for 3 or 6 months is ridiculous to me. Furthermore the time span of 3 months
or 6 months in this definition sounds arbitrary to me.
Both papers emphasize the ÒpredictabilityÓ
in their models. I am not sure whether that means their regression model fits
the existing model well or their models are trained and tested over two set of
data. Perhaps in their context, these two definitions are equivalent. But ÒpredictabilityÓ
sometimes sounds misleading.
The tools (e.g., the design
of crawlers, the link analysis software or the backend database) that allow the
researchers to study data in this scale has never been described in details in
papers of this type. Many traditional computer science papers tend to be
precise on describing the setup of their experiments. While for these papers in
the emerging areas, they tend to avoid detailed description of their
methodologies while these methodologies are usually more complicated than
experiments in more traditional computer science research. I would like to see
more discussion on the data processing part because this is the starting point
of doing any research in these type of systems.