“Family trees” of online communities

Brian Keegan
5 min readAug 1, 2019


tl;dr: Chenhao and I have a new NSF grant and will be looking to partner with community moderators and recruit a new co-advised post-doc. Apply here.

Where do new online communities come from? They rarely start entirely from scratch — the founders of a new community often were active in earlier communities. It can be difficult for researchers to capture the social dynamics of founders’ activities across different communities if they move to new platforms. However, platforms like Reddit and Wikipedia support a diverse ecosystem of sub-communities called subreddits and WikiProjects (respectively) and users’ identities and contributions to these communities are archived in a consistent format. The publicly-available digital traces of users’ behaviors enable us to reconstruct a “family tree” of genealogical relationships between sub-communities as members migrate from an earlier “parent” communities to subsequent “child” communities. These genealogical relationships can reveal overlooked sources of influence that can predict community success, explain vectors of toxic behavior, and provide new perspectives on the evolution of these influential platforms.

Genealogical relationships among a sample of subreddits.

Chenhao Tan and I are excited to announce that we have received funding from the National Science Foundation to launch a three-year project to analyze genealogical relationships in online communities on Reddit and Wikipedia. The goals of this project are three-fold:

  1. Characterizing genealogical graphs. What is a genealogical relationship in an online community? We have a preliminary quantitative method for identifying parent-child relationships based on the temporal sequences in their users’ public activity logs and propose further extensions to the method and applications to platforms like Reddit and Wikipedia.
  2. Validating genealogical graphs. Does our genealogical construct capture substantive relationships between online communities? We will employ a battery of mixed methods approaches such as trace ethnography, trace interviews, and focus groups to validate the genealogical relationship constructs. This triangulation step will elicit alternative definitions of genealogies, produce labeled data, and identify outliers that will require induction and iteration to generate more robust constructs.
  3. Evaluating community processes. How do genealogical relationships explain community processes? We analyze how processes like growth and norms are influenced by genealogical relationships. We propose to examine how genealogical graphs relate to community success through a prediction framework and study the effectiveness of features based on genealogical graphs.

Previous findings

Chenhao and I started to collaborate on this project around our shared interests in understanding the social dynamics of online communities through a sequence analysis perspective.

I published a paper in 2016 (with Ofer Arazy) arguing that researchers need to pay greater attention to behavioral sequences to understand the dynamics of social computing systems. In a large sample of Wikipedia articles, we classified users’ contributions relative to the number of edits (if any) since their previous contribution: this is the editor’s first contribution to an article, the editor made the previous contribution to the article, etc. We found that the frequency of patterns occurred significantly more and less often than random, which reflected complex behaviors like newcomer socialization, anti-vandalism, and conflict. Our paper’s analysis did not look into where these newcomers came from before or where they departed to if they stopped contributing to an article.

Chenhao published a paper in 2018 outlining a method for tracing the genealogy of communities from its founding members to their activity in previous communities. He used a sample of 30,000 subreddits with at least 100 active members and defined a genealogical relationship based on its early members’ recent community memberships, “Specifically, we define parents of a new community j based on the posting history of its first k members in the month before they posted to community j.” His analysis found that most child communities have a stable set of strong parent communities, the strength of the genealogical relationship predicts the growth rate of the child community, and recruiting founders with diverse activity is crucial for child community success. He also has a very slick interactive web application that lets you explore the genealogical relationships on Reddit.

Ethical and privacy considerations

The proposed research necessarily involves tracking user activity across contexts, which raises important ethical and privacy concerns. First, users maintain different identities to different groups but research designs can collapse these contexts together and upsets users’ imagined audiences. Second, just because users’ trace data are accessible through public APIs does not automatically exempt it from ethical concerns. Third, while the policies governing ethical review boards in the United States interpret digital trace data as less risky to participants than other research designs, our colleague Casey Fiesler has done research documenting how social media users express reservations about their content being used for research.

We will address these ethical and privacy concerns through several steps. First, we will be transparent about our use of data and findings with the communities whose data we are using. We will participate in appropriate forums where research about the communities like Reddit’s /r/TheoryOfReddit or Wikipedia’s “Village pump” to disclose our research designs and share our results. Second, we will support community- and professional-led deliberation about our research by using community-led deliberative genres like “Ask Me Anything” engagements and blog posts (like this) to share our research results. We will also invite feedback and co-creation of research designs, like those pioneered by Nathan Matias. We will also consult with other Reddit and Wikipedia researchers through workshops and panels at conferences to assess the risks and benefits of different data and research designs. Third, our analyses will employ de-identified and aggregated data from public sources and will not involve joining in other data that could lead to de-anonymization. The results and data we share will be reported at the community-level rather than at the level of individual users.

Engaging community members

We plan to run interviews and focus groups with moderators, administrators, and other leaders of the sub-communities we analyze. While Wikipedia has a regular community gathering (Wikimania), there is no analogous “RedditCon”. The closest things are Content Moderation at Scale, but if you are aware of any conferences, workshops, panels, etc. where Reddit moderators gather in person, please get in touch!

Recruiting a post-doctoral research associate

As a part of this grant, we are looking to recruit a post-doc for up to two years. The ideal candidate will be familiar with the history and culture of Reddit and/or Wikipedia, want to develop skills in computational, quantitative, and qualitative research methods, and help to shape the research agenda of human-centered data science. Interested applicants should apply here.



Brian Keegan

{Social, Data, Network, Information} Scientist. @CUInfoScience assistant professor.