The Computational Propaganda Project

Oxford Internet Institute, University of Oxford


Research Design FAQ

We frequently get questions about how we do our research. Our most important research decisions are detailed in our short data memos, longer working papers, and formally published research. But here are some straightforward answers to the most common questions we get.


What is the difference between a data memo, a working paper and a formally published piece of research?

Data memos are designed to present quick snapshots of analysis on current events in a short format. This approach allows us to conduct real-time analysis during pivotal moments of public and political life, and to shed light on attempted incidents of computational propaganda transparent, as they are happening. Often the project publishes data memos in a series, constantly adding new details to an ongoing political event or to report new findings. If you cannot find a detail that you are interested in, we might have already published information about it in another data memo in the series, in a longer working papers on this topic, or in one of our formally published pieces. We are always happy to respond to additional questions, but we first suggest consulting our other work that has been published earlier.


Is your research reviewed by your peers? 

Yes. The vast majority of our work has been funded by the world’s largest, most prestigious public science agencies, including the National Science Foundation and European Research Council. These agencies coordinate extensive double blind peer review committees that review our research design and methodological approaches. In addition, our academic writing gets regularly critiqued and reviewed. For several years now team members have been presenting our research at academic conferences where we get feedback on our papers. We have published our research in book chapters, edited books, conference proceedings, and peer reviewed articles. Each type of publication has different kinds of editorial review, blind review, double blind review, and unblinded review. These days, many of us believe that we have a responsibility to contribute to public life, so we also take advantage of scholarly networks to have our research reviewed informally before it is made public and submitted for formal review. Even our short data memos and white papers get reviewed by colleagues who are not the credited authors. For example, we received feedback on our recent data memo about the concentration of junk news in swing states from over a dozen researchers with academic appointments in communication, information science, and computer science departments. Taking feedback from our peers like this is a normal part of the process of scholarly inquiry, and allows us to improve our research and writing.


How do you select your hashtags when you study political communication on Twitter?

In order to get the a large and relevant sample of social media data, tweets are collected by following particular hashtags identified by our research team as being actively used during a political event. Our researchers rely on platforms tools and their own in-country expertise to identify hashtags. Each team of authors is assembled for their language skills, knowledge of a country’s political culture, and familiarity with the platform and issues being studied. In every data collection, we test our initial set of hashtags by collecting smaller test data sets and analyzing the co-occurrence of hashtags. In this iterative process of pre-tests and sub-samples, we can identify important hashtags that are missing from our initial list of hashtags and expand the list of hashtags to track. Since hashtags use changes over the natural course of events, we tend to focus on a set of core hashtags that are the most stably associated with a candidate, a political group, or an event. Occasionally we add new hashtags if they rise to prominence.


Where can I find the list of hashtags you’ve studied?

All the hashtags we use when we capture data from Twitter are listed in the memo, either in the paragraphs about the process of capturing or in the “Note” of the table summarizing the trends. Sometimes if the same data is used across several short Data Memos, the full list of hashtags appears in the first few memos on that topic. Going forward, for the sake of brevity, we don’t reproduce the full list. But you can always find the list trace the citations back to the previous memo on that topic.


Which APIs do you use for collecting your Twitter data?

Most of our analyses are based on data from Twitter’s free and public Streaming API. This allows us to archive traffic around a set of hashtags associated with a political event in near real time. We use other Twitter APIs as well: for example, we use the Search API to collect more information about suspicious accounts, lookup the timelines of users, and collect post and user metadata. To better answer research questions about political conversations we haven’t archived in real-time, we have also bought data from GNIP, Twitter’s in-house data broker. Each of these methods have their own limitations in the form of data caps, poor sampling documentation from Twitter, or lost fidelity in conversations where content has been deleted or accounts have been suspended. We always detail the most important data limitations of the data or provide citations to methodology papers of other academics who do a more extensive job detailing the challenges of working with social media data.


Do you capture and analyze content that gets removed later by Twitter?

Yes. We capture content via the Streaming API that gets removed later by either Twitter or by the accounts themselves later. Only Twitter knows how much of this content has been removed, and under what circumstances.


The sample periods for your studies differ, and sometimes includes data captured after voting day. Why?

We usually sample a few extra days of social media traffic so that we can understand the full arc of how junk news and social media algorithms impact public life. Sometimes the highly automated accounts we track continue to produce content even after the close of the polls. Sometimes the most aggressive accounts declare their particular candidates the winner after polls close but before ballot counting starts and election results have been certified. We’ve even fresh junk news stories start on voting day but swell in traffic in the days after an election. So it makes sense to customize the sample period to capture the sensible arc of political conversation around a campaign, election, or topic.


How do you decide what is junk news?

Our typology of junk news is grounded in close examination of the content being shared in each sample. Junk news content includes various forms of propaganda and ideologically extreme, hyper-partisan, or conspiratorial political news and information. Much of this content is deliberately produced false reporting. It seeks to persuade readers about the moral virtues or failings of organizations, causes or people and presents commentary as a news product. To identify junk news, our coders evaluate both the reputation of the publisher, the content of the item, and its presentation. We look for professional journalistic standards and evidence of fact-checking. We take note of unsafe generalizations and other logical fallacies, ad hominem attacks, and the overuse of attention grabbing techniques, such as excessive capitalization, exclamation marks, emotionally charged words and distorted images. Coders are given short training course, and we actively revisit evaluate new content as we encounter it. Sometimes we must adapt our typologies based on new types of content, and all of our coding and recoding processes are described in the methodology section of each data memo, working paper, and published manuscript.


How do you train coders?

All of our coders are experts in the political context the analysis is focusing on. They are native speakers or proficient in the language spoken in the country context analyzed, are highly knowledgeable about the media landscape, and are deeply familiar with the political debates, figures and contexts shaping the social media conversation. Training workshops are conducted by researchers from the project’s core team and take place over several days. The grounded typology is discussed in depth, examples and previous coding decisions are discussed, and coders participate in several pre-tests whereby inter-coder reliability is assessed. The materials for our coder training sessions are also online so you can see how our team prepares to review content and make coding decisions.


What do you do to be transparent about your research process?

Since we seek more transparency from social media platforms and the political actors who abuse us all, we take extra care to be exceptionally transparent about our work. (1) Our research gets reviewed by other academics at the planning stage, as we implement the research plan, and as we write up our findings. (2) We put full descriptions of our grant awards on our project website. (3) We disclose all funders on our website and in individual data memos, working papers, and published manuscripts. (4) We only conduct research that has been reviewed and approved by our university ethics board, a process that can involve getting additional feedback from researchers out-side the university. (5) All our publications explain our typologies and data collection methods, or provide the citations to previous iterations of the research where those explanations can be found. (6) We abide by the conventions of scholarly citation, acknowledgement, and peer review. (7) We adapt our methods when scholarly feedback or research from other teams shows a better way forward, and document changes using those conventions. (8) We actively maintain and protect an archive of data sets, recorded interviews, and archival materials so that we can revisit findings, reconstruct events, and preserve material for future researchers. (9) We get data sets online as soon as we can so that others may explore the samples. (10) We share findings with technology companies themselves at the same time that we share findings with journalists. (11) We respond to methodology queries from technology companies, alert them on families of automated and troll accounts that are degrading public life, and advise them on doing better detection of bots and fake users. (12) We respond to reasonable inquiries from journalists, policy makers and the interested public in a timely manner.


In fact, one of our peer-reviewers suggested that an accessibly written FAQ on research design could improve the transparency of our work. So we have published this FAQ and will actively maintain it. Our ultimate goal is research excellence, and along the way we can also advance public discussion about the causes and consequences of computational propaganda.