We frequently get questions about how we do our research. Our most important research and methods decisions are detailed in our data memos, working papers, and peer reviewed published research. But here are some straightforward answers to the most common questions we receive.
What is the difference between a data memo, a working paper and a formally published piece of research?
Data memos are designed to present quick snapshots of analysis on current events in a short format. This approach allows us to conduct real-time analysis during pivotal moments of public and political life, and to shed light on attempted incidents of computational propaganda. Often the project publishes data memos in a series, constantly adding new details to an ongoing political event or to report new findings. We build on our initial memos and rigorous analysis to inform our peer reviewed pieces.
Is your research reviewed by your peers?
Yes. The vast majority of our work has been funded by the world’s largest, most prestigious public science agencies, including the National Science Foundation and European Research Council. These agencies coordinate extensive double-blind peer review committees that review our research design and methodological approaches. In addition, our academic writing gets regularly critiqued and reviewed. We have published our research in book chapters, edited books, conference proceedings, peer reviewed articles, and frequently present our work at academic conferences. Each type of publication has different kinds of editorial review, blind review, double blind review, and unblinded review.
Why do you publish data memos in “real time”?
These days, many of us believe that we have a responsibility to contribute to public life, and especially so during critical moments such as elections and referenda. For our real-time data memos and working papers we take advantage of scholarly networks to have our research reviewed informally before we publish it on our webpage. For our election observatory data memos, we typically receive feedback from researchers with academic appointments in communication, information science, and computer science departments. We also seek feedback from country-specific experts.
How has your research design changed over time?
The science of computational propaganda and misinformation has changed over time. From the inception of the project in 2014 to 2016, our research mainly focused originating novel methodological approaches to study social media manipulation, theorizing new phenomena in the relationship to computational propaganda, and analyzing patterns of malicious behavior on social media during critical moments of public life in real-time. In the aftermath of digital interference in the US Presidential Elections 2016 to 2018, our research matured to study a multitude of phenomena related to junk news, manipulative cyber operations, and computational propaganda around the globe including Asia, Europe, and the Americas. More recently, our research has examined the malicious use of social media on new or difficult to study platforms including Instagram, Facebook, WhatsApp, as well as AI-generated fakes, and regulatory responses to computational propaganda. As a research team we want to inform evidence-based public discourse and we have developed guidelines and resources for civil society, parties, policymakers, and platforms.
How do you select your hashtags when you study political communication on Twitter?
In order to get a large and relevant sample of social media data, tweets are collected by following particular hashtags identified by our research team as being actively used during a political event. Our researchers rely on platforms tools and their own in-country expertise to identify hashtags. Each team of experts is assembled for their knowledge of a country’s political culture, language and familiarity with the issues being studied. In every data collection, we test our initial set of hashtags by collecting test data sets and analyzing the co-occurrence of hashtags. In this iterative process of pre-tests and sub-samples, we can identify important hashtags that are missing from our initial list and expand. We include a full list of all of our hashtags in every publication or data supplement.
Which APIs do you use for collecting your Twitter data?
Most of our analyses are based on data from Twitter’s free and public Streaming API. This allows us to archive traffic around a set of hashtags associated with a political event in near real time. We use other Twitter APIs as well: for example, we use the Search API to collect more information about suspicious accounts, lookup the timelines of users, and collect post and user metadata. We have also bought data from GNIP, Twitter’s in-house data broker. Each of these methods have their own limitations in the form of data caps, poor sampling documentation from Twitter, or lost fidelity in conversations where content has been deleted or accounts have been suspended.
Do you capture and analyze content that gets removed later by Twitter?
Yes. We capture content via the Streaming API that gets removed later by either Twitter or by the accounts themselves later. Only Twitter knows how much of this content has been removed, and under what circumstances.
The sample periods for your studies differ, and sometimes includes data captured after voting day. Why?
We usually sample a few extra days of social media traffic so that we can understand the full arc of how junk news and social media algorithms impact public life. Sometimes the accounts we track continue to produce content even after the close of the polls. For example, our work has documented instances where accounts disseminated messages designed to sow distrust about the integrity of an election.
How do you decide what is junk news?
Our typology of junk news is grounded in close examination of the content being shared in each sample. A detailed methods section on our grounded typology is available in our most recent peer reviewed publication here and here. Junk news content includes various forms of propaganda and ideologically extreme, hyper-partisan, or conspiratorial political news and information. To be classified as junk news content, the source must fulfil at least three of these five criteria:
- Professionalism: These outlets do not employ standards and best practices of professional journalism. They refrain from providing clear information about real authors, editors, publishers, and owners. They lack transparency and accountability and do not publish corrections of debunked information.
- Style: These sources use emotionally driven language that includes emotive expressions, hyperbole, ad hominem attacks, misleading headlines, excessive capitalization, unsafe generalizations and logical fallacies, moving images, and lots of pictures and mobilizing memes.
- Credibility: These outlets rely on false information and conspiracy theories, which they often employ strategically. They report without consulting multiple sources and do not fact-check. Sources are often untrustworthy and standards of production lack reliability.
- Bias: Reporting by these outlets is highly biased, ideologically skewed, or hyper-partisan, and news reporting frequently includes strongly opinionated commentary.
- Counterfeit: These sources mimic established news reporting. They counterfeit fonts, branding, and stylistic content strategies. This category also includes commentary disguised as news, with references to news agencies and credible sources, and headlines are written in a news tone with date, time, and location stamps.
How do you train coders?
All of our coders are experts in the political context the analysis is focusing on. They are native speakers or proficient in the language spoken in the country context analyzed, are highly knowledgeable about the media landscape, and are deeply familiar with the political debates, figures and contexts shaping the social media conversation. Training workshops are conducted by researchers from the project’s core team and take place over several weeks. Coders are required to achieve an intercoder reliability score of Krippendorf’s alpha = > 0.8 signaling good concept formation and high adeptness to our method.
How do you identify amplifier accounts?
We describe amplifier accounts as accounts that deliberately seek to increase the volume of traffic or the attention being paid to particular messages. These accounts include automated, semi-automated and highly active human-curated accounts on social media. We define amplifier accounts as those that post 50 times a day or more on one of the selected hashtags. This detection methodology falls short of capturing amplifier accounts that are tweeting at lower frequencies. Despite the simplicity of our metric, more complex methods using machine learning yield comparative numbers of false positives and remain contested in the field of computational social science. On the contrary, we have identified very few human users that tweet more than 49.5 times average per day. More detailed empirical analysis of amplifier accounts is available in our latest peer reviewed publications here and here.
Transparency & Replicability
What do you do to be transparent about your research process?
Since we seek more transparency from social media platforms and the political actors who abuse us all, we take extra care to be exceptionally transparent about our work. (1) Our research gets reviewed by other academics at the planning stage, as we implement the research plan, and as we write up our findings. (2) We put full descriptions of our grant awards on our project website. (3) We disclose all funders on our website and in individual publications. (4) We only conduct research that has been reviewed and approved by our university ethics board. (5) All our publications explain our typologies and data collection methods. (6) We abide by the conventions of scholarly citation, acknowledgement, and peer review. (7) We adapt our methods when scholarly feedback or research from other teams shows a better way forward. (8) We actively maintain and protect an encrypted archive of data sets, recorded interviews, and archival materials so that we can revisit findings, reconstruct events, and preserve material for future research. (9) We get data sets online as soon as we can so that others may explore the samples. (10) We share findings with technology companies at the same time that we share findings with journalists. (11) We respond to methodology queries, alert technology companies on suspicious behavior on their platforms, and advise them on doing better detection of bots and fake users. (12) We respond to reasonable inquiries from journalists, policy makers and the interested public in a timely manner.
Do you publish replication data?
We publish open access replication data from all of our studies here http://comprop.oii.ox.ac.uk/data/
Our ultimate goal is research excellence, and along the way we can also advance public discussion about the causes and consequences of computational propaganda. We are happy to respond to questions and look forward to your feedback.