How many Wikipedia references are available to read? We measured the proportion of open access sources across languages and topics.

Let’s say you’re planning a trip to a subtropical region and you want to learn about available vaccines for yellow fever. You look up the English Wikipedia article. You’re lucky to find a well-sourced section, with a wealth of references, many of them pointing to information from public health agencies and reputable news articles. Great!

But then you decide to learn more about the scientific research on the treatment of yellow fever. You try to follow three citations, each a review article in a scholarly journal, and find that they’re all behind a paywall and inaccessible to you.

As it turns out, you’re not alone. When following a link to the official version of a scholarly article, Wikipedia readers are twice as likely to hit a paywall than one they can freely read.

Percentage of open citations

Citations are the bridge between Wikipedia articles and a broader landscape of reliable, secondary sources. Citations not only allow readers to verify the reliability of the facts they find in Wikipedia; through citations readers can also deep-dive into any given topic by exploring the books, scholarly publications, and news stories referenced in an article.

Wikipedia’s use of open access sources—publications that can be accessed for free by anyone online^[1]—plays a critical role in supporting fact-checking and in fostering digital literacy. Initiatives such as the Wikipedia Library’s partnerships allow editors to gain free access to the paywalled literature, while automated tools such as OAbot help editors find openly accessible versions that can be added to Wikipedia article references. However, Wikipedia does not have policies that explicitly favor open access over subscription-based sources when citing information. In reality, Wikipedia still links both many paywalled and open access sources. How does this affect readers?

We recently released a dataset of all citations with identifiers in Wikipedia. This does not represent every single book, article, and website referenced on Wikipedia (more analysis of that to come), but the substantial subset of sources that have a code that uniquely identifies the work, such as a Digital Object Identifier (DOI) in the case of academic articles, or an International Standard Book Number (ISBN) for books. You might think of this large subset of citations as roughly representing the scholarly sources of Wikipedia. To estimate the proportion of these scholarly sources that are free to read, we are releasing a new dataset that sheds light on these citations by their accessibility level and topic.

To create this dataset, we cross-referenced our existing data with data provided by Unpaywall, a database gathering accessibility information of more than 19 million articles, and we associated each scholarly publication’s digital object identifier (DOI) cited in Wikipedia with an accessibility label: “Open” if the publisher (source) provides a free copy, “Closed” if it is a paywalled publication, or “Available” if the official version is paywalled but there exists an open copy available elsewhere on the web (e.g., legally deposited by the author in a university repository). In total, we gathered accessibility values for around 450,000 scholarly publications cited across 300 Wikipedia languages.^[2]

We find that less than half of the official versions of scholarly publications cited with an identifier in Wikipedia are freely available on the web: 29% are free-to-read at the source, while an additional 10% have a free-to-read version available elsewhere.^[3]

Topic accessibility

The fraction of free-to-read content very much depends on the subject you’re interested in. To explore the extent to which readers can access content without hitting a paywall depending on their interests, we characterized each publication with a topic category by looking at the main topic of the Wikipedia pages that cite a publication. We then used the draftopic tool to assign a Wikipedia page with a topic from the Wikiproject Hierarchy. Since the tool works for English Wikipedia only, we assign topics to pages from non-English Wikipedias by finding their corresponding English Wikipedia version through Wikidata.^[4]

To explore this data at a glance, we published a number of visualizations through a Python notebook.

Accessibility of publications cited in Wikipedia by article topic.

If you are passionate about space, you will have far more chances to freely access the sources of the Wikipedia articles you read: around 55% of publications in space-related articles are open. On the contrary, if you are interested in chemistry, you will very likely have to pay to read the publications cited in your favorite Wikipedia articles or rely on an institutional subscription: 83% of publications in chemistry articles across languages are paywalled.

Accessibility also varies significantly across languages. If you read Wikipedia in Belarusian, you will be linked to a wide number of free to read scholarly publications (around 45% are open). Conversely, if you only speak Khmer, only around 20% of the scholarly publications cited in your Wikipedia language edition can be freely accessed.

Distribution of scholarly articles cited in each Wikipedia language edition by percentage of open access publications cited (left) and percentage of freely available paywalled publications cited (right). See our notebook visualizations for more comprehensive and interactive plots.

Beyond differences across topics and languages, our data also allows discovering areas of Wikipedia content that can be ‘un-paywalled’, i.e. topics for which we can improve the openness of referenced publications. For example, while articles in mathematics might not have a large amount of officially freely available sources (only 15%), we found that a larger proportion of publications in this field (25%) have a free copy available on the web. Similarly, Wikipedia in Punjabi is a language edition citing many paywalled sources (only 17% are open); however, around 24% of scholarly articles referenced in this language do have a non-paywalled version.

Exploring the data

Earlier this year, a team of developers from the Wikimedia Foundation, eLife and SAGE Publishing used the dataset to prototype a web-based tool called WikiCiteVis to search and visualize the world of Wikipedia citations. During the 2018 eLife Innovation Sprint, the team started working on the tool, and they have since produced a working web app that allows users to query the updated Wikipedia citation database in a browser, including links to Wikipedia articles and underlying cited articles. You can now see how a scholarly article or book is cited in Wikipedia using WikiCiteVis. Users are also welcome to provide feedback and suggest new features via GitHub, or to consider building these features themselves—find out more on eLife Labs.

You can access the full dataset of citations by topic and accessibility at https://doi.org/10.6084/m9.figshare.6819710. For our earlier data release and analysis of the most cited sources in Wikipedia, check out this blog post or read more in Wired and Nature.

We are also conducting research to understand how Wikipedia readers use citations (and why) and to identify when individual statements need additional citations. Stay tuned for further updates on these projects.

Acknowledgements

We would like to thank Heather Piwowar and Jason Priem from ImpactStory for making Unpaywalled data publicly available.

We would also like to thank David Moulton, Senior Front-End Developer at eLife, his collaborators Sean Wiseman and Chris Wilkinson, along with Sam Walton from the Wikimedia Foundation and Ian Mulvany from SAGE, for working on WikiCiteVis during and after the eLife Innovation Sprint.

Footnotes

1. For the purpose of this analysis, we use the definition of “open” provided by Unpaywall, which refers to the accessibility of a scholarly article, not whether the article is published under free licenses that allow reuse. Studying the prevalence of free-licensed (or libre) open access citations is a challenge due to the lack of comprehensive data.

2. This analysis is limited to citations by DOI to assess the accessibility of a paper’s version of record. Citations templates often include additional links to accessible versions, when available.

3. In context, the practical accessibility of all citations on Wikipedia may be significantly higher. This dataset only contains citations with an identifier, which are much more common in scholarly journals (that are more likely to be paywalled). Many news websites or other types of source do not have an identifier (and are not in this dataset) but are free to access online.

4. Although this methodology gives us interesting insights on topic-accessibility, its main limitation is that we can assign a topic to a Wikipedia page only if it exists in English Wikipedia.

Miriam Redi, Research Scientist
Dario Taraborelli, Director, Head of Research
Jake Orlowitz, The Wikipedia Library, Community Programs
Wikimedia Foundation