Tag: data science for design

Wikidata in the Classroom and the WikiCite project

The following post was presented by Wikimedian in Residence, Ewan McAndrew, at the Repository Fringe Conference 2018 held on 2nd & 3rd July 2018 at the Royal Society of Edinburgh.

 

Hi, my name’s Ewan McAndrew and I work at the University of Edinburgh as the Wikimedian in Residence.

My talk’s in two parts;

The first is part is on teaching data literacy with the Survey of Scottish Witchcraft database and Wikidata.

Contention #1:  since the City Region deal is there is a pressing need for implementing data literacy in the curriculum to produce a workforce equipped with the data skills necessary to meet the needs of Scotland’s growing digital economy and that this therefore presents a massive opportunity for educators, researchers, data scientists and repository managers alike.

Wikidata is the sister project of Wikipedia and it the backbone to all the Wikimedia projects, a centralised hub of structured, machine-readable, multilingual linked open data. An introduction to Wikidata can be found here.

I was invited along with 13 other ‘problem holders’ to a ‘Data Fair’ on 26 October 2017 hosted by course leaders on the Data Science for Design MSc. We were each afforded just five minutes to pitch a dataset for the 45 students on the course to work on in groups as a five week long project.

The ‘Data Fair’ held on 26 October 2017 for Data Science for Design MSc students. CC-BY-SA, own work.

Two groups of students were enthused to volunteer to help surface the data from the Survey of Scottish Witchcraft database, a fabulous piece of work at the University of Edinburgh from 2001-2003 chronicling information about accused witches in Scotland from the period 1563-1736, their trials and the individuals involved in those trials (lairds, sheriffs, prosecutors etc.) which remained somewhat static and unloved in an Microsoft Access database since the project concluded in 2003. So students at the 2017 Data Fair were invited to consider what could be done if the data was exported into Wikidata with attribution, linking back to the source database to provide verifiable provenance, given multilingual labels and linked to other complementary datasets? Beyond this, what new insights & visualisations of the data could be achieved?

There were several areas of interest: course leaders on the Data Science for Design MSc were keen for the students to work with ‘real world’ datasets in order to give them practical experience ahead of their dissertation projects.

 “A common critique of data science classes is that examples are static and student group work is embedded in an ‘artificial’ and ‘academic’ context. We look at how we can make teaching data science classes more relevant to real-world problems. Student engagement with real problems—and not just ‘real-world data sets’—has the potential to stimulate learning, exchange, and serendipity on all sides, and on different levels: noticing unexpected things in the data, developing surprising skills, finding new ways to communicate, and, lastly, in the development of new strategies for teaching, learning and practice.”

Towards Open-World Scenarios: Teaching the Social Side of Data Science by Dave Murray Rust, Joe Corneli and Benjamin Bach.

Beyond this, there were other benefits to the exercise. Tim Berners-Lee, the inventor of the Web, has suggested a 5-star deployment scheme for Open Data (illustrated in the picture and table below). Importing data into Wikidata makes it 5 star data!

By Michael Hausenblas, James G. Kim, five-star Linked Open Data rating system developed by Tim Berners-Lee. (http://5stardata.info/en/) [CC0], via Wikimedia Commons
Number of stars Description Properties Example format
make your data available on the Web (whatever format) under an open license
  • Open license
PDF
★★ make it available as structured data (e.g., Excel instead of image scan of a table)
  • Open license
  • Machine readable
XLS
★★★ make it available in a non-proprietary open format (e.g., CSV instead of Excel)
  • Open license
  • Machine readable
  • Open format
CSV
★★★★ use URIs to denote things, so that people can point at your stuff
  • Open license
  • Machine readable
  • Open format
  • Data has URIs
RDF
★★★★★ link your data to other data to provide context
  • Open license
  • Machine readable
  • Open format
  • Data has URIs
  • Linked data
LOD

Importing data into Wikidata makes it 5 star data!

Open data producers can use Wikidata IDs as identifiers in datasets to make their data 5 star linked open data. As of June 2018, Wikidata featured in the latest Linked Open Data cloud diagram on lod-cloud.net as a dataset published in the linked data format containing over 5,100,000,000 triples.

Over a series of workshops, the Wikidata assignment also afforded the students the opportunity to develop their understanding of, and engagement with, issues such as: data completeness; data ethics; digital provenance; data analysis; data processing; as well as making practical use of a raft of tools and data visualisations. It also motivated student volunteers to surface a much-loved repository of information as linked open data to enable further insights and research. A project that the students felt proud to take part in and found “very meaningful”. (The students even took the opportunity to consult with professors of History at the university in order to gain even more of an understanding of the period in which these witch trials took place, such was their interest in the subject).

Feedback from students at the conclusion of the project included:

  • “After we analysed the data, we found we learned the stories of the witches and we learned about European culture especially in the witchhunts.”
  • “We had wanted to do a happy project but finally we learned much more about these cultures so it was very meaningful for us.”
  • “In my opinion, it’s quite useful to put learning practice into the real world so that we can see the outcome and feel proud of ourselves… we learned a lot.”
  • “Thank you for inviting us and appreciating our video. It’s an unforgettable experience in my life. Thank you so much.”

As a  result of the students’ efforts, we now have 3219 items of data on the accused witches in Wikidata (spanning 1563 to 1736). We also now have data on 2356 individuals involved in trying these accused witches. Finally we have 3210 witch trials themselves. This means we can link and enrich the data further by adding location data, dates, occupations, places of residence, social class, marriages, and penalties arising from the trial.

The fact that Wikidata is also linked open data means that students can help connect to and leverage from a variety of other datasets in multiple languages; helping to fuel discovery through exploring the direct and indirect relationships at play in this semantic web of knowledge.

 

Descendents of King James VI and I, king during union of English and Scottish crowns

And we can see an example of this semantic web of related entities, or historical individuals in this case, here in this visualisation of the descendants of King James I of England and James VI of Scotland (as shown in the pic above but do click on the link for a live rendering).

We can also see the semantic web at play in the below class level overview of gene ontologies (505,000 objects) loaded into Wikidata, and linking these genes to items of data on related proteins and items of data on related diseases, which, in turn, have related chemical compounds and pharmaceutical products used to treat these diseases. Many of these datasets have been loaded into Wikidata, or are maintained by, the GeneWiki initiative – around a million Wikidata items of biomedical data – but, importantly, they are also leveraging from other datasets imported from the Centre for Disease Control (CDC) among other sources. This allows researchers to add to and explore the direct and, perhaps more importantly, the indirect relationships at play in this semantic web of knowledge to help identify areas for future research.

 

Using Wikidata as an open, community-maintained database of biomedical knowledge – CC-BY: Andrew Su, Professor at The Scripps Research Institute.

Which brings me onto…

Contention #2 – Building a bibliographical repository: the sum of all citations

Sharing your data to Wikidata, as a linking hub for the internet, is also the most cost-effective way to surface your repository’s data and make it 5 star linked open data. As a centralised hub for linked open data on the internet, it enables you to leverage from many other datasets while you can still have  your own read/write applications on top of Wikidata. (Which is exactly what the GeneWiki project did to encourage domain experts to contribute to knowledge gaps on Wikidata through providing a user-friendly read/write interface to enable the “consumption and curation” of gene annotation data using the Wiki Genome web application).

Within Wikidata, we have biographical data, geographical data, biomedical data, taxomic data and importantly, bibliographic data.

The WikiCite project are building a bibliographic repository of sources within Wikidata.

“How does the Wikimedia movement empower individuals to assess reliable sources and arm them with quality information so they can make decisions based in facts? This question is relevant not only to Wikipedia users​ but to consumers of media around the globe. Over the past decade, the Wikimedia movement has come together to answer that question. Efforts to design better ways to support sourcing have begun to coalesce around Wikidata – the free knowledgebase that anyone can edit. With the creation of a rich, human-curated, and machine-readable knowledgebase of sources, the WikiCite initiative is crowdsourcing the process of vetting information​ and its provenance.” – WikiCite Report 2017

Wikidata tools can be used to create Wikidata items on scholarly papers automatically from scraping source metadata from:

  • DOIs,
  • PMIDs,
  • PMCIDs
  • ORCIDs (NB: Multiple items of data can be created simultaneously to represent multiple scholarly papers using one ORCID identifier input in the Orcidator tool).

Indeed, 1 out of 4 items of data in Wikidata represents a creative work. Wikidata currently includes 10 million entries about citable sources, such as books, scholarly papers, news articles and over 75 million author string statements and 84 million citation links in Wikidatas between these authors and sources. 17 million items with a Pubmed ID and 12.4 million items with a DOI.

Mike Bennett, our Digital Scholarship Developer at the University of Edinburgh, is working to develop a tool to translate the Edinburgh Research Archives’ thesis collection data from ALMA into a format that Wikidata can accept but there are ready-made tools that Wikidatans have developed that will automatically create a Wikidata item of data for scholarly papers scraping the source metadata from DOIs, Pubmed IDs and ORCID identifiers, allowing for a bibliographic record of scholarly papers and their authors to be generated as structured, machine-readable, multilingual linked open data.

Why does this matter?

Well…​the Initiative for Open Citations (I4OC) is a new collaboration between scholarly publishers, researchers, and other interested parties to promote the unrestricted availability of scholarly citation data. Over 150 publishers have now chosen to deposit and open up citation data. As a result, the fraction of publications with open references has grown from 1% to more than 50% out of 38 million articles with references deposited with Crossref.

Citations are the links that knit together our scientific and cultural knowledge. They are primary data that provide both provenance and an explanation for how we know facts. They allow us to attribute and credit scientific contributions, and they enable the evaluation of research and its impacts. In sum, citations are the most important vehicle for the discovery, dissemination, and evaluation of all scholarly knowledge.”

Once made open, the references for individual scholarly publications may be accessed within a few days through the Crossref REST API.  Open citations are also available from the OpenCitations Corpus that is progressively and systematically harvesting citation data from Crossref and other sources. An advantage of accessing citation data from the OpenCitations Corpus is that they are available i n machine-readable RDF format which is systematically being added to Wikidata.

Because this is data on scholars, scholarly papers and citations is stored as linked data on Wikidata, the citation data can be linked to, and leverage from, other complementary datasets enabling the direct and indirect relationships to be explored in this semantic web of knowledge.

This means we can parse the data to answer a range of queries such as:

  • Show me all works which cite a New York Times article/Washington Post article/Daily Telegraph article etc. (delete as appropriate).
  • Show me the most popular journals cited by statements of any item that is a subclass of economics/archaeology/mathematics etc. (delete as appropriate).
  • Show me all statements citing the works of Joseph Stiglitz/Melissa Terras/James Loxley/Karen Gregory etc. (delete as appropriate).
  • Show me all statements citing journal articles by physicists at Oxford University in 1960s/1970s/1980s etc. (delete as appropriate).
  • Show me all statements citing a journal article that was retracted.

And much more besides.

Screengrab of the Scholia profile for the developmental psychologist, Uta Frith, generated from the structured linked data in Wikidata.

 

Like the WikiGenome web application already mentioned, other third party applications can be built with user-friendly UIs to read/write from Wikidata. For instance, the Scholia Web service creates on-the-fly scholarly profiles for researchers, organizations, journals, publishers, individual scholarly works, and research topics. Leveraging from information in Wikidata, Scholia displays information on total number of publications, co-authors, citation statistics in a variety of visualisations. Another way of helping to demonstrate the impact and reach of your research.

Citation statistics for developmental psychologist Uta Frith, visualised on the Scholia web service and generated from the citation data in Wikidata.
Co-author graph for Polly Arnold, Professor of Chemistry at the University of Edinburgh in the School of Chemistry visualised in the Scholia Web Service and generated from bibliographic data in Wikidata. Professor Arnold is the Crum Brown Chair of Chemistry at the University of Edinburgh.

To  conclude, the many benefits and power of linked open data to aid the teaching of data literacy and to help share knowledge between different institutions and different repositories, between geographically and culturally separated societies, and between languages is a beautiful empowering thing. Here’s to more of it and entering a brave new world of linked open data. Thank you.

By way of closing I’d like to show you the video presentations the students on the Data Science for Design MSc students came up with as the final outcome of their project to import the Survey of Scottish Witchcraft database into Wikidata.

Here are two data visualisation videos they produced:

Further reading

 3 steps to better demonstrate your institution’s commitment to Open Knowledge and Open Science.

  1. Allocate time/buy out time for academics & postdoctoral researchers to add university research (backed up with citations) to Wikipedia in existing/new pages. Establishing relevance is the most important aspect of adding university research so an understanding of the subject matter is important along with ensuring the balance of edits meets the ethos of Wikipedia so that any possible suggestion of promotion/academic boosterism is outweighed by the benefit of subject experts paying knowledge forward for the common good. At least three references are required for a new article on Wikipedia so citing the work of fellow professionals goes some way to ensuring the article has a wider notability and helps pay it forward. Train contributors prior to editing to ensure they are aware of Wikipedia’s policies & guidelines and monitor their contributions to ensure edits are not reverted.
  2. Identify the most cited works by your university’s researchers which are already on Wikipedia using Altmetric software. Once identified, systematically add in the Open Access links to any existing (paywalled) citations on Wikipedia and complete the edit by adding in the OA symbol (the orange padlock) using the {{open access}} template. Also join WikiProject Open Access.
  3. Help build up a bibliographic repository of structured machine-readable (and multilingual) linked open data on both university researchers AND research papers in Wikidata using the easy-to-use suite of tools available.