Tag: Artificial Intelligence

Computer Keyboard with an AI button lit up

Wikipedia at 24

“With more than 250 million views each day, Wikipedia is an invaluable educational resource”.^[1]

In light of Wikipedia turning 24 years this (January 15th), and the Wikimedia residency at the University of Edinburgh turning 9 years old this week too, this post is to examine where we are with Wikipedia today in light of artificial intelligence and the ‘existential threat’ it poses to our knowledge ecosystem. Or not. We’ll see.

NB: This post is especially timely given also Keir Starmer’s focus on “unleashing Artificial Intelligence across the UK” on Monday^[2]^[3] and our Principal’s championing of the University of Edinburgh as “a global centre for artificial intelligence excellence, with an emphasis on using AI for public good” this week.

Before we begin in earnest

Wikipedia has been, for some time, preferentialised in the top search results of Google, the number one search engine. And “search is the way we live now” (Darnton in Hillis, Petit & Jarrett, 2013, p.5)…. whether that stays the same remains to be seen with the emergence of chat-bots and ‘AI summary’ services. So it is incumbent on knowledge-generating academic institutions to support staff and students in navigating a robust information literacy when it comes to navigating the 21st century digital research skills necessary in the world today and understanding how knowledge is created, curated and disseminated online.

Engaging with Wikipedia in teaching & learning has helped us achieve these outcomes over the last nine years and supported thousands of learners to become discerning ‘open knowledge activists‘; better able to see the gaps in our shared knowledge and motivated to address these gaps especially when it comes to under-represented groups, topics, languages, histories. Better able also to discern reliable sources from unreliable sources, biased accounts from neutral point of view, copyrighted works from open access. Imbued with the critical thinking, academic referencing skills and graduate competencies any academic institution and employer would hope to see attained.

Point 1: Wikipedia is already making use of machine learning

ORES

The Wikimedia Foundation has been using machine learning for years (since November 2015). ORES is a service that helps grade the quality of Wikipedia edits and evaluate changes made. Part of its function is to flag potentially problematic edits and bring them to the attention of human editors. The idea is that when you have as many edits to deal with as Wikipedia does, applying some means of filtering can make it easier to handle.

“The important thing is that ORES itself does not make edits to Wikipedia, but synthesizes information and it is the human editors who decide how they act on that information” – Dr. Richard Nevell, Wikimedia UK

MinT

Rather relying entirely on external machine translation models (Google Translate, Yandex, Apertium, LingoCloud), Wikimedia also now has its own machine translation tool, MinT (Machine in Translation) (since July 2023) which is based on multiple state-of-the-art open source neural machine translation models ^[5] including (1) Meta’s NLLB-200 (2) Helsinki University’s OPUS (3) IndicTrans2 (4) Softcatalà.

The combined result of which is that more than 70 languages are now supported by MinT that are not supported by other services (including 27 languages for which there is no Wikipedia yet).^[5]

“The translation models used by MinT support over 200 languages, including many underserved languages that are getting machine translation for the first time”.^[6]

Machine translation is one application of AI or more accurately large language models that many readers may be familiar with. This aids with the translation of knowledge from one language to another, to build understanding between different languages and cultures. The English Wikipedia doesn’t allow for unsupervised machine translations to be added into its pages, but human editors are welcome to use these tools and add content. The key component is human supervision, with no unedited or unaltered machine translation permitted to be published on Wikipedia. We made use of the Content Translation tool on the Translation Studies MSc for the last eight years to give our students meaningful, practical published translation experience ahead of the world of work.

Point 2: Recent study finds artificial intelligence can aid Wikipedia’s verifiability

“It might seem ironic to use AI to help with citations, given how ChatGPT notoriously botches and hallucinates citations. But it’s important to remember that there’s a lot more to AI language models than chatbots….”^[7]

SIDE – a potential use case

A study published in Nature Machine Intelligence in October 2023 demonstrated that the use of SIDE, a neural network based machine intelligence, could aid the verifiability of the references used in Wikipedia’s articles.^[8] SIDE was trained using the references in existing ‘Featured Articles‘ on Wikipedia (the 8,000+ best quality articles on Wikipedia) to help flag citations where the citation was unlikely to support the statement or claim being made. Then SIDE would search the web for better alternative citations which would be better placed to support the claim being made in the article.

The paper’s authors observed that for the top 10% of citations tagged as most likely to be unverifiable‘ by SIDE, human editors preferred the system’s suggested alternatives compared with the originally cited reference 70% of the time‘.^[8]

What does this mean?

“Wikipedia lives and dies by its references, the links to sources that back up information in the online encyclopaedia. But sometimes, those references are flawed — pointing to broken websites, erroneous information or non-reputable sources.” ^[7]

This use case could, theoretically, save time for editors in checking the accuracy and verifiability of citations in articles BUT computational scientist at the University of Zurich, Aleksandra Urman, warns that this would only be if the system was deployed correctly and “what the Wikipedia community would find most useful”.^[8]

Indeed, practical implementation and actual usefulness remains to be seen BUT the potential there is acknowledged by some within the Wikimedia and open education space:

“This is a powerful example of machine learning tools that can help scale the work of volunteers by efficiently recommending citations and accurate sources. Improving these processes will allow us to attract new editors to Wikipedia and provide better, more reliable information to billions of people around the world.” – Dr. Shani Everstein Sigalov, educator and Free Knowledge advocate.

One final note is that Urman pointed out that Wikipedia users testing the SIDE system were TWICE as likely to prefer neither of the references as they were to prefer the ones suggested by SIDE. So the human editor would still have to go searching for the relevant citation online in such instances.

Laptop on desk displaying webpage about ChatGPT

Photo by Emiliano Vittoriosi on Unsplash

Point 3: ChatGPT and Wikipedia

Do people trust ChatGPT more than Google Search and Wikipedia?

No, thankfully. A focus group and interview study published in 2024 revealed that not all users trust ChatGPT-generated information as much as Google Search and Wikipedia.^[9]

Has the emergence and use of ChatGPT affected engagement with Wikipedia?

In November 2022, ChatGPT was released to the public and quickly became a popular source of information, serving as an effective question-answering resource. Early indications have suggested that it may be drawing users away from traditional question answering services.

A 2024 paper examined Wikipedia page visits, visitor numbers, number of edits and editor numbers in twelve Wikipedia languages. These metrics were compared with the numbers before and after the 30th of November 2022 when ChatGPT released. The paper’s authors also developed a panel regression model to better understand and quantify any differences. The paper concludes that while ChatGPT negatively impacted engagement in question-answering services such as StackOverflow, the same could not be said, as of yet, to Wikipedia. Indeed, there was little evidence of any impact on edits and editor numbers and any impact seems to have been extremely limited.^[10]

Wikimedia CEO Maryana Iskander states,

“We have not yet seen a drop in page views on the Wikipedia platform since ChatGPT launched. We’re on it. we’re paying close attention, and we’re engaging, but also not freaking out, I would say.”^[11]

Do Wikipedia editors think ChatGPT or other AI generators should be used for article creation?

“AI generators are useful for writing believable, human-like text, they are also prone to including erroneous information, and even citing sources and academic papers which don’t exist. This often results in text summaries which seem accurate, but on closer inspection are revealed to be completely fabricated.”^[12]

Author of Should You Believe Wikipedia?: Online Communities and the Construction of Knowledge, Regents Professor Amy Bruckman states large language models are only as good as their ability to distinguish fact from fiction… so, in her view, they [LLMs] can be used to write content for Wikipedia BUT only ever as a first draft which can only be made useful if it is then edited by humans and the sources cited checked by humans also.^[12]

“Unreviewed AI generated content is a form of vandalism, and we can use the same techniques that we use for vandalism fighting on Wikipedia, to fight garbage coming from AI.” stated Bruckman.^[12]

Wikimedia CEO Maryana Iskander agrees,

“There are ways bad actors can find their way in. People vandalize pages, but we’ve kind of cracked the code on that, and often bots can be disseminated to revert vandalism, usually within seconds. At the foundation, we’ve built a disinformation team that works with volunteers to track and monitor.“^[11]

For the Wikipedia community’s part, a draft policy setting out the limits of usage of artificial intelligence on Wikipedia in article generation has been written to help editors avoid any copyright violations being posted on a open-licensed Wikipedia page or anything that might open Wikipedia volunteers up to libel suits. While at the same time the Wikimedia Foundation’s developers are creating tools to aid Wikipedia editors to better identify content online that has been written by AI bots. Part of this is also the greater worry that it is the digital news media, more than Wikipedia, that may be more prone to AI-generated content and it is these hitherto reliable news sources that Wikipedia editors would like to cite normally.

“I don’t think we can tell people ‘don’t use it’ because it’s just not going to happen. I mean, I would put the genie back in the bottle, if you let me. But given that that’s not possible, all we can do is to check it.”^[12]

Garbage in, Garbage out?

Portrait of Charles II of Spain, of the Habsburg dynasty

Portrait of King Charles II, the last monarch from the House of Habsburg by Juan Carreño de Miranda, Public domain, via Wikimedia Commons

As what is right or wrong or missing on Wikipedia spreads across the internet then the need to ensure there are enough checks and balances and human supervision to avoid AI-generated garbage being replicated on Wikipedia and then spreading to other news sources and other AI services means we might be in a continuous ‘garbage-in-garbage-out’ spiral to the bottom that Wikimedia Sweden‘s John Cummings termed the Habsburg AI Effect (i.e. a degenerative ‘inbreeding’ of knowledge, consuming each other in a death loop, getting progressively and demonstrably worse and more ill each time) at the annual Wikimedia conference in August 2024. Despite Wikipedia and Google’s interdependence, the Wikipedia community itself is unsure it wants to enter any kind of unchecked feedback loop with ChatGPT whereby OpenAI consumes Wikipedia’s free content to train its models to then feed into other commercial paywalled sites when ChatGPT’s erroneous ‘hallucinations’ might have been feeding, in turn, into Wikipedia articles.

It is true to say that while Jimmy Wales has expressed his reluctance to see ChatGPT used as yet (“It has a tendency to just make stuff up out of thin air which is just really bad for Wikipedia — that’s just not OK. We’ve got to be really careful about that.”)^[13] other Wikipedia editors have expressed their willingness to use it get past the inertia and “activation energy” of the first couple of paragraphs of a new article and, with human supervision (or humans as Wikipedia’s “special sauce”, if you will), this could actually help Wikipedia create greater numbers of quality articles to better reach its aim of becoming the ‘sum of all knowledge’.^[14]

One final suggestion posted on the Wikipedia mailing list has been the use of the BLOOM large language model which makes use of Responsible AI Licences (RAIL)^[15]

“Similar to some versions of the open Creative Commons license, the RAIL license enables flexible use of the AI model while also imposing some restrictions—for example, requiring that any derivative models clearly disclose that their outputs are AI-generated, and that anything built off them abide by the same rules.”^[12]

A Wikimedia Foundation spokesperson stated that,

“Based on feedback from volunteers, we’re looking into how these models may be able to help close knowledge gaps and increase knowledge access and participation. However, human engagement remains the most essential building block of the Wikimedia knowledge ecosystem. AI works best as an augmentation for the work that humans do on our project.”^[12]

Wikipedia page describing Wikipedia itself

Photo by Luke Chesser on Unsplash

Point 4: How Wikipedia can shape the future of AI

WikiAI?

In Alek Tarkowski’s 2023 thought piece he views the ‘existential challenge’ of AI models becoming the new gatekeepers of knowledge (and potentially replacing Wikipedia) as an opportunity for Wikipedia to think differently and develop its own WikiAI, “not just to protect the commons from exploitation. The goal also needs to be the development of approaches that support the commons in a new technological context, which changes how culture and knowledge are produced, shared, and used.”^[16] However, in discussion at Wikimania in August 2024, this was felt to be outwith the realms of possibility given the vast resources and financing this would require to get off the ground if tackled unilaterally by the Foundation.

Blacklisting and Attribution?

For Chris Albon, Machine Learning Director at the Wikimedia Foundation, using AI tools has been part of the work of some volunteers since 2002. ^[17] What’s new is that there may be more sites online using AI-generated content. However, Wikipedia has existing practice of blacklisting sites/sources once it has become clear they are no longer reliable. More concerning is the emerging disconnect whereby AI models can provide ‘summary’ answers to questions without linking to Wikipedia or providing attribution that the information is coming from Wikipedia.

“Without clear attribution and links to the original source from which information was obtained, AI applications risk introducing an unprecedented amount of misinformation into the world. Users will not be able to easily distinguish between accurate information and hallucinations. We have been thinking a lot about this challenge and believe that the solution is attribution.”^[17]

Gen-Z?

For Slate writer, Stephen Harrison, while a significant number of Wikipedia contributors are already gen Z (about 20% of Wikipedia editors are aged 18-24 according to a 2022 survey) there is a clear desire to increase this percentage within the Wikipedia community, not least to ensure the continuing relevance of Wikipedia within the knowledge ecosystem.^[18] I.e. if Wikipedia becomes reduced to mere ‘training data’ for AI models then who would want to continue editing Wikipedia and who would want to learn to edit to carry on the mantle when older editors dwindle away? Hence, recruiting more younger editors from generation Z and raising their awareness of how widely Wikipedia content is used across the internet and how they can derive a sense of community and a shared purpose from sharing fact-checked knowledge, plugging gaps and being part of something that feels like a world-changing endeavour.^[18]

WikiProject AI Cleanup

There is already an existing project is already clamping down on AI content on Wikipedia, according to Jiji Veronica Kim^[19] Volunteer editors on the project are making use of the help of AI detecting tools to:

Identify AI generates texts, images.
Remove any unsourced claims
Remove any posts that do not comply with Wiki policies.

“The purpose of this project is not to restrict or ban the use of AI in articles, but to verify that its output is acceptable and constructive, and to fix or remove it otherwise….In other words, check yourself before you wreck yourself.“.^[19]

A heart displayed among 0s and 1s in binary code.

Photo by Alexander Sinn on Unsplash

Point 5: Wikipedia as a knowledge destination and the internet’s conscience

Search failure – Information Retrieval in an age of Infoglut

By Ewan McAndrew

On May 1, 2017

In Uncategorized

Search failure:

The challenges facing information retrieval in an age of information explosion.

Abstract:

This article takes, as its starting point, the news that Wikipedia were reportedly developing a ‘Knowledge Engine’ and focuses on the most dominant web search engine, Google, to examine the “consecrated status” (Hillis, Petit & Jarrett, 2013) it has achieved and its transparency, reliability & trustworthiness for everyday searchers.

A bit of light reading on information retrieval – Own work, CC-BY-SA.

“Commercial search engines dominate search-engine use of the Internet, and they’re employing proprietary technologies to consolidate channels of access to the Internet’s knowledge and information.” (Cuthbertson, 2016)

On 16th February 2016, Newsweek published a story entitled ‘Wikipedia Takes on Google with New ‘Transparent’ Search Engine’. The figure applied for, and granted by the Knight Foundation, was a reported $250,000 dollars as part of the Wikimedia Foundation’s $2.5 million programme to build ‘the Internet’s first transparent search engine’.

The sum applied for was relatively insignificant when compared to Google’s reported $75 billion revenue in 2015 (Robinson, 2016). Yet, it posed a significant question; a fundamental one. Just how transparent is Google?

Two further concerns can be identified from the letter to Wikimedia granting the application: “supporting stage one development of the Knowledge Engine by Wikipedia, a system for discovering reliable and trustworthy public information on the Internet.”(Cuthbertson, 2016). This goes to the heart of the current debate on modern information retrieval: transparency, reliability and trustworthiness? How then are we faring in these three measures?

Defining Information Retrieval

Informational Retrieval is defined as “a field concerned with the structure, analysis, organisation, storage, searching, and retrieval of information.” (Salton in Croft, Metzler & Strohman, 2010, p.1).

Croft et al (2010) identify three crucial concepts in information retrieval:

Relevance – Does the returned value satisfy the user searching for it.
Evaluation – Evaluating the ranking algorithm on its precision and recall.
Information Needs – What needs generated the query in the first place.

Today, since the advent of the internet, this definition needs to be understood in terms of how pervasive ‘search’ has become. “Search is the way we now live.” (Darnton in Hillis, Petit & Jarrett, 2013, p.5). We are all now ‘searchers’ and the act of ‘searching’ (or ‘googling’) has become intrinsic to our daily lives.

By Typing_example.ogv: NotFromUtrecht derivative work: Parzi [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0) or GFDL (http://www.gnu.org/copyleft/fdl.html)], via Wikimedia Commons

Dominance of one search engine

“When you turn on a tap you expect clean water to come out and when you do a search you expect good information to come out” (Swift in Hillis, Petit & Jarrett, 2013)

With over 60 trillion pages (Fichter and Wisniewski, 2014) and terabytes of unstructured data to navigate, the need for speedy & accurate responses to millions of queries has never been more important.

Navigating the vast sea of information present on the web means the field of Information Retrieval necessitates wrestling with, and constantly tweaking, the design of complex computer algorithms (determining a top 10 list of ‘relevant’ page results through over 200 factors).

Google, powered by its PageRank algorithm, has dominated I.R. since the early 1990s, indexing the web like a “back-of-the-book” index (Chowdhury, 2010, p.5). While this oversimplifies the complexity of the task, modern information retrieval, in searching through increasingly multimedia online resources, has necessitated the addition of newer more sophisticated models. Utilising ‘artificial intelligence’ & semantic search technology to complement the PageRank algorithm, Google now navigates through the content of pages & generates suggested ‘answers’ to queries as well as the 10 clickable links users commonly expect.

According to 2011 figures in Hillis, Petit & Jarrett (2013), Google processed 91% of searches internationally and 97.4% of the searches made using mobile devices. This undoubted & sustained dominance has led to accusations of abuse of power in two recent instances.

Nicas & Kendall (2016) report that the Federal Trade Commission along with European regulators are examining claims that Google has been abusing its position in terms of smartphone companies feeling they had to give Google Services preferential treatment because of Android’s dominance.

In addition, Robinson (2016) states that the Authors Guild are petitioning the Supreme Court over Google’s alleged copyright-infringement; going back a decade ago when over 20 million library books were digitised without compensation or author/publisher permission. The argument is that the content taken has since been utilised by Google for commercial gain to generate more traffic, more advertising money and thus confer on them market leader status. This echoes the New Yorker article’s response to Google’s aspiration to build a digital universal library: “Such messianism cannot obscure the central truth about Google Book Search: it is a business” (Toobin in Hillis, Petit & Jarrett, 2013).

PageRank

Google’s business is powered, like every search engine, by its ranking algorithm. For Cahill et al (2009), Google’s “PageRank is a quantitative rather than qualitative system”. PageRank works by ranking pages in terms of how well linked a page is, how often it is clicked on and the importance of the page(s) that links to it. In this way, PageRank assigns importance to a page.

Other parameters are taken into consideration including, most notably, the anchor text which provides a short descriptive summary of the page it links to. However, the anchor text has been shown to be vulnerable to manipulation, primarily from bloggers, by the process known as ‘Google bombing’. Google bombing is defined as “the activity of designing Internet links that will bias search engine results so as to create an

inaccurate impression of the search target” (Price in Bar-Ilan, 2007). Two famous examples include when Microsoft came as top result for the query ‘More evil than Satan’ and when President Bush ranked as first result for ‘miserable failure’. Bar-Ilan (2007) suggests google bombs come about for a variety of reasons: ‘fun, ‘personal promotion’, ‘commercial’, ‘justice’, ‘ideological’ and ‘political’.

Although reluctant to alter search results, the reputational damage google bombs were having necessitated a response. In the end, Google altered the algorithm to defuse a number of google bombs. Despite this, “spam or joke sites still float their way to the top.”(Cahill et al, 2009) so there is a clear argument to be had about Google, as a private corporation, continuing to ‘tinker’ with the results delivered by its algorithm and how much its coders should, or should not, arbitrate access to the web in this way. After all, the algorithm will already bear hallmarks of their own assumptions without any transparency on how these decisions are arrived at. Further, Google Bombs, Byrne (2004) argues, empower those web users whom the ranking system, for whatever reason, has disenfranchised.

Just how reliable & trustworthy is Google?

“Easy, efficient, rapid and total access to Truth is the siren song of Google and the culture of search. The price of access: your monetizable information.”(Hillis, Petit & Jarrett, 2013, p.7)

For Cahill et al (2009), Google has made the process of searching too easy and searchers have becoming lazier as a result; accepting Google’s ranking at face value. Markland in van Dijck (2010) makes the point that students favouring of Google means they are dispensing with the services libraries provide. The implication being that, despite library information services delivering a more relevant & higher quality search result, Google’s quick & easy ‘fast food’ approach is hard to compete with.

This seemingly default trust in the neutrality of Google’s ranking algorithm also has a ‘funnelling effect’ according to Beel & Gipp (2009); narrowing the sources clicked upon 90% of the time to just the first page of results with a 42% click through on the first choice alone. This then creates a cosy consensus in terms of the fortunate pages clicked upon which will improve their ranking while “smaller, less affluent, alternative sites are doubly punished by ranking algorithms and lethargic searchers.” (Pan et al. in van Dijck, 2010)

While Google would no doubt argue that all search engines closely guard how their ranking algorithms are calibrated to protect them from aggressive competition, click fraud and SEO marketing, the secrecy is clearly at odds with principles of public librarianship. Further, Van Dijck (2010) argues that this worrying failure to disclose is concealing how knowledge is produced through Google’s network and the commercial nature of Google’s search engine. After all, search engines greatest asset is the metadata each search leaves behind. This data can be aggregated and used by the search engine to create profiles of individual search behaviour and collective profiles which can then be passed on to other commercial companies for profit. That is not to say it always does but there is little legislation to stop it in an area that is largely unregulated. The right to privacy does not, it seems, extend to metadata and ‘in an era in which knowledge is the only bankable commodity, search engines own the exchange floor.’ (Halavais in van Dijck, 2010)

The University of Edinburgh by Mihaela Bodlovic – http://www.aliceboreasphotography.com/ (CC-BY-SA)

Scholarly knowledge and the reliability of Google Scholar

When considering the reliability, transparency & trustworthiness of Google and Google Scholar it is pertinent to look at its scope and differences with other similar sites. Unlike Pubmed and Web of Science, Google Scholar is not a human-curated database but is instead an internet search engine therefore its accuracy & content varies greatly depending on what has been submitted to it. Google Scholar does have an advantage is that it searches the full text of articles therefore users may find searching easier on Scholar compared to WoS or Pubmed which are limited to searching according to the abstract, citations or tags.

Where Google Scholar could be more transparent is in its coverage as some notable publishers have been known, according to van Dijck (2010), to refuse to give access to their databases. Scholar has also been criticised for the lack of completeness of its citations, as well as its covering of social science and humanities databases; the latter an area of strength for Wikipedia according to Park (2011). But the searcher utilising Google Scholar would be unaware of these problems of scope when they came to use it.

Further, Beel & Gipp (2009) state that the ranking system on Google Scholar, leads to articles with lots of citations receiving higher rankings, and as a result, receive even more citations because of this. Hence, while the digitization of sources on the internet opens up new avenues for scholarly exploration, ranking systems can be seen to close ranks on a select few to the exclusion of others.

As Van Dijck (2010) points out: “Popularity in the Google-universe has everything to do with quantity and very little with quality or relevance.” In effect, ranking systems determine which sources we can see but conceal how this determination has come about. This means that we are unable to truly establish the scope & relevance of our search results. In this way, search engines cannot be viewed as neutral, passive instruments but are instead active “actor networks” and “co-producers of academic knowledge.” (van Dijck, 2010).

Further, it can be argued that Google decides which sites are included in its top ten results. With so much to gain commercially, from being discoverable on Google’s first page of results, the practice of Search Engine Optimising (SEO), or manipulating the algorithm to get your site in the top ten search results, has become widespread. SEO techniques can be split into ‘white hat’ (legitimate businesses with a relevant product to sell) and ‘black hat’ (sites who just want clicks and tend not to care about the ‘spamming’ techniques they employ to get them). As a result, PageRank has to be constantly manipulated, as with Google bombs, to counteract the effects of increasingly sophisticated ‘black hat’ techniques. Hence, the need for an improved vigilance & critical evaluation of the searches returned by Google has become a crucial skill in modern information retrieval.

The solution: Google’s response to modern information retrieval – Answer Engines

Google is the great innovator and is always seeking newer, better ways of keeping users on its sites and improving its search algorithm. Hence, the arrival of Google Instant in 2010 to autofill suggested keywords to assist searchers. This was followed by Google’s Knowledge Graph (and its Microsoft equivalent Bing Snapshot). These new services seek not just to provide the top ten links to a search query but also to ‘answer’ it by providing a number of the most popular suggested answers on the page results screen (usually showing an excerpt of the related Wikipedia article & images along the side panel), based on, & learning from, previous users’ searches on that topic.

Google’s Knowledge Graph is supported by sources including Wikipedia & Freebase (and the linked data they provide) along with a further innovation, RankBrain, which utilises artificial intelligence to help decipher the 15% of queries Google has not seen before. As Barr (2016) recognises: “A.I. is becoming increasingly important to extract knowledge from Google’s sea of data, particularly when it comes to classifying and recognizing patterns in videos, images, speech and writing.”

Bing Snapshot does much the same. The difference being that Bing provides links to the sources it uses as part of the ‘answers’ it provides. Google provides information but does not attribute it. Without this, it is impossible to verify their accuracy. This seems to be one of the thorniest issues in modern information retrieval; link decay and the disappearing digital provenance of sources. This is in stark contrast to Wikimedia’s efforts in creating Wikidata: “an open-license machine-readable knowledge base” (Dewey 2016) capable of storing digital provenance & structured bibliographic data. Therefore, while Google Knowledge Panels are a step forward, there are issues again over its transparency, reliability & trustworthiness.

Moreover, the 2014 EU Court ruling on ‘the right to be forgotten’, which Google have stated they will honour, also muddies the waters on issues of transparency & link decay/censorship:

“Accurate search results are vanishing in Europe with no public explanation, no real proof, no judicial review, and no appeals process…the result is an Internet riddled with memory holes — places where inconvenient information simply disappears.”(Fioretti, 2014).

The balance between an individual’s “right to be forgotten” and the freedom of information clearly still has to be found. At the moment, in the name of transparency, both Google and Wikimedia are posting notifications to affected pages that they have received such requests. For those wishing to be ‘forgotten’ this only highlights the matter & fuels speculation unnecessarily.

Wikipedia

The solution: Wikipedia’s ‘transparent’ search engine: Discovery

Since the setup of the ‘Discovery’ team in April 2015 and the disclosure of the Knight Foundation grant, there have been mixed noises from Wikimedia with some claiming that there was never any plan to rival Google because a newer ‘internal’ search engine was only ever being developed in order to integrate Wikimedia projects through one search portal.

Ultimately, a lack of consultation between the board and the wider Wikimedia community members reportedly undermined the project & culminated in the resignation of Lila Tretikov, Executive Director of the Wikimedia Foundation, at the end of February and the plans for Discovery were shelved.

However, Sentance (2016) reveals that, in their leaked planning documents for Discovery, the Foundation were indeed looking at the priorities of proprietary search engines, their own reliance on them for traffic and how they could recoup traffic lost to Google (through Google’s Knowledge Graph) at the same time as providing a central hub for information from across all their projects through one search portal. Wikipedia results, after all, regularly featured in the top page of Google results anyway – why not skip the middle man?

Quite how internet searchers may have taken to a completely transparent, non-commercial search engine we’ll possibly never know. However, it remains a tantalizing prospect.

The solution: Alternative Search Engines

An awareness of the alternative search engines available for use and their different strengths and weaknesses is a key component of the information literacy needed to navigate this sea of information. Bing Snapshot, for instance, makes greater use of providing the digital provenance for its sources than Google at present.

Notess (2016) serves notice that computational searching (e.g. Wolfram Alpha) continues to flourish along with search engines geared towards data & statistics (e.g. Zanran, DataCite.org and Google Public Data Explorer).

However, knowing about the existence of these differing search engines is one thing but knowing how to successfully navigate them is quite another as Notess (2016) himself concludes where “Finding anything beyond the most basic of statistics requires perseverance and experimenting with a variety of strategies.”

Information literacy, it seems, is key.

Information Literacy
By Ewa Rozkosz via Flickr (CC-BY-SA)

The solution: The need for information literacy

Given that electronic library services are maintained by information professionals, “values such as quality assessment, weighed evaluation & transparency” (van Dijck, 2010) are in much greater evidence than in commercial search engines. That is not to say that there aren’t still issues in library OPAC systems: whether it be in terms of the changes in the classification system used over time or the differing levels of adherence by staff to these classification protocols; or the communication to users of best practice in utilising the system.

The use of any search engine, requires literacy among the user group. The fundamental problem remains the disconnect between what a user inputs and what they can feasibly expect at the results stage. Understanding the nature of the search engine being used (proprietary or otherwise) a critical awareness of how knowledge is formed through its network and the type of search statement that will maximise your chances of success are all vital. As van Dijck (2010) states “Knowledge is not simply brokered (‘brought to you’) by Google or other search engines… Students and scholars need to grasp the implications of these mechanisms in order to understand thoroughly the extent of networked power”(Dijck, 2010).

Educating users of this broadens the search landscape, and defuses SEO attempts to circumvent our choices. Information literacy cannot be left to academics or information professionals alone, though they can play a large part in its dissemination. As mentioned at the beginning, we are all ‘searchers’. Therefore, it is incumbent on all of us to become literate in the ways of ‘search’ and pass it on, creating our own knowledge networks. Social media offers us a means of doing this; allowing us to filter information as never before and filtering is “transforming how the web works and how we interact with our world.” (Swanson, 2012)

Conclusion

Google may never become any more transparent. Hence, its reliability & trustworthiness will always be hard to judge. Wikipedia’s Knowledge Engine may have offered a distinctive model more in line with these terms but it is unlikely, at least for now, to be able to compete as a global crawler search engine.

Therefore, it is incumbent on searchers not to presume neutrality or assign any kind of benign munificence on any one search engine. Rather by educating themselves as to the merits & drawbacks of Google and other search engines, users will then be able to formulate their searches, and their use of search engines, with a degree of information literacy. Only then can they hope the returned results will match their individual needs with any degree of satisfaction or success.

Bibliography

Arnold, A. (2007). Artificial intelligence: The dawn of a new search-engine era. Business Leader, 18(12), pp. 22.
Bar‐Ilan, Judit (2007). “Manipulating search engine algorithms: the case of Google”. Journal of Information, Communication and Ethics in Society 5 (2/3): 155–166. doi:1108/14779960710837623. ISSN 1477-996X.
Barr, A. (2016). WSJ.D Technology: Google Taps A.I. Chief To Replace Departing Search-Engine Head. Wall Street Journal. ISSN 00999660.
Beel, J.; Gipp, B. (2009). “Google Scholar’s ranking algorithm: The impact of citation counts (An empirical study)”. 2009 Third International Conference on Research Challenges in Information Science: 439–446. doi:1109/RCIS.2009.5089308.
Byrne, S. (2004). Stop worrying and learn to love the Google-bomb. Fibreculture, (3).
Cahill, Kay; Chalut, Renee (2009). “Optimal Results: What Libraries Need to Know About Google and Search Engine Optimization”. The Reference Librarian 50 (3): 234–247. doi:1080/02763870902961969. ISSN 0276-3877.
Chowdhury, G.G. (2010). Introduction to modern information retrieval. Facet. ISBN 9781856046947.
Croft, W. Bruce; Metzler, Donald; Strohman, Trevor (2010). Search Engines: Information Retrieval in Practice. Pearson Education. ISBN 9780131364899.
Cuthbertson, A. (2016)“Wikipedia takes on Google with new ‘transparent’ search engine”. Available at: http://europe.newsweek.com/wikipedia-takes-google-new-transparent-search-engine-427028. Retrieved 2016-05-08.
Dewey, Caitlin (2016). “You probably haven’t even noticed Google’s sketchy quest to control the world’s knowledge”. The Washington Post. ISSN 0190-8286. Retrieved 2016-05-13.
Fichter, D. and Wisniewski, J. (2014). Being Findable: Search Engine Optimization for Library Websites. Online Searcher, 38(5), pp. 74-76.
Fioretti, Julia (2014). “Wikipedia fights back against Europe’s right to be forgotten”. Reuters. Retrieved 2016-05-02.
Foster, Allen; Rafferty, Pauline (2011). Innovations in Information Retrieval: Perspectives for Theory and Practice. Facet. ISBN 9781856046978.
Gunter, Barrie; Rowlands, Ian; Nicholas, David (2009). The Google Generation: Are ICT Innovations Changing Information-seeking Behaviour?. Chandos Publishing. ISBN 9781843345572.
Halcoussis, Dennis; Halverson, Aniko; Lowenberg, Anton D.; Lowenberg, Susan (2002). “An Empirical Analysis of Web Catalog User Experiences”. Information Technology and Libraries 21 (4). ISSN 0730-9295.
Hillis, Ken; Petit, Michael; Jarrett, Kylie (2012). Google and the Culture of Search. Routledge. ISBN 9781136933066.
Hoffman, A.J. (2016). Reflections: Academia’s Emerging Crisis of Relevance and the Consequent Role of the Engaged Scholar. Journal of Change Management, 16(2), pp. 77.
Kendall, Susan. “LibGuides: PubMed, Web of Science, or Google Scholar? A behind-the-scenes guide for life scientists. : So which is better: PubMed, Web of Science, or Google Scholar?”. libguides.lib.msu.edu. Retrieved 2016-05-02.
Koehler, W.C. (1999). “Classifying Web sites and Web pages: the use of metrics and URL characteristics as markers”. Journal of Librarianship and Information Science 31 (1): 21–31. doi:1177/0961000994244336. ISSN 0000-0000.
LaFrance, Adrienne (2016). “The Internet’s Favorite Website”. The Atlantic. Retrieved 2016-05-12.
Lecher, Colin (2016). “Google will apply the ‘right to be forgotten’ to all EU searches next week”. The Verge. Retrieved 2016-04-29.
Mendez-Wilson, D (2000). ‘Humanizing The Online Experience’, Wireless Week, 6, 47, p. 30, Business Source Premier, EBSCOhost, viewed 1 May 2016.
Milne, David N.; Witten, Ian H.; Nichols, David M. (2007). “A Knowledge-based Search Engine Powered by Wikipedia”. Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management. CIKM ’07 (New York, NY, USA: ACM): 445–454. doi:1145/1321440.1321504. ISBN 9781595938039.
Moran, Wes & Tretikova, Lila (2016). “Clarity on the future of Wikimedia search – Wikimedia blog”. Retrieved 2016-05-10.
Nicas, J. and Kendall, B. (2016). “U.S. Expands Google Probe”. Wall Street Journal. ISSN 00999660.
Notess, G.R., (2013). Search Engine to Knowledge Engine? Online Searcher, 37(4), pp. 61-63.
Notess, G.R. (2016). SEARCH ENGINE update. Online Searcher, 40(2), pp. 8-9.
Notess, G.R., (2016). SEARCH ENGINE update. Online Searcher, 40(1), pp. 8-9.
Notess, G.R., (2014). Computational, Numeric, and Data Searching. Online Searcher, 38(4), pp. 65-67.
Park, Taemin Kim (2011). “The visibility of Wikipedia in scholarly publications”. First Monday 16 (8). doi:5210/fm.v16i8.3492. ISSN 1396-0466.
Price, Gary (2016). “Digital Preservation Coalition Releases New Tech Watch Report on Preserving Social Media | LJ INFOdocket”. www.infodocket.com. Retrieved 2016-05-01.
Ratfcliff, Chris (2016).“Six of the most interesting SEM news stories of the week” | Search Engine Watch”. Retrieved 2016-05-10.
Robinson, R. (2016) How Google Stole the Work of Millions of Authors. Wall Street Journal. ISSN 00999660.
Rowley, J. E.; Hartley, Richard J. (2008). Organizing Knowledge: An Introduction to Managing Access to Information. Ashgate Publishing, Ltd. ISBN 9780754644316.
Sandhu, A. K.; Liu, T. (2014). “Wikipedia search engine: Interactive information retrieval interface design”. 2014 3rd International Conference on User Science and Engineering (i-USEr): 18–23. doi:1109/IUSER.2014.7002670
Sentance, R. (2016). “Everything you need to know about Wikimedia’s ‘Knowledge Engine’ so far | Search Engine Watch“. Retrieved 2016-05-02.
Simonite, Tom (2013).“The Decline of Wikipedia”. MIT Technology Review. Retrieved 2016-05-09.
Swanson, Troy (2012). Managing Social Media in Libraries: Finding Collaboration, Coordination, and Focus. Elsevier. ISBN 9781780633770.
Van Dijck, José (2010). “Search engines and the production of academic knowledge”. International Journal of Cultural Studies 13 (6): 574–592. doi:1177/1367877910376582. ISSN 1367-8779.
Wells, David (2007). “What is a library OPAC?”. The Electronic Library 25 (4): 386–394. doi:1108/02640470710779790. ISSN 0264-0473.

Bibliographic databases utilised

Suprimo – Library database at University of Strathclyde.
Proquest – http://www.proquest.com/
Google Scholar – http://scholar.google.co.uk/
Emerald Insight – emeraldinsight.com
NORA Power search – Library Catalogue – University of Northumbria.
ACM Digital Library – https://dl.acm.org
Capita Discovery – https://capitadiscovery.co.uk
Discover Ed – University of Edinburgh library catalogue.
IEE Explore – https://ieeexplore.ieee.org
Sage journals – https://online.sagepub.com
https://tandfonline.com
https://questia.com
https://highbeam.com

A little light Summer reading – Wikipedia & the PGCAP course

By Ewan McAndrew

On June 10, 2016

In Uncategorized

I was pleased we were able to host a week themed on ‘Wikimedia & Open Knowledge’ as part of the University of Edinburgh’s Postgraduate Certificate of Academic Practice.

Participants on the course were invited to think critically about the role of Wikipedia in academia.

In particular, to read, consider, contrast and discuss four articles:

The first by Dr. Martin Poulter, Wikimedian in Residence at the University of Oxford, is highly recommended in terms of articulating Wikipedia & its sister projects role in allowing digital ‘shiver-inducing’ contact with library & archival material;

The second is by Caitlin Dewey at the Washington Post on ‘Google’s sketchy quest to control the world’s knowledge’ from May this year;
Third is ‘Search engines and the production of academic knowledge’ (2010) by Jose van Dijck at the University of Amsterdam;
Lastly we have ‘Everything you ever wanted to know about Wikimedia’s Knowledge Engine so far’ by Rebecca Sentance at Search Engine Watch from March 2016.

Search Failure: The Challenge of Modern Information Retrieval in an age of information explosion.

In addition – RECOMMENDED reading on Wikipedia’s role in academia.

This was my response to the reading (and some additional reading).

Title:

Search failure: the challenges facing information retrieval in an age of information explosion.

Abstract:

Introduction:

The purpose of this article is to examine the pitfalls of modern information retrieval & attempts to circumnavigate them, with a focus on the main issues surrounding Google as the world’s most dominant search engine.

Defining Information Retrieval

Croft et al (2010) identify three crucial concepts in information retrieval:

Relevance – Does the returned value satisfy the user searching for it.
Evaluation – Evaluating the ranking algorithm on its precision and recall.
Information Needs – What needs generated the query in the first place.

Dominance of one search engine

“When you turn on a tap you expect clean water to come out and when you do a search you expect good information to come out” (Swift in Hillis, Petit & Jarrett, 2013)

PageRank

Just how reliable & trustworthy is Google?

“Easy, efficient, rapid and total access to Truth is the siren song of Google and the culture of search. The price of access: your monetizable information.”(Hillis, Petit & Jarrett, 2013, p.7)

Scholarly knowledge and the reliability of Google Scholar

The solution: Google’s response to modern information retrieval – Answer Engines

Moreover, the 2014 EU Court ruling on ‘the right to be forgotten’, which Google have stated they will honour, also muddies the waters on issues of transparency & link decay/censorship:

The solution: Wikipedia’s ‘transparent’ search engine: Discovery

Quite how internet searchers may have taken to a completely transparent, non-commercial search engine we’ll possibly never know. However, it remains a tantalizing prospect.

The solution: Alternatives Engines

Information literacy, it seems, is key.

The solution: The need for information literacy

Conclusion

Bibliography

Arnold, A. (2007). Artificial intelligence: The dawn of a new search-engine era. Business Leader, 18(12), pp. 22.
Bar‐Ilan, Judit (2007). “Manipulating search engine algorithms: the case of Google”. Journal of Information, Communication and Ethics in Society 5 (2/3): 155–166. doi:1108/14779960710837623. ISSN 1477-996X.
Barr, A. (2016). WSJ.D Technology: Google Taps A.I. Chief To Replace Departing Search-Engine Head. Wall Street Journal. ISSN 00999660.
Beel, J.; Gipp, B. (2009). “Google Scholar’s ranking algorithm: The impact of citation counts (An empirical study)”. 2009 Third International Conference on Research Challenges in Information Science: 439–446. doi:1109/RCIS.2009.5089308.
Byrne, S. (2004). Stop worrying and learn to love the Google-bomb. Fibreculture, (3).
Cahill, Kay; Chalut, Renee (2009). “Optimal Results: What Libraries Need to Know About Google and Search Engine Optimization”. The Reference Librarian 50 (3): 234–247. doi:1080/02763870902961969. ISSN 0276-3877.
Chowdhury, G.G. (2010). Introduction to modern information retrieval. Facet. ISBN 9781856046947.
Croft, W. Bruce; Metzler, Donald; Strohman, Trevor (2010). Search Engines: Information Retrieval in Practice. Pearson Education. ISBN 9780131364899.
Cuthbertson, A. (2016)“Wikipedia takes on Google with new ‘transparent’ search engine”. Available at: http://europe.newsweek.com/wikipedia-takes-google-new-transparent-search-engine-427028. Retrieved 2016-05-08.
Dewey, Caitlin (2016). “You probably haven’t even noticed Google’s sketchy quest to control the world’s knowledge”. The Washington Post. ISSN 0190-8286. Retrieved 2016-05-13.
Fichter, D. and Wisniewski, J. (2014). Being Findable: Search Engine Optimization for Library Websites. Online Searcher, 38(5), pp. 74-76.
Fioretti, Julia (2014). “Wikipedia fights back against Europe’s right to be forgotten”. Reuters. Retrieved 2016-05-02.
Foster, Allen; Rafferty, Pauline (2011). Innovations in Information Retrieval: Perspectives for Theory and Practice. Facet. ISBN 9781856046978.
Gunter, Barrie; Rowlands, Ian; Nicholas, David (2009). The Google Generation: Are ICT Innovations Changing Information-seeking Behaviour?. Chandos Publishing. ISBN 9781843345572.
Halcoussis, Dennis; Halverson, Aniko; Lowenberg, Anton D.; Lowenberg, Susan (2002). “An Empirical Analysis of Web Catalog User Experiences”. Information Technology and Libraries 21 (4). ISSN 0730-9295.
Hillis, Ken; Petit, Michael; Jarrett, Kylie (2012). Google and the Culture of Search. Routledge. ISBN 9781136933066.
Hoffman, A.J. (2016). Reflections: Academia’s Emerging Crisis of Relevance and the Consequent Role of the Engaged Scholar. Journal of Change Management, 16(2), pp. 77.
Kendall, Susan. “LibGuides: PubMed, Web of Science, or Google Scholar? A behind-the-scenes guide for life scientists. : So which is better: PubMed, Web of Science, or Google Scholar?”. libguides.lib.msu.edu. Retrieved 2016-05-02.
Koehler, W.C. (1999). “Classifying Web sites and Web pages: the use of metrics and URL characteristics as markers”. Journal of Librarianship and Information Science 31 (1): 21–31. doi:1177/0961000994244336. ISSN 0000-0000.
LaFrance, Adrienne (2016). “The Internet’s Favorite Website”. The Atlantic. Retrieved 2016-05-12.
Lecher, Colin (2016). “Google will apply the ‘right to be forgotten’ to all EU searches next week”. The Verge. Retrieved 2016-04-29.
Mendez-Wilson, D (2000). ‘Humanizing The Online Experience’, Wireless Week, 6, 47, p. 30, Business Source Premier, EBSCOhost, viewed 1 May 2016.
Milne, David N.; Witten, Ian H.; Nichols, David M. (2007). “A Knowledge-based Search Engine Powered by Wikipedia”. Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management. CIKM ’07 (New York, NY, USA: ACM): 445–454. doi:1145/1321440.1321504. ISBN 9781595938039.
Moran, Wes & Tretikova, Lila (2016). “Clarity on the future of Wikimedia search – Wikimedia blog”. Retrieved 2016-05-10.
Nicas, J. and Kendall, B. (2016). “U.S. Expands Google Probe”. Wall Street Journal. ISSN 00999660.
Notess, G.R., (2013). Search Engine to Knowledge Engine? Online Searcher, 37(4), pp. 61-63.
Notess, G.R. (2016). SEARCH ENGINE update. Online Searcher, 40(2), pp. 8-9.
Notess, G.R., (2016). SEARCH ENGINE update. Online Searcher, 40(1), pp. 8-9.
Notess, G.R., (2014). Computational, Numeric, and Data Searching. Online Searcher, 38(4), pp. 65-67.
Park, Taemin Kim (2011). “The visibility of Wikipedia in scholarly publications”. First Monday 16 (8). doi:5210/fm.v16i8.3492. ISSN 1396-0466.
Price, Gary (2016). “Digital Preservation Coalition Releases New Tech Watch Report on Preserving Social Media | LJ INFOdocket”. www.infodocket.com. Retrieved 2016-05-01.
Ratfcliff, Chris (2016).“Six of the most interesting SEM news stories of the week” | Search Engine Watch”. Retrieved 2016-05-10.
Robinson, R. (2016) How Google Stole the Work of Millions of Authors. Wall Street Journal. ISSN 00999660.
Rowley, J. E.; Hartley, Richard J. (2008). Organizing Knowledge: An Introduction to Managing Access to Information. Ashgate Publishing, Ltd. ISBN 9780754644316.
Sandhu, A. K.; Liu, T. (2014). “Wikipedia search engine: Interactive information retrieval interface design”. 2014 3rd International Conference on User Science and Engineering (i-USEr): 18–23. doi:1109/IUSER.2014.7002670
Sentance, R. (2016). “Everything you need to know about Wikimedia’s ‘Knowledge Engine’ so far | Search Engine Watch“. Retrieved 2016-05-02.
Simonite, Tom (2013).“The Decline of Wikipedia”. MIT Technology Review. Retrieved 2016-05-09.
Swanson, Troy (2012). Managing Social Media in Libraries: Finding Collaboration, Coordination, and Focus. Elsevier. ISBN 9781780633770.
Van Dijck, José (2010). “Search engines and the production of academic knowledge”. International Journal of Cultural Studies 13 (6): 574–592. doi:1177/1367877910376582. ISSN 1367-8779.
Wells, David (2007). “What is a library OPAC?”. The Electronic Library 25 (4): 386–394. doi:1108/02640470710779790. ISSN 0264-0473.

Bibliographic databases utilised

Suprimo – Library database at University of Strathclyde.
Proquest – http://www.proquest.com/
Google Scholar – http://scholar.google.co.uk/
Emerald Insight – emeraldinsight.com
NORA Power search – Library Catalogue – University of Northumbria.
ACM Digital Library – https://dl.acm.org
Capita Discovery – https://capitadiscovery.co.uk
Discover Ed – University of Edinburgh library catalogue.
IEE Explore – https://ieeexplore.ieee.org
Sage journals – https://online.sagepub.com
com – https://tandfonline.com
com – https://questia.com
com – https://highbeam.com

Supporting the University of Edinburgh's commitments to digital skills, information literacy, and sharing knowledge openly

Tag: Artificial Intelligence

Wikipedia at 24: Wikipedia and Artificial Intelligence

Wikipedia at 24

Before we begin in earnest

Further reading

Point 1: Wikipedia is already making use of machine learning

ORES

MinT

Point 2: Recent study finds artificial intelligence can aid Wikipedia’s verifiability

SIDE – a potential use case

What does this mean?

Point 3: ChatGPT and Wikipedia

Do people trust ChatGPT more than Google Search and Wikipedia?

Has the emergence and use of ChatGPT affected engagement with Wikipedia?

Do Wikipedia editors think ChatGPT or other AI generators should be used for article creation?

Garbage in, Garbage out?

Point 4: How Wikipedia can shape the future of AI

WikiAI?

Blacklisting and Attribution?

Gen-Z?

WikiProject AI Cleanup

Point 5: Wikipedia as a knowledge destination and the internet’s conscience

Search failure – Information Retrieval in an age of Infoglut

Search failure:

The challenges facing information retrieval in an age of information explosion.

A little light Summer reading – Wikipedia & the PGCAP course