A little light Summer reading – Wikipedia & the PGCAP course

I was pleased we were able to host a week themed on ‘Wikimedia & Open Knowledge’ as part of the University of Edinburgh’s Postgraduate Certificate of Academic Practice.

Participants on the course were invited to think critically about the role of Wikipedia in academia.

In particular, to read, consider, contrast and discuss four articles:

The first by Dr. Martin Poulter, Wikimedian in Residence at the University of Oxford, is highly recommended in terms of articulating Wikipedia & its sister projects role in allowing digital ‘shiver-inducing’ contact with library & archival material;

The second is by Caitlin Dewey at the Washington Post on ‘Google’s sketchy quest to control the world’s knowledge’ from May this year;
Third is ‘Search engines and the production of academic knowledge’ (2010) by Jose van Dijck at the University of Amsterdam;
Lastly we have ‘Everything you ever wanted to know about Wikimedia’s Knowledge Engine so far’ by Rebecca Sentance at Search Engine Watch from March 2016.

Search Failure: The Challenge of Modern Information Retrieval in an age of information explosion.

In addition – RECOMMENDED reading on Wikipedia’s role in academia.

This was my response to the reading (and some additional reading).

Title:

Search failure: the challenges facing information retrieval in an age of information explosion.

Abstract:

This article takes, as its starting point, the news that Wikipedia were reportedly developing a ‘Knowledge Engine’ and focuses on the most dominant web search engine, Google, to examine the “consecrated status” (Hillis, Petit & Jarrett, 2013) it has achieved and its transparency, reliability & trustworthiness for everyday searchers.

Introduction:

The purpose of this article is to examine the pitfalls of modern information retrieval & attempts to circumnavigate them, with a focus on the main issues surrounding Google as the world’s most dominant search engine.

“Commercial search engines dominate search-engine use of the Internet, and they’re employing proprietary technologies to consolidate channels of access to the Internet’s knowledge and information.” (Cuthbertson, 2016)

On 16th February 2016, Newsweek published a story entitled ‘Wikipedia Takes on Google with New ‘Transparent’ Search Engine’. The figure applied for, and granted by the Knight Foundation, was a reported $250,000 dollars as part of the Wikimedia Foundation’s $2.5 million programme to build ‘the Internet’s first transparent search engine’.

The sum applied for was relatively insignificant when compared to Google’s reported $75 billion revenue in 2015 (Robinson, 2016). Yet, it posed a significant question; a fundamental one. Just how transparent is Google?

Two further concerns can be identified from the letter to Wikimedia granting the application: “supporting stage one development of the Knowledge Engine by Wikipedia, a system for discovering reliable and trustworthy public information on the Internet.”(Cuthbertson, 2016). This goes to the heart of the current debate on modern information retrieval: transparency, reliability and trustworthiness? How then are we faring in these three measures?

Defining Information Retrieval

Informational Retrieval is defined as “a field concerned with the structure, analysis, organisation, storage, searching, and retrieval of information.” (Salton in Croft, Metzler & Strohman, 2010, p.1).

Croft et al (2010) identify three crucial concepts in information retrieval:

Relevance – Does the returned value satisfy the user searching for it.
Evaluation – Evaluating the ranking algorithm on its precision and recall.
Information Needs – What needs generated the query in the first place.

Today, since the advent of the internet, this definition needs to be understood in terms of how pervasive ‘search’ has become. “Search is the way we now live.” (Darnton in Hillis, Petit & Jarrett, 2013, p.5). We are all now ‘searchers’ and the act of ‘searching’ (or ‘googling’) has become intrinsic to our daily lives.

Dominance of one search engine

“When you turn on a tap you expect clean water to come out and when you do a search you expect good information to come out” (Swift in Hillis, Petit & Jarrett, 2013)

With over 60 trillion pages (Fichter and Wisniewski, 2014) and terabytes of unstructured data to navigate, the need for speedy & accurate responses to millions of queries has never been more important.

Navigating the vast sea of information present on the web means the field of Information Retrieval necessitates wrestling with, and constantly tweaking, the design of complex computer algorithms (determining a top 10 list of ‘relevant’ page results through over 200 factors).

Google, powered by its PageRank algorithm, has dominated I.R. since the early 1990s, indexing the web like a “back-of-the-book” index (Chowdhury, 2010, p.5). While this oversimplifies the complexity of the task, modern information retrieval, in searching through increasingly multimedia online resources, has necessitated the addition of newer more sophisticated models. Utilising ‘artificial intelligence’ & semantic search technology to complement the PageRank algorithm, Google now navigates through the content of pages & generates suggested ‘answers’ to queries as well as the 10 clickable links users commonly expect.

According to 2011 figures in Hillis, Petit & Jarrett (2013), Google processed 91% of searches internationally and 97.4% of the searches made using mobile devices. This undoubted & sustained dominance has led to accusations of abuse of power in two recent instances.

Nicas & Kendall (2016) report that the Federal Trade Commission along with European regulators are examining claims that Google has been abusing its position in terms of smartphone companies feeling they had to give Google Services preferential treatment because of Android’s dominance.

In addition, Robinson (2016) states that the Authors Guild are petitioning the Supreme Court over Google’s alleged copyright-infringement; going back a decade ago when over 20 million library books were digitised without compensation or author/publisher permission. The argument is that the content taken has since been utilised by Google for commercial gain to generate more traffic, more advertising money and thus confer on them market leader status. This echoes the New Yorker article’s response to Google’s aspiration to build a digital universal library: “Such messianism cannot obscure the central truth about Google Book Search: it is a business” (Toobin in Hillis, Petit & Jarrett, 2013).

PageRank

Google’s business is powered, like every search engine, by its ranking algorithm. For Cahill et al (2009), Google’s “PageRank is a quantitative rather than qualitative system”. PageRank works by ranking pages in terms of how well linked a page is, how often it is clicked on and the importance of the page(s) that links to it. In this way, PageRank assigns importance to a page.

Other parameters are taken into consideration including, most notably, the anchor text which provides a short descriptive summary of the page it links to. However, the anchor text has been shown to be vulnerable to manipulation, primarily from bloggers, by the process known as ‘Google bombing’. Google bombing is defined as “the activity of designing Internet links that will bias search engine results so as to create an

inaccurate impression of the search target” (Price in Bar-Ilan, 2007). Two famous examples include when Microsoft came as top result for the query ‘More evil than Satan’ and when President Bush ranked as first result for ‘miserable failure’. Bar-Ilan (2007) suggests google bombs come about for a variety of reasons: ‘fun, ‘personal promotion’, ‘commercial’, ‘justice’, ‘ideological’ and ‘political’.

Although reluctant to alter search results, the reputational damage google bombs were having necessitated a response. In the end, Google altered the algorithm to defuse a number of google bombs. Despite this, “spam or joke sites still float their way to the top.”(Cahill et al, 2009) so there is a clear argument to be had about Google, as a private corporation, continuing to ‘tinker’ with the results delivered by its algorithm and how much its coders should, or should not, arbitrate access to the web in this way. After all, the algorithm will already bear hallmarks of their own assumptions without any transparency on how these decisions are arrived at. Further, Google Bombs, Byrne (2004) argues, empower those web users whom the ranking system, for whatever reason, has disenfranchised.

Just how reliable & trustworthy is Google?

“Easy, efficient, rapid and total access to Truth is the siren song of Google and the culture of search. The price of access: your monetizable information.”(Hillis, Petit & Jarrett, 2013, p.7)

For Cahill et al (2009), Google has made the process of searching too easy and searchers have becoming lazier as a result; accepting Google’s ranking at face value. Markland in van Dijck (2010) makes the point that students favouring of Google means they are dispensing with the services libraries provide. The implication being that, despite library information services delivering a more relevant & higher quality search result, Google’s quick & easy ‘fast food’ approach is hard to compete with.

This seemingly default trust in the neutrality of Google’s ranking algorithm also has a ‘funnelling effect’ according to Beel & Gipp (2009); narrowing the sources clicked upon 90% of the time to just the first page of results with a 42% click through on the first choice alone. This then creates a cosy consensus in terms of the fortunate pages clicked upon which will improve their ranking while “smaller, less affluent, alternative sites are doubly punished by ranking algorithms and lethargic searchers.” (Pan et al. in van Dijck, 2010)

While Google would no doubt argue that all search engines closely guard how their ranking algorithms are calibrated to protect them from aggressive competition, click fraud and SEO marketing, the secrecy is clearly at odds with principles of public librarianship. Further, Van Dijck (2010) argues that this worrying failure to disclose is concealing how knowledge is produced through Google’s network and the commercial nature of Google’s search engine. After all, search engines greatest asset is the metadata each search leaves behind. This data can be aggregated and used by the search engine to create profiles of individual search behaviour and collective profiles which can then be passed on to other commercial companies for profit. That is not to say it always does but there is little legislation to stop it in an area that is largely unregulated. The right to privacy does not, it seems, extend to metadata and ‘in an era in which knowledge is the only bankable commodity, search engines own the exchange floor.’ (Halavais in van Dijck, 2010)

Scholarly knowledge and the reliability of Google Scholar

When considering the reliability, transparency & trustworthiness of Google and Google Scholar it is pertinent to look at its scope and differences with other similar sites. Unlike Pubmed and Web of Science, Google Scholar is not a human-curated database but is instead an internet search engine therefore its accuracy & content varies greatly depending on what has been submitted to it. Google Scholar does have an advantage is that it searches the full text of articles therefore users may find searching easier on Scholar compared to WoS or Pubmed which are limited to searching according to the abstract, citations or tags.

Where Google Scholar could be more transparent is in its coverage as some notable publishers have been known, according to van Dijck (2010), to refuse to give access to their databases. Scholar has also been criticised for the lack of completeness of its citations, as well as its covering of social science and humanities databases; the latter an area of strength for Wikipedia according to Park (2011). But the searcher utilising Google Scholar would be unaware of these problems of scope when they came to use it.

Further, Beel & Gipp (2009) state that the ranking system on Google Scholar, leads to articles with lots of citations receiving higher rankings, and as a result, receive even more citations because of this. Hence, while the digitization of sources on the internet opens up new avenues for scholarly exploration, ranking systems can be seen to close ranks on a select few to the exclusion of others.

As Van Dijck (2010) points out: “Popularity in the Google-universe has everything to do with quantity and very little with quality or relevance.” In effect, ranking systems determine which sources we can see but conceal how this determination has come about. This means that we are unable to truly establish the scope & relevance of our search results. In this way, search engines cannot be viewed as neutral, passive instruments but are instead active “actor networks” and “co-producers of academic knowledge.” (van Dijck, 2010).

Further, it can be argued that Google decides which sites are included in its top ten results. With so much to gain commercially, from being discoverable on Google’s first page of results, the practice of Search Engine Optimising (SEO), or manipulating the algorithm to get your site in the top ten search results, has become widespread. SEO techniques can be split into ‘white hat’ (legitimate businesses with a relevant product to sell) and ‘black hat’ (sites who just want clicks and tend not to care about the ‘spamming’ techniques they employ to get them). As a result, PageRank has to be constantly manipulated, as with Google bombs, to counteract the effects of increasingly sophisticated ‘black hat’ techniques. Hence, the need for an improved vigilance & critical evaluation of the searches returned by Google has become a crucial skill in modern information retrieval.

The solution: Google’s response to modern information retrieval – Answer Engines

Google is the great innovator and is always seeking newer, better ways of keeping users on its sites and improving its search algorithm. Hence, the arrival of Google Instant in 2010 to autofill suggested keywords to assist searchers. This was followed by Google’s Knowledge Graph (and its Microsoft equivalent Bing Snapshot). These new services seek not just to provide the top ten links to a search query but also to ‘answer’ it by providing a number of the most popular suggested answers on the page results screen (usually showing an excerpt of the related Wikipedia article & images along the side panel), based on, & learning from, previous users’ searches on that topic.

Google’s Knowledge Graph is supported by sources including Wikipedia & Freebase (and the linked data they provide) along with a further innovation, RankBrain, which utilises artificial intelligence to help decipher the 15% of queries Google has not seen before. As Barr (2016) recognises: “A.I. is becoming increasingly important to extract knowledge from Google’s sea of data, particularly when it comes to classifying and recognizing patterns in videos, images, speech and writing.”

Bing Snapshot does much the same. The difference being that Bing provides links to the sources it uses as part of the ‘answers’ it provides. Google provides information but does not attribute it. Without this, it is impossible to verify their accuracy. This seems to be one of the thorniest issues in modern information retrieval; link decay and the disappearing digital provenance of sources. This is in stark contrast to Wikimedia’s efforts in creating Wikidata: “an open-license machine-readable knowledge base” (Dewey 2016) capable of storing digital provenance & structured bibliographic data. Therefore, while Google Knowledge Panels are a step forward, there are issues again over its transparency, reliability & trustworthiness.

Moreover, the 2014 EU Court ruling on ‘the right to be forgotten’, which Google have stated they will honour, also muddies the waters on issues of transparency & link decay/censorship:

“Accurate search results are vanishing in Europe with no public explanation, no real proof, no judicial review, and no appeals process…the result is an Internet riddled with memory holes — places where inconvenient information simply disappears.”(Fioretti, 2014).

The balance between an individual’s “right to be forgotten” and the freedom of information clearly still has to be found. At the moment, in the name of transparency, both Google and Wikimedia are posting notifications to affected pages that they have received such requests. For those wishing to be ‘forgotten’ this only highlights the matter & fuels speculation unnecessarily.

The solution: Wikipedia’s ‘transparent’ search engine: Discovery

Since the setup of the ‘Discovery’ team in April 2015 and the disclosure of the Knight Foundation grant, there have been mixed noises from Wikimedia with some claiming that there was never any plan to rival Google because a newer ‘internal’ search engine was only ever being developed in order to integrate Wikimedia projects through one search portal.

Ultimately, a lack of consultation between the board and the wider Wikimedia community members reportedly undermined the project & culminated in the resignation of Lila Tretikov, Executive Director of the Wikimedia Foundation, at the end of February and the plans for Discovery were shelved.

However, Sentance (2016) reveals that, in their leaked planning documents for Discovery, the Foundation were indeed looking at the priorities of proprietary search engines, their own reliance on them for traffic and how they could recoup traffic lost to Google (through Google’s Knowledge Graph) at the same time as providing a central hub for information from across all their projects through one search portal. Wikipedia results, after all, regularly featured in the top page of Google results anyway – why not skip the middle man?

Quite how internet searchers may have taken to a completely transparent, non-commercial search engine we’ll possibly never know. However, it remains a tantalizing prospect.

The solution: Alternatives Engines

An awareness of the alternative search engines available for use and their different strengths and weaknesses is a key component of the information literacy needed to navigate this sea of information. Bing Snapshot, for instance, makes greater use of providing the digital provenance for its sources than Google at present.

Notess (2016) serves notice that computational searching (e.g. Wolfram Alpha) continues to flourish along with search engines geared towards data & statistics (e.g. Zanran, DataCite.org and Google Public Data Explorer).

However, knowing about the existence of these differing search engines is one thing but knowing how to successfully navigate them is quite another as Notess (2016) himself concludes where “Finding anything beyond the most basic of statistics requires perseverance and experimenting with a variety of strategies.”

Information literacy, it seems, is key.

The solution: The need for information literacy

Given that electronic library services are maintained by information professionals, “values such as quality assessment, weighed evaluation & transparency” (van Dijck, 2010) are in much greater evidence than in commercial search engines. That is not to say that there aren’t still issues in library OPAC systems: whether it be in terms of the changes in the classification system used over time or the differing levels of adherence by staff to these classification protocols; or the communication to users of best practice in utilising the system.

The use of any search engine, requires literacy among the user group. The fundamental problem remains the disconnect between what a user inputs and what they can feasibly expect at the results stage. Understanding the nature of the search engine being used (proprietary or otherwise) a critical awareness of how knowledge is formed through its network and the type of search statement that will maximise your chances of success are all vital. As van Dijck (2010) states “Knowledge is not simply brokered (‘brought to you’) by Google or other search engines… Students and scholars need to grasp the implications of these mechanisms in order to understand thoroughly the extent of networked power”(Dijck, 2010).

Educating users of this broadens the search landscape, and defuses SEO attempts to circumvent our choices. Information literacy cannot be left to academics or information professionals alone, though they can play a large part in its dissemination. As mentioned at the beginning, we are all ‘searchers’. Therefore, it is incumbent on all of us to become literate in the ways of ‘search’ and pass it on, creating our own knowledge networks. Social media offers us a means of doing this; allowing us to filter information as never before and filtering is “transforming how the web works and how we interact with our world.” (Swanson, 2012)

Conclusion

Google may never become any more transparent. Hence, its reliability & trustworthiness will always be hard to judge. Wikipedia’s Knowledge Engine may have offered a distinctive model more in line with these terms but it is unlikely, at least for now, to be able to compete as a global crawler search engine.

Therefore, it is incumbent on searchers not to presume neutrality or assign any kind of benign munificence on any one search engine. Rather by educating themselves as to the merits & drawbacks of Google and other search engines, users will then be able to formulate their searches, and their use of search engines, with a degree of information literacy. Only then can they hope the returned results will match their individual needs with any degree of satisfaction or success.

Bibliography

Arnold, A. (2007). Artificial intelligence: The dawn of a new search-engine era. Business Leader, 18(12), pp. 22.
Bar‐Ilan, Judit (2007). “Manipulating search engine algorithms: the case of Google”. Journal of Information, Communication and Ethics in Society 5 (2/3): 155–166. doi:1108/14779960710837623. ISSN 1477-996X.
Barr, A. (2016). WSJ.D Technology: Google Taps A.I. Chief To Replace Departing Search-Engine Head. Wall Street Journal. ISSN 00999660.
Beel, J.; Gipp, B. (2009). “Google Scholar’s ranking algorithm: The impact of citation counts (An empirical study)”. 2009 Third International Conference on Research Challenges in Information Science: 439–446. doi:1109/RCIS.2009.5089308.
Byrne, S. (2004). Stop worrying and learn to love the Google-bomb. Fibreculture, (3).
Cahill, Kay; Chalut, Renee (2009). “Optimal Results: What Libraries Need to Know About Google and Search Engine Optimization”. The Reference Librarian 50 (3): 234–247. doi:1080/02763870902961969. ISSN 0276-3877.
Chowdhury, G.G. (2010). Introduction to modern information retrieval. Facet. ISBN 9781856046947.
Croft, W. Bruce; Metzler, Donald; Strohman, Trevor (2010). Search Engines: Information Retrieval in Practice. Pearson Education. ISBN 9780131364899.
Cuthbertson, A. (2016)“Wikipedia takes on Google with new ‘transparent’ search engine”. Available at: http://europe.newsweek.com/wikipedia-takes-google-new-transparent-search-engine-427028. Retrieved 2016-05-08.
Dewey, Caitlin (2016). “You probably haven’t even noticed Google’s sketchy quest to control the world’s knowledge”. The Washington Post. ISSN 0190-8286. Retrieved 2016-05-13.
Fichter, D. and Wisniewski, J. (2014). Being Findable: Search Engine Optimization for Library Websites. Online Searcher, 38(5), pp. 74-76.
Fioretti, Julia (2014). “Wikipedia fights back against Europe’s right to be forgotten”. Reuters. Retrieved 2016-05-02.
Foster, Allen; Rafferty, Pauline (2011). Innovations in Information Retrieval: Perspectives for Theory and Practice. Facet. ISBN 9781856046978.
Gunter, Barrie; Rowlands, Ian; Nicholas, David (2009). The Google Generation: Are ICT Innovations Changing Information-seeking Behaviour?. Chandos Publishing. ISBN 9781843345572.
Halcoussis, Dennis; Halverson, Aniko; Lowenberg, Anton D.; Lowenberg, Susan (2002). “An Empirical Analysis of Web Catalog User Experiences”. Information Technology and Libraries 21 (4). ISSN 0730-9295.
Hillis, Ken; Petit, Michael; Jarrett, Kylie (2012). Google and the Culture of Search. Routledge. ISBN 9781136933066.
Hoffman, A.J. (2016). Reflections: Academia’s Emerging Crisis of Relevance and the Consequent Role of the Engaged Scholar. Journal of Change Management, 16(2), pp. 77.
Kendall, Susan. “LibGuides: PubMed, Web of Science, or Google Scholar? A behind-the-scenes guide for life scientists. : So which is better: PubMed, Web of Science, or Google Scholar?”. libguides.lib.msu.edu. Retrieved 2016-05-02.
Koehler, W.C. (1999). “Classifying Web sites and Web pages: the use of metrics and URL characteristics as markers”. Journal of Librarianship and Information Science 31 (1): 21–31. doi:1177/0961000994244336. ISSN 0000-0000.
LaFrance, Adrienne (2016). “The Internet’s Favorite Website”. The Atlantic. Retrieved 2016-05-12.
Lecher, Colin (2016). “Google will apply the ‘right to be forgotten’ to all EU searches next week”. The Verge. Retrieved 2016-04-29.
Mendez-Wilson, D (2000). ‘Humanizing The Online Experience’, Wireless Week, 6, 47, p. 30, Business Source Premier, EBSCOhost, viewed 1 May 2016.
Milne, David N.; Witten, Ian H.; Nichols, David M. (2007). “A Knowledge-based Search Engine Powered by Wikipedia”. Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management. CIKM ’07 (New York, NY, USA: ACM): 445–454. doi:1145/1321440.1321504. ISBN 9781595938039.
Moran, Wes & Tretikova, Lila (2016). “Clarity on the future of Wikimedia search – Wikimedia blog”. Retrieved 2016-05-10.
Nicas, J. and Kendall, B. (2016). “U.S. Expands Google Probe”. Wall Street Journal. ISSN 00999660.
Notess, G.R., (2013). Search Engine to Knowledge Engine? Online Searcher, 37(4), pp. 61-63.
Notess, G.R. (2016). SEARCH ENGINE update. Online Searcher, 40(2), pp. 8-9.
Notess, G.R., (2016). SEARCH ENGINE update. Online Searcher, 40(1), pp. 8-9.
Notess, G.R., (2014). Computational, Numeric, and Data Searching. Online Searcher, 38(4), pp. 65-67.
Park, Taemin Kim (2011). “The visibility of Wikipedia in scholarly publications”. First Monday 16 (8). doi:5210/fm.v16i8.3492. ISSN 1396-0466.
Price, Gary (2016). “Digital Preservation Coalition Releases New Tech Watch Report on Preserving Social Media | LJ INFOdocket”. www.infodocket.com. Retrieved 2016-05-01.
Ratfcliff, Chris (2016).“Six of the most interesting SEM news stories of the week” | Search Engine Watch”. Retrieved 2016-05-10.
Robinson, R. (2016) How Google Stole the Work of Millions of Authors. Wall Street Journal. ISSN 00999660.
Rowley, J. E.; Hartley, Richard J. (2008). Organizing Knowledge: An Introduction to Managing Access to Information. Ashgate Publishing, Ltd. ISBN 9780754644316.
Sandhu, A. K.; Liu, T. (2014). “Wikipedia search engine: Interactive information retrieval interface design”. 2014 3rd International Conference on User Science and Engineering (i-USEr): 18–23. doi:1109/IUSER.2014.7002670
Sentance, R. (2016). “Everything you need to know about Wikimedia’s ‘Knowledge Engine’ so far | Search Engine Watch“. Retrieved 2016-05-02.
Simonite, Tom (2013).“The Decline of Wikipedia”. MIT Technology Review. Retrieved 2016-05-09.
Swanson, Troy (2012). Managing Social Media in Libraries: Finding Collaboration, Coordination, and Focus. Elsevier. ISBN 9781780633770.
Van Dijck, José (2010). “Search engines and the production of academic knowledge”. International Journal of Cultural Studies 13 (6): 574–592. doi:1177/1367877910376582. ISSN 1367-8779.
Wells, David (2007). “What is a library OPAC?”. The Electronic Library 25 (4): 386–394. doi:1108/02640470710779790. ISSN 0264-0473.

Bibliographic databases utilised

Suprimo – Library database at University of Strathclyde.
Proquest – http://www.proquest.com/
Google Scholar – http://scholar.google.co.uk/
Emerald Insight – emeraldinsight.com
NORA Power search – Library Catalogue – University of Northumbria.
ACM Digital Library – https://dl.acm.org
Capita Discovery – https://capitadiscovery.co.uk
Discover Ed – University of Edinburgh library catalogue.
IEE Explore – https://ieeexplore.ieee.org
Sage journals – https://online.sagepub.com
com – https://tandfonline.com
com – https://questia.com
com – https://highbeam.com

A little light Summer reading – Wikipedia & the PGCAP course

Citation Needed – Euro 2016 and the case of the disappearing digital provenance

COMING SOON: Wikidata & Wikisource Showcase for Repo-Fringe