Metadata and Google: A Troubled Relationship

Abstract:

Metadata is a vital tool to allow information professionals to sort information objects into useful categories and to allow users to find them effectively. However, Google is accessible to everyone and hence to those that would abuse Metadata fields to acquire more page traffic, which has lead to Google distrusting user generated content that is not displayed directly on the page. This contradiction between the two ideologies ultimately leads to information professionals requiring new tactics when attempting to utilize Google as a means of exploring the digital branch of their library. Herein, the difficulties with utilizing Metadata alongside Google will be explored and potential solutions to the issue discussed. This will be accomplished by the initial analysis of Google’s approach to metadata and why Google has adopted such an approach. Then the information professionals approach to metadata and the manner in which in conflicts with Google will be explored. Lastly, methods of potentially reconciling this difference in methodology will be revealed, along with their shortcomings.

Introduction:

While Metadata is valuable for information professionals to categorize information resources, the Google search engine has a tendency to distrust any Metadata fields. This leads to something of a schism when information professionals wish to utilize metadata to describe information objects on the Google search engine, where they are lumped in together with everyone else. Herein, the difficulty with metadata and the Google search engine will be discussed. In order to effectively explore this issue, first the value and function of metadata for the information professional will be analysed. Following that, the manner in which Google functions as a search engine including the way it sorts and values results, as well as how the search engine views metadata, will be revealed. Lastly, potential solutions to the problem of Metadata and Google will be explored from Google scholar to HTML manipulations and various semantic web innovations.

What Is Metadata?

Metadata is essentially a method of sorting information resources into certain categories or associating particular words with them. For example, a book might be sorted into ‘non-fiction’ or ‘fiction’ categories as well as categories based on author, both examples of metadata. Basically, metadata is used to sort information resources into groups based on likeness and thus someone looking for books by a specific author or on a particular subject, both metadata fields, can in theory find more useful information with ease (Wolfe, 2005). Thus, metadata functions as a way of describing information objects by placing them into varying categories.

Metadata Examples:

• Author
• Title
• Subject
• Publisher

The value of metadata to the information professional is that it allows categorization based on likeness and hence navigation based on it as well as more accurate descriptions from professionals. The reason this is important is, in theory, because information objects do not always describe themselves and thus further description is needed in order to define them so others may access them (Wolfe, 2005). Some good examples of this are objects such as video, audio and images. Search engines simply do not have the technology to attempt to define these images in a way that users may discover and subsequently access. As a result, metadata of some form is required to discover them.

Despite all of the aforementioned benefits of metadata, there is some debate as to whether Metadata is truly useful within the library context itself as more information is becoming accessible online, it is suggested that information searches will lean more and more towards full-text rather than utilizing metadata (Beall, 2006, p. 46). Despite this debate, many information professions still consider metadata a vital method of sorting library resources with metadata still utilized on many library sites as part of their search system as well as librarians focused specifically on metadata (Wolfe, 2005).

How Does Google Function?

Google’s specific way of sorting information resources has lead to it becoming the most used search engine. While the precise algorithm Google uses is never revealed, for obvious business reasons, some speculation can be made in regards to how it functions. Simply put, Google sorts web pages and other information resources based on the parts of the page that someone reading them would see. In turn Google has major problems with hidden descriptors such as metadata keywords, which Google largely ignores. To sort information Google scans the web using bots and searches the visible, textual content of pages alongside the overall HTML. People running searches then run searches against the content of the pages, in theory and more often than not in practice retrieving their desired information. Certain HTML tags, and the information within them, are given specific priority as follows:

1. Title tag
2. Heading tags, particularly h1 tags
3. First Paragraph
4. Words In Content

Google’s stance of indexing sites based primarily on information that is visible to viewers has lead to Google’s popularity as a search engine. The reason for this is that publishers do not necessarily correctly organize the metadata that describes their site properly. Furthermore, publishers tend to manipulate said information to unfairly categorize their information. When the HTML of a website is ‘crawled’ by the Google bot it then attempts to understand what the site is about by analyizing what words and phrases are repeated or feature prominently with extra weighting given depending on location and what tags are around the text. To give an example, the phrase ‘Metadata and Google’ being present multiple times on a page means that searches for those terms will yield the page as a search result (Weldon, 2009, p. 22 – 23). It is vital to understand that while this is the case, simply repeating those phrases arbitrarily will not mean those search terms will be listed as Google has put in place various limits as to how many times they may per overall word volume before the phrases are considered spam. Basically, the argument is that information resources ought describe themselves rather than be described by authors (Wolfe 2005). This is the methodology which has caused the advent of Google as the number one search engine.

Google and Metadata

While metadata can be a useful method of sorting data in the library context, where only information professionals can input resources, on the internet anyone can. This leads to people flooding metadata fields as well as using deceptive metadata to get more visitors. Ultimately, this has culminated in Google approaching information resources based on their visible content rather than trusting authors to state what the resource contains. As such, Google ignores all metadata that describes an information resource in favour of a full-text search (Alimohammadi, 2005, p. 629). Relative to general websites, the main form of metadata is keywords, which are used by other search engines to ascertain what a web site contains. Relative to a library website, these fields may contain the libraries name as well as details on the information resource, for example each of these keywords is separated by a comma:

meta name = “keywords” content= “washington library, Ian McEwn, Atonement, 2001, Jonathan Cape”

What this particular example suggests is that the web page in question contains information about the above words. In theory, this means that a search engine that used keywords would display the page if someone entered the search phrase “Atonement jonathan cape”. However, the issue is that Google completely ignores the keywords section because of their policy to only sort information resources by content that is visible to the information seeker (Wolfe 2005). Or, as was previously stated, Google wants information objects to describe themselves.

All of this said, it must be remembered that even if Google were focused on metadata it would be website metadata rather than metadata pertaining to a particular information resource. For example, there are no standard HTML metadata fields for publisher or publication date, and the author tag refers to the author of the website and not the information resource it is referring to. That is to say that a page that refers to several information resources would only have the metadata for the overall web page rather than for each information resource. For this reason there is a current attempt to develop a set of semantic web systems that enable web pages to give the details of what information they are referring to, for instance the AB Meta system that is designed to allow publishers to describe the information resource which they are referring to through the use of a number of metadata fields (Iskold, 2010). How AB Meta works is by basically expanding the header section of a website to include information about the information object the page is referring to.

AB Meta Metadata Fields Example:

• Book: author
• Book: isbn
• Book: year

The idea with these metadata fields is that they say what the website is referring to, but do not necessarily say that these define the information resource precisely. However, once again, the problem is that while Google will not penalize the use of these it does not particular change the way it indexes a page based on it.

Potential Solutions

There are a number of potential solutions to the problems listed herein, on the assumption that an information professional must utilize metadata in providing resources via Google. Each of these potential solutions will be explored by looking at the positives and negatives of them in turn. First a solution using Google scholar will be discussed, including the negatives of using it. The second potential solution discussed will be the use of general page content as metadata, with the idea being that certain HTML tags can be used to emphasize aspects of the page. This will also be accompanied by the negatives of this particular approach. The last potential solution is the use of various semantic web structured formats in order to more accurately tell Google what information resource the page is referring to, including the negatives of this approach.

Solution One: Google Scholar

The first potential solution to Google’s metadata issue is to use the ‘Google Scholar’ service. The idea behind Google scholar is that it only allows the searching of academic materials and thus the peer review process behind them in theory suggests that the metadata is relevant (Mayr & Walter, 2007, p. 816) it is that it allows academic resources to be sorted via Metadata, indexing resources based on title, author, resource type, subject and the resources abstract as well as date and location. Thus, libraries may post their resources, including aforementioned metadata, to the Google scholar search engine and clients may use it to find suitable information. This can also be used as a method of looking through journals, with clients than requiring their library login in order to access the resources. For instance, Google scholar may allow users to search journals but in order to access them they would need to utilize whatever libraries they have available to login to (Albanese, 2004, p. 13).

Despite the aforementioned positives, using Google scholar as a method of solving the issue of metadata has a collection of its own problems. Firstly, the search results may link to journal articles unavailable to some libraries journal memberships, leading to frustration on the user’s end. Also, there may be some online journals or resources that a library does have access to but which Google scholar over looks due to their specificity (Mayr & Walter, 2007, p. 828). Furthermore, while the system may seem somewhat effective despite its previous mentioned shortcomings, it is the general experience of information professionals that it does not work effectively for the simple reason that its results are poor, for reasons that really are unknown (Mayr & Walter, 2007, p. 815). This could simply be because of lack of initial uptake or other reasons such as lack of user friendliness, problems which may be resolved in future.

Solution Two: HTML Manipulation

The second potential solution is to simply include the Metadata as general content on the webpage, particularly a form of content that Google’s search engine might prioritise. For instance, all of the Metadata fields could use the heading HTML tags, thus giving them more relevance on the page than other text. Also, the author and other details could be added to the title tag alongside the name of the information resource, which would essentially make searches using that metadata more likely to link to the resource in question. For instance, the title tag of the page listing a certain resource could read ‘Atonement, Ian McEwn, 2001’ rather than simply ‘Atonement’ or something unrelated. Thus, the author and publishing date would be included in the title and header tags which would be including in any Google searches (Dawson, 2004, pp. 348-349). Even the collection name could be included in the title, thus people looking for a local copy of an information resource could come by it.

Furthermore, if the metatags link to a specific webpage which is not dynamically generated, Google might associate those words as keywords related to the page and thus some degree of association via metadata will be achieved through Google. This is because Google associates the words used in the anchor text of a link with the page being linked to. In other words, if a hyperlink reads ‘Ian McEwn’ then that phrase is associated with the page the link leads to as far as Google’s index is concerned. As a result, having an index page of all information resources relating to Ian McEwn could serve as a way of relating author metadata to information resources in the Google search engine. However, this is not a particularly reliable method nor is it well designed. Furthermore, it is an extremely time consuming task for any librarians or web developers involved in the task (Beall, 2006 p. 46).

The primary problem with this solution is that it does not enable Google to differentiate the metadata fields in its search. For instance, a search for the phrase ‘Author: Ian McEwn’ might show the page in question which details the information resource he authored, but it might also display results about Ian McEwn’s personal life that are completely unrelated to any information resource. As such, this method does not enable a strong distinction between McEwn as subject and McEwn as author and hence renders the metadata somewhat lacking in a strong use (Dawson, 2004, p. 309). The only real hope is that the metadata field will have the word author beside it, thus searches for the phrase ‘Author Ian McEwn’ might take into account that the words are next to each other on the page and thus some form of search via metadata will be achieved.

There is, however, another glaring problem with all of the aforementioned methods of optimising HTML to achieve the affect of metadata within Google searches, albeit simulated. All of these Search EO ideas that attempt to include metadata as a HTML element make the assumption that the pages of a libraries website are not dynamically generated from a database on each individual search, in which case it would be near impossible to effectively include metadata in Google searches as the pages would not be available to be indexed by Google.

Solution Three: Semantic Web

The third potential solution is to use one of the structured format metadata methods developing under the ‘semantic web’ concept such as abmeta. The idea behind these is that they offer a way for web publishers to add metadata information that describes an information resource or even a product. For instance, abmeta functions by allowing a publisher to include information about a specific information research such as author, publication date and title (Iskold, 2010). The idea is that it will more specifically tell Google what a particular web page is about, but Google is unlikely to take it up for the simple reason that it allows users to tell the search engine what the page contains.

The problem involved in using semantic web systems is that once again they let publishers attempt to tell what their page’s content is, which ultimately goes against Google’s general principle of information objects describing themselves (Wolfe 2005). As a result, it is unlikely this method will be a particularly effective method of adding metadata for the Google search engine. Furthermore, even if it did work there is another problem with it and that is that the metadata fields might force the search results to be too specific. For example, the search terms required may have to be extremely precise so a mistake with the date of publication could lead to a dramatically lower ranking in Google and hence less clients provided with information resources. Thus, it may simply be better not to use any of these semantic systems even if they happen to be working precisely as they should.

Conclusions

In conclusion, while Google remains the number one search engine it seems the value of metadata in Google searches will remain fairly low. As long as Google functions by focusing on the content of an information resource rather than descriptions of that content it will almost certainly be incompatible with metadata as the information professional knows it. While there are a couple of solutions to allow the use of metadata on Google, most of them are fairly inadequate. The use of Google scholar does not seem particularly effective, despite it being a promising concept. Manipulation of HTML tags is essentially just about adding extra weight to words associated with metadata by using specific HTML elements, and is guaranteed to falter if a libraries web page is dynamically generated from a database based on what searches users run. The various semantic web concepts are at this time not really being heavily adopted by Google because they oppose the general idea of it, and furthermore they may result in less visitors because they force a very precise definition for the page. Overall, sorting information using metadata is a methodology that cannot be reconciled with Google’s approach at least so long as Google continues to take the approach it has been built upon. Metadata may simply be on the way out in favour of the full-text search, but it must remain in some degree while video, audio and photos do not describe themselves.

Reference:

Albanese, Andrew (2004). Google Launches Scholarly Search Service. Publishers Weekly: pp. 13.

Alimohammadi, Dariush (2005). Meta-tags: still a matter of opinion. The Electronic Library. 26(6): pp 625 – 631.

Beall, Jeffrey (July 2006). Stop the War on Metadata. Library Journal 131(12): pp. 46.

Dawson, Alan (2004). Creating metadata that work for digital libraries and Google. Library Review 53(7): pp. 347 – 350.

Dawson, Alan & Hamilton, Val (November 2004). Optimising metadata to make high value content more accessible to Google users. Journal of Documentation 62 (3): pp. 307 – 327.

Google (13 November 2008). Google’s Search Engine Optimization Starter Guide. Retrieved April 13, 2010 from http://static.googleusercontent.com/external_content/untrusted_dlcp/www.google.com/en//webmasters/docs/search-engine-optimization-starter-guide.pdf

Iskold, Alex (14 February 2010). Why Google and Other Human’s Don’t Read Your Book Reviews. Retreived March 12, 2010 from http://www.readwriteweb.com/archives/why_google_and_other_humans_dont_read_your_book_reviews.php

Mayr, Philipp & Walter, Anne-Kathrin (20 February 2007). An exploratory study of Google Scholar. Online Information Review. 31 (6): pp. 814 – 830.

Weldon, Lorette (2009). The ‘Googlization’ Of The cLibrary Collection. Academic Research Library April 2009: pp 21 – 25.

Wolfe, Robert (24 May 2005). The Value of Metadata in the Google Era. MIT Library Information Intersection. Retrieved April 5, 2010 from http://libraries.mit.edu/metadata/presentations/valuemetadatarev.ppt

Related posts: