...design, query and evaluate information retrieval systems
Meaning and Importance of Competency
Information retrieval systems play a central role in the work of librarians and information professionals. While the concept of an information retrieval system tends to bring to mind images of online library catalogs, licensed databases, and search engines such as Google, in fact, an information retrieval system can be any system that provides information to users and can include not only online systems but also brick and mortar libraries, telephone help lines, and parts catalogs. In the discussion that follows, because of the common usage of the term to refer to online systems, I will focus on those information retrieval systems that reside in the online realm.
An online information retrieval system includes four main components: a database, which consists of a data structure made up of records and fields, rules for each field, and an indexing scheme; a search engine, which is a computer application that executes searches based on queries entered by users; a user interface, which is a search screen connecting users to the data contained in the system; and the information seeker or user who uses the system to retrieve information. Information seekers include known item searchers, subject searchers, and browsers.
Information retrieval systems are complex systems created by humans, and as such, much thought must go into their design in order to make them effective in providing information to users. Decisions must be made about the structure of the database and the indexes, the application of controlled vocabularies to the data which populates the system’s records, rules for data entry, and whether the database will admit additional records in the future (i.e., be open or closed). Effective design takes into account the needs and habits of the people who will use the system.
Querying, also known as searching, is the process by which information seekers retrieve information stored in an information retrieval system. Both known item searchers and subject searchers can use querying to retrieve needed information. Since browsers discover useful information by browsing rather than by active searching, querying is less important to this type of user. The formulation of an effective search strategy requires an understanding of the design, structure, and organization of the system to be searched and the needs of the information seeker. The amount and specificity of information needed are important factors in formulating an effective search strategy. Some types of advanced searches which may or may not be allowed in a given system include Boolean, proximity, range, and truncation searching. In formulating a search, it is also important to understand what types of indexes are included in the system (whether keyword, term, both, or neither) and how they apply to the various fields within the database.
Evaluation is an important component of the discipline of library and information science, and it applies no less to information retrieval systems than to any other tool, program, service, technique, or methodology employed within our field. The evaluation of an information retrieval system can be approached from several perspectives, including the user’s point of view, a technical point of view, and a management point of view. For a user, the true measure of an information retrieval system is how effective it is in providing needed information. Technically speaking, an information retrieval system may be evaluated based on many factors, including its ability to redirect from non-preferred to preferred index terms, its ability to automatically stem search terms, and its ability to perform Boolean or other advanced searches. From a management point of view, an information retrieval system may be evaluated based on its overall value given the amount of money, labor, and technology required to provide access to it.
Preparation and Evidence
To demonstrate my ability to design, query and evaluate information retrieval systems, I present two assignments that I completed for courses in the online School of Library and Information Science (SLIS) at San Jose State University (SJSU). My first piece of evidence is a database that I designed for LIBR 202, Information Retrieval; for this assignment, I worked with a team to design a data structure and to collect and enter data, then built my database using Inmagic’s DB/TextWorks software, tested my database using several queries, and wrote a paper describing and evaluating my process and outcome. My second piece of evidence is a paper I wrote for LIBR 247, Vocabulary Design, in which I evaluate two thesaurus-enhanced search systems. Taken together, these two papers demonstrate my ability to design, query and evaluate information retrieval systems.
First Piece of Evidence: UNESCO World Heritage Travel Database, LIBR 202
In Fall 2009, I took the course “Information Retrieval.” One of the major assignments of the course was one in which students were asked to create a simple database. The requirements of the assignment were to define a database purpose and user group, collect data for 50 records, build a data structure with 10-15 fields and six rules for each field, enter 30 records into DB/TextWorks, test the database, and write a paper describing the process and evaluating the final product. The assignment stipulated that the theme of the database would be UNESCO World Heritage Sites, that the database would be closed (meaning it would not grow through the addition of other records), and that it would be flat (meaning that the fields within each record would not have a specific relationship to one another). Students were instructed to be creative in defining the database purpose, user group, and data structure and were permitted to work collaboratively on all aspects of the assignment except the final paper.
For this assignment, I teamed up with three other students to define a database purpose and user group, collect data, and design a data structure. The user group that we chose, described on page 2 of my paper, consists of travel agents working for an agency specializing in travel in Europe and North America. The purpose that we chose, described on pages 2-3 of my paper, is to facilitate travel planning and the creation of marketing materials by travel agents for a promotion featuring UNESCO World Heritage Sites. Other tasks that we completed as a team were defining our database fields and collecting data.
Before the work of data collection began, we decided that we needed at least a draft data structure to guide our efforts and ensure quality and consistency in data entry. I took on the task of drafting the data structure. Since we had already identified our database fields, I focused on defining the six required rules for each field, which dealt with variables such as type of data (e.g. text, number, or date), indexing (e.g. keyword, term, or none), unique entries, required entries, repeatability, and content validation. The final data structure appears on pages 4-10 of my paper. On pages 12-13 of my paper, I describe my rationale for defining the rules as I did; considerations I mention include the nature of the data and anticipated usage patterns by the user group. Aiming for effective precision and recall, our group also decided to apply controlled vocabularies to various fields in our data structure. I discuss our decision-making process on pages 13-14 of my paper. On pages 14-15 I discuss our team’s data sources and data collection process, and on pages 15-16 I discuss quality control measures that I implemented both individually and with my team to ensure consistency during the data entry process. Pages 16-17 conclude my discussion of the process of designing and constructing my database, at which point I turn my attention to querying.
The queries I describe on pages 17-19 aimed at testing my database. I ran a number of queries to test the database’s precision and recall, its keyword index, and its response to truncation searches. For example, I searched for all the cultural sites in Europe, all sites with skiing listed as an activity available nearby (which was dependent upon an effective keyword index, since the controlled term for skiing was “skis and skiing”), and all sites inscribed in the 1900s (which required a truncation search for the term “19*”). I found that my database performed reliably and consistently.
My paper concludes by evaluating my database and the process I went through in creating it. I evaluate the database primarily from a user’s point of view, concluding that the intended users in the travel agency should find it to be “an authoritative and accurate source of information for their work.” According to course instructor Nancy MacKay, success in information retrieval systems is determined by whether the user gets exactly the information she needs and none other, gets all the information she needs, and gets reliable and authoritative information at the level of detail and difficulty required (MacKay, 2009). In this case, I conclude that my information retrieval system is indeed successful. This paper demonstrates my ability to design, query, and evaluate information retrieval systems.
Second Piece of Evidence: Evaluation of Thesaurus-Enhanced Search Systems, LIBR 247
In Fall 2011, I took the course “Vocabulary Design” with Dr. Ali Shiri. The course covered the topics of indexing, abstracting, and thesaurus construction. The culminating assignment of the course was a paper in which students were asked to carry out a comparative evaluation of two thesaurus-enhanced search systems from both a user interface and an information retrieval perspective. Students were instructed to select one commercial and one non-commercial thesaurus-enhanced search system for comparison. For my paper, I compared PsycINFO, a product of EBSCO Industries, Inc., and the California Environmental Information Catalog, a product of the State of California. After a brief introduction, my paper includes three sections: “Evaluation of Interface,” “Evaluation of Retrieval,” and “Usability Critique and Suggestions for Improvement.” This paper demonstrates my ability to query and evaluate information retrieval systems; my evaluation takes into account many factors, including usability, design, and technical performance.
In the section of my paper entitled “Evaluation of Interface,” I address a series of questions posed by Dr. Shiri in the assignment. Topics include browsing and searching of the thesaurus, help functions, hyperlinking, query construction, interface features, displays, predictive searching, and other advanced features. I first examine the PsycINFO system. I describe how the use of varying color schemes to denote different sections of the system enhances usability, and I compare the different search options available within the thesaurus, pointing out that while both the “Term Begins With” and “Term Contains” search functions work quite well, the “Relevancy Ranked” search function does not perform quite as expected. I also explore features which aid users in performing thesaurus-based searches, including the ability to construct queries using Boolean operators, the ability to “Explode” terms, and the ability to select “Major Concepts.” After examining the PsycINFO system, I turn my attention to the California Environmental Information Catalog. Here I again describe features which aid users in performing thesaurus-based searches, which in this case include hyperlinks to the three fundamental facets of the thesaurus, “Natural resources,” “Natural environment,” and “Human environment,” which allow users to “drill down” through the thesaurus to its narrowest terms and also find related terms. I go on to discuss the layout of thesaurus screens, noting that while the display is clean and easy to read, some of the terminology used on the buttons is ambiguous. I also discuss some technical problems that arise through the construction of complex queries. After comparing the different displays and advanced features available within each system, I rate the two systems against one another. On the basis of its easy navigation, help techniques, and use of color and other visual elements, I rate the PsycINFO interface higher in usability.
In the section of my paper entitled “Evaluation of Retrieval,” I address another series of questions posed by Dr. Shiri in the assignment. Topics here include searching, results, displays, thesaurus browsing, and non-descriptor to descriptor redirection. In this section, I describe queries I constructed and performed to test the systems and evaluate how the systems performed. Again, I begin with an examination of PsycINFO. In this section, I describe two thesaurus-based searches I performed in the PsycINFO system, observing that the system performed as expected. I also describe the results of a search in which one of the terms was changed to a non-preferred term, deducing that in this case, “the system reverted to an exclusively keyword search and ignored the subject index entirely” (p. 10). I go on to describe the results of searches performed in the California Environmental Information Catalog, pointing out that based on mathematical calculations, “the results observed by searching for documents indexed with individual terms in the thesaurus are inconsistent with those observed by combining terms in the default search field accessible via the ‘Discover’ tab from the thesaurus” (p. 11). After testing the system using two Boolean queries, I conclude, again based on mathematical calculations, that the system does not permit Boolean searching and instead executes simple keyword searches.
My paper concludes with the section entitled “Usability Critique and Suggestions for Improvement.” In this section, I reiterate some of the points made in my earlier evaluations and criticize the interfaces and their features from an end-user perspective. By virtue of its simple and clean appearance, effective organization, easy navigation, and inclusion of a Help file, I argue that PsycINFO comes out as the winner from a usability perspective. The California Environmental Information Catalog, because of its lack of a Help file, ambiguity in labels, confusing organization, and seeming inability to handle complex queries, pales in comparison. I offer several suggestions for improvement for each thesaurus-enhanced system. For PsycINFO, I recommend a “Thesaurus Terms A to Z” index, improved relevancy ranking of search results, and some minor interface improvements. For the California Environmental Information Catalog, I recommend the addition of a Help file, definition of ambiguous terms used within the system, clear roadmaps, and the repair or removal of a feature which consistently generates errors. Because of its emphasis on interface design, the search process, and evaluation based on several criteria, this paper provides evidence of my ability to design, query, and evaluate information retrieval systems.
My work with information retrieval systems has been among the most enjoyable and interesting work that I have done in the SLIS program. Through my coursework and experience, I have discovered my own aptitude for database design, thesaurus construction, indexing, and organizing information. I have become a skilled user of various information retrieval systems, which allows me to thoughtfully and thoroughly evaluate them. I am interested in the evolution of information retrieval systems, including library OPACs, from both a usability perspective and from the perspective of participatory services. I expect that in the future, information retrieval systems will continue to play an important role in my work, and I look forward to drawing upon the knowledge I have gained through my studies and work experience as I design, query, and evaluate them.
MacKay, N. (2009). Evaluating information retrieval systems. Unpublished lecture notes, San Jose State University.