Comparison comprises of the most important task of retrieving

Comparison of Information Retrieval Models

 

Abstract

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

Information
retrieval is a developing field of computer science which is based upon the
storage of data and retrieving them upon the request of the user. It comprises
of the most important task of retrieving relevant document rendering to the demanded
inquiry. For this task effective and operational retrieve models have been prepared
and proposed. The survey paper shows on some of these information retrieval
models. These models have been made for different datasets and purposes. A
healthy association among these models is also shown

Keywords: Information retrieval, retrieval
models.

 

Introduction

The electronic
form of information is available in a great amount and its strength is continuously
increasing. Without any information retrieval, handling information would be
impossible. Researchers start paying attention on how to obtain or extract relevant
information when the size of data increases. Primarily much of the information
retrieval technology was related to experimentation and trial error. Managing
the increasing amount of textual information available in electronic form
efficiently and effectively is very crucial. Different retrieval models have
been formed based upon different terminologies to cope and extracting the
information.

 

Information
is stored in the form of documents more often. The purpose mainly associated
for these retrieval systems is to find the needed information. The information
retrieval system is a software program which is capable of storing and managing
information in the form of documents, often textual documents but most preferable
a multimedia. The system helps the users to find information which is required
by them. A perfect retrieval system retrieves only the significant documents
but basically it is not likely as significance is mostly depended on the individual
opinion of the user.         

Basic
patterns of models:

Almost
every retrieval model includes following basic steps:

Document
Content representation
Query
representation
Query
and collection comparison

Representation of
results

 

Figure 1 Information Retrieval Process (Hiemstra, November 2009)

       Many models represent documents in
indexed form as it is efficient approach. The efficient approach of indexed
form is used for the representation of many models. For the purpose of indexing
purposes, different algorithms are used and developed since better the data is
stored, it will be retrieved in a more efficient manner.

The next
important step is the Query Formulation. Data seraching through the key words
or phrases are mostly used by the users. The query must be presented in the
same form in order to search these phrases. Indexing can be prepared by different ways depending
upon the content representation of both the documents in the collection and the
user query. (Cerulo, 2004) (Hiemstra, November
2009)

The outcomes of the retrieval system depends on its
associated algorithm therefore it regulates the precision of the system. The better
results are obtained when the comparison is better. As the outcome of this
comparison, a list of documents is obtained which can be relevant or
irrelevant. The main
purpose of a retrieval model is to measure the degree of relevance of a document
according to the given query. (Paik, August 13,
2015)

The comparison
of relevant document or irrelevant document shows that the rank of relevant
document is higher and they are shown at the topmost of the list to minimize
user time and efforts devoted in the search of documents.

 The paper is divided in different sections
where each section explains different models & their results with their significance
and limitations.

Retrieval
Models

Exact match models

 

This model
labels the documents as relevant or irrelevant. It is also known as Boolean Model, the earliest and the
easiest model to retrieve documents. It uses logical functions in the query to
retrieve the required data. George Boole’s mathematical logic operators are
combined with query terms and their respective documents to form new sets of
documents. There are three basic operators AND (logical product) OR (logical
sum) and NOT (logical difference)
(Ricardo Baeza-Yates, 2009). The resultant of AND operator is a set of
documents smaller than or equal to the document sets of any of the terms. OR
operator results in a document set that is bigger than or equal to the document
sets of single terms.

 

The documents are labelled as relevant or irrelevant by this
model. It is also identified as Boolean Model which is the earliest and the
easiest most model to retrieve. Logical functions are used in the query for
retrieving the data which is required. The
mathematical logic operators of George Boolean are combined with query terms
and their corresponding documents to custom new sets of documents. There are
three basic operators AND (logical product) OR (logical sum) and NOT (logical
difference) (Ricardo Baeza-Yates,
2009). The resultant of AND operator is a set of documents smaller than
or equal to the document sets of any of the terms. OR operator results in a
document set which is bigger or equal to the document sets of single terms.

Sense of
control over the system is given by Boolean Model. The clear distinguishing is
done between the relevant or irrelevant documents clearly if the query is found
to be accurate. This model is not supposed to rank any document as the degree
of relevance is totally ignored. This model moreover retrieved the document or
not and this causes the frustration for the end user.

 

Region models

The extension
of the Boolean model which aims about arbitrary parts of the textual data which
is called as segments, extents or regions. A region can be word, a phrase, a
text element like a title or it can be a complete document. Start position and
an end position identify the regions. Region systems are not restricted to
retrieving documents. The region models does not show a great impact on the
information retrieval research community and also not on the development of new
retrieval systems. The reason of that is the region models do not explain in any
way that how to rank the search results. In fact, most region models are not apprehensive
with ranking at all; one might say they like the relational model are actually
data models instead of information retrieval models. (Mihajlovi´)

 

Ranking Models

Important data may be skipped by Boolean Model as ranking
mechanism is not supported. Therefore a need was felt to introduce ranking
algorithms in retrieval system. The outcomes are ranked with regards to the occurrence
oif terms in the queries. Some
ranking algorithms be contingent only on the link structure of the documents whereas
some use a combination of both that they use document content along with the link
structure to allot a rank value for a given document. (Gupta, 2013)

Similarity
measures/coefficient

Document
sets and query are used and compared and the documents with more similarities
are returned back to the user. There are many methods for the users to measure
the similarities which are cosine similarity, , tf-idf etc.

Cosine
similarity

The
cosine similarity compute the angles between the vectors in n dimensional
space. The cosine similarity in d documents and d’ is given by :

( d * d’ ) / | d | * | d’ |

 

The performance of retrieval vector base model can be improved by
utilizing user-supplied information of those documents that are relevant to the
query in question. (Kita, oct 1 , 2000)

It
has been described that the vector space model for information retrieval
provides a guide to the users which are more similar and have more significance
by calculating the angle between query and the terms or the documents. Here
documents are represented as term-vectors (Vaibhav
Kant Singh, 2015).

d = (t1,
t2, t3………tn)

Where ti
=1<=i<=t It is non-negative value and denotes the term i occurrences on document some important measures of vector space model are as follows {0,1}. Probabilistic model The probabilistic model is centered on probability ranking principle. Some statistics are involved for the probability of events estimation which states whether the document retrieve is relevant or non-relevant according to the information needed. Probabilistic models pay the conditional probability under the occurrence of the terms. Probabilistic model states that the sets of documents are ranked by the retrieval system according to the probability which is relevant to the query with all the given evidences. The documents are ranked with respect to probabilities in decreasing order. The term-index of term weight words are in binary representation. Bayesian network Model Bayesian network models (BNM) is acyclic graphical model which refers that it does not have a directed path but deals with the random variables. BNM contains a set of random-variables and the conditional probability dependencies between them. It is also known as belief networks, casual nets etc. BNM ranks the documents by usage of multiple evidences in order to compute conditional probability. Probability distribution presentation uses graphical approach to analyses complex conditional assumptions that are independent. Inference Network Model In inference retrieval model the random-variables concerned with four layers of nodes that are a query node, set of document nodes, representation nodes and index word nodes. The random-variables are represents as edges in inference network retrieval model.  All the nodes in this model represents random-variables with binary variables {0, 1}. Figure 2 simplified inference network model (Hiemstra, November 2009) Language based models: Language based models are the type of retrieval models based on the idea of speech recognition. Speech recognition depends on two main and unique models that are the acoustic-model and the language model. It is computed for each collection containing set of documents and based on terms. Ranking of documents are done by probability generalization of query.   Alternative Algebraic model In this retrieval model we further discuss two models that are latent semantic indexing and neural network model Latent Semantic Indexing Latent Semantic indexing (LSI) helps accurate retrieval information in large database. The similarity of the documents depends on the contexts of the existing and not existing words. LSI comprises the idea of singular value decomposition (SVD) and vector space model. Latent semantic indexing only takes the documents which have semantic similarity i-e having same topic, but they aren't similar in the vector space and then represents in reduced-vector-space having highest similarity. To compute LSI by using SVD a matrix A is decomposed into further 3 matrices A = U?V T Where: ? is diagonal matrix U is an orthogonal matrix and V is transpose of an orthogonal matrix Jin Wang et all (Jin Wang, 9May2012) proposes a model which uses bag of word model for the analysis of human motions in video frame. Ontology-based Information Retrieval The most emerging field if information retrieval and extraction now a days is ontology-based information retrieval (OBIE). OBIE is defined as the use of ontologies in order to retrieve information. Ontology means the conceptualization specification of the terms or the words. Ontologies are particular domain-specific generally so that it means different domains with different ontologies. As they are domain-specific so they have relationship between the class and the entities. They are application dependent. On the basis of similarities and dissimilarities an ontology-tree is hierarchal representation of classes or entities and their relationship between different grouping and classification of entities. Figure 3 Ontology based information extraction (Ritesh Shah, February 2014)       Conclusion: Different Information-retrieval techniques are discussed with advantages and disadvantages in this survey paper. Each model has its own different criteria to extract the relevant document for user's requested query. So we came to the point that few methods do best for some applications while few do best for other applications in data retrieval. Every method has its own criteria to extract and deal with the given query for a certain information need. Information-retrieval systems are being used in different organizations and still the new-model are being worked upon to get relevant results. Model Related work Methods Advantages limitations Exact match Model            i.            David E. Losada           ii.            Set theory based and Boolean algebra       iii.            Representation of query by Boolean expression       iv.            Terms combined with operators AND,OR,NOT         v.            Proximity       vi.            Stemming i.                     Easy to implement ii.                    Exact match model iii.                  Computationally efficient   i.                     No term weighting used in document and query ii.                    Add too much complexity and detail iii.                  Difficulty for end-users to form a correct Boolean query iv.                  No ranking v.                    No partial matching Vector space model   i.                     Waiting scheme used ii.                    Cosine similarity iii.                  Rank documents by similarity i.                     Improve retrieval performance by term weighting ii.                    Similarity can be used for different elements i.                     Term independence assumption ii.                    Users cannot specify relationships between terms Probabilistic Model   i.                     Probability rank principle based ii.                    relevance and non-relevance based of data i.                     Ranking of document ii.                    Does not consider index inside a document i.                     Binary word-in-doc weights ii.                    Independence of terms iii.                  Only partial ranking of documents iv.                  Prior knowledge based Language based models   Probability estimation of events in text Query likelihood model Speech recognition Term based for each document in entire collection Length normalization of term frequencies     Data sparsity Bayesian network Model       directed graphical model random variable relationship is captured by directed edges Deals with noisy data Describe interaction between query and document space Query specification based on Boolean expressions Expensive Computation Bad performance for small collection Inference Network Model     Random-variables concerned with query ,set of document and index words Provide a framework with possible strategies of Rankin used Boolean query formulation Latent Semantic Indexing     Concept based retrieval of text Use SVD Retrieval of the documents even if there is no share of keyword in the query Solves problem of ambiguities(polysemy & synonymy) Expensive Works on small collection Ontology-based Information Retrieval              i.            Entities classification based in hierarchal manner         ii.            Keyword matching based Capability to reuse and share of ontology with other applications High time consumption Difficulties come in creating ontological-tree Addition of new concept in existing ontology require considerable time and effort Neural Network Model     Neural based Weights assigned to edge of neurons Easy to use but requires some statistical trainings Deals with large collection of data Detect relationship between query and retrieve documents Difficult to design expensive Complicated in nature Does not deal with small documents References Cerulo, G. C. (2004). A Taxonomy of Information. Journal of Computing and Information Technology , 175–194. Daniel Valcarce, J. P. (n.d.). A Study of Smoothing Methods for Relevance-Based Language Modelling of Recommender Systems. Information Retrieval Lab Computer Science Department University of A Coruña, Spain {daniel.valcarce,javierparapar,barr iro}@udc.es. Gupta, P. K. (2013). Survey Paper on Information Retrieval Algorithms and Personalized Information Retrieval Concept. International Journal of Computer Applications. Hiemstra, D. (November 2009). Information Retrieval Models. Goker, A., and Davies, J. Information Retrieval: Searching in the 21st. Hiemstra, D. (November 2009.). Published in: Goker, A., and Davies, J. Information Retrieval: Searching in the 21st Century. John Wiley and Sons, Ltd.,. Hui Yang, M. S. (2014). Dynamic Information Retrieval Modeling. SIGIR'14, July 6–11, 2014, Gold Coast, Queensland, Australia. ACM 978-1-4503-2257-7/14/07. http://dx.doi.org/10.1145/2600428.2602297. Igor MOKRIŠ, L. S. (2006). Neural Network Model Of System For Information Retrieval From Text Documents In Slovak Language. ActaElectrotechnica et. Kita, X. T. (oct 1 , 2000). improvement of vector space information retrieval model based on supervised learnin. IRAL '00 Proceedings of the fifth international workshop on on Information retrieval with Asian languages ACM New York, NY, USA ©2000 , 69-74. Koltun, E. K. (4, July 2012 ). A Probabilistic Model for Component-Based Shape Synthesis. ACM Transactions on Graphics (TOG) - Proceedings of ACM SIGGRAPH 2012 . Mihajlovi´, D. H. (n.d.). A database approach to information retrieval: The remarkable. University of TwenteCentre for Telematics and Information Technology. Paik, J. H. (13 august 2015). A Probabilistic Model for Information Retrieval Based on Maximum Value Distribution. University of Maryland, College Park, USA ,SIGIR'15. Paik, J. H. (August 13, 2015). A Probabilistic Model for Information Retrieval Based on Maximum Value Distribution. University of Maryland, College Park, USA,SIGIR'15. Ricardo Baeza-Yates, B.-N. (2009). Modern Information Retrieva. ACM Press, New York. Ritesh Shah, S. J. (February 2014). Ontology-based Information Extraction: An Overview. International Journal of Computer Applications (0975 – 8887). Xi-Quan Yang, D. Y.-H. (2014). Scientific literature retrieval model based on weighted term frequency. IEEE Computing Society.