Information Model Khyati Sethi Kanika Sharma Jayant Gulati Parul

Information retrieval on content based searching using
Hidden Markov Model


Khyati Sethi

Best services for writing your paper according to Trustpilot

Premium Partner
From $18.00 per page
4,8 / 5
Writers Experience
Recommended Service
From $13.90 per page
4,6 / 5
Writers Experience
From $20.00 per page
4,5 / 5
Writers Experience
* All Partners were chosen among 50+ writing services by our Customer Satisfaction Team

Kanika Sharma

Jayant Gulati

Parul Yadav

Department of Information Technology, Bharati Vidyapeeth College of
Engineering, New Delhi, India

Department of Information Technology, Bharati Vidyapeeth College of
Engineering, New Delhi, India

Department of Information Technology, Bharati Vidyapeeth College of
Engineering, New Delhi, India

professor, Department of Information Technology, Bharati Vidyapeeth College
of Engineering, New Delhi, India

Email Id: [email protected]

Email Id: [email protected]

Email Id: [email protected]

Email Id: [email protected]





Abstract: Data Analysis
is used to model data with the aim of discovering useful information.
Information is retrieved and Hidden Markov Model is incorporated to identify
the relevance of a document. Relevance, evaluation, and information needs are
the definitive key issues associated with the analysis of data and the
retrieval of information. The relational value of an input given by user in the
form of query, within a dataset, is known as relevance. This relational value
is generally calculated using a ranking algorithm. These algorithms explicitly
define how applicable a document is to user query by defining and using
functions that relate interconnections between the query provided and the
documents indexed. An effortless data access mechanism system is needed that
works in a manner that is convenient and appreciated by the user. Retrieving a
large amount of information might be inconvenient in certain systems. Simultaneously,
in other systems, not returning all relevant information may sometimes be
unacceptable. After ascertaining the relevance of the recovered data using the
Hidden Markov Model, we employ concepts such as precision and recall to
estimate and analyse the model.

Keywords: Hidden
Markov, Information retrieval, relevance, precision, recall


The process of
inspection, transformation and shaping up of data, keeping the discovery of
useful information and conclusions as goal is known as Data Analysis. This
supports decision making. Information retrieval is carried out by various IR
methods and data is further analyzed. 

Usually, the
evaluation of relevance with the help of some document representations with
respect to the query is done by an IR system. There are various models for
representation documents and queries. Thus, each model has its pros and cons.

Data analysis
has multiple facets and approaches, encompassing diverse techniques under a
variety of names, in different business, science, and social science domains. Data
analysis is closely associated to the visualization and dissemination of data.
The term data analysis is often referred to as data modelling.

retrieval refers to the task of extracting out relevant information resources applicable
to an information requirement from a set of information resources collected. Usually,
metadata or on full-text (or other content) based indexing searches can be

Hidden Markov models have been successfully
designed and implemented, over the period of last two decades, covering a wide
variety of speech and language related recognition problems which include
speech recognition, named entity ending, optical character recognition, and
topic identification and a lot more 1. In the present work, an application of
this technology is described by us with respect to the ad hoc IR technique 2.
In every HMM implementation, the observed data is modelled by the output
produced by passing any unknown key through certain noisy channel(s). In the case
of ad hoc IR proposition(s), we represent the observed data as the query, and
an unknown key that makes up a desired relevant document. Thus, for each
document we can compute the probability that it highly probable that this was
the relevant document as imagined by the user, given the query. We then rank
the documents based on this measure.


mining is a distinct technique for data analysis which does not concentrate
upon purely descriptive purposes, rather, focuses on modelling and discovery of
knowledge for predictive purposes. Data relying excessively on aggregation and
aiming in business information comes under business intelligence. Customer data
and IT tools build the substructure on which a victorious CRM strategy is created.
Also, the quick expansion of the web and related technologies has substantially
extended the number of marketing opportunities. In addition, this has altered
the way alliance between

and their clients are balanced and supervised 3.

Predictive analytics
aims at application of statistical models for estimating or categorization, while statistical,
language-producing, and systemic techniques are applied to text analysis to
acquire and classify information from textual resources.

Retrieving information
from the web incorporates handling the abstractness and volume of data
contained on the internet. When including aspects like as word ambiguity and a
large number of typographical errors, it is made increasingly difficult. There
exist a variety of key pitfalls comprehending IR- relevance, evaluation, and
information needs.

However, this is
not the complete set of issues involving IR. Common information retrieval
problems include potential, scalability and paging update occurrences. The relational
value of an input provided by the user in the form of query, within a dataset,
is known as relevance, which is calculated using a ranking algorithm.

The larger complications
with IR that are evaluation and relevance are still significant subject matters
that require attention, amongst others.

The documents
and the respective queries form a corpus of terms where every term within that
document is indexed. 1 and 0 denote the presence and absence of some text in a
text source respectively 4,5.This is the Boolean model. Maintenance of an
inverted index of every term is necessary in order to process matching of
document and query. Nonetheless, this model holds certain limitations as
explained further. Binary decision criterion has a disadvantage that it exists lacking
any grading scale concept. Another problem includes overloading of documents.
Certain researchers have worked upon this to control the fragility of the above
said model by improvising the existing one. Certain researches have also approached
data analysis with a different search strategy of vectors. This is known as the
Vector Space model 5.

This Model denotes
documents and queries as vectors. In this model, every query and document is
expressed as vectors that exist in a |V|-dimensional space. Here V is the
collection of all distinct terms in the set of documents. Here, the documents
set is the vocabulary 5.

Markov Processes
were first proposed by Russian Mathematician Andrei Markov. A Markov model in
probability theory is a stochastic model. This model is used to model systems
that change randomly. In this model, it is presumed that the future states depend
only on the present ones rather than the sequence of events that occurred prior
to it 1, 2, 6.

There exist four
Markov models that are used in different situations, depending on the
observational degree of every sequential state.


A hidden Markov
model (HMM) constitutes of a Markov model that is statistical. Here, the system
to be modelled is presumed to be a Markovian process with states that are hidden,
which implies that the states are unobserved. The simplest dynamic Bayesian
network can refer to as HMM 7.

the measurement of effectiveness of spontaneous information retrieval in the
standard way, we require a collection of tests consisting of three things:

?             Collection of documents

?             Test suite of requirements represented
as a set of queries.

?             Set of conclusions, which standardly
is a binary assessment of relevance computed as either relevant or irrelevant
for every text-query pair.

Earlier, the
following parameters were in use for the evaluation of performance of IR systems:

Precision: It is the fraction of documents relevant among the completly
retrieved document. Practically it gives accuracy of the judgement.


Recall: The fraction of the documents retrieved and relevant among all relevant
documents is referred to as recall. Practically it gives coverage of result.




Set of relevant documents retrieved

Set of all relevant documents


In pattern
recognition system and IR with binary classification, precision refers to the
fraction of instances retrieved that are found to be relevant, while recall
refers to the fraction of relevant instances that are extracted and retrieved.
Both precision and recall are henceforth derived from an understanding and
degree of relevance.

A. Language used- Python

For both small
and large scale, Python helps enabling clear programs by providing constructs.
Its features include a dynamic type system and an automatic memory management.
It also has a huge and all-inclusive standard library.

Python’s large
standard library provides tools to users that are suited for numerous tasks.
Modules for creating GUIs, connecting them to relational databases, pseudorandom
number generators, and arithmetic decimals with arbitrary precision,
manipulation of regular expressions are included. It is also capable of performing
unit testing.

B. Dataset used

OHSUMED test collection is a combinational set of 348,566 references from
MEDLINE. It is the on-line database for medical information present on World
Wide Web. It has a title, MeSH indexing terms, author, and an abstract with
source as available fields in the database.

The existing OHSUMED topics define the real requirements.
Although, the judgements of relevance does not have the same coverage as given
by the pooling process of TREC.