UNIT-I
Boolean Retrieval: An example of information, Building an inverted index, Processing Boolean queries, The extended Boolean model versus ranked retrieval.
The term vocabulary and postings lists: Document delineation and character sequence decoding, determining the vocabulary of terms, Faster postings list intersection via skip pointers, Positional postings, and Phrase queries.
Dictionaries and tolerant retrieval: Search structures for dictionaries, Wildcard queries, Spelling correction.
Index construction: Hardware basics, blocked sort-based indexing, Single-pass in-memory indexing, Distributed indexing, Dynamic indexing, Other types of indexes.
UNIT-II
Index compression: Statistical properties of terms in information retrieval, Dictionary compression, Postings file compression. Scoring, term weighting and the vector space model: Parametric and zone indexes, Term frequency and weighting, the vector space model for scoring, and Variant tf-idf functions. Computing scores in a complete search system: Efficient scoring and ranking, Components of an information retrieval system, Vector space scoring and query operator interaction.
Evaluation in information retrieval: Information retrieval system evaluation, Standard test collections, Evaluation of unranked retrieval sets, Evaluation of ranked retrieval results, Assessing relevance.
UNIT-III
Relevance feedback and query expansion: Relevance feedback and pseudo relevance feedback, Global methods for query reformulation.
XML retrieval: Basic XML concepts, Challenges in XML retrieval, A vector space model for XML retrieval, Evaluation of XML retrieval, Text-centric vs. data-centric XML retrieval.
Probabilistic information retrieval: Basic probability theory, The Probability Ranking Principle, The Binary Independence Model.
Language models for information retrieval: Language models, The query likelihood model.
UNIT-IV
Text classification and Naive Bayes: The text classification problem, Naive Bayes text classification, The Bernoulli model, Properties of Naive Bayes, and Feature selection.
Vector space classification: Document representations and measures of relatedness in vector spaces, Rocchio classification, k- nearest neighbour, Linear versus nonlinear classifiers.
Flat clustering: Clustering in information retrieval, Problem statement, Evaluation of clustering, k-means.
Hierarchical clustering: Hierarchical agglomerative clustering, Single-link and complete-link clustering, Group-average agglomerative clustering, Centroid clustering, Divisive clustering.
UNIT-V
Matrix decompositions and Latent semantic indexing: Linear algebra review, Term-document matrices and singular value decompositions, Low-rank approximations, Latent semantic indexing.
Web search basics: Background and history, Web characteristics, Advertising as the economic model, The search user experience, Index size and estimation, Near-duplicates and shingling.
Web crawling and Indexes: Overview, Crawling, Distributing indexes, Connectivity servers.
Link analysis: The Web as a graph, Page Rank, Hubs and Authorities.
Suggested Readings:
1. Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze, An Introduction to Information Retrieval, Cambridge University Press, Cambridge, England, 2008
2. David A. Grossman, Ophir Frieder, Information Retrieval–Algorithms and Heuristics, Springer, 2nd Edition (Distributed by Universities Press), 2004.
3. Gerald J Kowalski, Mark T Maybury. Information Storage and Retrieval Systems, Springer, 2000
4. Soumen Chakrabarti, Mining the Web: Discovering Knowledge from Hypertext Data, Morgan-
Kaufmann Publishers, 2002.