605.744: Information Retrieval, Spring 2013
Exams on 4/29 and project presentations on 5/6
- Paul McNamee
Note: the overview below is for convenience. The official syllabus for Spring 2013 can be found in the Course Description handout.
Course Times and Location
- Lecture: Mondays, 7:20pm - 10:00pm
- Location: Room L-3. (FYI the L-rooms are up the street, around the circle, through the entrance, and left and downstairs in the vicinity of the part-time office.)
- Email should be the primary means of out-of-class communication; however
I can meet with students in person by appointment.
- Course Overview
- This course covers the storage and retrieval of unstructured digital
information. Topics include automatic index construction,
retrieval models, textual representations, efficiency issues,
search engines, text classification, and multilingual retrieval.
- Grading Policy
- Work for the class will include homework assignments, an independent research project, exams,
and classroom participation (e.g., quizzes, oral presentations, paper summaries).
Refer to the course outline for details. Note: Starting in 2013 I will assign grades with
plus/minus modifiers (e.g., A+, B-, etc...), since the program now permits it.
- Academic Integrity
- Work for this class is expected to be the result of individual
effort; however, unless explicitly prohibited, it is perfectly
acceptable to make use of published examples and even source code
from the literature or public domain - but only if attribution is
given. Furthermore, while it is permissible to discuss the general
nature of lecture material and assignments with your peers, this
does not extend to discussing or revealing solutions to assigned
problems or sharing source code. Students are expected to uphold the
academic integrity of the university. Students using published
material without citation, or who copy the work of another
individual (i.e., including source code) will face consequences such
as receiving a zero on the assignment and having the matter referred
to the dean. Contact me if you have any questions about this policy,
or if you have questions about a particular assignment.
- 1/28/13 Chapters 1 and 2 in Manning, Raghavan, and Schütze.
- 1/28/13 Michael Lesk, The Seven Ages of Information Retrieval (1995)
- 2/4/13 Chapters 3 and 4 in Manning, Raghavan, and Schütze.
- 2/11/13 Chapters 5 - 7 in Manning, Raghavan, and Schütze.
- 2/11/13 G. Salton and C Buckley, Term-Weighting Approaches in Automatic Text Retrieval, IPM 24(5), pp. 513-523, 1988.
- 2/18/13 Chapters 8 and 9 in Manning, Raghavan, and Schütze.
- 2/18/13 Economic Impact of TREC (2010). Executive Summary and Sections 1-3
- 2/18/13 The list of papers we'll review has been posted. Email me by 2/25 with the top-3 papers you'd like to summarize for the class.
- 2/25/13 Chapters 11 and 12 in Manning, Raghavan, and Schütze.
- 3/4/13 Chapters 13, 14, and 15 in Manning, Raghavan, and Schütze.
- 3/4/13 Thorsten Joachims. Text categorization with support vector machines: learning with many relevant features.
- 3/4/13 Not required reading: Goodman et al., Spam and the on-going battle for the inbox., CACM 50(2), pp. 24-33, 2007.
- 3/14/13 Chapters 19, 20, and 21 in Manning, Raghavan, and Schütze.
- 4/1/13 K. Kashida, Technical
issues of cross-language information retrieval: a review,
IPM 41, pp. 433-455, 2005.
- 4/15/13 M. Sanderson, Retrieving with Good Sense, Information Retrieval 2(1), pp. 49-69, 2000.
Course related web-links
- Sources for on-line papers:
ACM Digital Library
- IR Textbooks:
Information Retrieval: Implementingn and Evaluating Search Engines,
Information Retrieval: Algorithms and Heuristics
Readings in Information Retrieval (Amazon),
Foundations of Statistical Natural Language Processing
- IR Evaluations: TREC,
ROMIP (a Russian language evaluation)
- Organizations that distribute corpora:
- IR Journals:
ACM Transactions on Speech and Language Processing
- IR-related conferences:
CEAS 2010 (email spam)
AIRWeb (web spam)
ISMIR (music IR)
- On-line magazines:
The Noisy Channel,
Search Engine Watch,
- Peter Norvig's tutorial on spelling correction.
- Berkeley Primer: Finding Information on the Internet
- HLT Central Repository
- Discrete Mathematics Primer
- Web Protocols:
Z39.50 (Information Retrieval)
- Lucene a popular open-source search engine software (see also Solr)
- Wumpus system (Univ. Waterloo)
- Lemur / Indri: a language modelling IR toolkit.
- Cornell's SMART system (predates the birth of Sergey Brin or Larry Page)
- Martin Porter's Snowball stemming tool (includes Porter Stemmer):
- Jacques Savoy's stoplists in various languages (and some stemmers too)
- Managing Gigabytes mg system
- Very nice list of NLP, IR, CL, resources (i.e. parsers, taggers) at
- University of Michigan tool suite: Clairlib
(TnT) toolkit, a visible markov model tagger written by Thorsten
Brants (now of Google).
a probabilistic POS-tagger.
- On-line translators: Systran,
- Google's on-line translation service: Google Translate
- WordNet, a
lexical database for English
- Andrew McCallum's MALLET
toolkit, a Java-based API for machine learning applications
using Conditional Random Fields
- Perl LWP library (at CPAN).
- Machine Learning / Data Mining tool: WEKA
- Joachim's Support Vector Machine toolkit: SVMlight
- SVM-Multiclass, a multi-class version of SVMlight.
- Python-based set of tools for NLP tasks (parsing, POS tagging, etc...): NLTK
- Machine learning in Python: scikit
- Parsing HTML (robustly) in Python: Beautiful Soup
- A 'meta' search engine: Dogpile
- A question-answering system: START
- An online joke recommendation system that demonstrates
- A faux computer science paper generator,
SCIgen, from MIT
- No IR system with 3 billion queries a day is going to be perfect. Best of Google Bloopers ;-).
IR Test collections
Several Web Search Engines