605.744: Information Retrieval, Spring 2007
- Lecturer
- Paul McNamee,
<paulmac@apl.jhu.edu>,
Textbooks
Course Times and Location
- Lecture: Tuesday, 7:15pm - 10:00pm, Kossiakoff Center Room K-2.
- Rather than holding fixed office hours, I will meet with
students by appointment; however, email is expected to be the primary
means of out-of-class communication.
- Course Overview
- This course covers the storage and retrieval of unstructured digital
information. Topics include automatic index construction,
retrieval models, textual representations, efficiency issues,
search engines, text classification, and multilingual retrieval.
- Grading Policy
- Grades will be given based on the A,B,C, etc..., scale, per
university policy. Work for the class will include homework
assignments, an independent research project, a midterm exam,
and classroom participation (e.g., quizzes, oral presentations).
Refer to the course outline for details.
- Academic Integrity
- Work for this class is expected to be the result of individual
effort; however, unless explicitly prohibited, it is perfectly
acceptable to make use of published examples and source code
from the literature or public domain - but only if attribution
is given.
Furthermore, while it is permissible to discuss the general
nature of lecture material and assignments with your peers, this
does not extend to discussing or revealing solutions or source
code.
Students are expected to uphold the academic integrity of the
university.
Students using without reference, published material or copying
the work (i.e., particulary source code) of another individual
will face consequences such as receiving a zero on the
assignment and having the matter referred to the dean.
Contact me if you have any questions, no matter how
slight, about this policy, or if you have questions about a
particular assignment.
Assigned Readings
- 1/23/07, Michael Lesk,
The Seven Ages of Information Retrieval, 1995.
- (optional, not assigned): If you liked that paper, try
How Much
Information Is There in the World?", also by Lesk, the
author of "Practical Digital Libraries" by Morgan Kaufmann.
- 1/23/07,
IIR, Chapters 1 & 2.
Key ideas: Boolean document representations and query evaluation in
chapter 1; word normalization in chapter 2.
- 1/30/07, Damashek's N-gram paper (only available in hard copy)
- 1/30/07,
IIR, Chapters 3 & 4.
Key ideas: Coping with mispellings; wildcard search; building an
inverted file (pay particular attention to section 4.1 of the
text)
- 2/6/07,
IIR, Chapters 5 & 6.
Key ideas: Dictionary and Index Compression; Term weighting
- 2/20/07, Term Weighting paper by Salton and Buckley (hardcopy)
- 2/20/07,
IIR, Chapters 7-9
Key ideas: Vector Space model, Evaluation (of accuracy), Query
Expansion
- 2/27/07,
IIR, Chapters 11, 12
Key ideas: Probabilistic model, statistical language models
- 3/6/07,
IIR, Chapters 13-15
Key ideas: Text classification, Naive Bayes (NB), feature selection,
k-nearest neighbor (kNN), support vector machines (SVMs). Ignore
section 14.3, and don't worry about the math in Chap 15 - get a
conceptual understanding.
- 3/6/07, Thorsten Joachims,
Text Categorization with Support Vector Machines, A tech
report that also appeared at ECML-98.
- 3/20/07,
IIR, Chapters 19-21
Key ideas: Characteristics of Web documents, Hyperlink-sensitive methods, Crawling.
- 3/20/07, Nature and Science articles by Lawrence and Giles
(hardcopy).
- 3/27/07, Michael Wesch's,
The Machine is Us/ing Us
on YouTube.
- 4/3/07. Hull and Grefenstette, 'Querying across languages:
A dictionary-based approach to multilingual information
retrieval'. SIGIR-96 paper that also appeared in Readings in
IR. Sent as softcopy by email.
- 4/10/07, Mark Sanderson,
'Retrieving with Good Sense'.
Handouts
Assignments
Course related web-links
- Sources for on-line papers:
CiteSeer (scientific and CS articles)
ACL Anthology
TREC Publications
ACM Digital Library
- IR Textbooks:
Managing Gigabytes,
Readings in Information Retrieval (at amazon),
Foundations of Statistical Natural Language Processing text
- IR Evaluations: TREC,
CLEF,
NTCIR
FIRE
- Organizations that distribute corpora:
LDC,
ELRA
- IR Journals: JASIST,
IP&M,
IR
- IR-related conferences:
SIGIR,
CIKM,
KDD,
ACL
WWW-2005
- On-line magazines:
Search Engine Watch,
D-Lib Magazine
- Berkeley Primer: Finding Information on the Internet
- HLT Central Repository
- Parallel text processing: bibliography
- Discrete Mathematics Primer
- Web Protocols:
HTML 4,
Z39.50 (Information Retrieval)
Software Resources
Frivolity
Cool Demos
- A 'meta' search engine: Dogpile
- A question-answering system: START
- An online joke recommendation system that demonstrates
collaborative filtering:
JESTER
- A search engine for speech: PodZinger
- A faux computer science paper generator,
SCIgen, from MIT
IR Test collections
Several Web search engines
JHU on-line resources
Paul McNamee:
http://apl.jhu.edu/~paulmac/
(paulmac@apl.jhu.edu)