• Offered by Research School of Computer Science
  • ANU College ANU College of Engineering and Computer Science
  • Course subject Computer Science
  • Academic career Undergraduate
  • Course convener
    • Dr Scott Sanner
  • Mode of delivery In Person
  • Offered in Second Semester 2014
    See Future Offerings

Processing of semi-structured documents such as internet pages, RSS feeds and their accompanying news items, and PDF brochures is considered from the perspective of interpreting the content. This course considers the \document" and its various genres as a fundamental object for business, government and community. For this, the course covers four broad areas: (A) information retrieval, (B) natural language processing, (C) machine learning for documents, and (D) relevant tools for the Web. Basic tasks here are covered including content collection and extraction, formal and informal natural language processing, information extraction, information retrieval, classification and analysis. Fundamental probabilistic techniques for performing these tasks, and some common software systems will be covered, though no area will be covered in any depth.

Learning Outcomes

Upon successful completion of the course, the student will have an understanding of the role documents play in business and community, and the various digital resources available for document analysis. Moreover, the student will have the background theory and practical knowledge necessary to plan and execute a basic document analysis project. The student will be able to:

  • Understand the basic requirements digital libraries and business processes have w.r.t. documents. Obtain documents from various sources and transform them into a common XML or RDF format with a knowledge of SAX and XPATH.
  • Understand the genres of documents available from the internet such as RSS feeds, social networks, blogs, wikis, archives, etc., and the role they play in the internet ecosystem. Understand the linguistic and semantic resources available from the internet and the so-called ``web of data'', such as dictionaries, repositories and ontologies
  • Understand basic probabilistic theories of language and document structure, and the basic algorithms and software available for them, and be able to use some common libraries for natural language processing to perform basic analysis tasks.
  • Understand basic probabilistic theories of information retrieval, and be able to index a document collection for use in an information retrieval system. Understanding basic theories and algorithms for large scale named-entity matching and standardization of names within a collection.
  • Understand basic probabilistic theories of classification, clustering, and document feature ``engineering'', and be able to perform automated classification.

Indicative Assessment

Assignments (40%); Written final exam (60%).

The ANU uses Turnitin to enhance student citation and referencing techniques, and to assess assignment submissions as a component of the University's approach to managing Academic Integrity. While the use of Turnitin is not mandatory, the ANU highly recommends Turnitin is used by both teaching staff and students. For additional information regarding Turnitin please visit the ANU Online website.

Workload

Twenty four one-hour lectures and ten two-hour laboratory sessions.

Requisite and Incompatibility

To enrol in this course you must have completed COMP3410 or COMP3420; and 12 units of 3000 level COMP courses or INFS courses; and COMP2600 or 6 units of MATH or STAT courses.

Prescribed Texts

The following reference books will be used.

  • Introduction to Information Retrieval, C.D. Manning, P. Raghavan and H. Scutze, Cambridge University Press, 2008.
  • Foundations of Statistical Natural Language Processing, C.D. Manning and H. Scutze, MIT Press, 1999.

Majors

Minors

Specialisations

Fees

Tuition fees are for the academic year indicated at the top of the page.  

If you are a domestic graduate coursework or international student you will be required to pay tuition fees. Students continuing in their current program of study will have their tuition fees indexed annually from the year in which you commenced your program. Further information for domestic and international students about tuition and other fees can be found at Fees.

Student Contribution Band:
Band 2
Unit value:
6 units

If you are an undergraduate student and have been offered a Commonwealth supported place, your fees are set by the Australian Government for each course. At ANU 1 EFTSL is 48 units (normally 8 x 6-unit courses). You can find your student contribution amount for each course at Fees.  Where there is a unit range displayed for this course, not all unit options below may be available.

Units EFTSL
6.00 0.12500
Domestic fee paying students
Year Fee
1994-2003 $1650
2004 $2190
2005 $2190
2006 $2190
2007 $2298
2008 $2592
2009 $2850
2010 $2916
2011 $2946
2012 $2946
2013 $2946
2014 $2952
International fee paying students
Year Fee
1994-2003 $3234
2004 $3234
2005 $3288
2006 $3426
2007 $3426
2008 $3426
2009 $3426
2010 $3750
2011 $3756
2012 $3756
2013 $3756
2014 $3762
Note: Please note that fee information is for current year only.

Offerings and Dates

The list of offerings for future years is indicative only

Second Semester

Class number Class start date Last day to enrol Census date Class end date Mode Of Delivery
8000 21 Jul 2014 01 Aug 2014 31 Aug 2014 30 Oct 2014 In Person

Responsible Officer: Registrar, Student Administration / Page Contact: Website Administrator / Frequently Asked Questions