Document Analysis

Code COMP4650
Unit Value 6 units

Offered by School of Computing
ANU College ANU College of Engineering and Computer Science
Course subject Computer Science

Academic career UGRD
Course convener
- Dr Dawei Chen
Mode of delivery In Person
Co-taught Course
- COMP6490
Offered in Second Semester 2022
See Future Offerings

In Sem 2 2022, this course is delivered on campus with adjustments for remote participation due to unavoidable COVID constraints.

Processing of semi-structured documents such as internet pages, RSS feeds and their accompanying news items, and PDF brochures is considered from the perspective of interpreting the content. This course considers the \document" and its various genres as a fundamental object for business, government and community. For this, the course covers four broad areas: (A) information retrieval, (B) natural language processing, (C) machine learning for documents, and (D) relevant tools for the Web. Basic tasks here are covered including content collection and extraction, formal and informal natural language processing, information extraction, information retrieval, classification and analysis. Fundamental probabilistic techniques for performing these tasks, and some common software systems will be covered, though no area will be covered in any depth.

Learning Outcomes

Upon successful completion, students will have the knowledge and skills to:

Upon successful completion of the course, the student will have an understanding of the role documents play in business and community, and the various digital resources available for document analysis. Moreover, the student will have the background theory and practical knowledge necessary to plan and execute a basic document analysis project. The student will be able to:

differentiate between the basic probabilistic theories of language and document structure, information retrieval, and classification, clustering and document feature engineering.
identify the basic algorithms and software available for probabilistic theories of language and be proficient at using common libraries for natural language processing to perform basic analysis tasks.
index a document collection for use in an information retrieval system. Demonstrate advanced knowledge of basic theories and algorithms to determine large scale named-entity matching and standardization of names within a collection.
perform automated classification using probabilistic theories.

Indicative Assessment

Assignments (40%); Written final exam (60%).

The ANU uses Turnitin to enhance student citation and referencing techniques, and to assess assignment submissions as a component of the University's approach to managing Academic Integrity. While the use of Turnitin is not mandatory, the ANU highly recommends Turnitin is used by both teaching staff and students. For additional information regarding Turnitin please visit the ANU Online website.

Workload

Twenty four one-hour lectures and ten two-hour laboratory sessions.

Requisite and Incompatibility

To enrol in this course you must have completed 12 units of 3000 level COMP courses or INFS courses; and COMP2600 or COMP1600 or 6 units of MATH or STAT courses; and at least 6 units of programming courses including COMP1100/1130, COMP1110/1140, COMP1730 or COMP2100 or equivalent. Incompatible with COMP6490.

Prescribed Texts

The following reference books will be used.

Introduction to Information Retrieval, C.D. Manning, P. Raghavan and H. Scutze, Cambridge University Press, 2008.
Foundations of Statistical Natural Language Processing, C.D. Manning and H. Scutze, MIT Press, 1999.

Majors

Information Systems

Fees

Tuition fees are for the academic year indicated at the top of the page.

Commonwealth Support (CSP) Students
If you have been offered a Commonwealth supported place, your fees are set by the Australian Government for each course. At ANU 1 EFTSL is 48 units (normally 8 x 6-unit courses). More information about your student contribution amount for each course at Fees.

Student Contribution Band:: 2
Unit value:: 6 units

If you are a domestic graduate coursework student with a Domestic Tuition Fee (DTF) place or international student you will be required to pay course tuition fees (see below). Course tuition fees are indexed annually. Further information for domestic and international students about tuition and other fees can be found at Fees.

Where there is a unit range displayed for this course, not all unit options below may be available.

Units	EFTSL
6.00	0.12500

Domestic fee paying students

Year	Fee
2022	$4740

International fee paying students

Year	Fee
2022	$6000

Note: Please note that fee information is for current year only.

Offerings, Dates and Class Summary Links

ANU utilises MyTimetable to enable students to view the timetable for their enrolled courses, browse, then self-allocate to small teaching activities / tutorials so they can better plan their time. Find out more on the Timetable webpage.

The list of offerings for future years is indicative only.
Class summaries, if available, can be accessed by clicking on the View link for the relevant class number.

Second Semester

Class number	Class start date	Last day to enrol	Census date	Class end date	Mode Of Delivery	Class Summary
5780	25 Jul 2022	01 Aug 2022	31 Aug 2022	28 Oct 2022	In Person	View

Programs and Courses

Learning Outcomes

Indicative Assessment

Workload

Requisite and Incompatibility

Prescribed Texts

Majors

Fees

Course fees

Offerings, Dates and Class Summary Links

Second Semester

Second Semester

Second Semester