• Class Number 1605
  • Term Code 3320
  • Class Info
  • Unit Value 6 units
  • Mode of Delivery In Person
  • COURSE CONVENER
    • EmPr Peter Christen
  • LECTURER
    • Charini Nanayakkara
    • EmPr Peter Christen
  • Class Dates
  • Class Start Date 09/01/2023
  • Class End Date 10/03/2023
  • Census Date 20/01/2023
  • Last Date to Enrol 20/01/2023
SELT Survey Results

Real-world data are commonly messy, distributed, and heterogeneous. This course introduces core concepts of data cleaning and standardisation, and data integration, that are aimed at converting and mapping raw data into other formats that allow more efficient and convenient use and analysis of data. The courses also discusses data quality, management, and storage issues as relevant to data analytics.

Learning Outcomes

Upon successful completion, students will have the knowledge and skills to:

  1. Critically reflect upon different data sources, types, formats, and structures
  2. Research, justify and apply data cleaning, preprocessing, and standardisation for data analytics
  3. Apply data integration concepts and techniques to heterogeneous and distributed data
  4. Interpret, assess and discuss data quality measurements
  5. Research and justify advanced data wrangling, data integration, and database techniques as relevant to data analytics

Research-Led Teaching

This course will provide students with the opportunities to:

1. Develop knowledge in both theoretical concepts and practical skills on topics relevant to data wrangling;

2. Lean about some latest industry and research developments in data wrangling;

3. Apply their knowledge on real world problems where they need to conduct practical data wrangling on real data;

4. Describe their findings and reflect upon their data wrangling experiences.

Books

  1. Data matching - Concepts and techniques for record linkage, entity resolution and duplicate detection (Peter Christen, Springer, 2012). This book is a required text for major parts of the course. There are several copies available in the ANU library.
  2. Data mining: Concepts and techniques, 3rd edition (Jiawei Han, Micheline Kamber and Jian Pei, Morgan Kaufmann, 2011) Note: This is also the text book for the data mining course (COM3425 and COMPP8410).
  3. Data mining with Rattle and R is a useful book if you plan to use Rattle in this course as well as the Data Mining course (COMP3425 and COMP8410).


Software

  1. Pandas (which is included in Anaconda), based on Python.
  2. Matplotlib (also included in Anaconda), based on Python.
  3. Rattle, based on R.
  4. Code repository for the Data wrangling with Python book: https://github.com/jackiekazil/data-wrangling
  5. Code repository for data wrangling with Pandas: https://github.com/fonnesbeck/statistical-analysis-python-tutorial (see 2. Data wrangling with Pandas)

Staff Feedback

Students will be given feedback in the following forms in this course:

  • Individual written comments
  • Verbal comments
  • Feedback to the whole class, to groups, to individuals, and/or focus groups

Student Feedback

ANU is committed to the demonstration of educational excellence and regularly seeks feedback from students. Students are encouraged to offer feedback directly to their Course Convener or through their College and Course representatives (if applicable). The feedback given in these surveys is anonymous and provides the Colleges, University Education Committee and Academic Board with opportunities to recognise excellent teaching, and opportunities for improvement. The Surveys and Evaluation website provides more information on student surveys at ANU and reports on the feedback provided on ANU courses.

Class Schedule

Week/Session Summary of Activities Assessment
1 Week 1: Introduction to data wrangling (9 to 13 January 2023) – Recorded lecture 1: What is data wrangling; and course overview – Recorded lecture 2: The data wrangling process; understanding data – Recorded lecture 3: Data extraction and storage, data warehousing Reading material: 1. Rahm and Do (2000): Data cleaning: Problems and current approaches. 2. New York Times article (2014): For Big-Data scientists, ‘janitor work’ is key hurdle to insights Release of Assignment 1.
2 Week 2: Data quality, exploration and cleaning (16 to 20 January 2023) – Recorded lecture 4: Web scraping and geocoding of data – Recorded lecture 5: Data quality assessment, data quality dimensions, data profiling, data visualisation, real-world data is dirty – Recorded lecture 6: Resolving data quality issues, data cleaning overview, dealing with missing data Reading material: 1. Galaitsi, Cegan, Volk, Joyner, Trump, and Linkov (2021): The challenges of data usage for the United States’ COVID-19 response 2. Strong, Lee and Wang (1997): Data quality in context Submission of Assignment 1, release of Assignment 2. Online quiz 1 (progress questions weeks 1 and 2).
3 Week 3: Data pre-processing (23 to 27 January 2023) – Recorded lecture 7: Data transformation, aggregation and reduction, Metadata – Recorded lecture 8: Data parsing and standardisation, special case of personal data – Recorded lecture 9: Example data cleaning using Rattle (R based) and using Python (Pandas) Submission of Assignment 2, release of Assignment 3. Online quiz 2 (progress questions weeks 2 and 3).
4 Week 4: Data integration (30 January to 3 February 2023) – Recorded lecture 10: Overview of data integration, its importance, example applications – Recorded lecture 11: Schema mapping and matching – Recorded lecture 12: Overview of record linkage (process, history, challenges) Reading material: 1. First two chapters of Christen (2012): Data matching Submission of Assignment 3. Online quiz 3 (progress questions weeks 3 and 4).
5 Intensive week (6 to 10 February 2023) *** Attendance in person on campus is compulsory - nothing will be recorded *** Monday – Lecture 1: Welcome, summary of topics covered in weeks 1 to 4, discussion and question answering – Lecture 2 (13): Record linkage overview and process (brief repeat), blocking and indexing for record linkage – Lecture 3 (14): More on blocking / indexing (including phonetic encoding) – Practical lab 1: Data exploration using Rattle and Pandas ? Tuesday – Lecture 4 (15): Record linkage comparison (basics) – Lecture 5 (16): Record linkage comparison (string comparison functions) – Guest lecture by Prof Nick Biddle, ANU Centre for Social Research and Methods (TBC) – Practical lab 2: Data cleaning and transformation using Rattle and Pandas ? Wednesday – Lecture 6 (17): Record linkage classification (basics) – Lecture 7 (18): Record linkage classification (advanced) – Practical lab 3: Record linkage blocking using Python – Practical lab 4: Record linkage comparison using Python ? Thursday – Lecture 8 (19): Record linkage evaluation (measuring linkage quality and complexity) – Lecture 9 (20): Record linkage evaluation (clerical review and benchmark data) – Practical lab 5: Record linkage classification using Python – Practical lab 6: Record linkage evaluation using Python ? Friday – Lecture 10: Summary of intensive week and outlook to post-intensive period, discussion of final assessment tasks – Lecture 11: Return marked assessments from pre-intensive week, provide feedback and discussion (TBC) – Practical lab 7: End-to-end record linkage using Python Reading material: 1. Relevant chapters of Christen (2012): Data matching There will be no assessments during the intensive week, but Assignments 4 and 5 will be released.
6 Week 6: Data fusion and advanced record linkage (13 to 17 February 2023) – Recorded lecture 1 (21): Data fusion, merging records after integration – Recorded lecture 2 (22): Group linkage, collective linkage, active learning, geocode matching, linking temporal and dynamic data, real-time linkage – Recorded lecture 3 (23): Privacy aspects in data wrangling, privacy-preserving record linkage Reading material: 1. Schnell, Bachteler, and Reiher (2009): Privacy-preserving record linkage using Bloom filters
7 Week 7: Ontologies and wrangling Big Data (20 to 24 February 2023) – Recorded lecture 4 (24): Ontology mapping and matching – Recorded lecture 5 (25): Wrangling dynamic data and data streams, as well as location (spatial) data Submission of Assignment 4.
8 Week 8: No new lecture material (27 February to 3 March)
9 Week 9: No new lecture material (6 to 10 March) Submission of Assignment 5 and final online examination.

Tutorial Registration

Tutorials will be in the intensive on-campus week (6th to 10th February 2023) and are compulsory to attend.

Assessment Summary

Assessment task Value Due Date Learning Outcomes
Report on experiences in activities related to data wrangling 5 % 22/01/2023 LO1: Critically reflect upon different data sources, types, formats, and structures. LO2: Research, justify and apply data cleaning, preprocessing, and standardisation for data analytics. LO4: Interpret, assess and discuss data quality measurements.
Report on practical data exploration and profiling exercise 10 % 29/01/2023 LO1: Critically reflect upon different data sources, types, formats, and structures. LO2: Research, justify and apply data cleaning, preprocessing, and standardisation for data analytics. LO4: Interpret, assess and discuss data quality measurements.
Report on practial data cleaning exercise 15 % 05/02/2023 LO1: Critically reflect upon different data sources, types, formats, and structures. LO2: Research, justify and apply data cleaning, preprocessing, and standardisation for data analytics. LO4: Interpret, assess and discuss data quality measurements.
Online Quizzes 5 % * LO1: Critically reflect upon different data sources, types, formats, and structures. LO2: Research, justify and apply data cleaning, preprocessing, and standardisation for data analytics. LO3: Apply data integration concepts and techniques to heterogeneous and distributed data. LO4: Interpret, assess and discuss data quality measurements.
Practical record linkage exercise 10 % 26/02/2023 LO1: Critically reflect upon different data sources, types, formats, and structures LO3: Apply data integration concepts and techniques to heterogeneous and distributed data. LO5: Research and justify advanced data wrangling, data integration, and database techniques as relevant to data analytics.
Practical data wrangling project 25 % 10/03/2023 LO1: Critically reflect upon different data sources, types, formats, and structures LO2: Research, justify and apply data cleaning, preprocessing, and standardisation for data analytics. LO3: Apply data integration concepts and techniques to heterogeneous and distributed data. LO4: Interpret, assess and discuss data quality measurements. LO5: Research and justify advanced data wrangling, data integration, and database techniques as relevant to data analytics.
Final examination 30 % 10/03/2023 LO1: Critically reflect upon different data sources, types, formats, and structures LO2: Research, justify and apply data cleaning, preprocessing, and standardisation for data analytics. LO3: Apply data integration concepts and techniques to heterogeneous and distributed data. LO4: Interpret, assess and discuss data quality measurements. LO5: Research and justify advanced data wrangling, data integration, and database techniques as relevant to data analytics.

* If the Due Date and Return of Assessment date are blank, see the Assessment Tab for specific Assessment Task details

Policies

ANU has educational policies, procedures and guidelines, which are designed to ensure that staff and students are aware of the University’s academic standards, and implement them. Students are expected to have read the Academic Misconduct Rule before the commencement of their course. Other key policies and guidelines include:

Assessment Requirements

The ANU is using Turnitin to enhance student citation and referencing techniques, and to assess assignment submissions as a component of the University's approach to managing Academic Integrity. For additional information regarding Turnitin please visit the ANU Online website Students may choose not to submit assessment items through Turnitin. In this instance you will be required to submit, alongside the assessment item itself, hard copies of all references included in the assessment item.

Moderation of Assessment

Marks that are allocated during Semester are to be considered provisional until formalised by the College examiners meeting at the end of each Semester. If appropriate, some moderation of marks might be applied prior to final results being released.

Participation

You are expected to view all online lectures and read the provided reading material in the online weeks.

You must attend all lectures and labs in the intensive on-campus week (6th to 10th February), as none of these will be recorded.

Examination(s)

The examination will be held online in Wattle, with details to be discussed in the intensive week.

Assessment Task 1

Value: 5 %
Due Date: 22/01/2023
Learning Outcomes: LO1: Critically reflect upon different data sources, types, formats, and structures. LO2: Research, justify and apply data cleaning, preprocessing, and standardisation for data analytics. LO4: Interpret, assess and discuss data quality measurements.

Report on experiences in activities related to data wrangling

This assignment is a theoretical assignment that deals with questions about what data wrangling is, why it is important, and how it fits into the broader field of data analytics. Some of the questions refer to the required readings for week 1 of the course while the remainder ask you to draw on personal experience and the lecture material from week 1.

Assessment Task 2

Value: 10 %
Due Date: 29/01/2023
Learning Outcomes: LO1: Critically reflect upon different data sources, types, formats, and structures. LO2: Research, justify and apply data cleaning, preprocessing, and standardisation for data analytics. LO4: Interpret, assess and discuss data quality measurements.

Report on practical data exploration and profiling exercise

This assignment covers the topics of data quality, data exploration, and data profiling as presented in the first few weeks of the course. We provide you with a data set, and you have to explore this data set and detail your findings relevant to data quality.

Assessment Task 3

Value: 15 %
Due Date: 05/02/2023
Learning Outcomes: LO1: Critically reflect upon different data sources, types, formats, and structures. LO2: Research, justify and apply data cleaning, preprocessing, and standardisation for data analytics. LO4: Interpret, assess and discuss data quality measurements.

Report on practial data cleaning exercise

This assignment covers the topics of data cleaning, with a focus on identifying possible data quality problems in data sets and taking necessary steps to correct them. Similar to Assignment 2, we ask you to generate a second data set, to merge the data set from Assignment 2 with this new data set, and to then identify data quality problems in the merged data set, and to also fix them.

Assessment Task 4

Value: 5 %
Learning Outcomes: LO1: Critically reflect upon different data sources, types, formats, and structures. LO2: Research, justify and apply data cleaning, preprocessing, and standardisation for data analytics. LO3: Apply data integration concepts and techniques to heterogeneous and distributed data. LO4: Interpret, assess and discuss data quality measurements.

Online Quizzes

Online Quizzes will cover topics of data quality, data exploration, data profiling, data integration, data cleaning, and record linkage. Online quizzes will be in weeks 2, 3, and 4 of the course.

Assessment Task 5

Value: 10 %
Due Date: 26/02/2023
Learning Outcomes: LO1: Critically reflect upon different data sources, types, formats, and structures LO3: Apply data integration concepts and techniques to heterogeneous and distributed data. LO5: Research and justify advanced data wrangling, data integration, and database techniques as relevant to data analytics.

Practical record linkage exercise

For this assignment you will be having another look at the record linkage program that you developed in the lab sessions during the intensive week. Specifically, we provide you with two new master data sets from where you need to generate two individual data sets to be used for this assignment. We ask you to work with the programs we have developed in the labs, and provide answers based on your findings. As with the previous assignments, the emphasis is on your understanding, descriptions, and justification, as much as the raw (numerical) record linkage evaluation results that you are able to achieve.

Assessment Task 6

Value: 25 %
Due Date: 10/03/2023
Learning Outcomes: LO1: Critically reflect upon different data sources, types, formats, and structures LO2: Research, justify and apply data cleaning, preprocessing, and standardisation for data analytics. LO3: Apply data integration concepts and techniques to heterogeneous and distributed data. LO4: Interpret, assess and discuss data quality measurements. LO5: Research and justify advanced data wrangling, data integration, and database techniques as relevant to data analytics.

Practical data wrangling project

This assignment is an opportunity to put everything you have learned about data wrangling into practice on a problem and data sets of your choice. You have to perform a similar set of tasks to what you were required to do for Assignments 2, 3, and possibly 4. However the scope is much more open, and you are free to focus on aspects that you think are important or are of particular interest to you.

Assessment Task 7

Value: 30 %
Due Date: 10/03/2023
Learning Outcomes: LO1: Critically reflect upon different data sources, types, formats, and structures LO2: Research, justify and apply data cleaning, preprocessing, and standardisation for data analytics. LO3: Apply data integration concepts and techniques to heterogeneous and distributed data. LO4: Interpret, assess and discuss data quality measurements. LO5: Research and justify advanced data wrangling, data integration, and database techniques as relevant to data analytics.

Final examination

Final examination will cover every course content students learned during the course.

Hurdle - Students must obtain a final exam mark of at least 45% and a total mark over 50% to pass the course.

Academic Integrity

Academic integrity is a core part of our culture as a community of scholars. At its heart, academic integrity is about behaving ethically. This means that all members of the community commit to honest and responsible scholarly practice and to upholding these values with respect and fairness. The Australian National University commits to embedding the values of academic integrity in our teaching and learning. We ensure that all members of our community understand how to engage in academic work in ways that are consistent with, and actively support academic integrity. The ANU expects staff and students to uphold high standards of academic integrity and act ethically and honestly, to ensure the quality and value of the qualification that you will graduate with. The University has policies and procedures in place to promote academic integrity and manage academic misconduct. Visit the following Academic honesty & plagiarism website for more information about academic integrity and what the ANU considers academic misconduct. The ANU offers a number of services to assist students with their assignments, examinations, and other learning activities. The Academic Skills and Learning Centre offers a number of workshops and seminars that you may find useful for your studies.

Online Submission

The ANU uses Turnitin to enhance student citation and referencing techniques, and to assess assignment submissions as a component of the University's approach to managing Academic Integrity. While the use of Turnitin is not mandatory, the ANU highly recommends Turnitin is used by both teaching staff and students. For additional information regarding Turnitin please visit the ANU Online website.

Hardcopy Submission

No hard copy submissions

Late Submission

No submission of assessment tasks without an extension after the due date will be permitted. If an assessment task is not submitted by the due date, a mark of 0 will be awarded.

Referencing Requirements

Accepted academic practice for referencing sources that you use in presentations can be found via the links on the Wattle site, under the file named “ANU and College Policies, Program Information, Student Support Services and Assessment”. Alternatively, you can seek help through the Students Learning Development website.

Extensions and Penalties

Extensions and late submission of assessment pieces are covered by the Student Assessment (Coursework) Policy and Procedure The Course Convener may grant extensions for assessment pieces that are not examinations or take-home examinations. If you need an extension, you must request an extension in writing on or before the due date. If you have documented and appropriate medical evidence that demonstrates you were not able to request an extension on or before the due date, you may be able to request it after the due date.

Privacy Notice

The ANU has made a number of third party, online, databases available for students to use. Use of each online database is conditional on student end users first agreeing to the database licensor’s terms of service and/or privacy policy. Students should read these carefully. In some cases student end users will be required to register an account with the database licensor and submit personal information, including their: first name; last name; ANU email address; and other information. In cases where student end users are asked to submit ‘content’ to a database, such as an assignment or short answers, the database licensor may only use the student’s ‘content’ in accordance with the terms of service — including any (copyright) licence the student grants to the database licensor. Any personal information or content a student submits may be stored by the licensor, potentially offshore, and will be used to process the database service in accordance with the licensors terms of service and/or privacy policy. If any student chooses not to agree to the database licensor’s terms of service or privacy policy, the student will not be able to access and use the database. In these circumstances students should contact their lecturer to enquire about alternative arrangements that are available.

Distribution of grades policy

Academic Quality Assurance Committee monitors the performance of students, including attrition, further study and employment rates and grade distribution, and College reports on quality assurance processes for assessment activities, including alignment with national and international disciplinary and interdisciplinary standards, as well as qualification type learning outcomes. Since first semester 1994, ANU uses a grading scale for all courses. This grading scale is used by all academic areas of the University.

Support for students

The University offers students support through several different services. You may contact the services listed below directly or seek advice from your Course Convener, Student Administrators, or your College and Course representatives (if applicable).
EmPr Peter Christen
peter.christen@anu.edu.au

Research Interests


Record Linkage, Entity Resolution, Data Mining, Privacy-Preserving Record Linkage,, Data Quality, Population Data

EmPr Peter Christen

By Appointment
Charini Nanayakkara
charini.nanayakkara@anu.edu.au

Research Interests


Charini Nanayakkara

EmPr Peter Christen
peter.christen@anu.edu.au

Research Interests


EmPr Peter Christen

By Appointment

Responsible Officer: Registrar, Student Administration / Page Contact: Website Administrator / Frequently Asked Questions