Opportunities and Challenges of Complex Biomedical Data: Introduction to the Science of "Big Data" (BIOSTAT 202)
Summer 2023 (3 units)
This is an introduction to the opportunities and challenges of using large datasets for biomedical research. Topics to be covered include: What makes big data different? What big data can and cannot do. Phases of data science: getting data, merging and cleaning data, storing and accessing data, visualizing or telling stories with data, drawing conclusions from data. Introduction to supervised and unsupervised machine learning including detailed discussion of algorithms and model fitting.
Objectives
At the conclusion of this course, students will be able to:
- Utilize public use (and non-public) sources of data such as NHANES and social media data.
- Utilize software to manipulate and clean big data.
- Generate effective graphical displays of data.
- Describe the advantages and disadvantages of different approaches to both supervised (classification and regression) and unsupervised modeling (clustering and data reduction).
- Describe challenges to fitting complex models on big data, particularly the risk of overfitting in the context of model generalization/transportability.
- Describe the issues that arise when trying to use "big data"-based observational studies to derive causal conclusions.
Prerequisites
None
Faculty
Course Director: | Aaron Scheffler, PhD, MS Assistant Professor of Epidemiology & Biostatistics email: [email protected] |
Format
Twice weekly pre-recorded lectures introduce the substantive content for each module, which is subsequently reinforced in weekly applied homework problem sets. Weekly computer lab sessions give students guided problems to work through and the opportunity to learn to use the software, ask questions, and have more interaction with faculty.
Lectures: Monday and Thursdays, 1:00 to 1:45 PM, July 20 to August 31 (first session Thursday July 20)
Formal review of recorded lecture followed by application of lecture material as well as question and answer discussion.
Computer Laboratories: Thursdays, 2:00 PM to 3:45 PM, July 20 to August 31
Students have access to course faculty for questions on current or prior curriculum, assignments and software implementation.
In addition, all students will be required to submit a final project in which they manipulate, clean, and analyze data emanating from a large data source. Students will be given a choice of datasets and guidelines for performing the project.
All course materials and handouts will be posted on the course's online syllabus.
Materials
The free software suite Orange will be used throughout. Orange is a comprehensive, component-based software package with strengths in data visualization, data mining and machine learning.
Grading
Grades will be based on the Computer Lab assignments and the Final Project. Lab assignments will be due by the start of lecture the following week. Homework problem sets will account for 70% of the points for the course. The final project, based on course supplied data sets, will account for 30% of the points possible for the course.
Students must hand in all homework problem sets (even if late), complete a satisfactory Final Project, and receive at least 80% of the total number of points assigned during the quarter to receive a Satisfactory (if taking Satisfactory/Unsatisfactory) or B (if taking for a letter grade) in the course.
Students not in full-year TICR Programs who satisfactorily pass all course requirements will, upon request, receive a Certificate of Course Completion.
To Enroll
ATCR and MAS students use the Student Portal
Students taking individual courses:
Summer 2023 Course Fees
How to pay (please read before applying)
Summer 2023 Course Schedule
Apply by July 10, 2023 for summer quarter.
Only one application needs to be completed for all courses desired during the quarter.