Mining Shakespeare

Activity: Mining Shakespeare

Submitted by Peter Henstock, STEM

Goal:

  • To introduce students to the basics on natural language processing
  • To utilize clustering and other machine learning approaches in a familiar context and trying traditional and non-traditional approaches to gain insight

Class: CSCI E-81– Machine Learning and Data Mining

Introduction/Background:

This activity was a group homework assignment. In groups of two, students tackled the issue of whether Shakespeare was the actual author of all his attributed works using machine learning techniques, including clustering.

Procedure

Activity:

  • As homework, students used the online database of Shakespeare's works available from MIT.  The students used python and/or R as well as some other tools.
  • The students were required to use a clustering method to explore the space, but the instructor set up the course so that students were required to go beyond some subset of the assignments for 'exploration' points.  As a result, a number of the students discussed ideas in sections and in office hours, and went beyond the fairly open assignment.  Some drew in works of other authors.  Others created new features and explored their discriminatory power.  Still others tried interesting approaches including leveraging old-English word stemmers, etc. 

Comments & Follow-Up:

  • Each student team submitted a report detailing the text mining approach including the features used, clustering methods, and visualizations
  • The class spent about 30 minutes in a subsequent class where each student team was given a few minutes to discuss their ideas, approach and findings

The instructor notes that students went far beyond the assignment by exploring the research in the field, taking papers from relevant authors, and leveraging other analyses.  Doing so complemented and enhanced the machine learning and text mining efforts to create novel approaches that many students were quite proud of.

Assessment:

  • Students were evaluated on how well they had met the original assignment in terms of appropriately applying clustering techniques and a visualization method on the text.  In addition, the instructor assigned bonus points that were largely subjective based on the exploratory/bonus approaches beyond the original assignment.