The analysis of discussion forums within MOOCs presents many logistical issues, resulting chiefly from the size of the dataset, which can present challenges for understanding student behaviors. In this presentation, the author outlines how educational data mining techniques were used to analyze the discussion forum posts from six HarvardX MOOCs.
Learning analytics focuses on extracting meaning from large amounts of data. One of the largest datasets in education comes from Massive Open Online Courses (MOOCs) that typically feature enrollments in the tens of thousands. One of the most significant sources of data comes from the discussion forums, a focal point for many MOOCs. The analysis of discussion forums within MOOCs presents many logistical issues, resulting chiefly from the size of the dataset, which can present challenges for understanding and adequately describing student behaviors. In the presentation, the author outlines how Excel, Linguistic Inquiry Word Count (LIWC; Tausczik & Pennebaker, 2010) (pronounced “Luke”) and Stata 14 were used to analyze the substantive qualities of discussion forum posts from six HarvardX MOOCs. The data was provided through Harvard’s VPAL Research Group. Each of these MOOCs were offered in two iterations – instructor-paced and self-paced – and this presentation outlines the differences in discussion forum activity between these iterations. This study analyzes 57,650 discussion posts generated by 13,495 students across these six courses. This presentation demonstrates ways in which automatic text analysis tools can aid the analysis and the subsequent identification of evidence of cognitive presence in MOOCs.
MOOC research tends to emphasize forum activity, given it is the primary source of student engagement. Sharif and Magrill (2015) suggest the experiential learning that can occur in a MOOC discussion forum aligns well with the connectivist approach of cMOOCs. Most of the research on the forums is done from a macro perspective (Sun, Li, & Lin, 2016), and findings suggest that the structure of the discussion significantly impacts the quality of the discussion (Hewitt, 2003; Swan, Shea, Fredericksen, Pickett, & Pelz, 2000; Vonderwell, 2003; Vrasidas & McIsaac, 1999). But despite this potential, the precise impact of discussion forums on learning is not yet fully understood (Bergner et al., 2015), making further exploration of discussion forum impact critically important. Since it is within the forums that main interaction between learners occur, it is reasonable for forums to have been one of the most studied areas in MOOC research. A variety of methods have been used to examine discussion forum activity, including the collection of descriptive statistics on the number of posts and the leveraging of social network analysis to explore sub-communities of students that have developed as well as their interaction patterns. From this analysis of the activity in the discussion forums, researchers have been able to describe learner behavior in MOOCs by focusing on the areas of attrition (e.g., Jordan, 2015; Onah et al., 2014), engagement (e.g., Wang, Yang, Wen, Koedinger, & Rosé, 2015), and communication patterns (e.g., Eynon et al., 2016; Gillani & Eynon, 2014).
This presentation shares the findings from the discussion forums from three courses offered by Harvard University on the edX platform: (1) The Ancient Greek Hero, (2) Visualizing Japan, and (3) Super-Earths and Life which had a combined student enrollment of 153,768. Each of these courses has a self-paced and instructor-paced version. The courses were selected based on the criterion that they had to have activity in both versions’ discussion forums; had to be offered by HarvardX, and had to represent a different subject area (humanities, science, and art and culture). When a student signs up for a MOOC with Harvard, they agree to be part of research efforts (see http://harvardx.harvard.edu/research-statement). The final requirement was for Harvard to provide identifiable data for each of the courses. The full dataset included all iterations of the selected courses, which totaled 59 courses and over 860,000 student and instructor records. During the presentation, I will provide details on each of the steps that I used to clean and structure the data.
This presentation will be an opportunity for me to disseminate my research methods to a diverse audience. During the presentation, I will have an interactive discussion that will use PowerPoint slides that include animations to help illustrate the processes that I followed. I will open my presentation with an embedded PollEverywhere slide which will ask the audience to share what they are most interested in learning or hearing during my presentation. I will use this ranking to inform the direction of my presentation and will also ask subsequent questions using PollEverywhere to engage with the audience. Specifically, I will use PollEverywhere to collect their thoughts regarding their experiences with data mining, concerns or challenges that they have and the analysis that I completed. I will also use PollEverywhere to facilitate my question and answer period in letting students submit their questions and vote questions up or down – thus giving the attendees a great deal of opportunities to engage both with each other and myself during the live presentation.
My research has been able to receive identifiable information from Harvard University through a data use agreement between Harvard and my home institution, NC State. One of the benefits of receiving this type of data is that I am able to provide a more robust analysis of the discussion forums by identifying if the post was from a student or an instructor. This unique perspective will make my presentation interesting for anyone that has explored or attempted to research MOOCs. It will help to disseminate the research methods that I have used in the hopes that it might inspire others to consider these approaches in their own studies on MOOCs. Working with big data is an emerging and ‘hot’ topic right now and my research speaks to not only how to use big data but also outsides some procedures that may be applicable for those considering research in this research area.