Preliminary Analysis of Large, Web-Enabled, Dataset for Computing Education

John Doorenbos


In studying how students learn and understand computing, large datasets are essential to computing education research, allowing for analysis of educational methodologies on a large scale. Online textbooks are an effective means of gathering large quantities of data which track students’ progress through the learning process. This paper discusses the use of a web-enabled dataset to analyze the learning habits of introductory computer science students. It also describes the process of auditing and cleansing the dataset. The short-term goal of the research project was to study how students interact with the online textbooks, How to Think Like a Computer Scientist and ( The online textbook now experiences over 9,000 users a day and has logged more than 29,800,000 events in the two and a half years it has been operational. As a part of the analysis, we compared the student retention rate of the textbook to that of MOOCs, finding that the online textbook had a similar though improved retention rate. Studying short-term users, we also identified topics that students most often sought help for online. This research project also has the long-term goal of developing a dataset for future research. A major part of this research was auditing the dataset and developing a cleansing workflow that would automatically clean the data. Data auditing included removing data that was anomalous or otherwise unsuitable for evaluation, and standardizing the dataset for ease of analysis. This research is not a stand-alone, completed study; rather, it will assist in making available a large dataset for use by many. The dataset has already been used in a study of the differences in online textbook usage patterns of high-school students, college students, and other online website viewers.


Interactive, Education, Data Cleansing

Full Text: PDF


  • There are currently no refbacks.