A Lakehouse-Oriented Big Data Infrastructure for Educational Analytics: Integrating Administrative and Assessment Data for Early Student Risk Prediction

Authors

  • Bhairav Kaphle Madan Bhandari University of Science and Technology (MBUST)
  • Biswajit Shrestha Madan Bhandari University of Science and Technology (MBUST)

DOI:

https://doi.org/10.58776/ijitcsa.v4i1.248

Keywords:

Educational data integration, Learning analytics, Student risk prediction, Lakehouse architecture, Higher education, Reproducible analytics

Abstract

Educational institutions increasingly depend on heterogeneous digital systems, yet many analytics initiatives remain fragmented across student information, registration, assessment, and learning platforms. This paper proposes a lakehouse-oriented big data infrastructure for educational analytics and validates it through a reproducible early-risk prediction study using the Open University Learning Analytics Dataset (OULAD). The study integrates five public OULAD tables student information, course registration, assessment metadata, student assessment submissions, and course presentation metadata into temporally valid feature tables aligned to the student–module–presentation level. We define a windowed feature engineering framework that constructs actionable indicators such as submission rate, weighted completion score, average submission lag, and assessment coverage gap at 30%, 50%, 70%, and 100% of the course timeline. Two supervised classifiers, logistic regression and random forest, are evaluated under a stratified 80/20 protocol. The results show that administrative data alone provides weak discrimination (AUC  0.673), whereas integrated mid-course assessment evidence substantially improves performance. At the 50% course window, the random-forest model achieves an AUC of 0.947, F1 of 0.879, and recall of 0.829; even at the 30% window the model already reaches an AUC of 0.904. These findings demonstrate that the value of educational prediction depends not only on model choice but also on data integration architecture. The paper contributes (i) a lakehouse-oriented reference architecture for higher-education analytics, (ii) a temporally constrained feature engineering strategy for early-warning systems, and (iii) an empirical ablation showing that multi-source integration yields large and operationally meaningful gains.

References

. G. Siemens and P. Long, “Penetrating the fog: Analytics in learning and education,” EDUCAUSE Review, vol. 46, no. 5, pp. 30–40, 2011.

. R. Ferguson, “Learning analytics: drivers, developments and challenges,” International Journal of Technology Enhanced Learning, vol. 4, no. 5–6, pp. 304–317, 2012.

. W. Greller and H. Drachsler, “Translating learning into numbers: A generic framework for learning analytics,” in Proc. 2nd International Conference on Learning Analytics and Knowledge, 2012, pp. 42–57.

. C. Romero and S. Ventura, “Educational data mining: A review of the state of the art,” IEEE Transactions on Systems, Man, and Cybernetics, Part C, vol. 40, no. 6, pp. 601–618, 2010.

. C. Romero and S. Ventura, “Educational data mining and learning analytics: An updated survey,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 10, no. 3, p. e1355, 2020.

. E. Alyahyan and D. Düştegör, “Predicting academic success in higher education: literature review and best practices,” International Journal of Educational Technology in Higher Education, vol. 17, no. 3, 2020.

. M. Cantabella, R. Martı́nez-España, B. Ayuso, J. A. Yáñez, and A. Muñoz, “Analysis of student behavior in learning management systems through a big data framework,” Future Generation Computer Systems, vol. 90, pp. 262–272, 2019.

. M. Vaarma et al., “Predicting student dropouts with machine learning,” International Journal of Educational Research, 2024.

. A. M. Rabelo et al., “A model for predicting dropout of higher education students,” Smart Learning Environments, 2025.

. J. Kuzilek, M. Hlosta, and Z. Zdrahal, “Open University Learning Analytics dataset,” Scientific Data, vol. 4, p. 170171, 2017.

. E. Howard, “ouladFormat R package: Preparing the Open University Learning Analytics Dataset for analysis,” arXiv preprint arXiv:2501.08366, 2025.

. J. Samuelsen, W. Chen, and B. Wasson, “Integrating multiple data sources for learning analytics—review of literature,” International Journal of Educational Technology in Higher Education, vol. 16, no. 11, 2019.

. J. M. Dodero et al., “Trade-off between interoperability and data collection in learning analytics systems,” Computers & Education, vol. 106, pp. 44–57, 2017.

. M. Masud, X. Huang, J. Yong, and others, “Collaborative e-learning systems using semantic data interoperability and distributed metadata management,” Computers in Human Behavior, vol. 72, pp. 298–310, 2017.

. M. Paneque et al., “e-LION: Data integration semantic model to enhance learning analytics in multi-source e-learning ecosystems,” Expert Systems with Applications, vol. 213, p. 119245, 2023.

. M. Armbrust et al., “Delta Lake: High-performance ACID table storage over cloud object stores,” Proceedings of the VLDB Endowment, vol. 13, no. 12, pp. 3411–3424, 2020.

. J. Schneider et al., “The Lakehouse: State of the Art on Concepts and Technologies,” SN Computer Science, vol. 5, 2024.

. A. A. Harby et al., “Data Lakehouse: A survey and experimental study,” Information Systems, 2025.

. S. Slade and P. Prinsloo, “Learning analytics: Ethical issues and dilemmas,” American Behavioral Scientist, vol. 57, no. 10, pp. 1510–1529, 2013.

. P. Prinsloo and S. Slade, “An elephant in the learning analytics room: The obligation to act,” in Proc. Seventh International Conference on Learning Analytics & Knowledge, 2017, pp. 46–55.

. D. Ifenthaler and C. Schumacher, “Student perceptions of privacy principles for learning analytics,” Educational Technology Research and Development, vol. 64, no. 5, pp. 923–938, 2016.

. L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.

. T. Fawcett, “An introduction to ROC analysis,” Pattern Recognition Letters, vol. 27, no. 8, pp. 861–874, 2006.

. T. Saito and M. Rehmsmeier, “The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets,” PLOS ONE, vol. 10, no. 3, p. e0118432, 2015.

. M. Kuhn and K. Johnson, Applied Predictive Modeling. New York, NY, USA: Springer, 2013.

. B. Boehmke and B. Greenwell, Hands-On Machine Learning with R. Boca Raton, FL, USA: CRC Press, 2019.

. R. S. Baker and P. S. Inventado, “Educational data mining and learning analytics,” in Learning Analytics. New York, NY, USA: Springer, 2014, pp. 61–75.

. O. Viberg, M. Hatakka, O. Bälter, and A. Mavroudi, “The current landscape of learning analytics in higher education,” Computers in Human Behavior, vol. 89, pp. 98–110, 2018.

. E. López-Meneses et al., “Educational Data Mining and Predictive Modeling in the Era of Artificial Intelligence,” Computers, vol. 14, no. 2, p. 68, 2025.

. T. Nguyen et al., “Data quality management in big data: Strategies, tools, and AI-enabled directions,” 2025.

. M. M. Ncube and P. Ngulube, “Leveraging learning analytics to personalise academic library services for enhanced student success: A systematic review,” The Journal of Academic Librarianship, 2025.

. S. Boujmiraz et al., “Predicting student performance: A comprehensive review,” Machine Learning with Applications, 2026.

Downloads

Published

13-04-2026

How to Cite

Kaphle, B., & Shrestha, B. (2026). A Lakehouse-Oriented Big Data Infrastructure for Educational Analytics: Integrating Administrative and Assessment Data for Early Student Risk Prediction. International Journal of Information Technology and Computer Science Applications, 4(1), 47–58. https://doi.org/10.58776/ijitcsa.v4i1.248

Issue

Section

New Submission