Predicting the occurrence of a Brain Stroke based on physiological data using machine learning

Skills & techniques: Data analytics, supervised machine learning techniques, R Programming language.


Stroke is one of the leading causes of mortality around the globe, and identifying individuals at risk of stroke is crucial for prevention and management. This study aims to predict stroke occurrence using machine learning models and evaluate different data balancing techniques to identify the optimal performance model. The study uses retrospective data and considers risk factors such as age, gender, hypertension, heart disease, work type, residence, body mass index, and smoking status. The results indicate that oversampling techniques outperformed other data balancing techniques, and the random forest was the best-performing model, achieving an accuracy of 93.98%. The study highlights the importance of using machine learning techniques in stroke prediction and the need for further research to improve its accuracy and clinical applicability.               


A brain stroke or cerebrovascular accident (CVA) is a medical emergency that occurs when blood flow to the brain is interrupted or abnormal in a particular region. It is one of the second leading causes of death and was responsible for approximately 11% of total deaths worldwide in 2020. There are two types of strokes: ischemic and hemorrhagic. An ischemic stroke is a blockage in the blood vessels supplying the brain. On the other hand, a hemorrhagic stroke occurs when there is one rupture of a blood vessel in the brain. The latter is more dangerous and can lead to fatal outcomes for the patient (Mariano et al., 2022; Tan et al., 2002). Over the past three decades, stroke has remained a major public health concern, with significant economic costs and burdens incurred worldwide. A systematic review (Ayerbe et al., 2013) of 50 studies concluded that the prevalence of depression among people after a stroke was 29%, and the cumulative incidence within five years of stroke was 39%–52%. 

Several risk factors can increase an individual’s likelihood of experiencing a stroke. These risk factors include high blood pressure, smoking, diabetes, high cholesterol, physical inactivity, obesity, excessive alcohol consumption, and family history (Guo et al., 2023). High blood pressure is a particularly important risk factor, as uncontrolled high blood pressure can damage the blood vessels in the brain and increase the risk of a stroke. Similarly, smoking can also damage the blood vessels and increase the risk of a stroke. Diabetes, high cholesterol, physical inactivity, overweight, and obesity (BMI >26 kg/m2) are all risk factors that can increase the risk of a stroke (Guo et al., 2023). Additionally, women have unique stroke risk factors such as pregnancy, endogenous hormone levels, and exogenous hormone therapy, making them more vulnerable than their male counterparts when it comes to CVA (Yoon & Bushnell, 2023). Age is another relevant factor in the risk of suffering a stroke since the risk of stroke more than doubles for each successive decade after the age of 55 years (Ovbiagele et. al, 2013). Stroke mortality is higher in rural compared with urban areas of the United States, and this seems to be because the rural residence is also associated with reduced access to a variety of healthcare services, and lower use of acute stroke care interventions, such as brain imaging, thrombolysis, and stroke unit care (Krapral, 2019).

In recent years, there has been an increasing interest in leveraging machine learning techniques to improve stroke diagnosis. The primary objective of these studies is to enhance accuracy and speed of diagnosis (Sirsat et al., 2020). Machine learning, a subset of artificial intelligence (AI), involves training computer programs to recognize patterns and learn from data, and then using this learning to make predictions or take actions without explicit instructions. As such, machine learning has been increasingly utilized in medical research for various applications, including disease diagnosis, prediction, and treatment planning.

By predicting a patient’s risk of having a stroke, healthcare providers can take appropriate measures to manage risk factors and prevent or reduce the likelihood of experiencing a stroke. This can ultimately improve the patient’s health outcomes and quality of life while also reducing the economic burden of stroke on society. Hence, the objective of this study is to predict whether a patient will suffer from a stroke based on gender, age, hypertension, heart disease, work type, residence, body mass index, and smoking status using different machine learning models. In addition, the main contribution of this study is to identify the optimal performance model by evaluating various data balancing techniques, including undersampling and oversampling.



Unbalanced dataset
Undersampled dataset
Oversampled dataset


A cerebrovascular accident (CVA), commonly referred to as a brain stroke, is a critical medical emergency that arises due to an interruption or abnormality in blood flow to a specific region of the brain. It is a major cause of mortality, ranking second globally, and accounted for approximately 11% of total deaths worldwide in 2020. (Guo et al., 2023). Early detection of a stroke guarantees a better outcome for the patient. Therefore, the objective of this study was to predict whether a patient will suffer from a stroke based on physiological data using different machine learning models.

The original dataset was imbalanced hence undersampled and oversampled techniques were applied to improve performance. Several machine learning models were applied to the three datasets. Random Forest yielded the best results in the oversampled dataset. Several studies have found that oversampling yields better since there is no loss of information (Kaur & Gosain, 2018). Random Forest tend to attain high classification accuracy since they can handle outliers and noise in the data. RF are less susceptible to overfitting compared to other models (Ahmad et al, 2018)


Abedi, V., Avula, V., Chaudhary, D., Shahjouei, S., Khan, A., Griessenauer, C. J., … & Zand, R. (2021). Prediction of long-term stroke recurrence using machine learning models. Journal of clinical medicine, 10(6), 1286.

Ahmad, I., Basheri, M., Iqbal, M. J., & Rahim, A. (2018). Performance comparison of support vector machine, random forest, and extreme learning machine for intrusion detection. IEEE access, 6, 33789-33795.

Ayerbe, L., Ayis, S., Wolfe, C. D., & Rudd, A. G. (2013). Natural history, predictors and outcomes of depression after stroke: systematic review and meta-analysis. The British Journal of Psychiatry, 202(1), 14-21.

Bento, M., Souza, R., Salluzzi, M., Rittner, L., Zhang, Y., & Frayne, R. (2019). Automatic identification of atherosclerosis subjects in a heterogeneous MR brain imaging data set. Magnetic resonance imaging, 62, 18-27.

Chun, M., Clarke, R., Cairns, B. J., Clifton, D., Bennett, D., Chen, Y., … & China Kadoorie Biobank Collaborative Group. (2021). Stroke risk prediction using machine learning: a prospective cohort study of 0.5 million Chinese adults. Journal of the American Medical Informatics Association, 28(8), 1719-1727.

Dourado Jr, C. M., da Silva, S. P. P., da Nobrega, R. V. M., Barros, A. C. D. S., Reboucas Filho, P. P., & de Albuquerque, V. H. C. (2019). Deep learning IoT system for online stroke detection in skull computed tomography images. Computer Networks, 152, 25-39.

Emon, M. U., Keya, M. S., Meghla, T. I., Rahman, M. M., Al Mamun, M. S., & Kaiser, M. S. (2020, November). Performance analysis of machine learning approaches in stroke prediction. In 2020 4th International Conference on Electronics, Communication and Aerospace Technology (ICECA) (pp. 1464-1469). IEEE.

Guo, L., Guo, Y., Booth, J., Wei, M., Wang, L., Zhu, Y., … & Liu, Y. (2023). Experiences of health management among people at high risk of stroke in China: A qualitative study. Nursing Open, 10(2), 613-622.

Kapral, M. K., Austin, P. C., Jeyakumar, G., Hall, R., Chu, A., Khan, A. M., … & Tu, J. V. (2019). Rural-urban differences in stroke risk factors, incidence, and mortality in people with and without prior stroke: The CANHEART stroke study. Circulation: Cardiovascular Quality and Outcomes, 12(2), e004973.

Kaur, P., & Gosain, A. (2018). Comparing the behavior of oversampling and undersampling approach of class imbalance learning by combining class imbalance problem with noise. In ICT Based Innovations: Proceedings of CSI 2015 (pp. 23-30). Springer Singapore.

Lee, B. J., Kim, K. H., Ku, B., Jang, J. S., & Kim, J. Y. (2013). Prediction of body mass index status from voice signals based on machine learning for automated medical applications. Artificial intelligence in medicine, 58(1), 51-61.

Li, X., Bian, D., Yu, J., Li, M., & Zhao, D. (2019). Using machine learning models to improve stroke risk level classification methods of China national stroke screening. BMC medical informatics and decision making, 19, 1-7.

Liu, T., Fan, W., & Wu, C. (2019). A hybrid machine learning approach to cerebral stroke prediction based on imbalanced medical dataset. Artificial intelligence in medicine, 101, 101723.

Mahesh, K. A., Shashank, H. N., Srikanth, S., & Thejas, A. M. (2020). Prediction of Stroke Using Machine Learning.

Mariano, V., Tobon Vasquez, J. A., Casu, M. R., & Vipiana, F. (2022). Brain Stroke Classification via Machine Learning Algorithms Trained with a Linearized Scattering Operator. Diagnostics, 13(1), 23.

Min, S. N., Park, S. J., Kim, D. J., Subramaniyam, M., & Lee, K. S. (2018). Development of an algorithm for stroke prediction: a national health insurance database study in Korea. European neurology, 79(3-4), 214-220.

Ovbiagele, B., Goldstein, L. B., Higashida, R. T., Howard, V. J., Johnston, S. C., Khavjou, O. A., … & Trogdon, J. G. (2013). Forecasting the future of stroke in the United States: a policy statement from the American Heart Association and American Stroke Association. Stroke, 44(8), 2361-2375.

Sirsat, M. S., Fermé, E., & Câmara, J. (2020). Machine learning for brain stroke: a review. Journal of Stroke and Cerebrovascular Diseases29(10), 105162.

Tan, J., Ramazanu, S., Liaw, S. Y., & Chua, W. L. (2022). Effectiveness of public education campaigns for stroke symptom recognition and response in non-elderly adults: a systematic review and meta-analysis. Journal of Stroke and Cerebrovascular Diseases, 31(2), 106207.

Yoon, C. W., & Bushnell, C. D. (2023). Stroke in women: a review focused on epidemiology, risk factors, and outcomes. Journal of Stroke, 25(1), 2-15.

Leave a Comment

Your email address will not be published. Required fields are marked *