Hybrid Sine Cosine and Stochastic Fractal Search for Hemoglobin Estimation

2022-08-24 06:58:44MarwaEidFawazAlasseryAbdelhameedIbrahimBandarAbdullahAloyaydiHeshamArafatAli1andShadyElMashad

Computers Materials&Continua 2022年8期

Marwa M.Eid,Fawaz Alassery,Abdelhameed Ibrahim,Bandar Abdullah Aloyaydi,Hesham Arafat Ali1, and Shady Y.El-Mashad

1Faculty of Artificial Intelligence,Delta University for Science and Technology,Mansoura,35712,Egypt

2Department of Computer Engineering,College of Computers and Information Technology,Taif University,Taif,21944,Saudi Arabia

3Computer Engineering and Control Systems Department,Faculty of Engineering,Mansoura University,Mansoura,35516,Egypt

4Mechanical Engineering Department,Qassim University,Buraidah,51452,Saudi Arabia

5Department of Computer Systems Engineering,Faculty of Engineering at Shoubra,Benha University,Egypt

Abstract: The sample’s hemoglobin and glucose levels can be determined by obtaining a blood sample from the human body using a needle and analyzing it.Hemoglobin (HGB) is a critical component of the human body because it transports oxygen from the lungs to the body’s tissues and returns carbon dioxide from the tissues to the lungs.Calculating the HGB level is a critical step in any blood analysis job.The HGB levels often indicate whether a person is anemic or polycythemia vera.Constructing ensemble models by combining two or more base machine learning (ML) models can help create a more improved model.The purpose of this work is to present a weighted average ensemble model for predicting hemoglobin levels.An optimization method is utilized to get the ensemble’s optimum weights.The optimum weight for this work is determined using a sine cosine algorithm based on stochastic fractal search(SCSFS).The proposed SCSFS ensemble is compared to Decision Tree,Multilayer perceptron (MLP),Support Vector Regression (SVR) and Random Forest Regressors as model-based approaches and the average ensemble model.The SCSFS results indicate that the proposed model outperforms existing models and provides an almost accurate hemoglobin estimate.

Keywords: Sine cosine optimization;metaheuristics optimization;hemoglobin estimation;weight average ensemble

1 Introduction

Hemoglobin is a protein molecule found in the red blood cells that are mainly composed of the element iron.Hemoglobin is built up of globulin chains,which are interconnected protein molecules that carry oxygen.Heme is an essential component of the globulin chain since it contains an iron atom.Heme is also known as the iron atom.It aids in the transportation of oxygen and carbon dioxide throughout our bodies via the blood.HGB(hemoglobin)and glucose(Gl)are two essential components of human blood and are two significant components of human blood.When it comes to biological activity,blood is involved in quite a few of them.Hemoglobin is responsible for transporting oxygen throughout the body from the lungs.Both a hemoglobin shortage and an overabundance of hemoglobin are associated with illness.The body generates glucose from the meals it consumes to provide energy to all the cells in the body.However,if there is an excessive amount of glucose in the blood,it may create difficulties.Diabetes can be considered one of the most prevalent illnesses globally,affecting over a billion people[1,2].

Being a reliable tool in managing laboratory information of illness diagnosis in the hematology laboratory is very beneficial.Hematological data analysis is a fascinating and challenging job in the medical research field,and we present a model to compare different classification and regression techniques utilizing Scikit-Learn to accomplish this work.If a simple Hemoglobin estimation method is provided,this may be feasible.To overcome a complication caused by anemia,our team is developing a technology that will detect hemoglobin levels in the early stages of anemia with relative ease[3,4].Consequently,it will assist in the treatment of anemic patients.The estimate of hemoglobin levels in the blood may be accomplished using a variety of techniques.According to WHO(World Health Organization),these techniques are recorded.The methods for estimating HGB value are discussed in the next section in more detail.

Data mining can be considered as the computing process for discovering patterns during the investigation of large datasets.This process involves methods based on machine learning,statistics,and database systems [5-7].Recently,both the amount of data generated and collected data have increased dramatically in recent decades.The main goal of the data mining process is to make sense of large amounts of primarily unsupervised data.Data mining is being utilized in various areas such as hospitals,companies,education,fraud detection,and bioinformatics.In the case of bioinformatics,it assists medical persons in extracting valuable information from large datasets collected in biology,and it helps patients receive better and more affordable health care[8].

Estimating a mapping function between the continuous input variables and the continuous output variable is known as regression analysis.The hemoglobin(HGB)can be predicted based on the input variables such as white blood cells (WBC),red blood cells (RBC),lymphocytes (LYM).For this study,various machine learning estimators,such as Decision Tree Regressor,Multilayer perceptron(MLP) Regressor,Support Vector Regression (SVR),and Random Forest Regressor,in addition to the ensemble model,have been experimented.The ensemble models assist in integrating the skills of a range of single base models to produce an almost accurate prediction model[9-11].This concept may propose into action in different ways.For example,some essential techniques depend on resampling the training set,while others rely on different prediction methods or changing specific predictive technique parameters,among other things [12-14].Also,the results of each prediction are pooled using an ensemble of methods to conclude[15-17].

This work presents a weighted average ensemble model based on sine cosine and stochastic fractal search optimization techniques (SCSFS) for predicting hemoglobin levels.The proposed optimization algorithm is utilized to get the ensemble’s optimum weights.The proposed SCSFS ensemble is compared to Decision Tree,Multilayer perceptron (MLP),Support Vector Regression(SVR)and Random Forest Regressors as model-based approaches,in addition to the average ensemble and ensemble-based MLP models.The suggested optimizing ensemble model based on the SCSFS algorithm is used to discover the optimum weights for the MLP Regressor ensemble,and the proposed optimizing ensemble model is used to find the best weights for the ensemble model.The SCSFS results indicate if it can outperform existing models and provide an almost accurate hemoglobin estimate.The ANOVA and t-test statistical methodologies compare the populations to establish the significant difference between the suggested and compared technologies.

2 Related Work

For the measurement of blood components,a variety of invasive techniques are used.The majority of these techniques test blood components by taking blood from the patient’s body via venipuncture.These techniques are unpleasant and carry a risk of infection since blood must be taken with a needle,and the results of the analysis of the blood sample take a long time to come back[18].

Several closely similar studies have been published,including “Automated Diagnosis of Thalassemia Based on Data Mining Classifiers” and other publications that use data mining methods to diagnose a variety of illnesses.The investigation for thalassemia based on the complete blood count(CBC) was given;however,the primary focus of this article is on Mean Cellular Volume (MCV),and we believe that they should be used.The primary characteristic of categorized thalassemia is low hemoglobin.The authors in this work investigated algorithms in more depth,focusing on their accuracy,learning time,and error rate.They observed a direct relationship between the time spent constructing the tree model and the volume of information records and that there is a corresponding indirect relationship between the time spent creating the tree model and the attribute size of the informative collections.Their investigation concludes that Bayesian algorithms outperform all other algorithms in terms of classification precision[19].

A non-invasive HGB concentration level prediction approach based on photoplethysmography(PPG)signals was presented.The system analyses PPG characteristics features utilizing a variety of machine learning methods [20].Following the analysis of the datasets (PPG signals) obtained from 33 individuals who illuminated light with their fingers over the course of ten periods,40 distinctive characteristics were identified.A combination of RELIEFF feature selection(RFS)and correlationbased feature selection (CFS) was used to choose the best features before developing eight distinct regression models [21].According to the prediction findings,the support vector-based regression model outperformed the other model in terms of overall performance.

Iron adds to the hemoglobin(HGB)level in the blood and causes it to be red.There is a significant relationship between the color components and the hemoglobin concentrations.According to this article,the Hb value is calculated by analyzing the color components of picture samples.The average of the red component of a colored picture is calculated and regarded as a feature in the image analysis.These have been computed based on the data that has been made public.Outside of commercial devices such as those mentioned above,much research is being conducted to create non-invasive hemoglobin monitoring systems and prediction algorithms for human blood.The researchers used pulse spectrophotometry at seven different wavelengths to create a non-invasive,continuous,and highly accurate device that showed promise for detecting HGB [22].Data mining techniques and methods for diabetes diagnosis are being used,according to the authors.A comparable examination of different algorithms is conducted in this article.This endeavor is concerned with mining the relationships in diabetic information to put the information into effective order.Regardless,they need a model that can evaluate diabetes datasets,which has already been presented.

3 Proposed Ensemble Model

The suggested ensemble model will be introduced and described in more depth in this part.First,the gathered dataset is given to be preprocessed.The dataset is subjected to an exploratory analysis,which is provided.After that,the preparation methods that were used on the dataset are described in detail.

3.1 Preprocessing

Specific columns,such as gender and age,have been removed from the dataset since there were so many blank entries in them.Those blank numbers can happen because the operator did not store the patient’s personal information for each blood test because the operator was using another software to handle the patient’s data at the time of the test.The previously stated that CBC parameters were carefully chosen by a competent doctor who decided the most significant factors that impact the calculation of hemoglobin.

After selecting the relevant features,the data has been normalized to convert the raw feature vectors into a more appropriate form for the various machine learning estimators available.Then,utilizing the StandardScaler formula to do the work standardizes characteristics by subtracting their means and scaling them to unit variance.Various machine learning estimators can perform incorrectly if an individual feature does not more or less resemble standard normally distributed data.Tab.1 shows the descriptive statistics of the hematological dataset.Data correlation of the hemoglobin and other parameters in the tested dataset are shown in Fig.1.Fig.2 presents the processing of hemoglobin data in graphical format which shows the distribution of the dataset value for each parameter.

Table 1:Descriptive statistics of the hematological dataset

Figure 1:Hemoglobin data correlation of parameters vs.HGB

Figure 2:(Continued)

Figure 2:Processing of the hemoglobin dataset parameters

3.2 Weight Average Ensemble Model

The suggested weighted average ensemble model is based on optimizing weights for base models,followed by the calculation of the average ensemble based on the weighted outcomes.The optimum weights for the ensemble model are determined by using the Sine Cosine Algorithm(SCA)[23]based on Stochastic Fractal Search(SFS)method[24].The SCSFS algorithm is responsible for optimizing the weights of the base.Following the calculation of the optimized weights of the base models,the average ensemble is computed to get the final output result of the base models.

A common issue in machines when the input parameters of the function,such as the floatingpoint values,are actual numerical values is the problem of continuous functions optimization (also known as optimization of continuous functions).The function returns an evaluation of the argument that corresponds to real-life situations.Continuous function optimization may be used to differentiate between issues involving discrete variables and problems involving many variables,which are referred to as combined optimization problems.Various methods may be resolved,structured,and relied upon to maximize the situation when dealing with issues involving continuous functions.Based on one method of optimization classification,the information about the goal function used and utilized throughout the optimization process is derived.The more well-known information about the target function is,the simpler it is to optimize since knowledge can be applied efficiently.

3.3 SCSFS Optimizer

Several years ago,the Sine Cosine Algorithm(SCA)for optimization issues was introduced[23].When it comes to updating the locations of the agents,the algorithm is mostly based on the sine and cosine functions.A collection of random variables indicates the direction of movement,the distance of movement that should be made,and the transition between the sine and cosine components in the algorithm.SCA updates the locations of various solutions using the following equation,which is expressed in mathematical terms.

whereXitis the position of current solution in the i-th dimension,Ptirepresents the current position of the best solution in the i-th dimension.Ther1parameter is calculated asr1=a (1-t/tmax)fortrepresents the current iteration,ais a constant value,and the total number of iterations is indicated bytmax.The parametersr2,r3,andr4are random values in[0,1].

The method generates a random starting location for the SCA population of n agents,then used to determine the final position.The objective function is then computed for each agent to determine the location of the best possible solution.The characteristics of the original fractal method can be used to inspire a meta-heuristic algorithm based on random fractals in terms of time consumption and accuracy to find a solution for a given problem.The basic Stochastic Fractal Search (SFS) method[24]employs the following elements to find a solution for a given problem:

whereXi′*is the updated best solution using the process of diffusion.ηandη′are random values in[0,1].

It is necessary to use the Gaussian distribution technique to generate new particles based on the diffusion process of SFS.Thus,using the SFS algorithm’s diffusion process to find the best solution,the suggested SCSFS may explore more options and find the best answer faster than before,as shown in Fig.3.The proposed SCSFS algorithm is explained in detail in Algorithm 1.Steps from 1 to 3 initialize the algorithm parameters.From step 4 to step 19,the algorithm calculates the predefined objective function and updates the agents’positions.Changing of agent’s positions is based on the sine cosine algorithm from step 7 to step 14 and the SFS algorithm from step 15 to step 17.AT step 18,the number of iterations is updated.After finalizing the process,the optimal solution is obtained.

4 Experimental Results

The dataset,performance metrics,results and statistical analysis are explained in detail in this section.In addition to feature selection and standardization,the hematological dataset is divided into two parts:training data(which accounts for 80 percent of the dataset)and testing data(20 percent of the dataset).The contrast between the original hemoglobin value and the predicted hemoglobin value is presented in Fig.4 to show the effectiveness of the presented SCSFS based model.

Figure 3:Proposed SCSFS optimization technique using the SCA algorithm based on SFS method

Algorithm Pseudo-code of the proposed SCSFS algorithmimages/BZ_301_301_1777_2051_2698.png

4.1 Dataset

Using the Mindray BC-5300 Auto Hematology Analyzer [25],which delivers consistent and almost accurate five-part hematology findings from as little as 20 uL of blood,the researchers gathered the information as represented in Tab.1.This analyzer stores more than 200 hematological parameters for each blood test to make it a potent tool.Some of these values,such as RBC,WBC,and PLT,are computed automatically by the analyzer.Other parameters,such as gender and age,are entered manually by the operator into the system.

Figure 4:Sample of the original value (green color) to the predicted value (red color) based on the SCSFS algorithm

The following parameters that make up the hematological dataset are CBC,WBC,Lymphocytes(LYM),RBC,MCV,Mean Cellular Hemoglobin(MCH),Mean Cellular Hemoglobin Concentration(MCHC),Red Blood Cell Distribution Width (RDW),Hematocrit,Platelet Count (PLT),Mean Platelet Volume(MPV),Hematocrit(HCT),Platelet Count(PLT),Mean Platelet Volume(MPV)and Hemoglobin (HGB) [18,26].These factors assist in doing data mining operations on hematological data.Example records from the design dataset are shown in the first row of Tab.2.

Table 2:Samples of the hematological records in tested dataset

4.2 Performance Metrics

The performance metrics used to evaluate the proposed algorithm based on the tested dataset are Root Mean Squared Error (RMSE),Mean Absolute Error (MAE),Mean Bias Error (MBE),r,determination coefficient (R2),Relative Root Mean Squared Error (RRMSE),Nash-Sutcliffe Efficiency (NSE),and WI,as shown in Tab.3.The parameter ofHp,iindicates a predicted value,Hirepresents the corresponding measured value,and thenparameter represents the total number of observations[23].

Table 3:Performance metrics for classification[23]

4.3 Results

The experiment shows the results of the base,ensemble,and proposed model as presented in Tabs.4 and 5.Tab.4 shows the results of the base model,while Tab.5 shows the results of the ensemble model using the proposed SCSFS and other ensemble models.The results,based on base models of Decision Tree,MLP,SVR,and Random Forest Regressors,in addition to the average ensemble and the ensemble model based on MLP regressor,then all these results are compared with a weighted average ensemble model using the SCSFS algorithm to show the performance of the proposed model.The results show the performance of the proposed model withRMSEof (0.009042361),MAEof(0.000828369),MBEof(-0.003330259),rof(0.999876369),R2of(0.999718049),RRMSEof (1.830587988),NSEof (0.999296554) andWIof (0.986786848).The RMSE box plot graph for the presented and compared modelsvs.the objective function is shown in Fig.5,which shows the performance of the proposed ensemble model.

Table 4:Results of the base model

Table 5:Results of the ensemble model using the proposed SCSFS and other ensemble models

Figure 5:RMSE box plot graph for the presented and compared models vs.the objective function

The SCSFS based ensemble model exhibits superior performance in histogram and receiver operating characteristic (ROC) curves,confirming the suggested model’s superiority for the studied issue as shown in Fig.6.Fig.6a shows the ROC curve of the presented ensemble model using SCSFSvs.ensemble model using MLP regressor,Fig.6b shows the ROC curve of the presented ensemble model using SCSFSvs.average ensemble model.Fig.6c presents the histogram of RMSE,with bin center range of(0.008,0,034 and 0,060)against number of values,for the presented and compared to other ensemble and single models.The ROC analysis is performed on a ranking standard and continuous diagnostic test data.The derived accuracy indices,especially the area under the curve(AUC),provide a meaningful knowledge of best regression.As shown in Fig.6,the suggested algorithm’s AUC value is much greater than that of previous methods,approaching one.The QQ plot also shows that the proposed algorithm’s actual and predicted values are almost fit as represented in Fig.7.Fig.7a shows the residual plot,Fig.7b represent the homoscedasticity plot and Fig.7c indicates the QQ plot.The heat map for the presented ensemble model using SCSFS algorithmvs.other single and ensemble models is shown in Fig.7d.

Figure 6:(a)ROC curve of the presented ensemble model using SCSFS vs.ensemble model using MLP regressor,(b)ROC curve of the presented ensemble model using SCSFS vs.average ensemble model and (c) Histogram of RMSE,with bin center range of (0.008,0,034 and 0,060) against number of values,for the presented and compared to other ensemble and single models

Figure 7:(a)Residual plot,(b)Homoscedasticity plot,(c)QQ plot and(d)Heat map for the presented ensemble model using SCSFS algorithm vs.other single and ensemble models

4.4 Statistical Analysis

The ANOVA and t-test statistical methodologies are used to compare the populations to establish a significant difference between the suggested and compared technologies.The results of the two-way ANOVA test are shown in Tab.6.The statistical hypothesizes for ANOVA are stated as follows:

? The null hypothesis (H0) states that there is no statistically significant difference between the groups.

? The alternative hypothesis(H1)states a statistically significant difference between the means of two populations,which is the distinction.

For the one-sample t-test,as shown in Tab.7,the statistical hypothesis may be expressed in the following manner:

? The null hypothesis (H0) states that there is no statistically significant difference between the two groups.

? Alternative hypothesis(H1):The differentiation is based on the significant difference between the two means of the population.

Table 6:Results of ANOVA test of the presented model compared to other models

Table 7:Results of one sample T-Test

5 Conclusion

Ensemble weights are optimized through the proposed SCSFS Meta-Heuristic Optimization,based on the Sine Cosine Algorithm (SCA) and stochastic fractal search.The proposed SCSFS algorithm for estimating the value of Hemoglobin using hematological parameters,SCSFS ensemble,is compared to three model-based approaches and the average ensemble model.The SCSFS algorithm performed a comparison and statistical study of the ROC curve and the T-Test to determine the superiority and stability of the anticipated outcomes to validate the processes’correctness.

Acknowledgement:We deeply acknowledge Taif University for supporting this study through Taif University Researchers Supporting Project Number (TURSP-2020/150),Taif University,Taif,Saudi Arabia.

Funding Statement:Funding for this study is received from Taif University Researchers Supporting Project No.(Project No.TURSP-2020/150),Taif University,Taif,Saudi Arabia.

Conflicts of Interest:The authors declare that they have no conflicts of interest to report regarding the present study.

Computers Materials&Continua2022年8期

Computers Materials&Continua的其它文章: Deep Learning Framework for Precipitation Prediction Using Cloud Images; Fuzzy Logic with Archimedes Optimization Based Biomedical Data Classification Model; Competitive Swarm Optimization with Encryption Based Steganography for Digital Image Security; Underwater Terrain Image Stitching Based on Spatial Gradient Feature Block; Spider Monkey Optimization with Statistical Analysis for Robust Rainfall Prediction; An Integrated Framework for Cloud Service Selection Based on BOM and TOPSIS