PREDICTING LESION GROWTH

using Machine Learning

by Victor Lai and Bryson Li

View on GitHub

Background

The purpose of this project is to evaluate various incremental machine learning algorithms in predicting the growth of ischemic stroke lesion by using pre-processed data extracted from MRI images. In particular, magnetic resonance in conjunction with perfusion imaging is used to extract important features and parameters to classify salvageable brain tissue. Studies and research done in this field are intended to move away from the single-voxel approach and instead, use a correlated, spatial approach in making predictions. By coming up with more refined algorithms that capture more of the complexity but also eliminate the noise of the data set available, more accurate predictions can be made in whether or not a given patient should undergo treatment. Moreover, exploring techniques that are capable of learning one data instance at a time can potentially improve the efficiency of such predictions.

Before any analysis can be performed, the data must first be put together from MRI images of patients that are extracted via segmentation and registration. The process involves identifying tissue at risk and infarcted tissue and integrating different types of images like diffusion weighted and perfusion-weighted images to be able to study different parameters and features such as cerebral blood volume, time to peak, and mean transit time. Whatever model is developed using the processed data can then be compared with the ground truth of infarcted tissue segmented out of FLAIR images taken days after intervention [3].

Although there have been models of tissue fate studied using single, independent voxels, models that consider correlated, surrounding tissue have showed better performance because of their robustness to noise. Supervised regression models, for example, have been proposed that are trained on a set of FLAIR images with known outcomes and making a prediction based on FLAIR intensity. In fact, results from experiments [2] have shown that nonlinear models are more applicable than linear ones for considering regional features of a voxel. A model of deep learning, the Convolutional Neural Network [3] which involves multiple intermediate layers taking in a Tmax map as input, also showed better performance over a linear regression model that was single-voxel based.

Research has also been done to explore other factors beyond the voxels of an MRI image that can also determine the fate of brain tissue. The model of forest classifiers, for example, takes into account not only the voxels of an image, but also the level of therapy performed, in predicting tissue growth. Acknowledging that many other studies do not account for this factor of degree of intervention in predicting the fate of tissue, R. McKinley et al. trained two separate models, one making a prediction given a good response to therapy and the other making a prediction based on no intervention. To segment out the penumbra, the tissue at risk, a unique algorithm that involved making a prediction based on a collection of decision trees constructed, was utilized [3].

Although not all of the mentioned models [2], [4] use data that is adequate, uniformly distributed, and representative of all stroke patients, it should not be an issue going forward considering that there is an abundance of medical data for researchers to work with. With the availability of data and growing technology of digital and medical imaging, many other learning techniques, like incremental ones, can be explored. Computational power and data storage have both grown in the last twenty years [3] to facilitate deep learning. However, sometimes main memory may be limited, so incremental learning can be efficient and more scalable.

Code

The framework to apply various machine learning techniques on the relevant set of data is maintained on GitHub. Please follow the repository to view the progress of this project.

Final Report

Methods

Although there exists software to compute perfusion values from tissue concentration time curves and AIF, the values generated can often be noisy, so machine learning is believed to have potential in accurately predicting such values. With increasingly more data becoming available, incremental learning can efficiently make predictions and potentially perform as well as batch-based learning. For this study, the predictive models used were those from scikit-learn, a Python library for machine learning. In particular, the regression-based versions of stochastic gradient descent (SGD) and Passive-Aggressive (PA) were utilized in experiments to evaluate the potential of incremental models in predicting perfusion parameters. Accurate prediction of perfusion values like TMAX and MTT can be helpful indicators of tissue outcome, which would then be critical for clinicians to make decisions regarding patients with acute ischemic stroke. The subset of data that was used to apply incremental machine learning algorithms on was provided by Fabien Scalzo’s research team. Supplementary libraries like matplotlib and pandas were included in scripts to help tune hyperparameters, visualize data, and conveniently record results.

Evaluation

Experiments were performed as attempts to answer some important questions. Does the performance improve with each batch of data that the model incrementally trains on? Does increasing the patch radius, or window size, actually improve the performance as many studies have found? How do the incremental models perform in comparison to batch-based models? How well do the models predict each perfusion parameter? The resulting numbers from these experiments were compared with the performance numbers of batch-based models including support vector regression and linear regression, all of which were provided by a member of Scalzo’s research team. Root-mean-square error and coefficient of determination (R2 score) were the performance metrics used to gauge the accuracy of predictions. The overall problem can be regarded as a regression-based task where the output corresponds to a perfusion value.

The data instances are pixels from perfusion-weighted images of the brains of anonymous stroke patients admitted to the UCLA Medical Center. The features essentially make up a concatenated concentration time curve (CTC) with the AIF appended. A script was included with the data for convenient extraction of the data given the perfusion parameter to predict and the size of the patch to extract for each pixel. In particular, two data sets were provided for each combination of perfusion parameter and patch radius. The larger set and smaller set were meant to be used to train the model and verify, or test, the model, respectively. The number of examples in each set varied across perfusion parameter, but there were roughly 7000-8000 instances in the larger set and 2000-3000 in the smaller set. The test dataset contained examples of new patients not included in the examples of the training dataset.

By creating and using various visualization tools, we found many noteworthy things regarding the provided data even before conducting the experiments. Although we were told that the distribution of the data was uniform, we found that by generating histograms of the distribution of each dataset, it was only roughly uniform relative to predetermined bin sizes. The distributions are shown in Figures 1-10. Nevertheless, because there was not any sharp skewness observed, the non-uniformity should not have affected too much of our results. We also discovered that each row of the data matrix was sorted by the bins that each value of the ground truth belonged to, which we thought was significant considering that the way in which our incremental models are fed each batch of training data would matter in how well they perform.

Results

We observed steady improvements in accuracy with increasingly more batches trained on our incremental models, as shown in Figure 17. The training RMSE of SGD, for example, decreased by 0.00502 ± 0.0057, while the test error decreased by 0.00570 ± 0.0063 for each new batch of size 1 trained on.

Our experiment of varying patch sizes showed slight improvements in accuracy for limited patch radiuses, as depicted in Figure 12. For SGD, an impressive order of magnitude improvement was observed in predicting TMAX. A single-voxel-based approach resulted in a RMSE of 68.1 ± 0.021, which improved to 51.5 ± 0.037 with a slightly larger patch radius of 1. However, increasing the patch radius any further only reduced the accuracies. In contrast to a batch-based model in linear support vector regression, the error consistently decreased from 37.5 to 33.1 as the patch size increased from 1 x 1 to 13 x 13.

The accuracies across incremental and batch-based models are reported in Figures 13-16. Based on running the experiment multiple times with different random seeds, PA remarkably produced an RMSE of 10.7 ± 0.055 and 16.8 ± 0.0059 in predicting a patch size of 13 x 13 for RBV and MTT, respectively. In comparison, batch-based models like kernel ridge regression performed with an RMSE of 18.4 and 22.5 for RBV and MTT, respectively, on the given test datasets.

Discussion

Our experimental results showed that the performance numbers of SGD were affected when the data was not shuffled before training. With each batch trained on, we observed that the training and test errors continued to decrease for the first several batches but then increased for the last few batches. On the other hand, with shuffling, the performance continued to improve with increasingly more batches. The problem that we envisioned can be related to a situation in which a classifier trains on data instances that all belong to the same class, but then performs poorly on new examples belonging to a different class. In fact, our models seem to put less weight on examples with different outcomes that come in later batches [1], so randomization becomes important.

Even though many studies in the field have shown that the surrounding distribution of a voxel can be predictive of tissue outcome, increasing the size of the patch in our experiments showed only small improvements in accuracy. The maximum number of data examples provided were trained for each patch radius in this experiment. The small disparities in performance can be explained by the fact that there exists a convergence of the optimal patch size, and more training instances may be needed to see a more noticeable improvement. There also may have been some kind of noise involved as the numbers for PA did not show any consistent trends.

With reasonable tuning of each model’s respective parameters, the performance numbers of our incremental models showed comparable results to the batch-based ones. For each perfusion parameter and model, cross validation was performed on the training set for tuning, before the full training and test sets were used to train and test the models, respectively. Perhaps with the exception of linear regression, all of the models seemed to show rather similar numbers in RMSE on the test datasets. However, testing on more subsets of patient data will be required in order to verify any significance of performance between models.

Although the experiments conducted have answered some questions about the potential of incremental models in predicting perfusion values, more can be done to verify and possibly exploit more of such models. In particular, kernels can be utilized to map the features to another space to exploit a potentially nonlinear relationship in the data. Hopefully by putting more effort into creating another visualization method that can perform dimensionality reduction (e.g. PCA), we can potentially confirm the linearity or non-linearity of the data. We could even oversample or undersample the data ourselves to see if an equal number of perfusion values drastically affects performance. More data, if available, can also be fed into the models, especially to verify if more training instances can lead to improved performance for larger patch sizes [2]. Furthermore, an experiment can also be done to see if removing the features representing the AIF affects the results in any significant way.

Results

Here is an updated list of our major findings. Click on any of the graphs to see an enlarged, interactive view.

First, we present information regarding the provided data. The distribution of each combination of perfusion parameter and data type (training or test) is presented in Figures 1-10.


Figure 1: Distribution of RBF training set.
Figure 2: Distribution of RBF test set.

Figure 3: Distribution of TMAX training set.
Figure 4: Distribution of TMAX test set.

Figure 5: Distribution of TTP training set.
Figure 6: Distribution of TTP test set.

Figure 7: Distribution of MTT training set.
Figure 8: Distribution of MTT test set.

Figure 9: Distribution of RBV training set.
Figure 10: Distribution of RBV test set.

Although we were told that each dataset is uniformly distributed, we found that the distributions varied from generating those bar charts. The data for each perfusion parameter was sampled with a bin size of 1, except for RBF, which used a bin size of 2. Some examples were duplicated (oversampled) for the purpose of achieving a more uniform distribution across bins. The number of examples in each dataset is shown in Figure 11 below.

Figure 11: Size of datasets.

The number of features for a given instance is dependent on the patch radius specified. The patch radius refers to regional distribution of a particular voxel. Many studies have found that by taking into consideration the surrounding region of a given voxel, the accuracy of predictions improves. We define this square region as a patch, or window. The window size is the length of this region. The patch radius is the distance from the given voxel to an edge of the region. For example, if the patch radius is 4, then the window size is 9 (or 9 x 9). A single-voxel-based approach, which does not use a regional distribution, uses a patch radius of 0 (and window size of 1 x 1). Mathematically, the window size is computed by doubling the value of the patch radius and adding one.


Patch Diagram

Regardless of the window size, the last 40 columns of each data instance is representative of the AIF. The remaining columns refer to the concentration time curves (CTC) of each voxel in the patch. Each CTC accounts for 40 additional columns. Mathematically, the total number of features can be calculated by multiplying the area of the patch (i.e. the window size squared) by the size of the CTC (which is 40) and adding the size of the AIF (which is also 40). For example, if the patch size is 13, then there are 6,800 total features.

Another note about the data is that each row of the matrix is sorted by the bins that each value of the ground truth (outcome) belongs to.


Now we present the actual results!

Please note that the error bars (which are very small) generated in the graphs that follow are based on running our experiments multiple times with each trial using a different seed. Because our incremental models are non-deterministic and use a seed for a pseudo random number generator as part of their respective stochastic algorithms, the average and standard deviation are reported after the trials have been run. Normally experiments are run repetitively in a manner where each trial involves a different split of training and test data (via cross validation). Because we were only provided a subset of data and the patients were unable to be distinguished from the voxels of data, we were unable to run standard procedure with different sets of patients for a given perfusion parameter and patch size.


Does increasing the patch radius, or window size, actually improve the performance of incremental models? Figure 12 shows that increasing the region of a voxel slightly decreases the training error of our incremental models.


Figure 12: Performance of models with varying patch size.

The plot specifically depicts results from training PA and SGD with a default batch size of 100 on perfusion parameter TMAX, compared to those for a batch-based algorithm in linear SVR. Very small improvements in accuracy can be seen from increasing the patch size. A closer look at the numbers can be found here.


How do the incremental models perform in comparison to batch-based models? How well do they predict each perfusion parameter? Figures 13-16 show the training and test performance of our incremental models compared with some batch-based models.


Figure 13: Average training root mean squared error across models.
Figure 14: Average training R2 score across models.

Figure 15: Average test root mean squared error across models.
Figure 16: Average test R2 score across models.

The models were compared across perfusion parameters using a patch radius of 6. The perfusion parameters RBF and RBV are similar to CBF and CBV, respectively.

Generally speaking, our incremental models showed from training and test to be comparable in predictive power with the other batch-based models. In fact, they even showed better accuracy for perfusion parameters like MTT and RBV. Since our incremental models did not take advantage of kernels (and do not naturally and internally support them in scikit-learn), there is some reason to believe that PA and SGD can be improved if the data has a non-linear distribution. Because the data contains many features, it was difficult to visualize the shape of the data. The distribution will be verified later by performing dimensionality reduction (e.g. PCA), perhaps.

The results for the batch-based models were recorded by a member of Fabien Scalzo's research team. A closer look at those numbers can be found here. The batch-based models that were used were linear regression (linreg), support vector regression with linear kernel (linsvr), support vector regression with RBF kernel (rbfsvr), ridge regression with linear kernel (linkridge), ridge regression with RBF kernel (rbfkridge), and decision tree regressor (tree). Note that no tuning was performed on any of those models (so the the parameters for each model were left at their default values). The relatively poor performance numbers of the linear regression model might be explained by the lack of tuning. However, the model that was used (from scikit-learn) does not seem to be very flexible in varying parameters as it does not seem to have a regularization parameter, for example, so the model easily overfits.

A closer look at the numbers for the incremental models can be found here.


Does the performance of an incremental model improve with each batch of data that it trains on? The following scatter plots verifies our finding that the performance of incremental learning on perfusion values improves with increasingly more examples trained on.


Figure 17: Learning curve of stochastic gradient descent.

The plots specifically shows the results from training stochastic gradient descent with a batch size of 1 on perfusion parameter RBF and patch radius 6. Shuffling of the rows of the data matrix is necessary in order to observe this improvement. A closer look at the numbers can be viewed here. The incremental performance of PA shows a similar trend with increasingly more batches trained on.

References

  1. "6. Strategies to scale computationally: bigger data," scikit-learn. [Accessed April 21, 2017].
  2. F. Scalzo et al., "Regional Prediction of Tissue Fate in Acute Ischemic Stroke," Annals of Biomedical Engineering, vol. 40. doi:10.1007/s10439-012-0591-7. [Accessed: April 11, 2017].
  3. N. Stier et al., "Deep Learning of Tissue Fate Features in Acute Ischemic Stroke," IEEE BIBM. doi:10.1109/BIBM.2015.7359869. [Accessed: April 11, 2017].
  4. R. McKinley et al., "Fully automated stroke tissue estimation using random forest classifiers (FASTER)," JCBFM. doi: 10.1177/0271678X16674221. [Accessed: April 11, 2017].