An Application Of Deep Learning Models To Reconstruct ECG Signals

--

* Reference for image: A. J. Huber, A. H. Leggett, and A. K. Miller, “Electrocardiogram: Blog chronicles value of old, but still vital cardiac test,” Scope, 19-Dec-2017. [Online]. Available: https://scopeblog.stanford.edu/2016/09/21/electrocardiogram-blog-chronicles-value-of-old-but-still-vital-cardiac-test/. [Accessed: 24-May-2022].

By: Yan Zhu, Chunsheng Zuo, Yunhao Qian, Guanghan Wang

Abstract

An electrocardiogram (ECG) measures the electrical activity of the heart. A complete set of ECG consists of 12 signals, yet measuring 12 leads of an ECG can be time-consuming and resulting in higher misdiagnosis rates. This motivates us to design a machine learning solution to reconstruct the 12 leads ECGs with 2–3 leads of input to improve the data collection efficiency and reduce the chance of lead misplacement.

Our team examined 5 different models, including linear regression, CNN, and LSTM, which have been applied in prior works, and U-Net and transformer, which are more advanced models that have not been used for ECG reconstruction before. The result has shown that the transformer beats other benchmarks and achieves the best overall performance in terms of RMSE, Pearson correlation coefficient, and the qualities of the reconstructed ECG diagrams.

Background

What Is ECG?

The heartbeats are primarily controlled by a group of autorhythmic cells lying on the right atrium of the heart, namely the sinoatrial node (or SA node). The electrical signals generated by the SA node spread through the entire heart and result in regular muscle contraction/relaxation cycles. An ECG measures and records the electrical potential changes of these signals. A healthy heart should have a normal rate (between 60–100 cycles per minute) and a constant pattern that contains the P wave, QRS complex, and the T wave. These waves correspond to the contraction and relaxation of the heart atria and ventricles. Many cardiac diseases can be identified from the ECGs; for instance, Figure2 shows the ECGs of some typical CVDs.

Figure1: Segments In ECG Signal (L. Sherwood, Human physiology: From cells to systems. Boston, MA, USA: Cengage Learning, 2016.)
Figure2: Normal ECG vs. ECGs for CVDs (L. Sherwood, Human physiology: From cells to systems. Boston, MA, USA: Cengage Learning, 2016.)

ECGs are an essential first-line tool in the diagnosis of many different types of heart conditions. According to the Centers for Disease Control, more than 69 million ECGs are performed annually in the United States during doctor’s office and emergency department visits.

Taking an ECG test requires a health professional to attach 10 electrodes. These electrodes are used to generate what is called a 12 lead ECG. The accuracy of limb electrode placement does not have much impact on the signal, however the location of chest electrodes V1 to V6 are critical. Studies have shown that even trained health professionals can misplace the chest electrodes. One study found that only 10% of participants (doctors, nurses, and cardiac technicians) correctly applied all the electrodes[3]. Misplaced chest electrodes can change the expected signal and result in missing certain diagnoses. This motivates us to seek a technical solution that can reduce the time and complexity of ECG collecting and the risk of misdiagnosis.

Figure 3: 12 lead ECG electrode placement locations. The position of the limb electrodes can be anywhere along the arms or legs without much effect on the recorded ECG signal.
Figure 4: 3 Resulted electrode placement locations. These 3 electrode locations will be used to measure ECG leads I and II.

Research Goal Overview

The project will evaluate how well machine learning reduces the need for ten electrodes and if it can reduce the need for error-prone chest placement; thus, speed up the ECG measuring process and eliminate the error associated with chest electrode placement. More specifically, we aim to design a deep learning model that only uses signals from three or two electrodes and can synthesize the rest of the missing leads. We will use Lead I and Lead II as shown in Figure 3 and use the signals collected at LA, RA and LL as the input for the neural net.

Linearity of the First Six Leads

Figure 5: Linearity between the first 6 leads

The first 6 leads of ECG signals: lead I, II, III AVL, AVR, AVF have linear relationships — as long as having the information of 2 out of the 6 leads, we can compute the rest 4 signals with high accuracy. Thus, instead of training a machine learning model to reconstruct these leads, we decide to compute the remaining leads given Lead II and Lead III. The machine learning models discussed in the following paragraphs serve for the reconstruction of the last six leads of ECG, i.e., V1-V6.

Dataset

Two ECG datasets have been used to train our ECG reconstructors, PTB-XL and CODE-15%. Both datasets can be easily downloaded from the Internet.

Preprocessing

The raw signals cannot be directly used as model inputs/references because they are noisy, and more importantly, because they suffer from the baseline wander problem. If the baseline wander is not corrected, it would become the major source of training loss. To alleviate the problem, both input and reference ECG signals are passed through a bandpass Butterworth filter with cutoff frequencies at 0.5 and 60 Hz. Such a filter removes the baseline wander and high-frequency noises without losing too much information about the ECG signals.

In each dataset, the ECG signals are annotated with many labels. In particular, many are boolean labels indicating whether the belonging patient is diagnosed to have certain diseases like infraction, hypertrophy, and ischemia. Although these labels have not been taken into account by our reconstructor models, they will become important e.g. when we measure the accuracy of the diagnostics based on the reconstructed signals. Therefore, when we split a dataset into training, validation, and test sets, we must take into account the fair distribution of labels. To ensure this, stratified sampling is applied to the joint distribution of all the labels when a dataset is split. Then, the distribution will be consistent across the splits for an arbitrary combination of labels

Another problem that we ran into when preparing the dataset is that the CODE-15% dataset, which was used mainly for classification tasks, has many low-quality signals with wide zero paddings and extreme values at signal endings (e.g., the measured voltage can approach to near infinity when the electrodes are being taken off from the patient’s body). These corner cases were not fatal for classifiers, but they turned out to cause a lot of trouble when we train reconstructors, especially large MSE losses. It is hard to overcome these problems in the runtime, so we decide to remove the bad elements when preparing the dataset. The following criteria have been used:

  • Because too many signals are zero-padded, and because we do not want to drop too many training examples, we clip all the ECG signals from 4096 to 2934 samples (which is roughly the 95-percentile of the unpadded lengths) while the paddings are removed.
  • To remove ECG signals that are too noisy for reconstruction, we pass the signals into the R-peak detector of the neurokit2 python package. A signal is discarded if the number of detected R-peak is smaller than 4.
  • To remove ECG signals that are flat (i.e., do not contain any ECG signal) or have extremely large values, we measure the voltage range of the clipped signals. A signal is discarded if the voltage range is smaller than 0.6 or larger than 9.0 (which are roughly the 5 and 95 percentiles, respectively).

For each raw ECG signal, a sliding window is run, and the first sliding window that satisfies all three criteria is adopted. If no sliding window can be found that satisfies the three criteria, the ECG signal will be discarded.

Format Conversion

To abstract away the differences between individual ECG datasets, and to accelerate the loading of huge ECG data (~3G for PTB-XL and ~30G for CODE-15%), the datasets are converted into a common format. The following three technologies have been attempted.

  • TFRecord (loaded with the tensorflow_datasets library). This format provided fair performance in the runtime. However, we abandoned it finally because:
  • The multi-threaded data loading provided by tf.data.Dataset is not fast enough, while using multi-processing provided by torch.utils.data.DataLoader causes even more trouble (when TensorFlow is initialized on every worker process).
  • Elements of it have to be accessed sequentially, so it is really painful to access an element by index in linear time (which means waiting for minutes) when we want to debug our models on a specific ECG signal of interest.
  • NumPy arrays opened with mmap. This format allows accessing elements by indices and provides great performance on our personal computers. However, it turned out to be very slow on network drives, in particular the Google Drive folders mounted on Colab machines. Note that it is impractical to always copy data to the local storage of Colab because the local storage is insufficient for larger datasets that we are going to use. Therefore, we do not use it in the end.
  • HDF5 (loaded with the h5py library). This is the format that we finally adopted. It allows accessing elements by indices, works happily with the multiprocessing of PyTorch data loaders, and gives good performance on both local devices and Google Colab.

Evaluation Metrics

Loss Function

Up to now, only the mean square error (MSE) has been used as the loss function for optimizing the reconstructor models.

Metrics

Two metrics have been used to report the performance of the reconstructor models.

Figure 6: Root Mean Square Error
Figure 7: Pearson Correlation Coefficient
  1. Root mean square error (RMSE) measures the absolute difference between reconstructed and reference signals. Lower numbers indicate better reconstruction. This metric is simply the square root of MSE, but RMSE is reported instead of MSE to give people more intuition about the magnitude of differences.
  2. Pearson correlation coefficient (PearsonR) measures the relative difference between reconstructed and reference signals. Higher numbers closer to 1 indicate better reconstruction. Note that in the original definition of PearsonR, both output and reference distributions are assumed to be normal, and the computed number roughly indicates the probability that the output and reference are from the same distribution. However, the voltage of ECG signals does not follow a normal distribution; the voltage distribution has a spike around the zero point, as shown in the figure below. Therefore, the computed PearsonR values do not indicate any probability. Instead, they are just a rough indicator of the relative closeness between reconstructed and reference signals.
Figure 8: Voltage distribution of ECG signals, plotted as a histogram with 200 bins.

Models

All our models take in ECG signals with reduced lead sets and output reconstructed ECG signals.

1. Linear Regression

We first implement a simple linear regression model that maps the reduced lead ECG signals to the missing lead ECG signals. The model is implemented as a single linear layer with no activation function. The number of input channels is equal to the number of input leads, and the number of output channels is equal to the number of output leads. The model learns a linear mapping between the reduced lead ECG signals and the missing lead ECG signals.

We chose to implement the linear regression model first due to several reasons. The model is simple and fast to train, and it is a suitable baseline model to evaluate the performance of more complex models. Also, it is highly interpretable, as the coefficients of the linear transformation can be directly inspected to determine the relationship between the input features and the output. This explainability also helped us identify and verify the linear relationship among the first 6 leads to some extent.

2. Convolutional Neural Network (CNN)

Next, we implemented a Convolutional Neural Network (CNN) consisting of a stack of 1D convolutional (Conv1d) layers, each followed by a LeakyReLU activation function. The number of Conv1d layers in the stack is determined by the parameter n_layers. Every layer will have an increasing number of output channels. The number of output channels in the final layer, though, equals to the number of output leads.

Each Conv1d layer has a kernel size of 3 and uses ‘same’ padding to ensure that the output size matches the input size, which is important to ensure the length of the reconstructed signals is the same as the input signals. Every Conv1d layer extracts local features (such as the QRS complexes) from the reduced ECG lead signals, which are then combined and refined by subsequent layers to form higher-level features that can represent the information from multiple leads to reconstruct the missing channels.

The LeakyReLU activation function is used after each Conv1d layer to introduce non-linearity into the model, thus enabling the model to learn complex, non-linear relationships between the input ECG leads and the missing ECG leads.

3. Long Short-Term Memory (LSTM)

Another natural fit for processing temporal sequence is the Long Short-Term Memory (LSTM) model. It is a type of Recurrent Neural Network (RNN) that is capable of learning long-term dependencies and patterns in temporal data, which is useful for our task of reconstructing missing ECG leads.

The model has num_layers LSTM layers with n_hidden hidden units in each layer. The input_size of the LSTM layer is the number of input leads. The output of the LSTM is passed through a LeakyReLU activation function to introduce non-linearity. Finally, the output is passed through a linear layer to produce the reconstructed ECG signal.

4. U-Net

As our CNN showed good performance in reconstructing ECG leads, we would like to try a more complex CNN-based model to see if it can further improve the performance. We chose to implement the U-Net model, which is a popular CNN architecture that excels at many tasks. The U-Net model consists of an encoder and a decoder. The encoder network downsamples the input while the decoder network upsamples it back, and, in our case, reconstructs the missing leads in the process. The architecture is designed to capture both local and global context information, which is useful for reconstructing ECG leads. Our implementation of U-NET uses 1D convolutional layers instead of traditional 2D convolutional layers for image segmentation tasks.

5. UnetFormer

While U-Net consisting of encoder and decoder networks is good at capturing local and global context information, not all time instances of the ECG signal are equally important. Therefore, as a next step after U-Net, we propose a Transformer-based model to further improve the performance of ECG reconstruction, which we name UnetFormer. UnetFormer is a sequence-to-sequence model that is based on the attention mechanism. Similar to U-Net, it also has an encoder and a decoder network.

In our implementation, PositionalEncoding is responsible for encoding positional information into the input ECG sequence before feeding the input into Transformer’s encoder. The Transformer’s encoder consists of a series of convolutional layers that transform the positional encoded input sequence into a sequence of embeddings. Mirroring the encoder, the decoder uses a series of convolutional layers to transform the Transformer output back into a sequence of ECG signals, utilizing the information gathered by the encoder to reconstruct the ECG signal in the process.

UnetFormer is a solid choice for reconstructing ECG signals since it can effectively model the complex relationships between the different ECG leads. However, it is not able to capture the local context information as efficiently as U-Net. Therefore, we proposed a hybrid model that combines the strengths of both U-Net and Transformer. The hybrid model wrapped a Transformer inside U-Net. U-Net will encode the input, pass the result to the Transformer, and then reconstruct based on the output of the Transformer.. The U-Net is responsible for capturing the local context information, while the Transformer is responsible for capturing global dependencies

Results

Quantitative analysis

Table 1: Model Performance Comparison

The results for models trained and tested on PTB-XL and Code15% dataset are summarized in table 1. We also recorded their floating-point operations per second in units of 109 (GFlops) as a reflection of their computational complexity (the lower the better). In terms of both PearsonR and RMSE, UnetFormerh has a slightly better performance than U-net. On PTB-XL, UnetFormer outperforms U-net on RMSE by 0.28%, while U-net is 0.03% higher in PearsonR. On Code15%, UnetFormer outperforms U-net on both PearsonR and RMSE by 1.26% and 0.007%. Though UnetFormer does not lead by a large margin, it has 38.9% less computational complexity than U-net, proving its efficiency in temporal signal understanding and making it a more favourable choice. In addition, by looking at the overall model performance on each dataset, it can be seen that all models achieve better results on PTB-XL than Code15%. This is a reflection of the noisiness and the difficulty of each dataset, which agrees with our observation while we prepared them. The highest performance drop occurs on LR, which has a 39.4% decline in PearsonR and a 106.4% increase in RMSE. Though other models also have a much worsen RMSE, their PearsonR values are still capable of maintaining a similar level, implying that the ability to capture non-local information of the signal is indispensable in handling the reconstruction of noisy signals. As U-net and UnetFormer are both competent in capturing the ECG waveform patterns for each output lead, which can serve as a guide for reconstruction, they are able to obtain a more accurate mapping from the 3 input ECG leads to the back 6 leads.

Qualitative analysis

Figure 8: Reconstruction visualization for samples from PLB-XL and Code%. The first figure on the left is a reconstruction for lead V4 of a sample from PLB-XL. The second figure from the left is a reconstruction for lead V6 of a sample from PLB-XL. The blue curves are the ground truth, while red curves are the reconstructions.

Looking at the model reconstruction results for some of the samples, it is obvious to see that LR often resulted in a much lesser amplitude in the reconstructed signal. CNN matches both the amplitude and the waveform much better than LR, but we can still clearly identify its gap with the rest of the models. For LSTM, U-net, and UnetFormer, most of the time the difference is minimal. In the reconstruction for the same sample, one of the 3 models may handle some parts of the signal better than other models, but may also have some issue regarding some other parts of the signal that does not happen to other models, making it hard to conclude which model provides the best reconstruction in terms of sheer visualization.

Conclusion

In this work, we applied machine learning techniques to reconstruct 12-lead ECG signals using only 3 input leads. We developed, trained, and evaluated five different machine learning models, and compared their performances using RMSE and Pearson Correlation Coefficient. Among the five candidates, transformer outperforms the rest of the models and achieves the highest Pearson CC and the lowest RMSE. One interesting future research direction would be to reduce the number of input leads (for instance, using only 2 or 1 lead to reconstruct the complete 12 leads of ECG signals). This may require other machine learning models that we have not explored in this work, such as GAN.

--

--

University of Toronto Machine Intelligence Team

UTMIST’s Technical Writing Team publishes articles on topics within machine learning to our official publication: https://medium.com/demistify