Review

Application of artificial intelligence in the diagnosis of hepatocellular carcinoma

Abstract

Hepatocellular carcinoma (HCC) is a major cause of cancer-related deaths worldwide. This review explores the recent progress in the application of artificial intelligence (AI) in radiological diagnosis of HCC. The Barcelona Classification of Liver Cancer criteria guides treatment decisions based on tumour characteristics and liver function indicators, but HCC often remains undetected until intermediate or advanced stages, limiting treatment options and patient outcomes. Timely and accurate diagnostic methods are crucial for enabling curative therapies and improving patient outcomes. AI, particularly deep learning and neural network models, has shown promise in the radiological detection of HCC. AI offers several advantages in HCC diagnosis, including reducing diagnostic variability, optimising data analysis and reallocating healthcare resources. By providing objective and consistent analysis of imaging data, AI can overcome the limitations of human interpretation and enhance the accuracy of HCC diagnosis. Furthermore, AI systems can assist healthcare professionals in managing the increasing workload by serving as a reliable diagnostic tool. Integration of AI with information systems enables comprehensive analysis of patient data, facilitating more informed and reliable diagnoses. The advancements in AI-based radiological diagnosis hold significant potential to improve early detection, treatment selection and patient outcomes in HCC. Further research and clinical implementation of AI models in routine practice are necessary to harness the full potential of this technology in HCC management.

Introduction

Hepatocellular Carcinoma (HCC) one of the leading cancers globally.1 2 A census conducted by the global cancer observatory in 2020 showed rising rates of HCC globally, particularly within North African and East Asian populations.1 The aetiology of HCC is unique in its potential for curative therapy through surgical resection, radiofrequency ablation and liver transplants for patients with early stages of the disease.3–7 Currently, the most widely accepted classification of HCC is the Barcelona Classification of Liver Cancer (BCLC) criteria which characterise HCCs based on the number and size of nodules, supported by secondary parameters such as liver function indicators.8 The BCLC guidelines also guide treatment options by physicians and curative treatments are generally reserved only for BCLC A and BCLC 0 patients.8 Unfortunately, symptomatic presentation of HCC generally only occurs in intermediate or advanced stage disease (BCLC B and C, respectively) where treatment options and efficacy are limited.9–11 As such, timely and accurate diagnostic methods are crucial as they are open to the possibility for more curative therapies and are likely to significantly improve patient outcomes.12–14 Current first-line surveillance methods include laboratory parameters such as alpha-feto protein (AFP) and imaging techniques such as liver ultrasounds, which are preferred due to their non-invasive nature.15–17 Recent advances in artificial intelligence (AI) have opened new applications in conjunction with existing diagnostic tools. Improvements in deep learning (DL) and neural network models have seen comparable or in some cases superior outcomes compared with physicians in the diagnosis of disease.18–20 Within HCC, AI models have been incorporated in the radiological and histological detection of the disease. This paper will discuss the recent advances of AI in the radiological diagnosis of HCC in the hope of providing a clearer landscape of the present and future of its application.

Terminology

The field of AI has constantly evolved since its inception with new concepts and applications arising every few years. As such, adequate understanding of frequently used terminologies is indispensable in discussions on the topic. This section will provide a brief introduction to commonly used nomenclature within the field of AI. First, AI refers to any science or engineering involved in making intelligent machines, especially intelligent computer programmes, for problem solving.21 AI covers a broad range of concepts that aim to allow computers to understand human intelligence but is not confined by what is biologically observable. AI models may be generally separated into two concepts; narrow AI which are trained to support a specific task and general AI which would possess the self-awareness to identify, learn and solve problems with minimal intervention22 (figure 1). Currently, general AI models are purely theoretical and no true general AI models have been created.23 As such, this paper will focus on narrow AI which has seen implementation clinically. DL and machine learning (ML) are subfields of AI and refer to a system that seeks to identify meaningful relationships and patterns from observed data.24 25 Technically, DL models differ from ML models through its method of pattern recognition. ML systems refer to any system that extracts information from one or more supplied data sets and applies it to achieve a specified outcome. DL models are a subset of ML which uses multiple layers of non-linear processing units for feature extraction and classification, creating a hierarchical system capable of incorporating large amounts of data for analysis.26 Additionally, classical ML models rely on supervised learning, in which human operators manually delineate the pertinent features required to classify the image, while DL models can be trained using both labelled and unlabelled data sets.27 Lastly, artificial neural networks (ANNs) are an application of ML which aim to imitate human pattern recognition behaviour by modelling the way biological neurons signal one another.28 29 This involves three general layers: a node layer, one or more hidden layers and an output layer. The node layer receives, and weighs data provided to produce a quantifiable signal. If the produced signal exceeds a specified threshold, the node is ‘lit’ and activates downstream layers, mimicking the way neurons are fired through sufficient stimuli. Through that, ANNs can perform complex analytics and adapt to unique tasks based on the activation of different layers. Convolutional neural networks (CNNs) are an advancement of ANNs, consisting of multiple layers of ANNs which surpass the classical method in performance, pattern recognition and object recognition.30 31 It is important to note that while these terminologies describe unique concepts within the field of AI, significant overlaps exist between them. For example, while DL is considered a subset of ML, its ability for unsupervised learning can be applied to ANNs and thus could also be considered a subset of ANNs. As such, accurate nomenclature is crucial when discussing the topic.

Figure 1
Figure 1

Overview of nomenclature associated with artificial intelligence. Definitions are based on the International Business Machines (IBM) corporation descriptions. The different components of a neural network have been colour-coded; red circles represent the input layer, yellow circles represent the hidden layers and green circles represent the output layer.

Role of AI in diagnosis of HCC

The advantages of AI in diagnosing HCC are widespread but will be summarised here in three roles: the reduction of diagnostic variability, the reallocation of healthcare resources and the optimisation of data analysis. First, reduction of diagnostic variability. The diagnosis of HCC relies on an interplay of radiological, histological and cytological parameters.32 33 Of the three, radiological identification of HCC remains the preferred method for its non-invasive nature but is the most prone to variability due to factors such as experience of the radiologist and interpreting physician, patient factors and workflow variability.34 35 In a study by Covert et al evaluating the effect of interoperator variability on MRI-based manual segmentation of 140 HCC lesions, interoperator variability was found to be relatively insignificant (reliability coefficient for response evaluation criteria in solid tumours (RECIST) diameter=0.966) but was estimated to affect the perception of tumour-control probability by up to 25%.36 Theoretically, this could indicate a difference in treatment regimen and outcome between clinicians. As such, the benefit of AI is twofold; its empirical analysis enables objective interpretation of an image while its non-biological nature helps ensure consistent analysis regardless of time or patient load. The second role of AI in imaging would be the reallocation of healthcare resources. The non-biological nature of AI ensures consistent, empirical analysis of patient data while consistent research has shown comparable, if not superior, diagnostic capabilities of AI to physicians in a variety of diseases.37 This holds true in the diagnosis of HCC where, in a preliminary study evaluating the use of a CNN for the interpretation of 228 ultrasound videos, the overall detection rate of the AI (89.8%; 95% CI 84.5% to 95.0%) was significantly higher compared with non-radiologist physicians (29.1%; 95% CI 21.2% to 37.0%, p<0.001) and radiologists (70.9%; 95% CI 63.0% to 78.8%, p<0.001).38 As such, the implementation of AI systems could help ease the workload of physicians by incorporating a constantly available tool to assist the diagnosis of HCC. Lastly, the optimisation of data analysis. The integration of information systems in healthcare has provided clinicians with an assortment of patient data at their disposal, enabling more reliable and informed diagnoses. However, this increase in volume also necessitates optimal integration to ensure maximum benefit. An example within HCC is a model developed by Kim et al which integrated patient history (such as hepatitis B/C and cirrhosis), patient characteristics (such as age and sex), antiviral treatments, imaging data and laboratory parameters to generate a risk score for developing HCC in patients with chronic hepatitis B.39

Principles of radiomics

Radiomics is the emerging field involving the extraction of quantitative features and subsequent analysis of radiological images using AI models.40 The process of training a radiomics model has been extensively described in literature41–45 and can be summarised into five steps: image requisition and preprocessing, segmentation, feature extraction, model training and model validation.44 Image requisition involves selecting imaging data and identifying potential features causing variability when training the AI model. Crucially, developing a protocol for image requisition requires a balance of standardisation (to reduce noise and confounding) and variability (to ensure generalisability of the model in a clinical setting).46 Imaging data are then segmented into regions of interests delineating the tumour and its surrounding areas before being fed into the selected AI model for feature extraction. The models are then refined for accuracy, complexity and efficiency before being validated through a validation data set (usually derived from a subset of the original data). Lastly, the model is tested on a ‘blind’ training data set to determine its performance. Current radiomics models in HCC have been developed for ultrasound, CTs and MRIs. The following sections will explore some of the notable developments of AI models for each imaging technique.

Application in ultrasound

Liver ultrasonography has been a standard radiological modality for HCC surveillance due to its non-invasiveness and absent risk of radiation or contrast exposure.47 However, the utility of ultrasound in detecting HCC is being increasingly debated due to its suboptimal performance, particularly in early HCC, and its operator-dependent variability. Ultrasound detection of HCC remains relatively reliable with a meta-analysis of 32 studies (comprising 13 367 patients of varying HCC severity) concluding with an overall sensitivity of 84% (95% CI 76% to 92%). However, the study also evaluated the sensitivity of detecting early HCC at 47% (95% CI 33% to 61%).48 This disparity is likely due to most patients with HCC presenting with cofounding comorbidities such as cirrhosis which produce a coarse pattern on ultrasound, precluding identification of small HCC nodules.49 Additional factors such as obesity, fatty-liver disease and fibrosis may similarly impair the quality of liver ultrasound, thereby obfuscating the presence of smaller HCC nodules in early HCC.50 While newer applications, for example, contrast ultrasounds and diagnostic guidelines have helped limit the diagnostic variability of ultrasounds in HCC, recent interest has been on the implementation of AI models in interpreting ultrasound images. A CNN model developed by Tiyarattanachai et al, highlighted the diagnostic potential of AI systems in liver ultrasounds.38 There, a pretrained CNN model (RetinaNet51) was fed 25 557 images of various common ultrasound findings (eg, HCC, liver cysts and haemangiomas) followed by refinement using 228 ultrasound videos with difficult frames to create a model specialising in the differentiation of ultrasound findings. The model was then tested against 175 videos containing 127 lesions and its performance was compared with physicians (both non-radiologists and radiologists) to assess its potential. Overall, the AI system achieved an overall detection rate of 89.8% (114/127 lesions; 95% CI 84.5% to 95.0%) and a 100% detection rate for HCC (23/23 lesions; 95% CI 85.2% to 100%). Comparatively, overall detection rates for non-radiologist physicians and radiologists were 29.1% and 70.9%, respectively (both p<0.001) while detection rates for HCC were 39.1% (p<0.001) and 69.6% (p=0.016), respectively. This study demonstrated the feasibility of AI systems in detecting HCC through ultrasound with detection rates significantly higher than both non-radiologist and radiologist clinicians. Current clinical guidelines suggests early surveillance for HCC using ultrasound and serum AFP.52 However AFP remains negative in nearly two-thirds of patients at any stage of HCC53 and screening outcomes using ultrasound alone remain suboptimal.48 A DL model detailed by Zhang et al highlighted the potential of AI as a screening tool for AFP-negative HCCs.54 The CNN model (Xception55) was trained using a total of 305 images of HCC and focal nodular hyperplasia taken using B-mode ultrasound and model testing was done using 102 B-mode ultrasound images. HCC staging, lesion size, echogenicity and liver function were heterogenous in both training and testing data sets. Diagnostic performance was then compared with four other available CNN models (MobileNet,56 Resnet50,57 Densenet121,58 InceptionV3).59 In total, the model displayed promising diagnostic capabilities with an overall area under the curve (AUC) of 93.68% (95% CI 88.6% to 98.8%), sensitivity of 96.08%, specificity of 76.92% and accuracy of 86.41%, outperforming other CNN models and providing an optimistic non-invasive tool for HCC surveillance. Table 1 summarises the key studies evaluating the use of AI in the interpretation of liver ultrasound.

Table 1
|
Overview of studies involving AI in HCC ultrasound as of January 2023

Application in CT

CT is a primary surveillance method for HCC, exhibiting utility in particular for detecting dysplastic lesions and early stage HCC.60 61 Current guidelines recommend a multiphasic CT with extracellular contrast agents as a first-line option for the diagnosis and staging of HCC.62 A meta-analysis by Lee et al evaluating the diagnostic capabilities of CT imaging for detecting HCC concluded with an overall sensitivity of 72% (95% CI 62% to 80%).63 Additionally, the study reported a sensitivity of 74% and specificity of 81% when CT imaging was used as the initial diagnostic tool for focal liver lesions detected during surveillance. These results highlight that while acceptable, there is significant room for optimisation in CTs as a diagnostic tool for HCC and the incorporation of AI has shown promise in that regard. A DL model outlined by Wang et al reported optimistic performance in identifying patients with HCC from CT data with an AUC of 0.887 (95% CI 0.855 to 0.919) for an internal data set and 0.883 (95% CI 0.855 to 0.911) for an external data set.64 In the paper, 647 individuals with HCC of various stage of disease and 6865 non-HCC individuals were analysed using plain and contrast-enhanced CT. Obtained images were then fed into two CNN models (NoduleNet and HCCNet) for training, and performance testing was done on an internal and external data set. A subset of the test data set was also reviewed by three radiologists to obtain a comparison of performance. For the internal data set, the AI model achieved a sensitivity of 78.4% (95% CI 72.4% to 83.7%), specificity of 84.4% (95% CI 78.0% to 89.6%) and an overall accuracy of 81.0% (95% CI 76.8% to 84.8%). For the external data set, the model achieved a sensitivity of 89.4% (95% CI 85.0% to 92.8%), specificity of 74.0% (95% CI 68.5% to 78.9%) and an accuracy of 81.3% (95% CI 77.8% to 84.5%). Comparatively, the AI model performed similarly to the radiologists in terms of accuracy (0.853 vs 0.818, p = 0.107), sensitivity (0.815 vs 0.753, p = 0.064) and specificity (0.902 vs 0.903, p = 0.981) for the internal data set, with similar findings of the external data set. However, the true utility of the model was displayed when it was used to assist the radiologists in improving their diagnostic performance as seen by a significant improvement in diagnostic accuracy for the internal data set (0.873 vs 0.818, p = 0.026) and external data set (0.854 vs 0.793, p = 0.017). This study indicated the potential for AI systems to standardise the diagnosis and stratification of HCC when using CT imaging. Table 2 summarises the key studies evaluating the use of AI in the interpretation of liver CTs.

Table 2
|
Overview of studies on the application of AI in HCC CTs as of January 2023

Application in MRI

Liver MRI is another first-line imaging tool for the diagnosis and classification of HCC.62 Advantages of MRI include its high-quality imaging of the entire liver, contrasting enabling detection and classification of HCC and lack of ionising radiation allowing safer routine use.65–67 MRIs have also shown utility in detecting small HCC nodules more consistently than CTs. In a meta-analysis of 34 articles, MRIs had an overall sensitivity of 88% (95% CI 83% to 92%) and specificity of 94% (95% CI 85% to 98%). However, MRIs also had a significantly lower sensitivity in detecting lesions less than 2 cm with a sensitivity of 62% for lesions <2 cm and 95% for lesions >2 cm (p<0.02). These results highlight that while MRIs generally have higher detection rates, more optimisations may be possible and in particular for detecting early HCC. Current AI models have shown promise in differentiating common liver lesions such as HCC, haemangioma and metastatic tumours through feature extraction of MRIs. In a study by Oyama et al, non-contrast-enhanced fat-suppressed 3D T1-weight gradient echo MRIs of 150 hepatic tumours (50 HCC, 50 haemangiomas and 50 metastatic tumours) were fed into an inhouse ML model. There, the images were ranked based on their likely identity (ie, HCC, haemangioma or metastatic tumour). Overall, the ML model was able to differentiate HCCs and metastatic tumours with an accuracy of 92% (92/100 lesions), sensitivity of 100% (50/50 lesions), specificity of 84% (42/50 lesions) and an AUC of 95%. Similarly, the model differentiated HCCs with haemangiomas with an accuracy of 90% (90/100 lesions), sensitivity of 96% (48/50 lesions), specificity of 84% (42/50) and AUC of 95%. Despite the limited analysis, the results support the potential of AI systems in the interpretation of MRIs.68 Another study by Oestmann et al evaluating the use of CNNs in identifying HCCs on contrast-enhanced MRIs identified key factors affecting the diagnostic accuracy of ML models.69 In the study, 118 patients (73 HCC and 45 non-HCC) were evaluated using contrast-enhanced MRIs and the images were analysed in a CNN. Atypical lesions, defined as lesions not displaying typical MRI appearances based on the Liver Imaging Reporting and Data System criteria (ie, arterial hyperenhancement, washout and enhancing rim/pseudocapsule),70 were included to determine the correlation between lesion grading and CNN performance. The study concluded with the CNN demonstrating an overall accuracy of 87.3% and sensitivities/specificities for HCC and non-HCC lesions of 92.7%/82.0% and 82.0%/92.7%, respectively. CNN performance was also directly correlated with the lesion grading system, becoming less accurate the more atypical imaging features the lesion showed. This study highlighted the potential pitfall of an ML system as their strict algorithm lacked the flexibility to consistently identify atypical lesions. Table 3 summarises the key studies evaluating the use of AI in the interpretation of liver MRIs.

Table 3
|
Summary of studies on the application of AI in HCC MRIs as of January 2023

Other applications of AI in HCC diagnosis

Apart from directly aiding in the interpretation of radiological images to diagnose HCC, several other applications have emerged to assist in the surveillance and classification of HCC. One such example would be the development of integrated ML models for risk-score prediction.39 A recent model developed by Kim et al aimed to predict the risk of HCC in Korean and Caucasian patients with hepatitis B. The ML model was developed which incorporated ultrasound images and 10 baseline parameters to assess the risk of HCC in patients with hepatitis B virus (HBV). These parameters were the presence of cirrhosis, age, platelet count, antiviral agent used, sex, serum alanine aminotransferase levels, baseline serum levels of HBV DNA, serum levels of albumin and bilirubin, and hepatitis B e-antigen (HBeAg) status, which are listed in the order of importance. The results were prospectively validated against a Korean (n=5817) and Caucasian (n=1640) cohort and assessed with a Concordance Index (c-index).71 The model achieved a c-index of 0.79 (95% CI 0.78 to 0.80) and 0.81 (95% CI 0.79 to 0.83), respectively, in the Korean and Caucasian cohort and highlights the potential of AI models in the integration of patient parameters and for the risk stratification of HCC.

Current challenges and future directions

The recent boom of AI models into mainstream popularity has placed increased focus on their applications in healthcare. However, despite constant improvements in hardware and software capabilities, several challenges remain unaddressed. First, the standardisation of study design. Due to the relative infancy of ML and CNN models in healthcare, limited regulations exist governing their application. This holds true for their use in HCC diagnostics, where there is currently scarce data on their efficacy in actual patient settings. As of writing this paper, there is currently no consensus on the feasibility of AI in large-scale HCC radiomics nor standardised approaches on how results should be reported or interpreted. A recent meta-analysis on the quality assessment standards in systematic reviews of AI accuracy have also shown studies to be afflicted by biasness.72 These findings included 243 out of 423 (57.5%) studies reporting a high or unclear risk of patient selection bias, 110 studies (26.0%) reporting a risk of bias in index test selection, 121 studies (28.6%) reporting a risk of bias in reference standard and 157 studies (37.1%) reporting a risk of bias in study flow. The study highlights the need for AI-specific quality assessment tools and guidelines to facilitate the safe use of AI tools into clinical practice. Several guidelines have been proposed such as the Standard Protocol Items: Recommendations for Interventional Trials - Artificial Intelligence (SPIRIT-AI) and Consolidated Standards of Reporting Trials–Artificial Intelligence (CONSORT AI) guidelines which propose a framework for designing and reporting AI-related clinical trials to ensure robustness and accuracy of results.73 Additionally, while several studies exist evaluating the performance of AI systems, the vast majority of these have been retrospective in nature and prospective evidence in the field remains limited.

Another challenge is the quality assessment of data sets. Proper training and validation of ML models are crucial to ensure accuracy which necessitates quality data sets.74 Currently, most published studies on the applications of AI in HCC diagnostics have focused on improving the quality of their models through additions and optimisation on the model itself (such as optimising feature selection and increasing CNN layers). However, limited studies have been published focusing on the impact of data set quality on ML training and the ideal features of a good data set.75 76 Differences in diagnostic equipment and workflows have also made it difficult to accurately identify an optimal imaging workflow for training AI models. A systematic review by Fatania et al provides insight into how the optimal radiological features could be evaluated.77 In the paper, the intensity standardisation techniques of 12 studies were classified and evaluated to assess reliability based on the Quality Assessment of Diagnostic Accuracy Studies-2 (QUADAS-2) quality-assessment criteria.78 Despite the small sample size (n=12) and focus on gliomas, the paper by Fatania et al serves as a helpful framework for developing future evaluative studies for AI-related image acquisition methods in HCC. Another interesting approach has been the use of a supplementary AI model to automatically detect and exclude poor-quality data in a training set.79 The proposed model, based on a novel technique of untrainable data cleansing, identifies and removes a subset of the data that AI models are unable to correctly label (classify) during the AI training process. The feasibility of their model was evaluated using a publicly available data set of 624 paediatric X-rays, of which 200 images were flagged as ‘noisy’ by the system. From there, a trained radiologist was given a subset of the data (100 correct images and 100 ‘noisy’ images based on the AI’s classification) for interpretation. Results showed that the reader consensus between the radiologist’s label and the original label was significantly higher for the correct images (Cohen’s κ=0.65) compared with the noisy images (Cohen’s κ=0.05). These results indicate a high discordance in interpretation of the ‘noisy’ images that would make them suboptimal for a training data set. However, the feasibility of such a model remains unevaluated in HCC radiomics.

Lastly, the availability of large data sets remains scare. In addition to the quality of data, sufficient images are also required to optimise ML performance. Most published studies have relied solely on internal data sets for the training and validation of their models. Although the use of internal validation is acceptable, the development of new and existing public databases for external validation would be a core for the future of the ML landscape. The benefit of a large, public data set is multifold. First, it improves the accuracy of feature extraction, improving the diagnostic performance of the AI model.80 81 The fostering of a culture of cooperation would assist in the creation of databases of sufficient size and detail to adequately train new AI models.82 83 Second, it provides the opportunity for smaller hospitals, which may lack the experience or resources to create custom ML systems, to incorporate AI into their diagnostic frameworks. To that end, several sources have already made their data sets publicly available,81 84 85 marking a good first step towards an integrative data network. However, key concerns such as the cross-compatibility of data sets between hospital information systems, the safety and security of patient data, and the economic cost of developing the system must be addressed.

Conclusion

The recent surge of AI into the mainstream has raised discussion about their use in healthcare. To that extent, many hospitals have begun evaluating their feasibility. In this paper, we discussed the current landscape of AI application in the radiological diagnosis of HCC and its future directions. Current research has been optimistic, with AI models consistently reported to outperform clinicians in the interpretation of ultrasound, CT and MRI. However, rather than replace the role of clinicians, AI systems should be seen as tools bolstering the confidence of clinicians (in particular inexperienced or non-radiology trained clinicians) in their diagnosis of HCC. Despite the positive findings, it is prudent that we also acknowledge that the implementation of AI into healthcare will come with new challenges such as the financial feasibility, standardisation of use and security of data systems. All in all, the future of AI systems in diagnosing HCC looks bright but large-scale prospective is still needed.