Introduction
For intensity-modulated radiation therapy (IMRT) or volumetric modulated arc therapy (VMAT), once the parameters of beam geometry are established and the treatment energy is chosen, the dose distribution is inversely optimised by the radiation field shaping devices, such as multi-leaf collimator. While inverse optimisation allows for the streamlined creation of advanced treatment plans, its trial-and-error nature can result in sub-optimal and inconsistent treatment plan quality. In some situations, the optimisation process leads to obtaining plans with a high level of complexity. This complexity can approach the limit of the accuracy of the dose calculation model, the precision of the treatment delivery device, or both [1].
Patient-specific quality assurance (PSQA) is an essential clinical step to ensure the treatment plans can be delivered as intended and to verify the treatment planning systems (TPS) dose computation. The PSQA protocols employ a physical measurement device to compare this measurement with the TPS-calculated dose. The gamma index, which combines criteria of both per cent dose difference (DD) and distance-to-agreement (DTA), is the most common method of evaluating the concordance of the measured and calculated dose [2]. The prevalent method for evaluating PSQA is assessing the gamma passing rate (GPR). The GPR signifies the percentage of measurement points that successfully meet the specified gamma index criterion. The American Association of Physicists in Medicine (AAPM) TG 218 report recommended 95% of GPR as the tolerance limit under a 3%|2 mm gamma criterion checked globally [3]. While the AAPM recommendations concern conventional fractionation schemes for the stereotactic or radiosurgery treatment, when the tumour sizes are significantly smaller, and the higher fraction doses are delivered, there are no clear recommendations for the gamma criterion [4, 5]. Only a suggestion to tighten the gamma criterion for verification of this kind of treatment is posted in the AAPM report [3]. Due to this, our institute uses a 2%|2 mm criterion measured in the local mode.
Advantages of VMAT relative to traditional IMRT include significantly faster and more efficient treatment delivery, though these advantages come at the expense of additional plan complexity [6]. Decreasing the delivery time of treatment sessions for every patient treated by VMAT saves extra time on the accelerator during the day, thus increasing the number of patients treated daily. However, the increased number of patients also means more PSQA verifications which, in the classic form, require access to the accelerator for gathering the measurement data to compare them with the planned data. Freeing time on the accelerator needed for PSQA measurements justifies the search for software-based QA protocols that could replace the traditional PSQA procedures. These studies focus on searching complexity metrics of the treatment and constructing artificial intelligence or machine learning models containing planning and complexity data to forecast the potential failure of PSQA results [7, 8].
While complexity metrics add to the understanding of the complexity of treatment plans, the current perception is that PSQA scores cannot be predicted based on a single complexity metric [9]. The earlier studies focused on models incorporating planning and complexity metrics, demonstrating the viability of employing machine learning algorithms to predict PSQA outcomes [10-13]. However, each machine learning model depends on the characteristics and quality of available data, and each PSQA prediction involves the combination of technologies, the choice of machine learning model, and clinical protocols used for optimising VMAT treatment plans, which can vary across institutions. Current studies where machine learning models were developed tried to forecast the GPR results of PSQA in a quantitative form. In our opinion, qualitative information acquired in the planning stage is also a helpful tool to inform the dosimetrist whether the constructed plan meets the gamma criteria set according to the technique used (i.e. conventional or stereotactic fractionation) that, regarding our institutional protocols, are 3%|2 mm measured in the global mode and 2%|2 mm measured in the local mode.
Therefore, this work explored the interrelations between planning and complexity metrics and GPR results obtained from routinely realised VMAT treatments in our institution. Additionally, three multicomponent models were tested for further modelling GPR results in the qualitative form.
Materials and methods
The study is based on the retrospective anonymised analysis approved by the local Bioethics Committee at the Poznan University of Medical Sciences. All examinations have been performed following the Committee guidelines and the Declaration of Helsinki [14]. The study includes the original studies conducted upon patients’ informed consent in writing due to the standard institution protocol. The study is based on unsponsored, single-institutional studies using the database collected from January to May 2022. All data have been anonymised, and the examined patients cannot be identified. There were 802 treatment arcs extracted from 378 volumetric modulated arc therapy (VMAT) treatment plans. Forty-six plans contained three arcs, and 332 plans had two arcs. The plans were created and realised for patients with cancer localised in the head and neck (HN; 192 arcs), thorax (THX; 191 arcs) and abdomen and pelvic (AP; 419 arcs) regions. Detailed locations are provided in the supplementary data (Tab. S1).
All plans were prepared using the 6 MeV photon energy and met our institutional clinical guidelines for dose distribution. The plans were based on conventional as well as stereotactic fractionation schemes. Three hundred and twenty-three plans (671 arcs) were realised conventionally with a flattening filter (6X), and the remaining 55 plans (131 arcs) were realised without a flattening filter (6X-FFF). The maximum planned dose rate (DR) was 600 [MU/min] for 6X and 1400 [MU/min] for 6X-FFF. The dose distribution calculations were performed on CT scans (Somatom Definition AS scanner; Siemens Medical Solution, Erlangen, Germany) using the analytical anisotropic algorithm (AAA) v.16.1.0 implemented in the Eclipse v.16.0 treatment planning system (Varian Medical Systems, Palo Alto, USA).
The plans were realised on the six TrueBeams accelerators (Varian Medical Systems, Palo Alto, USA), four of which were equipped with an electronic portal imaging device (EPID) aS1200 and two with EPID aS1000. Patient-specific quality assurance for every plan was performed using the gamma analysis method. The planned doses were compared with those measured by EPIDs. In general, 521 arcs were measured by EPID aS1200 and 281 arcs by EPID aS1000. For both EPIDs, the same performance algorithm (PDIP) v.16.1.0 was used.
Each arc has been verified twice: in the global mode with criteria of dose differences (DD) equal to 3% and the distance-to-agreement (DTA) 2 mm and in the local mode with DD = 2% and DTA = 2 mm. For both verifications, the threshold was 5% and was normalised to the maximum planned dose. Based on the gamma passing rates (GPR) from both verifications, a three-level qualitative descriptor (QD) was established to score the result of verification (Tab. 1).
Qualitative Descriptor |
Gamma Passing Rate for specified gamma analysis criteria |
|
3% | 2 mm |
2% | 2 mm |
|
Green |
≥ 95% |
> 95% |
Yellow |
≥ 95% |
< 95% |
Red |
< 95% |
< 95% |
Figure 1 shows examples of the comparisons for which, as a result of gamma analysis, three different QDs were granted, i.e. (a) green, (b) yellow and (c) red. Regarding Figure 1, each comparison was performed between the predicted dose (from the treatment plan) and the delivered dose, gathered on the same type of portal, i.e. aS1200. Moreover, every comparison was performed for the doses obtained from the 6X arcs with a 600 [MU/min] dose rate, and the examples included patients with the same location of the treatment area (PA) and comparable planning target volume (PTV).
Figure 2 shows the relations between the GPRs obtained through gamma analyses based on two different criteria for the DD and DTA and realised in two different modes (global and local).
The study’s first phase includes an analysis of the interdependence between the selected metrics of the treatment plans, the selected plans’ complexity metrics, and the results of its dosimetry verification presented in the form of qualitative descriptors (QD). Mann-Whitney, Kruskal-Wallis with Dunn multiple pairwise comparisons and Spearman tests were used to check these relations with a 0.05 significance level.
The plan metrics included in the study were:
- • Darc [Gy] — the part of the fraction dose delivered during the arc irradiation;
- • PTV [L] — planning target volume in litres;
- • energy (6X or 6X-FFF) — energy, type of radiation and beamforming technology;
- • area — the PTV location: HN, THX and AP.
- • The complexity metrics used in the study were:
- • BA, BI and BM — beam aperture, intensity, and modulation, respectively [15];
- • MU/Gy – monitor units [16];
- • aMU/CP and sdMU/CP — the average number of monitor units in Gy per control point during the arc irradiation (aMU/CP) and the corresponding standard deviation (sdMU/CP) [17];
- • aDR and sdDR — the average normalised dose rate during the arc irradiation (aDR) and the corresponding standard deviation (sdDR) [18];
- • aGS and sdGS — the average normalised speed of the gantry movement during the arc irradiation (aGS) and the corresponding standard deviation (sdGS) [18];
- • Join function (ϑ) — empirically determined function representing the relationship between aDR and aGS.
All plan and complexity metrics listed above were extracted automatically from the plans dicom files by our script written in Python using the SciPy library [19].
In contrast to complexity metrics listed from (a) to (e) that were first introduced by other authors [15–18], the join function (ϑ) is our empirically determined function by the nonlinear estimation method that describes the relation between the dose rate and the gantry speed for volumetric modulated arc therapy.
The relations visualised in Figure 3 may be expressed by the formula:
The join function ranges from 0 to 2. For the values from 0 to 1 of the function, aGS is near the maximum available speed (~1), and aDR that ranges from 0 to 1 plays a predominant role in the function. When aDR obtains 1, which is equal to the maximum available planned dose rate, the proper dose delivery starts to be controlled by aGS, decreasing from 1 to 0, and as a result, aGS starts to play a predominant role in the function.
In the study’s second phase, based on the treatment plans and the complexity metrics, the predictive models of the qualitative descriptors of the dosimetry verifications were created and examined. Two methods were chosen. The first was a probabilistic parametric classification technique called discriminant analysis (DA), and the second was a machine learning, random decision forest (RDF) model. The DA is a popular statistical technique to classify observations into nonoverlapping groups based on determining a linear or quadratic equation constructed from one or more continuous or categorical predictor variables to predict which group the case belongs to [20]. The RDF is a classifier that evolves from the decision trees model - a predictive model expressed as a recursive partition of the feature space to subspaces that constitute a basis for prediction. A random forest is an ensemble method that combines multiple decision trees through bagging. Bagging involves creating multiple subsets of the original dataset through random sampling (with replacement) and training a decision tree on each subgroup. The final prediction is an average or majority vote of predictions from individual trees. It is used to overcome the overfitting problem of one decision tree by reducing variance. The RDF enables many weak or weakly correlated classifiers to form a robust classifier [21].
All plan and complexity metrics explored in the study’s first phase were included to build DA, RDF, and hybrid models. The hybrid model assumed two steps of the prediction procedure — the first, where the DA model was used to predict red QD and the second, where the RDF model predicted green and yellow QD. The models were compared by accuracy and the number of correct classifications and misclassifications. The proper classification related to the different QD values for every model was studied, including the sensitivity and specificity of the models to forecast specified QD. The accuracy of the models and the sensitivity/specificity of the models to forecast specified QD were computed by the formulas [22]:
Accuracy = (TP+TN)/(TP+TN+FP+FN),
Sensitivity = TP/(TP+FN),
Specificity = TN/(TN+FP),
the TP, FP, TN, and FN are true and false positive observations and true and false negative observations, respectively. Both models were constructed and tested using XLSTAT software (Addinsoft, New York, USA). Training and validation groups used for models were the same and contained 642 and 160 treatment arcs, respectively (i.e. 80%/20% split). Data were split using a stratified technique based on the distribution of QD of GPRs to guarantee that the testing set was representative of the overall population of QD of GPRs (Fig. 4).
Results
Figure 5 shows the percentage of observations grouped by QDs (green, yellow, and red) and related to (a) the area of the irradiation, (b) detector type, and (c) energy used. The distribution of the QDs was different for the pelvis and abdomen (PA) area from that for the thorax (THX) or head and neck (HaN) areas (Kruskal-Wallis, p < 0.001). Better results of the QDs distribution were observed for the newest EPID type (aS1200) than for the aS1000 type (Mann-Whitney, p < 0.001). Almost all QDs for 6X-FFF were classified as green. Different distribution was for 6X (Mann-Whitney, p < 0.001), where yellow and red QDs were noted, too.
The 6X-FFF arcs were characterised in general by a high fraction dose per arc (Darc) and were used mainly in stereotactic treatment (75.5% of all 6X-FFF arcs). The requirements of the stereotactic treatment link these results with the records where small PTV and, consequently, small beam apertures (BA) and a high number of monitor units per control point (aMU/CP) were used. Examining the interdependence between plan and complexity metrics shows many statistically significant correlations. Nevertheless, it should be noted that many of them are small, fair, or moderate [23]. Almost perfect correlations were observed between PTV and BA (R = 0.897, p < 0.001), and aMU/CP and results of the joint function (ϑ) (R = 0.893, p < 0.001). The rest of the detailed results are presented in supplementary data (Tab. S2).
Figure 6 shows the interdependence between aMU/CP and ϑ. The data are presented as two trend lines determined by the energy parameter (6X or 6X-FFF).
Analysis of the proportion of the QDs of dosimetry verification results related to the complexity and quantitative plan metrics values shows that BA, aMU/CP, MU/Gy, aDR, and ϑ effectively differentiate all three QDs. The PTV, aGS, sdGS and sdMU/CP effectively separate green from the yellow and red QDs and do not differentiate the yellow from red. The BI and BM allow separating green from yellow QDs. For the rest of the parameters, the QD differentiation was ineffective. Figure 7 shows the results of QD differentiation for selected parameters. Table 2 shows the p-values obtained from the Dunn multiple pairwise comparisons performed during the Kruskal-Wallis analysis of qualitative descriptor differentiation by plan and complexity metrics.
Parameter |
Green vs. Yellow |
Green vs. Red |
Yellow vs. Red |
Darc |
0.655 |
0.966 |
0.899 |
PTV |
< 0.001 |
< 0.001 |
0.537 |
BA |
< 0.001 |
< 0.001 |
0.042 |
BI |
0.011 |
0.084 |
0.474 |
BM |
< 0.001 |
0.271 |
0.233 |
aMU/CP |
< 0.001 |
< 0.001 |
0.027 |
sdMU/CP |
< 0.001 |
< 0.001 |
0.213 |
MU/Gy |
< 0.001 |
< 0.001 |
0.033 |
aDR |
0.003 |
< 0.001 |
0.021 |
sdDR |
0.058 |
0.141 |
0.477 |
aGS |
< 0.001 |
< 0.001 |
0.484 |
sdGS |
< 0.001 |
< 0.001 |
0.458 |
ϑ |
0.001 |
< 0.001 |
0.022 |
Higher accuracy of the model was observed when the RDF method was used rather than the DA method (0.875 vs. 0.550). The wrong prediction of the green and yellow QDs caused the relatively weak accuracy of the DA model. As many as 70 green QDs (from all 108 green QDs in the validation set) were classified by the DA model as yellow. It causes weak results in the sensitivity of the green QD prediction (0.352) and the specificity of the yellow QD prediction (0.381). While the prediction of the green and yellow QDs by the RDF model was better than the DA model, the prediction of red QD was better for the DA model. While the DA model correctly predicted all five red QDs, the RDF model did it only for two, which strongly affected the sensitivity of prediction for these QDs (1.000 for DA vs. 0.400 for RDF). We introduce a hybrid model in which, in the first phase, the DA model is used to predict red QDs, and then, in the second phase, the prediction of green and yellow QDs is based on the RDF model. The constructed hybrid model has the highest accuracy and the best average sensitivity and specificity values (Tab. 3). The confusion matrices obtained for training validation sets are presented in the supplementary data (Tables S3-S7).
|
DA |
RDF |
Hybrid |
General models statistics |
|||
Accuracy |
0.550 |
0.875 |
0.894 |
Correct class |
88 |
140 |
143 |
Misclass |
72 |
20 |
17 |
Sensitivity | Specificity of qualitative descriptors |
|||
Green |
0.352 | 0.981 |
0.944 | 0.808 |
0.944 | 0.808 |
Yellow |
0.957 | 0.381 |
0.766 | 0.929 |
0.766 | 0.947 |
Red |
1.000 | 0.994 |
0.400 | 0.987 |
1.000 | 0.994 |
Averaged |
0.770 | 0.785 |
0.703 | 0.908 |
0.903 | 0.916 |
Discussion
It is known that the quality of dose distributions in plans is frequently independent of planning complexity [24], and comparable dose distributions can be attained through treatment plans of varying levels of complexity due to the potential introduction of unnecessary intricacy through inverse optimisation [25, 26]. For these rationales, numerous researchers have advocated the integration of complexity metrics into the cost function utilised by optimisation algorithms [26–28]. In this study, we selected complexity metrics that are easy to extract from the TPS at the dose optimisation and calculation stage. By examining the correlations between the complexity metrics, plan metrics and the PSQA scores, we confirmed previous literature findings [29, 30] that many complexity metrics correlated. Multiple metrics can account for different uncertainties and sources of plan complexity. As we have shown, the complexity metrics also correlated with the plan metrics, e.g., the intercorrelations presented in Figure 4, between the join function, average monitor units per control point and the energy/beamforming technology that is strictly related in our data to the fractionation scheme (stereotactic/conventional) that is represented by Darc — the fraction dose delivered during the arc irradiation. Nevertheless, as we have shown, predicting PSQA results based on one specified predictor is impossible. Therefore, in contrast to the ideas that assumed the usage of these indices on the optimisation stage to reduce plan complexity, we used them with plan metrics to construct the forecasting model that provides qualitative information on the planning stage on further results of PSQA. While other works that focused on the forecasting models show the results of quantitative model development [10–13] that are intended to replace the PSQA procedures, our concept assumes the introduction of a support tool that will provide qualitative information during the treatment plan preparation about its feasibility by the treatment machine. Our study shows that the most effective forecasting of the QD of the GPR results was obtained for the hybrid model based on the DA and RDF models. When implemented commercially, such a solution will enable the effective use of information generated during the treatment planning process to finally create a plan that can be implemented on the therapeutic machine with the accuracy adopted in the institution. This solution should be pre-configured and dependent on the institution-specific data. It means that the team developing the model should decide which DD|DTA criteria of gamma analysis will be included to generate green, yellow, and red descriptors. Moreover, the data on which the model will be trained should be gathered in this institution for specific dose development and PSQA methods. As shown, while we used one PSQA method (EPID dosimetry), the GPR results differed by the EPID model. Therefore, a specific characteristic of the dosimetry tool used during the PSQA should also be included.
The presented study is of a pilot nature. Our findings provide the basis for further model development to increase its accuracy, which currently allows correct QD prediction at 89.4%.
Conclusion
While we found a lot of statistically significant interrelations between metrics describing the plan and its complexity, they were small, fair or moderate. Only the correlations between ϑ and aMU/CP and the BA and PTV were almost perfect (R = 0.893 and R = 0.897, respectively).
Analysis of the proportion of the QDs related to the values of the complexity and plan metrics shows that a lot of these features allow for the effective separation of each of the descriptors (BA, aMU/CP, MU/Gy, aDR and ϑ) or to separate one descriptor from two other descriptors (PTV, aGS, sdGS, sdMU/CP, BI, BM).
The study shows that predicting GPR results based on one specified predictor is problematic. However, multi-component forecasting models became possible. Analysis of the efficacy of the DA, RDF and hybrid models shows that a hybrid model, which uses DA methods to predict red QD and RDF methods to predict green and yellow QDs, is the most accurate (0.894 compared to 0.875 for the RDF model and 0.550 for the DA model).
Data availability
The datasets analysed during the study are available from the corresponding author on request.
Author contributions
T.P. — the concept of the study, literature analysis, writing the manuscript, data analysis and models training and validation; A.R.— the concept of the study, SciPy coding, data export and analysis and writing the manuscript; P.K.— literature analysis and writing the manuscript, supervision of training and validation of the models; M.A.— literature analysis, collecting and export of complexity metrics data; P.S.— collecting and export of plan metrics data; B.B. and M.K-M — collecting and analysis of the PSQA data; A.J. – the concept of the study, supervision of the data collecting and export, manuscript writing.
Conflict of interests
Authors declare no conflict of interests
Funding
None declared.