Chapter 6 Conclusion

Most data are multivariate. This thesis makes important contributions to the interactive visualization of quantitative multivariate data. Data visualization can convey complexities in data such as heteroskedasticity and irregular shapes of distributions better than numerical summaries. Thus it is an essential activity for exploring data. However, it becomes more challenging when there are more than two variables (dimensions). Dimension reduction is commonly used in analysis of high-dimensional data. This thesis provides a new methodology using linear dimension reduction, built on dynamic animation of linear projections, with a specific focus on the radial tour. The radial tour changes the variable contribution of a selected variable. The user-control of the over the basis allows the analyst to better explore the variable attribution contributing to the the structure of a projection.

The Introduction lists the over-arching question: Can the radial tour, with user-steering of the basis, help analysts’ understanding of the variable sensitivity to structure in the projection? This is divided into three research questions that were correspondingly addressed in the three content chapters. Chapter 3 developed a package addressing RQ 1 (How do we define and implement a user interface and interactions for the radial tours to add and remove variables smoothly from 1- and 2D linear data projections?). Chapter 4 discussed a user study to answer RQ 2; does the use of the interactive radial tour improve analysts’ understanding of the relationship between variables and structure in 2D linear projections compared to existing approaches? Chapter 5 covers the novel analysis and corresponding package responding to RQ 3; can the radial tours be used in conjunction with local explanations to improve the interpretability of machine learning models?

6.1 Contributions

The contributions of this thesis can be split into scientific knowledge and software development.

6.1.1 Scientific knowledge

Chapter 3 clarifies the radial tour methodology, specifically using Rodrigues’ rotation formula (Rodrigues 1840) to solve the rotation matrix defined for 1- and 2D tours. This use of the rotation formula sets up a scaffolding to extend the manual tour to three dimensions with another rotation angle to span the manipulation space. This work also supports radial tours by illustrating use cases on high-energy physics data sets.

Chapter 4 discusses a user study to evaluate the efficacy of the radial tour as compared with PCA and the grand tour. It defines a task and accuracy measure for a variable attribution of the separation of two clusters. The \(n=108\) crowdsourced user study compares the performance of these visuals across three experimental factors of simulated data. Mixed model regression finds considerable evidence for a sizable improvement in accuracy from the radial tour, which participants also subjectively prefer.

Chapter 5 introduces cheem analysis. This extends the interpretability of nonlinear models by exploring local explanations with the radial tour. The global view gives a full observation summary of data space, attribution space, and residual plot side-by-side as a coordinated view. From this, an analyst identifies a primary observation to explore its explanation in detail and optionally compares it against another observation. The primary observation’s normalized attribution becomes the starting basis for a radial tour. By varying the contributions of variables, an analyst tests the contribution’s sensitivity to the predictive power identified in the explanation. We provide usage and discussion from four contemporary datasets.

6.1.2 Software

The R package spinifex facilitates the creation of manual tours, which allow an analyst to steer the contributions of a variable. It creates a framework for the layered display of tours interoperable with the tourr package. This extension of the geometric display of tours will feel at home to ggplot2 users. After composition, tours can be animated and exported as interactive or fixed animations. Vignettes and an interactive application help users rapidly understand the concepts facilitated by this work. The impact of spinifex can be seen in two ways. My contributions to spinifex and tourr won the ACEMS Impact and Engagement Award, 2018. Furthermore, the package is available on CRAN with vignettes and version notes on its pkgdown site. It has been downloaded over 15,800 times from CRAN between 09 April 2019 and 27 February 2022.

Given a compatible model, the cheem package calculated the tree SHAP local explanations with treeshap. Functions perform the preprocessing to calculate the local explanations of all observations with several statistics that help describe the separability of data and attribution space. This processed object is then used to produce two novel visuals that are composed using spinifex and ggplot2 before being rendered with interaction with plotly. An interactive graphical user interface is made with shiny, using these visuals to facilitate the outlined cheem analysis. Several preprocessed datasets are included and allow analysts to upload their data after processing. This package was recently published at CRAN and has a corresponding pkgdown site.

6.2 Limitations

Below, we discuss several limitations of this work in chapter order. The following section echos this order and discusses possible directions to address these limitations.

Manual tours (and radial tours) are limited in several ways. For instance, the work currently lists rotation matrices for 1- and 2D manual tours. Another limitation is that they only change one variable at a time. This can be cumbersome and timely if many variables need to be zeroed or otherwise modified. The radial tour moves in three segments (total contribution, no contribution, and initial contribution). It may be more approachable to directly relate the fraction of the slider to the magnitude of the variable’s contribution.

Setting aside the manual tour, there are several extensions to the layered composition of tours in spinifex, such as extending the geometric display type. Because the manual tour is currently defined for \(d \in [1,2]\) projections, the geometric display for \(d>2\) tourr-made tours are under-supported. Most tour implementations are made with 2D monitors in mind. It would be nice to have implementations for extended reality with a stereoscopic head-mounted display.

The user study evaluating the radial tour has several intrinsic limitations from the choice of visuals and levels of the experimental factors. It only considered discrete 2D PCA, grand tour, and manual tour for supervised cluster separation on mostly linear clusters in four and six dimensions. Trials we collected online due to the ongoing COVID-19 pandemic, which limited the audience and complexity of the task.

The cheem analysis was illustrated using the tree SHAP explanations on random forest models. The cheem package handles several other tree-based models, though the analysis should be generalized to a broader base of models and compatible explanations. We also focus on quantitative matrices. This could be generalized to accommodate text, image, or time-series data. There is also no comparison evaluating the value of cheem analysis. It would be interesting to measure the benefit of cheem over local explanations or other observation-level evaluations of a model.

6.3 Future work

This section proposes possible directions to address or extend from the limitations outlined above.

Chapter 3 included the scaffolding to extend manual tours beyond 2D. Namely, this would require the use of Rodrigues’ rotation formula to define another rotation angle. The 3D projection would initialize a 4D manipulation space and use three input angles to control the contribution in the first three components. Another display dimension may benefit the detection and understanding of the higher dimensional structure, while we would also expect longer interaction time and less intuitive input.

In addition to the manual tour controlling the contribution of a single variable, it may be insightful to change the contributions of several variables at once (effectively manipulating a linear combination of variables). Varying combinations of variables may increase manipulation speed or be prohibitively unintuitive to input. Alternatively, an automated approach may help clean up variables with small contributions. A sort of dimension “reduction tour” could append several manual tours that sequentially zero the contributions of variables contributing less than some threshold, which may expedite analysis, especially for approaching a feature’s intrinsic dimensionality. This approach may be better as an initialization step before render time. This would abstract away some of the business and monotony of dealing with many variables allowing an analyst to focus on the variables that tangibly impact the projection.

Currently, the radial tour creates three segments (increase-decrease-increase of magnitude). An “ink tank” display for a manual tour may be more intuitive than the current approach. For this, the extent of contribution would match the progress of an animation slider. The first frame would contain zero variable contribution, and the last would have a total contribution. The starting frame could be the original contribution that would also be exaggerated or annotated for reference and navigation.

This last point is adjacent to the pedagogical side of tours and linear projections more broadly. The user-steering of the manual tour provides a way to play with projections and test hypotheses making it a good candidate as a learning tool. An interactive application built for teaching and exploring projection techniques would be a boon to scholars.

Extending the output dimensions of some tours is relatively easy to do. The display in spinifex has focused on \(d \in [1, 2]\). Additional functions could be added to facilitate non-axes based displays such as parallel coordinates, Chernoff faces, glyph-based, or pixel-based visuals. For the reasons mentioned in Chapter 2, these displays are potentially best used with data with relatively few observations and many variables.

There are also several extensions to the display of 1 & 2D tours, including tabular displaying the numeric values of the basis, drawing of lines around convex and alpha hulls, and a high-density region displays with progressive density rugs or a combination of point and density contour display where the bulk of the data is shown as density contour, while the outermost observations are displayed as points (Hyndman 1996; O’Hara-Wild et al. 2022).

It may be interesting to experience tours as 3D scatterplots in extended reality with stereoscopically accurate head-tracking may be fruitful. Nelson et al. (1998) explore 2D tours in virtual reality. Other works view 3D scatterplot tours on 2D displays (Yang 1999, 2000). It would be interesting to see modern implementations using WebGL, Mozilla A-frame, or Unity. One concern would be keeping hardware and software as generalized as possible.

There are ways that the user-study evaluating the radial tour could be extended. The data could be changed to involve nonlinear shapes, outliers, or varying density and dimensionality. Besides changing the support of the experimental factors to the user study, it may be more interesting to evaluate across different tasks or compare other visualization techniques — scatterplot matrices, parallel coordinate plots, or ridge plots of principal component approximations may be realistic. The performed user study crowdsourced participants with little exposure to linear projections. It would be interesting to compare the results from more experienced participants.

The outlined cheem analysis can be generally applied to models and local explanations. While the package cheem currently calculates tree SHAP for tree-based models supported by treeshap (Kominsarczyk et al. 2021). This could be generalized more broadly to other models and local explanations. Those facilitated by DALEX::explain() seem to be an excellent direction to extend (Biecek 2018; Biecek and Burzykowski 2021). However, processing runtime is a looming obstacle that is exasperated when moving away from computationally efficient explanations. Alternatively, other statistics may better show the structure identified in the attribution space from the explanations.

The cheem analysis focused primarily on tabular quantitative predictors. This analysis could be extended to image, text, or time series analysis. The global view may remain helpful in summary and identification, while observation-level exploration would likely need to change to fit the context of the data, be it saliency map, word-level contextual sentiment analysis, or another context.

A user study would help elucidate the benefit of cheem analysis over local explanations or observation-specific analysis of a model. Our analysis can highlight the unique attribution of a selected observation against peers and test the sensitivity of that attribution. Perhaps a task identifying variable sensitivity to a prediction may be appropriate.