Conclusion

Author

Rishika Mamidibathula (rm4318) and Kshamaa Suresh (ks4423)

This project successfully demonstrated how combining interactive data visualization with interpretable machine learning transforms complex multivariate health data into actionable insights about colorectal cancer risk factors. Through analysis of over one thousand real patient records, we employed both exploratory visualization techniques and statistical modeling to identify key patterns and validate findings across multiple methodological approaches.

Our analysis revealed that colorectal cancer risk emerges primarily from modifiable lifestyle factors rather than isolated demographic or nutritional variables. Smoking behavior emerged as the single strongest predictor across all analytical approaches, with logistic regression showing a standardized coefficient of approximately 3.5 and decision tree analysis placing lifestyle patterns at the root node. This finding, validated through multiple independent methods, suggests that smoking cessation programs could have substantial population-level impact on cancer prevention. Family history consistently ranked as the second most important factor, acting as a risk amplifier that doubles baseline risk across all lifestyle categories, though our interactive visualizations revealed that active individuals with family history still show lower risk than sedentary individuals without genetic predisposition.

The parallel coordinates visualization proved particularly effective for uncovering multivariate patterns, allowing dynamic exploration of relationships that would require dozens of static charts to capture. The BMI-lifestyle heatmap revealed critical interaction effects, showing that obesity combined with smoking creates the deadliest risk profile at fifty-three percent, while physical activity provides substantial protection even at higher body weights. These interactive tools transformed passive observation into active discovery, enabling users to identify high-risk profiles and understand how individual variables interact within broader health contexts.

Surprisingly, dietary micronutrients showed minimal direct association with cancer risk in our statistical models, despite clear nutritional differences between risk groups visible in our visualizations. While the at-risk group showed substantially lower intake of Vitamin C and Iron in exploratory plots, these factors did not achieve statistical significance in regression analysis, suggesting that diet may influence cancer risk primarily through indirect pathways such as BMI and metabolic effects rather than through direct mechanisms. This finding challenges common assumptions about diet and cancer, indicating that lifestyle modification and appropriate screening should be prioritized over dietary supplementation for cancer prevention.

The technical architecture we developed, combining static GitHub Pages hosting with cloud-hosted Shiny applications, offers a practical model for projects requiring both narrative documentation and computational interactivity. The risk prediction calculator extended our analytical framework beyond pattern recognition to personalized assessment, with the what-if simulator enabling users to explore how lifestyle modifications might alter their risk profile. This interactive capability shifts focus from static diagnosis to dynamic planning, empowering individuals to make informed preventive decisions.

Several limitations merit acknowledgment. The cross-sectional design cannot establish causality, and single-point dietary assessments may not capture long-term nutritional patterns. The logistic regression model, while interpretable, represents a simplified approach compared to more sophisticated machine learning methods that could capture non-linear relationships. Future work could incorporate longitudinal data to establish temporal relationships, integrate polygenic risk scores for enhanced prediction, and validate findings across diverse populations.

This project reinforced that effective health communication requires more than accurate information presentation; it demands interactive tools enabling personalized exploration and supporting decision-making processes. By combining thoughtful visualization design with interpretable machine learning, we created an analytical environment where users can ask questions, test hypotheses, and develop personalized understanding of complex health relationships. The convergence of findings across multiple methodologies strengthens confidence that identified risk factors reflect genuine associations, offering evidence-based guidance for both clinical practice and public health interventions.