Blog23rd February 2021

Increasing the utility of cancer data at Genomics England

Prabhu Arumugam writes about Genomics England’s involvement in DATA-CAN.

As a member of the UK Health Data Research Alliance, Genomics England was delighted to join forces with other key institutions as founding partners of DATA-CAN. With an ever-growing dataset of over 100,000 whole genome sequences linked to rich clinical data and with key focus areas in both cancer and rare diseases, Genomics England and DATA-CAN share clear synergies.

Established in 2013 to deliver the flagship 100,000 Genomes Project, the collaborative partnership between Genomics England and NHS England has paved the way for the establishment of the NHS Genomic Medicine Service. Through this service, and through continued collaboration with external partners across industry and academia, the Genomics England dataset continues to grow and has most recently expanded the modalities of data made available to approved researchers globally. The Genomics England dataset, and associated biobank of tumour, extracted DNA and additional multi-omics samples can be made available to DATA-CAN researchers alongside key expertise across our clinical, scientific and bioinformatic capabilities.

The Genomics England cancer dataset includes both somatic and germline whole genomes sequences for almost 17,000 participants. However, to explore and understand the true value of whole genome sequencing, coupling this data to wide ranging and diverse clinical data is a priority. Whilst a broad research dataset was collected at sample submission for the 100,000 Genomes Project, Genomics England has endeavoured to continue longitudinal clinical data collection, providing an understanding of the participant clinical course.

The National Cancer Registration and Analysis Service (NCRAS) dataset is a highly curated and focused cancer dataset. In order to deliver the accuracy and depth required, NCRAS has highly experienced data analysts decoding data and providing benchmark decisions on any data variation which means that data cannot be delivered in real-time. However, the Cancer Outcomes and Services Dataset (COSD) that is submitted for all registered cancers in England also includes full pathology and radiology reports.

Of particular focus is improving the timeliness and depth of the clinical cancer data available to researchers. Currently, reporting of standard of care molecular testing in pathology laboratories is relatively limited within the NCRAS dataset. Genomics England is working in collaboration with NCRAS to explore methods of improving delivery of these molecular testing results, alongside exploration of more granular data.

Genomics England is keen to explore the utility of minimising the time lag in data curation, improving and diversifying the depth of cancer data and exploring the utility of making de-identified pathology and radiology reports available within the Genomics England Research Environment.

Pathology reports in England have historically been reporter specific, thus each individual pathologist will have their own style and method. There is however an increasing drive to have a standardised template to reporting of biopsies and excision specimens. Natural language processing (NLP) is extensively used to interpret and analyse large amounts of natural language data. The vast variation in reporting styles though can provide unique and interesting challenges to utilising NLP for data mining in pathology reports. Utilising a group of clinical subject matter experts and experienced data scientists, Genomics England has focused on cancer specific datasets. By augmenting the NCRAS curated data, Genomics England has focused on extracting interesting research data points from pathology reports. As a pilot, Genomics England released a Colorectal specific dataset that included size of tumour, excision margin, MSI status in addition to data from NHS Digital and ONS to provide outcomes data including date last seen and date of death.

Chemotherapy data has been a key focus for researchers, to understand treatment regimes, responses and complications. The Systemic Anti-Cancer Therapy (SACT) dataset from NCRAS provides a dosage, treatment regime and response data. Genomics England has developed a strategy to provide structured chemotherapy data with only a one-month lag.

Whilst there is a drive to provide structure and timeliness to clinical data, pathologists and researchers will be keen to explore their own route of experimental interest. Genomics England has created an academic network, known as the Genomics England Clinical Interpretation Partnership (GeCIP) and is utilising this network of researchers to develop both an understanding of the need to access structured data alongside the need to access full reports. In order to aide this process, Genomics England has developed a process to make a limited number of de-identified pathology and radiology reports available for researchers.

Cancer Genomics is a rapidly evolving field and the data needed to interpret and ultimately advance our understanding is growing. Genomics England is at the forefront Whole Genomes Sequencing (WGS), delivering the only national standard of care WGS service in the world for the NHS. Utilising this knowledge and experience, we have also prioritised diversifying our clinical data provision and aiding researchers diagnostic experiences. Alongside this wealth of clinical data currently available, we aim to add pathology and radiology imaging for all cancers alongside primary care data over the next 6-12 months.

Prabhu Arumugam is the Lead Liaison for Cancer Pathology and Clinical Data at Genomics England.