Last week a paper was published in Nature (open access) that had me giddily excited. It was the latest article from the ExAC team and it promised to make for very interesting reading. ExAC is the Exome Aggregation Consortium and constitutes the single largest collection of exomes anywhere in the world. The exome is the part of your DNA that actually encodes for proteins, what many consider to be the useful bit. At only 1.5% of the total DNA in your cells it suggests that nearly 99% of your genome is useless, so called ‘junk’ DNA. This view is rapidly crumbling, though, as we slowly learn more about genetic control and regulatory elements found throughout the non-coding parts of the genome.
Whilst the exome is not the be all and end all in genetics, then, it is certainly a good place to start and a thorough understanding of it will bring with it immeasurable benefits to modern medicine. By now you may have figured out that ExAC is just a database and therefore boring; but you would be wrong.
Although last week’s publication represents the first official publication of the dataset it was first written about in the bioRxiv back in 2014. This was also the point at which the whole database was made publicly available for all. Speaking as someone whose day job is to try to diagnose rare genetic disorders I can’t overstate how useful a tool this has been in the diagnostic setting.
Back in the genetic stone age, six or so years ago, we simply didn’t have much data to work with. Comparatively few of our 22000 genes had been well characterised. Those causing common genetic diseases like cystic fibrosis and breast cancer were the best studied with many thousands of people sequenced. The vast majority of the genome, however, was an expansive wasteland, uncharted territory where only the brave or the foolhardy had trodden.
Diagnostically speaking we simply didn’t have enough information to say with the necessary certainty what might be going on in a given patient and so diagnoses were restricted to the usual suspects. Gradually, more genes would come online. With the advent of what we call Next Generation Sequencing technologies our rate of exploration has exploded. It is literally hundreds of times greater than even 5 years ago.
Even a little lab like the one in which I work now has the ability to generate sequence data on hundreds of genes on hundreds of patients in just a few weeks. This, though, has brought with it its own challenges. Now that we are bravely out prospecting in the genetic Wild West what do we do when we discover something shiny in the dirt? How do we tell a golden nugget from a lump of pyrite? A disease causing mutation from a benign polymorphism?
This is where scientific collaboration comes in. No one lab is going to be able to come up with the kind of depth of data that we would need to gain a deep understanding in the array of genes across the exome. So scientists started setting up collections of data to help inform the community. The problem was that they were very niche. People interested in Parkinson’s Disease set up PD databases; those keen on neuropathies made a database for those. They were good as far as they went but they were often not funded, created by someone who just happened to know a bit of coding. They were often out of date or not particularly well curated.
To make things more complicated there’s the fact that we’re not clones. If you pick a random person in the street that you’re not related to and compare their genome with yours then there will be about twenty thousand differences. That’s normal. But what’s normal for someone of Nigerian heritage won’t necessarily be the same as what is normal for someone of Italian heritage. A Chinese person might have a variant that one third of their compatriots share but that would be vanishingly rare in me, a European.
This is important because when I find a variant in a patient I have to decide if it is pathogenic or not. There are various tools at my disposal to help me in this decision but one of the main ones is to know the frequency of that variant. If I am diagnosing a disease that only occurs in 1 in 50000 people and my variant has a frequency in the population of 1 in 50 then I can almost certainly rule that out; it’s just too common to be what I’m looking for. If I don’t know the ethnicity of the patient, which is frequently the case, then how am I supposed to know what ‘normal’ even is? And, to be blunt, it was only really those of European ancestry that were properly represented; there was a crippling lack of diversity in the genetic databases.
Then came the age of the exome projects.
In 2012 the 1000 Genomes Project came online giving us access to a database 60-fold larger than all the previous genetic data of the last 25 years combined. Soon afterwards followed the Exome Sequencing Project from the National Heart, Lung and Blood Institute in the US which was five times larger again. In 2014 ExAC blew both of these out of the water with the collected exomes of 60706 people, nearly ten times larger than the ESP.
For the first time we had decent sequencing depth on all of the world’s major populations. If I have a patient from south east Asia I can compare their sequence to that of their ethnic kin and make a more informed determination about whatever I might find buried in their genes.
Things are not perfect. There are still populations that are not represented sufficiently and Europeans still dominate but we are definitely going in the right direction. My job has certainly been made easier. We are still scrabbling around the edges of the Broad 😉 genetic plain before us. ExAC represents the invention of our first telescope with which we can properly probe the realm before us, charting our first expedition past the frontier. But it won’t be long until we have better telescopes and faster horses. The never ending genetic horizon is no longer a scary wilderness but an exciting adventure to be embarked upon.