Y-DNA Haplogroups: Navigation using graph database methods

David A Stumpf, MD, PhD
Dec 30, 2017
5 min read

Updated: Nov 18, 2022

Y-DNA Haplogroups

I manage a number of Y-DNA surname projects at FTDNA. One of the challenges is determining whether men's single nucleotide polymorphisms (SNP) results put them on the same haplogroup branch. Of course it's easy if they are in different major clades such as R, I, E, or others. But how do you know if R-Z2 is on the same branch as R-Z25738? Of course, you can visually navigate through the full haplotree, but if you've done much of this, you'll understand why a better solution is needed. FTDNA has recently added a feature that is helpful. They now present in one of their reports just the direct branches, each with the count of matches. But it is only a few levels deep.

Human males have passed down the Y-chromosome from Adam, their common patrilineal ancestor. Y-DNA testing measures markers which are mutations which occur periodically and are then inherited by descendants. These mutations, and their fidelity over many generations, enable us to create a tree with branches defined by the mutations. Groups of humans on different migration routes would develop mutations that uniquely define them. This method enabled scientists to identify migrations pathways of humans out of Africa to every corner of the earth. Anthropologists have also identified haplogroups of ancient humans buried thousands of years ago.

More recently, genealogists adopted Y-DNA methods to study their roots, creating a greatly expanded set of data. The haplotrees have exploded with numerous branches. The pace of identifying new branches has increased and the time of the newest branches pushed well into historic times. Managing these data requires tools which quickly and easily create insights about the halpotrees genealogist encounter. In this blog, we will explore how a Neo4j graph database can be helpful. See a previous blog for some additional backgroup and reading material on Neo4j.

Haplotree Data

One of the challenges of haplotrees is the paucity of data sets in a robust format. Haplotrees are, for the most part, constructed by linking individual SNPs in hierarchical order. Individual SNP data is downloadable from ISOOG via a Google spreadsheet with the present version at this link. The hierarchical data can be created by cutting, pasting and parsing web page displays; it's a messy process for programers and a step we'll not discuss here. Suffice it to say, the trees are loaded into Neo4j.

Neo4j Cypher Haplotree Queries

The Neo4j database has nodes for each SNP and a child relationship (edge) to each subsumed node. You can then write a Neo4j cypher query to view the graph with the default Neo4j web viewer.

MATCH (n:SNPNode{SNP:'R-U106'})-[*0..2]-(m) RETURN n,m

This query finds the node for R-U106 and then navigates 2 steps in the graph in either the ascendancy or descendancy direction. Notice the directions of the arrows in the edges. There are only 10 nodes that met these conditions, including a parent (R-L151), sibling (R-A8053, R-S1194, R-L51, and R-P312) , child (R-Z2265 and R-A2150) and grandchild (R-BY30097, R-S19589, and R-BY16457) nodes.

You can modify the query to select either descendant or ascendant nodes, which we'll extend out 3 steps:

MATCH (n:SNPNode{SNP:'R-U106'})-[*0..3]->(m) RETURN n,m

MATCH (n:SNPNode{SNP:'R-U106'})<-[*0..3]-(m) RETURN n,m

Neo4j has a built-in function for the shortest path. A common dilemma is determining whether two SNPs are in the same branch. A query incorporating both SNPs is made for this:

MATCH p = shortestPath((a:SNPNode {SNP:'R-U106'})-[:SNPChild*..99]-(b:SNPNode {SNP:'R-Z11'})) unwind(nodes(p)) as pn RETURN pn.SNP as SNP

This returns this list, also viewed with the Neo4j web viewer:

SNP R-U106 R-Z2265 R-BY30097 R-Z381 R-Z301 R-L48 R-Z9 R-Z30 R-Z27 R-Z345 R-Z2 R-Z7 R-Z8 R-Z338 R-Z11

Alternatively, we can use another Neo4j function, reduce, to create a string display:

MATCH p = shortestPath((a:SNPNode {SNP:'R-U106'})-[:SNPChild*..99]-(b:SNPNode {SNP:'R-Z11'})) RETURN reduce(s='',q in nodes(p)|s + q.SNP + " > ") as SNP_Path

This produces this output:

R-U106 > R-Z2265 > R-BY30097 > R-Z381 > R-Z301 > R-L48 > R-Z9 > R-Z30 > R-Z27 > R-Z345 > R-Z2 > R-Z7 > R-Z8 > R-Z338 > R-Z11 >

with a differet second SNP, we can see a different path after R-Z2:

R-U106 > R-Z2265 > R-BY30097 > R-Z381 > R-Z301 > R-L48 > R-Z9 > R-Z30 > R-Z27 > R-Z345 > R-Z2 > R-S15510 > R-Y7378 > R-Y7404 > R-S8958 > R-S20654 > R-S25738 >

This illustrates another common question: what is the most recent branch point in the haplotree shared by two individuals with SNPs on different branches? A simple query provides the answer:

match path=(n:SNPNode{SNP:'R-Z11'})<-[:SNPChild*..99]-(MRCA)-[:SNPChild*..99]->(m:SNPNode{SNP:'R-S25738'}) return MRCA.SNP

It is R-Z2

Person - Haplogroup Relationships

Genealogist use SNPs to identify related persons. In a graph database we do this by adding Person nodes and then relationships (edges) between the persons and their haplogroup node. Then we can find all the persons with a specific recent known haplogroup. We have loaded data from the FF_Green surname project into Neo4j and captured the screen shot reflecting the situation when the data was captured:

The ascendancy tree from R-Z25289 is shown at the top of this group.

We can find all those with the haplogroup R-Z25289 with this query and display the kit number:

MATCH (s:SNPNode{SNP:'R-Z25289'})-[:PersonHG]->(p:Person) return s,p

But this query ignores others who may be on the same branch but put further up it because more precise SNP testing has not been done. We can design a query to find them too, but we will need to exclude the top node (R-M269) because it will be seen in many persons not in this grouping.

MATCH (s:SNPNode{SNP:'R-Z25289'})<-[r:SNPChild*0..99]-(t:SNPNode)-[:PersonHG]-(p:Person) where t.SNP<>'R-M269' return s,p,t,r

The results pull the additional kits from the project that are not R-M269. Also pulled was kit 344684, which was not in the FF_Green project but had a halpogroup in the ascendancy tree. Notice also the one father relationship helped creat the shortest path between R-Z25289 and R-L48.

We can query for all haplogroups and get the count for each:

MATCH (p:Person) where p.kit>' ' match (p)-[:PersonHG]-(s:SNPNode) return s.SNP,count(*) as Ct order by Ct desc,s.SNP

Looking Ahead

The current state of my Neo4j database is quite useful, but there are obvious enhancements that could be made. Here is a list:

Put it online where others can access it. The Neo4j web API returns json documents that can serve this purpose.
Enhance maintenance of the database. Currently it's being rebuilt with each update. While it takes less than 2 minutes, this will get more cumbersome as the database grows.
Add synonyms and associated SNPs at the same branch point. Pretty trivial technically; its the data sources that are sub-optimal.
Enhance display options, including phylotrees, etc. Easy to do in R, that may not be the best option.
Optimize geographic relationships (see early test case below).
Reader suggestions in the comments ... please contribute!

Keep an eye on this blog for future posts under the tags Neo4j and haplogroups.

Geography test case The Neo4j database also has places of birth, death and marriages. So we can also query the geographic location of the haplogroups.

MATCH (p:Person) where p.kit>' ' match (p)-[:father*0..99]->(a:Person)-[:BP|DP|UnP]-(l:Place) match (p)-[:PersonHG]-(s:SNPNode) return distinct s.SNP as HG,l.Display as Place,min(a.BD) as EarliestBD order by HG,EarliestBD

A portion of the output is shown here, which accurately reflects the migration of several branches of the family across the country: