On this post we’ll discover and visualise different clusters of cancer types by analysing disease ontology as a knowledge graph. Specifically we’ll arrange neo4j in a docker container, import the ontology, generate graph clusters and embeddings, before using dimension reduction to plot these clusters and derive some insights. Although we’re using `disease_ontology` for example, the identical steps could be used to explore any ontology or graph database.
In a graph database, reasonably than storing data as rows (like a spreadsheet or relational database) data is stored as nodes and relationships between nodes. For instance within the figure below we see that melanoma and carcinoma are SubCategories Of cell type cancer tumour (shown by the SCO relationship). With this sort of data we are able to clearly see that melanoma and carcinoma are related although this will not be explicitly stated in the info.
Ontologies are a formalised set of concepts and relationships between those concepts. They’re much easier for computers to parse than free text and due to this fact easier to extract meaning from. Ontologies are widely utilized in biological sciences and you could find an ontology you’re eager about at https://obofoundry.org/. Here we’re specializing in the disease ontology which shows how various kinds of diseases relate to one another.
Neo4j is a tool for managing, querying and analysing graph databases. To make it easier to establish we’ll use a docker container.
docker run
-it - rm
- publish=7474:7474 - publish=7687:7687
- env NEO4J_AUTH=neo4j/123456789
- env NEO4J_PLUGINS='["graph-data-science","apoc","n10s"]'
neo4j:5.17.0
Within the above command the `-publish` flags set ports to let python query the database directly and allow us to access it through a browser. The `NEO4J_PLUGINS` argument specifies which plugins to put in. Unfortunately, the windows docker image doesn’t appear to have the ability to handle the installation, so to follow along you’ll need to put in neo4j desktop manually. Don’t worry though, the opposite steps should all still be just right for you.
While neo4j is running you’ll be able to access your database by going to http://localhost:7474/ in your browser, or you should use the python driver to attach as below. Note that we’re using the port we published with our docker command above and we’re authenticating with the username and password we also defined above.
URI = "bolt://localhost:7687"
AUTH = ("neo4j", "123456789")
driver = GraphDatabase.driver(URI, auth=AUTH)
driver.verify_connectivity()
Once you might have your neo4j database arrange, it’s time to get some data. The neo4j plug-in n10s is built to import and handle ontologies; you should use it to embed your data into an existing ontology or to explore the ontology itself. With the cypher commands below we first set some configs to make the outcomes cleaner, then we arrange a uniqueness constraint, finally we actually import disease ontology.
CALL n10s.graphconfig.init({ handleVocabUris: "IGNORE" });
CREATE CONSTRAINT n10s_unique_uri FOR (r:Resource) REQUIRE r.uri IS UNIQUE;
CALL n10s.onto.import.fetch(http://purl.obolibrary.org/obo/doid.owl, RDF/XML);
To see how this could be done with the python driver, try the complete code here https://github.com/DAWells/do_onto/blob/most important/import_ontology.py
Now that we’ve imported the ontology you’ll be able to explore it by opening http://localhost:7474/ in your web browser. This helps you to explore a bit of your ontology manually, but we’re eager about the larger picture so lets do some evaluation. Specifically we’ll do Louvain clustering and generate fast random projection embeddings.
Louvain clustering is a clustering algorithm for networks like this. In brief, it identifies sets of nodes which might be more connected to every aside from they’re to the broader set of nodes; this set is then defined as a cluster. When applied to an ontology it’s a fast approach to discover a set of related concepts. Fast random projection however produces an embedding for every node, i.e. a numeric vector where more similar nodes have more similar vectors. With these tools we are able to discover which diseases are similar and quantify that similarity.
To generate embeddings and clusters we’ve to “project” the parts of our graph that we’re eager about. Because ontologies are typically very large, this subsetting is an easy approach to speed up computation and avoid memory errors. In this instance we’re only eager about cancers and never some other sort of disease. We do that with the cypher query below; we match the node with the label “cancer” and any node that is expounded to this by a number of SCO or SCO_RESTRICTION relationships. Because we would like to incorporate the relationships between cancer types we’ve a second MATCH query that returns the connected cancer nodes and their relationships.
MATCH (cancer:Class {label:"cancer"})<-[:SCO|SCO_RESTRICTION *1..]-(n:Class)
WITH n
MATCH (n)-[:SCO|SCO_RESTRICTION]->(m:Class)
WITH gds.graph.project(
"proj", n, m, {}, {undirectedRelationshipTypes: ['*']}
) AS g
RETURN g.graphName AS graph, g.nodeCount AS nodes, g.relationshipCount AS rels
Once we’ve the projection (which we’ve called “proj”) we are able to calculate the clusters and embeddings and write them back to the unique graph. Finally by querying the graph we are able to get the brand new embeddings and clusters for every cancer type which we are able to export to a csv file.
CALL gds.fastRP.write(
'proj',
{embeddingDimension: 128, randomSeed: 42, writeProperty: 'embedding'}
) YIELD nodePropertiesWrittenCALL gds.louvain.write(
"proj",
{writeProperty: "louvain"}
) YIELD communityCount
MATCH (cancer:Class {label:"cancer"})<-[:SCO|SCO_RESTRICTION *0..]-(n)
RETURN DISTINCT
n.label as label,
n.embedding as embedding,
n.louvain as louvain
Let’s have a have a look at a few of these clusters to see which sort of cancers are grouped together. After we’ve loaded the exported data right into a pandas dataframe in python we are able to inspect individual clusters.
Cluster 2168 is a set of pancreatic cancers.
nodes[nodes.louvain == 2168]["label"].tolist()
#array(['"islet cell tumor"',
# '"non-functioning pancreatic endocrine tumor"',
# '"pancreatic ACTH hormone producing tumor"',
# '"pancreatic somatostatinoma"',
# '"pancreatic vasoactive intestinal peptide producing tumor"',
# '"pancreatic gastrinoma"', '"pancreatic delta cell neoplasm"',
# '"pancreatic endocrine carcinoma"',
# '"pancreatic non-functioning delta cell tumor"'], dtype=object)
Cluster 174 is a bigger group of cancers but mostly carcinomas.
nodes[nodes.louvain == 174]["label"]
#array(['"head and neck cancer"', '"glottis carcinoma"',
# '"head and neck carcinoma"', '"squamous cell carcinoma"',
#...
# '"pancreatic squamous cell carcinoma"',
# '"pancreatic adenosquamous carcinoma"',
#...
# '"mixed epithelial/mesenchymal metaplastic breast carcinoma"',
# '"breast mucoepidermoid carcinoma"'], dtype=object)p
These are sensible groupings, based on either organ or cancer type, and might be useful for visualization. The embeddings however are still too high dimensional to be visualised meaningfully. Fortunately, TSNE is a really useful method for dimension reduction. Here, we use TSNE to scale back the embedding from 128 dimensions all the way down to 2, while still keeping closely related nodes close together. We will confirm that this has worked by plotting these two dimensions as a scatter plot and colouring by the Louvain clusters. If these two methods agree we should always see nodes clustering by color.
from sklearn.manifold import TSNEnodes = pd.read_csv("export.csv")
nodes['louvain'] = pd.Categorical(nodes.louvain)
embedding = nodes.embedding.apply(lambda x: ast.literal_eval(x))
embedding = embedding.tolist()
embedding = pd.DataFrame(embedding)
tsne = TSNE()
X = tsne.fit_transform(embedding)
fig, axes = plt.subplots()
axes.scatter(
X[:,0],
X[:,1],
c = cm.tab20(Normalize()(nodes['louvain'].cat.codes))
)
plt.show()
Which is strictly what we see, similar forms of cancer are grouped together and visual as clusters of a single color. Note that some nodes of a single color are very far apart, it is because we’re having to reuse some colors as there are 29 clusters and only 20 colors. This offers us an incredible overview of the structure of our knowledge graph, but we may also add our own data.
Below we plot the frequency of cancer type as node size and the mortality rate because the opacity (Bray et al 2024). I only had access to this data for a number of of the cancer types so I’ve only plotted those nodes. Below we are able to see that liver cancer doesn’t have an especially high incidence over all. Nonetheless, incidence rates of liver cancer are much higher than other cancers inside its cluster (shown in purple) like oropharynx, larynx, and nasopharynx.
Here we’ve used the disease ontology to group different cancers into clusters which provides us the context to match these diseases. Hopefully this little project has shown you methods to visually explore an ontology and add that information to your individual data.
You’ll be able to try the complete code for this project at https://github.com/DAWells/do_onto.
Bray, F., Laversanne, M., Sung, H., Ferlay, J., Siegel, R. L., Soerjomataram, I., & Jemal, A. (2024). Global cancer statistics 2022: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA: a cancer journal for clinicians, 74(3), 229–263.