Home Artificial Intelligence Analyze Scientific Publications with E-utilities and Python

Analyze Scientific Publications with E-utilities and Python

1
Analyze Scientific Publications with E-utilities and Python

To question an NCBI database effectively, you’ll need to study certain E-utilities, define your search fields, and select your search parameters — which control the best way results are returned to your browser or in our case, we’ll use Python to question the databases.

4 most useful E-utilities

There are nine E-utilities available from NCBI, they usually are all implemented as server-side fast CGI programs. This implies you’ll access them by creating URLs which end in .cgi and specify query parameters after a question-mark, with parameters separated by ampersands. All of them, aside from EFetch, gives you either XML or JSON outputs.

  • ESearch generates a listing of ID numbers that meet your search query

The next E-Utilities could be used with a number of ID numbers:

  • ESummary journal, writer list, grants, dates, references, publication type
  • EFetch **XML ONLY** all of what ESummary provides in addition to an abstract, list of grants utilized in the research, institutions of authors, and MeSH keywords
  • ELink provides a listing of links to related citations using computed similarity rating in addition to providing a link to the published item [your gateway to the full-text of the article]

The NCBI hosts 38 databases across their servers, related to quite a lot of data that goes beyond literature citations. To get an entire list of current databases, you need to use EInfo without search terms:

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi

Each database will vary in how it may possibly be accessed and the knowledge it returns. For our purposes, we’ll give attention to the pubmed and pmc databases because these are where scientific literature are searched and retrieved.

The 2 most significant things to study searching NCBI are search fields and outputs. The search fields are numerous and can rely on the database. The outputs are more straightforward and learning learn how to use the outputs is important, especially for doing large searches.

Search fields

You won’t have the opportunity to actually harness the potential of E-utilities without knowing concerning the available search fields. You’ll find a full list of those search fields on the NLM website together with an outline of every, but for the most accurate list of search terms specific to a database, you’ll need to parse your individual XML list using this link:

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?db=pubmed

with the db flag set to the database (we’ll use pubmed for this text, but literature can also be available through pmc).

A listing of search fields for querying PubMed MEDLINE records. (Source: https://www.nlm.nih.gov/bsd/mms/medlineelements.html)

One especially useful search field is the Medline Subject Headings (MeSH).[3] Indexers, who’re experts in the sphere, maintain the PubMed database and use MeSH terms to reflect the material of journal articles as they’re published. Each indexed publication is usually described by 10 to 12 rigorously chosen MeSH terms by the indexers. If no search terms are specified, then queries will likely be executed against every search term available within the database queried.[4]

Query parameters

Each of the E-utilities accepts multiple query parameters through the URL line which you need to use to regulate the sort and amount of output returned from a question. That is where you possibly can set the variety of search results retrieved or the dates searched. Listed here are a listing of the more necessary parameters:

Database parameter:

  • db ought to be set to the database you might be fascinated by searching — pubmed or pmc for scientific literature

Date parameters: You possibly can get more control over the date by utilizing search fields, [pdat] for instance for the publication date, but date parameters provide a more convenient option to constrain results.

  • reldate the times to be searched relative to the present date, set reldate=1 for essentially the most recent day
  • mindate and maxdate specify date based on the format YYYY/MM/DD, YYYY, or YYYY/MM (a question must contain each mindate and maxdate parameters)
  • datetype sets the style of date whenever you query by date — options are ‘mdat’ (modification date), ‘pdat’ (publication date) and ‘edat’ (Entrez date)

Retrieval parameters:

  • rettype the style of information to return (for literature searches, use the default setting)
  • retmode format of the output (XML is the default, though all E-utilities except fetch do support JSON)
  • retmax is the utmost variety of records to return — the default is 20 and the utmost value is 10,000 (ten thousand)
  • retstart given a listing of hits for a question, retstart specifies the index (useful for when your search exceeds the ten thousand maximum)
  • cmd this is barely relevant to ELink and is used to specify whether to return IDs of comparable articles or URLs to full-texts

Once we all know concerning the E-Utilities, have chosen our search fields, and decided upon query parameters, we’re able to execute queries and store the outcomes — even for multiple pages.

Whilst you don’t specifically need to make use of Python to make use of the E-utilities, it does make it much easier to parse, store, and analyze the outcomes of your queries. Here’s learn how to start in your data science project.

Let’s say you wish to search MeSH terms for the term “myoglobin” between 2022 and 2023. You’ll set your retmax to 50 for now, but remember the max is 10,000 and you possibly can query at a rate of three/s.

import urllib.request
search_url = f'http://eutils.ncbi.nlm.nih.gov/entrez//eutils/esearch.fcgi/' +
f'?db=pubmed' +
f'&term=myoglobin[mesh]' +
f'&mindate=2022' +
f'&maxdate=2023' +
f'&retmode=json' +
f'&retmax=50'

link_list = urllib.request.urlopen(search_url).read().decode('utf-8')
link_list

The output of the esearch query from above.

The outcomes are returned as a listing of IDs, which could be utilized in a subsequent search throughout the database you queried. Note that “count” shows there are 154 results for this question, which you may use in the event you desired to get a complete count of publications for a certain set of search terms. In case you desired to return IDs for all of the publication, you’d set the retmax parameter to the count, or 154. Normally, I set this to a really high number so I can retrieve the entire results and store them.

Boolean searching is simple with PubMed and it only requires adding +OR+, +NOT+, or +AND+ to the URL between search terms. Here’s an example below. For instance:

http://eutils.ncbi.nlm.nih.gov/entrez//eutils/esearch.fcgi/?db=pubmed&term=CEO[cois]+OR+CTO[cois]+OR+CSO[cois]&mindate=2022&maxdate=2023&retmax=10000

These search strings can constructed using Python. In the next steps, we’ll parse the outcomes using Python’s json package to get the IDs for every of the publications returned. The IDs can then be used to create a string — this string of IDs could be utilized by the opposite E-Utilities to return information concerning the publications.

Use ESummary to return details about publications

The aim of ESummary is to return data that you simply might expect to see in a paper’s citation (date of publication, page numbers, authors, etc). Once you might have a end in the shape of a listing of IDs from ESearch (within the step above), you possibly can join this list into an extended URL.

The limit for a URL is 2048 characters, and every publication’s ID is 8 characters long, so to be protected, you must split your list of links up into batches of 250 if you might have a listing larger than 250 IDs. See my notebook at the underside of the article for an example.

The outcomes from an ESummary are returned in JSON format and may include a link to the paper’s full-text:

import json
result = json.loads( link_list )
id_list = ','.join( result['esearchresult']['idlist'] )

summary_url = f'http://eutils.ncbi.nlm.nih.gov/entrez//eutils/esummary.fcgi?db=pubmed&id={id_list}&retmode=json'

summary_list = urllib.request.urlopen(summary_url).read().decode('utf-8')

We are able to again use json to parse summary_list. When using the json package, you possibly can browse the fields of every individual article by utilizing summary[‘result’][id as string], as in the instance below:

summary = json.loads( summary_list )
summary['result']['37047528']

We are able to create a dataframe to capture the ID for every article together with the name of the journal, the publication date, title of the article, a URL for retrieving the complete text, in addition to the primary and last writer.

uid = [ x for x in summary['result'] if x != 'uids' ]
journals = [ summary['result'][x]['fulljournalname'] for x in summary['result'] if x != 'uids' ]
titles = [ summary['result'][x]['title'] for x in summary['result'] if x != 'uids' ]
first_authors = [ summary['result'][x]['sortfirstauthor'] for x in summary['result'] if x != 'uids' ]
last_authors = [ summary['result'][x]['lastauthor'] for x in summary['result'] if x != 'uids' ]
links = [ summary['result'][x]['elocationid'] for x in summary['result'] if x != 'uids' ]
pubdates = [ summary['result'][x]['pubdate'] for x in summary['result'] if x != 'uids' ]

links = [ re.sub('doi:s','http://dx.doi.org/',x) for x in links ]
results_df = pd.DataFrame( {'ID':uid,'Journal':journals,'PublicationDate':pubdates,'Title':titles,'URL':links,'FirstAuthor':first_authors,'LastAuthor':last_authors} )

Below is a listing of all the various fields that ESummary returns so you possibly can make your individual database:

'uid','pubdate','epubdate','source','authors','lastauthor','title',
'sorttitle','volume','issue','pages','lang','nlmuniqueid','issn',
'essn','pubtype','recordstatus','pubstatus','articleids','history',
'references','attributes','pmcrefcount','fulljournalname','elocationid',
'doctype','srccontriblist','booktitle','medium','edition',
'publisherlocation','publishername','srcdate','reportnumber',
'availablefromurl','locationlabel','doccontriblist','docdate',
'bookname','chapter','sortpubdate','sortfirstauthor','vernaculartitle'

Use EFetch whenever you want abstracts, keywords, and other details (XML output only)

We are able to use EFetch to return similar fields as ESummary, with the caveat that the result’s returned in XML only. There are several interesting additional fields in EFetch which include: the abstract, author-selected keywords, the Medline Subheadings (MeSH terms), grants that sponsored the research, conflict of interest statements, a listing of chemicals utilized in the research, and an entire list of all of the references cited by the paper. Here’s how you’ll use BeautifulSoup to acquire a few of these things:

from bs4 import BeautifulSoup
import lxml
import pandas as pd

abstract_url = f'http://eutils.ncbi.nlm.nih.gov/entrez//eutils/efetch.fcgi?db=pubmed&id={id_list}'
abstract_ = urllib.request.urlopen(abstract_url).read().decode('utf-8')
abstract_bs = BeautifulSoup(abstract_,features="xml")

articles_iterable = abstract_bs.find_all('PubmedArticle')

# Abstracts
abstract_texts = [ x.find('AbstractText').text for x in articles_iterable ]

# Conflict of Interest statements
coi_texts = [ x.find('CoiStatement').text if x.find('CoiStatement') is not None else '' for x in articles_iterable ]

# MeSH terms
meshheadings_all = list()
for article in articles_iterable:
result = article.find('MeshHeadingList').find_all('MeshHeading')
meshheadings_all.append( [ x.text for x in result ] )

# ReferenceList
references_all = list()
for article in articles_:
if article.find('ReferenceList') just isn't None:
result = article.find('ReferenceList').find_all('Citation')
references_all.append( [ x.text for x in result ] )
else:
references_all.append( [] )

results_table = pd.DataFrame( {'COI':coi_texts, 'Abstract':abstract_texts, 'MeSH_Terms':meshheadings_all, 'References':references_all} )

Now we are able to use this table to go looking abstracts, conflict of interest statements, or make visuals that connect different fields of research using MeSH headings and reference lists. There are after all many other tags that you may explore, returned by EFetch, here’s how you possibly can see all of them using BeautifulSoup:

efetch_url = f'http://eutils.ncbi.nlm.nih.gov/entrez//eutils/efetch.fcgi?db=pubmed&id={id_list}'
efetch_result = urllib.request.urlopen( efetch_url ).read().decode('utf-8')
efetch_bs = BeautifulSoup(efetch_result,features="xml")

tags = efetch_bs.find_all()

for tag in tags:
print(tag)

Using ELink to retrieve similar publications, and full-text links

It’s possible you’ll want to seek out articles much like those returned by your search query. These articles are grouped based on a similarity rating using a probabilistic topic-based model.[5] To retrieve the similarity scores for a given ID, you should pass cmd=neighbor_score in your call to ELink. Here’s an example for one article:

import urllib.request
import json

id_ = '37055458'
elink_url = f'http://eutils.ncbi.nlm.nih.gov/entrez//eutils/elink.fcgi?db=pubmed&id={id_}&retmode=json&cmd=neighbor_score'
elinks = urllib.request.urlopen(elink_url).read().decode('utf-8')

elinks_json = json.loads( elinks )

ids_=[];score_=[];
all_links = elinks_json['linksets'][0]['linksetdbs'][0]['links']
for link in all_links:
[ (ids_.append( link['id'] ),score_.append( link['score'] )) for id,s in link.items() ]

pd.DataFrame( {'id':ids_,'rating':score_} ).drop_duplicates(['id','score'])

The opposite function of ELink is to supply full-text links to an article based on its ID, which could be returned in the event you pass cmd=prlinks to ELink as an alternative.

In case you want to access only those full-text links which are free to the general public, you’ll want to use links that contain “pmc” (PubMed Central). Accessing articles behind a paywall may require subscription through a University—before downloading a big corpus of full-text articles through a paywall, you must seek the advice of together with your organization’s librarians.

Here’s a code snippet of how you may retrieve the links for a publication:

id_ = '37055458'
elink_url = f'http://eutils.ncbi.nlm.nih.gov/entrez//eutils/elink.fcgi?db=pubmed&id={id_}&retmode=json&cmd=prlinks'
elinks = urllib.request.urlopen(elink_url).read().decode('utf-8')

elinks_json = json.loads( elinks )

[ x['url']['value'] for x in elinks_json['linksets'][0]['idurllist'][0]['objurls'] ]

You may also retrieve links for multiple publications in a single call to ELink, as I show below:

id_list = '37055458,574140'
elink_url = f'http://eutils.ncbi.nlm.nih.gov/entrez//eutils/elink.fcgi?db=pubmed&id={id_list}&retmode=json&cmd=prlinks'
elinks = urllib.request.urlopen(elink_url).read().decode('utf-8')

elinks_json = json.loads( elinks )

elinks_json
urls_ = elinks_json['linksets'][0]['idurllist']
for url_ in urls_:
[ print( url_['id'], x['url']['value'] ) for x in url_['objurls'] ]

Occasionally, a scientific publication will likely be authored by someone who’s a CEO, CSO, or CTO of an organization. With PubMed, now we have the power to investigate the newest life science industry trends. Conflict of interest statements, which were introduced as a search term in PubMed during 2017,[6] give a lens into which author-provided keywords are appearing in publications where an industry executive is disclosed as an writer. In other words, the keywords chosen by the authors to explain their finding. To perform this function, simply include CEO[cois]+OR+CSO[cois]+OR+CTO[cois] as search term in your URL, retrieve the entire results returned, and extract the keyword from the resulting XML output for every publication. Each publication comprises between 4–8 keywords. Once the corpus is generated, you possibly can quantify keyword frequency per 12 months throughout the corpus as the variety of publications in a 12 months specifying a keyword, divided by the variety of publications for that 12 months.

For instance, if 10 publications list the keyword “cancer” and there are 1000 publications that 12 months, the keyword frequency could be 0.001. Using the seaborn clustermap module with the keyword frequencies you possibly can generate a visualization where darker bands indicate a bigger value of keyword frequency/12 months (I even have dropped COVID-19 and SARS-COV-2 from the visualization as they were each represented at frequencies far greater 0.05, predictably).

Clustermap of author-specified keyword frequencies for publications with a C-suite writer listed, generated by the writer using Seaborn’s clustermap module.

From this visualization, several insights concerning the corpus of publications with C-suite authors listed becomes clear. First, probably the most distinct clusters (at the underside) comprises keywords which were strongly represented within the corpus for the past five years: cancer, machine learning, biomarkers, artificial intelligence — simply to name a number of. Clearly, industry is heavily lively and publishing in these areas. A second cluster, near the center of the figure, shows keywords that disappeared from the corpus after 2018, including physical activity, public health, children, mass spectrometry, and mhealth (or mobile health). It’s to not say that these areas are usually not being developed in industry, just that the publication activity has slowed. Taking a look at the underside right of the figure, you possibly can extract terms which have appeared more recently within the corpus, including liquid biopsy and precision medicine — that are indeed two very “hot” areas of drugs for the time being. By examining the publications further, you may extract the names of the businesses and other information of interest. Below is the code I wrote to generate this visual:

import pandas as pd
import time
from bs4 import BeautifulSoup
import seaborn as sns
from matplotlib import pyplot as plt
import itertools
from collections import Counter
from numpy import array_split
from urllib.request import urlopen

class Searcher:
# Any instance of searcher will seek for the terms and return the variety of results on a per 12 months basis #
def __init__(self, start_, end_, term_, **kwargs):
self.raw_ = input
self.name_ = 'searcher'
self.description_ = 'searcher'
self.duration_ = end_ - start_
self.start_ = start_
self.end_ = end_
self.term_ = term_
self.search_results = list()
self.count_by_year = list()
self.options = list()

# Parse keyword arguments

if 'count' in kwargs and kwargs['count'] == 1:
self.options = 'rettype=count'

if 'retmax' in kwargs:
self.options = f'retmax={kwargs["retmax"]}'

if 'run' in kwargs and kwargs['run'] == 1:
self.do_search()
self.parse_results()

def do_search(self):
datestr_ = [self.start_ + x for x in range(self.duration_)]
options = "".join(self.options)
for 12 months in datestr_:
this_url = f'http://eutils.ncbi.nlm.nih.gov/entrez//eutils/esearch.fcgi/' +
f'?db=pubmed&term={self.term_}' +
f'&mindate={12 months}&maxdate={12 months + 1}&{options}'
print(this_url)
self.search_results.append(
urlopen(this_url).read().decode('utf-8'))
time.sleep(.33)

def parse_results(self):
for end in self.search_results:
xml_ = BeautifulSoup(result, features="xml")
self.count_by_year.append(xml_.find('Count').text)
self.ids = [id.text for id in xml_.find_all('Id')]

def __repr__(self):
return repr(f'Search PubMed from {self.start_} to {self.end_} with search terms {self.term_}')

def __str__(self):
return self.description

# Create a listing which can contain searchers, that retrieve results for every of the search queries
searchers = list()
searchers.append(Searcher(2022, 2023, 'CEO[cois]+OR+CTO[cois]+OR+CSO[cois]', run=1, retmax=10000))
searchers.append(Searcher(2021, 2022, 'CEO[cois]+OR+CTO[cois]+OR+CSO[cois]', run=1, retmax=10000))
searchers.append(Searcher(2020, 2021, 'CEO[cois]+OR+CTO[cois]+OR+CSO[cois]', run=1, retmax=10000))
searchers.append(Searcher(2019, 2020, 'CEO[cois]+OR+CTO[cois]+OR+CSO[cois]', run=1, retmax=10000))
searchers.append(Searcher(2018, 2019, 'CEO[cois]+OR+CTO[cois]+OR+CSO[cois]', run=1, retmax=10000))

# Create a dictionary to store keywords for all articles from a specific 12 months
keywords_dict = dict()

# Each searcher obtained results for a specific start and end 12 months
# Iterate over searchers
for this_search in searchers:

# Split the outcomes from one search into batches for URL formatting
chunk_size = 200
batches = array_split(this_search.ids, len(this_search.ids) // chunk_size + 1)

# Create a dict key for this searcher object based on the years of coverage
this_dict_key = f'{this_search.start_}to{this_search.end_}'

# Each value within the dictionary will likely be a listing that gets appended with keywords for every article
keywords_all = list()

for this_batch in batches:
ids_ = ','.join(this_batch)

# Pull down the web site containing XML for all of the ends in a batch
abstract_url = f'http://eutils.ncbi.nlm.nih.gov/entrez//eutils/efetch.fcgi?db=pubmed&id={ids_}'

abstract_ = urlopen(abstract_url).read().decode('utf-8')
abstract_bs = BeautifulSoup(abstract_, features="xml")
articles_iterable = abstract_bs.find_all('PubmedArticle')

# Iterate over the entire articles from the web site
for article in articles_iterable:
result = article.find_all('Keyword')
if result just isn't None:
keywords_all.append([x.text for x in result])
else:
keywords_all.append([])

# Take a break between batches!
time.sleep(1)

# Once all of the keywords are assembled for a searcher, add them to the dictionary
keywords_dict[this_dict_key] = keywords_all

# Print the important thing once it has been dumped to the pickle
print(this_dict_key)

# Limit to words that appeared approx five times or more in any given 12 months

mapping_ = {'2018to2019':2018,'2019to2020':2019,'2020to2021':2020,'2021to2022':2021,'2022to2023':2022}
global_word_list = list()

for key_,value_ in keywords_dict.items():
Ntitles = len( value_ )
flattened_list = list( itertools.chain(*value_) )

flattened_list = [ x.lower() for x in flattened_list ]
counter_ = Counter( flattened_list )
words_this_year = [ ( item , frequency/Ntitles , mapping_[key_] ) for item, frequency in counter_.items() if frequency/Ntitles >= .005 ]
global_word_list.extend(words_this_year)

# Plot results as clustermap

global_word_df = pd.DataFrame(global_word_list)
global_word_df.columns = ['word', 'frequency', 'year']
pivot_df = global_word_df.loc[:, ['word', 'year', 'frequency']].pivot(index="word", columns="12 months",
values="frequency").fillna(0)

pivot_df.drop('covid-19', axis=0, inplace=True)
pivot_df.drop('sars-cov-2', axis=0, inplace=True)

sns.set(font_scale=0.7)
plt.figure(figsize=(22, 2))
res = sns.clustermap(pivot_df, col_cluster=False, yticklabels=True, cbar=True)

After reading this text, try to be able to go from crafting highly tailored search queries of the scientific literature all of the option to generating data visualizations for closer scrutiny. While there are other more complex ways to access and store articles using additional features of the varied E-utilities, I even have tried to present essentially the most straightforward set of operations that ought to apply to most use cases for a knowledge scientist fascinated by scientific publishing trends. By familiarizing yourself with the E-utilities as I even have presented here, you’ll go far toward understanding the trends and connections inside scientific literature. As mentioned, there are various items beyond publications that could be unlocked through mastering the E-utilities and the way they operate throughout the larger universe of NCBI databases.

1 COMMENT

LEAVE A REPLY

Please enter your comment!
Please enter your name here