A Advice System For Academic Research (And Other Data Types)!

Implementing Natural Language Processing and Graph Theory to check and recommend several types of documents

Lots of the projects people develop today generally begin with the primary crucial step: Lively Research. Investing in what other people have done and constructing on their work is essential to your project’s ability so as to add value. Not only do you have to learn from the strong conclusions of what other people have done, but you furthermore may will need to determine what you shouldn’t do in your project to make sure its success.

As I worked through my thesis, I began collecting numerous several types of research files. For instance, I had collections of various academic publications I read through in addition to excel sheets with information containing the outcomes of various experiments. As I accomplished the research for my thesis, I wondered: Is there a technique to create a advice system that may compare all of the research I even have in my archive and help guide me in my next project?

In truth, there’s!

Note: Not only would this be for a repository of all the research chances are you’ll be collecting from various serps, but it would also work for any directory you’ve containing various varieties of different documents.

I developed this advice with my team using Python 3.

There are numerous APIs that support this advice system and researching what each specific API can perform could also be helpful for your personal learning.

import string 
import csv
from io import StringIO
from pptx import Presentation
import docx2txt
import PyPDF2
import spacy
import pandas as pd 
import numpy as np
import nltk 
import re
import openpyxl
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from gensim.parsing.preprocessing import STOPWORDS as SW
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import wordnet
import networkx as nx
from networkx.algorithms.shortest_paths import weighted
import glob

The Hurdle

One big hurdle I had to beat was the necessity for the advice machine’s ability to check several types of files. For instance, I desired to see if an excel spreadsheet has information similar or is connected to the data inside a PowerPoint and academic PDF journal. The trick to doing this was reading every file type into Python and remodeling each object right into a single string of words. This normalizes all the information and allows for the calculation of a similarity metric.

PDF Reading Class

The firstclass we’ll take a look at for this project is the pdfReader class which is in a position to format a PDF to be readable in Python. Of all of the file formats, I’d argue that PDFs are one of the crucial necessary since most of the journal articles downloaded from research repositories akin to Google Scholar are in PDF format.

class pdfReader:def __init__(self, file_path: str) -> str:
self.file_path = file_path
def PDF_one_pager(self) -> str:
"""A function which returns a one line string of the 
pdf.
Returns:
one_page_pdf (str): A one line string of the pdf.
"""
content = ""
p = open(self.file_path, "rb")
pdf = PyPDF2.PdfReader(p)
num_pages = len(pdf.pages)
for i in range(0, num_pages):
content += pdf.pages[i].extract_text() + "n"
content = " ".join(content.replace(u"xa0", " ").strip().split())
page_number_removal = r"d{1,3} of d{1,3}"
page_number_removal_pattern = re.compile(page_number_removal, re.IGNORECASE)
content = re.sub(page_number_removal_pattern, '',content)
return content
def pdf_reader(self) -> str:
"""A function which might read .pdf formatted files 
and returns a python readable pdf.
Returns:
read_pdf: A python readable .pdf file.
"""
opener = open(self.file_path,'rb')
read_pdf = PyPDF2.PdfFileReader(opener)
return read_pdf
def pdf_info(self) -> dict:
"""A function which returns an information dictionary of a 
pdf.
Returns:
dict(pdf_info_dict): A dictionary containing the meta
data of the article.
"""
opener = open(self.file_path,'rb')
read_pdf = PyPDF2.PdfFileReader(opener)
pdf_info_dict = {}
for key,value in read_pdf.documentInfo.items():
pdf_info_dict[re.sub('/',"",key)] = value
return pdf_info_dict
def pdf_dictionary(self) -> dict:
"""A function which returns a dictionary of 
the article where the keys are the pages
and the text throughout the pages are the 
values.
Returns:
dict(pdf_dict): A dictionary pages and text.
"""
opener = open(self.file_path,'rb')
read_pdf = PyPDF2.PdfReader(opener)
length = read_pdf.pages
pdf_dict = {}
for i in range(length):
page = read_pdf.getPage(i)
text = page.extract_text()
pdf_dict[i] = text
return pdf_dict

Microsoft Powerpoint Reader

The pptReader class is able to reading Microsoft Powerpoint files into Python.

class pptReader:def __init__(self, file_path: str) -> None:
self.file_path = file_path
def ppt_text(self) -> str:
"""A function that returns a string of text from all 
of the slides in a pptReader object.
Returns:
text (str): A single string containing the text 
inside each slide of the pptReader object.
"""
prs = Presentation(self.file_path)
text = str()
for slide in prs.slides:
for shape in slide.shapes:
if not shape.has_text_frame:
proceed
for paragraph in shape.text_frame.paragraphs:
for run in paragraph.runs:
text += ' ' + run.text
return text

Microsoft Word Document Reader

The wordDocReader class might be used for reading Microsoft Word Documents in Python. It utilizes the doc2txt API and returns a string of the text/information situated inside a given word document.

class wordDocReader:
def __init__(self, file_path: str) -> str:
self.file_path = file_pathdef word_reader(self):
"""A function that transforms a wordDocReader object right into a Python readable
word document."""
text = docx2txt.process(self.file_path)
text = text.replace('n', ' ')
text = text.replace('xa0', ' ')
text = text.replace('t', ' ')
return text

Microsft Excel Reader

Sometimes researchers will include excel sheets of their results with their publications. With the ability to read the column names, and even the values, could help with recommending results which might be like what you might be looking for. For instance, what should you were researching information on the past performance of a certain stock? Perhaps you seek for the name and symbol which is annotated in a historical performance excel sheet. This advice system would recommend the excel sheet to you to assist along with your research.


class xlsxReader:def __init__(self, file_path: str) -> str:
self.file_path = file_path
def xlsx_text(self):
"""A function which returns the string of an 
excel document.
Returns:
text(str): String of text of a document.
"""
inputExcelFile = self.file_path
text = str()
wb = openpyxl.load_workbook(inputExcelFile)
#It will save the excel sheet as a CSV file
for sn in wb.sheetnames:
excelFile = pd.read_excel(inputExcelFile, engine = 'openpyxl', sheet_name = sn)
excelFile.to_csv("ResultCsvFile.csv", index = None, header=True)
with open("ResultCsvFile.csv", "r") as csvFile: 
lines = csvFile.read().split(",") # "rn" if needed
for val in lines:
if val != '':
text += val + ' '
text = text.replace('ufeff', '')
text = text.replace('n', ' ')
return textCSV File Reader

The csvReader class will allow for CSV files to be included in your database and to be utilized in the system’s recommendations.


class csvReader:def __init__(self, file_path: str) -> str:
self.file_path = file_path
def csv_text(self):
"""A function which returns the string of a
csv document.
Returns:
text(str): String of text of a document.
"""
text = str()
with open(self.file_path, "r") as csvFile: 
lines = csvFile.read().split(",") # "rn" if needed
for val in lines:
text += val + ' '
text = text.replace('ufeff', '')
text = text.replace('n', ' ')
return textMicrosoft PowerPoint Reader

Here’s a helpful class. Not many individuals take into consideration how there’s beneficial information stored throughout the bodies of PowerPoint presentations. These presentations are by and huge created to visualise key ideas and knowledge to the audience. The next class will help relate any PowerPoints you’ve in your database to other bodies of data in hopes of steering you towards connected pieces of labor.

class pptReader:def __init__(self, file_path: str) -> str:
self.file_path = file_path
def ppt_text(self):
"""A function which returns the string of a 
Mirocsoft PowerPoint document.
Returns:
text(str): String of text of a document.
"""
prs = Presentation(self.file_path)
text = str()
for slide in prs.slides:
for shape in slide.shapes:
if not shape.has_text_frame:
proceed
for paragraph in shape.text_frame.paragraphs:
for run in paragraph.runs:
text += ' ' + run.text
return textMicrosoft Word Document Reader

The ultimate class for this method is a Microsoft Word document reader. Word documents are one other beneficial source of data. Many individuals will write reports, indicating their findings and concepts in word document format.

class wordDocReader:
def __init__(self, file_path: str) -> str:
self.file_path = file_pathdef word_reader(self):
"""A function which returns the string of a 
Microsoft Word document.
Returns:
text(str): String of text of a document.
"""
text = docx2txt.process(self.file_path)
text = text.replace('n', ' ')
text = text.replace('xa0', ' ')
text = text.replace('t', ' ')
return text

That’s a wrap for the classes utilized in today’s project. Please note: there are tons of other file types you should use to reinforce your advice system. A current version of the code being developed will accept images and check out to relate them to other documents inside a database!

Preprocessing

Let’s take a look at the way to preprocess this data. This advice system was built for a repository of educational research, due to this fact the necessity to break the text down using the preprocessing steps guided by Natural Language Processing (NLP) was necessary.

The information processing class is solely called datapreprocessor and the primary function throughout the class is a word parts of speech tagger.

class dataprocessor:
def __init__(self):
return@staticmethod
def get_wordnet_pos(text: str) -> str:
"""Map POS tag to first character lemmatize() accepts
Inputs:
text(str): A string of text
Returns:
tag_dict(dict): A dictionary of tags
"""
tag = nltk.pos_tag([text])[0][1][0].upper()
tag_dict = {"J": wordnet.ADJ,
"N": wordnet.NOUN,
"V": wordnet.VERB,
"R": wordnet.ADV}
return tag_dict.get(tag, wordnet.NOUN)

This function tags the parts of speech in a word and can come in useful later within the project.

Second, there’s a function that conducts the traditional NLP steps lots of us have seen before. These steps are:

Lowercase each word
Remove the punctuation
Remove digits (I only wanted to take a look at non-numeric information. This step might be taken out if desired)
Stopword removal.
Lemmanitizaion. That is where the get_wordnet_pos() function turns out to be useful for including parts of speech!

@staticmethod
def preprocess(text: str):
"""A function that prepoccesses text through the
steps of Natural Language Processing (NLP).
Inputs:
text(str): A string of textReturns:
text(str): A processed string of text
"""
#lowercase
text = text.lower()
#punctuation removal
text = "".join([i for i in text if i not in string.punctuation])
#Digit removal (Just for ALL numeric numbers)
text = [x for x in text.split(' ') if x.isnumeric() == False]
#Stop removal
stopwords = nltk.corpus.stopwords.words('english')
custom_stopwords = ['n','nn', '&', ' ', '.', '-', '$', '@']
stopwords.extend(custom_stopwords)
text = [i for i in text if i not in stopwords]
text = ' '.join(word for word in text)
#lemmanization
lm = WordNetLemmatizer()
text = [lm.lemmatize(word, dataprocessor.get_wordnet_pos(word)) for word in text.split(' ')]
text = ' '.join(word for word in text)
text = re.sub(' +', ' ',text)
return text

Next, there’s a function to read all the files into the system.

@staticmethod
def data_reader(list_file_names):
"""A function that reads in the information from a directory of files.Inputs:
list_file_names(list): List of the filepaths in a directory.
Returns:
text_list (list): An inventory where each value is a string of text
for every file within the directory
file_dict(dict): Dictionary where the keys are the filename and the values
are the data found inside each given file
"""
text_list = []
reader = dataprocessor()
for file in list_file_names:
temp = file.split('.')
filetype = temp[-1]
if filetype == "pdf":
file_pdf = pdfReader(file)
text = file_pdf.PDF_one_pager()
elif filetype == "docx":
word_doc_reader = wordDocReader(file)
text = word_doc_reader.word_reader()
elif filetype == "pptx" or filetype == 'ppt':
ppt_reader = pptReader(file)
text = ppt_reader.ppt_text()
elif filetype == "csv":
csv_reader = csvReader(file)
text = csv_reader.csv_text()
elif filetype == 'xlsx':
xl_reader = xlsxReader(file)
text = xl_reader.xlsx_text()
else:
print('File type {} not supported!'.format(filetype))
proceed
text = reader.preprocess(text)
text_list.append(text)
file_dict = dict()
for i,file in enumerate(list_file_names):
file_dict[i] = (file, file.split('/')[-1])
return text_list, file_dict

As that is the primary version of this method, I would like to foot stomp that the code might be adapted to incorporate many other file types!

The subsequent function known as the database_preprocess() which is used to process all the files inside your given database. The input is a listing of the files, each with its associated string of text (processed already). The strings of text are then vectorized using sklearn’s tfidVectorizer. What’s that exactly? Principally, it would transform all of the text into different feature vectors based on the frequency of every given word. We do that so we will take a look at how closely related documents are using similarity formulas referring to vector arithmetic.

@staticmethod
@staticmethod
def database_processor(file_dict,text_list: list):
"""A function that transforms the text of every file throughout the 
database right into a vector.Inputs:
file_dixt(dict): Dictionary where the keys are the filename and the values
are the data found inside each given file
text_list (list): An inventory where each value is a string of the text
for every file within the directory
Returns:
list_dense(list): An inventory of the files' text was vectors.
vectorizer: The vectorizor used to rework the strings of text
file_vector_dict(dict): A dictionary where the file names are the keys
and the vectors of every files' text are the values.
"""
file_vector_dict = dict()
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(text_list)
feature_names = vectorizer.get_feature_names_out()
matrix = vectors.todense()
list_dense = matrix.tolist()
for i in range(len(list_dense)):
file_vector_dict[file_dict[i][1]] = list_dense[i]
return list_dense, vectorizer, file_vector_dict

The rationale a vectorizer is created off of the database is that when a user gives a listing of terms to look for within the database, those words might be vectorized based on their frequency in said database. That is the largest weakness of the present system. As we increase the dimensions of the database, the time and computational allocation needed for calculating similarities will increase and decelerate the system. One advice given during a top quality control meeting was to make use of Reinforcement Learning for recommending different articles of knowledge.

Next, we will use an input processor that processes any word provided right into a vector. That is synonymous to whenever you type a request right into a search engine.

 @staticmethod
def input_processor(text, TDIF_vectorizor):
"""A function which accepts a string of text and vectorizes the text using a 
TDIF vectorizoer.Inputs:
text(str): A string of text
TDIF_vectorizor: A pretrained vectorizor
Returns:
words(list): An inventory of the input text in vectored form.
"""
words = ''
total_words = len(text.split(' '))
for word in text.split(' '):
words += (word + ' ') * total_words
total_words -= 1
words = [words[:-1]]
words = TDIF_vectorizor.transform(words)
words = words.todense()
words = words.tolist()
return words

Since all of the data inside and given to the database might be vectors, we will use cosine similarity to compute the angle between the vectors. The closer the angle is to 0, the less similar the 2 said vectors might be.

@staticmethod
def similarity_checker(vector_1, vector_2):
"""A function which accepts two vectors and computes their cosine similarity.Inputs:
vector_1(int): A numerical vector
vector_2(int): A numerical vector
Returns:
cosine_similarity([vector_1], vector_2) (int): Cosine similarity rating
"""
vectors = [vector_1, vector_2]
for vec in vectors:
if np.ndim(vec) == 1:
vec = np.expand_dims(vec, axis=0)
return cosine_similarity([vector_1], vector_2)

Once the aptitude of finding the similarity rating between two vectors is completed, rankings can now be created between the words being searched and the documents situated throughout the database.

@staticmethod
def recommender(vector_file_list,query_vector, file_dict):
"""A function which accepts a listing of vectors, query vectors, and a dictionary
pertaining to the list of vectors with their original values and file names.Inputs:
vector_file_list(list): An inventory of vectors
query_vector(int): A numerical vector
file_dict(dict): A dictionary of filenames and text referring to the list
of vectors
Returns:
final_recommendation (list): An inventory of the ultimate beneficial files
similarity_list[:len(final_recommendation)] (list): An inventory of the similarity
scores of the ultimate recommendations.
"""
similarity_list = []
score_dict = dict()
for i,file_vector in enumerate(vector_file_list):
x = dataprocessor.similarity_checker(file_vector, query_vector)
score_dict[file_dict[i][1]] = (x[0][0])
similarity_list.append(x)
similarity_list = sorted(similarity_list, reverse = True)
#Recommends the highest 20%
beneficial = sorted(score_dict.items(), 
key=lambda x:-x[1])[:int(np.round(.5*len(similarity_list)))]
final_recommendation = []
for i in range(len(beneficial)):
final_recommendation.append(beneficial[i][0])
#add in graph for greater than 3 recommendationa
return final_recommendation, similarity_list[:len(final_recommendation)]

The vector file list is the list of vectors we created from the files before. The query vector is a vector of the words being searched. The file dictionary was created earlier which uses file names for the keys and the files’ text as values. Similarities are computed, after which a rating is created favoring probably the most similar pieces of data to the queried words being beneficial first. Note, what if there are greater than 3 recommendations? Incorporating elements of Networks and Graph Theory will add an additional level of computational profit to this method and create more confident recommendations.

Page Rank Theory

Let’s take a fast detour and go over the speculation of page rank. Don’t get me flawed, cosine similarity is a strong computation for measuring the similarity between vectors, put incorporating page rank into your advice algorithm allows for similarity comparisons across multiple vectors (data inside your database).

Page rank was first designed by Larry Page to rank web sites and measure their importance [1]. The fundamental idea is that an internet site might be deemed “more necessary” if more web sites are linked to it. Drawing from this concept, a node on a graph might be ranked as more necessary if there’s a decrease in the space of its edge to other nodes. The shorter the collective distance a node has in comparison with other nodes in a graph, the more necessary said node is.

Today we’ll use one variation of PageRank called eigenvector centrality. Eigenvector centrality is like PageRank in that it measures the connections between nodes of a graph, assigning higher scores for stronger connections. Biggest difference? Eigenvector centrality will account for the importance of nodes connected to a given node to estimate how necessary that node is. That is synonymous with saying, a one that knows numerous necessary people could be very necessary themselves through these strong relationships. All-in-all, these two algorithms are very close in the best way they’re implemented.

For this database, after the vectors are computed, they might be placed right into a graph where their edge distance is set by their similarity to other vectors.

@staticmethod
def ranker(recommendation_val, file_vec_dict):
"""A function which accepts a listing of recommendaton values and a dictionary
files wihin the databse and their vectors.Inputs:
reccomendation_val(list): An inventory of recommendations found through cosine
similarity
file_vec_dic(dict): A dictionary of the filenames as keys and their
text in vectors because the values.
Returns:
ec_recommended(list): An inventory of the highest 20% recommendations found using the 
eigenvector centrality algorithm.
"""
my_graph = nx.Graph()
for i in range(len(recommendation_val)):
file_1 = recommendation_val[i]
for j in range(len(recommendation_val)):
file_2 = recommendation_val[j]
if i != j:
#Calculate sim_score between two values (weight)
edge_dist = cosine_similarity([file_vec_dict[recommendation_val[i]]],[file_vec_dict[recommendation_val[j]]])
#add an edge from file 1 to file 2 with the load 
my_graph.add_edge(file_1, file_2, weight=edge_dist)
#Pagerank the graph  ]    
rec = nx.eigenvector_centrality(my_graph)
#Takes 20% of the values
ec_recommended = sorted(rec.items(), key=lambda x:-x[1])[:int(np.round(len(rec)))]
return ec_recommended

Okay, now what? We’ve the recommendations created through the use of the cosine similarity between each data point within the database, and suggestions computed by the eigenvector centrality algorithm. Which recommendations should we output? Each!

@staticmethod
def weighted_final_rank(sim_list,ec_recommended,final_recommendation):
"""A function which accepts a listing of similiarity values found through 
cosine similairty, recommendations found through eigenvector centrality,
and the ultimate recommendations produced by cosine similarity.Inputs:
sim_list(list): An inventory of all the similarity values for the files
throughout the database.
ec_recommended(list): An inventory of the highest 20% recommendations found using the 
eigenvector centrality algorithm.
final_recommendation (list): An inventory of the ultimate recommendations found
through the use of cosine similarity.
Returns:
weighted_final_recommend(list): An inventory of the ultimate recommendations for 
the files within the database.
"""
final_dict = dict()
for i in range(len(sim_list)):
val = (.8*sim_list[final_recommendation.index(ec_recommendation[i][0])].squeeze()) + (.2 * ec_recommendation[i][1])
final_dict[ec_recommendation[i][0]] = val
weighted_final_recommend = sorted(final_dict.items(), key=lambda x:-x[1])[:int(np.round(len(final_dict)))]
return weighted_final_recommend

The ultimate function of this script will weigh different recommendations produced by cosine similarity and eigenvector centrality. Currently, 80% of the load might be given to the recommendations produced by the cosine similarity recommendations, and 20% of the load might be given to eigenvector centrality recommendations. The ultimate recommendations might be computed based on these weights and aggregated together to provide recommendations which might be representative of all of the similarity computations within the system. The weights can easily be modified by the developer to reflect which batch of recommendations they feel are more necessary.

Let’s do a fast example with this code. The documents inside my database are all within the formats previously discussed and pertain to different areas of machine learning. More documents within the database are related to Generative Adversarial Networks (GANS), so I’d suspect those to be beneficial first when “Generative Adversarial Network” is the query term.

path = '/content/drive/MyDrive/database/'
db = [f for f in glob.glob(path + '*')]research_documents, file_dictionary = dataprocessor.data_reader(db)
list_files, vectorizer, file_vec_dict = dataprocessor.database_processor(file_dictionary,research_documents)
query = 'Generative Adversarial Networks'
query = dataprocessor.preprocess(query)
query = dataprocessor.input_processor(query, vectorizer)
advice, sim_list = dataprocessor.recommender(list_files,query, file_dictionary)
ec_recommendation = dataprocessor.ranker(advice, file_vec_dict)
final_weighted_recommended = dataprocessor.weighted_final_rank(sim_list,ec_recommendation,  advice)
print(final_weighted_recommended)

Running this block of code produces the next recommendations, together with the load value for every advice.

[(‘GAN_presentation.pptx’, 0.3411272882084124), (‘Using GANs to Augment UAV Data_V2.docx’, 0.16293615818015078), (‘GANS_DAY_1.docx’, 0.12546058188955278), (‘ml_pdf.pdf’, 0.10864164490536887)]

Let’s try yet another. What if I query “Machine Learning” ?

[(‘ml_pdf.pdf’, 0.31244922151487337), (‘GAN_presentation.pptx’, 0.18170070184645432), (‘GANS_DAY_1.docx’, 0.14825501243059303), (‘Using GANs to Augment UAV Data_V2.docx’, 0.1309153863914564)]

Aha! As expected, the primary document beneficial is an introductory transient to machine learning! I only used 7 documents for this instance, and the more documents added, the more recommendations one will receive!

Today we checked out how you’ll be able to create a advice system for files you collect (especially should you are collecting research for a project). The essential feature of this method is that it goes one step further in computing the cosine similarity of vectors by adopting the eigenvector centrality algorithm for more concise, and higher recommendations. Do this out today, and I hope it helps you get a greater understanding of how related the pieces of knowledge you possess are.

Should you enjoyed today’s reading, PLEASE give me a follow and let me know if there’s one other topic you want to me to explore! Should you should not have a Medium account, join through my link here (I receive a small commission whenever you do that)! Moreover, add me on LinkedIn, or be at liberty to achieve out! Thanks for reading!

A Advice System For Academic Research (And Other Data Types)!

Implementing Natural Language Processing and Graph Theory to check and recommend several types of documents

The Hurdle

PDF Reading Class

Microsoft Powerpoint Reader

Microsoft Word Document Reader

Microsft Excel Reader

Preprocessing

Page Rank Theory

Sources

What are your thoughts on this topic?
Let us know in the comments below.

1 COMMENT

Share this article

Recent posts

AI’s Growing Power Needs: Tech Industry’s Move Towards Nuclear Power

“Human Intelligence Created”… Human Intelligence Challenge Spreads Against ‘Made by AI’

What We Still Don’t Understand About Machine Learning

OpenAI Unveils SearchGPT: A Recent AI-Powered Search Engine

Public Release: Kling AI Video Generator

A Advice System For Academic Research (And Other Data Types)!

Implementing Natural Language Processing and Graph Theory to check and recommend several types of documents

The Hurdle

PDF Reading Class

Microsoft Powerpoint Reader

Microsoft Word Document Reader

Microsft Excel Reader

Preprocessing

Page Rank Theory

Sources

What are your thoughts on this topic? Let us know in the comments below.

1 COMMENT

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.