AI-Powered Information Extraction and Matchmaking

Developing an application for extracting key profile information from CVs and recommending jobs aligned with the profile

With the increasing efficiency of Large Language Models (LLMs), they have gotten increasingly popular for information extraction from business documents comparable to legal contracts, invoices, financial reports, and resumes, to call a couple of. The knowledge extracted from multiple sources might be used for matchmaking and suggestion systems.

Among the applications of data extraction and matchmaking include:

Automatic request-for-quotation (RFQ) generation by extracting information from customers’ requests
Extracting key usage patterns from customer’s data to offer product recommendations
Extracting key information from tenders and matching it with company profiles to search out the potential bidders
Extracting key information from an organization’s invoices and sales documents to generate sale-purchase reports
Extracting key information from purchase orders to facilitate inventory and provide chain management
Matching individuals on dating or matrimonial platforms based on their profiles, etc.

During my AI consultancy experience within the FAIR EDIH project in Finland, I encountered several match-making use cases that might be implemented by employing LLMs for information extraction and subsequently providing recommendations aligned with the extracted information. A few of these use cases include:

Matching users’ preferences for purchasing cars and houses
Mapping student skills with profession pathways in a learning management system
Matching EU regulations and compliances with tender proposals
Recommending experts for reviewing a research proposal
Providing peer recommendations to students for enhancing the training experience
Upskilling recommendations for an organization’s employees based on their profiles
Relocating applicants to workplaces matching with their profiles, to call a couple of.

In this text, I’ll discuss a use case of extracting key information from a job seeker’s Curriculum Vitae (CV) or resume and recommending jobs aligned with the job seeker’s profile from a job database. The strategy applies to each CVs and resumes; nevertheless, I’ll only use the term “CV” throughout the article. This use case may be very useful for job search platforms that need to integrate AI into their existing system. Such platforms maintain a job database and permit users to create profiles and/or upload their CVs. The identical method can be applied to assist recruiters find potential candidates who match the job ads.

In this text, we’ll develop an application with an easy GUI to investigate an uploaded CV to extract a profile comprising educational credentials, skills, and skilled experience, and subsequently recommend top matching jobs matching with the profile, together with a proof for every selection.

It can be crucial to notice that this instance use case may be prolonged for several other information extraction and matchmaking tasks.

This text will cover the next topics:

Utilizing LlamaParse and Pydantic models to extract structured information from documents using an LLM.
Applying this information extraction method to CVs to extract educational credentials, skills, and skilled experience.
Scoring the extracted skills based on their strength (semantic rating) within the CV.
Making a job vector database from a curated list of job ads.
Retrieving top matching jobs from the vector database based on their semantic similarity with the extracted profile.
Generating the ultimate job recommendations with an LLM with a proof for every suggestion.
Developing an easy streamlit application, allowing the collection of multiple LLMs and embedding models (each OpenAI and open-source).

The entire code may be present in my GitHub repository with complete instructions.

There are two principal folders within the repository: i) the code within the folder OpenAI models uses OpenAI’s gpt-4o LLM and text-embedding-3-large embedding model, and ii) the code within the folder Multiple models offers the choice to pick out OpenAI in addition to open-source LLMs (e.g., gpt-4o, gpt-4o-mini, llama3:70b-instruct-q4_0, mistral:latest, llama3.3:latest) and embedding models (e.g., text-embedding-3-large, text-embedding-3-small, BAAI/bge-small-en-v1.5).

You have to an OpenAI’s API key to run the code within the OpenAI models folder. Nevertheless, if you could have a strong PC with a CUDA-enabled GPU, you may test the code in Multiple models folder with open-source models free of charge. You may run this code even with no CUDA-enabled GPU, however the processing shall be too slow. Each the codes are flexible so as to add more LLMs and/or embedding models for experiments. For the sake of simplicity, I’ll only confer with the code in OpenAI models in this text.

The next figure shows the general process.

The general strategy of extracting key information from CVs and recommending matching jobs from a job database (image by writer)

Following is a snapshot of the streamlit application.

A snapshot of the Streamlit application (image by writer)

Parsing with LlamaParse and Information Extraction & Validation with Pydantic Models

In the next article, I demonstrated information extraction from unstructured documents using LLMs. Here, I used python-docx library to extract text from AI consultancy documents (MS WORD) and directly send the text of every document to an LLM for information extraction.

In one other article, I demonstrated a greater parsing method using LlamaParse for a contextual, multimodal Retrieval Augment Generation (RAG). LlamaParse is a genAI-based document parsing platform that parses and cleans data, ensuring it’s of fine quality and in proper format before passing it to an LLM. Please see the abovementioned article to establish LlamaParse and get its free API key.

In this text, I’ll use LlamaParse to parse data from a CV. Nevertheless, as an alternative of directly extracting the required information from the parsed content using an LLM, I’ll use Pydantic models to implement a selected schema for information extraction and validate the extracted information against the given schema. This process ensures that the output generated by an LLM conforms to the expected types and formats. Pydantic validation also helps to cut back LLM hallucinations.

Pydantic offers a clean and concise strategy to define data models using Python classes. Before discussing the Pydantic-guided information extraction from CVs, I’ll first start with an example to reveal this process for any document. I’ll use the identical example document of AI consultancy for corporations as utilized in the abovementioned article to extract key, structured information from an AI consultancy document. Here is the instance document.

That is the AI consultancy of the corporate Sagittarius Tech on the date 2024-09-12. This was a daily session facilitated by the expert Klaus Muller. Sagittarius Tech, based in Finland, is a forward-thinking, well-established company specializing in renewable energy solutions. They've a robust technical foundation in renewable energy systems, particularly in solar and wind energy, but their application of AI technology continues to be in its infancy, resulting in a current AI maturity level that is taken into account low.
The corporate’s objectives are well articulated and concentrate on optimizing the efficiency of their energy distribution networks. Specifically, Sagittarius Tech goals to implement AI-driven predictive maintenance for his or her solar farms and wind turbines. Their current approach to maintenance is essentially reactive, with inspections carried out at regular intervals or when a failure is detected. This method, while functional, is neither cost-effective nor efficient, because it often results in unexpected downtime and better maintenance costs. By integrating AI into their maintenance operations, Sagittarius Tech hopes to predict and stop equipment failures before they occur, thereby reducing downtime and lengthening the lifespan of their energy assets.
The concept of implementing predictive maintenance using AI is extremely relevant and aligns with current industry trends. By predicting equipment failures before they occur, Sagittarius Tech can improve the reliability of their energy systems and offer more consistent service to their clients. The applying of AI for this purpose is especially advantageous, because it allows for the evaluation of huge datasets from sensors and monitoring equipment to discover patterns and anomalies which may indicate impending failures.
While the corporate’s immediate goals are clear, their long-term strategy for AI integration continues to be into account. Nevertheless, they've identified their goal market as large-scale renewable energy operators and utility corporations. By way of data requirements, Sagittarius Tech has access to extensive datasets generated by the sensors installed on their solar panels and wind turbines. This data, which incorporates temperature readings, vibration evaluation, and energy output metrics, is crucial for training and validating AI models for predictive maintenance. The info is repeatedly updated as a part of their ongoing operations, providing a wealthy source of data for AI-driven insights.
The corporate has demonstrated strong technical expertise in renewable energy systems and in managing the associated data. They've a growing interest in AI, particularly in the realm of predictive analytics, though their experience on this field continues to be developing. Sagittarius Tech is searching for technical assistance from FAIR Services to develop an AI proof-of-concept (POC) focused on predictive maintenance for his or her energy assets. Through the consultation, it was noted that the corporate may benefit from targeted training in AI-based predictive maintenance techniques to further their capabilities.
The experts suggested that the challenge of implementing predictive maintenance might be approached through using machine learning models which are specifically designed to handle time-series data. Models comparable to LSTM (Long Short-Term Memory) networks, that are particularly effective in analyzing sequential data, may be applied to the sensor data collected by Sagittarius Tech. These models are able to learning patterns over time and may provide early warnings of potential equipment failures. Nevertheless, the experts noted that these models require a major amount of knowledge for training, so it could be helpful to start with a smaller pilot project before scaling up.
The experts further really useful exploring the combination of AI-driven predictive maintenance tools with the corporate’s existing monitoring systems. This integration may be achieved through using custom APIs and middleware, allowing the AI models to repeatedly analyze incoming data and supply real-time alerts to the upkeep team. Moreover, the experts emphasized the importance of a hybrid approach, combining AI predictions with human expertise to make sure that maintenance decisions are each data-driven and informed by practical experience.
Starting with pre-trained models for time-series evaluation was really useful, with the choice to fine-tune these models based on the particular characteristics of Sagittarius Tech’s equipment and operations. It was advised to avoid training models from scratch on account of the computational complexity and resource requirements involved. As a substitute, a phased approach to AI integration was suggested, where the predictive maintenance system is progressively rolled out across different sites, allowing the models to be refined and validated in a controlled environment. This approach ensures that the AI system may be effectively integrated into the corporate’s operations without disrupting existing processes.

We now have a whole bunch of such unstructured documents and the aim is to extract the next key information: company name, country, consultation date, experts, consultation type, area domain, current solution, AI field, AI maturity level, technical expertise and capability, company type, aim, identified goal market, data requirement assessment, FAIR’s services sought, and proposals.

The next libraries should be installed before running the given codes.

pip install openai pydantic[email] llama_parse llama-index python-dotenv pydantic[email] streamlit

The next code defines a Pydantic model to implement a selected schema for data extraction, validating LLM’s output, and converting the format of some fields into an expected format.

import os
import json
import openai
from datetime import datetime, date
from typing import List, Optional
from pydantic import BaseModel, Field, field_validator
from llama_parse import LlamaParse
from llama_index.llms.openai import OpenAI
from dotenv import load_dotenv
from llama_index.core import SimpleDirectoryReaderload_dotenv() #load the API keys from .env file
class AIconsultation(BaseModel):
company_name: Optional[str] = Field(None, description="The name of the corporate searching for AI advisory")
country: Optional[str] = Field(None, description="The corporate's country")
consultation_date: Optional[str] = Field(None, description="The date of consultation")
experts: Optional[List[str]] = Field(None, description="The experts providing AI consultancy")
consultation_type: Optional[str] = Field(None, description="Kind of consultation: Regular or pop-up")
area_domain: Optional[str] = Field(None, description="The sector of the corporate's operations (e.g., healthcare, logistics, etc.)")
current_solution: Optional[str] = Field(None, description="A temporary summary of the present solution (e.g., suggestion system, skilled guidance system)")
ai_field: Optional[List[str]] = Field(None, description="AI sub-fields in use or required (e.g., computer vision, generative AI)")
ai_maturity_level: Optional[str] = Field(None, description="AI maturity level: low, moderate, high")
technical_expertise_and_capability: Optional[str] = Field(None, description="Company's technical expertise: low, moderate, or high")
company_type: Optional[str] = Field(None, description="Company type: startup or established company")
aim: Optional[str] = Field(None, description="Fundamental AI task the corporate is aiming for")
identified_target_market: Optional[str] = Field(None, description="The targeted customers (e.g., healthcare professionals, construction firms)")
data_requirement_assessment: Optional[str] = Field(None, description="Kind of data required for AI integration with format/modality")
fair_services_sought: Optional[str] = Field(None, description="Services expected from FAIR (e.g., technical advice, proof of concept)")
recommendations: Optional[str] = Field(None, description="Key recommendations specializing in most vital suggested actions")
@field_validator("consultation_date", mode="before")
def validate_and_convert_date(cls, raw_date):
if raw_date is None:
return None
if isinstance(raw_date, str):
# List of acceptable date formats
date_formats = ['%d-%m-%Y', '%Y-%m-%d', '%d/%m/%Y', '%m-%d-%Y']
for fmt in date_formats:
try:
# Try and parse the date string with the present format
parsed_date = datetime.strptime(raw_date, fmt).date()
# Return the date in MM-DD-YYYY format as a string
return parsed_date.strftime('%m-%d-%Y')
except ValueError:
proceed  # Try the subsequent format
# If not one of the formats match, raise an error
raise ValueError(
f"Invalid date format for 'consultation_date'. Expected one in all: {', '.join(date_formats)}."
)
if isinstance(raw_date, date):
# Convert date object to MM-DD-YYYY format
return raw_date.strftime('%m-%d-%Y')
raise ValueError(
"Invalid type for 'consultation_date'. Should be a string or a date object."
)
def extract_content(file_path):
"""Parse the document and extract its content as text."""
#Initialize LlamaParse parser
parser = LlamaParse(
result_type="markdown",
parsing_instructions="Extract each section individually based on the document structure.",
auto_mode=True,
api_key=os.getenv("LLAMA_API_KEY"),
verbose=True
)
file_extractor = {".pdf": parser}
# Load the document
documents = SimpleDirectoryReader(
input_files=[file_path], file_extractor=file_extractor
).load_data()
text_content = "n".join([doc.text for doc in documents])
return text_content
def extract_information(document_text, llm_model):
"""Extract structured information and validate with Pydantic schema."""
openai.api_key = os.getenv("OPENAI_API_KEY")
llm = OpenAI(model=llm_model, temperature=0.0)
prompt = f"""
You're an authority in analyzing consultation documents. Use the next JSON schema to extract relevant information:
```json
{AIconsultation.schema_json(indent=2)}
```json
Extract the data from the next document and supply a structured JSON response strictly adhering to the schema above. 
Please remove any ```json ``` characters from the output. Don't make up any information. If a field can't be extracted, mark it as `n/a`.
Document:
----------------
{document_text}
----------------
"""
response = llm.complete(prompt)
if not response or not response.text:
raise ValueError("Did not get a response from LLM.")
try:
parsed_data = json.loads(response.text)  # Parse the response text to a Python dictionary
return AIconsultation.model_validate(parsed_data)  # Validate the parsed data against the schema
except Exception as e:
raise ValueError(f"Validation failed: {e}")
if __name__ == "__main__":
# Path to the document to investigate
document_path = "Sagittarius.pdf"
if not os.path.exists(document_path):
raise FileNotFoundError(f"The file {document_path} doesn't exist.")
try:
print("Extracting content from the document...")
document_content = extract_content(document_path)
print("Parsing and extracting structured information...")
consultation_info = extract_information(document_content, llm_model="gpt-4o")
print("Extraction complete. Here is the structured information:")
print(json.dumps(consultation_info.dict(), indent=2))
except Exception as e:
print(f"An error occurred: {e}")

The outline of all fields in AIconsultationclass is self-explanatory. The sector validator function validate_and_convert_date checks the format of the extracted consultation_datefield and converts it right into a required format (dd-mm-yyyy) if required. The function extract_content()parses the given AI consultancy document using Llamaparse, and the function extract_information()extracts the required information from the document using gpt-4oLLM, guided by the Pydantic model. The prompt in extract_information function instructs the model to follow the Pydantic schema and output the response in a JSON format.

Llamaprse splits the documents into multiple sub-documents based on the general context. As per the instructions given to the parser (see parsing_instructions in extract_content()function), the parser creates multiple sections and assigns each section a heading. The parser’s output (documentobject in extract_content()function) incorporates sub-document id’s, meta data of every sub-document, and the text containing multiple sections with headings.

LlamaParse’s output contained in `document` object in `extract_content()` function (image by writer)

I only select text (text_contentin extract_content()function) for information extraction by LLM. Here is the ultimate output of extract_content()function. The document has been split into multiple sections, with each section assigned a heading.

Finally, the extract_information()function extracts the required information (defined within the Pydantic model) from the parsed content in a nicely structured format. The consultation date was validated and converted into dd-mm-yyyyformat. Note that in prompt, we don’t must specify what information we wish to extract, as this has been laid out in Pydantic models.

Extracting content from the document...
Began parsing the file under job_id 0761bfee-922a-49a8-9e92-da1877aeea1a
Parsing and extracting structured information...
Extraction complete. Here is the structured information:
{
"company_name": "Sagittarius Tech",
"country": "Finland",
"consultation_date": "09-12-2024",
"experts": [
"Klaus Muller"
],
"consultation_type": "Regular",
"area_domain": "Renewable energy",
"current_solution": "Reactive maintenance for solar farms and wind turbines",
"ai_field": [
"Predictive maintenance",
"Machine learning",
"Time-series analysis"
],
"ai_maturity_level": "Low",
"technical_expertise_and_capability": "High",
"company_type": "Established company",
"aim": "Implement AI-driven predictive maintenance for solar farms and wind turbines",
"identified_target_market": "Large-scale renewable energy operators and utility corporations",
"data_requirement_assessment": "Extensive datasets from sensors on solar panels and wind turbines, including temperature readings, vibration evaluation, and energy output metrics",
"fair_services_sought": "Technical advice, proof of concept for predictive maintenance",
"recommendations": "Use machine learning models like LSTM for time-series data, start with pre-trained models, integrate AI with existing systems using custom APIs, and adopt a phased approach to AI integration"
}

Parsing CV Content and Information Extraction & Validation Using Pydantic Models

After an illustration of parsing documents with LlamaParse and data extraction via Pydantic models, let’s now discuss parsing and key information extraction from CVs using Pydantic models and supply job recommendations aligned with the extracted profile. Here, the extracted educational credentials, skills, and past experience have been considered sufficient information to offer matching job recommendations.

The GitHub code is structured into two .py files:

CV_analyzer.py: defines the Pydantic models, configures the LLM and the embedding model, parses CV’s data, extracts the required information from CV, assigns scores to the extracted skills, and retrives matching jobs from the job vector database.
job_recommender.py: initializes a Streamlit application, calls the functions in CV_analyzer.py in a sequential manner, and displays the extracted information and job recommendations.

The general function of the code is depicted in the next image.

Workflow of the job suggestion application: integration of `CvAnalyzer` and `RAGStringQueryEngine` for CV parsing, Pydantic guided profile extraction with LLM, skill scoring, and job suggestion with Streamlit output (image by writer).

Just a few structures on this code have been adopted from this source with significant enhancements. Let’s discuss all of the classes and functions within the code one after the other.

The next code in CV_analyzer.py shows the definitions of the Pydantic models.

# Pydantic model for extracting education
class Education(BaseModel):
institution: Optional[str] = Field(None, description="The name of the tutorial institution")
degree: Optional[str] = Field(None, description="The degree or qualification earned")
graduation_date: Optional[str] = Field(None, description="The graduation date (e.g., 'YYYY-MM')")
details: Optional[List[str]] = Field(
None, description="Additional details concerning the education (e.g., coursework, achievements)"
)@field_validator('details', mode='before')
def validate_details(cls, v):
if isinstance(v, str) and v.lower() == 'n/a':
return []
elif not isinstance(v, list):
return []
return v
# Pydantic model for extracting experience
class Experience(BaseModel):
company: Optional[str] = Field(None, description="The name of the corporate or organization")
location: Optional[str] = Field(None, description="The situation of the corporate or organization")
role: Optional[str] = Field(None, description="The role or job title held by the candidate")
start_date: Optional[str] = Field(None, description="The beginning date of the job (e.g., 'YYYY-MM')")
end_date: Optional[str] = Field(None, description="The tip date of the job or 'Present' if ongoing (e.g., 'MM-YYYY')")
responsibilities: Optional[List[str]] = Field(
None, description="An inventory of responsibilities and tasks handled through the job"
)
@field_validator('responsibilities', mode='before')
def validate_responsibilities(cls, v):
if isinstance(v, str) and v.lower() == 'n/a':
return []
elif not isinstance(v, list):
return []
return v
# Fundamental Pydantic class ensapsulating education and epxerience classes with other information
class ApplicantProfile(BaseModel):
name: Optional[str] = Field(None, description="The complete name of the candidate")
email: Optional[EmailStr] = Field(None, description="The e-mail of the candidate")
age: Optional[int] = Field(
None,
description="The age of the candidate."
)
skills: Optional[List[str]] = Field(
None, description="An inventory of high-level skills possessed by the candidate."
)
experience: Optional[List[Experience]] = Field(
None, description="An inventory of experiences detailing previous jobs, roles, and responsibilities"
)
education: Optional[List[Education]] = Field(
None, description="An inventory of educational qualifications of the candidate including degrees, institutions studied in, and dates of start and end."
)
@root_validator(pre=True)
def handle_invalid_values(cls, values):
for key, value in values.items():
if isinstance(value, str) and value.lower() in {'n/a', 'none', ''}:
values[key] = None
return values

The Educationclass within the pydantic model defines the extraction of educational details including institution’s name, the degree or qualification earned, the graduation date, and extra details like coursework or achievements. The Experience class defines the extraction of skilled experience details including company name, location, role, start and end dates, and an inventory of responsibilities or tasks. The principal class ApplicantProfile`encapsulates the Education and Experienceclasses, together with other candidate-specific information like name, email, age, and skills. The sector validators in each class handle the conversion of invalid and irrelevant values (comparable to n/a or ‘none’) or improperly formatted inputs right into a consistent data format.

After defining the Pydantic models,CV_analyzer.pyuses aCvAnalyzer class with the next structure for performing various tasks.

# Class for analyzing the CV contents
class CvAnalyzer:
def __init__(self, file_path, llm_option, embedding_option):
"""
Initializes the CvAnalyzer with the given resume file path and model options.Parameters:
- file_path: Path to the resume file.
- llm_option: Name of the LLM to make use of.
- embedding_option: Name of the embedding model to make use of.
"""
pass
def _model_settings(self):
"""
Configures the massive language model and embedding model based on the user-provided options.
This ensures that the chosen models are properly initialized and prepared to be used.
"""
pass
def extract_profile_info(self) -> ApplicantProfile:
"""
Extracts structured information from the resume and converts it into an ApplicantProfile object.
This includes parsing education, skills, and experience using a particular LLM.
"""
pass
def _get_embedding(self, texts: List[str], model: str) -> torch.Tensor:
"""
Generates embeddings for an inventory of text inputs using the desired embedding model.
This function is named by compute_skill_scores() function
Parameters:
- texts: List of strings to embed.
- model: Name of the embedding model.
Returns:
- Tensor of embeddings.
"""
pass
def compute_skill_scores(self, skills: list[str]) -> dict:
"""
Computes semantic similarity scores between skills and the resume content.
Parameters:
- skills: List of skills to guage.
Returns:
- A dictionary mapping each skill to its similarity rating.
"""
pass
def _extract_resume_content(self) -> str:
"""
Called by compute_skill_scores(), this function extracts and returns the raw textual content of the resume.
"""
pass
def _cosine_similarity(self, vec1: torch.Tensor, vec2: torch.Tensor) -> float:
"""
Called by compute_skill_scores() function, calculates the cosine similarity between two vectors.
Parameters:
- vec1: First vector.
- vec2: Second vector.
Returns:
- Cosine similarity rating as a float.
"""
pass
def create_or_load_job_index(self, json_file: str, index_folder: str = "job_index_storage"):
"""
Creates a brand new job vector index from a JSON dataset or loads an existing index from storage.
Parameters:
- json_file: Path to the job dataset JSON file.
- index_folder: Folder to save lots of or load the vector index.
Returns:
- VectorStoreIndex object for querying jobs.
"""
pass
def query_jobs(self, education, skills, experience, index, top_k=3):
"""
Queries the job vector index to search out the top-k matching jobs based on the provided profile.
Parameters:
- education: List of educational qualifications.
- skills: List of skills.
- experience: List of labor experiences.
- index: Job vector database index.
- top_k: Variety of top matching jobs to retrieve (default: 3).
Returns:
- List of job matches.
"""
pass

extract_profile_infofunction first parses the given CV with LlamaParse and splits it into sections (as demonstrated in the instance presented to start with of the article). It then sends the CV’s contents self._resume_contentto the LLM with the Pydantic schema and data extraction instructions (see prompt). The response from the LLM (response) is validated against the Pydantic schema.

It’s value mentioning that as an alternative of the unique parsed content (document) with metadata and other information, I extract the text data (self._resume_content) and send it to the LLM for information extraction. This prevents the LLM from becoming confused between the data scattered across different nodes, which could end in omitting some parts of the required information.

def extract_profile_info(self) -> ApplicantProfile:
"""
Extracts candidate data from the resume.
"""
print(f"Extracting CV data. LLM: {self.llm_option}")
output_schema = ApplicantProfile.model_json_schema()
parser = LlamaParse(
result_type="markdown",
parsing_instructions="Extract each section individually based on the document structure.",
auto_mode=True,
api_key=os.getenv("LLAMA_API_KEY"),
verbose=True
)
file_extractor = {".pdf": parser}# Load resume and parse
documents = SimpleDirectoryReader(
input_files=[self.file_path], file_extractor=file_extractor
).load_data()
# Split into sections
self._resume_content = "n".join([doc.text for doc in documents])
prompt = f"""
You're an authority in analyzing resumes. Use the next JSON schema to extract relevant information:
```json
{output_schema}
```json
Extract the data from the next document and supply a structured JSON response strictly adhering to the schema above. 
Please remove any ```json ``` characters from the output. Don't make up any information. If a field can't be extracted, mark it as `n/a`.
Document:
----------------
{self._resume_content}
----------------
"""
try:
response = self.llm.complete(prompt)
if not response or not response.text:
raise ValueError("Did not get a response from LLM.")
parsed_data = json.loads(response.text)
return ApplicantProfile.model_validate(parsed_data)
except Exception as e:
print(f"Error parsing response: {str(e)}")
raise ValueError("Did not extract insights. Please make sure the resume and query engine are properly configured.")

The compute_skill_scoresfunction computes the embeddings of every extracted skill and that of the CV contents. It then computes a Cosine similarity rating between the skill and CV embeddings. The more outstanding a skill is in a CV, the upper the Cosine similarity rating it gets. This Cosine similarity rating for every skill is normalized between 0 and 5 to display in a 5-star format.

def compute_skill_scores(self, skills: list[str]) -> dict:
"""
Compute semantic weightage scores for every skill based on the resume contentParameters:
- skills (list of str): An inventory of skills to guage.
Returns:
- dict: A dictionary mapping each skill to a rating 
"""
# Extract resume content and compute its embedding
resume_content = self._extract_resume_content()
# Compute embeddings for all skills directly
skill_embeddings = self._get_embedding(skills, model=self.embedding_model.model_name)
# Compute raw similarity scores and semantic frequency for every skill
raw_scores = {}
for skill, skill_embedding in zip(skills, skill_embeddings):
# Compute semantic similarity with your complete resume
similarity = self._cosine_similarity(
self._get_embedding([resume_content], model=self.embedding_model.model_name)[0],
skill_embedding
)
raw_scores[skill] = similarity
return raw_scores
def _extract_resume_content(self) -> str:
"""
Returns the CV contents previously extracted
"""
if self._resume_content:
return self._resume_content  # Use the pre-stored content
else:
raise ValueError("Resume content not available. Ensure `extract_profile_info` is named first.")
def _get_embedding(self, texts: List[str], model: str) -> torch.Tensor:
"""Computes embeddings based on the chosen embedding model. 
These might be CV embeddings, skill embeddings, or job embeddings """     
from openai import OpenAI
client = OpenAI(api_key=openai.api_key)
response = client.embeddings.create(input=texts, model=model)
embeddings = [torch.tensor(item.embedding) for item in response.data]
return torch.stack(embeddings)
def _cosine_similarity(self, vec1: torch.Tensor, vec2: torch.Tensor) -> float:
"""
Compute cosine similarity between a skill and the CV content.
"""
vec1, vec2 = vec1.to(self.device), vec2.to(self.device)
return (torch.dot(vec1, vec2) / (torch.norm(vec1) * torch.norm(vec2))).item()

The function create_or_load_job_index creates a brand new job vector database or loads an index from an existing job vector database (see job_index_storagefolder within the code repository).

def create_or_load_job_index(self, json_file: str, index_folder: str = "job_index_storage"):
"""
Create or load a vector database for jobs using LlamaIndex.
"""
if not os.path.exists(index_folder):
print(f"Creating latest job vector index with {self.embedding_model.model_name} model...")
with open(json_file, "r") as f:
job_data = json.load(f)
# Convert job descriptions to Document objects by serializing all fields dynamically
documents = []
for job in job_data["jobs"]:
job_text = "n".join([f"{key.capitalize()}: {value}" for key, value in job.items()])
documents.append(Document(text=job_text))
# Create the vector index directly from documents
index = VectorStoreIndex.from_documents(documents, embed_model=self.embedding_model)
# Save index to disk
index.storage_context.persist(persist_dir=index_folder)
return index
else:
print(f"Loading existing job index from {index_folder}...")
storage_context = StorageContext.from_defaults(persist_dir=index_folder)
return load_index_from_storage(storage_context)

The job dataset, from which the vector database is created, is in sample_jobs.json file within the code repository. I curated this instance dataset by scrapping 50 job ads from different sources in JSON format. Here is how the job ads are stored on this file.

{
"jobs": [
{
"id": "2253637",
"title": "Director of Customer Success",
"company": "HEI Schools",
"description": "HEI Schools is seeking an experienced Director of Customer Success to lead our account management, customer success, and project delivery functions. Responsibilities include overseeing seamless product and service delivery, ensuring high quality and customer satisfaction, and supervising a team of three customer success professionals. The role requires regular international travel and reports directly to the CEO.",
"image": "n/a",
"location": "Helsinki, Finland",
"employmentType": "Full-time, Permanent",
"datePosted": "December 10, 2024",
"salaryRange": "n/a",
"jobProvider": "Jobly",
"url": "https://www.jobly.fi/en/job/director-customer-success-2253637"
},
{
"id": "2258919",
"title": "Service Specialist",
"company": "Stora Enso",
"description": "We are seeking an active and service-oriented Service Specialist for our forest owner services in the Helsinki metropolitan area. Responsibilities include supporting timber sales and service sales, marketing and communication within your area of responsibility, forest consulting, promoting digital solutions in customer management and service offerings, and stakeholder collaboration in the metropolitan area.",
"image": "n/a",
"location": "Helsinki, Finland",
"employmentType": "Permanent, Full-time",
"datePosted": "December 10, 2024",
"salaryRange": "n/a",
"jobProvider": "Jobly",
"url": "https://www.jobly.fi/en/job/palveluasiantuntija-2258919"
}
...],
"index": 0,
"jobCount": 50,
"hasError": false,
"errors": []
}

The function query_jobs retrieves top_k matching job ads from the job vector database that are then sent to the LLM for final suggestion.

 def query_jobs(self, education, skills, experience, index, top_k=3):
"""
Query the vector database for jobs matching the extracted profile.
"""
print(f"Fetching job suggestions.(LLM: {self.llm.model}, embed_model: {self.embedding_option})")
query = f"Education: {', '.join(education)}; Skills: {', '.join(skills)}; Experience: {', '.join(experience)}"
# Use retriever with appropriate model
retriever = index.as_retriever(similarity_top_k=top_k)
matches = retriever.retrieve(query)
return matches

The CvAnalyzerclass and its above-mentioned methods are initialized and called by job_recommender.pywhich serves because the principal application code. job_recommender.pyuses the next custom query engine to offer final job recommendations with the function.

class RAGStringQueryEngine(BaseModel):
"""
Custom Query Engine for Retrieval-Augmented Generation (fetching matching job recommendations).
"""
retriever: BaseRetriever
llm: OpenAI
qa_prompt: PromptTemplate# Allow arbitrary types
model_config = ConfigDict(arbitrary_types_allowed=True)
def custom_query(self, candidate_details: str, retrieved_jobs: str):
query_str = self.qa_prompt.format(
query_str=candidate_details, context_str=retrieved_jobs
)
response = self.llm.complete(query_str)        
return str(response)

The principal function in job_recommender.py works as follows:

def principal():
#Streamlit messages
st.set_page_config(page_title="CV Analyzer & Job Recommender", page_icon="🔍")
st.title("CV Analyzer & Job Recommender")
st.write("Upload a CV to extract key information.")
uploaded_file = st.file_uploader("Select Your CV (PDF)", type="pdf", help="Select a PDF file as much as 5MB")
#Define LLM and embedding model    
llm_option = "gpt-4o"
embedding_option = "text-embedding-3-large"
#Following code is trigerred after pressing 'Analyze' button
if uploaded_file shouldn't be None:
if st.button("Analyze"):
with st.spinner("Parsing CV... This may increasingly take a moment."):
try:
with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as temp_file:
temp_file.write(uploaded_file.getvalue())
temp_file_path = temp_file.name
# Initialize CvAnalyzer with chosen models
analyzer = CvAnalyzer(temp_file_path, llm_option, embedding_option)
print("Resume extractor initialized.")
# Extract insights from the resume
insights = analyzer.extract_profile_info()
print("Candidate data extracted.")
# Load or create job vector index
job_index = analyzer.create_or_load_job_index(json_file="sample_jobs.json", index_folder="job_index_storage")
# Extract education, skills, and experience fields from insights object
education = [edu.degree for edu in insights.education] if insights.education else []
skills = insights.skills or []
experience = [exp.role for exp in insights.experience] if insights.experience else []
#Retrieve the top_k matching jobs
matching_jobs = analyzer.query_jobs(education, skills, experience, job_index)
#mix the retrieved matching jobs
retrieved_context = "nn".join([match.node.get_content() for match in matching_jobs]) 
#mix the profile information
candidate_details = f"Education: {', '.join(education)}; Skills: {', '.join(skills)}; Experience: {', '.join(experience)}" 
#Initialize LLM and the query engine
llm = OpenAI(model=llm_option, temperature=0.0)
rag_engine = RAGStringQueryEngine(
retriever=job_index.as_retriever(),
llm=analyzer.llm,  
qa_prompt=PromptTemplate(template="""
You're expert in analyzing resumes, based on the next candidate details and job descriptions:
Candidate Details:
---------------------
{query_str}
---------------------
Job Descriptions:
---------------------
{context_str}
---------------------
Provide a concise list of the matching jobs. For every matching job, mention job-related details comparable to 
company, temporary job description, location, employment type, salary range, URL for every suggestion, and a temporary explanation of why the job matches the candidate's profile.
Be critical in matching profile with the roles. Thoroughly analyze education, skills, and experience to match jobs.  
Don't explain why the candidate's profile doesn't match with the opposite jobs. Don't include any summary. Order the roles based on their relevance. 
Answer: 
"""
),
)#send the profile details and the retrieved jobs to the LLM for final suggestion
llm_response = rag_engine.custom_query(
candidate_details=candidate_details,
retrieved_jobs=retrieved_context
)
# Display extracted information
st.subheader("Extracted Information")
st.write(f"**Name:** {insights.name}")
st.write(f"**Email:** {insights.email}")
st.write(f"**Age:** {insights.age}")
list_education(insights.education or [])
with st.spinner("Extracting skills..."):
list_skills(insights.skills or [], analyzer)
list_experience(insights.experience or [])
st.subheader("Top Matching Jobs with Explanation")
st.markdown(llm_response)
print("Done.")
except Exception as e:
st.error(f"Failed to investigate the resume: {str(e)}")

The principal function initializes CVAnalyzer class with the chosen models and calls extract_profile_info function to extract profile information. It then loads the job vector index and calls query_jobs function to retrieve the roles matching with the extracted profile. Subsequently, it initializes the query engine (rag_engine) and sends the retrieved jobs (retreived_context) and profile information (candidate_details) to the LLM with instructions on what elements to contemplate to generate the ultimate job recommendations (see qa_prompt).

import torch
from transformers import AutoTokenizer, AutoModel
from llama_index.core import Settings, VectorStoreIndex 
from llama_index.llms.ollama import Ollama
from typing import Union
import streamlit as st
import tempfile
import random
import os
from CV_analyzer import CvAnalyzer
from llama_index.core.query_engine import CustomQueryEngine
from llama_index.core.retrievers import BaseRetriever
from llama_index.llms.openai import OpenAI
from llama_index.core.prompts import PromptTemplate
from pydantic import BaseModel, Field, ConfigDictclass RAGStringQueryEngine(BaseModel):
"""
Custom Query Engine for Retrieval-Augmented Generation (fetching matching job recommendations).
"""
retriever: BaseRetriever
llm: OpenAI
qa_prompt: PromptTemplate
# Allow arbitrary types
model_config = ConfigDict(arbitrary_types_allowed=True)
def custom_query(self, candidate_details: str, retrieved_jobs: str):
query_str = self.qa_prompt.format(
query_str=candidate_details, context_str=retrieved_jobs
)
response = self.llm.complete(query_str)        
return str(response)
def principal():
st.set_page_config(page_title="CV Analyzer & Job Recommender", page_icon="🔍")
st.title("CV Analyzer & Job Recommender")
llm_option = "gpt-4o"
embedding_option = "text-embedding-3-large"
st.write("Upload a CV to extract key information.")
uploaded_file = st.file_uploader("Select Your CV (PDF)", type="pdf", help="Select a PDF file as much as 5MB")
if uploaded_file shouldn't be None:
if st.button("Analyze"):
with st.spinner("Parsing CV... This may increasingly take a moment."):
try:
with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as temp_file:
temp_file.write(uploaded_file.getvalue())
temp_file_path = temp_file.name
# Initialize CvAnalyzer with chosen models
analyzer = CvAnalyzer(temp_file_path, llm_option, embedding_option)
print("Resume extractor initialized.")
# Extract insights from the resume
insights = analyzer.extract_profile_info()
print("Candidate data extracted.")
# Load or create job vector index
job_index = analyzer.create_or_load_job_index(json_file="sample_jobs.json", index_folder="job_index_storage")
# Extract education, skills, and experience fields from insights object
education = [edu.degree for edu in insights.education] if insights.education else []
skills = insights.skills or []
experience = [exp.role for exp in insights.experience] if insights.experience else []
#Retrieve the top_k matching jobs
matching_jobs = analyzer.query_jobs(education, skills, experience, job_index)
#mix the retrieved matching jobs
retrieved_context = "nn".join([match.node.get_content() for match in matching_jobs]) 
#mix the profile information
candidate_details = f"Education: {', '.join(education)}; Skills: {', '.join(skills)}; Experience: {', '.join(experience)}" 
#Initialize LLM and the query engine
llm = OpenAI(model=llm_option, temperature=0.0)
rag_engine = RAGStringQueryEngine(
retriever=job_index.as_retriever(),
llm=analyzer.llm,  
qa_prompt=PromptTemplate(template="""
You're expert in analyzing resumes, based on the next candidate details and job descriptions:
Candidate Details:
---------------------
{query_str}
---------------------
Job Descriptions:
---------------------
{context_str}
---------------------
Provide a concise list of the matching jobs. For every matching job, mention job-related details comparable to 
company, temporary job description, location, employment type, salary range, URL for every suggestion, and a temporary explanation of why the job matches the candidate's profile.
Be critical in matching profile with the roles. Thoroughly analyze education, skills, and experience to match jobs.  
Don't explain why the candidate's profile doesn't match with the opposite jobs. Don't include any summary. Order the roles based on their relevance. 
Answer: 
"""
),
)
#send the profile details and the retrieved jobs to the LLM for final suggestion
llm_response = rag_engine.custom_query(
candidate_details=candidate_details,
retrieved_jobs=retrieved_context
)
# Display extracted information
st.subheader("Extracted Information")
st.write(f"**Name:** {insights.name}")
st.write(f"**Email:** {insights.email}")
st.write(f"**Age:** {insights.age}")
list_education(insights.education or [])
with st.spinner("Extracting skills..."):
list_skills(insights.skills or [], analyzer)
list_experience(insights.experience or [])
st.subheader("Top Matching Jobs with Explanation")
st.markdown(llm_response)
print("Done.")
except Exception as e:
st.error(f"Failed to investigate the resume: {str(e)}")

The next three functions displays educational credentials, skills, and experience. The function list_skillscalls compute_skill_scoresfunction to compute the cosine similarity rating for every skill after which converts each rating in a 5-star rating.


def list_skills(skills: list[str], analyzer):
"""
Display skills with their computed scores as large golden stars with full or partial coverage.
"""
if not skills:
st.warning("No skills found to display.")
return
st.subheader("Skills")
# Custom CSS for giant golden stars
st.markdown(
"""
""",
unsafe_allow_html=True,
)# Compute scores for all skills
skill_scores = analyzer.compute_skill_scores(skills)
# Display each skill with a star rating
for skill in skills:
rating = skill_scores.get(skill, 0)  # Get the raw rating
max_score = max(skill_scores.values()) if skill_scores else 1  # Avoid division by zero
# Normalize the rating to a 5-star scale
normalized_score = (rating / max_score) * 5 if max_score > 0 else 0
# Split into full stars and partial star percentage
full_stars = int(normalized_score)
if (normalized_score - full_stars) >= 0.40:
partial_star_percentage = 50
else:
partial_star_percentage = 0
# Generate the star display
stars_html = ""
for i in range(5):
if i < full_stars:
# Fully filled star
stars_html += '★★'
elif i == full_stars:
# Partially filled star
stars_html += f'★★'
else:
# Empty star
stars_html += '★'
# Display skill name and star rating
st.markdown(f"**{skill}**: {stars_html}", unsafe_allow_html=True)
def list_education(education_list):
"""
Display an inventory of educational qualifications.
"""
if education_list:
st.subheader("Education")
for education in education_list:
#extract metrics for every education (degree) and display it
institution = education.institution if education.institution else "Not found"
degree = education.degree if education.degree else "Not found"
12 months = education.graduation_date if education.graduation_date else "Not found"
details = education.details if education.details else []
formatted_details = ". ".join(details) if details else "No additional details provided."
st.markdown(f"**{degree}**, {institution} ({12 months})")
st.markdown(f"_Details_: {formatted_details}")
def list_experience(experience_list):
"""
Display a single-level bulleted list of experiences.
"""
if experience_list:
st.subheader("Experience")
for experience in experience_list:
#extract metrics for every experience and display it
job_title = experience.role if experience.role else "Not found"
company_name = experience.company if experience.company else "Not found"
location = experience.location if experience.location else "Not found"
start_date = experience.start_date if experience.start_date else "Not found"
end_date = experience.end_date if experience.end_date else "Not found"
responsibilities = experience.responsibilities if experience.responsibilities else ["Not found"]
brief_responsibilities = ", ".join(responsibilities)
st.markdown(
f"- Worked as **{job_title}** from {start_date} to {end_date} in *{company_name}*, {location}, "
f"where responsibilities include {brief_responsibilities}."
)

See the next snapshots of the profile information extraction and job recommendations from the Streamlit app. The sample CV (Sample CV.pdf) may be present in the code repository.

AI-Powered Information Extraction and Matchmaking

Developing an application for extracting key profile information from CVs and recommending jobs aligned with the profile

Parsing with LlamaParse and Information Extraction & Validation with Pydantic Models

Parsing CV Content and Information Extraction & Validation Using Pydantic Models

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Speed up Token Production in AI Factories Using Unified Services and Real-Time AI

How Can A Model 10,000× Smaller Outsmart ChatGPT?

NVIDIA Extreme Co-Design Delivers Latest MLPerf Inference Records

The Inversion Error: Why Secure AGI Requires an Enactive Floor and State-Space Reversibility

CUDA Tile Programming Now Available for BASIC!

AI-Powered Information Extraction and Matchmaking

Developing an application for extracting key profile information from CVs and recommending jobs aligned with the profile

Parsing with LlamaParse and Information Extraction & Validation with Pydantic Models

Parsing CV Content and Information Extraction & Validation Using Pydantic Models

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.