Predicting metadata for humanitarian datasets with LLMs part 2 — An alternative choice to fine-tuning

TL;DR

Within the humanitarian response world there could be tens of hundreds of tabular (CSV and Excel) datasets, a lot of which contain critical information for helping save lives. Data could be provided by a whole bunch of various organizations with different naming conventions, languages and data standards, so having information (metadata) about what each column represents in tables is essential for locating the proper data and understanding the way it suits together. Much of this metadata is about manually, which is time-consuming and error prone, so any automatic method can have an actual effect towards helping people. In this text we revisit a previous evaluation “Predicting Metadata of Humanitarian Datasets with GPT 3” to see how advances within the last 18 months open the best way for more efficient and fewer time-consuming methods for setting metadata on tabular data.

Using metadata-tagged CSV and Excel datasets from the Humanitarian Data Exchange (HDX) we show that fine-tuning GPT-4o-mini works well for predicting Humanitarian Exchange Language (HXL) tags and attributes for essentially the most common tags related to location and dates. Nevertheless, for less well-represented tags and attributes the technique generally is a bit limited as a consequence of poor quality training data where humans have made mistakes in manually labelling data or just aren’t using all possible HXL metadata combos. It also has the limitation of not with the ability to adjust when the metadata standard changes, because the training data wouldn’t reflect those changes.

Given more powerful LLMs at the moment are available, we tested a way to directly prompt GPT-4o or GPT-4o-mini moderately than fine-tuning, providing the complete HXL core schema definition within the system prompt now that larger context windows can be found. This approach was shown to be more accurate than fine-tuning when using GPT-4o, in a position to support rarer HXL tags and attributes and requiring no custom training data, making it easier to administer and deploy. It’s nevertheless dearer, but not if using GPT-4o-mini, albeit with a slight decrease in performance. Using this approach we offer an easy Python class in a GitHub Gist that could be utilized in data processing pipelines to mechanically add HXL metadata tags and attributes to tabular datasets.

About 18 months ago I wrote a blog post Predicting Metadata of Humanitarian Datasets with GPT 3.

That’s right, with GPT 3, not even 3.5! 🙂

Even so, back then Large Language Model (LLM) fine-tuning produced great performance for predicting Humanitarian Exchange Language (HXL) metadata fields for tabular datasets on the amazing Humanitarian Data Exchange (HDX). In that study, the training data represented the distribution of HXL data on HDX and so was comprised of essentially the most common tags regarding location and dates. These are very essential for linking different datasets together in location and time, an important think about using data to optimize humanitarian response.

The LLM field has since advanced … a LOT.

So in this text, we’ll revisit the technique, expand it to cover less frequent HXL tags and attributes and explore other options now available to us for situations where a posh, high-cardinality taxonomy must be applied to data. We will even explore the flexibility to predict less frequent HXL standard tags and attributes not currently represented within the human-labeled training data.

You’ll be able to follow together with this evaluation by opening these notebooks in Google Colab or running them locally:

Please discuss with the README within the repo for installation instructions.

For this study, and with help from the HDX team, we’ll use data extracted from the HDX platform using a crawler process they run to trace the usage of HXL metadata tags and attributes on the platform. You’ll find great HXL resources on GitHub, but when you would like to follow together with this evaluation I even have also saved the source data on Google Drive because the crawler will take days to process the a whole bunch of hundreds of tabular datasets on HDX.

The information looks like this, with one row per HXL-tagged table column …

Example of information utilized in this study, with a row per tabular data column.

The HXL postcard is a very great overview of essentially the most common HXL tags and attributes within the core schema. For our evaluation, we’ll apply the complete standard as found on HDX which provides a spreadsheet of supported tags and attributes …

Excerpt of the “Core HXL Schema” used for this study, as found on the Humanitarian Data Exchange

The generate-test-train-data.ipynb notebook provides all of the steps taken to create test and training datasets, but listed below are some key points to notice:

1. Removal of automatic pipeline repeat HXL data

On this study, I removed duplicate data created by automated pipelines that upload data to HDX, by utilizing an MDF hash of column names in each tabular dataset (CSV and Excel files). For instance, a CSV file of population statistics created by a company is commonly very similar for every country-specific CSV or Excel file, so we only take one example. This has a balancing effect on the info, providing more variation of HXL tags and attributes by removing very similar repeat data.

2. Constraining data to valid HXL

About 50% of the HDX data with HXL tags uses a tag or attribute which should not laid out in the HXL Core Schema, so this data is faraway from training and test sets.

3. Data enrichment

As a (mostly!) human being, when deciding what HXL tags and attributes to make use of on a column, I take a peek at the info for that column and likewise the info as an entire within the table. For this evaluation we do the identical for the LLM fine-tuning and prompt data, adding in data excerpts for every column. A table description can be added using an LLM (GPT-3.5-Turbo) summary of the info to make them consistent, as summaries on HDX can vary in form, starting from pages to a couple of words.

4. Fastidiously splitting data to create train/test sets

Many machine learning pipelines split data randomly to create training and test sets. Nevertheless, for HDX data this could lead to columns and files from the identical organization being in train and test. I felt this was a bit too easy for testing predictions and so as a substitute split the info by organizations to make sure organizations within the test set weren’t within the training data. Moreover, subsidiaries of the identical parent organization — eg “ocha-iraq” and “ocha-libya” — weren’t allowed to be in each the training and test sets, again to make the predictions more realistic. My aim was to check prediction with organizations as if their data had never been seen before.

In any case of the above and down-sampling to save lots of costs, we’re left with 2,883 rows within the training set and 485 rows within the test set.

In my original article I opted for using a completion model, but with the discharge of GPT-4o-mini I as a substitute generated prompts appropriate for fine-tuning a chat model (see here for more information in regards to the available models).

Each prompt has the shape …

{
"messages": [
{
"role": "system", 
"content": ""
}, 
{
"role": "user", 
"content": ""
}, 
{
"role": "assistant", 
"content": ""
}
]
}

Note: The above has been formatted for clarity, but JSONL can have every thing in a single line per record.

Using the info excerpts, LLM_generated table description, column name we collated, we are able to now generate prompts which appear like this …

{
"messages": [
{
"role": "system", 
"content": "You are an assistant that replies with HXL tags and attributes"
}, 
{
"role": "user", 
"content": "What are the HXL tags and attributes for a column with these details? 
resource_name='admin1-summaries-earthquake.csv'; 
dataset_description='The dataset contains earthquake data for various 
administrative regions in Afghanistan, 
including country name, admin1 name, latitude, 
longitude, aggregation type, indicator name, 
and indicator value. The data includes maximum 
earthquake values recorded in different regions, 
with corresponding latitude and longitude coordinates. 
The dataset provides insights into the seismic 
activity in different administrative areas of 
Afghanistan.'; 
column_name:'indicator'; 
examples: ['earthquake', 'earthquake', 'earthquake', 'earthquake', 'earthquake', 'earthquake', 'earthquake', 'earthquake', 'earthquake', 'earthquake', 'earthquake']"
}, 
{
"role": "assistant", 
"content": "#indicator+name"
}
]
}

We now have test and training files in the proper format for fine-tuning an OpenAI chat model, so let’s tune our model …

def fine_tune_model(train_file, model_name="gpt-4o-mini"):
"""
High-quality-tune an OpenAI model using training data.Args:
prompt_file (str): The file containing the prompts to make use of for fine-tuning.
model_name (str): The name of the model to fine-tune. Default is "davinci-002".
Returns:
str: The ID of the fine-tuned model.
"""
# Upload file to OpenAI for fine-tuning
file = client.files.create(
file=open(train_file, "rb"),
purpose="fine-tune"
)
file_id = file.id
print(f"Uploaded training file with ID: {file_id}")
# Start the fine-tuning job
ft = client.fine_tuning.jobs.create(
training_file=file_id,
model=model_name
)
ft_id = ft.id
print(f"High-quality-tuning job began with ID: {ft_id}")
# Monitor the status of the fine-tuning job
ft_result = client.fine_tuning.jobs.retrieve(ft_id)
while ft_result.status != 'succeeded':
print(f"Current status: {ft_result.status}")
time.sleep(120)  # Wait for 60 seconds before checking again
ft_result = client.fine_tuning.jobs.retrieve(ft_id)
if 'failed' in ft_result.status.lower():
sys.exit()
print(f"High-quality-tuning job {ft_id} succeeded!")
# Retrieve the fine-tuned model
fine_tuned_model = ft_result.fine_tuned_model
print(f"High-quality-tuned model: {fine_tuned_model}")
return fine_tuned_model
model = fine_tune_model("hxl_chat_prompts_train.jsonl", model_name="gpt-4o-mini-2024-07-18")

Within the above we’re using the brand new GPT-4-mini model, which from OpenAI is currently free to fine-tune …

“Now through September 23, GPT-4o mini is free to fine-tune as much as a day by day limit of 2M training tokens. Overages over 2M training tokens will probably be charged at $3.00/1M tokens. Starting September 24, fine-tuning training will cost $3.00/1M tokens. Take a look at the fine-tuning docs for more details on free access.”

Even at $3.00/1 Million tokens, the prices are quite low for this task, coming out at about $7 a fine-tuning run for just over 2 million tokens within the test file. Taking into account, fine-tuning ought to be a rare event for this particular task, once we’ve got such a model it might be reused.

The fine-tuning produces the next output …

Uploaded training file with ID: file-XXXXXXXXXXXXXXX
High-quality-tuning job began with ID: ftjob-XXXXXXXXXXXXXXX
Current status: validating_files
Current status: validating_files
Current status: running
Current status: running
Current status: running
Current status: running
Current status: running
Current status: running
Current status: running
Current status: running
Current status: running
Current status: running
Current status: running
Current status: running
Current status: running
Current status: running
Current status: running
Current status: running
Current status: running
Current status: running
Current status: running
High-quality-tuning job ftjob-XXXXXXXXXXXXXXX succeeded!
High-quality-tuned model: ft:gpt-4o-mini-2024-07-18::XXXXXXX

It took about 45 minutes.

Now that we’ve got a pleasant recent shiny fine-tuned model for predicting HXL tags and attributes, we are able to use the test file to take it for a spin …

def make_chat_predictions(prompts, model, temperature=0.1, max_tokens=13):
"""
Generate chat predictions based on given prompts using the OpenAI chat model.Args:
prompts (list): An inventory of prompts, where each prompt is a dictionary containing a listing of messages.
Each message within the list has a 'role' (either 'system', 'user', or 'assistant') and 'content'.
model (str): The name or ID of the OpenAI chat model to make use of for predictions.
temperature (float, optional): Controls the randomness of the predictions. Higher values (e.g., 0.5) make the
output more random, while lower values (e.g., 0.1) make it more deterministic.
Defaults to 0.1.
max_tokens (int, optional): The utmost variety of tokens in the anticipated response. Defaults to 13.
Returns:
pandas.DataFrame: A DataFrame containing the outcomes of the chat predictions. Each row within the DataFrame
corresponds to a prompt and includes the prompt messages, the actual message, and the
predicted message.
"""
results = []
for p in prompts:
actual = p["messages"][-1]["content"]
p["messages"] = p["messages"][0:2]
completion = client.chat.completions.create(
model=model,
messages=p["messages"],
temperature=temperature,
max_tokens=max_tokens
)
predicted = completion.decisions[0].message.content
predicted = filter_for_schema(predicted)
res = {
"prompt": p["messages"],
"actual": actual,
"predicted": predicted
}
print(f"Predicted: {predicted}; Actual: {actual}")
results.append(res)
results = pd.DataFrame(results)
return results
def filter_for_schema(text):
"""
Filters the input text to extract approved HXL schema tokens.
Args:
text (str): The input text to be filtered.
Returns:
str: The filtered text containing only approved HXL schema tokens.
"""
if " " in text:
text = text.replace(" ","")
tokens_raw = text.split("+")
tokens = [tokens_raw[0]]
for t in tokens_raw[1:]:
tokens.append(f"+{t}")
filtered = []
for t in tokens:
if t in APPROVED_HXL_SCHEMA:
if t not in filtered:
filtered.append(t)
filtered = "".join(filtered)
if len(filtered) > 0 and filtered[0] != '#':
filtered = ""
return filtered
def output_prediction_metrics(results, prediction_field="predicted", actual_field="actual"):
"""
Prints out model performance report for HXL tag prediction. Metrics are for
just predicting tags, in addition to predicting tags and attributes.
Parameters
----------
results : dataframe
Dataframe of results
prediction_field : str
Field name of element with prediction. Handy for comparing raw and post-processed predictions.
actual_field: str
Field name of the particular result for comparison with prediction
"""
y_test = []
y_pred = []
y_justtag_test = []
y_justtag_pred = []
for index, r in results.iterrows():
if actual_field not in r and predicted_field not in r:
print("Provided results don't contain expected values.")
sys.exit()
y_pred.append(r[prediction_field])
y_test.append(r[actual_field])
actual_tag = r[actual_field].split("+")[0]
predicted_tag = r[prediction_field].split("+")[0]
y_justtag_test.append(actual_tag)
y_justtag_pred.append(predicted_tag)
print(f"LLM results for {prediction_field}, {len(results)} predictions ...")
print("nJust HXL tags ...n")
print(f"Accuracy: {round(accuracy_score(y_justtag_test, y_justtag_pred),2)}")
print(
f"Precision: {round(precision_score(y_justtag_test, y_justtag_pred, average='weighted', zero_division=0),2)}"
)
print(
f"Recall: {round(recall_score(y_justtag_test, y_justtag_pred, average='weighted', zero_division=0),2)}"
)
print(
f"F1: {round(f1_score(y_justtag_test, y_justtag_pred, average='weighted', zero_division=0),2)}"
)
print(f"nTags and attributes with {prediction_field} ...n")
print(f"Accuracy: {round(accuracy_score(y_test, y_pred),2)}")
print(
f"Precision: {round(precision_score(y_test, y_pred, average='weighted', zero_division=0),2)}"
)
print(
f"Recall: {round(recall_score(y_test, y_pred, average='weighted', zero_division=0),2)}"
)
print(
f"F1: {round(f1_score(y_test, y_pred, average='weighted', zero_division=0),2)}"
)
return
with open(TEST_FILE) as f:
X_test = [json.loads(line) for line in f]
results = make_chat_predictions(X_test, model)
output_prediction_metrics(results)
print("Done")

Noting within the above that every one predictions are filtered for allowed tags and attributes as defined within the HXL standard.

This offers the next results …

LLM results for predicted, 458 predictions ...Just HXL tags ...
Accuracy: 0.83
Precision: 0.85
Recall: 0.83
F1: 0.82
Tags and attributes with predicted ...
Accuracy: 0.61
Precision: 0.6
Recall: 0.61
F1: 0.57

‘Just HXL Tags’ means predicting the primary a part of the HXL, for instance if the complete HXL is #affected+infected+f, the model accurately got the #affected part correct. ‘Tags and attributes’ means predicting the complete HXL string, ie ‘#affected+infected+f’, a much harder challenge as a consequence of all of the combos possible.

The performance isn’t perfect, but not that bad, especially as we’ve got balanced the dataset to cut back the variety of location and date tags and attributes (ie made this study a bit more difficult). There are tens of hundreds of humanitarian response tables without HDX, even the above performance would likely add value.

Let’s look into cases where predictions didn’t agree with human-labeled data …

The predictions were saved to a spreadsheet, and I manually went through many of the predictions that didn’t agree with the labels. You’ll find this evaluation here and summarized below …

What’s interesting is that in some cases the LLM is definitely correct, for instance in adding additional HXL attributes which the human labeled data doesn’t include. There are also cases where the human labeled HXL was perfectly reasonable, however the LLM predicted one other tag or attribute that might even be interpreted as correct. For instance a #region may also be an #admin1 in some countries, and whether something is an +id or +code is usually difficult to make your mind up, each are appropriate.

Using the above categories, I created a brand new test set where the expected HXL tags were corrected. On re-running the prediction we get improved results …


Just HXL tags ...Accuracy: 0.88
Precision: 0.88
Recall: 0.88
F1: 0.88
Tags and attributes with predicted ...
Accuracy: 0.66
Precision: 0.71
Recall: 0.66
F1: 0.66

The above shows that the human-labeled data itself could be incorrect. The HXL standard is designed excellently, but generally is a challenge to memorize for developers and data scientists when setting HXL tags and attributes on data. There are some amazing tools already provided by the HXL team, but sometimes the HXL remains to be incorrect. This introduces an issue to the fine-tuning approach which relies on this human-labeled data for training, especially for less well represented tags and attributes that humans should not using fairly often. It also has the limitation of not with the ability to adjust when the metadata standard changes, because the training data wouldn’t reflect those changes.

Because the initial evaluation 18 months ago various LLM providers have advanced their models significantly. OpenAI in fact released GPT-4o as their flagship product, which importantly has a context window of 128k tokens and is one other data point suggesting costs of foundational models are decreasing (see for instance GPT-4-Turbo in comparison with GPT-4o here). Given these aspects, I wondered …

If models have gotten more powerful and inexpensive to make use of, could we avoid fine-tuning altogether and use them to predict HXL tags and attributes by prompting alone?

Not only could this mean less engineering work to wash data and fine-tune models, it could have an enormous advantage in with the ability to include HXL tags and attributes which should not included within the human-labeled training data but are a part of the HXL standard. That is one potentially huge advantage of powerful LLMs, with the ability to classify with zero- and few-shot prompting.

Models like GPT-4o are trained on web data, so I assumed I’d first do a test using one among our prompts to see if it already knew every thing there was to learn about HXL tags …

What we see is that it seems to learn about HXL syntax, but the reply is wrong (the right answer is ‘#affected+infected’), and it has chosen tags and attributes that should not within the HXL standard. It’s actually much like what we see with human-tagged HXL.

How about we offer a very powerful parts of the HXL standard within the system prompt?

def generate_hxl_standard_prompt(local_data_file):
"""
Generate a typical prompt for predicting Humanitarian Markup Language (HXL) tags and attributes.Args:
local_data_file (str): The trail to the local data file containing core hashtags and attributes.
Returns:
str: The generated HXL standard prompt.
"""
core_hashtags = pd.read_excel(local_data_file, sheet_name='Core hashtags')
core_hashtags = core_hashtags.loc[core_hashtags["Release status"] == "Released"]
core_hashtags = core_hashtags[["Hashtag", "Hashtag long description", "Sample HXL"]]
core_attributes = pd.read_excel(local_data_file, sheet_name='Core attributes')
core_attributes = core_attributes.loc[core_attributes["Status"] == "Released"]
core_attributes = core_attributes[["Attribute", "Attribute long description", "Suggested hashtags (selected)"]]
print(core_hashtags.shape)
print(core_attributes.shape)
core_hashtags = core_hashtags.to_dict(orient='records')
core_attributes = core_attributes.to_dict(orient='records')
hxl_prompt= f"""
You might be an AI assistant that predicts Humanitarian Markup Language (HXL) tags and attributes for columns of information where the HXL standard is defined as follows:
CORE HASHTAGS:
{json.dumps(core_hashtags,indent=4)}
CORE ATTRIBUTES:
{json.dumps(core_attributes, indent=4)}
Key points:
- ALWAYS predict hash tags
- NEVER predict a tag which will not be a legitimate core hashtag
- NEVER start with a core hashtag, you could at all times start with a core hashtag
- At all times attempt to predict an attribute if possible
- Don't use attribute +code if the info examples are human readable names
It's essential to return your result as a JSON record with the fields 'predicted' and 'reasoning', each is of type string.
"""
print(len(hxl_prompt.split(" ")))
print(hxl_prompt)
return hxl_prompt

This offers us a prompt like this …

You might be an AI assistant that predicts Humanitarian Markup Language (HXL) tags and attributes for columns of information where the HXL standard is defined as follows:CORE HASHTAGS:
[
{
"Hashtag": "#access",
"Hashtag long description": "Accessiblity and constraints on access to a market, distribution point, facility, etc.",
"Sample HXL": "#access +type"
},
{
"Hashtag": "#activity",
"Hashtag long description": "A programme, project, or other activity. This hashtag applies to all levels; use the attributes +activity, +project, or +programme to distinguish different hierarchical levels.",
"Sample HXL": "#activity +project"
},
{
"Hashtag": "#adm1",
"Hashtag long description": "Top-level subnational administrative area (e.g. a governorate in Syria).",
"Sample HXL": "#adm1 +code"
},
{
"Hashtag": "#adm2",
"Hashtag long description": "Second-level subnational administrative area (e.g. a subdivision in Bangladesh).",
"Sample HXL": "#adm2 +name"
},
{
"Hashtag": "#adm3",
"Hashtag long description": "Third-level subnational administrative area (e.g. a subdistrict in Afghanistan).",
"Sample HXL": "#adm3 +code"
},
{
"Hashtag": "#adm4",
"Hashtag long description": "Fourth-level subnational administrative area (e.g. a barangay in the Philippines).",
"Sample HXL": "#adm4 +name"
},
{
"Hashtag": "#adm5",
"Hashtag long description": "Fifth-level subnational administrative area (e.g. a ward of a city).",
"Sample HXL": "#adm5 +code"
},
{
"Hashtag": "#affected",
"Hashtag long description": "Number of people or households affected by an emergency. Subset of #population; superset of #inneed.",
"Sample HXL": "#affected +f +children"
},
{
"Hashtag": "#beneficiary",
"Hashtag long description": "General (non-numeric) information about a person or group meant to benefit from aid activities, e.g. "lactating women".",
"Sample HXL": "#beneficiary +name"
},
{
"Hashtag": "#capacity",
"Hashtag long description": "The response capacity of the entity being described (e.g. "25 beds").",
"Sample HXL": "#capacity +num"
},
... Truncated for brevity
},
{
"Hashtag": "#targeted",
"Hashtag long description": "Number of people or households targeted for humanitarian assistance. Subset of #inneed; superset of #reached.",
"Sample HXL": "#targeted +f +adult"
},
{
"Hashtag": "#value",
"Hashtag long description": "A monetary value, such as the price of goods in a market, a project budget, or the amount of cash transferred to beneficiaries. May be used together with #currency in financial or cash data.",
"Sample HXL": "#value +transfer"
}
]
CORE ATTRIBUTES:
[
{
"Attribute": "+abducted",
"Attribute long description": "Hashtag refers to people who have been abducted.",
"Suggested hashtags (selected)": "#affected, #inneed, #targeted, #reached"
},
{
"Attribute": "+activity",
"Attribute long description": "The implementers classify this activity as an "activity" proper  (may imply different hierarchical levels in different contexts).",
"Suggested hashtags (selected)": "#activity"
},
{
"Attribute": "+adolescents",
"Attribute long description": "Adolescents, loosely defined (precise age range varies); may overlap +children and +adult.  You can optionally create custom attributes in addition to this to add precise age ranges, e.g. "+adolescents +age12_17".",
"Suggested hashtags (selected)": "#affected, #inneed, #targeted, #reached, #population"
},
{
"Attribute": "+adults",
"Attribute long description": "Adults, loosely defined (precise age range varies); may overlap +adolescents and +elderly. You can optionally create custom attributes in addition to this to add precise age ranges, e.g. "+adults +age18_64".",
"Suggested hashtags (selected)": "#affected, #inneed, #targeted, #reached, #population"
},
{
"Attribute": "+approved",
"Attribute long description": "Date or time when something was approved.",
"Suggested hashtags (selected)": "#date"
},
{
"Attribute": "+bounds",
"Attribute long description": "Boundary data (e.g. inline GeoJSON).",
"Suggested hashtags (selected)": "#geo"
},
{
"Attribute": "+budget",
"Attribute long description": "Used with #value to indicate that the amount is planned/approved/budgeted rather than actually spent.",
"Suggested hashtags (selected)": "#value"
},
{
"Attribute": "+canceled",
"Attribute long description": "Date or time when something (e.g. an #activity) was canceled.",
"Suggested hashtags (selected)": "#date"
},
{
"Attribute": "+children",
"Attribute long description": "The associated hashtag applies to non-adults, loosely defined (precise age range varies; may overlap +infants and +adolescents). You can optionally create custom attributes in addition to this to add precise age ranges, e.g. "+children +age3_11".",
"Suggested hashtags (selected)": "#affected, #inneed, #targeted, #reached, #population"
},
{
"Attribute": "+cluster",
"Attribute long description": "Identifies a sector as a formal IASC humanitarian cluster.",
"Suggested hashtags (selected)": "#sector"
},
{
"Attribute": "+code",
"Attribute long description": "A unique, machine-readable code.",
"Suggested hashtags (selected)": "#region, #country, #adm1, #adm2, #adm3, #adm4, #adm5, #loc, #beneficiary, #activity, #org, #sector, #subsector, #indicator, #output, #crisis, #cause, #impact, #severity, #service, #need, #currency, #item, #need, #service, #channel, #modality, #event, #group, #status"
},
{
"Attribute": "+converted",
"Attribute long description": "Date or time used for converting a monetary value to another currency.",
"Suggested hashtags (selected)": "#date"
},
{
"Attribute": "+coord",
"Attribute long description": "Geodetic coordinates (lat+lon together).",
"Suggested hashtags (selected)": "#geo"
},
{
"Attribute": "+dest",
"Attribute long description": "Place of destination (intended or actual).",
"Suggested hashtags (selected)": "#region, #country, #adm1, #adm2, #adm3, #adm4, #adm5, #loc"
},
{
"Attribute": "+displaced",
"Attribute long description": "Displaced people or households. Refers to all types of displacement: use +idps or +refugees to be more specific.",
"Suggested hashtags (selected)": "#affected, #inneed, #targeted, #reached, #population"
},
{
"Attribute": "+elderly",
"Attribute long description": "Elderly people, loosely defined (precise age range varies). May overlap +adults. You can optionally create custom attributes in addition to this to add precise age ranges, e.g. "+elderly +age65plus".",
"Suggested hashtags (selected)": "#affected, #inneed, #targeted, #reached, #population"
},
... Truncated for brevity
{
"Attribute": "+url",
"Attribute long description": "The data consists of web links related to the main hashtag (e.g. for an #org, #service, #activity, #loc, etc).",
"Suggested hashtags (selected)": "#contact, #org, #activity, #service, #meta"
},
{
"Attribute": "+used",
"Attribute long description": "Refers to a #service, #item, etc. that affected people have actually consumed or otherwise taken advantage of.",
"Suggested hashtags (selected)": "#service, #item"
}
]
Key points:
- ALWAYS predict hash tags
- NEVER predict a tag which will not be a legitimate core hashtag
- NEVER start with a core hashtag, you could at all times start with a core hashtag
- At all times attempt to predict an attribute if possible
It's essential to return your result as a JSON record with the fields 'predicted' and 'reasoning', each is of type string.

It’s pretty long (the above has been truncated), but encapsulates the HXL standard.

One other advantage of the direct prompt method is that we can even ask for the LLM to offer its reasoning when predicting HXL. This will in fact include hallucination, but I’ve at all times found it useful for refining prompts.

For the user prompt, we’ll use the identical information that we used for fine-tuning, to incorporate excerpt and LLM-generated table summary …

What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/IFRC Appeals Data for South Sudan8.csv'; 
dataset_description='The dataset accommodates information on various 
appeals and events related to South Sudan, 
including details comparable to the kind of appeal, 
status, sector, amount requested and funded, 
start and end dates, in addition to country-specific 
information like country code, region, and average 
household size. The information includes appeals for 
different crises comparable to floods, population 
movements, cholera outbreaks, and Ebola preparedness, 
with details on beneficiaries and confirmation needs. 
The dataset also includes metadata comparable to IDs, 
names, and translation modules for countries and regions.'; 
column_name:'aid'; 
examples: ['18401', '17770', '17721', '16858', '15268', '15113', '14826', '14230', '12788', '9286', '8561']

Putting all of it together, and prompting each GPT-4o-mini and GPT-4o for comparison …

def call_gpt(prompt, system_prompt, model, temperature, top_p, max_tokens):
"""
Calls the GPT model to generate a response based on the given prompt and system prompt.Args:
prompt (str): The user's input prompt.
system_prompt (str): The system's input prompt.
model (str): The name or ID of the GPT model to make use of.
temperature (float): Controls the randomness of the generated output. Higher values (e.g., 0.8) make the output more random, while lower values (e.g., 0.2) make it more deterministic.
top_p (float): Controls the variety of the generated output. Higher values (e.g., 0.8) make the output more diverse, while lower values (e.g., 0.2) make it more focused.
max_tokens (int): The utmost variety of tokens to generate within the response.
Returns:
dict or None: The generated response as a dictionary object, or None if an error occurred during generation.
"""
response = client.chat.completions.create(
model=model,
messages= [
{"role": "system", "content": system_prompt},
{"role": "user", "content": prompt}
],
max_tokens=2000,
temperature=temperature,
top_p=top_p,
frequency_penalty=0,
presence_penalty=0,
stop=None,
stream=False,
response_format={ "type": "json_object" }
)
result = response.decisions[0].message.content
result = result.replace("```json","").replace("```","")
try:
result = json.loads(result)
result["predicted"] = result["predicted"].replace(" ","")
except:
print(result)
result = None
return result
def make_prompt_predictions(prompts, model, temperature=0.1, top_p=0.1, 
max_tokens=2000, debug=False, actual_field="actual"):
"""
Generate predictions for a given set of prompts using the required model.
Args:
prompts (pandas.DataFrame): A DataFrame containing the prompts to generate predictions for.
model (str): The name of the model to make use of for prediction.
temperature (float, optional): The temperature parameter for the model's sampling. Defaults to 0.1.
top_p (float, optional): The highest-p parameter for the model's sampling. Defaults to 0.1.
max_tokens (int, optional): The utmost variety of tokens to generate for every prompt. Defaults to 2000.
debug (bool, optional): Whether to print debug information during prediction. Defaults to False.
actual_field (str, optional): The name of the column within the prompts DataFrame that accommodates the actual values. Defaults to "actual".
Returns:
pandas.DataFrame: A DataFrame containing the outcomes of the predictions, including the prompt, actual value, predicted value, and reasoning.
"""
num_prompts = len(prompts)
print(f"Variety of prompts: {num_prompts}")
results = []
for index, p in prompts.iterrows():
if index % 50 == 0:
print(f"{index/num_prompts*100:.2f}% complete")
prompt = p["prompt"]
prompt = ast.literal_eval(prompt)
prompt = prompt[1]["content"]
actual = p[actual_field]
result = call_gpt(prompt, hxl_prompt, model, temperature, top_p, max_tokens)
if result's None:
print("    !!!!! No LLM result")
predicted = ""
reasoning = ""
else:
predicted = result["predicted"]
reasoning = result["reasoning"]
if debug is True:
print(f"Actual: {actual}; Predicted: {predicted}; Reasoning: {reasoning}")
results.append({
"prompt": prompt,
"actual": actual,
"predicted": predicted,
"reasoning": reasoning
})
results = pd.DataFrame(results)
print(f"nn===================== {model} Results =========================nn")
output_prediction_metrics(results)
print(f"nn=================================================================")
results["match"] = results['predicted'] == results['actual']
results.to_excel(f"{LOCAL_DATA_DIR}/hxl-metadata-prompting-only-prediction-{model}-results.xlsx", index=False)
return results
for model in ["gpt-4o-mini","gpt-4o"]:
print(f"Model: {model}")
results = make_prompt_predictions(X_test, model, temperature=0.1, top_p=0.1, max_tokens=2000)

We get …

===================== gpt-4o-mini Results =========================LLM results for predicted, 458 predictions ...
Just HXL tags ...
Accuracy: 0.77
Precision: 0.83
Recall: 0.77
F1: 0.77
Tags and attributes with predicted ...
Accuracy: 0.53
Precision: 0.54
Recall: 0.53
F1: 0.5
===================== gpt-4o Results =========================
LLM results for predicted, 458 predictions ...
Just HXL tags ...
Accuracy: 0.86
Precision: 0.86
Recall: 0.86
F1: 0.85
Tags and attributes with predicted ...
Accuracy: 0.71
Precision: 0.7
Recall: 0.71
F1: 0.69
=================================================================

As a reminder, the fine-tuned model produced the next results …

Just HXL tags ...Accuracy: 0.83
Precision: 0.85
Recall: 0.83
F1: 0.82
Tags and attributes with predicted ...
Accuracy: 0.61
Precision: 0.6
Recall: 0.61
F1: 0.57

How does prompting-only GPT-4o compare with GPT-4o-mini?

the above, we see that GPT-4o-mini prompting-only predicts just tags with 77% accuracy, which is lower than GPT-4o-mini fine-tuning (83%) and GPT-4o prompting-only (86%). That said the performance remains to be good and would improve HXL coverage even when used as-is.

How does prompting-only compare with the fine-tuned model?

GPT-4o prompting-only gave the very best results of all models, with 86% accuracy on tags and 71% on tags and attributes. Actually, the performance could well be higher after a bit more evaluation of the test data to correct incorrect human-labeled tags,.

Let’s take a more in-depth take a look at the times GPT-4o got it fallacious …

df = pd.read_excel(f"{LOCAL_DATA_DIR}/hxl-metadata-prompting-only-prediction-gpt-4o-results.xlsx")breaks = df[df["match"]==False]
print(breaks.shape)
for index, row in breaks.iterrows():
print("n======================================== ")
pprint.pp(f"nPrompt: {row['prompt']}")
print()
print(f"Actual", row["actual"])
print(f"Predicted", row["predicted"])
print()
pprint.pp(f'Reasoning: n{row["reasoning"]}')

'n'
'Prompt: What are the HXL tags and attributes for a column with these '
'details? '
"resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/IFRC "
"Appeals Data for South Sudan8.csv'; dataset_description='The dataset "
'accommodates information on various appeals and events related to South Sudan, '
'including details comparable to the kind of appeal, status, sector, amount '
'requested and funded, start and end dates, in addition to country-specific '
'information like country code, region, and average household size. The information '
'includes appeals for various crises comparable to floods, population movements, '
'cholera outbreaks, and Ebola preparedness, with details on beneficiaries and '
'confirmation needs. The dataset also includes metadata comparable to IDs, names, '
"and translation modules for countries and regions.'; column_name:'dtype.id'; "
"examples: ['12', '5', '1', '1', '12', '12', '1', '6', '1', '1', '7']")Actual #cause+id
Predicted #meta+id
('Reasoning: n'
"The column 'dtype.id' accommodates numeric identifiers (e.g., '12', '5', '1') "
'that are more likely to be internal identifiers for data records. In response to '
'the HXL standard, the suitable hashtag for internal identifiers is '
"'#meta' with the attribute '+id'.")
======================================== 
('n'
'Prompt: What are the HXL tags and attributes for a column with these '
'details? '
"resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/IFRC "
"Appeals Data for South Sudan8.csv'; dataset_description='The dataset "
'accommodates information on various appeals and events related to South Sudan, '
'including details comparable to the kind of appeal, status, sector, amount '
'requested and funded, start and end dates, in addition to country-specific '
'information like country code, region, and average household size. The information '
'includes appeals for various crises comparable to floods, population movements, '
'cholera outbreaks, and Ebola preparedness, with details on beneficiaries and '
'confirmation needs. The dataset also includes metadata comparable to IDs, names, '
"and translation modules for countries and regions.'; "
"column_name:'dtype.name'; examples: ['Flood', 'Population Movement', "
"'Epidemic', 'Epidemic', 'Flood', 'Flood', 'Epidemic', 'Complex Emergency', "
"'Epidemic', 'Epidemic', 'Civil Unrest']")
Actual #cause+name
Predicted #event+type
('Reasoning: n'
"The examples provided within the column ('Flood', 'Population Movement', "
"'Epidemic', 'Complex Emergency', 'Civil Unrest') describe various kinds of "
'events or incidents inside a crisis or emergency. In response to the HXL '
'standard, the suitable hashtag for such a data is #event, and the '
'attribute +type is used to specify the sort or category of the event.')
======================================== 
('n'
'Prompt: What are the HXL tags and attributes for a column with these '
'details? '
"resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/IFRC "
"Appeals Data for South Sudan8.csv'; dataset_description='The dataset "
'accommodates information on various appeals and events related to South Sudan, '
'including details comparable to the kind of appeal, status, sector, amount '
'requested and funded, start and end dates, in addition to country-specific '
'information like country code, region, and average household size. The information '
'includes appeals for various crises comparable to floods, population movements, '
'cholera outbreaks, and Ebola preparedness, with details on beneficiaries and '
'confirmation needs. The dataset also includes metadata comparable to IDs, names, '
"and translation modules for countries and regions.'; "
"column_name:'status_display'; examples: ['Active', 'Active', 'Closed', "
"'Closed', 'Closed', 'Closed', 'Closed', 'Closed', 'Closed', 'Closed', "
"'Closed']")
Actual #status+name
Predicted #status+code
('Reasoning: n'
"The column 'status_display' accommodates values comparable to 'Lively' and 'Closed', "
'which describe the status of appeals or events. The suitable HXL hashtag '
'for project or activity status is #status. Because the values are categorical '
'and represent different statuses, the attribute +code is suitable to '
'indicate these status codes.')
======================================== 
('n'
'Prompt: What are the HXL tags and attributes for a column with these '
'details? '
"resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/IFRC "
"Appeals Data for South Sudan8.csv'; dataset_description='The dataset "
'accommodates information on various appeals and events related to South Sudan, '
'including details comparable to the kind of appeal, status, sector, amount '
'requested and funded, start and end dates, in addition to country-specific '
'information like country code, region, and average household size. The information '
'includes appeals for various crises comparable to floods, population movements, '
'cholera outbreaks, and Ebola preparedness, with details on beneficiaries and '
'confirmation needs. The dataset also includes metadata comparable to IDs, names, '
"and translation modules for countries and regions.'; "
"column_name:'region.id'; examples: ['0', '0', '0', '0', '0', '0', '0', '0', "
"'0', '0', '0']")
Actual #adm1+code
Predicted #region+id
('Reasoning: n'
"The column 'region.id' accommodates numeric identifiers for regions, which "
'aligns with the HXL tag #region and the attribute +id. The examples provided '
'are all numeric, indicating that these are likely unique identifiers for '
'regions.')
========================================

Notice how we now have a ‘Reasoning’ field to point why the tags were chosen. This is helpful and can be a very important part for refining the prompt to enhance performance.

the sample above, we see some familiar scenarios that were found when analyzing the fine-tuned model failed predictions …

+id and +code ambiguity
#region and #adm1 used interchangeably
#event versus more detailed tags like #cause

These appear to fall into the category where two tags are possible for a given column given their HXL definition. But there are some real discrepancies which would wish more investigation.

That said, using GPT-4o to predict HXL tags and attributes yields the very best results, and I consider at a suitable level given a variety of data is missing HXL metadata altogether and lots of the datasets which have it have incorrect tags and attributes.

Let’s see how costs compare with each technique and model …

 def num_tokens_from_string(string: str, encoding_name: str) -> int:
"""
Returns the variety of tokens in a text string using toktoken.
See: https://github.com/openai/openai-cookbook/blob/primary/examples/How_to_count_tokens_with_tiktoken.ipynbArgs:
string (str): The text string to count the tokens for.
encoding_name (str): The name of the encoding to make use of.
Returns:
num_tokens: The variety of tokens within the text string.
"""
encoding = tiktoken.get_encoding(encoding_name)
num_tokens = len(encoding.encode(string))
return num_tokens
def calc_costs(data, model, method="prompting"):
"""
Calculate token costs for a given dataset, method and model.
Note: Just for inference costs, not fine-tuning
Args:
data (pandas.DataFrame): The information to get the tokens for.
method (str, optional): The tactic to make use of. Defaults to "prompting".
model (str): The model to make use of, eg "gpt-4o-mini"
Returns:
input_tokens: The variety of input tokens.
output_tokens: The variety of output tokens.
"""
# See https://openai.com/api/pricing/
price = {
"gpt-4o-mini": {
"input": 0.150,
"output": 0.600
},
"gpt-4o": {
"input": 5.00,
"output": 15.00
}
}
input_tokens = 0
output_tokens = 0
for index, p in data.iterrows():
prompt = p["prompt"]
prompt = ast.literal_eval(prompt)
input = prompt[1]["content"] 
# If prompting, we must include system prompt
if method == "prompting":
input += " " + hxl_prompt
output = p["Corrected actual"]
input_tokens += num_tokens_from_string(str(input), "cl100k_base")
output_tokens += num_tokens_from_string(str(output), "cl100k_base") 
input_cost = input_tokens / 1000000 * price[model]["input"]
output_cost = output_tokens / 1000000 * price[model]["output"]
print(f"nFor {data.shape[0]} table columns where we predicted HXL tags ...")
print(f"{method} prediction with model {model}, {input_tokens} input tokens = ${input_cost}")
print(f"High-quality-tuning prediction GPT-4o-mini {output_tokens} output tokens = ${output_cost}n")
hxl_prompt = generate_hxl_standard_prompt(HXL_SCHEMA_LOCAL_FILE, debug=False)
X_test2 = pd.read_excel(f"{LOCAL_DATA_DIR}/hxl-metadata-fine-tune-prediction-results-review.xlsx", sheet_name=0)
calc_costs(X_test2, method="fine-tuning", model="gpt-4o-mini")
calc_costs(X_test2, method="prompting", model="gpt-4o-mini")
calc_costs(X_test2, method="prompting", model="gpt-4o")

Which provides …

For 458 table columns where we predicted HXL tags ...
fine-tuning prediction with model gpt-4o-mini, 99738 input tokens = $0.014960699999999999
High-quality-tuning prediction GPT-4o-mini 2001 output tokens = $0.0012006For 458 table columns where we predicted HXL tags ...
prompting prediction with model gpt-4o-mini, 2688812 input tokens = $0.4033218
High-quality-tuning prediction GPT-4o-mini 2001 output tokens = $0.0012006
For 458 table columns where we predicted HXL tags ...
prompting prediction with model gpt-4o, 2688812 input tokens = $13.44406
High-quality-tuning prediction GPT-4o-mini 2001 output tokens = $0.030015000000000003

Note: the above is simply for the inference cost, there will probably be a really small additional cost in generating table data summaries with GPT-3.5.

Given the test set, predicting HXL for 458 columns …

High-quality-tuning:

As expected, inference costs for the fine-tuned GPT-4o mini model (which cost about $7 to fine-tune) are very low about $0.02.

Prediction-only:

GPT-4o prediction only is pricey, due to the HXL standard being passed in to the system prompt each time, and comes out at $13.44.
GPT-4o-mini, albeit with reduced performance, is a more reasonable $0.40.

So ease of use comes with a price if using GPT-4o, but GPT-4o-mini is a lovely alternative.

Finally, it’s price noting that in lots of cases, setting HXL tags won’t to be real time, for instance for a crawler process that corrects already uploaded datasets. This is able to mean that the brand new OpenAI batch API may very well be used, reducing costs by 50%.

Putting this all together, I created a Github gist hxl_utils.py. Check this out from GitHub and place the file in your current working directory.

Let’s download a file to check it with …

# See HDX for this file: https://data.humdata.org/dataset/sudan-acled-conflict-data
DATAFILE_URL="https://data.humdata.org/dataset/5efad450-8b15-4867-b7b3-8a25b455eed8/resource/3352a0d8-2996-4e70-b618-3be58699be7f/download/sudan_hrp_civilian_targeting_events_and_fatalities_by_month-year_as-of-25jul2024.xlsx"
local_data_file = f"{LOCAL_DATA_DIR}/{DATAFILE_URL.split('/')[-1]}"# Save data file locally 
urllib.request.urlretrieve(DATAFILE_URL, local_data_file)
# Read it to get a dataframe
df = pd.read_excel(local_data_file, sheet_name=1)

And using this dataframe, let’s predict HXL tags …

from hxl_utils import HXLUtilshxl_utils = HXLUtils(LOCAL_DATA_DIR, model="gpt-4o")
data = hxl_utils.add_hxl(df,"sudan_hrp_civilian_targeting_events_and_fatalities_by_month-year_as-of-25jul2024.xlsx")
print("nnAFTER: nn")
display(data)

And there we’ve got it, some lovely HXL tags!

Let’s see how well GPT-4o-mini does …

hxl_utils = HXLUtils(LOCAL_DATA_DIR, model="gpt-4o-mini")
data = hxl_utils.add_hxl(df,"sudan_hrp_civilian_targeting_events_and_fatalities_by_month-year_as-of-25jul2024.xlsx")

Which provides …

Pretty good! gpt-4o gave “#affected+killed+num” for the last column, where “gpt-4o-mini” gave “#affected+num”, but this might likely be resolved with some deft prompt engineering.

Admittedly this wasn’t a really difficult dataset, nevertheless it was in a position to accurately predict tags for events and fatalities, that are less frequent than location and dates.

I feel an enormous takeaway here is that the direct-prompting technique produces good results without the necessity for training. Yes, dearer for inference, but perhaps not if an information scientist is required to curate incorrectly human-labeled fine-tuning data. It will rely upon the organization and metadata use-case.

Listed below are some areas that is perhaps considered in future work …

Improved test data

This evaluation did a fast review of the test set to correct HXL tags which were incorrect in the info or had multiple possible values. More time may very well be spent on this, as at all times in machine learning, ground truth is essential.

Prompt engineering and hyperparameter tuning

The above evaluation uses very basic prompts with no real engineering or strategies applied, these could definitely be improved for higher performance. With an evaluation set and a framework comparable to Promptflow, prompt variants may very well be tested. Moreover we would add more context data, for instance in deciding administrative levels, which may vary per country. Finally, we’ve got used fixed hyperparameters for temperature and top_p, in addition to completion token length. All these may very well be tuned leading to raised performance.

Cost optimization

The prompting-only approach definitely appears to be a powerful option and simplifies how a company can mechanically set HXL tags on their data using GPT-4o. There are in fact cost implications with this model, being a dearer, but predictions occur only on low-volume schema changes, not when the underlying data itself changes, and with recent options for batch submission on OpenAI and ever decreasing LLM costs, this method appears viable for a lot of organizations. GPT-4o-mini also performs well and is a fraction of the price.

Application to other metadata standards

It will be interesting to use this method to other metadata and labeling standards, I’m sure many organizations are already using LLMs for this.

Please like this text if inclined and I’d be delighted in the event you followed me! You’ll find more articles here.

Predicting metadata for humanitarian datasets with LLMs part 2 — An alternative choice to fine-tuning

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

4 Techniques to Optimize AI Coding Efficiency

structured generation in Rust and Python

Is Your Model Time-Blind? The Case for Cyclical Feature Encoding

Deploying Speech-to-Speech on Hugging Face

How AI coding agents work—and what to recollect in case you use them

Predicting metadata for humanitarian datasets with LLMs part 2 — An alternative choice to fine-tuning

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.