Home Artificial Intelligence My First Exploratory Data Evaluation with ChatGPT Structuring Work Data Cleansing Basic Evaluation Natural Language Processing Wrapping Up Discussion Conclusion

My First Exploratory Data Evaluation with ChatGPT Structuring Work Data Cleansing Basic Evaluation Natural Language Processing Wrapping Up Discussion Conclusion

0
My First Exploratory Data Evaluation with ChatGPT
Structuring Work
Data Cleansing
Basic Evaluation
Natural Language Processing
Wrapping Up
Discussion
Conclusion

Unleashing the facility of ChatGPT: A deep dive into an exploratory data evaluation and future opportunities

“An AI exploring an enormous world of knowledge. Digital art. Vivid colors.” (Creator generated via DALL-E 2)

ChatGPT is a rare tool for working more efficiently, and that doesn’t stop with data analytics. In this text we’ll run through an example of exploratory data evaluation (EDA) run by ChatGPT. We’ll cover the assorted stages of an EDA, see some impressive outputs (Wordclouds!) and note where ChatGPT does well (and never so well). Finally, we’ll touch on the longer term of LLMs in analytics and the way excited we’re for it.

The dataset used for the evaluation is a sample from Common Crawl, which is free to be accessed and analysed by anyone. The Common Crawl dataset is an enormous collection of web crawl data, comprising billions of web pages from the web. The dataset includes various web content types and is usually updated. It serves as a major resource for training language models like LLMs and made up 60% of the training data for ChatGPT. You will discover the dataset sample curated by the writer hosted on Kaggle here.

Throughout the post, content will probably be truncated, so be happy to follow along directly on the Google Colab used to run this evaluation.

We’ve broken down the evaluation into five sections:

  • 🤔 Overall Structure
  • 🧹 Data Cleansing
  • 💯 Basic Evaluation
  • 🇬🇧 NLP Approaches
  • 🔑 Key Findings

With all that out of the best way, let’s jump into the evaluation!

Before starting analytics work, it’s good to formulate a plan. EDA’s typically follow a normal structure to begin with, before going off on tangents as interesting insights are discovered within the evaluation. One such template is that this, which details the core components: preview data, clean data, plot data. We could use this as a start line, but let’s assume we all know nothing and see what ChatGPT comes up with.

Prompt:

Let's run an EDA on the Common Crawl text dataset together. The dataset is a scrape of many web pages on the web. The columns are 'text', 'WARC-Type', 'WARC-Goal-URI', 'WARC-Date', 'WARC-Record-ID', 'WARC-Refers-To', 'WARC-Block-Digest', 'WARC-Identified-Content-Language', 'Content-Type', and 'Content-Length'. The dataset is saved in a csv file called common_crawl_text.

It is best to give me an overall plan, then we are able to undergo step-by-step. In each step it's best to write some code and I'll run it within the terminal then report back to you the outcomes.

Response:

Sure, here's a high-level plan for the EDA on the Common Crawl text dataset:

  1. Load the dataset right into a pandas DataFrame.
  2. Check for missing values in each column and handle them appropriately.
  3. Explore the distribution of the goal variable 'WARC-Type'.
  4. Check the distribution of 'WARC-Identified-Content-Language'.
  5. Check the distribution of 'Content-Type'.
  6. Explore the 'text' column, including the distribution of text lengths and essentially the most common words.
  7. Check for any correlations between the several columns.
  8. Summarize the findings and discover any interesting patterns or insights.

Let's start with step 1 and cargo the dataset right into a pandas DataFrame.

ChatGPT has done a fairly good job here covering the usual steps. Notice that it doesn’t want us to deeply analyse all columns, but has a subset that it’s specifically fascinated by. This is wise, but I feel it’s missing some invaluable data points. For instance, “WARC-Goal-URI” is value exploring, and the “WARC-Date” column will give insights to when the information was collected.

I prompted ChatGPT with this, and it happily included these into the plan. At the identical time, it added within the creation of a Wordcloud, something I had behind my mind.

I like this type of interaction, where it appears like a partnership, quite than simply reading a book.

With each of the steps outlined within the structure, we are able to ask ChatGPT to expand on that step and provides us the precise code to run. To begin off, it suggests:

We will start by loading the Common Crawl text dataset and taking a have a look at its structure. Here's the code to load the dataset right into a Pandas DataFrame:

import pandas as pd

df = pd.read_csv("common_crawl_text.csv")

Once you have loaded the dataset, you possibly can check the structure of the DataFrame using the next code:

df.head()

This can show you the primary five rows of the DataFrame. You can too use the [df.info]()() method to get an summary of the columns and their data types, in addition to the variety of non-null values in each column.

After giving it the outcomes, it’s capable of share some basic information back:

Based on the provided data, it looks like we've a dataset of web archive records with different attributes corresponding to the variety of the record, the URI (Uniform Resource Identifier) of the record, the date it was recorded, the length of the content, and the content itself. The content is in several languages and formats, corresponding to plain text, and the length of the content varies. The dataset accommodates 37832 records, and a few of them have null values within the content length column.

We will go a bit further and ask it to provide an in depth description of every column:

Prompt:

Are you able to give an in depth list of the columns and the expected meaning?

Response:

Sure! Here's an in depth list of the columns within the DataFrame and their expected meaning:

  1. WARC-Type: The variety of WARC record. A WARC (Web ARChive) record is a format for storing web resources corresponding to web pages and their associated metadata.
  2. WARC-Goal-URI: The goal URI (Uniform Resource Identifier) of the archived web page.
  3. WARC-Date: The date and time when the net page was archived.
  4. WARC-Record-ID: The unique identifier of the WARC record.
  5. WARC-Refers-To: The identifier of a related WARC record that this record refers to.
  6. WARC-Block-Digest: A cryptographic hash value computed over the contents of the WARC record.
  7. WARC-Identified-Content-Language: The language(s) identified within the content of the net page.
  8. Content-Type: The media variety of the content, corresponding to text/plain or text/html.
  9. Content-Length: The length of the content in bytes.
  10. text: The extracted text content of the net page.

Note that a few of these columns are specific to the WARC format and will not be present in other sorts of data.

It’s done a reasonably good job at guessing what these columns mean. A few of these will not be correct (e.g. Content-Length), nevertheless it’s a helpful start line. It may very well know these specific names because of its training data, but I’ve experimented around with arbitrary column names and it may often guess what the column will contain from a mixture of the name and data returned.

Now, it noticed there have been missing values, so let’s see the right way to cope with that:

Prompt:

Should we do something concerning the missing values?

Response:

Yes, it could be idea to handle the missing values within the Content-Length column. One option is to remove the rows which have missing values in the event that they should not a good portion of the information. An alternative choice is to impute the missing values using some statistical method corresponding to mean, median, or mode. Nonetheless, the most effective approach would depend upon the character of the information and the evaluation being performed.

These are the ways I’d probably cope with it. On this case, I’m blissful to simply delete the entries since there aren’t too many.

Prompt:

Let's just remove them as there are only a few. How can I try this?

Response:

You possibly can drop the rows containing missing values using the dropna() function in pandas. Here's an example:

import pandas as pd

# create a dataframe

df = pd.DataFrame({'A': [1, 2, None, 4], 'B': [5, None, 7, 8]})

# drop rows with missing values

df.dropna(inplace=True) print(df)

Output:

A B 0 1.0 5.0 3 4.0 8.0

In this instance, the dropna() function removes the second row which accommodates missing values. The inplace=True argument modifies the unique DataFrame, so the resulting DataFrame only accommodates rows with none missing values.

Notice the way it gives working code, and comments each line of the code. It also gives a proof of what is going on at the tip. That is great because we are able to learn what’s happening within the code, and if there’s anything we wish to change, it’s so much easier.

We will now start with the primary parts of the evaluation.

Steps 3–5 are roughly the identical, plot some distributions of knowledge and see if there’s anything interesting there.

When prompted to begin step 3:

Sure, let's explore the distribution of the goal variable 'WARC-Type'. We will do that by making a histogram or a bar chart of the variable to see the count or percentage of every value within the dataset. In the event you are using Python, you need to use libraries like Matplotlib or Seaborn to create visualizations. Here is an example code snippet to create a bar chart using Matplotlib: python import matplotlib.pyplot as plt
# Count the variety of occurrences of every value within the 'WARC-Type' column
value_counts = df['WARC-Type'].value_counts()
# Create a bar chart of the worth counts
plt.bar(value_counts.index, value_counts.values)
# Set the title and axis labels
plt.title('Distribution of WARC-Type')
plt.xlabel('WARC-Type')
plt.ylabel('Count')
# Show the plot plt.show() This code will create a bar chart showing the count of every value within the 'WARC-Type' column. You possibly can modify the code to point out the proportion of every value as a substitute of the count.

And that outputs a graph for us. On this case, it’s not very interesting as there’s actually just one “WARC-Type”, nevertheless it was a pleasant idea. The difficulty with images is that it’s hard to feed the outcomes back to ChatGPT. I discovered that it’s higher to suggest ChatGPT to return tables of knowledge to unravel this, in addition to additional plots where it is smart.

After learning the column only has one value, ChatGPT is blissful to maneuver on to the following stage. When exploring the “WARC-Content-Language” we begin to see something a bit more complex. The outcomes returned are:

eng 15903

rus 1551

...

hrv,nno,bel 1

zul,eng 1

Of which it learns:

It looks like there are 1187 unique values for the 'WARC-Identified-Content-Language' column, and essentially the most common language is English ('eng') followed by Russian ('rus'), German ('deu'), and Chinese ('zho'). Since there are such a lot of unique values, it is likely to be difficult to investigate the distribution of the variable. One approach to simplifying the evaluation may very well be to group languages into broader categories, corresponding to "English", "Romance languages", "Slavic languages", etc. depending on the research query.

It’s a terrific point about grouping, nevertheless it’s missed something here — each row can have multiple languages. When prompted about this (a couple of times), it comes up with approach.

Here's an updated code that splits the rows with multiple languages and weighs them accordingly: import pandas as pd

# Split the rows with multiple languages
df['WARC-Identified-Content-Language'] = df['WARC-Identified-Content-Language'].str.split(',')

# Create a recent DataFrame to store the split rows
df_split = pd.DataFrame({ 'WARC-Identified-Content-Language': [lang for langs in df['WARC-Identified-Content-Language'].tolist() for lang in langs], 'count': [1/len(langs) for langs in df['WARC-Identified-Content-Language'].tolist() for lang in langs] })

# Aggregate by language
df_agg = df_split.groupby('WARC-Identified-Content-Language')['count'].sum().reset_index()

# Plot the distribution of languages
plt.figure(figsize=(10,6))
sns.barplot(data=df_agg, x='WARC-Identified-Content-Language', y='count', order=df_agg.sort_values('count', ascending=False)['WARC-Identified-Content-Language'])
plt.title('Distribution of WARC-Identified-Content-Language') plt.xticks(rotation=90)
plt.show()

This code splits the rows with multiple languages into separate rows, and weights them by the variety of languages in each row. Then it aggregates the rows by language and plots the distribution of languages.

After combining this with some prompts concerning the language grouping, we’re capable of get out a pleasant plot:

One thing I particularly like about that is that it’s giving labels to the chart. It might be a bit annoying to write down out labels each time, nevertheless it does make a difference, so it’s nice to see that it’s smart enough to provide sensible ones here.

Seems English remains to be the dominant language — by how much?

Prompt:

It looks prefer it's mostly English language. Are you able to calculate what percentage of the agg data is english?

Response:

Sure, here's some code to calculate the proportion of the aggregated data that's English: english_percentage = (agg_data.loc['eng', 'weight'] / agg_data['weight'].sum()) * 100

print(f"The proportion of the aggregated data that's English is: {english_percentage:.2f}%")

This could output the proportion of the aggregated data that's English, rounded to 2 decimal places.

(Creator created image)

Now we are able to move on to step 6 which involves doing a little exploration of the text column. That is going to be more analytics heavy and would require Natural Language Processing (NLP) techniques. Let’s see how ChatGPT does.

Prompt:

Great. Let's move on to step 6:

"Explore the 'text' column, including the distribution of text lengths and essentially the most common words."

Response:

Sure! To explore the 'text' column, we are able to start by calculating the length of every text and making a histogram to visualise the distribution of text lengths. Here's the code to do this:

# Calculate the length of every text
df['text_length'] = df['text'].str.len()

# Create a histogram of text lengths
plt.hist(df['text_length'], bins=50)

This can create a histogram that shows the distribution of text lengths within the 'text' column. We will then have a look at essentially the most common words within the 'text' column by first removing any HTML tags, punctuation, and stop words. Here's the code to do this: import re

import nltk from nltk.corpus
import stopwords

# Remove HTML tags from text
df['text'] = df['text'].apply(lambda x: re.sub('<.*?>', '', x))
# Remove punctuation from text
df['text'] = df['text'].str.replace('[^ws]','')
# Convert text to lowercase
df['text'] = df['text'].str.lower()
# Remove stop words from text
stop_words = set(stopwords.words('english'))
df['text'] = df['text'].apply(lambda x: ' '.join([word for word in x.split() if word not in stop_words]))
# Tokenize text into individual words
df['text'] = df['text'].apply(nltk.word_tokenize)
# Flatten list of words right into a single list
word_list = [word for sublist in df['text'] for word in sublist]

# Calculate frequency distribution of words
freq_dist = nltk.FreqDist(word_list)

# Print the 20 commonest words
print(freq_dist.most_common(20))

This code will remove HTML tags and punctuation from the 'text' column, convert all text to lowercase, remove stop words, tokenize the text into individual words, after which calculate the frequency distribution of words. Finally, it should print the 20 commonest words within the 'text' column.

This yields:

[('nan', 36847), ('0', 3467), ('1', 2809), ('keresőoptimalizálás', 2498), ('idősgondozás', 1852), ('2', 1740), ('3', 1559), ('szerződés', 1529), ('eltartási', 1527), ('brands', 1442), ('5', 1362), ('4', 1350), ('usd', 1245), ('us', 1205), ('2023', 1204), ('new', 1178), ('sex', 1165), ('2022', 1031), ('home', 919), ('cookies', 904)]

I wasn’t very blissful with this, as we are able to see that a variety of the ‘top words’ are only numbers, or stopwords from one other language (it only used English ones). I used to be capable of prod it on this, and it helped clean it up by filtering out words which were non-alphabetic tokens and including stopword lists for other languages. Now let’s visualise these in a Wordcloud.

Prompt:

Are you able to make it a wordcloud?

Response:

Sure! To create a wordcloud from the highest 50 most frequent words, you need to use the WordCloud library in Python as follows: from wordcloud import WordCloud

import matplotlib.pyplot as plt

# Convert the highest 50 most frequent words to a dictionary
freq_dict = dict(top_words)

# Generate the wordcloud
wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(freq_dict)

# Display the wordcloud
plt.figure(figsize=(12, 8))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()
This code will generate a wordcloud with the highest 50 most frequent words within the text column. You possibly can adjust the parameters of the WordCloud function to customize the looks of the wordcloud.

WARNING: accommodates explicit language. For individuals who are wondering, “keresőoptimalizálás” is Hungarian for “SEO” (writer created image).

I’m undecided what to remove from this plot. It seems there are some words which could have useful content related to them (e.g “information”, “blog”, “brands”), but there may be also a variety of undesirable content. I shared the highest words to ChatGPT and asked what it thought:

Prompt:

What can we are saying about this?

Response:

Based on the highest words present in the 'text' column, it seems that the information may contain a variety of spam or irrelevant information, as most of the top words are related to online dating, reviews, loans, and escorts. Moreover, the presence of words like 'nan' suggests that there could also be missing or incomplete data within the 'text' column. It will be necessary to further investigate the character and quality of the information within the 'text' column to find out how useful it's for any downstream evaluation.

That’s a good conclusion to make, and indeed, firms that work with the Common Crawl dataset have stated that they invest considerable resources cleansing up the dataset to make it usable for training AI models.

This concludes the plan outlined by ChatGPT, so we’ll wrap up the evaluation here and see what findings it has.

Prompt:

Great. Let's move on to step 8. Are you able to summarise the findings from all our evaluation to date?

Response:

Sure, based on the evaluation we've conducted to date, listed below are some key findings: Language distribution: Probably the most common language within the dataset is English, followed by Russian, German, Chinese, and Spanish. Content-Type distribution: Nearly all of the content within the dataset is of the sort "text/plain". Text length: The text length within the dataset ranges from 0 to eight,542,524, with a median length of 20,566. Common words: Probably the most frequent words within the dataset should not particularly informative, as they include many stop words and other non-informative terms. Correlations: There may be a robust positive correlation between the length of the text and the content length. There may be also a weak positive correlation between the variety of languages detected in a text and the content length.

If a knowledge analyst got here to me with this, I’d be pretty disenchanted. As mentioned initially, there are some columns it could have analysed in additional detailed, and it’s generally higher to make some suggestions for next steps. Nonetheless, we never gave it a goal for the evaluation, nor told it to provide us suggestions so I’m not going to penalise it an excessive amount of.

As you possibly can see, the prompting used throughout was relatively easy. On condition that, it’s impressive at how well ChatGPT was capable of understand what I wanted done and provides useful answers. These prompts could definitely be improved by providing more context in each prompt and being stricter on what we wish back. For instance, each prompt could contain references to the precise task it’s specializing in, in addition to additional text to have it do exactly what we wish:

Don’t respond with superfluous text. Assume pandas, numpy and matplotlib have been imported in the usual way.

These may very well be maintained in your individual set of prompt templates to hurry up this type of work, or done with a tool corresponding to LangChain.

We could also define our own overall template. I let ChatGPT provide you with a plan, nevertheless it wasn’t perfect. We could define an overall structure for it to follow, and a normal option to e.g. analyse each variable. With templates, ChatGPT is less more likely to miss insights in such an evaluation.

While it was fun going backwards and forwards with ChatGPT to get data outputs to it, it quickly became tiring. ChatGPT is far more powerful when it may run the code directly itself. ChatGPT can connected to a Python runtime by as a substitute working with the Python API. On this case, the code may very well be run mechanically, but to chop the human out of the loop we’ll need yet another tool.

AutoGPT has been very talked-about within the last month as a power-up to ChatGPT which effectively provides a guide to ChatGPT agents which allows them to maintain executing towards some goal. AutoGPT could replace me in this case, asking ChatGPT agents to design code, then executing it, feeding the outcomes back to ChatGPT, proceeding until it has an in depth evaluation. It will also interface with a memory database which might allow it to execute much larger analyses.

With a tool like AutoGPT we are able to set a transparent goal with requirements corresponding to detail of study and expected conclusion style. On this case, we are able to check in less usually with the outcomes and eventually should do little work to get a good evaluation out.

Finally, we must always call out that ChatGPT is removed from ‘perfect’ and even on this mock evaluation, I needed to massage the prompts to get a solution that was near what I wanted. It was so much easier than I expected, but still value noting. It created some code that had errors, though it managed to repair the errors each time it was told. At times it created code that I wouldn’t have desired to run, and I needed to suggest it follow a distinct path, but again, upon prompting it could provide you with a good solution.

In this text, we’ve seen how ChatGPT might be used to support the running of an Exploratory Data Evaluation (EDA). We’ve seen that we’re capable of get surprisingly good results working with the system, with little exterior help. We also noted that there are already tools which permit us to increase this concept corresponding to AutoGPT which could make an excellent more powerful assistant.

As a knowledge analyst, I’m already using ChatGPT to assist with my analytics in a number of the ways described above, though I rarely use it for an end-to-end evaluation as detailed in this text. As more integrations are built out with tools like AutoGPT, and the friction to make use of is reduced, I expect to be using it an increasing number of and am very much excited for it (while I’m not made obsolete 😉 ).

LEAVE A REPLY

Please enter your comment!
Please enter your name here