Predict Player Churn, with Some Help From ChatGPT Introduction The Platform The Dataset Exploratory Data Evaluation Training a Classification Model Improving the Model Performance Creating Recent Features Training a Recent (hopefully improved) Classification Model Model Deployment in Production Conclusions

Artificial Intelligence

Predict Player Churn, with Some Help From ChatGPT Introduction The Platform The Dataset Exploratory Data Evaluation Training a Classification Model Improving the Model Performance Creating Recent Features Training a Recent (hopefully improved) Classification Model Model Deployment in Production Conclusions

admin

June 23, 2023

Predict Player Churn, with Some Help From ChatGPT
Introduction
The Platform
The Dataset
Exploratory Data Evaluation
Training a Classification Model
Improving the Model Performance
Creating Recent Features
Training a Recent (hopefully improved) Classification Model
Model Deployment in Production
Conclusions

These curves are also useful to find out what threshold we could use in our final application. For instance, whether it is desired to reduce the variety of false positives, then we will select a threshold where the model obtains the next precision, and check what the corresponding recall might be like.

The importance of every feature for the most effective model obtained may also be viewed, which is probably certainly one of the more interesting results. That is computed using permutation importance via AutoGluon. P-values are also shown to find out the reliability of the result:

Feature Importance Table. Image by writer.

Perhaps unsurprisingly, crucial feature is EndType (showing what caused the extent to finish, reminiscent of a win or a loss), followed by MaxLevel(the best level played by a user, with higher numbers indicating that a player is kind of engaged and lively in the sport).

Then again, UsedMoves (the variety of moves performed by a player) is practically useless, and StartMoves (the variety of moves available to a player) could actually harm performance. This also is smart, because the variety of moves used and the variety of moves available to a player by themselves aren’t highly informative; a comparison between them would probably be rather more useful.

We could even have a have a look at the estimated probabilities of every class (either 1 or 0 on this case), that are used to derive the expected class (by default, the category having the best probability is assigned as the expected class):

Table with original values, Shapley values, and predicted values. Image by writer.

Explainable AI is becoming ever more necessary to grasp model behaviour, which is why tools like Shapley values are increasing in popularity. These values represent the contribution of a feature on the probability of the expected class. As an illustration, in the primary row, we will see that a RollingLosses value of 36 decreases the probability of the expected class (class 0, i.e. that the person will keep playing the sport) for that player.

Conversely, which means the probability of the opposite class (class 1, i.e. that a player churns) is increased. This is smart, because higher values of RollingLosses indicate that the player has lost many levels in succession and is thus more more likely to stop playing the sport as a consequence of frustration. Then again, low values of RollingLosses generally improve the probability of the negative class (i.e. that a player won’t stop playing).

As mentioned, a variety of models are trained and evaluated, following which the most effective one is then chosen. It’s interesting to see that the most effective model on this case is LightGBM, which can be certainly one of the fastest:

Information on the models trained. Image by writer.

At this point, we will try improving the performance of the model. Perhaps certainly one of the simplest ways is to pick the ‘Optimize for quality’ option, and see how far we will go. This selection configures several parameters which are known to generally improve performance, on the expense of a potentially slower training time. The next results were obtained (which you may as well view here):

Evaluation Metrics when using the ‘Optimize for quality’ option. Image by writer.

Again specializing in the ROC AUC metric, performance improved from 0.675 to 0.709. This is kind of a pleasant increase for such an easy change, although still removed from ideal. Is there something else that we will do to enhance performance further?

As discussed earlier, we will do that using feature engineering. This involves creating latest features from existing ones, that are in a position to capture stronger patterns and are more highly correlated with the variable to be predicted.

In our case, the features within the dataset have a reasonably narrow scope because the values pertain to just one single record (i.e. the data on a level played by the user). Hence, it is perhaps very useful to get a more global outlook by summarizing records over time. In this fashion, the model would have knowledge on the historical trends of a user.

As an illustration, we could determine what number of extra moves were utilized by the player, thereby providing a measure of the issue experienced; if few extra moves were needed, then the extent may need been too easy; alternatively, a high number might mean that the extent was too hard.

It could even be a superb idea to examine if the user is immersed and engaged in playing the sport, by checking the period of time spent playing it over the previous couple of days. If the player has not played the sport much, it would mean that they’re losing interest and will stop playing soon.

Useful features vary across different domains, so it is vital to attempt to find any information pertaining to the duty at hand. For instance, you can find and skim research papers, case studies, and articles, or seek the recommendation of corporations or professionals who’ve worked in the sector and are thus experienced and well-versed with essentially the most common features, their relationships with one another, any potentially pitfalls, and which latest features which are almost definitely to be useful. These approaches assist in reducing trial-and-error, and speed up the feature engineering process.

Given the recent advances in Large Language Models (LLMs) (for instance, you’ll have heard of ChatGPT…), and on condition that the strategy of feature engineering is perhaps a bit daunting for inexperienced users, I used to be curious to see if LLMs might be in any respect useful in providing ideas on what features might be created. I did just that, with the next output:

ChatGPT’s answer when asking about what latest features will be created to predict player churn more accurately. The reply is definitely quite useful. Image by writer.

ChatGPT’s reply is definitely quite good, and in addition points to a variety of time-based features as discussed above. In fact, remember that we won’t have the opportunity to implement the entire suggested features if the required information is just not available. Furthermore, it’s well-known that it’s susceptible to hallucination, and as such may not provide fully accurate answers.

We could get more relevant responses from ChatGPT, for instance by specifying the features that we’re using or by employing prompts, but that is beyond the scope of this text and is left as an exercise to the reader. Nevertheless, LLMs might be regarded as an initial step to get things going, even though it remains to be highly beneficial to hunt more reliable information from papers, professionals, and so forth.

On the Actable AI platform, latest features will be created using the fairly well-known SQL programming language. For those less acquainted with SQL, approaches reminiscent of utilizing ChatGPT to routinely generate queries may prove useful. Nonetheless, in my limited experimentation, the reliability of this method will be somewhat inconsistent.

To make sure accurate computation of the intended output, it’s advisable to manually examine a subset of the outcomes to confirm that the specified output is being computed accurately. This may easily be done by checking the table that’s displayed after the query is run in SQL Lab, Actable AI’s interface to put in writing and run SQL code.

Here’s the SQL code I used to generate the brand new columns, which should help provide you with a head start when you would love to create other features:

SELECT 
*,
SUM("PlayTime") OVER UserLevelWindow AS "time_spent_on_level",
(a."Max_Level" - a."Min_Level") AS "levels_completed_in_last_7_days",
COALESCE(CAST("total_wins_in_last_14_days" AS DECIMAL)/NULLIF("total_losses_in_last_14_days", 0), 0.0) AS "win_to_lose_ratio_in_last_14_days",
COALESCE(SUM("UsedCoins") OVER User1DayWindow, 0) AS "UsedCoins_in_last_1_days",
COALESCE(SUM("UsedCoins") OVER User7DayWindow, 0) AS "UsedCoins_in_last_7_days",
COALESCE(SUM("UsedCoins") OVER User14DayWindow, 0) AS "UsedCoins_in_last_14_days",
COALESCE(SUM("ExtraMoves") OVER User1DayWindow, 0) AS "ExtraMoves_in_last_1_days",
COALESCE(SUM("ExtraMoves") OVER User7DayWindow, 0) AS "ExtraMoves_in_last_7_days",
COALESCE(SUM("ExtraMoves") OVER User14DayWindow, 0) AS "ExtraMoves_in_last_14_days",
AVG("RollingLosses") OVER User7DayWindow AS "RollingLosses_mean_last_7_days",
AVG("MaxLevel") OVER PastWindow AS "MaxLevel_mean"
FROM (
SELECT
*,
MAX("Level") OVER User7DayWindow AS "Max_Level",
MIN("Level") OVER User7DayWindow AS "Min_Level",
SUM(CASE WHEN "EndType" = 'Lose' THEN 1 ELSE 0 END) OVER User14DayWindow AS "total_losses_in_last_14_days",
SUM(CASE WHEN "EndType" = 'Win' THEN 1 ELSE 0 END) OVER User14DayWindow AS "total_wins_in_last_14_days",
SUM("PlayTime") OVER User7DayWindow AS "PlayTime_cumul_7_days",
SUM("RollingLosses") OVER User7DayWindow AS "RollingLosses_cumul_7_days",
SUM("PlayTime") OVER UserPastWindow AS "PlayTime_cumul"
FROM "game_data_levels"
WINDOW
User7DayWindow AS (
PARTITION BY "UserID"
ORDER BY "ServerTime"
RANGE BETWEEN INTERVAL '7' DAY PRECEDING AND CURRENT ROW
),
User14DayWindow AS (
PARTITION BY "UserID"
ORDER BY "ServerTime"
RANGE BETWEEN INTERVAL '14' DAY PRECEDING AND CURRENT ROW
),
UserPastWindow AS (
PARTITION BY "UserID"
ORDER BY "ServerTime"
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
)
) AS a
WINDOW
UserLevelWindow AS (
PARTITION BY "UserID", "Level"
ORDER BY "ServerTime"
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
),
PastWindow AS (
ORDER BY "ServerTime"
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
),
User1DayWindow AS (
PARTITION BY "UserID" 
ORDER BY "ServerTime" 
RANGE BETWEEN INTERVAL '1' DAY PRECEDING AND CURRENT ROW
),
User7DayWindow AS (
PARTITION BY "UserID"
ORDER BY "ServerTime"
RANGE BETWEEN INTERVAL '7' DAY PRECEDING AND CURRENT ROW
),
User14DayWindow AS (
PARTITION BY "UserID"
ORDER BY "ServerTime"
RANGE BETWEEN INTERVAL '14' DAY PRECEDING AND CURRENT ROW
)
ORDER BY "ServerTime";

On this code, ‘windows’ are created to define the range of time to think about, reminiscent of the last day, last week, or last two weeks. The records falling inside that range will then be used throughout the feature computations, that are mainly intended to offer some historical context as to the player’s journey in the sport. The total list of features is as follows:

time_spend_on_level: time spent by a user in playing the extent. Gives a sign of level difficulty.
levels_completed_in_last_7_days: The variety of levels accomplished by a user within the last 7 days (1 week). Gives a sign of level difficulty, perseverance, and immersion in game.
total_wins_in_last_14_days: the full variety of times a user has won a level
total_losses_in_last_14_days: the full variety of times a user has lost a level
win_to_lose_ratio_in_last_14_days: Ratio of the variety of wins to the variety of losses (total_wins_in_last_14_days/total_losses_in_last_14_days)
UsedCoins_in_last_1_days: the variety of used coins inside the day prior to this. Gives a sign of the extent difficulty, and willingness of a player to spend in-game currency.
UsedCoins_in_last_7_days: the variety of used coins inside the previous 7 days (1 week)
UsedCoins_in_last_14_days: the variety of used coins inside the previous 14 days (2 weeks)
ExtraMoves_in_last_1_days: The number of additional moves utilized by a user inside the day prior to this. Gives a sign of level difficulty.
ExtraMoves_in_last_7_days: The number of additional moves utilized by a user inside the previous 7 days (1 week)
ExtraMoves_in_last_14_days: The number of additional moves utilized by a user inside the previous 14 days (2 weeks)
RollingLosses_mean_last_7_days: The typical variety of cumulative losses by a user over the past 7 days (1 week). Gives a sign of level difficulty.
MaxLevel_mean: the mean of the utmost level reached across all users.
Max_Level: The utmost level reached by a player within the last 7 days (1 week). Together with MaxLevel_mean, it gives a sign of a player’s progress with respect to the opposite players.
Min_Level: The minimum level played by a user within the last 7 days (1 week)
PlayTime_cumul_7_days: The overall time played by a user within the last 7 days (1 week). Gives a sign to the player’s immersion in the sport.
PlayTime_cumul: The overall time played by a user (because the first available record)
RollingLosses_cumul_7_days: The overall variety of rolling losses over the past 7 days (1 week). Gives a sign of the extent of difficulty.

It’s important that only the past records are used when computing the worth of a latest feature in a selected row. In other words, using future observations should be avoided, because the model will obviously not have access to any future values when deployed in production.

Once satisfied with the features created, we will then save the table as a latest dataset, and run a latest model that ought to (hopefully) attain higher performance.

Time to see if the brand new columns are any useful. We are able to repeat the identical steps as before, with the one difference being that we now use the brand new dataset containing the extra features. The identical settings are used to enable a good comparison with the unique model, with the next results (which may also be viewed here):

Evaluation Metrics using the brand new columns. Image by writer.

The ROC AUC value of 0.918 is far improved compared with the unique value of 0.675. It’s even higher than the model optimized for quality (0.709)! This demonstrates the importance of understanding your data and creating latest features which are in a position to provide richer information.

It could now be interesting to see which of our latest features were actually essentially the most useful; again, we could check the feature importance table:

Feature importance table of the brand new model. Image by writer.

It looks like the full variety of losses within the last two weeks is kind of necessary, which is smart since the more often a player loses a game, it’s potentially more likely for them to turn out to be frustrated and stop playing.

The typical maximum level across all users also appears to be necessary, which again is smart because it might be used to find out how far off a player is from the vast majority of other players — much higher than the typical indicates that a player is well immersed in the sport, while values which are much lower than the typical could indicate that the player remains to be not well motivated.

These are only just a few easy features that we could have created. There are other features that we will create, which could improve performance further. I’ll leave that as an exercise to the reader to see what other features might be created.

Training a model optimized for quality with the identical closing date as before didn’t improve performance. Nonetheless, this is probably comprehensible because a greater variety of features is getting used, so more time is perhaps needed for optimisation. As will be observed here, increasing the closing date to six hours indeed improves performance to 0.923 (when it comes to the AUC):

Evaluation metric results when using the brand new features and optimizing for quality. Image by writer.

It also needs to be noted that some metrics, reminiscent of the precision and recall, are still quite poor. Nonetheless, it is because a classification threshold of 0.5 is assumed, which might not be optimal. Whilst the brink could also be modified by clicking on the curves, the AUC is threshold independent and can provide a more comprehensive picture of the performance. As mentioned earlier, the AUC is very useful when used because the optimisation metric whilst training on imbalanced datasets, as is the case here.

The performance when it comes to the AUC of the trained models will be summarised as follows:

┌─────────────────────────────────────────────────────────┬───────────┐
│                         Model                           │ AUC (ROC) │
├─────────────────────────────────────────────────────────┼───────────┤
│ Original features                                       │     0.675 │
│ Original features + optim. for quality                  │     0.709 │
│ Engineered features                                     │     0.918 │
│ Engineered features + optim. for quality + longer time  │     0.923 │
└─────────────────────────────────────────────────────────┴───────────┘

It’s no use having a superb model if we will’t actually apply it to latest data. Machine learning platforms may offer this ability to generate predictions on future unseen data given a trained model. For instance, the Actable AI platform allows using an API that enables the model for use on data outside of the platform, as is exporting the model or inserting raw values to get an easy prediction.

Nonetheless, it’s crucial to periodically test the model on future data, to find out if it remains to be performing as expected. Indeed, it could be essential to re-train the models with the newer data. It’s because the characteristics (e.g. feature distributions) may change over time, thereby affecting the accuracy of the model.

For instance, a latest policy could also be introduced by an organization that then affects customer behaviours (be it positively or negatively), however the model could also be unable to take the brand new policy under consideration if it doesn’t have access to any features reflecting the brand new change. If there are such drastic changes but no features that might inform the model can be found, then it might be price considering using two models: one trained and used on the older data, and one other trained and used with the newer data. This might make sure that the models are specialised to operate on data with different characteristics that could be hard to capture with a single model.

In this text, a real-world dataset containing information on each level played by a user in a mobile app was used to coach a classification model that may predict whether a player will stop playing the sport in two weeks’ time.

The entire processing pipeline was considered, from EDA to model training to feature engineering. Discussions on the interpretation of results and the way we could improve upon them was provided, to go from a price of 0.675 to a price of 0.923 (where 1.0 is the maximal value).

The brand new features that were created are relatively easy, and there actually exist many more features that might be considered. Furthermore, techniques reminiscent of feature normalisation and standardisation may be considered. Some useful resources will be found here and here.

Almost about the Actable AI platform, I could after all be a bit biased, but I do think that it helps simplify a number of the more tedious processes that must be done by data scientists and machine learning experts, with the next desirable elements:

Core ML library is open-source, so it might be verified to be secure to make use of by anyone who has good programming knowledge. It will probably even be utilized by anyone who knows Python
For many who have no idea Python or should not conversant in coding, the GUI offers a technique to use a variety of analytics and visualisations with little fuss
It’s not too difficult to begin using the platform (it doesn’t overwhelm the user with an excessive amount of technical information that will dissuade less knowledgeable people from using it)
Free tier allows running of analytics on datasets which are publicly available
An enormous variety of tools can be found (other than classification considered in this text)

That said, there are just a few drawbacks while several elements might be improved, reminiscent of:

Free tier doesn’t allow running ML models on private data
User interface looks a bit dated
Some visualisations will be unclear and sometimes hard to interpret
App will be slow to reply at times
No support for imbalanced data
Some knowledge of knowledge science and machine learning remains to be needed to extract essentially the most out of the platform (although this might be true of other platforms too)

In other future articles, I’ll think about using other platforms to find out their strengths and weaknesses, and thereby which use cases best fit each platform.

Until then, I hope this text was an interesting read! Please be at liberty to depart any feedback or questions that you’ll have!

LEAVE A REPLY Cancel reply