On this post, I’ll give an alternative choice to popular techniques in market basket evaluation that may help practitioners find high-value patterns quite than simply probably the most frequent ones. We’ll gain some intuition into different pattern mining problems and have a look at a real-world example. The total code may be found here. All images are created by the creator.
I even have written a more introductory article about pattern mining already; in the event you’re not acquainted with a number of the concepts that come up here, be at liberty to envision that one out first.
Briefly, pattern mining tries to search out patterns in data (duuh). More often than not, this data is available in the shape of (multi-)sets or sequences. In my last article, for instance, I checked out the sequence of actions that a user performs on an internet site. On this case, we would care concerning the ordering of the items.
In other cases, reminiscent of the one we’ll discuss below, we don’t care concerning the ordering of the items. We only list all of the items that were within the transaction and the way often they appeared.
So for instance, transaction 1 contained 🥪 3 times and 🍎 once. As we see, we lose information concerning the ordering of the items, but in lots of scenarios (because the one we’ll discuss below), there isn’t any logical ordering of the items. This is comparable to a bag of words in NLP.
Market Basket Evaluation (MBA) is an information evaluation technique commonly utilized in retail and marketing to uncover relationships between products that customers are likely to purchase together. It goals to discover patterns in customers’ shopping baskets or transactions by analyzing their purchasing behavior. The central idea is to grasp the co-occurrence of things in shopping transactions, which helps businesses optimize their strategies for product placement, cross-selling, and targeted marketing campaigns.
Frequent Itemset Mining (FIM) is the strategy of finding frequent patterns in transaction databases. We will have a look at the frequency of a pattern (i.e. a set of things) by calculating its support. In other words, the support of a pattern X is the variety of transactions T that contain X (and are within the database D). That’s, we’re simply how often the pattern X appears within the database.
In FIM, we then want to search out all of the sequences which have a support greater than some threshold (often called minsup). If the support of a sequence is higher than minsup, it is taken into account frequent.
Limitations
In FIM, we only have a look at the existence of an item in a sequence. That’s, whether an item appears two times or 200 times doesn’t matter, we simply represent it as a one. But we frequently have cases (reminiscent of MBA), where not only the existence of an item in a transaction is relevant but in addition how again and again it appeared within the transaction.
One other problem is that frequency doesn’t at all times imply relevance. In that sense, FIM assumes that each one items within the transaction are equally necessary. Nonetheless, it is cheap to assume that somebody buying caviar is perhaps more necessary for a business than someone buying bread, as caviar is potentially a high ROI/profit item.
These limitations directly bring us to High Utility Itemset Mining (HUIM) and High Utility Quantitative Itemset Mining (HUQIM) that are generalizations of FIM that try to handle a number of the problems of normal FIM.
Our first generalization is that items can appear greater than once in a transaction (i.e. we’ve a multiset as an alternative of an easy set). As said before, in normal itemset mining, we transform the transaction right into a set and only have a look at whether the item exists within the transaction or not. So for instance the 2 transactions below would have the identical representation.
t1 = [a,a,a,a,a,b] # repr. as {a,b} in FIM
t2 = [a,b] # repr. as {a,b} in FIM
Above, each these two transactions can be represented as [a,b] in regular FIM. We quickly see that, in some cases, we could miss necessary details. For instance, if a and b were some items in a customer’s shopping cart, it will matter quite a bit whether we’ve a (e.g. a loaf of bread) five times or just once. Subsequently, we represent the transaction as a multiset wherein we write down, how again and again each item appeared.
# multiset representation
t1_ms = {(a,5),(b,1)}
t2_ms = {(a,1),(b,1)}
This can also be efficient if the items can appear in a lot of items (e.g. 100 or 1000 times). In that case, we’d like not write down all of the a’s or b’s but simply how often they seem.
The generalization that each the quantitative and non-quantitative methods make, is to assign every item within the transaction a utility (e.g. profit or time). Below, we’ve a table that assigns every possible item a unit profit.
We will then calculate the utility of a particular pattern reminiscent of {🥪, 🍎} by summing up the utility of those items within the transactions that contain them. In our example we might have:
(3🥪 * $1 + 1🍎 * $2) +
(1 🥪 * $1 + 2🍎 * $2) = $10
So, we get that this pattern has a utility of $10. With FIM, we had the duty of finding frequent patterns. Now, we’ve to search out patterns with high utility. This is especially because we assume that frequency doesn’t imply importance. In regular FIM, we may need missed rare (infrequent) patterns that provide a high utility (e.g. the diamond), which shouldn’t be true with HUIM.
We also have to define the notion of a transaction utility. This is solely the sum of the utility of all of the items within the transaction. For our transaction 3 within the database, this may be
1🥪 * $1 + 2🦞*$10 + 2🍎*$2 = $25
Note that solving this problem and finding all high-utility items is harder than regular FPM. It’s because the utility doesn’t follow the Apriori property.
The Apriori Property
Let X and Y be two patterns occurring in a transaction database D. The apriori property says that if X is a subset of Y, then the support of X should be at the very least as big as Y’s.
Which means that if a subset of Y is infrequent, Y itself should be infrequent because it should have a smaller support. Let’s say we’ve X = {a} and Y = {a,b}. If Y appears 4 times in our database, then X must appear at the very least 4 times, since X is a subset of Y. This is smart since we’re making the pattern less general / more specific by adding an item which suggests that it can fit less transactions. This property is utilized in most algorithms because it implies that if {a} is infrequent all supersets are also infrequent and we are able to eliminate them from the search space [3].
This property doesn’t hold after we are talking about utility. A superset Y of transaction X could have kind of utility. If we take the instance from above, {🥪} has a utility of $4. But this doesn’t mean we cannot have a look at supersets of this pattern. For instance, the superset we checked out {🥪, 🍎} has the next utility of $10. At the identical time, a superset of a pattern won’t at all times have more utility because it is perhaps that this superset just doesn’t appear fairly often within the DB.
Idea Behind HUIM
Since we are able to’t use the apriori property for HUIM directly, we’ve to provide you with another upper certain for narrowing down the search space. One such certain is named Transaction-Weighted Utilization (TWU). To calculate it, we sum up the transaction utility of the transactions that contain the pattern X of interest. Any superset Y of X can’t have the next utility than the TWU. Let’s make this clearer with an example. The TWU of {🥪,🍎} is $30 ($5 from transaction 1 and $5 from transaction 3). After we have a look at a superset pattern Y reminiscent of {🥪 🦞 🍎} we are able to see that there isn’t any way it will have more utility since all transactions which have Y in them even have X in them.
There are actually various algorithms for solving HUIM. All of them receive a minimum utility and produce the patterns which have at the very least that utility as their output. On this case, I even have used the EFIM algorithm because it is fast and memory efficient.
For this text, I’ll work with the Market Basket Evaluation dataset from Kaggle (used with permission from the unique dataset creator).
Above, we are able to see the distribution of transaction values present in the info. There’s a complete of around 19,500 transactions with a mean transaction value of $526 and 26 distinct items per transaction. In total, there are around 4000 unique items. We may make an ABC evaluation where we put items into different buckets depending on their share of total revenue. We will see that around 500 of the 4000 items make up around 70% of the revenue (A-items). We then have a protracted right-tail of things (around 2250) that make up around 5% of the revenue (C-items).
Preprocessing
The initial data is in a protracted format where each row is a line item inside a bill. From the BillNo we are able to see to which transaction the item belongs.
After some preprocessing, we get the info into the format required by PAMI which is the Python library we’re going to use for applying the EFIM algorithm.
data['item_id'] = pd.factorize(data.Itemname)[0].astype(str) # map item names to id
data["Value_Int"] = data["Value"].astype(int).astype(str)
data = data.loc[data.Value_Int != '0'] # exclude items w/o utilitytransaction_db = data.groupby('BillNo').agg(
items=('item_id', lambda x: ' '.join(list(x))),
total_value=('Value', lambda x: int(x.sum())),
values=('Value_Int', lambda x: ' '.join(list(x))),
)
# filter out long transactions, only use subset of transactions
transaction_db = transaction_db.loc[transaction_db.num_items < 10].iloc[:1000]
We will then apply the EFIM algorithm.
import PAMI.highUtilityPattern.basic.EFIM as efim obj = efim.EFIM('tdb.csv', minUtil=1000, sep=' ')
obj.startMine() #start the mining process
obj.save('out.txt') #store the patterns in file
results = obj.getPatternsAsDataFrame() #Get the patterns discovered right into a dataframe
obj.printResults()
The algorithm then returns a listing of patterns that meet this minimum utility criterion.