Home Artificial Intelligence Group-by data augmentation for e-commerce datasets Motivation Data augmentation ingredients KeyGenerator Elements of model design Results Further evaluation and future work Conclusion

Group-by data augmentation for e-commerce datasets Motivation Data augmentation ingredients KeyGenerator Elements of model design Results Further evaluation and future work Conclusion

Group-by data augmentation for e-commerce datasets
Data augmentation ingredients
Elements of model design
Further evaluation and future work

Introducing a latest training strategy for learning user preferences to multi-attribute offers based on a multi-task approach

Making sense of assorted groups of offer attributes

In tinyclues SaaS business context, we would like to model a user propensity to purchase a given offer within the presence of its multiple offer attributes. In our previous medium blog, we explained that a core tinyclues platform functionality is to offer marketers with an especially flexible ML tool for creating targeted marketing campaigns.

Particularly, the definition of an “offer” may be very broad and will include different product categories, attributes, and transaction contexts. It may even work in a case where no product catalog exists in any respect.

To unravel this challenge, we considered several machine-learning approaches. On this blog, we’ll dive deep into an entire implementation of considered one of the previously discussed ML designs. Mainly we’ll talk concerning the offer aggregation strategy internal to the model. We’ll also share some insights about model architecture and discuss various results. So, let’s start!

Predominant idea

Our guiding intuition is to represent a suggestion as a bag of features composing it. What does this mean concretely? Take a suggestion defined as a SQL filter over a transactions (user-offer interaction) table. So, once we apply that filter, we get all corresponding offer rows (instances). Next, for a given set of offer features, we capture the distribution of every offer feature (independently of others).

In e-commerce datasets, offer features are typically categorical and encoded as 1-hot labels. Thus, the resulting bags will probably be weighted multi-hot features encoded as sparse vectors. Note that we may optionally modify the relative row weights by giving more importance to recent events. At inference time, a trained model receives those multi-hot bags-of-attributes features as input (together with other features) to generate a prediction for a specific offer.

Nonetheless, it seems that will not be enough yet! We want to switch the training schema as well to get accurate predictions when using this offer aggregation (because otherwise, the model performance may drop loads.) Now we’ll explain how precisely to try this!

A dummy example of offer attributes on transactional table || Bags representation of some offers generated from that table. (Offer definition attributes are shown in red. Here, “pid” is a brief for “product_id”.)

Keeping the most important ideas in mind, our most important logic becomes quite straightforward: we would like to mimic the inference pattern of feature transformation through the model training as well. Mainly, we should always replace original non-aggregated offer features with their bags-aggregated analogs for all training batches. In machine learning literature, this process known as , and we’ll now take a better have a look at it.

Results of Group-By (with “key” column) on categorical (”brand” and “size” which are labeled encoded) and dense features (”emb”) followed by aggregation

Mini-batch approximation of offer aggregation

Note that for the inference, we are able to access offer bag statistics from the complete (or sufficiently well-sampled) transaction data. Yet, for neural network training, having a mini-batch-based approximation may be convenient for constructing augmentations on the fly.

Fortunately, for the training of our models, we already used a batch size that’s large enough (~10k), and we didn’t notice any major drawbacks of using the mini-batch stats in our experiments (even when, in theory, such an approximation adds more randomness and will not be accurate for rare offers).

Okay, how you can generate a mini-batch augmentation? Let’s pick some random offer attributes (it may possibly be one or many), let’s say brand and size. The obtained tuple (brand, size) known as a , and a component performing this alternative known as KeyGenerator. Next, for any row i ∈ MiniBatch, we modify the worth of a suggestion feature by a bag of values taken from all mini-batch rows sharing the identical key. The resulting feature will probably be of variable length if is categorical. Since this is comparable to a typical SQL operation, we’ll also baptize it

For a given mini-batch we generate (typically, 5 …10) group-by augmented batches.

# input: batch (dict: feature_name -> tensor)
# params: offer_features, average_key_length, nb_augmentations
# output: [batch_a1, batch_a2, ... , batch_an] (augmented batches)

def KeyGenerator(batch, offer_features, average_key_length):
Generate group-by key from offer attributes

def GroupBy(feature_tensor, group_by_key): # in TF 2.*
unique_values, unique_idx = tf.unique(group_by_key) # unique keys positions
grouped_ragged = tf.ragged.stack_dynamic_partitions(feature_tensor, unique_idx, len(unique_values))
return tf.gather(grouped_ragged, unique_idx) # broadcast to batch size -> a ragged tensor of shape bs x None

output = []
for i in range(nb_augmentations):
augmented_batch = batch.copy()
group_key = KeyGenerator(batch, offer_features, average_key_length) # shape bs x 1
for feature in offer_features: # we do not modify other features
augmented_batch[f'{feature}_grp'] = GroupBy(batch[feature], group_key)

So, what concerning the behavior of KeyGenerator? Many possible key exploration strategies could possibly be used. As an example, it is cheap to pick the offer keys that are more often utilized by our clients on the platform with the next probability. Yet, we went with a quite simple and usage-agnostic approach: each offer feature may be chosen randomly and independently of others with some fixed probability proba = average_key_length / number_of_offer_features, which produces a binomial distribution. We typically set average_key_length ~ 2. Once key features are chosen, we define a “key” column as a tuple of key features values (or by hashing them), and next, we perform a suggestion features group-by based on it. If a suggestion feature is a multi-hot feature, we apply MinHash on it (with random seeding for every batch) to get back to the 1-hot case.

Offer mixture

A basic KeyGenerator above corresponds to the AND conjunction between several attributes within the offer SQL filter. To support the OR disjunction case (as pid=”942” OR pid=”661” in the instance above), we want to perform an additional offer mixture. There are lots of ways to achieve this, but we consider here an easy method that preserves a mini-batch logic. Particularly, we split the most important groups obtained above into smaller sub-groups, which we then collision randomly to create a latest key column. We do it in such a way that collided groups contain from 2 to six “pure” sub-groups. That simulates the sparse mixtures of assorted offers with random weights. In the method, we apply those collisions only to a random fraction (like ~50%) of generated keys.

Offer mixture illustration : original groups are randomly split into smaller sub-groups, and next randomly merged to form latest collided groups (colours match different groups)

Note that it is vitally much like generating a random linear combination of sample representations in classical semi-supervised methods reminiscent of MixUp.


Now that now we have found how you can modify batches to represent any offer, we’ll feed those batches into the unique model. In fact, one must also make sure that a model handles ragged inputs accurately. A loss per step may be defined merely as a sum of the respective losses. And actually, we may give it some thought as if it were only a multi-task learning process where each alternative of grouping keys corresponds to a special task. And that’s a vital point because it opens the room to varied multi-task optimization methods.

loss = 0
for i in range(nb_augmentations)
output_i = Model(augm_batch_i)
loss += LossFn(output_i, response)

Note that it’s value creating several augmentations from one batch of knowledge. It might allow to average gradients of various multi-tasks at each learning step, making it more stable. It also accelerates the training itself. Indeed, grouped-by batches may be processed in parallel by a model while user features embeddings (or other model nodes) may be shared across all augmentations.

Group-by with fused aggregation

In practice, we do a group-by aggregation in a more optimized way. We first apply an embedding layer on offer feature inputs (just once per batch) after which perform a knowledge augmentation with a vectorial aggregation (like mean on “emb” column in the instance above) in a single single operation on the offer embeddings. By doing so, we are able to share offer embeddings computation across different batch augmentations and thus be more memory efficient.

In fact, this trick may not work for more complex aggregations. Technically speaking, because of on-the-fly group-by with fused aggregation, we could to implement all the things contained in the model itself without modifying our training data-loaders in any respect. So, let’s now have a look at what happens to the model!

Group-by data augmentation may be combined with many possible ML/DL models. And probably the most noticeable performance improvement will come from the group-by itself. Nonetheless, selecting the best model may also be very necessary. We thus need to share some principles we considered for our model architecture alternative. From our experiences, the latter is often liable for around 2%-6% of a relative metric improvement in comparison with an easy two-towers model (in notebooks, we’ll provide some ablation tests).

Mean and variance extraction for internal feature importance

To capture the most important offer semantics, it’s natural to extract a mean of embeddings inside groups. Yet besides , we also extract the embeddings’ intra-group variance . On many datasets, we observed clear advantages from using in our models, and the intuition behind it’s the next: higher variance indicates the potential presence of noisy components amongst offer embeddings.

Subsequently, because of , a model can reduce their relative importance (one may take into consideration signal over noise form of formula), whereas with only , a model should learn to sum up useless offer components to zero (neutral vector) which becomes much harder, thus making the model less robust to the generalization on unseen offers.

As an example, to represent a given brand, a model would relatively depend on the brand feature embedding (whose is zero) and will completely ignore the component of size embeddings (whose is often high because there may be a wide range of sizes inside any brand).

As a substitute of taking a precise formula of σ motion on , we decided to learn it. Hence, we introduced a further sub-network MaskNet, acting on as pointwise multiplication :

We parametrize MaskNet as a sequence of non-linear dense layers (DenseNetwork or DN) with some sparse ( connections suitable for multi-task learning.

Note that one also can extract other statistics from group embeddings as soon as they’re consistent, i.e., converge to those over the complete dataset when a batch size grows. And, to follow a more modern idea, one could use a self-attention sort of model as an aggregator as an alternative of deterministically extracted statistics.

Compressed feature-wise interactions

One other necessary principle we identified in our work: each offer (and user) feature should possess its own single embedding space. This is vital for sharing the utmost amount of knowledge while reducing the variety of model parameters.

Nonetheless, this comes at the value of a more complex interaction module. As an example, it may possibly be harmful to go together with an easy sum of various feature embeddings (like size and brand). Indeed, suppose the size feature provides a really strong interaction. In that case, the mixed embedding space will probably be completely driven by size semantics and may be too constraining for a brand representation.

This reasoning naturally brings us to a feature-wise bi-linear interaction model which captures all individual user-offer feature interactions and allows the coexistence of various embedding spaces. Mainly, for all pairs of user and offer features, we extract a scalar interaction.

where the kernels

It’s an adequate alternative when there are few interactions. Nevertheless it becomes impractical (on account of the next variety of parameters and expensive computational time) when the variety of offer and user features grows. To unravel this issue, we recommend the next tradeoff: before computing the feature-wise interactions, we mix some offer features right into a “meta”-feature.

Thus, this piece of the model looks (in einsum tensor notations) as follows,

where stands for batch, for offer features, for meta offer features, and for embedding dimensions, respectively. The feature compression matrix is (instance-wise) parametrized by a DN, and MaskNet is now acting directly on the meta-features.

We perform an identical meta-compression on user features as well.

We argue that this feature compression is a legitimate alternative for the groups of hierarchical features (like for ‘’, ‘’, ‘’ sharing a compatible semantic), and as a usual variety of hierarchical groups inside offer features is relatively low, we keep the number meta features relatively lower (≤ 5). In some situations, it could possibly be useful to require a more sparse structure from ℂ (for a stronger feature selection effect).

We finally apply a DN again (with gelu or tanh nonlinear activation) over extracted meta features interactions to get scalar output. For more details (technical implementations, model hyper-parameters selections, evaluations, and so forth), we invite the reader to take a look at these notebooks, where we reproduce our models on publicly available Movielens and Rees datasets.

Training time

Because of fused aggregation and suitable training params alternative (we typically double the variety of epochs for group-by training), we found that the general training time increase on account of group-by augmentation is relatively acceptable (at worst-case scenario ≤ x3) in comparison with numerous combos of offer attributes the model learns.

We’ll present several results of our model with group-by data augmentation on the internally available datasets. Group-by models at the moment are utilized in production on the vast majority of our datasets (on ≥ 100 of them) but on account of an absence of space, we’ll only report probably the most typical and illustrative cases here. We also provided the notebooks that allowed us to breed these results on two public datasets.


Our cross-validation protocol consists of the models training on historical events (around one 12 months of knowledge) and the performance evaluation over two weeks post-training time. Let’s know the metric we have a look at. For a given grouping key (let’s say “brand” above), we focus first on the corresponding hottest offers (i.e., queries like (”brand” = X) ). For any such offer we evaluate the model AUC on the offer binary classification task (1 if the event belongs to that provide and 0 otherwise).

Finally, we average the computed AUCs (with weight = offer frequency) to get the metric we’ll follow. We report for a subset of accessible offer keys. While the highest offers is more appropriate for group-by effect (since all groups will contain quite a lot of elements), for cold start benchmarks, we’re taking less frequent offers whose variety of occurrences is ~ 20–200 within the evaluation window.

Multi-task model vs. mono-task models

To challenge our multi-task approach, for any chosen offer attribute, we train a mono-task model which is restricted to the corresponding offer feature. These specialized models are trained with no augmentation but share the identical architecture (except using , which is zeros in that case).

Here we report models wAUC performance for various offer attributes (taking offers with ≥ 200 occurrences):

: (Classical retail with atypical size feature) || : (Classical retail with strong geographical shop_id feature)
: (Travel industry standard case) || public retail dataset

Let’s take a better have a look at the outcomes. We note several necessary observations :

  • To begin with, as expected, mono task models show strong results when evaluating on their corresponding task but typically underperform on the others.
  • Model doing offer group-by only in inference with only classical training (without augmentations) has a major performance drop for many tasks.

We see that group-by trained models perform thoroughly compared with specialized models, and that’s precisely the most important effect we desired to create! But beyond that, we’ll show you below another interesting results that got here along the way in which.

Cold start scenario

When wAUC of rare offers (with quite a few occurrences between 20 and 200), we see that our multi-task group-by model starts to point out even higher results than specialized mono-task models, which proves that group-by models can effectively leverage the knowledge of all available offer features in cold-start scenario. Note also that group-by model leverages it significantly better than without augmentations model which has access to all offer features as well.

Cold start results. : Retail A || : Retail B || : public Rees retail dataset

Offer mixture experiments

To validate the importance of offers mixture in group-by data augmentation, we generate a man-made “mixed” topics by sampling randomly a fraction of rows from one offer and fraction from the second offer ( corresponds to ORquery between of the 2 offers).

We vary and follow the AUC on the -mixed offer of two models: one trained with only pure groups and one other with an additional group-mixture. What we note is that for the intense values, each models perform similarly. Nonetheless, the collision stimulated model shows significantly better end in the center; and if two offers are more “dissimilar,” one should expect an even bigger gap.

On the plot below we show some typical behavior :

AUC on mixture (with param lambda) of two offers for 2 models : one trained with keys collisions (AND_OR) and the opposite not (only_AND) || Same for an additional offers alternative

Negative transfer

We see that sometimes, the training of some offer attributes may end in a performance drop for others (hence, we’re talking about conflicting attributes, like shop_id and product_id on Dataset B). This phenomenon, called negative transfer, is well-known and actively studied within the multi-task world.

To mitigate it, a wide range of techniques may be applied. Here is an inventory of several ideas for future work :

  1. A guided (aka curriculum learning) keys exploration;
  2. Exploration that avoids taking the identical or similar keys but relatively tries to create complementary offer keys (based on their mutual information) for every batch;
  3. Focal loss (we applied it with success on datasets with a vital disparity of offer features interaction strength);
  4. A gradient surgery for conflicting tasks and more multi-task-friendly model architecture.

As an example the thought of curriculum learning, we observe in the instance below that through the training “shop_id” attribute (which is straightforward to learn because of the “zipcode” user feature) reaches its top wAUC quickly whilst it takes more epochs for the offer attributes (item dependent). That is relatively a typical pattern, and due to this fact, it’s value making more focused on the “hard” keys after the primary epochs.

wAUC evolution during training epochs on

Feature representation quality

One other noticeable effect of group-by training is the development of the standard of each user and offer embeddings. We observed a greater performance of user embeddings transferred to other downstream tasks. Thus, group-by data augmentation may be used as an efficient pre-training strategy or as an auxiliary task for semi-supervised training.

Group-by trained offer embeddings (more precisely, offer meta features) used for the (cos-)similarity search also showed more relevant results. Furthermore, the general model diversity (meaning the capability to pick different audiences for various offers while keeping the identical prediction accuracy) has been increased as well (especially for lower entropy categories reminiscent of cat1, cat2, or brand).

Offer meta features CosSimilarity for top-40 directors on Movielens dataset. Comparison of two models. model || model

The most important intuition is as follows: the group-by training forces a model to make sense of all offer embeddings individually and their combos, whereas with no data augmentation, the lower entropy features serve mainly as compensation for poorly sampled ones of upper entropy. Note that to some extent, an identical effect is reached with a classical masking/dropping feature sort of data augmentation (and the group-by may be successfully combined with the latter methods!).

To wrap it up, we’ve suggested a latest sort of data augmentation strategy called . This strategy allows us to learn in a multi-task fashion user preferences from all combos of offer attributes producing high-capacity models. And that’s with an appropriate negative transfer and a limited increase within the training time! Beyond that trained models prove to be powerful for improving the embeddings quality and tackling the cold-start problem.

Last but not least, we imagine that data augmentation can find other applications for tabular datasets and will not be limited to learning only offer-user interactions.


Please enter your comment!
Please enter your name here