How we built our machine learning pipeline to fight fraud at BlaBlaCar — Part 2 PART 2 — Our First Pipeline Get the infrastructure right Get examples to your learning algorithm Features at training are correct Features at serving are correct Labels at training are correct Conclusion

Artificial Intelligence

How we built our machine learning pipeline to fight fraud at BlaBlaCar — Part 2 PART 2 — Our First Pipeline Get the infrastructure right Get examples to your learning algorithm Features at training are correct Features at serving are correct Labels at training are correct Conclusion

admin

June 30, 2023

How we built our machine learning pipeline to fight fraud at BlaBlaCar — Part 2
PART 2 — Our First Pipeline
Get the infrastructure right
Get examples to your learning algorithm
Features at training are correct
Features at serving are correct
Labels at training are correct
Conclusion

Every time a member is publishing a visit or booking a ride, we compute a fraud rating using our business rules, a few of those rules are written by experts, others are leveraging machine learning models. So for each event flowing in our Kafka cluster, we now have a corresponding fraud rating event. Those fraud rating events are then consumed by fraud fight services that may react to high scores by canceling a visit or a booking, resetting a password, asking for extra information and even blocking a member.

When the whole lot goes well, by the point a highly suspicious driver clicks on the publish button, we rating and react to that motion before the user is redirected to the subsequent page.

When designing a knowledge pipeline for machine learning, it’s critical to have 3 things right:

Features at are correct
Features at are correct
Labels at and evaluation are correct

Once those three things are implemented reliably and we’re confident they may stay that way as we add more features and more sources of labels, then a lot of the work is completed. We expected our first model, although quite simple, to be a significant step forward when it comes to detection performance. So the challenge was to attenuate the chance of error as much as possible by counting on probably the most basic infrastructure possible, with the smallest amount of feature engineering.

As mentioned within the previous article, scammers may quickly adapt and bypass rules, making them useless over time. Gaining access to fresh and reliable features needs to be prioritized in that setup because the older the features are, the less they’re prone to reflect the present fraud patterns on the platform.

Rule #29: The most effective strategy to be sure that that you just train such as you serve is to save lots of the set of features used at serving time, after which pipe those features to a log to make use of them at training time.

With that in mind, we decided to store all of the serving features contained in the fraud rating event and to rely only on those features at training. This decision led to significant quality improvements and a discount in code complexity, in comparison with other pipelines we now have in production and that need to depend on backfill to construct training sets. Relying only on production logs also removes all potential risk of information leakage.

Rule #29: The most effective strategy to be sure that that you just train such as you serve is to save lots of the set of features used at serving time, after which pipe those features to a log to make use of them at training time.

LEAVE A REPLY Cancel reply