Every time a member is publishing a visit or booking a ride, we compute a fraud rating using our business rules, a few of those rules are written by experts, others are leveraging machine learning models. So for each event flowing in our Kafka cluster, we now have a corresponding fraud rating event. Those fraud rating events are then consumed by fraud fight services that may react to high scores by canceling a visit or a booking, resetting a password, asking for extra information and even blocking a member.
When the whole lot goes well, by the point a highly suspicious driver clicks on the publish button, we rating and react to that motion before the user is redirected to the subsequent page.
When designing a knowledge pipeline for machine learning, it’s critical to have 3 things right:
- Features at are correct
- Features at are correct
- Labels at and evaluation are correct
Once those three things are implemented reliably and we’re confident they may stay that way as we add more features and more sources of labels, then a lot of the work is completed. We expected our first model, although quite simple, to be a significant step forward when it comes to detection performance. So the challenge was to attenuate the chance of error as much as possible by counting on probably the most basic infrastructure possible, with the smallest amount of feature engineering.
As mentioned within the previous article, scammers may quickly adapt and bypass rules, making them useless over time. Gaining access to fresh and reliable features needs to be prioritized in that setup because the older the features are, the less they’re prone to reflect the present fraud patterns on the platform.
Rule #29: The most effective strategy to be sure that that you just train such as you serve is to save lots of the set of features used at serving time, after which pipe those features to a log to make use of them at training time.
With that in mind, we decided to store all of the serving features contained in the fraud rating event and to rely only on those features at training. This decision led to significant quality improvements and a discount in code complexity, in comparison with other pipelines we now have in production and that need to depend on backfill to construct training sets. Relying only on production logs also removes all potential risk of information leakage.