Online Payments Risk Management
Part III. Tools and Methods
Detection: Figuring Out that Something Is Wrong
Detecting that “something” is happening in your system is one of the core activities of the RMP team. How do you set up effective detection mechanisms? By employing three activities: measuring your current performance, using inflow to predict future performance, and setting up mechanisms to detect outlier activity that needs to be investigated.
Measuring Performance
The biggest advantage in measuring your performance based on actual numbers is being able to know exactly what’s going on. The biggest disadvantage is that you have to wait a long time to get that number, basically making it impossible to respond to events in any reasonable amount of time; you could be out of business by the time you get your fully developed loss numbers. While measuring your rejection rate (the percent of stopped applications at every stage of your decision process [models, manual decisions]) is pretty straightforward, merely trying to understand which of your rejections is a mistake is going to take a while. As I noted before, card payment originating time based cohorts take 60–70 days to reach 95% development, which means that it takes several months to get to around 95% of chargebacks you can expect to receive for this cohort. In consumer lending, write-off time (the point in a debt’s lifecycle when you stop collecting and declare a debt lost and off your balance sheet) is based on your payment terms and the way regulation defines default — most probably a much longer time.
Measuring performance as well as understanding it requires that you understand the different steps in your value chain as well as have an overall view; optimization doesn’t only mean losses vs. rejections but can also include collection cost and revenue. Your recovery activity economics (the cost for challenging a chargeback, your ability to charge a late fee, etc.) may change your risk appetite and tolerance toward some people not paying you initially. Your current and estimated loss rates determine whether you can operate profitably, enter a market more aggressively, or need to make quick actions to limit a growing problem.
Measuring Offset Performance: Time-Based Cohorts
One of the basic yet important things to remember when measuring performance is understaing that the performance you see today is not a result of decisions you made yesterday, but rather a longer period of time ago. This is where time-based cohorts come in handy. Time-based cohorts mean that you present results, such as loss rates, based on when applications were received rather than when you learned of the loss, e.g., all credit card purchases attempted in a certain month. The correct or incorrect decisions you made on an application are a result of your detection capability and understanding at the time it was attempted, not the time at which you discovered a problem, since that could be anywhere from a few seconds to a few months after approval. If you received five chargebacks today, you cannot infer that you made five bad decisions yesterday. Maybe one chargeback is for a month-old purchase, and the other is for a week-old one. You must look at the originating cohort, the time frame these purchases were approved on, to analyze whether you have a problem relative to other purchases in that cohort.
What Should You Measure?
Measure Your Defaults
A default happens when people don’t pay when they’re supposed to or charge back on a payment. It doesn’t mean that you won’t recover some or all of that debt later, but it does mean that you’ll have to work for it. Defaulted volume is sometimes referred to as gross loss; after all recovery steps, when written off, it is sometimes referred to as net loss.
When measuring consumer defaults, in addition to measuring totals (number of defaults, their volume, and percent of total payments volume) you should at least group by product (if you have more than one), by payment instrument (direct bank payment vs. Visa vs. Amex vs. other payment options), and by the way in which you found out about the default (chargeback, customer complaint, analyst flagging, etc.). Different products and default-discovery channels work differently, have different false-positive rates, draw a different mix of consumers, have different collection success rates down the line, and so on. You must segment to be able to track different populations effectively.
Another important dimension is industry segment. This is true for both consumer and merchant defaults and is driven by portfolio risk management. Macroeconomic changes leading to lower consumption, credit shortage impacting high amount, high cash-flow-dependent segments, or a sales team too focused on risky segments can quickly turn your portfolio south without you noticing. Bad merchants bring in worse customers and tend to default more themselves.
When you measure your performance this way, you must again deal with the fact that defaults take a while to mature; while you care about gross loss, net loss is what you usually optimize for (while taking recovery cost/benefit into consideration). Suppose you’re looking at all purchases from last month. How do you know what their eventual loss rate is going to look like, whether you’re improving or worsening, and what drives those trends? Part of the answer is measuring and inferring based on inflow — the topic of our next section — but by comparing early loss evolution you can try to infer what future performance will look like.
Looking at loss evolution[1] at different points in time (5, 10, 20, 30, or so days after purchase until you hit your write-off point) and comparing how different cohorts are doing will allow you to identify problematic cohorts and assess what their future performance is going to look like.
Measure Your Recovery
Loss optimization can be driven by measuring defaults, but that needs to be complemented by other numbers. Specifically, for problems with ability to pay — credit issues — but also with intent problems, recovery and the way you manage it can make the difference between a losing and profitable operation. You must track recovery cost and success.
Your main KPIs should be recovery success rate (as simple as percent of net loss from gross loss), average number and cost of actions done on each default, and the success rate of every action: both in being able to reach the defaulted party and driving for full or partial payment of the debt. Different financial institutions may handle chargeback management differently, have easier or harder decision processes, and generally impose different success rates on your operations. Make it a priority to identify vendors that make it harder to recover. As in any other case, you should always perform A/B tests. Much as in any customer-facing activity, A/B testing helps to improve performance. A/B testing here means designing experiments to test competing services against one another on similar populations, then choosing the ones producing the best results. Collection agencies work differently in different regions, have different modeling techniques and cost structures, and may prove to have highly varying efficiency. The same is relevant for external chargeback management; due to different unit economics they may end up optimizing against you.
Any relationship with a vendor can be plagued with conflicts of interest.
However, in payments it is undeniably evident with every purchase. Specifically regarding loss liability, the hot potato lands way too often in the retailer’s lap. If you use PayPal, for example, different products will have different protection policies, meaning that you may end up with vastly different loss levels, even with identical default levels.
Measure Your Rejections
Rejections should be analyzed not only because they are lost revenue — some consumers will try again until successful, some are rejected due to system errors and data malfunctions, some just type their information incorrectly. Rejections happen for many reasons, and their analysis helps discover product and flow issues.
Instrumentation. Proper instrumentation will provide you with enough data to conduct in-depth analysis and determine the reasons for a rejection. Rejections should be instrumented and logged in exactly the same way as approved applications: data content, time stamp, decisions, and so on. Rejections should also have a trace of all systems that recommended a rejection, not only the one that actually rejected them. This will allow the creation of what-if scenarios — simulating what additional volume of approved applications will enter the system if a detection system is altered.
Control Groups. If you apply a certain rejection to 100% of a certain population, you have no way of knowing whether a behavior shifted, was eliminated, or was misinterpreted by your system. Allowing a reasonably sized control group depends on your portfolio size and distribution and system structure, but it is a tool you have to implement and use from the get go or you will always have to hack something to compensate for not having one. Three to six months down the line from layering detection systems without a control group, you will have no clue why a customer was denied. Thus, you need to measure your rejections’ real performance and false positives against the control group to get a clear picture.
Defining a Rejection. What’s a rejected customer in your flow? How should you calculate rejection rate? Should all customers be accepted on their first try? Is an application that you challenge with questions and accept the same as an application you accept up front? Measuring raw rejection rate (where every rejection counts) is advisable but requires that you know, in detail, the reasons for each rejection, or you’re at risk of trying to optimize on model rejections when most of your problems are, maybe, faulty merchant integrations. Measuring “cleaned” rejection rates (where, for example, a consumer that tried more than once and eventually got accepted is not considered to be rejected) makes sense when you don’t have that level of accuracy, but should be accompanied by an analysis of how repeated rejections and tries impact customer LifeTime Value (LTV).
Measure Ops Performance
Payments risk operations are your last line of defense on defaults, and the
Recovery team is the one saving you from write-offs. You need to know how these teams perform as a group and as individuals to make sure that you are investing in the right tools and people. While defaults and recovery are covered in previous bullets, here I’m referring to measuring the efficiency and efficacy of the teams’ operation.
Although you should not staff your team with strictly customer-serviceoriented personnel, Ops should still be measured for responsiveness, throughput, and accuracy of decisions or recovery attempts. Manual actions must be automated away using flow-management tools, and data collection actions should be done automatically — APIs have become very popular, and programatic data access is very common. A word of clarification: I refer to customer-service professionals and where I see them fit in the RMP team. In determining that most customer-servicetrained staff shouldn’t double as your manual decision team, I’m using the term “customer service” very broadly, referring to staff focused on interacting with and helping customers vs. analytically and/or technically inclined people. Of course, many talented people can do both, but promoting from within customer service will usually find professionals who are much more focused on the former rather than the latter. Thier talents are highly needed in the support and recovery teams.
There are two main differences between a customer-care team and risk ops.
The first is that process quality can be measured and optimized on much more than customer interaction quality. You should be able to measure and improve the number of cases that are sent to Ops and their quality, the effectiveness of the decision process (number of clicks to decision, average number of data sources used per decision, number of opened tabs), and the analyst’s ability to decide (from average time per decision crossed with accuracy, to the probability of that agent asking for more information or deferring to a senior agent before making a decision).
The second difference is that tying individuals’ and teams’ performance to actual quantitative data, rather than just qualitative, is easier. It is relatively easy, and must be one of your core measuring activities, to see how much your Ops and Recovery activities are saving your company. For Consumer Risk Ops, for example, measure rejected volume, discounted by what they reject incorrectly or miss. For the Recovery team, measure the conversion of defaults to write-offs on a per-purchase basis. You’d be surprised to find out that the best recovery agents can be 10x better than the worst ones on a per-action basis.
I am delivering a two-sided message in this book: on one hand, use Ops extensively; on the other hand, use them in a specific manner. A huge part of using Ops as domain experts is delivering the right quality of cases, the complicated and ambiguous ones, rather than clear-cut cases that can be automatically sorted. The main indicator of case quality is review queue hit rate: how many of those purchases put in the review queue are actually bad, based on whether the team rejected them or not.
What’s “Normal” Performance?
How do you know if you’re doing a good job? What performance numbers do you need to aspire to? Defaults, rejections, and operations performance depend on your industry, type of payment, country, and loss tolerance, and they vary accordingly. Still, the following are some numbers that are important as reference.
Defaults
For most businesses operating in the US and using credit cards, 1% is the upper limit, with most hovering around 0.4% for purely online purchases. Lending businesses working with prime customers perform very close to their respective country’s credit card defaults; many Western countries are at around 6%. Subprime lending sees defaults in the low 20s or high 10s. These are numbers for your whole portfolio. You may be very tolerant toward losses in new markets or some products — some digital merchants are fine with 5%–15% loss rates, and they refund payments upon first complaint. While these may be standard numbers, they are definitely not optimal. FraudSciences had less than 0.1% losses in high-risk payments, Klarna reports <1% losses while granting short-term credit, and PayPal reported not more than 0.26% of losses while covering more and more of its merchant base’s losses.
Rejections
As the flip side of losses, rejection numbers are both complicated to discern and not shared publicly. Businesses using credit cards in the US have rejection rates that reach a maximum of 5% and go as low as 0.5%. Lending businesses can go up to 80% rejection, depending on the credit status of their customer base. Roughly speaking, rejection rates of more than 10% are high for any standard online business using cards or bank transfers without providing any loss guarantee. Klarna reports less than 10% rejections as a lending business; FraudSciences gave a loss guarantee, took high-risk purchases, and had a 25% rejection rate.
Operations Performance
Review time and effectiveness differ significantly between industries and rely on automation, data, and tool quality. As you increase automation, review time actually becomes longer, since review staff get more complicated cases. Still, five-minute reviews are reasonable, with oneminute reviews for first-tier simple cases being the goal. False positives at this stage could go up to 20%.
Detecting that “Something” Is Happening
Detection efforts deal with two issues. The first is detecting phenomena that you expect to happen in your system: some types of fraud, abusive behavior, or maybe a positive trend. The other is being able to understand when an unexplained phenomenon, one that may or may not lead to losses and increased risk, is occurring in your system. While measuring performance in its early or late stages can point to an existing problem, detecting trends early can lead to those losses being prevented. Let’s look at a few ways to detect issues in advance.
Incoming Complaints
Closely monitoring suspicious cases flagged by consumers, merchants, and employees is one of the major data sources for your detection efforts. Consumers suddenly starting to complain about a merchant, higher levels of nonshipment complaints, or a merchant pointing at a trend they see on their website — all are possible leads for a trend you may have missed.
Inflow
One portfolio-level indicator for a change in your risk levels is your application inflow, and more accurately inflow composition. This means looking at the types of applications you’re getting and checking whether that has changed compared to historical trends, even before seeing a single default. Different consumers and merchants have different risk profiles, and understanding that you are seeing a different population than the one you were used to is a leading indicator for trouble (or, sometimes, very good news). There are many ways to attack portfolio segmentation for inflow measurement; however, some are more common.
If you are already deploying real-time models, measuring score or classification distribution shift in accepted and rejected applications is the first one; since your score is an indication of the risk level you assign to customers; as long as your model’s performance hasn’t deteriorated significantly, a shift in score distribution (for example: more consumers accepted at low scores than expected) can signal issues.
What’s a score distribution, and what does a shift mean? Your model assigns a score to each application. Normally, you’d expect these scores to show normal distribution, peaking around the threshold score. If you plot scores given by the model for a certain cohort and see that suddenly you have an over-representation of higher or lower scores than usual (to put it more plainly, the curve shifted and peaks at a different score, or has multiple peaks), this is an indication of a shift in your population.
In addition to scores, you should segment inflow based on (at least) product type, age group, amount (in case of purchases), and industry segment. If one of these subsegments becomes more prominent in its contribution to overall volume, it needs to be looked into. A higher than average purchase volume from 18–22 year olds in the clothing segment could mean that summer sales are coming, but also that you have a new merchant drawing young shoppers with significantly lower prices — increasing their operational risk — or any number of other reasons.
Linking
Fraud only matters when done at scale. Fraudsters and most abusers aren’t just looking to get away with one free item or service; they are in the business of stealing valuables and reselling them. Being a fraudster requires investment: time, purchase of stolen identities or hacked accounts, setting up proxies, finding drop addresses, or more plainly the risk of getting caught. If they cannot effectively repeat their actions and steal a lot of value from you or your merchants, their return on investment is too low, and that alone will deter most of them. Therefore, limiting scale — limiting the ability to repeatedly exploit a weakness in your system in a similar fashion over time — is one of the things you have to pay close attention to.
Linking is mostly concerned with horizontal scale, the fraudsters’ tendency to find a loophole in your system, then repeat it as often as possible using multiple identities and customer accounts that are supposed to seem unrelated. Linking, therefore, is a mechanism used to connect customers and their activities together so you can detect when they come back, and especially when they come back while manipulating their details and trying to hide from your regular detection efforts and look like a completely new or unrelated user.
Implementing a linking mechanism can be very simple or highly complex. The simplest implementation requires nothing more than explicit matching between two purchases or accounts to deem them linked. Still, even when using simple linking heuristics you need to be able to filter: only linking using IP, even if explicit, will result in many false-positive links, such as multiple customers using one workplace’s network. You therefore need to combine a few assets as links.
Some Linking Terminology
An explicit link means that details in two purchases are identical, e.g., both applications come from the same IP or email.
A fuzzy link is any link that involves partial similarity. Two accounts using IPs 111.bbb.aaa.1 and 111.bbb.aaa.2 can be related since they share a c-class from an uncommon network.
As with any type of behavior, patterns in linking can point to specific behaviors, some riskier than others. Most commonly, if two purchases come from the same IP and email but not the same name and address, they are highly suspicious. Most of the times this linking pattern is pointing to a fraudster using multiple identities in various purchases.
How Complex Should Your Linking Be?
Creating more sophisticated linking can be done by (1) collecting more data or those that are hard(er) to obfuscate (such as new types of cookies), (2) adding fuzzy matching, and (3) recursive matching, going beyond one level of matching on a smaller set of assets. There’s always more data to be collected and fuzzy matching to be done, but returns diminish quickly after the first few basic link types. That is why most companies offering linking as a service are struggling. Developing something simple in-house captures most of the benefit. Recursive matching refers to linking starting from a base entity A and returning not only the B1..i entities linked to it directly but also the C1..i entities linked to B-class entities, up to the Nth level. This is done because fraudsters, even the more-sophisticated ones, tend to reuse assets. That is also the reason why more data have quickly diminishing returns.
The biggest issue with linking isn’t algorithm complexity but implementation, especially for real-time linking on large data sets. Creating a fully fledged linking mechanism that works close to real-time is a big investment — fuzzy linking is especially tasking on whatever data infrastructure you use — and thus hard to justify. As a result, finding the optimal functionality to deploy in real-time is a complex exercise in ROI calculation.
Velocity
While measuring inflow allows you to see changes from baseline on a portfolio level, there is also a need for outlier detection in smaller batches or on a per-customer basis. By “Velocity,” I mean models looking at outof-the-ordinary repeat behavior or occurrence of “something.” Velocity models come in many shapes and forms — from univariate numerical models looking at purchasing velocity from consumers to clustering algorithms.
Some Velocity Terminology
Univariate velocity models are models counting an individual quantity, such as number of purchases, and responding to changes in that quantity. They are the simplest and most common, because they capture a large number of suspicious activities, either by individual customers or by a group.
A baseline is the normal level of a phenomenon as observed in your data. A baseline is hard to determine in limited data sets, since no history exists to determine standard levels of any type of activity. However, once you have a few million events, you can start to determine what’s comon and uncommon for your customers: IPs, email domains, purchase patterns, and more.
Clustering systems, or algorithms, detect groups of cases that have similar values for a number of features or indicators. While univariate models require that you set a threshold or a baseline to compare to and therefore indicate what’s okay and what isn’t, clustering algorithms are unsupervised. This means that they don’t require past examples of good and bad. You get groups of cases, and it’s up to you to decide whether to investigate and act on them or not. Of course, with time, you can label certain clusters as belonging to a certain behavior. Clustering algorithms are both slower and significantly more complex than any other model discussed here and, therefore, only used by teams with vast data sets and advanced tools.
Baselining
Spikes in velocity should always be compared to a baseline, and that baseline should be selected wisely. Repeat purchasing behavior for a 65year-old customer shopping for gardening equipment isn’t the same as a 21-year-old college students shopping for clothes. It works the same way with merchants — separating seasonal sales segments from year-round ones will reduce false positives.
Iterative Analysis
Constant analysis of results and definition of velocity root causes —
seeing why something popped on your screen and whether it’s a previously obesrved behavior — is a key activity to make the best of your velocity tool. Giving names to velocity patterns is exactly like defining any other customer behavior. With time, you’ll discover which peaks are clearly good or bad and which require additional attention. Coming back to the previous example, once you detect seasonality, you need to adjust your model to deal with and flag seasonality, and sometimes disregard it while searching for new and unexplained phenomena. Most seasonality is business as usual — a phenomenon to be detected but not to alert your team to or require any change in behavior from your operators.
Individual Merchant/Consumer Velocity Models
Individual models compare number and volume of purchases and sales against a chosen baseline (projections for new merchants, population purchasing behavior for consumers, etc.) to flag those that are more active than usual or whose activity drops (the latter is more relevant to merchants; churned consumers are interesting but not for our current purposes). While hyper-growth volumes aren’t necessarily an indication of malicious intent, they can easily drive losses even when the customer means well. A merchant growing in sales beyond their operational capacity may go bankrupt or just create a tidal wave of complaints; an overzealous consumer may try to grab products and run — or is simply on a drunk purchasing binge. All need to be examined and sometimes contacted to alleviate the suspicion or ask for guarantees of continued activity.
General Outlier Detection Models
Many of the trends in your system will not rely on individuals and therefore will not be detected through linking or individual velocity models. General velocity tracks the appearance of individual assets (IP, email domain, zip code, etc.) and flags unusual spikes. This means specific IPs or IP ranges appearing beyond their baseline (an attack, or a promotion at a certain school or work place), new and rare domains spiking (a new free email provider used for fake identities), and others. Univariate velocity models capture most of the outlier activity that is not caught by inflow composition tracking, but depending on your system’s complexity, clustering systems may be relevant.
[1] A way to describe how your time-based origination cohorts are doing over time. It plots portfolio default level for a certain cohort at different, but set, points in time, for comparison.
Chapter 8. Analysis:
Understanding What’s Going On
The next step after detecting that something is going on is root cause analysis. Understanding the cause quickly and accurately allows you to solve the current issue as well as prepare for any future changes in the user’s behavior. As with anything, proper analysis starts with planning and is highly dependent on the architecture of your system and data availability.
Designing for Analysis
It is hard enough to understand what happens in your system as it is; lacking data, overlapping systems, and bad instrumentation just make it so much harder. When approaching system and data architecture, there are several things to remember that will make your work much easier.
Instrumentation and Data Retention
Many analysis attempts fail due to poor instrumentation and data retention. This usually stems from engineers optimizing for performance and storage size in production code and ends with missing data — since maintaining event-based historical data is not anywhere near the top of these engineers’ minds. The simplest example is canceled or rejected applications deleted after a very short while or saved in highly redacted form. When you try to model consumer behavior or merchant performance deterioration patterns based on cancellations and rejections in addition to purchases, you fail due to lack of data. The most advanced example is point-in-time analysis. When you build predictive systems, you always need to be able to look at purchases at the point of approval without post-decision information, but most systems overwrite the purchase object as states change rather than keep snapshots or use an event-based approach.
RMP requires either an event-based system keeping track of each change, or periodical snapshots from almost day one for further analysis. Either way, you must design your system with preparation for future instrumentation and as little data loss as possible.
Data Latency and Transformation
While portfolio analysis sometimes resembles and uses Finance and Business Analysis techniques, most RMP analysis done in order to identify and understand trends and their root cause requires data that are rawer and fresher than what most BI teams require. Therefore, an elaborate data warehouse with long and cumbersome ETL, 24+ latency, and low uptime is an insufficient tool. RMP requires a close to real-time database that is almost a copy of production data but provides some additional capabilities (aggregation, point in time, tuning, and indexing for analysis, etc.).
Most analytics teams use a separate database for analysis. That database is, in its ideal form, a repository that is getting data feeds from production systems and additional services and storing them in an optimized structure for analysis — the data warehouse (DWH). The most common data schema optimized for analysis is called the star schema. Production databases are optimized for performance and have very different properties, some of which aren’t even relational databases. A process of Extracting, Transforming, and Loading (ETL) the data into the data warehouse is required to make one into the other. Since it’s a resourceheavy process requiring and lot of database access, it is usually run in bulk when the service is at lower demand, usually once a day. That is why most data warehouses are up to 24 hours behind compared production databases.
It is often preferable to train your team to use a replicated version of your production database rather than build a highly processed data warehouse. Accordingly, transformation should be basic and bare-bones, as well as thoroughly documented. The time they spend in learning these tools will be saved by more than ten-fold by not having to build elaborate ETL processes and maintain a unique infrastructure for a DWH.
Control Groups
Special attention should be given to instrumentation and control groups that enable decision funnel analytics. Often, your decision funnel consists of several systems working linearly (sometimes even in parallel) to make decisions on applications: rules, models, manual review, etc. Control groups must be implemented across all of your detection systems so you can constantly test your decisions and identify gaps. These control groups should persist through the whole lifecycle — otherwise, the model’s control group could be rejected by rules and vice versa. Make sure you design instrumentation and control groups so you can attribute a decision to a specific mechanism and optimize their performance.
Best Practices for Ongoing Analysis
Every system has its idiosyncrasies, and with a large-enough portfolio, most of your analysis will focus on identifying corner cases stemming from interactions between seemingly unrelated product features and operationally driven spikes in costs (mistakes in settlement file parsing are an example). Still, using a few best practices will help you get to a solution earlier and allow you to act faster and with better accuracy.
Automated Segmentation and Tagging
Preserving knowledge from previous investigations is key for iteratively understanding and fixing your loss problems. Having solved a problem once doesn’t guaurantee it won’t appear again in a few months, by virtue of seasonality or a new big merchant with a clunky operation. Maintain your ability to detect previously solved problems by developing scripts that automatically tag them, and then incrementally add to them as you expand your knowledge. This script suite will serve as the first diagnostic tool you can use on a misbehaving portfolio to single out already known problems and help you focus on the unknowns.
Root Cause Analysis
Once you have a defined subpopulation that needs to be examined further, case-by-case review by domain experts is the next step. The goal is to find the reason behind the loss using careful review in light of all events that transpired after an application was approved. This is where you must combine funnel analytics (knowing which mechanism did or would have acted on this purchase), strong review tools, and investigative capabilities. Talk to your customer (although most of these inquiries provide only halftruths), and follow up on disputes. Many times, customer-care contacts will help you identify integration and process errors. Backtrack everything that happened to the application to understand those.
Analyzing trends in order to find actionable insights is a science (mixed with a little bit of art) requiring deep domain expertise in customer behavior and also detailed understanding of specific systems and processes. Support your team with strong documentation and knowledge sharing in easily searchable databases, and you will create a highly effective investigation process that will properly inform the actions you take.
Action: Dealing with Your Findings
Once you know what needs to be done, using the right tools is the next step. There are several ways to make a change: from ops procedures, through decision systems, and all the way to product changes.
Decision and loss-reduction mechanisms vary by their flexibility, time to market, and impact. Flexibility and time to market improve as you move farther in the application lifecycle — further away from real-time decisions — and impact is reduced accordingly. The lack of flexibility in the front end shouldn’t be taken for granted, and a new model’s time to market can and should be at the 3 month range rather than the 9–18 months that are most common in large organizations. Still, models vs. rules vs. manual decisions have different advantages and disadvantages and should be used accordingly.
Manual Review
Manually reviewing an application is the core activity of every RMP team. Much like you wouldn’t hire a developer who can’t write code, you wouldn’t want a domain expert that cannot make a decision when reviewing an application. Manual review is not only about making accurate decisions, it is also about knowing when the information you have is insufficient, identifying patterns, and developing a taste for what a mistake looks and feels like. Manual review helps you keep track of your system’s pulse and is the basis for more detailed root cause analysis, the most important activity in the problem-solving cycle.
Barring the downsides of a manual operation, discussed previously, the manual review team is the one providing you with the most flexibility for enforcing short-term changes to your decisions without any product changes. Your manual review team must be equipped with a strong application review tool as well as effective and flexible rules governing their backlog. The more flexible the rules are, the better your ability is to feed Ops and make a short-term change.
Manual review is triggered and utilized when a trend is identified by your detection mechanisms. Usually, those come from customer disputes or your linking/velocity mechanisms identifying an activity that you wish to investigate or stop. You need the ability to change your backlogcontrolling rules (often referred to as backend rules because they run after the purchase has completed and a real-time decision was made) and feed those cases to your review staff so that they can manually stop applications. The process is as ineffective and limited in scope as it is flexible, but it serves as a first response.
Your should also provide review staff with force multipliers, allowing them to make batch decisions on applications as well as translate their insights into broadly applicable actions and rules in your systems. The first step after identifying a new behavior is sending it to review via backend rules. Once a pattern and a response to it have been established, the team should write a rule or a set of rules to automatically handle the new trend. With time, after they get to know this trend and develop a more subtle understanding of how to detect it (best measured by hit rate: every early attempt has low hit rate, but improves over time), detection of this trend can move to real time, and to the model. One of the key obstacles in this process is variables, or indicators, used as building blocks in the rules they write. This is where the Variable Library plays a central role.
The Variable Library
There is a constant gap between the ability to identify, build, and utilize new indicators and data sources in the analysis environment or Ops’ sandbox vs. real-time decision mechanisms. Variables get developed separately in two or even three enviornments, by different teams and with different tools, creating a barrier that prevents knowledge from trickling in either direction.
Teaching review staff to look at applications without presenting them with the indicators you use in other systems limits their effectiveness, since inferring why a specific application is rejected, approved, or queued becomes close to impossible. The same happens the other way around: Ops discover a new indicator that can greatly improve decisions and are able to build it in their own silo. Making that indicator/variable available to all services will benefit all decision mechanisms, but without a central variable repository, code gets duplicated and often incorporates bugs. In order to get the positive, compounded effect from models making broad real-time decisions — then rules adding trend detection, then review staff making specific high-impact decisions on new behaviors — you need to provide all with the same data and indicators. Otherwise, you may see different teams solving the same problems using their own tools.
The variable library, or variable service, works to mitigate that. While not a detection or decision mechanism, it is a crucial part of the infrastructure that underlies them. It is a directory of computed variables that makes the same set of variables available to all of your tools: models, rules, manual review, and analysis. Starting early and making this service easy to access and extend solves a major issue in detection improvements as well as many of the bugs inherent to complex model deployment processes. A service-oriented architecture that allows engineers to add new variables on a weekly basis will let you expose front-end variables to your back-end decisions, helping staff make better decisions, as well as allowing Ops to add new variables they they discover while manually reviewing cases and quickly constructing new rules to respond to evolving trends. Those will also trickle into the models in a quicker and smoother fashion.
The Review GUI
A lot of thought is given to data processing and decisions made by agents, but not a lot is given to the tools used to reach those decision and how they support or complicate the investigation process. I discussed this in a 2010 article:
A lot of times it creeps up on you: volume picks up and so you know you need someone to look at orders. If you’re running a small shop it’s most probably going to be you, but a lot of companies just hire one or two folks. These people use whatever tool you have to look at transactions — most times a customer service tool — and make up their technique as they go. With time, and sometimes with chargebacks coming in, you realize that your few analysts can’t review all transactions, so you turn to set up a few rules to make queue and transaction hold decisions. Since your analysts are not technology people you resort to hard coding some logic based on a product manager’s refinement of the analysts’ thoughts, again based on a few (or many) cases they’ve already seen. Not a long while passes, and you realize that the analysts are caught in a cat and mouse game where they try to create a rule to stop the latest attack that found its way to the chargeback report, and put a lot of strain on the engineers who maintain the rule-set. Even after coding some simple rule writing interface the situation isn’t better since the abundance of rules creates unpredictable results, especially if you allowed the rules to actually make automated decisions and place restrictions on transactions and accounts.
Staff is expected to review applications at an ever-growing pace on a collection of interfaces not optimized for their use, either home-grown or based on customer-care systems. There is little variety in off-the-shelf solutions, and the commitment required to build an effective tool (integrating queue management, flow management, a review console pulling all data sources into one place, an so on) is hard to maintain after the first version; the MVP covers 60% — 70% of required functionality, and added improvements get constantly down-prioritized against those with higher impact. Manual review and decisions are absolutely required for quality RMP practices, and present-day statistics tell us that manual review is still significantly common. As most Ops staff use at least two systems for review and make decisions, it’s obvious that the Ops team needs a strong tool to allow them to do so. This will require at least one engineer constantly tweaking and improving your tools; designing a proper dashboard and review interface that follow agent workflows while enforcing subtle changes for efficiency is essential.
Main Consideration in GUI Design
There are a few key matters to remember when designing or integrating a review interface. Achieving high efficiency and accuracy in review requires human-centric design, compensating for the human factor’s shortcomings. People tire, suffer from decision bias and fatigue, and have a hard time assimilating data and using slow systems. A well-designed GUI takes all that into consideration and provides a work environent that’s mostly constrained by case quality rather than inadequate tools; it can support your team and give them the best possible environment for high performance.
Decision Fatigue
Making a decision every few minutes for hours on end is tiring, even with planned breaks. Analysis of previous mistakes and detailed KPIs, while driving better performance, contribute to fear of mistakes and decision bias. All of these together cause decision fatigue, usually reflected in agents deferring to others by starting group discussions about cases, calling the customer, and so on. Your review tool must support flow management that escalates these cases to an experienced staffer that will make a quicker, more efficient decision; this way, you let your team defer a decision (which is sometimes unavoidable and needed) but still get the case worked quickly. Layering expertise and difficulty levels allows new employees to deal with the bulk of the work you’re trying to simplify and automate, while your senior employees make high-impact decisions.
System Responsiveness
Review staff spend a lot of their time waiting for pages to load. While your page load times and system responsiveness don’t need to be at consumer product levels, the number of clicks to decision and wait time between pages need to be reasonable. Three to five clicks per decision and up to ten seconds of wait are not ideal, but allow page switching without excessive memorization — one of the leading reasons for wrong decisions. The best option is a single page.
Data Assimilation
Review staff also suffer from context switching. There is only so much copying and pasting you can do between your main screen and whitepages or social network sites that you may be using for your review without making errors, and comparing details becomes a tedious job. Make sure that your interface prepopulates as much information as possible from external sources on a single page and that those data are organized in a way that complements your review method. Use color coding and imagery to highlight important details or ones that require more investigation.
The Rules Engine
The rules engine is practically where “it” all should happen. In basic or early implementations of RMP systems, “rule” refered to a piece of hardcoded logic describing a specific behavior or trigger. (More than three purchases today? IP country doesn’t match billing address?) and queueing applications for review (or rejecting them upfront). A proper rules engine, however, is an interface (whether graphical or not) that allows nondevelopers to draw data from various sources (external, complete models, or your variable service) and compose statements in a syntax that allows sophisticated arguments — regular expressions, string manipulation, and some flow control commands such as IF statements and FOR loops.
Basic Functionality Requirements
The rules engine should have at least decision tree functionality — allowing you to segment an incoming population and set different reject and queue thresholds for different groups of applications — as well as the ability to quickly write and deploy simple rules that will respond to an evolving trend. While functionality should be the same across the application lifecycle, the rules engine should be able to connect and provide different permission levels and controls for real-time and async rule sets, as both are important but mistakes will be significantly costlier in the frontend.
Performance Simulation
In order to effectively deliver quick value through changes in rule-based decisions, you need to see a what-if simulation — showing the performance of that rule on its own as well as its incremental benefit to the overall rule set. You need to measure rule performance, identify ones that will not contribute to your optimization goal, and retire those that are not helpful anymore. Poor rule-set management results in code spaghetti (this is especially true for hard-coded rules that cannot be easily changed). Extensive performance simulation before you let a rule impact your application flow prevents suboptimization. It is possible for new rules to target a bad population that is only slightly incremental to the current rule set, but introduce a large set of false positives, thus reducing overall performance. Simulation and validation (checking for syntax errors and possibly logic errors) are vital, especially for short lived rules.
Performance Monitoring
Funnel analytics should be mentioned once again. Being able to measure rules as a set and individually for performance tweaking and retirement is a key component in managing your decision funnel. Implementing and instrumenting the rules engine’s actions must be planned in advance, as well as the data infrastructure supporting both its real-time actions and reporting needs.
Automated Decision Models
For simplification purposes, the word modeling is used in this book to describe the construction of any type of automatic decision. Regression, classification, clustering, and other techniques aren’t discussed separately. For all purposes, modeling is a process that consumes a set of indicators, or “features,” and turns them into a score for a given action. The score tells you how “bad” the purchase is (the definition of bad can change), or how much it fits a specific profile you’re trying to detect. A threshold score is then determined for each score range; any action getting a score equal to or higher than the threshold will be let through, and those below it will be stopped or reviewed manually. As the threshold becomes lower, you can expect more approved actions and more losses. Finding the optimal threshold for your business is therefore an important decision.
Building a model is a complicated, detail-oriented task. You will start by using statistical models roughly 6–12 months into your company’s life, depending on growth trajectory, and they will become a very important tool in your decision-making process, as they are the best for making automated decisions at scale. Using models depends not only on the number and total amount of purchases that go through your system, but also the diversity of your population and the number of bad purchases you see. If you acquire one type of customer through one channel and sell one type of product (say you have a t-shirt printing businesses selling wholesale to small brick-and-mortar businesses at festivals), you’ll need a smaller sample. As complexity grows, so do the requirements of your data. I’d like to touch upon several pitfalls and issues to remember when building models for RMP.
What Are You Predicting?
Training set construction and feature engineering are much more effective when the predicted class or performance flag are clearly defined. A class or performance flag are two ways of naming what it is that the model is trying to predict. Depending on the type of problem and type of algorithm, you could try to predict anything from the probability of default on a certain purchase to whether an individual IP belongs to a government agency. I use several classes, corresponding to the main archetypes of behavior (fraud, default, abuse, and technical errors), that are predicted separately and then combined into a single decision flow. With a small sample, splitting by different classes will almost guarantee model overfitting. When choosing a single flag (usually loss/not loss), other problems arise: although you may make the same decision for two very similar purchases, they could randomly get very different treatment from the Collections team, an effect that is almost impossible to separate. Starting from a flag indicating, per purchase, whether it has or hasn’t caused loss is a first step; then, separately predicting specific behaviors that are easy to identify directly is the way to go from a general performance flag to multiple ones.
Which Algorithms Should You Use?
With the rise in popularity of data science, there sometimes seems to be pressure to use sophisticated techniques and algorithms regardless of their actuall business impact. The fact is that, more often than not, the simplest tools are the ones with the best impact as well as the most easily interpreted results. Some highly touted algorithms are more accurate but require much more computing power, hence limiting scale or the amount of data you can use in real time; some are black boxes and thus harder to tune and improve, especially for smaller data sets. Layering regression models that target different behaviors and proper feature engineering that captures interaction will create a much stronger prediction system. Especially in RMP, a practice assuming the existence of an adversary, being able to interpret loss events, and tune your system is critical. You must be able to tie a loss event to a root cause through all decisions taken on an application in your system.
Model Time to Market (TTM)
TTM measures the time it takes to launch a model from initial analysis to 100% live performance. Short time to market matters because it means you quickly respond to changes in your customer population. However, most RMP teams aren’t properly set up for short TTM. Usually, the analytics team is separate from the engineering team and is providing the latter with specs for developed models in a waterfall development model. As a result, models go through several reimplementation and compilation processes (first at analytics, then in engineering), causing bugs and delays, demonstrating the lack of shared language and tools between analysts and engineers. It is not uncommon for new models to go through a few months of tuning while the analytics team detects bugs and the engineering team fixes them. The Variable Service mentioned earlier is one possible solution to the problem, since it serves the same variables/features in development and production. The Rules Engine provides the ability to write and deploy segmentation and flow logic easily, thus allowing code reuse instead of reimplementation from scratch with every version. You should build your RMP service to use these components wisely and reduce model TTM significantly.
TTM is so impactful that if you can guarantee TTM shorter than a month you can lax the controls preventing over-fitting. Over-fitting is a situation that may occur for various reasons in which a model predicts a random phenomenon represented in the data instead of an actual relationship. When samples are too small, random events of even small magnitude seem much more important than they are in reality, therefore skewing the model’s performance. In plain English, the model “thinks” that this phenomenon is very common and therefore doesn’t “give enough attention” to other, more important ones, thus not learning how to predict them effectively. As time passes, even very well built models’ performance degrades and decisions become less accurate as behaviors shift. That degradation slows down significantly when you start to identify archetypes of customer behaviors that don’t significantly change, but until then, performance degradation can be so steep that shorter model deployment cycles provide much greater value than elaborate and complex feature engineering and model tuning. Since your data sets are small, you’re guaranteed to constantly find new behaviors that just didn’t appear a month or two ago, and the model did not train on. If your model refreshes every month, it may over-represent behaviors that were observed in the previous month, but since those constantly change, that’s not a big problem. Of course, once you reach a standard set of features and behaviors you’re targeting, or a large enough data set, this stops being true. For many teams that I’ve seen, though, a standard set of features and behaviors is a stretch goal even after years of operation.
The Feedback Loop
You will get feedback from losses, but false positives and some false negatives will systematically not be detected. Most of your rejected customers will not try again, leaving you to think that they were rightfully rejected, and surprisingly, not all customers impacted by fraud will complain. As discussed previously regarding domain experts and manual feedback, you must sample applications and have domain experts manually review them. Without this crucial step, you will always be limited to effects that have significant representation in your existing data set, since these are the only ones the model will learn from. As a result, expanding your customer base will be difficult and require large-scale controlled experiments where you allow previously rejected customers through. An automated system cannot make “leaps of faith” or infer correctly whether a very small sample of a newly detected behavior is good or bad. As a result, if your system is automated and you want to expand into a new industry segment, a country, or maybe reject fewer applications, your best option is to randomly approve previously rejected applications and wait for losses to come in so you can learn from them. While this is possible, it is a slow process that usually requires high “tuition” costs, paid for in losses. Domain-expert-based control will allow you to reach conclusions faster and often more accurately, as they are expected to correctly generalized on small samples and come up with features that will allow accurate detection.
Product and Experience Modifications
RMP teams focus on real-time and post-approval detection and prevention of risk and loss. Loss can also be managed and reduced by changes to customer experience. Specific experiences can be used to handle heterogeneous groups of applications that contain both customers you’d like to reject as well as ones you’d like to approve, but are too indistinguishable with the information given to you regularly by all customers. That’s when you throw a question or additional step at them and judge by their response.
In-Flow Challenges
In-flow challenges are a “nicer” way for getting customers to go through a few extra hoops before getting approved. These sometimes are designed to respond to a specific attack vector, asking the fraudster to do something that most probably only the real person can do. Another option is challenges that put the customer in a specific mindset before applying for a loan or making a purchase, making them more aware of the commitment they are making.
An example of the former is KBA, knowledge-based authentication, used when signing up to package-tracking services online. Identity theft is common in the US, and consumers’ identities can be used to re-route packages to fraudster drop points or re-shippers, who are often innocent people working from home, unaware that they are aiding an act of fraud. KBA will ask you a few questions, based on your credit report, that the average person will not be able to answer without extensive research: a past spouse, historical addresses, and so on. While definitely not fool or fraud proof, this reduces the chances of simple identity fraud.
An example of the latter is an alert prompting the consumer to rethink an action before submitting it. While considered a conversion-killing crime, when very specifically targeted it can pinpoint problematic customers. Several online retailers started using this kind of alert for impulse buyers who use their websites while drunk on weekends. Though a lot of them paid, this proved to be a highly remorseful crowd who often returned items, and some retailers chose to discourage them than have to deal with restocking and chargebacks.
User Experience Changes
A lot of losses can be prevented in advance by creating a more accommodating user experience that takes specific customer needs into consideration. This is an especially effective way of dealing with merchant-driven losses. Merchant risk management is different than most consumer risk effort since it’s a long-term process that deals with different risks. Insolvency and operational issues are common. As a result, merchant risk requires a lot of interaction with and information from the merchant. Most of these operations still use printed, faxed, and scanned documents and are unsuccessful in getting merchants to cooperate freely and provide timely information. What they end up doing is placing limitations on merchants with even the smallest deviations from a generally acceptable baseline.
The most common risk-prevention mechanism for commercial credit is a reserve — holding a certain amount, often a few days’ worth of payments or a certain percentage of turnover, as hedge against possible losses. Reserves are used by both offline and online services providers. Merchants are not fond of reserves; cash flow is severely impacted by them, and at least in some cases reserves lead to small-merchant insolvency. A more sophisticated alternative is merchant lifecycle management, prompting merchants to provide you more information at strategic points — after the first purchase, before the first payout, when they start experiencing hypergrowth — when they are motivated to cooperate in order to get their business going. That way, instead of slapping a one-size-fits-all reserve that never changes and puts a strain on all merchants at all times, you can respond to higher risk levels when warranted. Smart user experience design is required to provide this kind of smooth lifecycle management experience. When properly done, it is much more effective than reserves and creates good will with merchants as you help them grow their business.
Proactive Risk Management
The last option I mentioned is proactively promoting customer safety by making them change a password or add more defense mechanisms (such as additional secret codes). This should happen when you discover a real or potential breach in your system or a related service — the equivalent of a security patch in software. The evolution of credit card seurity codes demonstrates this: when stolen credit cards started to include the front of the card from in-store charge slips, the issuers added a three-digit number to the back of the card. When that started to be collected by fraudsters, they added another secret — a code on your statement, 3D-secure.
Some websites proactively trigger mass password resets when they discover a breach. This happens every once in a while. In January 2013, Twitter did exactly that. Roughly 250,000 account passwords were reset, possibly due to a third-party app being hacked. This tactic has proven useful multiple times.
When Things Go Wrong: Dispute Resolution
Many RMP teams focus on detection and prevention as closely as possible to real time. That’s a reasonable effort. The closer to in-flow you make a correct decision, the better return on invested time. Dealing with losses after they occur costs more time and money than stopping them from happening. That focus, however, sometimes obscures the fact that (a) losses happen anyway and (b) there is much to be recovered by proper handling of disputes when they happen. I’ve touched on this subject earlier in this book: a large chunk of losses are actually a misunderstanding. When properly handled, some of those could be prevented from turning into chargebacks, and even if they do, can be disputed and won. There are two major things to think about when designing and operating a dispute process for small and medium companies: experience design and backoffice efficiency.
User Experience Design
UX in dispute resolutions encompasses all the emails, text messages, web pages, and phone call scripts a customer interacts with. These have a profound impact on whether you’ll be able to reduce losses as well as provide the customer with a brand-supporting experience that will result in them coming back. Your first and foremost goal is to establish credibility with the customer and have them settle their dispute through you rather than through a third party and to feel that they’ve been treated fairly, even if you have decided against them. Working with third parties (an issuing bank or a mediator) is a cumbersome and painful process for both them and you. If you establish credibility by allowing customers to submit a dispute and handling it fairly — by communicating early and often and sharing progress whenever possible — you will be able to handle most disputes in-house and, if nothing else, reduce process-related fees that you’d incur from dealing with external parties.
The second design goal is reminding customers that they are actually good customers and that a fair settlement is in the best interest for all parties involved. It is never fun to be the victim of identity fraud, but a good number of “fraud” victims actually gave explicit or implied permission to a family member (child or spouse) to use their card or made a risky business decision that backfired. When reminded, or confronted with purchasing and usage behavior in a consumer’s case, most of them find that this was the case. Some consumers try to dodge payments through denial of purchases they obviously made; denial of future service causes many of them to square up to not be cut off. Finally, quite a lot of customers do want to pay but run into temporary financial problems. Giving them the ability to negotiate the timing and amounts they will pay to settle their debt will allow more of them to pay you in full.
Back Office Efficiency
Dispute resolution is a manual process, plagued with low impact when handled incorrectly. As a result, many companies do not spend energy on it. Businesses with tight margins must care about dispute resolution and specifically about making it streamlined. Chargeback challenge, proving to the bank that a certain chargeback has no merit, is a process that you must at least consider. Like any other operational process, you must pay attention to detail; responses must match the chargeback code you received and include correct evidence. If you follow the strict (and sometimes confusing) guidelines, you’ll improve you ability to recoup losses. Some analysis of the process and simple automation will boost your recovery significantly. While chargeback management is a highly manual and detail-oriented business, the potential for higher recovery on your losses is a direct contribution to your bottom line.
Setting Up Your Team and Tools
Now that you have a good sense of what you’re trying to solve, how do you measure and detect phenomena and performance and determine how to improve them? How do you start getting it done? Should you build everything yourself? What’s already out there, and how much is it going to cost you? How do you make a buy vs. build decision, and if you don’t build, how do you make sure you’re not completely dependent on your providers?
Buy vs. Build
Buy vs. build depends on what is or isn’t core competence for your company, the stage you’re at, and the availability of data and resources.
The need for core competence is different for retailers vs. other participants in the payments ecosystem and is tied to business model and margins. If you act as a non-risk-taking agent for financial services providers or have zero cost of goods sold (like most virtual goods and gaming companies), you will be less sensitive to losses and have a higher margin to pay for tools and products. Lifecycle stage is another consideration: companies in hypergrowth should generally pay a third party and invest engineering and hustling efforts in whatever contributes directly to the top line. On the other hand, if your margins are tight, you’re operating a mature business, or risk is a core feature of your product (short-term lending falls in this category), you’ll want to keep at least some of the activity and capabilities in-house.
The availability of engineering resources is an obvious constraint, as you’ll want to invest effort in the work that will contribute most to your growth. That is, as discussed previously, the main reason for the destructive cycle causing RMP product work to be constantly underfunded: the need to invest ahead of time is unclear, major loss events are the main driver to action, and time constraints force patchy solutions. When this starts hitting you, buying a third-party solution makes more sense. Data availability is slightly different: if you have large historical datasets and/or access to data no one else has (e.g., you’re working for Facebook, LinkedIn, PayPal, Amazon, etc.), you are better positioned to develop proprietary in-house RMP tools. Others who are bootstrapping their database or are in need of standardized data (e.g., credit scores) will pay a lot of money to attain access to those.
Which Vendors Should You Look For?
No matter your engineering team size or data availability, there is always something you’ll need to buy. What’s available, and where should you look? The following is in no way an exhaustive or up-to-date list, but rather a few points to think of when you start shopping.
Should You Buy an Off-the-Shelf Risk System?
Unless you’re a huge and profitable payments company or retailer that needs a simplified, interactive tool for legions of operators with little technical training, stay away from using detection platforms with fancy GUIs and built-in detection models. Outsourcing your risk decisions isn’t necessarily a bad idea, as discussed before, but these systems specifically have multiple downsides. First, they are expensive. Integrating a system that costs six to seven figures is far from a good investment for most businesses. Second, such systems seldom integrate at multiple touch points. Integration is limited to front-end detection, missing on additional data from back-end decisions and disputes. Even when this functionality exists, integration time prohibits a full integration. As a result, lack of a full feedback loop for front-end models produces suboptimal decisions. Finally, since these companies do not provide any guarantee, your financial interests are not fully aligned, and you are left dealing with wrong decisions.
If RMP is not core to your business, seek a company that is easy to integrate with, delivers decisions rather than recommendations, and takes your loss liability. That is the best buying ROI decision. The one obvious exception to this “only buy decisions” rule is if you’re breaking into a new market or segment and a provider has a lot of historical information that you can temporarily use while entering the market such that the high cost makes sense. For example, credit scores in a new country you’re expanding into; there are credit bureaus in many countries that could charge you several dollars per hit but will provide a lot of helpful data and scoring. That makes sense to pay for as you’re expanding, and while expensive, is in no way at the same price level as a full suite of tools.
Detection Vendors and Social Data
Some companies sell detection services, identifying specific behaviors that are either hard to detect or require industry data to detect effectively. One of the most common ones is returning-user detection and “device fingerprinting,” telling you whether a user visiting your website has already visited with different details or has been flagged as bad on a different website. Others sell blacklists of stolen and compromised accounts and consumer details to validate against. Most of these are pretty expensive, and make sense only if you’re price insensitive. Blacklists and device ID can be built in-house with 60%–70% accuracy in a few weeks, and given the integration time and complexity required by most vendors, their main advantage is providing industry-wide monitoring based on their customer base. If you’re selling virtual goods and are under constant attack, you’re in their sweet spot and will see a lot of benefit even at this price level. If you’re selling candy and have kids using their parents’ identity, you won’t see as much benefit. As usual, understanding your problems is key to solving them.
A few providers in Europe and the US offer identity validation and social network data enrichment — giving you additional information about an individual based on email, and sometimes name and address. Most of them are shipping and marketing companies that found a way to aggregate and resell the data they collect. While you should first make sure you’re protecting yourself from privacy policy violations when using these vendors, they provide interesting data allowing you to eliminate fake identities as well as learn more about your customers. The main problem, however, is coverage; different providers have different datasets that are often incomplete and only occasionally overlap, requiring you to integrate with all of them and spend effort on piecing the puzzle together in your database. Instead, use an aggregator of social and identity data that can give you slightly or heavily processed data that you can use in your decision process.
Must-Have Tools and Data Sources
There are tools and data sources I always use, because their ROI is high and justify using them in any case. AVS and other address-to-card validation sources are a no brainer and usually come as an add-on from your payment gateway. IP geolocation and network type are also extremely cheap compared to other data sources and will help you detect proxies, suspicious, and safe connections rather easily. Email provider type, usually provided by the same companies, can help you separate unknown but free email domains and other blacklisted ones. Google Maps’ (or other providers’) address type and geolocation APIs are a helpful source to see where your package is being sent and suspicious addresses.
Some incredibly cheap databases will tell you whether the phone you got from the user is a mobile, VoIP, or fixed line. All of these are good and rather easy to integrate tools that you should look into when you’re starting to build internal capabilities, but won’t provide final and liability-shifting decisions. Still, most of these sources can be replaced with internal data if your database is large enough.
Don’t Forget Domain Expertise
Outsourcing decisions and using external data sources is often a good idea, but does not mean you should stop growing and nurturing internal domain expertise. Your internal team should be much more than an operator of a black-box rule system. Even when outsourcing parts of your process you must have analytics and manual review to track your vendors’ performance, manually review selected samples, and examine false positives in multiple segments. The operational and analytical parts of your RMP functions may be smaller, but shouldn’t disappear, or you’ll be at the mercy of your vendor. Make sure that even if you decide to completely bypass dealing with RMP, you have at least one person that has it as part of their job description to see what kind of value for money you’re getting. It will save you a lot.