Selection Bias in Costly Data Collection

Actuaries and underwriters are always looking for a way to evaluate the actual loss potential of the risks the insurance company is writing.  This evolution in risk evaluation is a basic component of the competitive insurance marketplace.  If I can develop a way to more accurately segment risks, then I can have a competitive advantage over those who cannot see the marketplace as clearly.  Data mining is being used to find the next rating variable that has not yet been explored.  This increasingly complex underwriting is costly to develop and sometimes to collect.  Several new developments in personal lines rating have shifted some of the economic burden of providing data to the insured. 

Two such examples are the advent of pay-as-you-drive programs in personal auto programs and wind loss mitigation programs in property programs.  In both the insured is asked to shoulder the economic burden of providing information for the promise of potential large savings in insurance premiums.  In pay-as-you-drive programs, the insured is required to install a monitoring device in their car that transmits information to the insurance company on driving behavior.  Wind loss mitigation programs are focused on recognizing the decreased loss potential associated with certain building features that increase resistance to wind damage (shutters, roof to wall connections, opening protection, etc.).   The economic burden in pay-as-you-drive programs is the loss of personal privacy required of the insured.  The cost in wind loss mitigation programs is more direct; the building inspections required to verify the building features can be quite costly.

These costs are important as they trigger another buying decision point in the underwriting process.  “What type of car do you drive?” “How old is your home” “How many stories?” These questions are expected in the insurance purchasing process and are virtually free to provide and collect.  When the insured is asked to take on a large economic cost in order to provide information, then an additional (and sometimes unexpected) buying decision occurs.  Because of this additional buying decision, the process of collecting the additional information can be subject to selection bias.  What is selection bias you might ask? Two examples might make this term easy to explain.

Let’s say that I approached a group of graduating college students.  Assume this group comprises a mixture of the population of graduating seniors (C-averages students, Art majors, engineers, etc.).  If I propose that I will pay any member of the group $100 if they can answer one difficult math question (or art question, or engineering question), all members of the group would be expected to participate as the reward is far greater than the cost (a minute of time).  At the end of the experiment, I would pay those who could answer the question correctly, but I would also have a very good idea of those of the group who were math majors (or art majors, or engineering majors).  I clearly lost money in this proposition, but gained information.

On the other hand, let’s assume that I approached the same group of students and made a different proposal.  If I propose that I will pay any member of the group $400 if they can answer the one difficult math question, but entry into the contest will cost $150.  In this test, I will probably not get the entire group to participate, as individuals with no math background would choose not to pay the $150 for what they view as a limited chance of getting $400.  However, I may well get the same number of math majors to participate (although we aren’t the most risk loving of souls). 

The results of the two tests may wind up showing me the same individuals as being the math majors.  However, in the second test, the population of individuals participating will be overrepresented by these individuals.  If one were to assume the results of the second test were an unbiased sample, you would draw the incorrect conclusion about the percentage of the overall population that was math majors. 

This same reasoning applies to these costly data items.  Individuals who understand that they are high usage (or just bad) drivers will not allow more specific information about their driving habits be sent to the insurance company.  These individuals do not have a reasonable expectation that the cost they will bear will be rewarded with any great probability.  Similarly, homeowners will likely not pay to have an inspection performed unless they are reasonably sure that they can expect substantial insurance savings. 

The response to this selection bias problem in these two instances is a study of contrasts.  I recently saw one major auto carrier install a premium credit for their pay-as-you-drive program.  All insured who agreed to the program received a discount.  As a result, the economic cost to the policyholders was reduced, if not shifted entirely to the insurance company.  As a result, this company can expect to get an unbiased sampling of the driving habits of their insured population and groups of high-risk and low-risk drivers will naturally segregate through time.

On the other hand, the Florida property market (where a large majority of the wind loss mitigation rating issues have arisen) is already a distressed market.  Companies have been unwilling to bear the cost of the inspection process internally, as most of the market is still struggling with other major cost issues.  Other public programs have not provided large subsidies to insureds for these inspection costs.  As a result, the economic costs continue to be borne by the insured, and a selection bias can be expected to occur.  The population that is inspected is expected to represent a larger portion of the heavily mitigated properties.  The other risks would therefore represent a larger portion of non-mitigated properties.  The pricing of the non-inspected properties would not naturally move to their expected cost.  Pricing deficiencies would have to first show themselves and direct pricing action would have to be taken to adjust these risks to their appropriate rates. 

An additional concern for this product is that a large majority of the pricing is done using catastrophe modeling.  These models rely only on the data they are presented in order to estimate loss potential.  In the situation above, the actual mitigation features on inspected properties would be used in modeling where available.  However, where information is unknown, it is customary for the models to assume that risk potential is based on the average building stock characteristics of an area and time of construction (predominate building code requirements, etc.).  If the model estimates were generated in this manner, the non-inspected populations estimated loss potential would be biased low.  This problem would only be resolved through direct action or through re-calibration of the models to this effect.

These data collection issues are not isolated to the insurance industry, but are present in many other disciplines.  Similar concerns arise in other social sciences and in medial trials.  As our world becomes increasingly data driven, these biases and systematic data collection items present additional hurdles to understanding issues more clearly. 

What are your thoughts on this phenomenon?  What steps could be taken to measure this bias and what potential corrections could be implemented?  What other data collection issues concern you or your organization?

Ryan Purdy, FCAS, MAAA, is a consulting actuary at Merlinos & Associates.