Newswise — The world's largest private-sector securities regulator, the National Association of Securities Dealers, has teamed with University of Massachusetts Amherst researchers to bring cutting-edge computer science to the world of securities fraud. By developing statistical models that assess data that most models can't manage, the scientists aim to help the NASD discover misconduct among brokers and concentrate regulatory attention on those who are most likely to misbehave.

Because broker malfeasance is often encouraged by the presence of those conspiring to commit fraud themselves, the researchers were given the task of developing statistical models that made use of this social aspect of rule-breaking. Such "relational" data is difficult for many models, which often assume independence among records.

David Jensen, computer science, likens the task to modeling medical diagnostics. When trying to predict the probability that an individual will catch a disease, information intrinsic to the individual—such as age or health history—can be critical. But clues can also be extracted from information about the person's social and professional network, such as where they've lived or worked, or with whom they've been in contact.

"Our methods are uniquely suited to analyze this kind of information," says Jensen. "They allow you to easily look at the characteristics of the surrounding network."

The work is part of an ongoing, joint project exploring fraud detection by UMass Amherst researchers and the NASD, and it was presented recently by doctoral student Jennifer Neville at the Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

More than 600,000 brokers are engaged in securities transactions, making NASD examiners a valuable and finite resource. While these human examiners have the acuity to spot relational patterns that suggest a broker warrants further scrutiny, automating that sort of evaluation had proved difficult. But the relational probability trees (RPTs) developed by Neville and Jensen appear to make good use of this contextual information and they provide a ranking of risky brokers to boot.

Using data from past years supplied by the NASD, Jensen, Neville and doctoralstudent Ozgur Simsek applied their algorithms to the networks of organizational relationships in the securities world. For example, brokers are linked to the firms they work for, customer complaints are linked to the brokers they reference, and branches are linked to their parent firms. By analyzing records of brokers in the context of other records in their "neighborhood" the algorithms were able to predict which brokers would commit violations with surprising accuracy, says Jensen.

The researchers also examined the RPT models' ability to highlight risky brokers compared to the subjective ratings of brokers by NASD examiners and to the NASD's Higher-Risk Broker List. Not only did the RPTs identify many of the same brokers as the higher-risk list, it also identified novel cases not on the list.

"That it performs as well as live examiners is fascinating," says John Komoroske, vice president of the NASD. "With tweaking and time, models such as these could be a real help."

Jensen and his team have explored applying the models to other kinds of data as well. In a recent test of the RPTs' prognosticating powers the researchers constructed a tree to predict a movie's box office success. In this case, the data fragments they inserted into the trees included the movie and all of its associated actors, producers, directors and studios. Characteristics such as 'not a documentary,' or the number of successful movies an actor had already been in were also embedded in the network. Based on the movie and its associated network, the RPT assigned a probability that the movie would make more than $2 million in the first weekend.

The accuracy with which the models predict a hit seems surprising says Jensen, but he points out that the tree was constructed automatically, based on data for which box office receipts were known. Once trained on prior data, the model can then be applied to unknowns. But Jensen is cautious about the role of such models in decision making.

"There are sometimes concerns that these algorithms will replace human analysts and decision makers," says Jensen. "But decades of work show that they work best when applied to mundane tasks—thus freeing the analyst for work that really requires expertise."

MEDIA CONTACT
Register for reporter access to contact details
CITATIONS

ACM SIGKDD International Conference on Knowledge Discovery and Data Mining