BYU’s research in machine learning is starting to receive recognition on the world stage.
Paul Felt, a recent PhD graduate in computer science, has been developing new ways to help computers learn the subtleties of human language from human-labeled data. His paper “Making the Most of Crowdsourced Document Annotations: Confused Supervised LDA” received the best paper award at the SIGNLL Conference on Computational Natural Language Learning in Beijing. Crowdsourcing is the practice of obtaining labeled (or “annotated”) data from ordinary people online rather than from experts.
About 55 different papers were presented at the conference, sponsored by Google and Microsoft Research, by an international body of computer science researchers.
“It was definitely unexpected,” Felt said. “My other co-author [in attendance] had already left after I presented, and I was sitting in the back . . . when they called my name.”
Felt conducted his research and wrote his award-winning paper in collaboration with his faculty advisor, Dr. Eric Ringger, the director of the BYU Natural Language Processing Laboratory (currently on leave at Facebook), Dr. Kevin Seppi, who directs the BYU Applied Machine Learning Laboratory, and Dr. Jordan Boyd-Graber of the University of Colorado at Boulder. Felt’s research required collaboration among the labs, and the results were exciting for all those working on machine learning and natural language processing at BYU and Boulder.
“It’s an area we’ve been working on for a handful of years,” Ringger said. “Paul’s work is the culmination of a lot of thinking and experimentation, and it’s the first batch of really strong results.”
The goal of this research is to provide trusted examples teaching computers human language. Since language is a human phenomenon, Felt and his co-authors gathered crowdsourced annotation data, or—in other words—labeled examples from a group of ordinary human annotators working online rather than experts.
These annotators label examples that are used to teach machine learning algorithms. For example, annotators can be given the task to label the topic of a document, identify word or phrase types, or indicate the overall tone (or “sentiment”) of a written work. Because labels from different annotators for a specific example may be inconsistent, Felt and his co-authors devised algorithms and models to ascertain the “true” labels for the examples from as few labels from the crowd as possible. The methods automatically take advantage of inter-annotator agreement and disagreement as well as annotator self-consistency and, naturally, the content of the examples themselves. Past methods have not been as effective at zeroing in on the true labels.
Labeled data inferred from crowdsourced data can likewise be used to learn other kinds of human language behavior to improve computer language tasks such as search engine relevance and ranking, product rating, machine translation, and speech recognition.
“If you’re trying to learn humans’ reaction to something, then you will always need a human in the loop,” Seppi said.
Dr. Ringger also stressed the importance of using humans to solve computer problems involving language.
“Think of the new algorithm as a truth finder, when it comes to truths that only humans know about,” Ringger said. “We’re leveraging a group of people to find linguistic truth.”
According to Felt, the main goal of this research is to generate trustworthy labels at little cost. Since almost every major tech company is currently using labeled data, this type of research is in very high demand.
“Anybody who is developing machine learning is a good candidate for using this,” Felt said. “In fact, there was a fair contingent from Google at the conference, and they expressed quite a bit of interest.”
Although Felt obtained his PhD in December 2015 and currently works at IBM Watson, BYU’s research in the areas of machine learning and natural language processing is far from over. Now, it is up to a new batch of computer science undergraduate and graduate students to build on the research that Felt pioneered.