General research project ideas
Personal Assistant “Online” Learning: Personal assistants are somewhat static in that they are modeled and trained using large amounts of data, but they don’t seem to improve very well over time in how they interact with their users. The purpose of this research project is to help a personal assistant (building off of Kennington and Schlangen, 2016) start from very minimal amounts of data and give it the provisions for improving as it interacts.
Background: Good Java and Python programming helpful, machine learning knowledge also helpful.
Medical Disclosures and Form Filling: The purpose of this project is to map from natural human language (in this case, speech) to fill in the regular forms that doctors have to currently fill in by hand. By learning how to extract relevant information from a conversation between a doctor and a patient, the goal of this research project would be a tool that would fill in digital forms automatically. To get there, we need to be able to extract the relevant information from the speech signal.
Pronunciation Drilling: Computers have been used to help with language learning in many different ways. One way that has not yet been fully realized, but could potentially be very helpful, is pronunciation drilling. For example, an individual whose native language is English, but wishes to learn Japanese might have a hard time pronouncing certain words. A tool that the user could talk to which could give feedback as to how well a word was pronounced and information on the parts of the word that were incorrectly pronounced, would be very helpful indeed. To this end, a speech recognizer such as CMU’s Sphinx4 could be used. This project would require building an intuitive GUI for the users.
Working with Robots: In the general research area of grounded language acquisition and dialogue, we are interested in portraying human intent into robots (or vehicles, etc.) using not just speech commands, but natural, interactive speech. We are working with Anki Cozmo robots and are planning on expanding to other kinds (in potential collaboration with Dr. Hoda Mehrpouyan of Boise State).
Multimodal Alignment using InproTK: In a system that attempts to make use of multiple information sources (e.g., speech, gesture, gaze, social cues, etc.) These need to be integrated at a semantic level in order to understand and make use of all information sources. However, before this can even take place, the timing alignment needs to be addressed: e.g., different sensors and services that read in sensor information, sampling rates, etc., make it impossible to determine if a gesture that occurred 30 ms in the past coincides with speech that occurred 1000 ms in the past.
Background: Excellent Java programming required. Experience in handling time series data helpful.
In-car Dialogue Systems: It has been shown that speaking on the phone (even hands free!) while driving a car is cognitively taxing. Quite literally, the only thing a driver can do safely while talking on the phone is drive in a straight line at a constant speed. This isn’t the case when the driver is speaking to another adult passenger because that passenger can stop talking at any time allowing the driver to focus all cognitive faculties on the primary task of driving. Worse than talking to people is talking to dialogue systems, but being able to accomplish tasks while driving using a speech-driven assistant is a potential way to save time and money for many. The purpose of this project is to learn strategies for systems to be “situationally aware” such that they can respond and aid the driver rather than hinder the driver.
Fusing Multiple ASR Sources: Since 2011, there have been impressive improvements in Automatic Speech Recognition (ASR), some of which are either open source or freely available. Google, Nuance, IBM, and MS have their own ASR/Speech APIs that produce impressive results and cover large vocabularies in several languages. Open source ASR engines such as Sphinx (by CMU) and Kaldi have also made impressive improvements and have out of the box models available.
Different companies and open source engines have different strengths and weaknesses. Google ASR, for example, produces good results (in represented languages like English and German) at the cost of a delay: even fast Internet connections can have delayed results up to 1500 ms. It is moreover important to know the start and stop time of each word that is recognized, but Google ASR doesn’t provide this. Sphinx ASR, on the other hand, has fairly high error rates, but can produce very fast results and provide start and stop times for the words.
The goal of this project is to combine the strengths from various ASR sources to produce output that is accurate, fast, and produces timing information. This requires the integration of several ASR sources (potentially the more the better) as a module (or integration of modules) in the InproTK framework (written in Java).
Hand Pointing Recognition and Tracking (using MS Kinect or HD Camera): I would like to develop a system that can detect and track deictic gestures and, more importantly, determine which object is being pointed at. This could be done, following Matuszek et al., (2014), with a standard web camera, or this could be done with a MS Kinect (with the latest server software) which can determine general pointing directions (treating the hand as sort of a claw). Developing this will be a first step in using deictic gestures to ground language learning to real-world objects in interactive settings.
Background: experience in Java and Python helpful.
Research involving the recent Words-as-Classifiers (WAC) model
The WAC model of reference resolution (Kennington & Schlangen, 2015; Schlangen et al., 2016) has shown promise in its so-far short life. At the basis, the model learns a “fit” between words and visual aspects of objects. The mapping is done using a logistic regression classifier for each word and examples of word and referred object are given to that word’s classifier to learn the mapping. It has been used in reference resolution tasks ranging from simple geometric shapes to objects in real-world photographs. Learned classifiers have also been applied to language generation tasks (Zarieß & Schlangen, 2016). It has also been applied in a demo using a simple robot that can manipulate objects (Hough & Schlangen, 2016).
The logistic regression classifiers learn to “fit” words and objects. For example, given enough examples of red objects, the classifier for red learns to return high probabilities when exposed to the features of red objects, and lower probabilities for objects that are not red. The more prototypical the red (as accrued in the training data), the higher the probability.
Composition of Semantic Meaning using the WAC Model : See description here.
Learning Verb Meanings with the WAC Model: I conjecture that learning verb meanings happens in a similar way as non-verb words like nouns and can be modeled similarly as other words in the WAC model, what would need to be added (or changed) are the features. Verbs denote movement, so features representing movement of objects, a change in state, or movement of an appendage that performed the verb, could prove useful. In order to do this, we would need to either find existing data (e.g., the REX corpus) or create new data ourselves by recording interactions between humans or humans and machines (e.g., using Pentomino objects in a simple task of constructing puzzle objects). That data would (likely) need to be annotated and we can use that data to train the WAC verb classifiers.
Grounded Semantics with the WAC Model and a Semantic Formalism: At the moment, the classifiers treat words as independent entities and are only combined somewhat ad-hoc by averaging the output of the classifiers as they are applied to visually present objects. Another possible way of combining these classifiers would be to use them as a lexical semantic representation and apply an implementation of a known semantic formalism (e.g., PLTAG, RMRS, Dependency Parsing, or TTR and DS). The formalisms could guide the composition and interpretation process, making the WAC model more useful in real-world applications.
Quantification using the WAC Model: The WAC model produces a distribution over possible referents. A simple determiner (e.g., “the”) denotes the top element, whereas other quantifiers (e.g., “a” or “some” or “all”) denotes more than just the top element. Determining how quantifier words map to a distribution is the main part of this research project.
Fast-mapping with WAC: When children first learn words, they require lots of examples of caregivers pointing at or otherwise denoting objects while uttering corresponding referring expressions. Later, children are able to quickly pick up new words with only a single use. For example, a child who knows a number of words but sees a pine cone for the first time can ask what it is and only after hearing the reference “pine cone” once, the child can remember it. This is a phenomenon known as fast-mapping. The question for this research project is: can the WAC model be used for fast mapping? Some preliminary work shows that it is very possible.