Colloquia 2020-2021

Thursday, April 15, 2021 - 11:00am - Virtual

Speaker: Rada Chirkova, Ph.D.

Title: "Temporal Enrichment and Querying of Ontology-Compliant Data"

Abstract: We consider the problem of answering temporal queries on RDF stores, in the presence of time-agnostic RDFS domain ontologies, of relational data sources that include temporal information, and of rules that map the domain information in the source into the target ontology. Our proposed solution consists of two rule-based domain-independent algorithms. The first algorithm materializes target RDF data via a version of data exchange that enriches the data and the ontology with temporal information from the sources. The second algorithm accepts as inputs temporal queries expressed in terms of the domain ontology, using SPARQL supplemented with time annotations. The algorithm translates the queries into the standard SPARQL form that respects the structure of the temporal RDF information while preserving the question semantics. We present the algorithms, report on their implementation and experimental results for two application domains, and discuss future work. This is joint work with Jing Ao, Zehui Cheng, and Phokion G. Kolaitis. 

Faculty Host: Dr. Saumya Debray


Thursday, March 11, 2021 - 11:00am - Virtual

Speaker: Lei Cao, Ph.D.

Title: "SAUL: Towards Effective Data Science"

Abstract: Many data scientists prefer high level, end-to-end interfaces, like SQL databases to make sense of data, since they abstract away low-level time consuming engineering details. However, except for SQL databases, few tools for data scientists today offer such high-level interfaces. The goal of my research is to bridge this gap, by developing systems and algorithms that automatically address low-level performance and scaling bottlenecks at every step in the data science pipeline, while still making it easy to incorporate domain-specific requirements.

My talk will cover two systems we have built, including an anomaly discovery system and a labeling system that solve fundamental problems in both unsupervised and supervised machine learning. First, AutoAD, the self-tuning component of our anomaly discovery system, targets freeing the data scientists from manually determining which among the large number of unsupervised anomaly detection techniques is the best suited for the given task and tuning the parameters for each of the alternate methods. This is particularly challenging in the unsupervised setting, where no labels are available for cross-validation. AutoAD solves this problem by using a fundamentally new strategy that unifies the merits of unsupervised anomaly detection and supervised classification. Second, our LANCET approach solves the labeling problem, a key bottleneck that limits the success of cutting-edge machine learning techniques in enterprise deployments. These techniques often require millions of labeled data objects to train a robust model. Because relying on humans to supply such a huge number of labels is rarely practical, automated methods for label generation are needed. Unfortunately, critical challenges in auto-labeling remain unsolved, including the following questions: (1) which objects to ask humans to label, (2) how to automatically propagate labels to other objects, and (3) when to stop labeling. LANCET addresses all three challenges in an integrated framework based on a solid theoretical foundation characterizing the properties that the labeled dataset must satisfy to train an effective prediction model.

Bio: Dr. Lei Cao is a Postdoc Associate at MIT CSAIL, working with Prof. Samuel Madden and Prof. Michael Stonebraker in the Data System group. Before that he worked for IBM T.J. Watson Research Center as a Research Staff Member in the AI, Blockchain, and Quantum Solutions group. His recent research is focused on developing end-to-end tools for data scientists to effectively make sense of data.

Faculty Host: Dr. John Hartman


Thursday, March 4, 2021 - 11:00am - Virtual

Speaker: Wenpeng Yin, Ph.D.

Title: "Universal Natural Language Processing with Limited Annotations"

Abstract: The research of Natural Language Processing (NLP) tries to endow machines with the human's ability to understand natural languages. This mission has been broken down into a number of subtasks. We often focus on solving individual tasks by first collecting large-scale task-specific data, then developing a learning algorithm to fit the data. This research paradigm has considerably pushed the frontiers of NLP. However, it also means that we have to build new systems to handle new tasks, which is undesirable in the long run since (i) we do not really know all the tasks that we need to solve, and (ii) it discourages people from thinking about how to make systems truly understand the natural languages rather than remember the patterns in the training data.

In this talk, I will share my progress towards the goal of universal NLP (i.e., building a single system to solve a wide range of NLP tasks) in three stages. First, I study why humans show vastly superior generalization to machines regarding classifying open-genre text to the open-form labels. The first part of the talk will present a single and static system that unifies various text classification problems: new text labels keep coming to the system while no supporting examples are available. Secondly, I define a more realistic task—"incremental few-shot text classification'', where the system needs to learn the new labels incrementally with k examples per label. Thirdly, I shift my focus from classification problems to more complex and distinct tasks (e.g., Question Answering, Coreference Resolution, Relation Extraction, etc.). This part will elaborate on how to optimize the generalization of a pre-trained entailment model with k task-specific examples so that a single entailment model can generalize to a variety of NLP tasks. Overall, the universal NLP research pushes us to think more about the underlying universal reasoning among various problems, facilitating utilizing indirect supervision to solve new tasks.

Bio: Dr. Wenpeng Yin is a research scientist at Salesforce Research, Palo Alto, California. He got a Ph.D. degree from the University of Munich, Germany, in 2017 under the supervision of Prof. Hinrich Schütze and then worked as a postdoc at UPenn with Prof. Dan Roth. Wenpeng has broad research interests in Natural Language Processing (NLP) and Machine Learning, with a recent focus on Universal & Trustworthy NLP. He got multiple awards in the past, including WISE2013 “Best Paper”, "Baidu Ph.D. Fellowship" in 2014&2015, "Chinese Government Award for Outstanding Self-financed Ph.D. Students Abroad" in 2016, and “Area Chair Favorites” paper award in COLING2018. He was an invited Senior Area Chair for NAACL'21, Area Chairs for NAACL'19 and ACL'19&21. 

Faculty Host: Dr. Chicheng Zhang


Tuesday, March 2, 2021 - 11:00am - Virtual

Speaker: Tianyi Zhang, Ph.D. 

Title: "Rethinking Modern Programming Tools with Human-Centered Intelligence"

Abstract: As computation is woven into our everyday life, more people want or need to write code. But programming is hard, especially for novice programmers and computer end-users. Over the years, many intelligent tools have been invented to automate the programming workflow. However, recent studies have shown that the cost of automation often outweighs its benefit and, more importantly, people may have trust issues in automation.

In this talk, I will describe how we can overcome those limits by augmenting intelligent tools with human-centered interaction mechanisms. First, I will demonstrate how we can amplify human’s learning capability with a bird’s-eye view of hundreds of code examples, so programmers can quickly navigate through the long tail of API usage patterns and contextualize those patterns with concrete usage scenarios. Then, I will describe how we can renovate existing program synthesizers with enriched feedback loops, so users can better clarify their intent to a synthesizer and validate the programs synthesized on their behalf. Finally, I will describe how we can support interpretability in program synthesis, so users can build a detailed mental model and provide strategic guidance in challenging tasks that a synthesizer cannot solve alone. I will conclude my talk with several future directions, e.g., building human-centered intelligent tools to help domain experts such as API designers and physician-scientists.

Bio: Tianyi Zhang is a postdoctoral fellow in computer science at Harvard University. At Harvard, he works with Elena Glassman to build interactive systems that augment human intelligence with data-driven insights and augment machine intelligence with human guidance, with a particular focus on improving programmers’ productivity and widening the demographics of people who can write programs. He obtained his PhD from UCLA CS, advised by Miryung Kim, and is a recipient of the UCLA Dissertation Year Fellowship.

Faculty Host: Dr. Kate Isaacs


Tuesday, February 23, 2021 - 11:00am - Virtual

Speaker: Prashant Pandey, Ph.D.

Title: "Data Systems at Scale: Scaling Up by Scaling Down and Out (to Disk)"

Abstract: The standard solution to scaling applications to massive data is scale-out, i.e., use more computers or RAM. This talk presents my work on complementary techniques: scaling down, i.e., shrinking data to fit in RAM, and scaling to disk, i.e., organizing data on disk so that the application can still run fast. I will describe new compact and I/O-efficient data structures and their applications in computational biology, stream processing, and storage.

In computational biology, I show how to shrink genomic and transcriptomic indexes by a factor of two while accelerating queries by an order of magnitude compared to the state-of-the-art tools. In stream processing, my work bridges the gap between the worlds of external memory and stream processing to perform scalable and precise real-time event-detection on massive streams. In file systems, my work improves file-system random-write performance by an order of magnitude without sacrificing sequential read/write performance.

Bio: Pandey is a Postdoctoral Research Fellow at Lawrence Berkeley Lab and University of California Berkeley working with Prof. Kathy Yelick and Prof.Aydin Buluc. Prior to that, he spent one year as a postdoc at Carnegie Mellon University (CMU) working with Prof. Carl Kingsford. He obtained his Ph.D. in 2018 in Computer Science at Stony Brook University and was co-advised by Prof.Michael Bender and Prof. Rob Johnson.

His research interests lie at the intersection of systems and algorithms. He designs and builds tools backed by theoretically well-founded data structures for large-scale data management problems across computational biology, stream processing, and storage. He is also the main contributor and maintainer of multiple open-source software tools that are used by hundreds of users across academia and industry. During his Ph.D. he interned at Intel Labs and Google. While interning at Intel Labs, he worked on an encrypted FUSE file system using Intel SGX. At Google, he designed and implemented an extension to the ext4 file system for cryptographically ensuring file integrity. While at Google, he also worked on the core data structures of Spanner, Google’s geo-distributed big database.

Faculty Host: Dr. John Kececioglu


Thursday, February 18, 2021 - 11:00am - Virtual

Speaker: Zhuoyue Zhao, Ph.D. Candidate

Title: "Approximate Query Processing in Database Systems"

Abstract: In modern data analytical applications, it is a common need to issue complex analytical queries over a large amount of data. They are costly to evaluate even in the state-of-the-art OLAP systems. Approximate Query Processing (AQP) is a fast alternative that provides approximate answers with certain accuracy guarantees using random samples or data sketches. In this talk, I will introduce the use cases and challenges of AQP in analytical database systems, and review a few techniques we developed for drawing non-uniform and uniform random samples from complex join queries to support AQP on different analytical tasks. I will also give an overview on the future research directions of applying AQP in other scenarios, such as data science applications and hybrid transaction and analytical processing.

Bio: Zhuoyue Zhao is currently a fifth-year PhD candidate at University of Utah, advised by Prof. Feifei Li. He received a B.Eng. degree from Shanghai Jiao Tong University in 2016. His research interest is in large-scale data processing and management. In particular, he is interested in approximate query processing, query optimization and hybrid transaction and analytical processing. He received the best paper award in SIGMOD'16 and the Google PhD fellowship in 2019.

Faculty Host: Dr. Rick Snodgrass


Tuesday, February 16, 2021 - 11:00am - Virtual

Speaker: Anqi Liu, Ph.D

Title: "Towards Trustworthy AI: Provably Robust Extrapolation for Decision Making"

Abstract: To create trustworthy AI systems, we must safeguard machine learning methods from catastrophic failures. For example, we must account for the uncertainty and guarantee the performance for safety-critical systems, like in autonomous driving and health care, before deploying them in the real world. A key challenge in such real-world applications is that the test cases are not well represented by the pre-collected training data.  To properly leverage learning in such domains, we must go beyond the conventional learning paradigm of maximizing average prediction accuracy with generalization guarantees that rely on strong distributional relationships between training and test examples.

In this talk, I will describe a distributionally robust learning framework that offers accurate uncertainty quantification and rigorous guarantees under data distribution shift. This framework yields appropriately conservative yet still accurate predictions to guide real-world decision-making and is easily integrated with modern deep learning.  I will showcase the practicality of this framework in applications on agile robotic control and computer vision.  I will also introduce a survey of other real-world applications that would benefit from this framework for the future work.

Bio: Anqi (Angie) Liu is a postdoctoral scholar research associate at the Department of Computing and Mathematical Sciences in the California Institute of Technology. She obtained her Ph.D. from the Department of Computer Science of the University of Illinois at Chicago. She is interested in machine learning for safety-critical tasks and the societal impact of AI. She aims to design principled learning methods and collaborate with domain experts to build more reliable systems for the real world. She has been selected for the EECS Rising Star in UC Berkeley 2020. Her publication appears in prestigious machine learning conferences like Neurips, ICML, ICLR, AAAI, and AISTAT.

Faculty Host: Dr. Jason Pacheco


Thursday, February 11, 2021 - 11:00am - Virtual

Speaker: Ziyu Yao, Ph.D. Candidate

Title: "Building Interactive Natural Language Interfaces"

Abstract: Constructing natural language interfaces (NLIs) that allow humans to acquire knowledge and complete tasks using natural language has been a long-term pursuit. This is challenging because human language can be very ambiguous and complex. Moreover, existing NLIs typically provide no means for human users to validate the system decisions; even if they could, most systems do not learn from user feedback to avoid similar mistakes in their future deployment. 

In this talk, I will introduce my research about building interactive NLIs, where an NLI is formulated as an intelligent agent that can interactively and proactively request human validation when it feels uncertain. I instantiate this idea in the task of semantic parsing (e.g., parsing natural language into a SQL query). In the first part of the talk, I will present a general interactive semantic parsing framework [EMNLP 2019], and describe an imitation learning algorithm (with theoretical analysis) for improving semantic parsers continually from user interaction [EMNLP 2020]. In the second part, I will further talk about a generalized problem of editing tree-structured data under user interaction, e.g., how to edit the Abstract Syntax Trees of computer programs based on user edit specifications [ICLR 2021]. Finally, I will conclude by outlining future work around interactive NLIs and human-centered NLP/AI in general.

Bio: Ziyu Yao is a Ph.D. candidate at the Ohio State University (OSU). Her research interests lie in Natural Language Processing, Artificial Intelligence, and their applications to advance other disciplines. In particular, she has been focusing on developing natural language interfaces (e.g., question answering systems) that can reliably assist humans in various domains (e.g., Software Engineering and Healthcare). She has built collaborations with researchers at Carnegie Mellon University, Facebook AI Research, Microsoft Research, Fujitsu Laboratories of America, University of Washington, and Tsinghua University, and has published extensively at top-tier conferences in NLP (EMNLP, ACL), AI/Machine Learning (ICLR, AAAI), and Data Mining (WWW). In 2020, She was awarded the Presidential Fellowship (the highest honor given by OSU graduate school) and selected into EECS Rising Stars by UC Berkeley.

Host: Dr. Kwuang-Sun Jun


Tuesday, February 9, 2021 - 11:00am - Virtual

Speaker: Jonathan Kummerfeld, Ph.D.

Title: "You Are What You Train On:
 Creating Robust Natural Language Interfaces"

Abstract: Natural Language Interfaces like Siri and Alexa help people do things more efficiently, but they are brittle, unable to handle the full range of ways people naturally express themselves. Each of their actions is manually defined by developers, with limited ability to compose actions to make more sophisticated ones. The choice of action is made by a statistical model that is limited by the range of data seen in training. Despite steady progress in the accuracy of these systems, the true scope of remaining challenges has been obscured by the way researchers collect and prepare data.

In this talk, I will describe two of my projects that have revealed previously unknown limitations of natural language interfaces and ways to address them. First, I will show that systems for converting questions to SQL queries have limited generalizability beyond examples seen in training (ACL 2018). I propose a new model and a new way to split data into training and test sets that explore this challenge. Second, I will show that standard crowd-worker data collection processes miss the long and heavy tail of ways people speak (ACL 2017). I propose an outlier-based data collection workflow (NAACL 2019), and a complementary taboo list workflow (EMNLP 2020), that improve data diversity and reduce the cost of data cleaning. I will conclude by outlining a research agenda for fundamentally changing the capabilities of these systems. Today we use these systems to do simple tasks, e.g. “start a 5 minute timer”. My work will enable systems to do complex tasks as part of applications, e.g. “Plot population over the last 2000 years with a trend line only and a log scale on the y-axis”.

Bio: Jonathan K. Kummerfeld is a Postdoctoral Research Fellow in Computer Science and Engineering at the University of Michigan. He completed his Ph.D. at the University of California, Berkeley, advised by Prof. Dan Klein. Jonathan’s research has revealed new challenges in syntactic parsing, coreference resolution, and dialogue. He has proposed models and algorithms to address these challenges, improving the speed and accuracy of natural language processing systems. He has been on the program committee for 55 conferences and workshops, including Area Chair at ACL and Shared Task Coordinator for the DSTC workshops. He currently serves as a standing reviewer for the Computational Linguistics journal and the Transactions of the Association for Computational Linguistics journal. For more details, see his website:

Faculty Host: Dr. Kobus Barnard


Thursday, February 4, 2021 - 11:00am - Virtual

Speaker: Daniel Fried

Title: "Learning Grounded Pragmatic Communication"

Abstract: To generate language, natural language processing systems predict what to say---why not also predict how listeners will respond? We show how grounded language systems benefit from pragmatics: explicitly reasoning about the actions and intents of the people they interact with. Our pragmatics-enabled models predict how human listeners will interpret text generated by the model, and reason counterfactually about why human speakers produced the text they did. We find that explicit reasoning about pragmatics improves state-of-the-art grounded NLP models across diverse tasks and domains, including collaborative dialogue and following directions to choose routes through real indoor environments.

Bio: Daniel Fried is a final-year computer science PhD student at UC Berkeley working on natural language processing and machine learning. His research focuses on language grounding: tying language to world contexts, for tasks like visual- and embodied-instruction following, text generation, and dialogue. Previously, he graduated with a BS from the University of Arizona and an MPhil from the University of Cambridge. His work has been supported by a Google PhD Fellowship, an NDSEG Fellowship, and a Churchill Scholarship.

Host: Mihai Surdeanu


Tuesday, February 2, 2021 - 11:00am - Virtual

Speaker: Yu Huang, Ph.D.

Title: "Bjectively Measure Developers' Cognitive Activities: Code, Biases, and Brains"

Abstract: Understanding how developers carry out computer science activities can help to improve software engineering productivity and guide the use and development of supporting tools and environments. Previous studies have explored how programmers conduct computing activities, such as code comprehension, but they rely on traditional survey instruments, which may not be reliable in all contexts.  Advances in medical imaging and eye-tracking have recently been applied to software engineering, paving the way for grounded neurobiological understanding of fundamental cognitive processes involved therein.

In this talk, I will present three of my studies spanning software engineering and cognitive science using multiple modalities (fMRI, fNIRS, and eye-tracking), and discuss the implications. First, I will introduce the examination of the relationship between data structure manipulations and spatial ability, as well as how we adapt medical imaging approaches to software engineering. Then, I will discuss the investigation of cognitive processes in higher level, more semantically-rich and industry-related activities including code writing and code review. This work is among the first that leverages various objective measures to provide a systematic solution for understanding user cognition in programming activities. This line of research involves novel approaches to understand developers' behaviors, and shows potential for broad impact in CS pedagogy, technology transfer, and broadening participation. Lastly, I will discuss my ongoing and future research directions.

Bio: Yu Huang is a PhD candidate in the Department of Computer Science and Engineering at the University of Michigan, advised by Prof. Westley Weimer. Her research expertise lies in the intersection of software engineering and human factors. Her work spans software, hardware and embedded systems, medical imaging, open source software, and mobile sensing, collaborating with researchers from Psychology and Neuroscience. She is particularly interested in understanding and improving software activities and developers' behaviors.  Yu Huang has received multiple grants to support both her research and her efforts to improve diversity in computer science. Her work has resulted in over 20 peer-reviewed publications including an ACM SIGSOFT distinguished paper award, and has been covered in multiple media outlets, including the 2020 GitHub Octoverse Report. 

Faculty Host: Dr. Joshua Levine