Qianyu He (何千羽) is currently a third-year PhD candidate at Fudan University in the School of Computer Science, Shanghai, China. Her previous work focused on the Creative Generation of Language Models:
Currently, her interested research topics are mostly around the Enhancement of Large Language Models(LLMs)’ Instruction Following and Reasoning ability:
(Download my resumé. The last update was on 2024-04-30.)
PhD in CS, 2021-2026 (estimated)
Fudan University
B.S. in CS, 2017-2021
Fudan University
May. 2024 Gave a talk at Alibaba Tongyi Lab, titled: “Complex Instruction Following Ability of Large Language Models”. Thanks for the invitation!
Apr. 2024 How to improve LLMs’ ability to follow Complex Instructions? Check out our new preprint From Complex to Simple: Enhancing Multi-Constraint Complex Instruction Following Ability of Large Language Models.
Dec. 2023 Gave a talk at Tencent AI Lab, titled: “Beyond Simple Words: Make Machine Communicate like Humans”. Thanks for the invitation!
Dec. 2023 Congratulations on our paper Enhancing Quantitative Reasoning Skills of Large Language Models through Dimension Perception. being accepted to ICDE 2024!
Dec. 2023 Three papers have been accepted by AAAI 2024! The first paper is CELLO, a benchmark for evaluating LLMs’ ability to follow complex instructions systematically. The second paper is Xiezhi, a comprehensive, multi-disciplinary, auto-updating benchmark for domain knowledge evaluation.
April. 2023 We released CuteGPT, an open-source conversational language model developed by the Knowledge Works Research Laboratory at Fudan University. Our work has been integrated by FastChat👏.
Mar. 2023 Our paper HAUSER got accepted to ACL 2023!
It is imperative for Large language models (LLMs) to follow instructions with elaborate requirements (i.e. Complex Instructions Following). Yet, it remains under-explored how to enhance the ability of LLMs to follow complex instructions with multiple constraints. To bridge the gap, we initially study what training data is effective in enhancing complex constraints following abilities. We found that training LLMs with instructions containing multiple constraints enhances their understanding of complex instructions, especially those with lower complexity levels. The improvement can even generalize to compositions of out-of-domain constraints. Additionally, we further propose methods addressing how to obtain and utilize the effective training data. Finally, we conduct extensive experiments to prove the effectiveness of our methods in terms of overall performance, training efficiency, and generalization abilities under four settings.
Large language models (LLMs) can understand human instructions, showing their potential for pragmatic applications beyond traditional NLP tasks. However, they still struggle with complex instructions, which can be either complex task descriptions that require multiple tasks and constraints, or complex input that contains long context, noise, heterogeneous information and multi-turn format. Due to these features, LLMs often ignore semantic constraints from task descriptions, generate incorrect formats, violate length or sample count constraints, and be unfaithful to the input text. Existing benchmarks are insufficient to assess LLMs’ ability to understand complex instructions, as they are close-ended and simple. To bridge this gap, we propose \method, a benchmark for evaluating LLMs’ ability to follow complex instructions systematically. We design eight features for complex instructions and construct a comprehensive evaluation dataset from real-world scenarios. We also establish four criteria and develop corresponding metrics, as current ones are inadequate, biased or too strict and coarse-grained. We compare the performance of representative Chinese-oriented and English-oriented models in following complex instructions through extensive experiments.
Similes play an imperative role in creative writing such as story and dialogue generation. Proper evaluation metrics are like a beacon guiding the research of simile generation (SG). However, it remains under-explored as to what criteria should be considered, how to quantify each criterion into metrics, and whether the metrics are effective for comprehensive, efficient, and reliable SG evaluation. To address the issues, we establish HAUSER, a holistic and automatic evaluation system for the SG task, which consists of five criteria from three perspectives and automatic metrics for each criterion. Through extensive experiments, we verify that our metrics are significantly more correlated with human ratings from each perspective compared with prior automatic metrics.
The ability to understand and generate similes is an imperative step to realize human-level AI. However, there is still a considerable gap between machine intelligence and human cognition in similes, since deep models based on statistical distribution tend to favour high-frequency similes. Hence, a large-scale symbolic knowledge base of similes is required, as it contributes to the modeling of diverse yet unpopular similes while facilitating additional evaluation and reasoning. To bridge the gap, we propose a novel framework for large-scale simile knowledge base construction, as well as two probabilistic metrics which enable an improved understanding of simile phenomena in natural language. Overall, we construct MAPS-KB, a million-scale probabilistic simile knowledge base, covering 4.3 million triplets over 0.4 million terms from 70 GB corpora. We conduct sufficient experiments to justify the effectiveness and necessity of the methods of our framework. We also apply MAPS-KB on three downstream tasks to achieve state-of-the-art performance, further demonstrating the value of MAPS-KB.
Simile interpretation is a crucial task in natural language processing. Nowadays, pre-trained language models (PLMs) have achieved state-of-the-art performance on many tasks. However, it remains under-explored whether PLMs can interpret similes or not. In this paper, we investigate the ability of PLMs in simile interpretation by designing a novel task named Simile Property Probing, i.e., to let the PLMs infer the shared properties of similes. We construct our simile property probing datasets from both general textual corpora and human-designed questions, containing 1,633 examples covering seven main categories. Our empirical study based on the constructed datasets shows that PLMs can infer similes’ shared properties while still underperforming humans. To bridge the gap with human performance, we additionally design a knowledge-enhanced training objective by incorporating the simile knowledge into PLMs via knowledge embedding methods. Our method results in a gain of 8.58% in the probing task and 1.37% in the downstream task of sentiment classification.