Edgar Meij

Dense Retrieval Adaptation using Target Domain Description (ICTIR 2023)

Edgar Meij — Sun, 23 Jul 2023 08:46:39 +0000

In information retrieval (IR), domain adaptation is the process of
adapting a retrieval model to a new domain whose data distribution
is different from the source domain. Existing methods in this area
focus on unsupervised domain adaptation where they have access
to the target document collection or supervised (often few-shot)
domain adaptation where they additionally have access to (limited)
labeled data in the target domain. There also exists research on
improving zero-shot performance of retrieval models with no adaptation. This paper introduces a new category of domain adaptation
in IR that is as-yet unexplored. Here, similar to the zero-shot setting, we assume the retrieval model does not have access to the
target document collection. In contrast, it does have access to a
brief textual description that explains the target domain. We define
a taxonomy of domain attributes in retrieval tasks to understand
different properties of a source domain that can be adapted to a
target domain. We introduce a novel automatic data construction
pipeline that produces a synthetic document collection, query set,
and pseudo relevance labels, given a textual domain description.
Extensive experiments on five diverse target domains show that
adapting dense retrieval models using the constructed synthetic
data leads to effective retrieval performance on the target domain.

The post Dense Retrieval Adaptation using Target Domain Description (ICTIR 2023) appeared first on Edgar Meij.

ECIR 23 Tutorial: Neuro-Symbolic Approaches for Information Retrieval

Edgar Meij — Wed, 08 Mar 2023 09:45:18 +0000

This tutorial will provide an overview of recent advances
on neuro-symbolic approaches for information retrieval. A decade ago,
knowledge graphs and semantic annotations technology led to active research on how to best leverage symbolic knowledge. At the same time,
neural methods have demonstrated to be versatile and highly effective.
From a neural network perspective, the same representation approach
can service document ranking or knowledge graph reasoning. End-to-end
training allows to optimize complex methods for downstream tasks.
We are at the point where both the symbolic and the neural research
advances are coalescing into neuro-symbolic approaches. The underlying
research questions are how to best combine symbolic and neural approaches, what kind of symbolic/neural approaches are most suitable for
which use case, and how to best integrate both ideas to advance the state
of the art in information retrieval.

The post ECIR 23 Tutorial: Neuro-Symbolic Approaches for Information Retrieval appeared first on Edgar Meij.

Entity Retrieval from Multilingual Knowledge Graphs (MRL 2022)

Edgar Meij — Thu, 08 Dec 2022 09:44:01 +0000

Knowledge Graphs (KGs) are structured databases that capture real-world entities and their relationships. The task of entity retrieval from a KG aims at retrieving a ranked list of entities relevant to a given user query. While English-only entity retrieval has attracted considerable attention, user queries, as well as the information contained in the KG, may be represented in multiple—and possibly distinct—languages. Furthermore, KG content may vary between languages due to different information sources and points of view. Recent advances in language representation have enabled natural ways of bridging gaps between languages. In this paper, we therefore propose to utilise language models (LMs) and diverse entity representations to enable truly multilingual entity retrieval. We propose two approaches:(i) an array of monolingual retrievers and (ii) a single multilingual retriever, trained using queries and documents in multiple languages. We show that while our approach is on par with the significantly more complex state-of-the-art method for the English task, it can be successfully applied to virtually any language with a LM. Furthermore, it allows languages to benefit from one another, yielding significantly better performance, both for low-and high-resource languages.

The post Entity Retrieval from Multilingual Knowledge Graphs (MRL 2022) appeared first on Edgar Meij.

Similarity-based Multi-Domain Dialogue State Tracking with Copy Mechanisms for Task-based Virtual Personal Assistants (WWW 2022)

Edgar Meij — Sat, 09 Apr 2022 08:43:28 +0000

Task-based Virtual Personal Assistants (VPAs) rely on multi-domain
Dialogue State Tracking (DST) models to monitor goals throughout
a conversation. Previously proposed models show promising results
on established benchmarks, but they have difficulty adapting to
unseen domains due to domain-specific parameters in their model
architectures. We propose a new Similarity-based Multi-domain Dialogue State Tracking model (SM-DST) that uses retrieval-inspired
and fine-grained contextual token-level similarity approaches to
efficiently and effectively track dialogue state. The key difference
with state-of-the-art DST models is that SM-DST has a single model
with shared parameters across domains and slots. Because we base
SM-DST on similarity it allows the transfer of tracking information between semantically related domains as well as to unseen
domains without retraining. Furthermore, we leverage copy mechanisms that consider the system’s response and the dialogue state
from previous turn predictions, allowing it to more effectively track
dialogue state for complex conversations. We evaluate SM-DST
on three variants of the MultiWOZ DST benchmark datasets. The
results demonstrate that SM-DST significantly and consistently
outperforms state-of-the-art models across all datasets by absolute
5-18% and 3-25% in the few- and zero-shot settings, respectively.

Understanding Financial Information Seeking Behavior from User Interactions with Company Filings (WWW companion 2022)

Edgar Meij — Tue, 05 Apr 2022 08:42:39 +0000

Publicly-traded companies are required to regularly file financial
statements and disclosures. Analysts, investors, and regulators
leverage these filings to support decision making, with high financial and legal stakes. Despite their ubiquity in finance, little is
known about the information seeking behavior of users accessing
such filings. In this work, we present the first study of this behavior.
We analyze 14 years of logs of users accessing company filings
of more than 600K distinct companies on the U.S. Securities and
Exchange Commission’s (SEC) Electronic Data Gathering, Analysis,
and Retrieval (EDGAR) system, the primary resource for accessing
company filings. We provide an analysis of the information-seeking
behavior for this high-impact domain. We find that little behavioral history is available for the majority of users, while frequent
users have rich histories. Most sessions focus on filings belonging
to a small number of companies, and individual users are interested in a limited number of companies. Out of all sessions, 66%
contain filings from one or two companies and 50% of frequent
users are interested in six companies or less. Understanding user
interactions with EDGAR can suggest ways to enhance the user
journey in browsing filings, e.g., via filing recommendation. Our
work provides a stepping stone for the academic community to
tackle retrieval and recommendation tasks for the finance domain.

The post Understanding Financial Information Seeking Behavior from User Interactions with Company Filings (WWW companion 2022) appeared first on Edgar Meij.

Recent papers

Edgar Meij — Sun, 02 Jan 2022 17:47:00 +0000

Please look at Google Scholar to see the list of my most recept up-to-date papers, and be sure to check out https://techatbloomberg.com/ai as well!

The post Recent papers appeared first on Edgar Meij.

Improving Dialogue State Tracking with Turn-based Loss Function and Sequential Data Augmentation (EMNLP 2021)

Edgar Meij — Mon, 08 Nov 2021 09:41:12 +0000

While state-of-the-art Dialogue State Tracking (DST) models show promising results, all of them rely on a traditional cross-entropy loss function during the training process, which may not be optimal for improving the joint goal accuracy. Although several approaches recently propose augmenting the training set by copying user utterances and replacing the real slot values with other possible or even similar values, they are not effective at improving the performance of existing DST models. To address these challenges, we propose a Turn-based Loss Function (TLF) that penalises the model if it inaccurately predicts a slot value at the early turns more so than in later turns in order to improve joint goal accuracy. We also propose a simple but effective Sequential Data Augmentation (SDA) algorithm to generate more complex user utterances and system responses to effectively train existing DST models. Experimental results on two standard DST benchmark collections demonstrate that our proposed TLF and SDA techniques significantly improve the effectiveness of the state-of-the-art DST model by approximately 7-8% relative reduction in error and achieves a new state-of-the-art joint goal accuracy with 59.50 and 54.90 on MultiWOZ2. 1 and MultiWOZ2. 2, respectively.

The post Improving Dialogue State Tracking with Turn-based Loss Function and Sequential Data Augmentation (EMNLP 2021) appeared first on Edgar Meij.

News Article Retrieval in Context for Event-centric Narrative Creation

Edgar Meij — Sun, 11 Jul 2021 16:47:00 +0000

Writers such as journalists often use automatic tools to find relevant content to include in their narratives. In this paper, we focus on supporting writers in the news domain to develop event-centric narratives. Given an incomplete narrative that specifies a main event and a context, we aim to retrieve news articles that discuss relevant events that would enable the continuation of the narrative. We formally define this task and propose a retrieval dataset construction procedure that relies on existing news articles to simulate incomplete narratives and relevant articles. Experiments on two datasets derived from this procedure show that state-of-the-art lexical and semantic rankers are not sufficient for this task. We show that combining those with a ranker that ranks articles by reverse chronological order outperforms those rankers alone. We also perform an in-depth quantitative and qualitative analysis of the results that sheds light on the characteristics of this task.

See https://doi.org/10.1145/3471158.3472247 for more details.

The post News Article Retrieval in Context for Event-centric Narrative Creation appeared first on Edgar Meij.

Contextualizing Trending Entities in News Stories

Edgar Meij — Mon, 01 Mar 2021 10:39:00 +0000

Trends are those keywords, phrases, or names that are mentioned most often on social media or in news in a particular timeframe.They are an effective way for human news readers to both discover and stay focused on the most relevant information of the day. In this work, we consider trends that correspond to an entity in a knowledge graph and introduce the new and as-yet unexplored task of identifying other entities that may help explain the “why” an entity is trending. We refer to these retrieved entities as contextual entities. Some of them are more important than others in the context of the trending entity and we thus determine a ranking of entities according to how useful they are in contextualizing the trend. We propose two solutions for ranking contextual entities. The first one is fully unsupervised and based on Personalized PageRank, calculated over a trending entity-specific graph of other entities where the edges encode a notion of directional similarity based on embedded background knowledge. Our second method is based on learning to rank and combines the intuitions behind the unsupervised model with signals derived from hand-crafted features in a supervised setting. We compare our models on this novel task by using a new, purpose-built test collection created using crowdsourcing. Our methods improve over the strongest baseline in terms of Precision at 1 by 7% (unsupervised) and 13% (supervised). We find that the salience of a contextual entity and how coherent it is with respect to the news story are strong indicators of relevance in both unsupervised and supervised settings.

See https://doi.org/10.1145/3437963.3441765.

The post Contextualizing Trending Entities in News Stories appeared first on Edgar Meij.

Report on the first workshop on bias in automatic knowledge graph construction at AKBC 2020

Edgar Meij — Tue, 08 Dec 2020 09:38:47 +0000

We report on the First Workshop on Bias in Automatic Knowledge Graph Construction (KG-BIAS), which was co-located with the Automated Knowledge Base Construction (AKBC) 2020 conference. Identifying and possibly remediating any sort of bias in knowledge graphs, or in the methods used to construct or query them, has clear implications for downstream systems accessing and using the information in such graphs. However, this topic remains relatively unstudied, so our main aim for organizing this workshop was to bring together a group of people from a variety of backgrounds with an interest in the topic, in order to arrive at a shared definition and roadmap for the future. Through a program that included two keynotes, an invited paper, three peer-reviewed full papers, and a plenary discussion, we have made initial inroads towards a common understanding and shared research agenda for this timely and important topic.

The post Report on the first workshop on bias in automatic knowledge graph construction at AKBC 2020 appeared first on Edgar Meij.