Aaron Tay's Musings about librarianship

Musings About Librarianship Is Moving—Farewell Blogger, Hello Substack!

2025-06-10T08:25:00.006+08:00

Aaron Tay's Musingsabout librarianship - Substack (https://aarontay.substack.com/feed)

A Home Full of Memories—And a New Door Opening

When I first hit “Publish” on Musings About Librarianship on Blogger back in 2009, I was a wet behind the ears librarian nervously sharing half-baked thoughts. I never imagined that my tiny blogger corner would grow into a gathering place for thousands of curious colleagues around the world.

Sixteen years later—after countless coffee-fueled midnight drafts and many generous words of encouragement—my little Blogger house is bursting at the seams. So, with a mix of nostalgia and fluttering excitement, I’m carrying every box, every post, every hard-won insight to substack- https://aarontay.substack.com/.

Why I Had to Move

Many of you subscribed through follow.it. Lately, that service wedged noisy ads between our conversations and hid full-text emails behind a paywall defeating the purpose of email subscriptions! The moment I saw this, I knew it was time to leave.

Substack, by contrast has been offering

Clean, ad-free emails
A pleasant web reading experience—no pop-ups or paywalls
A modern interface for editing
An active community space for comments and threaded discussion (seems like many of you were already on substack)

What’s Changed (and What Hasn’t)

You still get every post whether by RSS, Email - 100% free
You still get the same old rambling pieces by me

What Changed

The entire archive—every crazy or inspire idea, every typo—now lives on Substack.
Future posts will only be posted on Substack.
The old Blogger site (both posts and pages) will stand as a quiet museum, a place to wander down memory lane.

What you need to Do

Email subscribers: Sit back—your address was imported automatically to stack. If it wasn't, subscribe here.
RSS followers: Update your feed to https://aarontay.substack.com/feed(the old feed will stop refreshing with new posts).
Update your bookmarks: https://aarontay.substack.com/

A Heartfelt Thank-You

Thank You. You’ve read plenty of rambling post that struggled to make a point, peppered with typos and misconceptions (I am always learning!) and still stuck with me!

Here’s to the next chapter—still free, still ad-free, still powered by curiosity and caffeinated late nights. I can’t wait to keep learning alongside you.

Hope to see you soon, and thank you for every click, comment, share and quiet read.

With gratitude

Aaron Tay

Musings About Librarianship

Drafted and edited with help of ChatGPT 4o.

Comparative review on Primo Research Assistant, Scopus AI, Web of Science Research Assistant and a explainer for AI search for librarians

2025-05-30T22:41:00.003+08:00

May was a busy month for me in terms of output.

[Article] Comparative review of Primo Research Assistant, Scopus AI and Web of Science Research Output

First, I had two pieces of work published in the Katina Magazine that I am quite proud of.

First, a comparative review of Primo Research Assistant, Scopus AI, and Web of Science Research Assistant—written by yours truly—was published.

These are three of the most commonly used academic search tools in the library world, so I was honoured to be invited to write the piece.

Summon Research Assistant was released just as I was finishing the article. From a quick glance, it appears functionally equivalent to Primo Research Assistant.

This was one of the most technically complex pieces I’ve written. The original draft was more than twice as long. Eventually, with the help of editors and copyeditors, we decided to split the article into two parts. The first part focuses solely on how these three systems generate direct answers using Retrieval-Augmented Generation (RAG).

A second part will be forthcoming, comparing other features such as finding seminal papers, topic maps, and top experts.

[Article] A "explainer" of AI search for librarians

Even after splitting the review into two, the writing was still too long and unwieldy because I kept digressing into technical explanations.

Eventually, the editors had the brilliant idea of pulling out all the technical concept detours and publishing them separately as an “explainer” piece titled A Librarian’s Guide to AI in Academic Search Tools.

To be honest, when I first saw the title, I thought it was a bit overreaching. Understanding “AI”—even within just the field of information retrieval—is no easy task. I’m not even sure I fully understand it myself.

I expanded the piece a little to smooth out the rough edges (remember, it started as essentially footnotes to the comparative review). It covers the following topics:

Constructing an Answer with Retrieval Augmented Generation (RAG)
Understanding (Vector) Embedding Search
Why the Use of Embeddings in Retrieval Reduces Interpretability
Embedding Search in Practice
Why Embedding Search Leads to Less Reproducible Results
Reranking with Embedding Search
Hybrid Search and Rerankers

In some ways, writing this piece stressed me out even more. Leaving aside imposter syndrome (I’m self-taught!), I struggled to strike the right balance between conciseness, technical accuracy, and accessibility for librarians with no background in information retrieval.

A “fun” game I played was pasting chunks of text into frontier LLMs like o3 and Gemini 2.5 Pro and asking them to critique my writing. They were merciless—and made it clear how much expertise is required to be both concise and technically accurate! In case you’re worried about hallucinations, I’ve found that top LLMs are generally solid on the fundamentals of information retrieval—likely because much of the content is open access and there's a wealth of teaching material available.

In the end, I came to a conclusion: when writing technical explanations, you can only really choose two out of three of the following:

a) Concise

b) Comprehensive and Accurate

c) Comprehensibility to laypersons

I chose to sacrifice (b) in favour of being concise and comprehensible—even to librarians without a background in AI or information retrieval.

I’m not sure I succeeded. Even now, I’m unhappy with the opening sentence and I think should have changed some of the examples and analogies I used in the piece. I also really itch to revise and add more sections.

Still, even if the piece isn’t 100% technically precise, I believe it points in approximately the right direction for any librarian who wants to understand these emerging tools.

[ADV] Want to Go Deeper into AI and Information Retrieval?

As I write this, I’ve just wrapped up my 1.5-hour “Master Class” – Understanding the Fundamentals of AI in Academic Search. In total, 127 librarians and researchers from around the world registered—thank you for your support!

But 1.5 hours barely scratches the surface.

That’s why I’ve teamed up with my colleague, Senior Librarian Bella Ratmelia, to offer a more comprehensive course titled AI-Powered Search in Libraries: A Crash Course on Understanding the Fundamentals for Library Professionals.

It will be conducted online as part of FSCI 2025 (FORCE11 Scholarly Communication Institute in partnership with UCLA Library) from July 22–24, 6-9 pm (Pacific Time), in three sessions of three hours each.

Bella and I co-designed the course using sound pedagogical practices and plenty of interactive activities to help you build your intuition for technical concepts like embeddings, LLMs, and RAG. The extra time will also allow us to experiment with AI academic search tools. Unlike my previous talks, we’ll also dedicate an entire session to testing methodology. No coding knowledge is required.

I’m excited to have more time to share and learn together. If you’re interested, please register here.

Scholarships are available for participants from the Global South, but applications must be submitted by June 20, 2025.

[Recording] Playing Devil’s Advocate on AI Search Engines

This month, I was invited to give several talks. Some weren’t recorded, and others were private sessions.

However, my keynote at the Librarian Futures Virtual Summit (hosted by Technology from Sage) was recorded and is available for public viewing.

In this talk, I took a different approach—playing devil’s advocate on the topic of AI search. This is a technique I’ve used before in my career, blogging about then new developmenets like citation based mapping services (e.g. Connected papers), Institutional Repositories, Web Scale Discovery Services (e.g. Summon), mobile related library services and more in the past.

I actually have content for a longer version of this talk, which I’ll likely revisit in a future blog post.

Conclusion

I’ve been exploring information retrieval and AI for almost three years now. This year, I decided to try sharing what I’ve learned in a more structured and deliberate way. I hope my work helps librarians better navigate and understand this fascinating—if sometimes bewildering—topic.

Text has been copyedited with the help of gpt4o.

Ai2 Paper Finder and Futurehouse PaperQA2: More transparent Deep Search for Scholars?

2025-05-12T01:48:00.002+08:00

As an academic librarian, I’m often asked: “Which AI search tool should I use?” With a flood of new tools powered by large language models (LLMs) entering the academic search space, answering that question is increasingly complex. Both startups and established vendors are rushing to offer “Deep Search” or “Deep Research” solutions, leading to a surge in requests to test these products. As a working academic librarian even one with a focus on AI search, I have limited time to invest in these new tools, and only ones that stand out in performance, novelty or price are likely to catch my attention.

Honestly, many are unstable or perform poorly, making them unready for widespread use. My current benchmark for performance is Undermind.ai. If a "Deep Search" product performs significantly worse than Undermind in retrieving relevant papers for my sample queries, I quickly lose interest. I know my sample tests aren’t foolproof, but I have to draw the line somewhere!

In today’s blog post, I’ll highlight two academic deep search/research tools that caught my attention: Ai2 Paper Finder and Futurehouse’s PaperQA2-based search. These two tools are currently free, and perform at roughly the same level as my current favourite Deep Search Academic tool - Undermind.ai.

More importantly, while Undermind offers a polished, user-friendly interface, its process feels like a black box, revealing little about how it retrieves results.

In contrast, Ai2 Paper Finder displays every step—query analysis, search workflows, and relevance judgments—making it easier for librarians to trust and explain results to researchers.

Similarly, Futurehouse’s Reasoning tab provides detailed insights into the papers found and the evidence considered, empowering users to understand the search process.

For academic librarians, transparent tools foster trust and enhance our ability to guide researchers in navigating and evaluating AI-driven search results effectively.

[ADV] Want to learn about the fundamentals of AI in academic search from me?

I rarely do this, but this is likely of interest to my readers.

I am offering a "Master Class" on May 29, 2025, from 4 PM to 5:30 PM (SGT), tailored for librarians navigating the new era of academic AI search. Prioritizing general principles and understanding over specific tools, this session will equip you with the foundational knowledge to excel.

As I write this post, there are now less than 10 seats left out of the 50 for my "Master Class." Can’t attend live? You get recordings! Register here.

1. Ai2 Paper Finder

Regular readers of this blog are familiar with Ai2, or the Allen Institute for AI, and in particular the work they do on their Semantic Scholar search engine. Arguably more significant is that they provide much of the dataset behind Semantic Scholar openly for use, and many—if not most—of the startups in the academic search space, such as Elicit.com and Undermind.ai, use this dataset as a base for their search engines.

More recently Ai2 have also been developing their own fully open LLMs such as OLMo 2 and now

Ai2 themselves have gotten into the act applying LLMs to search, and in March 2025, they launched Ai2 Scholar QA, their version of Deep Research that generates long-form reports. See my coverage of academic Deep Research tools here.

But Deep Research tools are extremely hard to evaluate, requiring not only careful reading of all citation statements to ensure they are faithful to the source but also subjective assessments of the coherence of the writing. As such, my current interest lies in evaluating results from Deep Search alone.

This generally involves examining the list of results generated by the Deep Search engine and checking whether most of the known gold-standard results are included.

I will discuss this further in a future post, but using articles from published systematic reviews as a gold standard often makes the "test" too easy for such Deep Search systems. They will inevitably find the systematic review and, through iterative citation searching, identify most of your gold-standard papers.

Indeed, Ai2 has launched a seperate deep search called Ai2 Paper Finder which recognises that

literature search is a multi-step process that involves learning and iterating as you go.

It creates an agent that simulates this process by

break[ing] down your query into relevant components, searches for papers, follows citations, evaluates for relevance, runs follow-up queries based on the results, and then presents not only the papers, but also short summaries of why the paper is relevant to your specific query.

If this sounds familiar, it’s exactly what Undermind does, and Ai2 even name-checks Undermind as

working in the same domain with a similar goal of finding good and exhaustive sets of papers on a topic.

Ai2 Paper Finder: Overview

So why do I spotlight Ai2 Paper Finder?

First, I believe it is intended to be free (as in free beer) for the foreseeable future. Second, they aim to be as "open as possible." Although they do not currently open-source their code, they

plan to release more of our(their) source code in the future.

Even without releasing their code, Ai2 Paper Finder is far more transparent about its processes.

Unlike Undermind.ai, which hides its process behind a generic placeholder screen, Ai2 Paper Finder acts like an open book. It displays every step—think of it as a librarian showing you exactly how they tracked down your sources—making it easier to trust and understand the results.

Query Analysis and Intent

For more context, the introductory blog post to the tool provides a fairly detailed explanation of the inner workings of Ai2 Paper Finder, and I’ll quote liberally from it in italics below.

First, the query analyzer goes to work and does two main things. It determines the query intent, which essentially has two modes:

Searching for a specific known paper
Searching for a set of papers on a topic

My observation : In my early testing of Ai2 Paper Finder, it erroneously assumed I was searching for a specific known paper too often, which caused failures and occasional crashes. After providing feedback to the Ai2 Paper Finder team, this issue seems mostly resolved.

I find it interesting that unlike most Deep Search or Deep Research tools like Undermind.ai, Gemini Deep Research, OpenAI Deep Research, it doesn't ask for clarification and goes straight off to search. This could help the system decide which mode to go into.

Second, the query analyzer checks if the query string includes metadata criteria like author, year, or journal. It can also recognize terms like "central," "popular," or "recent."

For example as seen below, it recognizes "Accounting Review" as a journal and the years of publication specified.

My observation : It doesn’t yet work for metadata like institutional affiliation.

Search Workflows and Sub-Flows

The analyzed query is then passed to the "query planner," which launches several predefined "workflows," including:

Specific paper search
Semantic search with potential metadata constraints
Pure-metadata queries
Queries involving an author name

Each of these sub-flows (that can be thought of as "sub-agents") return a set of paper-ids, and in sub-flows that include a semantic criteria, each paper-id is also associated with a list of evidence, and a ranking score reflecting its judged adherence to the semantic criteria.

Like any good system today, reranking is performed, and result sets are:

re-ranked according to a formula that combines the semantic relevance with metadata criteria, such as prioritizing more recent and more highly cited papers. This is influenced by query modifiers such as "recent", "early", "central" or "classic" that increase the emphasis on the metadata criteria over the semantic one.

The specific paper subflow

The specific paper sub-flow is designed to help you find the one specific paper you’re thinking of but can’t recall the title. It employs three strategies in parallel:

Searches the Semantic Scholar title-search API
Search for sentences containing the core search terms, focusing on those that also contain a citation and looking for what the majority of them cite (for example, in the query “the alphageometry paper”, it will search for “alphageometry” and see what is cited in its vicinity)
Asks the LLM directly for a paper and verifies its existence via the Semantic Scholar title-search API

Observation : As researchers, you’ve likely experienced wanting to cite a paper you vaguely remember but can’t find. This sub-flow seems designed to address this use case.

Initially, I even mistakenly thought the name Ai2 "Paper Finder" implied it was only for this purpose!

While the idea is good, I find it often fails for me. I suspect this is because what a paper actually says may differ from what it’s commonly cited for, which may differ from what you want to cite it for. For example, a paper might be cited for a certain conclusion, but you found a nuance in its research method interesting and want to cite it for that. Given the algorithm, you can see why Ai2 Paper Finder struggles here.

Semantic search sub-flow

The explanation in the blog post isn’t entirely clear when I read it, but here’s OpenAI's o3's interpretation based on the text.

It sound plausible to me and should be close enough I guess...

Iterative searching

The search process then enters another round, based on the most relevant papers so far. In this round, an LLM reformulates more queries based on the original query and the text of the found relevant papers, and sends them to the above-mentioned indices. Additionally, it does both forward and backward citation tracking based on the most relevant papers, which is again followed by LLM-based relevance judgment. The process continues for several rounds and stops when it either finds enough papers or scans too many candidates.

Observation: Nothing surprising here, but the fact that it performs both forward and backward citation tracking on the most relevant papers explains why using published systematic review papers as a gold standard is not a good test. The system will quickly find the systematic review, identify it as highly relevant, and citation searching will uncover the rest. (I’ve seen this in practice.)

How relevance is judged

Developing a reliable method for relevance judgment was challenging. While the system is still evolving, we found a "mini breakthrough" that significantly improved results and usability: ask an LLM to break the user's semantic criteria into multiple semantic sub-criteria, to be verified separately. For example, in the above query about dialog datasets, these sub-criteria would be “introducing an unscripted dialog dataset”, “English language”, “annotated speaker properties” and “relation between dialog and annotation”. Each of these titles also includes a brief elaboration. Then, the relevance-judging LLM is asked to first provide adherence to each of the individual sub-criteria, which are only then combined into a final score and final judgement.

Observation : This seems to involve asking an LLM to assess each candidate paper based on sub-criteria, somewhat similar to how LLMs are used for screening in systematic reviews, where they score title/abstracts based on PICO (Population, Intervention, Comparison, Outcome).

In the above example, the query is "Paper showing Google Scholar can be used for systematic review instead of using multiple databases" and three sub criteria used by the LLM to judge relevancy are

Google Scholar as a primary database - judged "perfectly relevant"
Systematic Review Methodology -judged "somewhat relevant"
Comparison to multiple databases - judged "perfectly relevant"

with overall judgement of "relevant"

This is somewhat similar to how LLMs are used for screening in systematic reviews and they are instructed to score title abstracts based on PICO.

Impact of citing papers

In the relevance judgment phase, an LLM goes over all the candidate papers and judges how well they match the semantic criteria in the user's query. For each candidate paper, we format it as markdown text that includes the paper's title, abstract, and the returned snippets from the search, where the snippets are ordered by the paper order, and include the section titles of the sections in which they appeared. If the returned snippets include snippets that cite this paper, these snippets are added as well.

Observation : There’s a “Relevant evidence from 4 citing papers” section, presumably affecting relevance?

Fast mode vs work harder mode

As you may imagine, the full process is effective but rather lengthy... For this reason, we also introduce a fast mode that does less work: it retrieves fewer papers in the initial stage based on the user's semantic criteria without additional reformulations and without the follow-up iterative procedure.

This fast mode is the mode that runs by default, so you don't wait two or three minutes for each response. Based on the results, you can then ask Paper Finder to "work harder" in which case it will invoke the more exhaustive mode described above. You can also invoke the exhaustive mode directly by asking Paper Finder for "an extensive set of paper about X" or something similar in the original query. This way you can get good and (relatively) fast answers to 80% of your queries, while getting higher quality and exhaustiveness for the 20% of queries that require the exhaustive mode.

Observation : Fast mode works well, but I've found asking it to “work harder” often doesn’t significantly improve results which is by design I suspect.

Limitations

As for the semantic queries, while we get top results on academic benchmarks such as LitSearch and Pasa, there is still a lot to do. In particular, we’ve already identified several areas which are particularly challenging: queries when the user does not know the right vocabulary, overly long and rambling queries where the user enters a very long, paragraph-length description of their intents, some queries that involve a combination of multiple semantic criteria where each of them appears in different part of the paper, and queries that search for things that are inherently hard to search for using an index (e.g. numeric ranges such as in "training techniques for models with more than 7b parameters", or negated semantic criteria as in "fairness papers that do not discuss race or gender")... Finally, the system is now strong but quite rigid, and while it is influenced by LLM decisions, the flows are predominantly shaped by the researchers and engineers in our team. This is powerful and effective but also limiting (as an almost trivial example, a query like "the bert paper and the roberta paper" is currently not handled well, and could be trivially supported by a more dynamic, LLM-controlled flow). Going forward, we'd like to see more and more decisions delegated to the LLM, supporting more dynamic and ad-hoc flows.

Observation : In The Differences between Deep Research, Deep Research, and Deep Research , the author classifies deep research tools in two dimensions.

a) Depth of search - Shallow vs Deep

b) Hand crafted vs Trained

Ai2 Paper Finder is clearly on the handcrafted side, with specifically defined sub-flows (rather than the LLM being trained via reinforcement learning), making it vulnerable to performing poorly on unexpected scenarios, as noted above (e.g., long, rambling queries).

Another area we recently started to explore is interactivity and multi-turn interactions. Real world search is not a one-shot process: once there are results, the searcher may like to refine the query. This refinement may refer to the returned results ("these are great but now focus on memory efficiency" or "the third and fourth are great, can you find more like these"), and we'd like the follow-up queries to take this into account.

Observation : Agreed

Overall assessment : I spent considerable time testing Ai2 Paper Finder against Undermind, and its performance is very close, though it can be less robust, with occasional unexpected failures (hopefully now corrected). I suspect Ai2 Paper Finder is still more sensitive to unexpected query inputs than Undermind.

UI-wise, Undermind.ai offers a polished experience, while Ai2 Paper Finder has room for improvement, though I’m pleased they adopted my suggestion to add more filters for criteria and relevance.

Criteria filters are particularly helpful for focusing on the most important aspects of your query that must match.

2. Futurehouse Platform search

Besides the commercial Undermind.ai, which dates back to early 2024, Futurehouse’s PaperQA2 was another early “agentic search” in the academic search space. An open-source project, you can read details in the preprint or this blog post.

More recently, they launched a platform offering “AI agents for scientific discovery.”

They offer three types of searches (leaving aside “Phoenix,” which is designed for chemistry-only tasks):

1. Crow: Based on the original PaperQA2, good for specific questions.

2. Owl: Used for precedent search, ideal for checking if something has been done before. This reminds me of Ai2 Paper Finder’s “specific paper” mode.

3. Falcon: Used for Deep Search, producing long reports with many sources, likely comparable to Ai2 Paper Finder and Undermind in standard topic searches.

The data source they are using is "38 million papers on PubMed, 500,000+ clinical trials" and Open Access papers.

Note that while, PaperQA2 is open source, these new agents are not. You can access them via an API or the free web interface.

So why do I spotlight Futurehouse?

It’s currently free, and the results are good, though I haven’t tested it as extensively as Ai2 Paper Finder.

How does it work?

My experience with “Owl” for finding specific papers I vaguely recall has been underwhelming, for the same reasons as Ai2 Paper Finder. So, I’ll focus on Falcon (Deep Search) and Crow (Concise Search).

Falcon Deep Search vs Crow Concise Search

Interestingly, when I tested the same query, Crow (Concise Search) took longer than Falcon (Deep Search), contrary to my expectations!

For the same query, Falcon (Deep Search) took 5 minutes, while Crow (Concise Search) took over double the time—12 minutes!

Despite spending less time, Falcon ran more queries (71 vs. 48). In terms of sources considered, Falcon evaluated 45, found 20 relevant, and referenced 14, while Crow considered 110, found 4 relevant, and referenced only 2!

I think Falcon isn’t necessarily “deeper” but produces a broader literature review than Crow. For example, Falcon’s output spanned 22,144 characters, while Crow’s output was only 3,247 characters, reflected in the “result tokens” used. Crow’s answer dives straight into studies estimating Google Scholar’s index size, while Falcon provides a longer, scene-setting report.

In this case, Falcon also performed better, finding 20 relevant papers vs. Crow’s 4. To be fair, Falcon’s broader literature review makes it easier to find relevant papers. Still, looking at the actual output Falcon clearly outperformed Crow in surfacing studies estimating Google Scholar’s index size, so I’ll focus on Falcon Deep Search for the rest of this post.

Falcon Deep Search interface

Falcon’s interface is the same as Crow’s and Owl’s.

Besides task details (which reveal the LLM models used), there are three tabs: “Results,” “Reasoning,” and “References.”

References Tab

The References tab lists the references used in the generated answers. Two things stood out: some references are tagged with labels like “Domain leading,” “Highest quality,” or “Peer reviewed.”

Under each paper, contexts (both used and unused) are listed (e.g., 1.1, 1.2, 1.3).

Mousing over them shows what appears to be a “reasoning trace”—typically a summary like “The excerpt from Author (year) discusses…” Each context (1.1, 1.2, etc.) may start similarly but generally differs.

Result Tab

Reasoning tab

The Reasoning tab is the most interesting, offering insight into the system’s steps, including papers found and “evidence found.”

Conclusion

As the landscape of AI-powered academic search tools continues to evolve, the proliferation of "Deep Search" and "Deep Research" products presents both opportunities and challenges for academic librarians and researchers. To me, Ai2 Paper Finder and Futurehouse’s PaperQA2-based search stand out in a crowded market by offering robust performance, transparency, and free access, with performance that rivals my favourite Undermind.ai.

In my view, while Undermind retains an edge in user experience and polish, the openness of Ai2 Paper Finder’s processes and Futurehouse’s detailed reasoning traces provide valuable insights into how these tools operate, fostering trust and enabling users to better understand their search workflows.

These tools demonstrate that effective "deep search" capabilities are becoming more accessible, and the push for transparency could become a key differentiator in this rapidly evolving market. Although both have areas for refinement, from Ai2 Paper Finder’s occasional sensitivity to query inputs and UI improvements, to the need for more extensive testing of Futurehouse's offerings, their current performance and open approach are commendable. Their emergence signals a healthy dynamism in the field, offering powerful, free alternatives that challenge established players and empower users with greater insight into the AI-driven discovery process. As these and other tools continue to mature, the quest for the ideal AI academic search companion – one that is effective, transparent, and user-friendly – remains an exciting one to watch.

Testing AI Academic Search Engines - What to find out and how to test (2)

2025-05-02T03:56:00.000+08:00

Following my recent talk for the Boston Library Consortium, many of you expressed a strong interest in learning how to test the new generation of AI-powered academic search tools. Specifically, evaluating systems using Retrieval-Augmented Generation (RAG) was the top request, surpassing interest in learning more about semantic search or LLMs alone.

This is a crucial topic, as these tools are rapidly entering our landscape. This post outlines my current thinking on practical ways librarians can evaluate and understand them through a series of questions with answers you might be able to obtain via reading the documentation, asking the vendor or light-weight testing of the system directly.

What Are We Evaluating? AI Academic Search with RAG

First, let's clarify the type of tool we're discussing (different types of tools may requre different approaches). I'm focusing on systems that:

1. Function as Search Engines: They use your query specifically to search an academic corpus (unlike general chatbots that might optionally search - as they have "search as a tool" functionality).

2. Generate Summaries with Citations: They use RAG to produce a synthesized text answer (a paragraph or more) based on retrieved documents, including citations linking back to those sources.

Examples include Elicit, Scite Assistant, Consensus, Scopus AI, and the research assistants for Primo and Web of Science.

These tools often present results with a generated summary alongside the list of source documents, though the exact layout varies

This is of course a simplified layout, there are still many UI decisions to be made. Given how latest studies still show RAG systems generate sentences that are not fully supported by valid citations, it is important for the system to carefully consider how the citations should be displayed in the generated answers and to make it easy to verify generated statements with citations. e.g. Are they inline numbers [1], hover-over highlights?

Below shows another possible layout from Scite Assistant with the references on the right panel.

Examples of what I consider AI academic search includes Elicit.com, Scite assistant*, Consensus, Scopus AI, Primo Research Assistant, Web of Science Research Assistant and many more (see list here)

But what should we test and how?

The RAG Challenge

RAG systems work in two main stages:

1. Retrieval: Finding relevant documents based on your query.
2. Generation: Using a Large Language Model (LLM) to synthesize an answer based only on the retrieved documents.

The special prompt that is built-in to the RAG system could uses a prompt like this

The quality of the final answer heavily depends on Stage 1: Retrieval. If the system fails to find the right information, the LLM has nothing good to work with, leading to weak, incomplete, or even misleading answers. Therefore, evaluating the retrieval component is critical.

A Practical Approach for Librarians

While academic researchers have rigorous (and time-consuming) methods for evaluating information retrieval (like TREC), librarians need practical approaches. We need to understand these tools well enough to guide users and make informed decisions, without needing weeks of formal testing.

The preliminary "framework" of questions presented here focuses specifically on questions designed to help librarians understand the functional performance and underlying mechanics of these RAG systems.

You can investigate the answers to these questions through checking the documentation and targeted testing. I've marked suggestions for hands-on testing with [TEST].

I tested this framework during a recent comparitive review of Primo Research Assistant, Web of Science Research Assistant and Scopus AI (to be published in Katina resource reviews).

Key Questions for Evaluating RAG Academic Search Engines

To structure our evaluation, we can draw inspiration from comprehensive frameworks designed for assessing academic search systems. For instance, Gusenbauer & Haddaway (2020) provide extensive criteria for evaluating the suitability of search tools, particularly for demanding tasks like systematic reviews.

The questions below adapt and simplify elements to ones relevant mostly to RAG systems only, focusing on practical insights librarians can gain through documentation review and targeted testing.

Also it's important to note that this practical testing framework deliberately excludes broader, though equally critical, evaluation areas that can be asked such as:

Ethical Considerations: Questions around copyright and ip implications of the training and use of LLMs

Environmental Impact: Assessing the computational resources and energy consumption associated with running these often complex AI models.

Let's break down the evaluation into understanding the retrieval process, the generation process, and the user interface for verification. I've marked suggestions for hands-on testing with [TEST].

Part 1: Understanding the Retrieval Component

What content is being searched (The Index)?

Why it matters: The RAG answer depends entirely on what's retrieved. Knowing the source index reveals potential coverage gaps.

Does it include only openly scholarly metadata from Semantic Scholar or OpenAlex (like Undermind.ai, Elicit.com) or does it use a proprietary index (like Scopus or Web of Science core collections)? Does it retrieve over just metadata only or over full-text (open access only, or includes some paywalled?)

Does it use the whole source or filters away any subsets? This could either be by choice (e.g. Semantic Scholar fuels both Undermind.ai and Elicit.com, yet they show different number of indexed items due to different ingestion criteria and updating strategies, while Scopus AI searches only over Scopus data from 2003 only) or optout by content owner ((like Elsevier in Primo Research Assistant)
Does it search only holding's licensed by the institution (e.g. Web of Science Core Collection owned) or the entire index regardless of holding (like Primo Research Assistant)?

How to check: Review documentation.
[TEST]: If unclear whether full text is used, design queries where the answer likely resides only in the full text and see if the system can answer accurately. Check and verify for known content opt-outs by purposely searching for such content.

How does the search actually work (Retrieval Mechanism)?

Why it matters: Most RAG systems encourage typing your input in natural query. While keyword search will probably be a part of the retrieval mechanism, it is likely to use other methods. This affects relevance performance, interpretability and reproducibility of search.

Does it use an LLM to translate your natural language query into a Boolean string (like Scopus AI, WoS RA, Primo RA)?
Does it use "semantic" search (like dense vector embeddings or learned sparse retrieval like Elicit)?
Does it do any form of two stage re-ranking? (e.g. Primo Research Assistant reranks the top 30 results with embedding search)
Or Is it a hybrid approach combining multiple methods?
See last blog post for more details

How to check: Check documentation though not all details are always mentioned. Still Vendors, particularly traditional providers often highlight if they use natural language processing or semantic search.
[TEST]: You might check the interface for clues (e.g., some tools show you the generated Boolean query if it uses LLM to generate boolean search strategies and you can check for example to see if Primo Research Assistant really creates boolean search strategies the way the documents describes it.). You can also check if there is any additional reranking by checking the default relevancy sort of the keyword search against the actual top results. If they are different, there is likely some type of additional reranking.

Are the search results consistent and explainable (Reproducibility & Interpretability)?

Why it matters:Interpretability of search results is important for certain use cases (e.g. Systematic reviews). This is mostly a matter of the retrieval mechanism. With regards to reproducbility, if the same query returns different source documents each time, the generated RAG answer will of course also vary!

LLM generated search strategies are interpretable but semantic search based methods tend to be black boxes
LLM-generated search strategies and some semantic search methods can be less deterministic than traditional keyword search.
LLMs might also be used in other parts of the retrieval mechanism such as for query expansion which can also lead to more randomness.

How to check: Understand the retrieval mechanism (see above).
[TEST]: Assuming the RAG system uses LLM to generate boolean. Run the exact same query multiple times (e.g., 5 times in quick succession). Look at the generated Boolean query in the interface (like Scopus AI, WoS Research Assistant), how much does the generated Boolean change? (I found WoS Research Assistant fairly consistent - perhaps having a different query 1 in 5 times, while Scopus AI varied more).
[TEST]:Compare the list of top N source documents retrieved each time (where N is the number used for generation, see below). Does the order or composition change significantly? (when tested 5 times in quick succession, you may want to repeat each batch of 5 a few times)

Does the natural language search understand search constraints (Metadata Parsing)?

Why it matters: Can you refine your search using natural language for common fields? For example, does "peer-reviewed articles on climate change from 2020-2024" correctly apply filters for publication type and publication date?
How to check: Documentation might list supported natural language commands. While both RAG search tools have limited support e.g. Primo Research Assistant supports year of publication and limited article types, Web of Science Research Assistant supports a dazzling array of metadata queries using natural language.
[TEST]: Verify claimed support of metadata parasing by try queries incorporating dates, author names, affiliations, publication types, citation counts, etc. (e.g. for Web of Science Research Assistant try "Papers by Patel from MIT on genomics since 2022," "Review articles on solar panels"). See if the results reflect these constraints accurately.

Does it handle non-English queries?

Why it matters: Can users query in languages other than English, even if the underlying corpus is primarily English? This is often possible with LLM-based query interpretation or multilingual embedding models.
How to check: Documentation might state language support.
[TEST]: Input a few queries in another language you understand (e.g., French, Spanish, Chinese) and see if the system retrieves relevant English-language documents.

Part 2: Understanding the Generation Component

What is the exact RAG method used

Why it matters: RAG is a general term these days and there are many different variants and techniques such as GraphRAG, RAGFusion etc that can lead to quite different results
How to check : In general, the main way is to check the documentation or ask the vendor.

How many retrieved results feed the summary (Top N)?

Why it matters: This indicates the breadth of information the LLM uses to generate the answer. Is it fixed (e.g., top 5 for Primo RA, top 8 for Scopus AI) or dynamic? Note that the final answer might not cite all N documents if some weren't deemed relevant by the LLM.
How to check: Documentation usually states this.
[TEST]: Run a search and look at the interface which will typically list all the retrieved references even those that are not used in the answer.

What LLM is used for generation?

Why it matters: Different LLMs have varying capabilities (though this is often a black box not stated explictly). Some tools might offer choice of models to use (like Scite Assistant).
How to check: Usually only knowable if stated in the documentation. Often, vendors don't disclose the specific model.

Can the generated answer be in non-English?

Why it matters: If a user queries in French, will the generated summary also be in French, or will it default to English? Systems vary (e.g., Scopus AI answers in English; Primo RA/WoS RA answer in the query language).
How to check: Documentation might specify.
[TEST]: Use a non-English query (as tested in Part 1) and observe the language of the generated summary.

Part 3: Evaluating the User Interface and Verification

How are citations displayed and linked?

Why it matters: Verification is key. How easily can a user connect and check a specific statement in the generated text back to the supporting evidence in the source document? Are citations inline numbers [1], hover-overs, linked phrases, or listed separately? Is it clear which part of the source supports the claim?
How to check: Examine the user interface.
[TEST]: Try to trace a few claims in the generated text back to the cited sources. How easy and accurate is this process? Can you quickly access the source abstract or full text?

Important Note: What We Haven't Covered Yet – Evaluating Output Quality

It's crucial to recognize that the questions and tests outlined in this post primarily help us understand how these AI academic search engines are built and function. We've focused on dissecting their components: the underlying index, the retrieval methods used, the generation process parameters, and the UI mechanisms for verification. This provides a foundational understanding of the system's mechanics and potential capabilities or limitations.

However, we have deliberately not yet addressed the direct evaluation of the quality of the final output – the generated RAG answer itself. Assessing critical aspects such as:

Accuracy and Faithfulness: Does the generated text correctly represent the information found in the cited sources? Are there factual errors or "hallucinations"?
Relevance: How well does the generated answer actually address the user's specific query?
Completeness and Conciseness: Does the summary capture the key information from the sources without being overly verbose or missing crucial points?
Usefulness: Is the generated summary genuinely helpful for the user's information need?

Evaluating these performance aspects requires different approaches, drawing from both long-established Information Retrieval (IR) evaluation techniques (like assessing the relevance of retrieved documents which heavily affects the RAG result) and newer methods specifically developed for evaluating RAG systems (such as measuring faithfulness, answer relevance, and guarding against hallucination).

Measuring this crucial aspect of RAG system performance will be the focus of the next part of this blog series.

Conclusion

Evaluating these AI academic search tools requires a blend of understanding their mechanics and practical testing. By asking the targeted questions detailed above and performing simple tests, librarians can gain a much deeper feel for a system's potential strengths, weaknesses, coverage gaps, and operational characteristics. This foundational knowledge is essential for guiding our users effectively and making informed decisions as we integrate these powerful, but still evolving, tools into our library services. Stay tuned for the next post where we will tackle the challenge of evaluating the quality of the answers these systems produce.

This blog post has been edited with the help of Gemini 2.5 Pro.

Technical note 1 : Interpretability and reproducibility of search

Based on your understanding of how retrieval works you should already have a sense how non-deterministic or misinterpretable the search is. Roughly you would expect keyword search to be most interpretable and reproducible, with Dense Embedding search being the least interpretable and reproducbilty.

See this blog post for more information, since retrieval systems may use a blend of such systems, things can be even less clear.

Technical note 2 : multi-lingual support

Even though your sources might be mostly in English, many RAG systems are capable of "understanding" your query even if it is inputted in any of the dozen non english languages.

Why do these new AI search systems generally work with non-english language? This is actually one of the benefits from moving away from just a keyword based search.

Firstly if it uses LLMs to generate Boolean search strategies, this clearly works even if you input your query into say Chinese as most modern LLMs are multi-lingual and are capable of "understanding" your query and creating an approproiate Boolean in English.

How about the more common dense embedding models? While not guaranteed (e.g. there are monolingual embeddings like MiniLM) a lot of the dense embeddings used are are built upon transformer models that are pretrained multilingually (e.g. BERT). They learn to map text from multiple languages into a shared vector space where similar concepts are represented close together, regardless of the language used.

You essentially just need one index to handle multiple languages. Compare this to lexical keyword based methods like BM25 where you need a seperate inverted index for each language!

The reproducibility and interpretability of academic Ai search engines like Primo Research Assistant, Web of Science Research Assistant, Scopus Ai and more

2025-04-14T22:04:00.004+08:00

I recently compared three academic AI search tools: Primo Research Assistant, Web of Science Research Assistant, and Scopus AI for a review article.

Why these three? Mainly because they are add-ons to extremely well-established academic search engines or databases:

Primo: Owned by Exlibris (a Clarivate company), Primo is one of the four major discovery systems used by universities, often serving as the default library search box. It is also, the only one of the three bundled in "free" with Primo.
Web of Science (WoS): Provided by Clarivate, WoS is the pioneer and oldest of the "Big Three" citation indexes.
Scopus: Developed by Elsevier, Scopus is a major competitor to Web of Science and is currently used in important global university rankings, such as the Times Higher Education (THE) World University Rankings and the QS World University Rankings.

(Note: Summon, the sister discovery service to Primo, also from Exlibris/Clarivate, launched its Summon Research Assistant last month. From all appearances, it seems identical to Primo Research Assistant, but I will not discuss it further here.)

I won't delve into a full comparison of how these three tools are similar and different from each other or from other academic AI search tools in this post. Instead, I want to focus on the reproducibility and interpretability of the search results they provide.

Reproducibility of the Search results vs Retrival Augmented Generation Answer

In general, Retrieval-Augmented Generation (RAG) answers use a Large Language Model (LLM) to synthesize information from top retrieved documents. Because LLMs are involved, the generated answer can vary slightly even if the underlying search consistently retrieves the exact same top items.

The non-deterministic nature of Transformer-based LLMs is well-known to anyone who uses ChatGPT, where the exact same input prompt can yield slightly different responses. Advanced users employing LLM APIs know you can reduce this randomness by adjusting settings like temperature, Top P, and Top K. However, even with settings aimed at maximum consistency (e.g., temperature=0, Top P=0, Top K=1), some small degree of variability often remains. I understand this is caused by factors like parallelization in computation, rounding errors in floating points and more

Of course, if the retrieved results themselves differ, the RAG-generated answer will almost certainly differ as well. As you will see unlike keyword search based retrieval techniques, non-keyword search based techniques often bring interpretability and reproducibility issues.

High level view of how retrieval works in Academic AI Search Engines

Let's consider how various academic AI search engines might use your natural language input to find relevant literature. The above image shows my best understanding of how Primo Research Assistant, Web of Science Research Assistant and Scopus Ai work based on the documentation and some tests.

The approach taken significantly impacts the reproducibility and interpretability of the results. This generally depends on the search mechanism(s) employed.

This can look very technical, if you having been looking into details of information retrieval but let me try to explain.

Technical Note: Gusenbauer & Haddaway (2020) distinguish and test both "Reproducibility of search results at different times" and "Reproducibility of search results at different locations." In this post, reproducibility refers to the first meaning – getting the same results when running the same search again, whether seconds apart or in different sessions.

1. Keyword-Based Methods (e.g., Boolean + TF-IDF/BM25 Ranking)

Description: This is the traditional method used by most conventional academic databases. It relies on matching keywords from your query within the documents. Most common is Boolean for matching items and relevancy ranking using TF-IDF/BM25

Interpretability: High. With strict Boolean logic (AND, OR, NOT), you can understand precisely why a document is included or excluded. Standard relevance ranking algorithms like TF-IDF or BM25, while technical, are somewhat intuitive – you can often grasp why some results rank higher by looking at the frequency and placement of matched terms (e.g. title matches worth more).

Reproducibility: Generally high. Running the exact same Boolean query usually yields the exact same set of results (assuming the underlying database index hasn't changed).

2. LLM-Generated Keyword/Boolean Search

Example of Web of Science Research Assistant using LLM to generate Boolean Search

Description: An LLM takes your natural language query and translates it into a structured Boolean or keyword search strategy, which is then run against a traditional database inverted index.

Interpretability: Remains high, assuming the system displays the generated Boolean query. You can see exactly which terms and logic were used to retrieve the results.

Reproducibility: Lower than pure keyword search. Because an LLM is involved in generating the query, the non-deterministic nature of LLMs means the generated search strategy itself might differ slightly even when given the exact same input query. This variation in the search strategy directly affects the reproducibility of the final result set.

3. Non-Keyword / Embedding-Based Methods (Knowned as Vector Search, Semantic Search, etc.)

Description: These methods generally convert both your query and the documents into numerical representations (vectors or embeddings) in a high-dimensional space. The system then finds documents whose vectors are "closest" or most similar to the query vector. Terminology varies: you might see "vector search", "(dense) embeddings", "neural search", "dense retrieval", or "semantic search." There are many technical variants (e.g., bi-encoders, cross-encoders, late-interaction/multi-vector approaches like ColBERT. Also learned sparse methods like SPLADE.)

Interpretability: Generally low. These methods often function as "black boxes." It's hard, if not impossible, to explain precisely why a specific document is retrieved or ranked higher than another, beyond pointing to a calculated similarity score between the query and the documents. Some advanced methods (like ColBERT or SPLADE) have the potential for more interpretability by showing which components contribute most to the relevance score, but they are not yet widely implemented or exposed in interfaces. (To my knowledge, Elicit is one academic AI tool using SPLADE, but its interface doesn't currently break down the score components for users).

Reproducibility: Often lower than keyword methods. Several factors contribute, but a major one is the common use of Approximate Nearest Neighbor (ANN) algorithms. ANN speeds up the computationally intensive process of finding similar vectors in massive datasets but often introduces some randomness, meaning results can vary slightly even for identical queries.

Let's now examine how our three example tools blend these methods, leading to differences in their interpretability and reproducibility.

Web of Science Research Assistant

This is perhaps the most straightforward of the three. Based on my testing and the available documentation:

1. It feeds your natural language query into an LLM (at the time of writing, likely a model like GPT-4o mini).

2. The LLM is prompted with instructions to generate a complex Boolean search query suitable for Web of Science.

For example, if you enter:

"impact of climate change on biodiversity"

The Assistant might generate a query like this:

(climate change OR global warming OR climate variation OR climatic changes OR climate variability OR climate crisis OR climate emergency OR greenhouse gases OR anthropogenic climate change OR carbon emissions) AND (biodiversity OR species diversity OR ecosystem diversity OR biological diversity OR ecological diversity) AND (impact OR effect OR influence OR consequence)

While we don't know the exact prompt used, observing examples suggests instructions along these lines:

1. Extract the main concepts from the user query.

2. For each concept, generate relevant synonyms and related terms, combining them with OR.

3. Combine the resulting concept blocks using AND.

Crucially, the Web of Science Research Assistant displays the generated Boolean query and provides a link to run it directly in the standard Web of Science interface.

Testing confirms that the top 8 results shown by the Research Assistant are exactly the same results, in the same order, as running the generated Boolean query directly in Web of Science using the default relevance sort (which likely uses BM25/TF-IDF).

Interpretability: Very high. You know exactly which Boolean search query was used. You can also understand why the top 8 results were retrieved and ranked as they are – they directly correspond to a standard WoS search using that generated query and default relevance ranking.

Reproducibility: Moderately high for an AI system. Since an LLM generates the Boolean query, some non-determinism is expected. In my tests, running the same natural language query multiple times resulted in a different generated Boolean search strategy roughly 1 out of 5 times. While this indicates variability, it's notably more consistent than some other AI search tools.

Primo Research Assistant

Boolean search generated by LLM for Primo Research Assistant

As I explained in a previous post, Primo Research Assistant functions similarly to the WoS tool in that it also uses an LLM to construct a Boolean search strategy. However, the style of the generated query and the subsequent processing differ.

Using the same example query:

"impact of climate change on biodiversity"

Primo RA might generate a query structured like this:

climate change biodiversity impact) OR (effects of climate change on ecosystems) OR (biodiversity loss due to climate change) OR (climate change species extinction) OR (impact of global warming on wildlife) OR (effects of climate change on ecosystems and species diversity) OR (how climate change impacts wildlife and biodiversity) OR (climate change consequences for biological diversity) OR (relationship between climate change and loss of biodiversity) OR (climate change threats to flora and fauna diversity) OR (impact of climate change on biodiversity)

Based on documentation and observation, the LLM appears to be instructed to:

1. Generate multiple query variant strings based on the original input.

2. Do it 10 times

3. Combine these 10 generated strings and the original query input using the OR operator.

So far, it seems analogous to WoS RA, just with a different query generation strategy. However, Primo RA adds another layer: it reranks the top 30 results retrieved by the generated Boolean query. This reranking step uses embedding models – a form of the less interpretable dense/vector retrieval discussed earlier.

PubMed also works similarly, though it runs a standard keyword search (BM25) first followed by reranking the top 500 using a machine learning-based reranking method - LambdaMART, making results somewhat less interpretable if you want to explain the order of results.

Because of this reranking step, the top results displayed by Primo Research Assistant will not necessarily be in the same order as those you'd get by simply running the generated Boolean query in the standard Primo interface.

Furthermore, the results might differ because Primo Research Assistant searches Primo's Central Discovery Index (CDI), with specific exclusions (e.g., content owners opting out) that might not apply identically to a user's standard Primo interface search configured by their institution.

Interpretability: Quite high, but slightly less than WoS Research Assistant. Like WoS RA, it displays the generated Boolean query, so you understand the initial retrieval logic. However, the final order of the top results is influenced by the embedding-based reranking step, which is harder to interpret. You can still understand why an item was retrieved at all (it matched the Boolean query), but explaining its exact rank among the top results is more difficult than with WoS RA.

Reproducibility: Similar to WoS Research Assistant in terms of the generated Boolean query. In my tests, the generated query also changed roughly 1 out of 5 times for the same input. The reproducibility of the final ranked list might be slightly lower due to potential non-determinism in the reranking step, although this is harder to isolate.

Scopus AI

Scopus AI employs the most complex retrieval mechanism of the three.

Like the others, it uses an LLM to generate a Boolean search strategy. For our example query:

"impact of climate change on biodiversity"

It might generate something like:

("climate change" OR "global warming" OR "climate crisis" OR "climatic change") AND ("biodiversity" OR "species diversity" OR "ecosystem diversity" OR "biological diversity") AND ("impact" OR "effect" OR "influence" OR "consequence") AND ("conservation" OR "preservation" OR "protection" OR "sustainability") AND ("habitat" OR "ecosystem" OR "environment" OR "biome") AND ("adaptation" OR "mitigation" OR "resilience" OR "response")

Interestingly, this query seems to incorporate more concepts (perhaps 4-6 distinct blocks) compared to the WoS RA example (3 blocks). The Scopus AI documentation mentions a "Copilot" that ensures "complex queries are broken down into their component parts," hinting at some initial query analysis or expansion (which add new aspects or concepts) before the LLM generates the final Boolean query.

Scopus AI does both "natural language search" and keyword Boolean search

The major difference, however, is that Scopus AI also performs a "natural language search" simultaneously with the Boolean search. The documentation describes this as a Vector Search. Unlike Primo RA's reranking step, Scopus AI uses vector search as part of the initial retrieval.

In the documentation the natural language search is described as a Vector Search. This is akin to the reranking step down by Primo Research Assistant, except this is done at the initial retrieval step.

In essence, Scopus AI uses a "hybrid search" system:

1. It runs the LLM-generated Boolean query against the Scopus index.

2. It runs the original natural language query (or a processed version) through an embedding-based vector search against the Scopus index.

3. It takes the top results from both searches, deduplicates them, and combines/reranks them (likely using a method like Reciprocal Rank Fusion) to produce the final list.

Interpretability: Clearly much lower than the other two. While the generated Boolean query component is displayed and interpretable, the simultaneous use of vector embedding search during initial retrieval makes the overall results significantly less interpretable. The vector search part acts as a black box; there's no straightforward way to explain why certain results were retrieved via that path or how they contributed to the final ranking, beyond abstract similarity scores

Reproducibility: Very low, based on my testing.

The generated Boolean search query itself is highly variable. It changed almost every other time (roughly 1 in 2 times) I ran the same natural language query. Potential reasons include randomness introduced by the initial "Copilot" query analysis step, or the LLM generating the Boolean query might be configured with parameters allowing for more variation (e.g., higher temperature).

The reproducibility of the overall search results (the final list) is also very low. This is likely due to the combination of the variable Boolean query and the inherent non-determinism of the vector search component (potentially using ANN algorithms). Even if the Boolean query happened to be the same between two runs, the vector search part could still return slightly different results, leading to a different final combined list.

Other Academic AI Search Systems & General Trends

My tests here are informal. Eventually, we will hopefully see formal evaluations of AI search tools, similar to Gusenbauer & Haddaway's (2020) study of 26 academic search systems ("Which academic search systems are suitable for systematic reviews or meta-analyses?"), but adapted for these new complexities. For now, let me speculate based on available information and trends:

Earlier tools like Scite.ai Assistant and the experimental CORE-GPT also use LLMs to generate search strategies. Scite.ai's generated queries sometimes resemble Primo RA's style (multiple OR'd phrases), while CORE-GPT might lean towards the WoS RA style (concept blocks with synonyms). Their specific reproducibility and interpretability would depend on their exact implementation (e.g., do they rerank? do they also use vector search?).

CORE-GPT: Combining Open Access research and large language models for credible, trustworthy question answering

Typical "AI Academic Search": It's often hard for us as librarians to know precisely what's happening under the hood, as vendor documentation varies in technical detail. However, based on informal discussions and industry trends, my guess is that many newer academic AI search engines, especially those from startups, are implementing hybrid search similar in principle to Scopus AI (running keyword/Boolean and vector searches in parallel, then combining results).

Beyond the use of LLMs to generate Boolean search strategies, LLMs can and have been used in other aspects of retrieval including being used outright to evaluate relevancy of the papers (e.g. Undermind.ai, Elicit Research Reports)

Incumbent Vendor Choices: Established vendors with massive, existing, highly optimized keyword-based indexes (like Exlibris/Clarivate with Primo/Summon and WoS) might initially prefer the "LLM-generates-Boolean-query" approach seen in WoS RA and Primo RA. The main advantage is likely cost and infrastructure. Building, maintaining, and constantly updating a parallel vector database for billions of items to support embedding search is significantly more resource-intensive than leveraging their existing infrastructure with an LLM front-end. This might explain why Exlibris/Clarivate have adopted this method for tools searching huge indexes like Primo's Central Discovery Index (containing 5 billion records - but of course not all records can be used in Primo Research Assistant).

Interpretability: Many emerging "AI academic search" tools that heavily rely on embedding-based vector search (especially in hybrid models) suffer from lower interpretability compared to traditional keyword search or even the LLM-generated Boolean approach (when the query is shown). Complex multi-stage retrieval pipelines used by some systems can make it even harder to understand precisely why you're seeing the results you get.

Reproducibility: While variable, the reproducibility of typical academic AI search engines is generally expected to be lower than that of traditional, deterministic keyword-based academic search. The degree of reproducibility varies significantly depending on the specific methods used (LLM settings, use of ANN in vector search, hybrid ranking strategies, etc.).

Conclusion: Beyond Reproducibility and Interpretability – The Challenge of Bias

The advent of modern AI in search presents multifaceted challenges extending beyond the interpretability and reproducibility issues discussed here. Perhaps the most profound, and certainly harder to investigate thoroughly, lies in detecting and measuring algorithmic bias.

As platforms increasingly replace explicit keyword control with AI-driven relevance judgments—often using neural networks for ranking or retrieval—they risk reflecting and potentially amplifying biases inherent in their underlying data and design.

Indeed, early studies suggest that neural rankers, particularly transformer-based models like BERT, can intensify the retrieval of certain content—for instance, potentially favouring male-associated sources—compared to traditional keyword systems such as BM25.

While known biases can sometimes be mitigated with effort, it is the unknown biases that pose a more significant concern. Detecting these hidden skews remains a formidable challenge, particularly in the current dynamic landscape. Is there a way to detect them? I certainly have no clue.

Currently, no single AI academic search tool or algorithm dominates (with the possible exception of Google Scholar). We are in a 'Wild West' phase where the information retrieval field lacks consensus on optimal ranking methods, leading to a variety of approaches. This situation is both a boon and a curse.

It's a curse because, without consensus, biases identified in one tool might be specific to its unique implementation, making systemic solutions difficult. Conversely, it's a boon because the diversity of tools means users aren't locked into a single potentially flawed system, and employing multiple tools might help mitigate the impact of any one tool's specific bias.

This inherent opacity presents a significant challenge. If the inner workings of these AI systems remain largely obscure, and unknown biases are difficult, if not impossible, to detect definitively, what practical approach can librarians and users take? Perhaps the most crucial skill we can currently cultivate and teach is fundamental skepticism.

On the other hand, navigating information systems with opaque algorithms and potential biases isn't entirely unprecedented. Librarians and researchers have long relied on tools like Google (which as far back as 2019 already incorporated BERT neural rankings for 10% of queries!) and Google Scholar, despite limited insight into their ranking mechanisms and awareness of potential commercial or systemic biases influencing results.

However, the context and claims of dedicated academic AI search tools arguably raise the stakes. These tools are explicitly marketed for scholarly discovery, potentially leading users to place greater trust in their outputs. Therefore, while the challenge of dealing with black-box systems isn't new, the need for active skepticism might be even more critical now. This translates into practical strategies: consistently questioning surprising results, habitually comparing outputs across different AI search tools (leveraging algorithmic diversity as a necessity), and remaining acutely aware that these systems are prone to biases, even if invisible.

Edit: Article was written by a human but edited with help of Grok3 and Gemini 2.5 pro