Home Artificial Intelligence The Long and Wanting It: Proportion-Based Relevance to Capture Document Semantics End-to-End

The Long and Wanting It: Proportion-Based Relevance to Capture Document Semantics End-to-End

0
The Long and Wanting It: Proportion-Based Relevance to Capture Document Semantics End-to-End

Dominant search methods today typically depend on keywords matching or vector space similarity to estimate relevance between a question and documents. Nevertheless, these techniques struggle on the subject of searching corpora using entire files, papers and even books as search queries.

Some fun with Dall-E 3

Keyword-based Retrieval

While keywords searches excel for brief look up, they fail to capture semantics critical for long-form content. A document accurately discussing “cloud platforms” could also be completely missed by a question looking for expertise in “AWS”. Exact term matches face vocabulary mismatch issues ceaselessly in lengthy texts.

Vector Similarity Search

Modern vector embedding models like BERT condensed meaning into a whole lot of numerical dimensions accurately estimating semantic similarity. Nevertheless, transformer architectures with self-attention don’t scale beyond 512–1024 tokens as a consequence of exploding computation.

Without the capability to completely ingest documents, the resulting “bag-of-words” partial embeddings lose the nuances of meaning interspersed across sections. The context gets lost in abstraction.

The prohibitive compute complexity also restricts fine-tuning on most real-world corpora limiting accuracy. Unsupervised learning provides one alternative but solid techniques are lacking.

In a recent paper, researchers address exactly these pitfalls by re-imagining relevance for ultra-long queries and documents. Their innovations unlock recent potential for AI document search.

Dominant search paradigms today are ineffective for queries that run into 1000’s of words as input text. Key issues faced include:

  • Transformers like BERT have quadratic self-attention complexity, making them infeasible for sequences beyond 512–1024 tokens. Their sparse attention alternatives compromise on accuracy.
  • Lexical models matching based on exact term overlaps cannot infer semantic similarity critical for long-form text.
  • Lack of labelled training data for many domain collections necessitates…

LEAVE A REPLY

Please enter your comment!
Please enter your name here