Query

Query

Efficient Query Processing

Strategies

  • scaling out:
    • servers bring more maintainance, power costs
  • scaling up:
    • hardware is expensive
    • diminishing returns
  • algorithms

Top-k Retrieval

  • avoid scoring documents not appear in the top-k result list
  • prestore information for each term to avoid scoring
  • incorporate with compression for query processing
  • GEQ(x) operation to avoid decompression
Inefficient Evaluation of BM25

// TODO

The WAND Algorithm
  • use maximum contribution of query term to overestimate scores
  • do not evaluate document if the score is lower than the top of the heap
  • utilize GEQ(x) to skip over large parts of posting lists
  • precompute the similarity metric
Maximum Contribution
  • the maximum contribution of a term q as the largest score
    • depends on the similarity measure
    • precomputed
    • store single floating points
    • can be used to overestimate the score

Query Completion

Goal

  1. formulate search requests
  2. reduce number of keystrokes to enter
  3. guide to a good query
  4. cache results

Strategy

  1. Generate list of completions baesd on partial query
  2. refine completions as keystrokes increase

Target

  1. most popular queries
  2. items on websites
  3. past queries by the user

Type

  1. prefix match
  2. substring match
  3. multi-term match
  4. relaxed match

Prefix Completion

  • given a query prefix, retrieve the top-k completions
  • data from static query log
Trie + RMQ Based Index
  1. sort query log by lexicographical order, count term frequencies
  2. Insert all unique queries and frequencies into a trie
    • store array with frequencies corresponding to each query
  3. Insert queries into trie, find node in trie representing the subtree prefixed by P in O(|p|) time

  4. Range Maximum Queries

Advanced Query

Phrase Query

inverted index

Positional Inverted Index

  1. intersect doc ids of phrase terms
  2. intersect position lists for the docs containing all terms
  3. rank the result

string matching indexes

Suffix Arrays

  1. store all suffixes of the text
  2. sort in lexicographical order
  3. each occurrence of P is a prefix of a suffix

Advantages
  • Compressed Suffix Arrays (CSA) use space equivalent to the compressed size of the text. Example: 1GB text, 8GB Suffix Array, Compressed text (gzip) 200MB, CSA 150MB
  • parallel constructed and space-efficiently
  • used to index TBs of data
  • solve queries that are hard to solve using an inverted index

Evaluation

Automatic Evaluation

Test Collection

  • reusable test collections constructed for Information Retrival evaluation, e.g., for TREC competitions
    • corpus of documents
    • set of queries
    • relevance judgements (qrels), human judgement

Relevance Vector

  • binary vector indicating relevance for each ranked document

Measures

  • precision at k: compute precision using only ranks 1…k
  • average precision: tak-e average over precision at k for each k
  • Mean Average Precision (MAP): average precision across multiple queries

Utility Based Metrics

Rank-Biased Precision

  • list top-to-bottom with persistence probability P
  • always look at the first result
  • look at the second result with probability P
  • look at the third result with probability P2 ...
  • search engine gets paid based on how much relevant documents it provides until the user stops

Effectiveness

  • query logs and click logs

Machine Learning to Rank

Process

  • learn a rank model that can rank the list of k documents for an unknown query
  • use training data consisting of tuples which represent the query q, the documents Di, user u and Relevance Vector Ri
  • learn to combine features representing x = to predict Ri

Challenges

  • find the right features representing x =
  • define the objective that we want to optimize

Features

User Features

  • kind of documents users look for
  • kind of links users click on
  • time the users stay on a URL'

  • ...

Document Features

  • tf/idf features
  • topics
  • number of slashes in URL
  • length of URL

Query Features

  • number of queries terms
  • popularity
  • time-sensitiveness
  • BM25 score distribution

results matching ""

    No results matching ""