Information Retrieval Solutions Manual

November 8, 2017 | Author: Dovoza Mamba | Category: Time Complexity, Information Retrieval, Computer Programming, Physics & Mathematics, Mathematics
Share Embed Donate


Short Description

Some solutions to exercises of the book...

Description

Information Retrieval Solutions Manual Mpendulo Mamba | 105998416 | [email protected]

Chapter 1 Solutions for chapter 1 exercises

terms

postings lists

forecasts home

1 1 -> 2 -> 3 -> 4

in

2 -> 3

increase july

3 2 -> 3 -> 4

new

1 -> 4

rise

2 -> 4

sales

1 -> 2 -> 3 -> 4

top

1

Exercise 1.1 Draw the inverted index that would be built for the following document collection. (See Figure 1.3 for an example.) 

Doc 1: new home sales top forecasts



Doc 2: home sales rise in july



Doc 3: increase in home sales in july

Doc 4: july new home sales rise



My Answer:

Exercise 1.2 Draw the term-document incidence matrix for this document collection. Draw the inverted index representation for this collection +



Doc 1: breakthrough drug for schizophrenia



Doc 2: new schizophrenia drug



Doc 3: new approach for treatment of schizophrenia



Doc 4: new hopes for schizophrenia patients My Answer: terms/doc

Doc 1

Doc 2

Doc 3

Doc 4

approach

0

0

1

0

breakthrough

1

0

0

0

drug

1

1

0

0

for

1

0

1

1

hopes

0

0

0

1

new

0

1

1

1

of

0

0

1

0

Page 2

terms/doc

Doc 1

Doc 2

Doc 3

Doc 4

patients

0

0

0

1

schizophrenia

1

1

1

1

treatment

0

0

1

0

terms

postings lists

approach

3

breakthrough

1

drug for

1 -> 2 1 -> 3 -> 4

hopes new

4 2 -> 3 -> 4

of

3

patients

4

schizophrenia

1 -> 2 -> 3 -> 4

treatment

3

Exercise 1.3 For the document collection shown in Exercise 1.2, what are the returned results for these queries:

Page 3

a. schizophrenia AND drug Solution Here we use the term-document incidence matrix to perform a boolean retrieval for the given query For the terms schizophrenia and drug, we take the row (or vector) which indicate the document the term appears in, schizophrenia - 1 1 1 1 drug - 1 1 0 0 Doing a bitwise AND operation for each of the term vectors gives, 1 1 1 1 AND 1 1 0 0 = 1 1 0 0 The result vector 1 1 0 0 gives Doc 1 and Doc 2 as the documents in which the terms schizophrenia AND drug both are present. b. for AND NOT(drug OR approach) for AND NOT (drug OR approach) Term vectors for - 1 0 1 1 drug - 1 1 0 0 approach - 0 0 1 0 First we do a boolean bit wise OR for drug, approach, which gives 1 1 0 0 OR 0 0 1 0 = 1 1 1 0 The we do a NOT operation on 1 1 1 0 (i.e. on drug OR approach), which gives 0 0 0 1 Then we do an AND operation on 1 0 1 1 (i.e. for) AND 0 0 0 1 (i.e. NOT(drug OR Page 4

approach)), which gives 0 0 0 1 Thus the document that contains for AND NOT (drug OR approach) is Doc 4. These exercise illustrate the Boolean Retrieval model for search of query terms in given list of documents.

Exercise 1.4 For the queries below, can we still run through the intersection in time O(x+y), where x and y are the lengths of the postings lists for Brutus and Caesar? If not, what can we achieve? a) Brutus AND NOT Caesar Solution: Time is O(x+y). Instead of collecting documents that occur in both postings lists, collect those that occur in the first one and not in the second. b) Brutus OR NOT Caesar Solution: Time is O(N) (where N is the total number of documents in the collection) assuming we need to return a complete list of all documents satisfying the query. This is because the length of the results list is only bounded by N, not by the length of the postings lists.

Page 5

Exercise 1.5 Extend the postings merge algorithm to arbitrary Boolean query formulas. What is its time complexity? For instance, consider: (Brutus OR Caesar) AND NOT (Antony OR Cleopatra) Can we always merge in linear time? Linear in what? Can we do better than this? My Answer:

+

For the query, it is assumed that the length of the posting list for each word is s1, s2, s3, s4, respectively. The time complexity on the left of the entire query is O (s1 + s2), and the time complexity on the right is O (s3 + s4). According to the discussion of the previous question, we find that the time complexity of the AND NOT operation is O (x + y), so the total time complexity is time complexity or O (s1 + s2 + s3 + s4, or linear. +

In conclusion, the time complexity of A AND B, A OR B, A AND NOT B is O (x + y, but the time complexity of A OR NOT B is O (N)

Exercise 1.6 We can use distributive laws for AND and OR to rewrite queries. a. Show how to rewrite the query in Exercise 1.5 into disjunctive normal form using the distributive laws. b. Would the resulting query be more or less efficiently evaluated than the original form of this query? c. Is this result true in general or does it depend on the words and the contents of the document collection? a. (Brutus OR Caesar) AND NOT (Antony OR Cleopatra) = (Brutus OR Caesar) AND NOT Antony AND NOT Cleopatra = (Brutus AND (NOT Antony) AND(NOT Cleopatra)) OR (Caesar AND (NOT Antony) AND(NOT Cleopatra)) Page 6

b. The resulting query would be more efficiently evaluated than the originalform of this query. drug 1 2 for 1 3 4 hopes 4 new 2 3 4 of 3 patients 4 schizophrenia 1 2 3 4 treatment 3 c. It depends on the words and the contents ofthe document collection.

Exercise 1.7 Recommend a query processing order for (tangerine OR trees) AND (marmalade OR skies) AND (kaleidoscope OR eyes) given the following postings list sizes: Term Postings size eyes 213312 kaleidoscope 87009 marmalade 107913 skies 271658 tangerine 46653 trees 316812 Solution: Using the conservative estimate of the length of the union of postings lists, the recommended order is: (kaleidoscope OR eyes) (300,321) AND (tangerine OR trees) (363,465) AND (marmalade OR skies) Page 7

(379,571) However, depending on the actual distribution of postings, (tangerine OR trees) may well be longer than (marmalade OR skies), because the two components of the former are more asymmetric. For example, the union of 11 and 9990 is expected to be longer than the union of 5000 and 5000 even though the conservative estimate predicts otherwise.

Exercise 1.8 If the query is: +

friends AND romans AND (NOT countrymen) how could we use the frequency of countrymen in evaluating the best query evaluation order? In particular, propose a way of handling negation in determining the order of query processing. My Answer: We need to record all the number of documents N, and how many documents contain the word, recorded as E, then when the decision to deal with the order we observe the size of $ N-E $ on it.

Exercise 1.9 For a conjunctive query, is processing postings lists in order of size guaranteed to be optimal? Explain why it is, or give an example where it isn't. My Answer:

+

It is not necessarily the best order of processing. Suppose there are three words and the size of the list is s1 = 100, s2 = 105, s3 = 110. Assuming that the intersection size of s1 and s2 is 100, the intersection size of s1 and s3 is zero. If it is in accordance with s1, s2, s3 order to deal with, need 100 +105 +100 + 110 = 415. But if the order of s1, s3, s2 to deal with, only need 100 + 110 +0 +0 = 210

Page 8

Exercise 1.10 Write out a postings merge algorithm,for an My Answer

xx OR yy query. +

OR(p1,p2) answer 81 89>84 89=89 95>92 9596 97=97 99
View more...

Comments

Copyright ©2017 KUPDF Inc.
SUPPORT KUPDF