Inverted Index for Web Documents

February 4, 2019 | Author: Alexandru Onea | Category: Search Engine Indexing, Letter Case, Computer Programming, Computer Data

Share Embed Donate

Report this link

Short Description

re re retre tr tre rt re ert e t r re te re e tretretretretrt...

Description

THE UNIVERSITY OF MANCHESTER

Inverted Index for Web Documents COMP38120 Lab 1 Report Alexandru-Nicolae Onea 13/2/2015

1. Introduction The inverted index is built using the Hadoop framework in Java and is used on six input files representing six Wikipedia articles on the web. The input is formatted as plain text in text files laid out on two rows. The first row is the title of the article while the second one contains the body of the article. The plain text in the files is extracted from the corresponding web page as it is (no modification or transformation has been done). The output of the Java program is an inverted index built based on all the six input files, containing references to all the words in the files associated with information regarding their occurrences, their origin and so on. The aim of the inverted index is to be used as input to a query algorithm which would easily search and extract the exact most relevant article on the web when a search key word is given from one of the articles in the input files described above.

2. Functionality 2.1.

Basic functionality

The basic functionality of the program is to take and parse the input files and produce a basic inverted index with all the words in those files associated with the input files they belong to. Each input file is parsed and passed to a Mapper which splits it into lines and performs the map operation on each line independently. The results of individual Map operations are then fed to the Reducers and combined into bigger data entities until there is a single output containing all the relevant information i.e. the inverted index. word1    1    e     l    i    F    t    u    p    n    I

Line 1

Map()

word2 ...

Mapper

wordk Line 2

Map()

wordk+1

   r    e    c    u     d    e    R

Reduce()

list1

   r    e    c    u     d    e    R

Reduce()

list2

... word1    2    e     l    i    F    t    u    p    n    I

Line 1

Map()

word2 ...

Mapper

wordp Line 2

Map()

wordp+1 ...

As described in the figure above, each input file is passed to exactly one Mapper which performs the Map operation on each line of that particular file. Note that each file contains exactly two lines: the first one represents the title and the second one the body. Each Map operation outputs several data entries, each of them representing an association between a single word of the input file and the name of the file being currently processed (which is the origin of the word). Next, several of these data entries containing single words are allocated to one Reducer. There is more than one Reducer and the allocation is performed by the Hadoop framework so we don ’t have to interfere with this process. Each Reducer groups and combines all the input data entries into more 1

complex lists of words, taking into account duplicates. Each of these lists is then passed to other Reducers and so on until a single list representing the inverted index is output.

2.2.

Improved functionality and features

2.2.1. Stemming Stemming is the operation of reducing inflected and derived words to their root form. For example „wires” becomes „wire” and „hanging” becomes „hang”. The stemming operation is very limited and highly dependent on the input language. One of the limitations of the program is that in can produce sensible results only for inputs written in English. Another limitation is given by the design of the stemmer, unable to correctly tackle all the possible situations that may occur in the English language. For example, by the same rules “making” will become “mak” and not “make” as it is in English.

2.2.2. Stop word removal Stop words are words which are considered irrelevant to a given context (from here deriving the fact that the stop words list is dependent on the context it is used in). In this case, words like “and” and “the” are considered stop words and are removed during the Map operation.

2.2.3. In-Mapper aggregation One particularity of the inputs for the lab program is that the whole body of the Wikipedia articles contained in the files is represented as a single line. Taking into account that the map operation works on a single line, than all the words of an article are processed by the same map operation. The event of two or more words repeating in an article is inevitable and therefore some sort of aggregation can be performed in this stage. Repeating words are grouped into a single entity.

2.2.4. Case folding In the English language any sentence starts with capital letter, however we want to keep only the names with capital letter and any other object to be lowercased. Detecting the first words of a sentence is difficult because of the punctuation of the document. First, all the punctuation is removed, excepting the full stops. Then a regex is used to detect words at the beginning of the sentences and these words are lowercased. Any other word with capital letter occurring in the middle of a sentence is considered a name. The limitation of such a method is that some of the sentences can start with names and some of the objects in the sentences are purposely written with capital letter but should be lowercased instead.

2.2.5. Positional indexing Each word is no longer passed to the reducers along only with the name of the file it originates from, but also with a list of the offsets in occurs at in that file. In this way one can easily find the word in the document. The offset is considered the word index on the current line.

2.2.6. Document and term frequency In addition to the positional indexing, when a word is present in more than one document, a single data entity is produced containing a list of the documents it occurs in and lists of the offsets for each document. There is no need to pass on the number of occurrences as they can be deduced by counting from these lists of occurrences.

2

2.2.7. Flagging the title Given the format of the input file, the title can be easily detected. Words contained in the title are flagged as they might be more relevant to a search query. This flag is passed along with the occurrence data between the mappers and reducers.

3. Performance In order to improve the performance the following have been implemented: 

The In-Mapper aggregation is done using a HashMap of words. If a new word is already contained by the HashMap, the entry is updated with the new occurrence.



A custom data structure has been designed to be passed between mappers and reducers. The data structure is a nested structure of lists containing occ urrence offsets and the names of the files.

3

4. Code listing /** * Basic Inverted Index * * This Map Reduce program should build an Inverted Index from a set of files. * Each token (the key) in a given file should reference the file it was found * in. * * The output of the program should look like this: * sometoken [file001, file002, ... ] * * @author Kristian Epps */ package uk.ac.man.cs.comp38120.exercise; import import import import import import import import import import import import import import import import import import import import import import import

java.io.*; java.util.*; java.util.Map.Entry; java.util.regex.*; org.apache.hadoop.conf.Configuration; org.apache.hadoop.conf.Configured; org.apache.hadoop.fs.Path; org.apache.hadoop.io.*; org.apache.hadoop.mapreduce.Job; org.apache.hadoop.mapreduce.Mapper; org.apache.hadoop.mapreduce.Reducer; org.apache.hadoop.mapreduce.lib.input.FileInputFormat; org.apache.hadoop.mapreduce.lib.input.FileSplit; org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; org.apache.commons.cli.CommandLine; org.apache.commons.cli.CommandLineParser; org.apache.commons.cli.HelpFormatter; org.apache.commons.cli.OptionBuilder; org.apache.commons.cli.Options; org.apache.commons.cli.ParseException; org.apache.hadoop.util.Tool; org.apache.hadoop.util.ToolRunner; org.apache.log4j.Logger;

import import import import import import import

uk.ac.man.cs.comp38120.exercise.BasicInvertedIndex.DataFrame; uk.ac.man.cs.comp38120.io.array.ArrayListWritable; uk.ac.man.cs.comp38120.io.map.HashMapWritable; uk.ac.man.cs.comp38120.io.pair.PairOfWritables; uk.ac.man.cs.comp38120.util.XParser; uk.ac.man.cs.comp38120.ir.StopAnalyser; uk.ac.man.cs.comp38120.ir.Stemmer;

public class BasicInvertedIndex extends Configured implements Tool { private static final Logger LOG = Logger .getLogger(BasicInvertedIndex.class); // [((filename, flag), [offset])] public static class DataFrame extends HashMapWritable { 4

/** * */ private static final long serialVersionUID = 1L; public DataFrame() {}; public static IntWritable max(IntWritable a, IntWritable b) { return (a.get() > b.get()) ? a : b; } public void addWordEntrt(Text fileName, IntWritable offset) { if (this.containsKey(fileName)) { this.get(fileName).getValue().add(offset); } else { PairOfWritables p = new PairOfWritables(); ArrayListWritable list = new ArrayListWritable(); list.add(offset); p.set(new IntWritable(0), list); this.put(new Text(fileName), p); } } public void mergeFrame(DataFrame other) { Iterator it = other.entrySet().iterator(); while (it.hasNext()) { Entry pairs = (Entry) it.next(); Text fileName = new Text(); fileName.set(pairs.getKey().toString()); PairOfWritables p = pairs.getValue(); if (this.containsKey(fileName)) { PairOfWritables p1 = this.get(fileName); PairOfWritables p3 = new PairOfWritables(); ArrayListWritable l = p1.getValue(); l.addAll(p.getValue()); HashSet hs = new HashSet(); hs.addAll(l); l.clear(); l.addAll(hs); p3.set(max(p1.getKey(), p.getKey()), l); this.remove(fileName); this.put(fileName, p3); 5

} else { this.put(fileName, p); } } } } public static class Map extends Mapper { // INPUTFILE holds the name of the current file private final static Text INPUTFILE = new Text(); // TOKEN should be set to the current token rather than creating a // new Text object for each one @SuppressWarnings("unused") private final static Text TOKEN = new Text(); // The StopAnalyser class helps remove stop words @SuppressWarnings("unused") private StopAnalyser stopAnalyser = new StopAnalyser(); // The stem method wraps the functionality of the Stemmer // class, which trims extra characters from English words // Please refer to the Stemmer class for more comments @SuppressWarnings("unused") private String stem(String word) { Stemmer s = new Stemmer(); // A char[] word is added to the stemmer with its length, // then stemmed s.add(word.toCharArray(), word.length()); s.stem(); // return the stemmed char[] word as a string return s.toString(); } private static HashMap uniqueMap = new HashMap(); // This method gets the name of the file the current Mapper is working // on @Override public void setup(Context context) { String inputFilePath = ((FileSplit) context.getInputSplit()).getPath().toString(); String[] pathComponents = inputFilePath.split("/"); INPUTFILE.set(pathComponents[pathComponents.length - 1]); } // This Mapper should read in a line, convert it to a set of tokens 6

// and output each token with the name of the file it was found in

//

public void map(Object key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); int offset = 0;TOKEN

line = line.replaceAll("'", ""); // remove single quotes (e.g., can't) line = line.replaceAll("[^a-zA-Z.]", " "); // replace the rest with a space Matcher m = Pattern.compile("(\\.\\ *\\n*\\ *[AZ])").matcher(line); StringBuffer sb = new StringBuffer(); while (m.find()) { m.appendReplacement(sb, m.group().toLowerCase()); } m.appendTail(sb); line = sb.toString(); // Remove some annoying punctuation line = line.replaceAll("[\\.]", " "); // replace the rest with a space StringTokenizer itr = new StringTokenizer(line); int pos = 0; while (itr.hasMoreTokens()) { pos ++; String str = itr.nextToken(); if (StopAnalyser.isStopWord(str)) continue; str = stem(str); Text t = new Text(); t.set(str); if (uniqueMap.containsKey(t)) { uniqueMap.get(t).addWordEntrt(INPUTFILE, new IntWritable(pos)); } else { DataFrame df = new DataFrame(); df.addWordEntrt(INPUTFILE, new IntWritable(pos)); uniqueMap.put(t, df); } } } @Override 7

protected void cleanup(Context context) throws IOException, InterruptedException { Iterator it = uniqueMap.entrySet().iterator(); while (it.hasNext()) { Entry pairs = (Entry) it.next(); //

TOKEN.set(pairs.getKey()); context.write(pairs.getKey(), pairs.getValue()); } } }

public static class Reduce extends Reducer { // TODO // This Reduce Job should take in a key and an iterable of file names // It should convert this iterable to a writable array list and output // it along with the key public void reduce( Text key, Iterable values, Context context) throws IOException, InterruptedException { Iterator itr = values.iterator(); // ArrayListWritable a = new ArrayListWritable(); DataFrame df = new DataFrame(); while (itr.hasNext()) { df.mergeFrame(itr.next()); a.add(itr.next()); }

//

context.write(key, df); } } // Lets create an object! :) public BasicInvertedIndex() { } // Variables to hold private static final private static final private static final

cmd line args String INPUT = "input"; String OUTPUT = "output"; String NUM_REDUCERS = "numReducers";

@SuppressWarnings({ "static-access" }) public int run(String[] args) throws Exception 8

{ // Handle command line args Options options = new Options(); options.addOption(OptionBuilder.withArgName("path").hasArg() .withDescription("input path").create(INPUT)); options.addOption(OptionBuilder.withArgName("path").hasArg() .withDescription("output path").create(OUTPUT)); options.addOption(OptionBuilder.withArgName("num").hasArg() .withDescription("number of reducers").create(NUM_REDUCERS)); CommandLine cmdline = null; CommandLineParser parser = new XParser(true); try { cmdline = parser.parse(options, args); } catch (ParseException exp) { System.err.println("Error parsing command line: " + exp.getMessage()); System.err.println(cmdline); return -1; } // If we are missing the input or output flag, let the user know if (!cmdline.hasOption(INPUT) || !cmdline.hasOption(OUTPUT)) { System.out.println("args: " + Arrays.toString(args)); HelpFormatter formatter = new HelpFormatter(); formatter.setWidth(120); formatter.printHelp(this.getClass().getName(), options); ToolRunner.printGenericCommandUsage(System.out); return -1; } // Create a new Map Reduce Job Configuration conf = new Configuration(); Job job = new Job(conf); String inputPath = cmdline.getOptionValue(INPUT); String outputPath = cmdline.getOptionValue(OUTPUT); int reduceTasks = cmdline.hasOption(NUM_REDUCERS) ? Integer .parseInt(cmdline.getOptionValue(NUM_REDUCERS)) : 1; // Set the name of the Job and the class it is in job.setJobName("Basic Inverted Index"); job.setJarByClass(BasicInvertedIndex.class); job.setNumReduceTasks(reduceTasks); // Set the Mapper and Reducer class (no need for combiner here) job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); // Set the Output Classes job.setMapOutputKeyClass(Text.class); 9

job.setMapOutputValueClass(DataFrame.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(DataFrame.class); // Set the input and output file paths FileInputFormat.setInputPaths(job, new Path(inputPath)); FileOutputFormat.setOutputPath(job, new Path(outputPath)); // Time the job whilst it is running long startTime = System.currentTimeMillis(); job.waitForCompletion(true); LOG.info("Job Finished in " + (System.currentTimeMillis() startTime) / 1000.0 + " seconds"); // Returning 0 lets everyone know the job was successful return 0; } public static void main(String[] args) throws Exception { ToolRunner.run(new BasicInvertedIndex(), args); // // // // //

char[] p = {'n', 'o', 't', 'h', 'i', 'n', 'g'}; Stemmer s = new Stemmer(); s.add(p, 7); s.stem(); System.out.println(s.toString());

} }

10

5. Output samples Bart {Bart_the_Mother.txt.gz=(1, [820, 1, 548, 408, 2067, 1914, 10, 2074, 674, 1774, 1226, 1234, 144, 574, 571, 2061, 1894, 1614, 1600, 1872, 1256, 2083, 669, 531, 53, 2091, 1757, 539, 2212, 610, 204, 1849, 1022, 1038, 1704, 631, 2114, 321, 1587, 1961, 450, 878, 508, 1343, 381, 1813, 109, 2149, 1561, 1206, 354, 1798]), Bart_the_General.txt.gz=(1, [1, 10, 59, 69, 107, 131, 181, 207, 225, 243, 262, 2 77, 301, 304, 335, 435, 641, 758, 834, 879, 1001, 1262, 1284, 1418, 1506, 1512, 1519, 1528, 1550, 1560, 1601]), Bart_the_Murderer.txt.gz=(1, [1, 136, 1096, 142, 10, 1365, 1225, 286, 2078, 2017, 2050, 2023, 1660, 264, 1790, 389, 2001, 2103, 1130, 161, 2008, 437, 523, 1871, 2084, 1265, 1995, 179, 207, 477, 1574, 619, 79, 1439, 1169, 1839, 457, 1293, 2124, 633, 1720, 582, 233, 707, 108, 1537, 250, 1206, 1084, 247, 602, 723, 1077]), Bart_the_Genius.txt.gz=(1, [1, 883, 2035, 141, 544, 412, 1309, 677, 10, 1363, 742, 2045, 1772, 194, 461, 1958, 1005, 1712, 1174, 1353, 93, 390, 1420, 329, 373, 1135, 1067, 103, 308, 2007, 506, 2015, 1985, 1747, 1991, 2086, 355, 58, 486, 1277, 1998, 240]), Bart_the_Lover.txt.gz=(1, [136, 1, 2308, 2071, 2314, 1160, 10, 950, 2391, 2117, 2321, 154, 1782, 2053, 2385, 83, 1661, 2330, 2028, 1284, 926, 369, 919, 111, 1333, 2410, 1196, 117, 185, 2357, 963, 2431, 724, 301, 2093]), Bart_the_Fink.txt.gz=(1, [479, 274, 204, 1, 1975, 1917, 1849, 1699, 74 8, 10, 1841, 407, 1832, 1413, 1648, 1825, 149, 1822, 1551, 1665, 507, 232, 108, 167, 1530, 1876, 1321, 670, 1627, 528, 1682, 1935, 1451])}

11

Inverted Index for Web Documents

Short Description

Description

Comments

We need your help!