Menu

[Solved]Course Information Retrieval Programming Assignment Indexer Part 1 Following Example Prese Q37200999

Course: Information retrieval

Programming Assignment (Indexer Part 1)

The following example presents an example of what our text callssingle pass in memory indexing. This indexer that has beendeveloped using the Python scripting language. Your assignment willbe to use this code to gain an understanding of how to generate aninverted index.

This simple python code will read through a directory ofdocuments, tokenize each document and add terms extracted from thefiles to an index. The program will generate metrics from thecorpus and will generate two files a document dictionary file and aterms dictionary file.

The terms dictionary file will contain each unique termdiscovered in the corpus in sorted order and will have a uniqueindex number assigned to each term. The document dictionary willcontain each document discovered in the corpus and will have aunique index number assigned to each document.

From our reading assignments we should recognize that a thirddocument is required that will link the terms to the documents theywere discovered in using the index numbers. Generating this thirdfile will be a future assignment.

We will be using a small corpus of files that contain articleand author information from articles submitted to the Journal“Communications of the ACM”.

The corpus is in a zip file in the resources section of thisunit as is the example python code.(https://drive.google.com/open?id=1XovU8ZspaSp-3lq3Tp5jRKwzRjVVk04C)

You will either need to have the current version of Python 2.xinstalled on your computer or you can use the University of thePeople virtual lab to complete the assignment as the lab alreadyhas python installed. You will need to modify the code to changethe directory where the files are found to match your environment.Although you can download the python file the contents of the fileare as follows:

# Example code in python programming language demonstratingsome of the features of an inverted index.

# In this example, we scan a directory containing the corpusof files. (In this case the documents are reports on articles # andauthors submitted to the Journal “Communications of the Associationfor Computing Machinery” # # In this example we see each file beingread, tokenized (each word or term is extracted) combined into asorted # list of unique terms. # # We also see the creation of adocuments dictionary containing each document in sorted form withan index assigned to it. # Each unique term is written out into aterms dictionary in sorted order with an index number assigned foreach term. # From our readings we know that to completeteh inverted index all that we need to do is create athird file that will #coorelate each term with the list ofdocuments that it was extracted from. We will do that in a laterassignment. ## # We can further develop this example by keeping areference for each term of the documents that it came from and by #developing a list of the documents thus creating the term anddocument dictionaries. # # As you work with this example, thinkabout how you might enhance it to assign a unique index number toeach term and to # each document and how you might create a datastructure that links the term index with the document index. importsys,os,re import time # define global variables used ascounters tokens = 0 documents = 0 terms = 0 termindex= 0 docindex = 0 # initialize list variable#alltokens = [] alldocs = [] # # Capturethe start time of the routine so that we can determine the totalrunning # time required to process the corpus # t2 =time.localtime() # set the name of the directory for the corpus# dirname = “c:usersdataicacm” # For eachdocument in the directory read the document into a string # all =[f for f in os.listdir(dirname)] for f in all: documents+=1 withopen(dirname+’/’+f, ‘r’) as myfile:alldocs.append(f)data=myfile.read().replace(‘n’, ”) for token indata.split(): alltokens.append(token) tokens+=1 # Open forwrite a file for the document dictionary #documentfile = open(dirname+’/’+’documents.dat’,’w’) alldocs.sort() for f inalldocs: docindex +=1documentfile.write(f+’,’+str(docindex)+os.linesep)documentfile.close() # # Sort the tokens in thelistalltokens.sort() # # Define a list for the unique terms g=[] ## Identify unique terms in the corpus for i inalltokens: if i notin g: g.append(i) terms+=1 terms = len(g) # Output Indexto disk file. As part of this processwe assign an ‘index’number to each unique term. # indexfile =open(dirname+’/’+’index.dat’, ‘w’) for i in g:termindex +=1 indexfile.write(i+’,’+str(termindex)+os.linesep)indexfile.close() # Print metrics on corpus # print ‘ProcessingStart Time: %.2d:%.2d’ % (t2.tm_hour, t2.tm_min) print “Documents%i” % documents print “Tokens %i” % tokens print “Terms %i” % termst2 = time.localtime() print ‘Processing End Time: %.2d:%.2d’ %(t2.tm_hour, t2.tm_min)

The areas where you must update the code are identified in boldtype. You should modify these to work for your environment. If youare working in linux or in the virtual computer lab remember thatforward slashes must be changed to back slashes.

The requirements of this assignment include: ·

You must modify and execute the indexer against the CACM corpus.Although this will not build a complete index it will demonstratekey concepts such as

-Traversing a directory of documents

-Reading the document and extracting and tokenizing all of thetext

-Computing counts of documents and terms

-Building a dictionary of unique terms that exist within thecorpus

-Writing out to a disk file, a sorted term dictionary

As we will see in coming units the ability to count terms,documents, and compiling other metrics is vital to informationretrieval and this first assignment demonstrates some of thoseprocesses. ·

Your terms dictionary and documents dictionary files must bestored on disk and uploaded as part of yoiur completed assignment.· Your indexer must tokenize the contents of each document andextract terms. ·

Your indexer must report statistics on its processing and mustprint these statistics as the final output of the program.

o Number of documents processed

o Total number of terms parsed from all documents

o Total number of unique terms found and added to the index

When you have completed coding and testing your indexer program,you must execute your indexer against the corpus of documents inthe cacm.zip file which can be downloaded in the resources sectionof Unit 2.

Capture the statistics output from your program after running itagainst the corpus. Your statistics must include all of thestatistics listed above. You can capture the statistics by copyingand pasting the output of your program directly into a documentwhich you can upload as part of your assignment or you can manuallyrecord each statistic and include in your posting of yourassignment.

If you are unable to complete the programming or you havetrouble getting your code to work, you can submit the work that youhave completed to solicit the feedback of your peers. However it issuggested that you use the course forum to post any difficultiesyou are having and seek the help of peers.

As you work with the indexer and corpus make note of yourobservations and provide a summary of your observations whenposting your assignment. Examples of observations might includecontent of the data, running time, efficiency of the program andother observations.

Peer Assessment Criteria

This assignment will have four elements for peer assessment.Keep in mind that as part of your assessment process you shouldreview and respond to the assessment questions and providesubstantive feedback. Your instructor will be monitoring thequality of the feedback that you provide and a portion of yourgrade will be based upon the feedback that you provide to yourpeers.

Feedback can take the form of suggestions on how to improve theproject, providing assistance to help fellow students completetheir assignment, sharing best practices, tips, or resources thatyou have found useful or explaining concepts to your peers.

The four elements required of the assignment include:

The indexer python code uploaded as part of the submission

The documents.dat and index.dat uploaded as part of thesubmission

The metrics produced when the indexer was executed

A description of the assignment and observations made whilerunning the indexer against the corpus

Google drive link to needed files:https://drive.google.com/open?id=1XovU8ZspaSp-3lq3Tp5jRKwzRjVVk04C

When submitting your assignment please make sure to include thefollowing items:

1. A description of your assignment and your observations madewhile completing the assignment

2. Upload your modified python code.

3. Upload the generated index.dat and document.dat files.

4. Include the metrics produced when your indexer was executedagainst the CACM corpus.  

Expert Answer


Answer to Course: Information retrieval Programming Assignment (Indexer Part 1) The following example presents an example of what … . . .

OR


Leave a Reply

Your email address will not be published. Required fields are marked *