Writing a simple Inverted Index in Python

Introduction

Nowadays is not uncommon that web applications include full text search features. There are already well known solutions working out-of-the-box that provide the needed functionalities, such as ElasticSearch or Apache Solr.

Having used ElasticSearch at work a couple of times I wondered how it achieved fast searches and what mechanism empowered that, so reading up a little on the topic, the Inverted Index appears as the cornerstone of full text search algorithms.

Thus, what better way to understand how something works than writing my own toy one?

What is an Inverted Index?

The Inverted Index is the data structure used to support full text search over a set of documents. It is constituted by a big table where there is one entry per word in all the documents processed, along with a list of the key pairs: document id, frequency of the term in the document.

How does it work?

Let's imagine we have two documents with different texts:

  1. The big sharks of Belgium drink beer.
  2. Belgium has great beer. They drink beer all the time.

In order to create the Inverted Index, each text is sliced into different units or terms. The rule is to use whitespace as the natural separator between words, although it can be changed.

Additionally, per each term there is a list of pairs (document id, occurrences), showing the document's ID where the term is found, and the number of times the term appears in the text.

Therefore, the Inverted Index after processing the previous two documents would be:

Term Appearances (DocId, Frequency)
The (1, 1)
big (1, 1)
sharks (1, 1)
of (1, 1)
Belgium (1, 1) (2, 1)
drink (1, 1)
beer (1, 1) (2, 2)
has (1, 1)
great (1, 1)
They (1, 1)
all (1, 1)
the (1, 1)
time (1, 1)

As seen, the term Belgium appears once in both documents, while the term beer appears once in the first and twice in the second one. Whenever a search is issued, the index will be looked up and the corresponding documents retrieved automatically.

This in turn makes processing the documents (indexing) and thus creating & updating the index a slow process, since each document needs to be parsed, sliced and analyzed. Conversely, once the index is created search becomes a really cheap operation since it only entails looking up an entry in a table.

As it happens with everything, this mechanism is not a silver bullet and it has it's quirks and drawbacks, being some of them:

  • Case: The words Belgium and belgium are indexed as two different terms in the table, while a user writing "belgium" in lowercase letters most likely would want to see all the occurrences regardless case.
  • Stopwords: keywords considered irrelevant and deprived on any additional sense, like prepositions, articles or conjunctions.
  • Stemming: Reducing derivations of a word to their common root (e.g: shark and sharks to shark).
  • Synonyms: Gathering terms that share a common meaning to a single one, in order to prevent an explosion in Inverted Indexe's size. Terms like car, automobile, plane and truck could be mapped into a single vehicle word in the index.

Example

The Inverted Index can be understood as a simple key/value dictionary where per each term we store a list of appearances of those terms in the documents and their frequency.

Thus, an Appearance class represents a single Appearance of a term in a document:

class Appearance:
    """
    Represents the appearance of a term in a given document, along with the
    frequency of appearances in the same one.
    """
    def __init__(self, docId, frequency):
        self.docId = docId
        self.frequency = frequency

        
    def __repr__(self):
        """
        String representation of the Appearance object
        """
        return str(self.__dict__)

The Database class is a fake in-memory DB used to persist the documents after they have been indexed.

class Database:
    """
    In memory database representing the already indexed documents.
    """
    def __init__(self):
        self.db = dict()


    def __repr__(self):
        """
        String representation of the Database object
        """
        return str(self.__dict__)

    
    def get(self, id):
        return self.db.get(id, None)

    
    def add(self, document):
        """
        Adds a document to the DB.
        """
        return self.db.update({document['id']: document})


    def remove(self, document):
        """
        Removes document from DB.
        """
        return self.db.pop(document['id'], None)

And finally, the InvertedIndex class.

class InvertedIndex:
    """
    Inverted Index class.
    """
    def __init__(self, db):
        self.index = dict()
        self.db = db


    def __repr__(self):
        """
        String representation of the Database object
        """
        return str(self.index)

        
    def index_document(self, document):
        """
        Process a given document, save it to the DB and update the index.
        """
        
        # Remove punctuation from the text.
        clean_text = re.sub(r'[^\w\s]','', document['text'])
        terms = clean_text.split(' ')
        appearances_dict = dict()

        # Dictionary with each term and the frequency it appears in the text.
        for term in terms:
            term_frequency = appearances_dict[term].frequency if term in appearances_dict else 0
            appearances_dict[term] = Appearance(document['id'], term_frequency + 1)
            
        # Update the inverted index
        update_dict = { key: [appearance]
                       if key not in self.index
                       else self.index[key] + [appearance]
                       for (key, appearance) in appearances_dict.items() }

        self.index.update(update_dict)

        # Add the document into the database
        self.db.add(document)

        return document

    
    def lookup_query(self, query):
        """
        Returns the dictionary of terms with their correspondent Appearances. 
        This is a very naive search since it will just split the terms and show
        the documents where they appear.
        """
        return { term: self.index[term] for term in query.split(' ') if term in self.index }

In order to test the execution of the index, I just create a couple documents and perform some searches.

def highlight_term(id, term, text):
    replaced_text = text.replace(term, "\033[1;32;40m {term} \033[0;0m".format(term=term))
    return "--- document {id}: {replaced}".format(id=id, replaced=replaced_text)


def main():
    db = Database()
    index = InvertedIndex(db)

    document1 = {
        'id': '1',
        'text': 'The big sharks of Belgium drink beer.'
    }

    document2 = {
        'id': '2',
        'text': 'Belgium has great beer. They drink beer all the time.'
    }

    index.index_document(document1)
    index.index_document(document2)
    
    
    search_term = raw_input("Enter term(s) to search: ")
    result = index.lookup_query(search_term)

    
    for term in result.keys():
        for appearance in result[term]:
            # Belgium: { docId: 1, frequency: 1}
            document = db.get(appearance.docId)
            print(highlight_term(appearance.docId, term, document['text']))
        print("-----------------------------")    
    
main()

Doing a couple searches we can see the result:

inverted1

and another one.

inverted2

I hope this served as a good introduction on how the Inverted Index works.

Have fun!

References

Array of extended objects in python using list comprehensions and lambda functions.

Problem

It's been a while since I don't write any posts, so I thought that even though the idea might be initially quite silly, it will help me to kickoff again the habit by writing about a small problem I encountered the other day.

I was developing a very simple microservice that would receive a GET request with two parameters, issue a SPARQL query to a Virtuoso store and then, transform the returned array of objects by extending each object with the same additional meta information per object. Say:

res = [{ 'title': 'Oh boy' }, { 'title': 'Oh girl'}]

And then add some additional metadata like { 'meta': { 'author': 'Myself'}}

Ending up with

res = [ {
        'title': 'Oh boy',
        'meta':  {
          'author': 'Myself'
          }
        },
        {
          'title': 'Oh girl',
          'meta': {
            'author': 'Myself'
          }
        }]

Solution

I wanted to do something self contained and as functional as possible, by using list comprehensions for example. Unfortunately, there is no method in python to update a dictionary and return the new dictionary updated. The regular way is like:

a = { 'b': 3 }
a.update({'c': 5}) # Dict updated, does not return anything
print(a) # {'c': 5, 'b': 3}

Ultimately I came up with a small solution:

result = [(lambda x, y=z.copy(): (y.update(x), y))({ 'meta': { 'author': 'Myself' } })[1] for z in res]

Tada! Combining list comprehensions, lambda functions and the built-in dictionary copy() function we can return a new array with a copy of each object already extended.

By using a lambda function that accepts a tuple we can specify that the first argument is passed as a parameter and the second one will be a copy of each element in the array (assuming it is an object). Then, the object is extended with the argument and the newly extended object is returned as the second element of the tuple.

We could even bake this into a function:

def map_extend(array=[], ext={}):
  return [(lambda x, y=z.copy(): (y.update(x), y))(ext)[1] for z in array]
>>> res
[{'title': 'Oh boy'}, {'title': 'Oh girl'}]
>>> ext = { 'meta': {'author': 'Hola'}}                                                     
>>> map_extend(res, ext)
[{'meta': {'author': 'Hola'}, 'title': 'Oh boy'}, {'meta': {'author': 'Hola'}, 'title': 'Oh girl'}]
>>> map_extend(res, {})                                                                     
[{'title': 'Oh boy'}, {'title': 'Oh girl'}]
>>> map_extend([], {})                                                                      
[]

Have fun!