Searching the IMDb Dataset

Implement a Binary Search Tree (BST) based index for indexing/searching actors or movies in the IMDb dataset with the following characteristics:

  • Your implementation should allow the user to quickly find information associated with an actor, or a movie, based on the actor's or movie's name (e.g., "Arnold Schwarzenegger"), or a prefix of a name (e.g. "Arnold Schwarz*").

  • The search should be case insensitive (e.g., "arnold schwarz*" should also work.

  • The search key is the short or simplified name of a movie or actor.

  • The information associated with the key should be of type MovieInfo defined below:

    class MovieInfo{
        public String shortName;   //short or simplified name, e.g., Tom Hanks. 
        public String fullName;    //full name, e.g., Hanks, Thomas III.
        public int ID;             //integer ID.
    
        public MovieInfo(int id, String shortName, String fullName) {
           this.ID = id;
           this.shortName = shortName;
           this.fullName = fullName;
        }
    }
    
  • The BST index should be implemented as a stand-alone class named BSTIndex, having at least the public methods listed below (you may add as many additional private helper methods as needed). You will need to define a Node class within the BSTIndex class that contains 4 fields: key (of String type), data (of MovieInfo type), and left & right references to the children nodes.

    • public BSTIndex() : a constructor to initialize the BST. The data element should be an object of type MovieInfo as described above.

    • public MovieInfo findExact(String key) : returns the data MovieInfo element where the shortName matches the key exactly (although possibly with different capitalization).

    • public MovieInfo findPrefix (String prefix) : returns the data element MovieInfo where the shortName starts with the prefix (and possibly has different capitalization).

    • public void insert(MovieInfo data) : insert the given data element into the proper place in the BST structure.

You can use the IndexTester class provided below to test your BSTIndex class. When executed, this class creates an empty BSTIndex object (using the constructor of the BSTIndex class), reads the data from an input movie or actor file, builds a MovieInfo object for each row, and inserts it into the BST index (using the insert() method of insert BSTIndex class). At this point, the BST index is built for all the movie or actor entries in the file. It will then ask for a search string from user and search for the MovieInfo object associated with the name, testing the search functionality of your BSTIndex class (using its findExact or findPrefix methods).

To test your code with the IndexTester class, you will need to provide it the name of a dataset file to index. You can use any of the following IMDB datasets to this end. Note the full data files (i.e., actors.txt and movies.txt) are relatively large. You will likely wish to use the smaller datasets (actors100K.txt and movies100K.txt) for debugging. All of the files are in the same format: one actor or movie per line, with fields: ID, shortName, fullName separated by a tab character ("\t"). Read the IndexTester to remind yourself how to read this data.

import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.util.Scanner;

public class IndexTester {

	public static void main( String [ ] args ) throws FileNotFoundException
	{
		BSTIndex t = new BSTIndex( );
		Scanner scan = new Scanner(new FileInputStream(args[0]));
		long start = System.currentTimeMillis();
		int i=0;
		
		while(scan.hasNext()){
			String line = scan.nextLine();
			String [] fields = line.split("\t");
			int id = Integer.parseInt(fields[0].trim());
			String shortName = fields[1].trim();
			String fullName = fields[2].trim();
			MovieInfo info = new MovieInfo(id, shortName, fullName);

			t.insert(info);
			i++;
			if(i % 10000 == 0){
				System.out.println("Inserted " + i + " records.");
			}
		}
		long end = System.currentTimeMillis();
		System.out.println("Index building complete. Inserted " + i + 
		                   " records. Elapsed time = " + (end - start )/1000 + 
		                   " seconds.");

		Scanner input = new Scanner(System.in);
		
		System.out.println("Enter search string, end in a '*' for prefix search." + 
		                   " q to quit");
		while(input.hasNext()){
			String search = input.nextLine().trim();
			if(search.equals("q")) break;
			if(search.indexOf('*') > 0){
				//call prefix search. 
				MovieInfo item = t.findPrefix(search);
				if(item==null) System.out.println("Not Found");
				else System.out.println(item.ID + " " + item.shortName + 
				                        " " + item.fullName);
				
			}
			else{
				//call exact search, modify to return MovieInfo
				MovieInfo item = t.findExact(search); 
				if(item==null) System.out.println("Not Found");
				else System.out.println(item.ID + " " + item.shortName + 
				                        " " + item.fullName);
			}
		}
	}	
}