Finding Titles of Webpages

Most web pages you visit are built from text files that contain HyperText Markup Language (HTML). Below is an example of the HTML for a simple web page:

<html>
   <head>
      <title>Sample Title</title>
   </head>
   <body>
      <h1>This is a Heading</h1>
      The following word is <b>bold<\b>
   </body>
</html>

Your browser knows how to display the various elements of the web page by looking for special tags enclosed in angle brackets. For example, if the HTML file for a web page has some text surrounded by ‹b› and ‹/b›, then it knows the text surrounded should be displayed in a boldface font. As another example, ‹title› and ‹/title› mark the beginning and end of the title of the web page. Often this title text is shown at the top of the browser window, or in the browser tab for that page.

Write a class named WebsiteTitleFinder that takes two arguments, both full paths to an input file and an output file. The input file should exist already, and contain one or more website addresses (i.e., URLs), one per line -- and nothing else. The output file named should be created by the WebsiteTitle Finder, and should list the same URLs along with the titles associated with those pages, if they have them. If a given URL does not have an associated title, "< no title present >" should appear where the title would have been present. The formatting of the output file produced should mirror that of the example given below. (e.g., the blank line after each entry, the presence of "Title: ", etc.)

For example, suppose we have the following input file:

filename: urls.txt

http://www.google.com
http://classes.emory.edu
http://www.oxford.emory.edu
http://www.epa.gov/ncer/rfa/2015/2015-p3.html
If the inputfile above was located on an OS X desktop of a user named pauloser, and we wanted the output file to appear in the same location, but be named titles.txt, we would run the program from the command line with:
$ java WebsiteTitleFinder /Users/pauloser/Desktop/urls.txt /Users/pauloser/Desktop/titles.txt

This would then produce the following file on the desktop:

filename: titles.txt

http://www.google.com
Title: Google

http://classes.emory.edu
Title: < no title present >

http://www.oxford.emory.edu
Title: Home - Oxford College

http://www.epa.gov/ncer/rfa/2015/2015-p3.html
Title: Research Grants | US EPA


You may find useful the following method to store the HTML associated with a URL (like 'http://www.google.com') into a string:

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;

...

public static String getHtmlFromWeb(String urlString) {
	String html = "";
	try {
	   URL url = new URL(urlString);
	   try {
	      InputStreamReader inputStreamReader = new InputStreamReader(url.openStream());
	      BufferedReader in = new BufferedReader(inputStreamReader);
	      String inputLine;
	      while ((inputLine = in.readLine()) != null) {
	         html += inputLine;
	      }
	      in.close();
	   } catch (IOException e) {
	      System.out.println("There was a problem reading from this URL. " +
                                 "No action taken.");
	      e.printStackTrace();
	   }
	} catch (MalformedURLException e) {
	   System.out.println("The URL was malformed." +  
                              "No action taken.");
	   e.printStackTrace();
	}
	
	return html;
}