CSSE 230
Doublets

Objectives

Overview

In 1879, Lewis Carroll (author of Alice in Wonderland and Through the Looking-Glass, among other classics), documented a word puzzle he termed "Doublets," which you may have seen. He wrote:
D EAR V ANITY [F AIR], -Just a year ago last Christmas, two young ladies-smarting under that sorest scourge of feminine humanity, the having "nothing to do"-besought me to send them "some riddles." But riddles I had none at hand, and therefore set myself to devise some other form of verbal torture which should serve the same purpose. The result of my meditations was a new kind of Puzzle-new at least to me-which, now that it has been fairly tested by a year's experience and commended by many friends, I offer to you, as a newly-gathered nut, to be cracked by the omnivorous teeth which have already masticated so many of your Double Acrostics.
The rules of the Puzzle are simple enough. Two words are proposed, of the same length; and the Puzzle consists in linking these together by interposing other words, each of which shall differ from the next word in one letter only. That is to say, one letter may be changed in one of the given words, then one letter in the word so obtained, and so on, till we arrive at the other given word. The letters must not be interchanged among themselves, but each must keep to its own place. As an example, the word "head" may be changed into "tail" by interposing the words "heal, teal, tell, tall." I call the two given words "a Doublet," the interposed words "Links," and the entire series "a Chain," ...
It is perhaps, needless to state that it is de rigeuer that the links should be English words, such as might be used in good society. ...
I am told that there is an American game involving a similar principle. I have never seen it, and can only say of its inventors, "pereant qui ante nos nostra dixerunt!"
Your first task will be to build the underlying data structures that will allow you to find a chain of links yielding a solution to Carroll's doublets for any pair of words. Then you'll complete the search algorithm and do some experiments with it.

Incremental Development Path

Setup

Download and unzip this data file containing large dictionaries of words on your computer so that the Doublets project and the unzipped DoubletsData folder are siblings in your file system:

folder structure

If you aren't sure where your Doublets project is, look in Eclipse under File > Properties > Resource > Location: and go to the folder containing Doublets.

Note that you don't need to import this folder into Eclipse in any way, you just need to place it in this directory so that the Doublets code can find it.

Of course, since this data isn't imported into Eclipse and isn't part of your project, you should never commit/push these dictionaries (we will test using our own copies). So when you see the dictionaries appear as unstaged changes, either leave them there as unstaged or right-click > Ignore to have git ignore them.

Data

You can find several word list files (derived from SCOWL) in the file you downloaded. The number in the file name indicates the frequency percentile of the word usage. Thus, english.cleaned.10.txt is a shorter list, but its words are more common than those who appear (additionally) in english.cleaned.all.20.txt., or english.cleaned.all.35.txt. The lists are supposed to be strict subsets of one another. If you want a bigger data set to play with, we also put the full file here too.

Links

Your first task is to build a data structure that represents the allowable links for each word. For example, we should be able to efficiently determine that the candidate plays from heal include heel, hell, heat, heap, real, hear, deal, zeal, seal, peal, head, veal, and meal, while the candidates from glass are grass, glads, class, gloss.

Create a class called Links that contains such a structure. It should implement the LinksInterface so that it has the right methods needed by the unit tests. Its constructor should take a file name as a parameter and populate the links data with word information from the file, which should have one word per line. Include a member method, getCandidates(String word), that takes a string and returns a collection of the valid connections (i.e., candidate plays) from the given word or null if there are none.

The getCandidates(word) method in Link must execute in O(1) time. It gets called OFTEN when searching for doublets. In particular, it should probably just do an O(1) lookup into a data structure and return what it finds, or null if there are none; there is no time to do string calculations. Therefore, you will need to do all of the work to figure out the set of words that differ from the given word in the constructor. So what basic data structure/s to use for your main Links field?

Notes:
You'll probably use a java.util.Scanner to read in the words from the file. Remember to pass a file, not the filename, to its constructor, like this:
scanner = new Scanner(new File(filename));
Populating this structure may have a reasonably high time complexity. However, it should handle english.cleaned.all.35.txt in at most 30 seconds. If you're struggling to achieve this, think about how to improve the efficiency of the process. In particular, we suggest reading the file only once into some intermediate data structure, then populating the main data structure from there. You will likely find yourself traversing some "search space" to build the final structure: how might you decrease the size of the search space?

The Links structure for the tiny.dictionary.txt datafile looks like this:

Doubles search tree

Make sure your code passes the given tests in LinkTest.java.

Chains

Create a class Chain representing an immutable sequence of link words. Recall that immutable just means that after the data is initialized in the constructor, no method may change any fields. It should support adding a new word to the (end) of the chain, as well as retrieving the last word in the chain (i.e., to see if it is the desired word). Because the objects are immutable, adding a word should construct and return a new Chain.

Your class should be an Iterable<String>, so that you can easily look over the contents (in order) and ultimately print the solution. Include a length method as well.

When searching for a chain between the doublets, it will be important not to repeat any words (otherwise your solution will be unnecessarily long). Thus, to ensure your chain does not repeat words, you should include a boolean method contains to determine efficiently whether a given word appears in the chain. What data structure is best to use here? After you have thought about this and have a good idea, see here for a hint. 

Managing Chains

Starting from the first doublet, there are many possible links that could lead to a solution. For instance, in finding a valid chain for Carroll's doublet from head to tail, you may decide to use a link from head to heal, or you might decide instead to use a link from head to hear.

Which one should you choose? If you are playing cerebrally, you might choose the link that seems most promising. However, it could end up being a dead-end, requiring you to resort to some alternative link.

Computationally, this process requires storing the alternatives you intend perhaps to try later if your search should come up short. Of course, you'd also need a method to retrieve one of those alternatives for examination. You can visualize the search as a tree:

search tree

GLASS has 4 links to candidate words. GRASS has 2 links shown. Searching for the correct chain just means expanding this tree until we find the end word. This makes sense if we can expand words in parallel. But we can't do this; we can only call getCandidates() on one word at a time. So we need to store the other chains in some kind of data structure.
Say we wanted to call getCandidates() to examine all the words at depth = 1 first before looking at anything on level 2: this is called a breadth-first search. (Can you see why? We search broadly on 1 level before we go deep along a path.) When discussing binary trees, what kind of traversal did we call this? What kind of data structure did we use for the corresponding iterator?

But maybe we think a depth-first search would be quicker. Then we would want to call getCandidates() on the first chain at depth 1 (grass), then call getCandidates() on the first new candidate at depth 2 (crass), then getCandidates() on the first new candidate at depth 3, etc., all before expanding any of the other chains at depth 1! When discussing binary trees, what kind of traversal is this? What kind of data structure would we use for the corresponding iterator?

See the difference? In the first case, we expand GLADS before CRASS, but in the second, we expand CRASS before GLADS.

Here is another, more complete example.

Doubles search tree

Abstract Class

To take advantage of polymorphism, we have given you a ChainManager abstract class that supports two core methods: add, which simply adds a chain for future consideration, and next, which returns (and removes) a chain.

In addition to these, we may be interested in tracking the performance of the manager. We include two other methods, numNexts, which returns the number of times next has been called, and maxSize, which returns the largest number of chains the manager has stored at any point so far. We also include updateMax(size), which should be called when the number of chains has potentially increased.

Implementations

Create two implementations of ChainManager, one called QueueChainManager that treats chains in a FIFO fashion, and another called StackChainManager that treats chains in a LIFO fashion. The next method should return null when there are no more chains left to consider. These correspond to the two ways to expand the tree (breadth-first and depth-first) discussed above.
Note:
java.util.LinkedList IS-A java.util.Queue, so your QueueChainManager implementation will store its chains in a LinkedList, but may only use Queue methods.

Doublet Chain

With all of the pieces in place, you are now ready to find a chain of links between doublets. But how? A sketch of the algorithm is as follows.

To get the process started, the initial chain is quite simple: it should contain the start word. This is your current (first) candidate chain. Add it to the chain manager to initialize it.

The algorithm should repeatedly ask the chain manager for chains to expand. What should be done with a chain?

Once you've done that, repeat the process for the next chain the manager gives you. If the chain manager says there's no chain left, you know there is no chain between the doublets.

Implement a class called Doublets whose constructor takes a LinkInterface object. Add a member method findChain that takes the two doublet strings (starting and ending) and a ChainManager, returning a solution Chain or null if there is none. The findChain method should implement the algorithm sketched above.  

At this point, you should be able to pass the unit tests in DoubletTest.

Interactive Program

Use all of these pieces to create an interactive program (main() in Doublets) that starts by loading the Dict35 file and initializing the Links. It should repeatedly prompt the user for a doublet and provide the chain solution or an appropriate message if there is none. Report the solution length, the number of candidate chains examined (i.e. number of times next called), and the maximum size of the candidate pool (maximum size of the specific chain manager's stack, queue, ...). For example, using english.cleaned.all.35.txt,
Welcome to Doublets, a game of "verbal torture."
Enter starting word: flour 
Enter ending word: bread 
Enter chain manager (s: stack, q: queue, x: exit): s 
Solution: [flour, floor, flood, blood, bloom, gloom, groom, broom, brood, broad, bread]
Solution Length: 11
ChainsExamined: 16
Max Chain Manager Size: 6 
Enter starting word: wet
Enter ending word: dry
Enter chain manager (s: stack, q: queue, x: exit): q
Solution: [wet, set, sat, say, day, dry]
Solution Length: 6
ChainsExamined: 82651
Max Chain Manager Size: 847047
Enter starting word: oat
Enter ending word: rye
The word "oat" is not valid. Please try again.
Enter starting word: owner
Enter ending word: bribe
Enter chain manager (s: stack, q: queue, x: exit): s 
No doublet chain exists from owner to bribe.
Enter starting word: cold
Enter ending word: warm
Enter chain manager (s: stack, q: queue, x: exit): x 
Goodbye!

Note: Please don't treat the output above as a unit test: the numbers you get will vary depending on the order you generate links. The relative sizes of candidates, max size, etc are important, though: you should find that each manager has its own strengths and weaknesses.

Now for a More Efficient Manager!

You have likely found that using a stack to manage the search is akin to following a chain all the way to completion first, no matter how long it may be. Conversely, using a queue to manage the search is like trying (and remembering!) every possible link at every possible stage. In this way, a queue unfolds the search into large tree of possibilities that expands layer by layer until you find a solution or no possibilities remain.

The stack manager tends to find chains that are unnecessarily long. Meanwhile, the queue manager is guaranteed to find the shortest solution, but can take gobs of memory. Both of these managers somewhat unintelligently choose which candidate to explore next. Can we perhaps do better?

When new chains are added to a queue, they are necessarily at the end. Thus the queue manager returns the candidate with the shortest chain for examination. While this is an advantage when several chains have the same length, we have no way of distinguishing which may be better. For instance, which of the following two chains would you prefer to be working from for the doublet head/tail?

head-held-hell-hall-hail
head-heat-feat-fear-pear

If you've played the game before, you'd probably pick the first. Why? After five moves, the end of the first chain differs from the goal word by only a single character. By contrast, the end of the second chain differs from the goal word in all four characters. We have an intuitive sense that the first is better. Can we formalize this somehow?

Instead of returning the shortest chain for examination, we would like to return the chain with the shortest total estimated solution length. That is, when a particular incomplete chain is extended to yield a solution, what will the solution's length be? We already know how long the candidate chain is, but how can we estimate how much longer the chain must be to reach the solution? One optimistic estimate is to count the number of characters that differ between the chain's end and the goal word. Because you can only change one character at a time, the remaining chain must have at least this many words in it (though perhaps more).

The value of a chain (for a particular doublet) is thus given by

Value=Length+Differences,
where "Differences" means the the number of characters that differ between the goal word and the end of the chain. In this assignment, you'll create a new chain manager that efficiently returns the most promising chain for consideration in the search; that is, the one with the lowest value. Will this strategy improve our situation? We will find out!

Priority Queue Chain Manager

If you hadn't already guessed, one of the most efficient ways to keep a collection of things and quickly access the "best" is a priority queue. In this part, we will build up a PriorityQueueChainManager that extends the ChainManager abstract class. You can begin by creating a skeletal implementation.

Java's PriorityQueue uses the elements' so-called "natural ordering", which is revealed by the compareTo method. That means the items it stores must be Comparable. To this point, you've likely been storing items inside the stack and queue as Chains. However, these are not comparable. While it might make sense to compare chains by their length, our new manager will require that we compare candidate chains by their total estimated length.

One might argue that a chain is independent of a solution or even the doublets or the search strategy; it is just a sequence of words. Your prior managers knew nothing about the doublets, chain length (explicitly), or even the search they played a part in. Because we might use estimation methods other than the one outlined above, the estimated total length is something that should be the chain manager's responsibility for calculating. In addition, it is often true in real-world practice you are handed classes and objects and you cannot change them. Consider that to be the case here: the Chain class simply cannot be modified.

Constructor   Add a constructor to your PriorityQueueChainManager class that takes the ending word as a parameter (which should then be a member of your class).

Distance Estimate   Add a private method to your PriorityQueueChainManager class that takes a string and returns the number of characters that differ from the target string. It may take as a precondition that the strings are the same length.

Entry Class   Because chains are not directly comparable using their length, we must create a new composite type to store in the priority queue. This new type will use a sum of the chain's length and the "distance from target" of the chain's last word, as defined above, as the basis of its natural ordering.

Create an inner class of PriorityQueueChainManager called Entry. Its constructor should take a Chain as a parameter, calculate and store the total value of that chain using its length and your distance estimate, as outlined above.

The class should implement Comparable using the total value as the basis for comparison.

Use Your New Chain Manager

You should now be able to complete the implementation of your priority queue-based chain manager. A call to next should always return the chain that has the lowest value (length plus differences).

Augment your Doublets program so that it can now use the new search strategy as an option. That is, your program asks:

Enter chain manager (s: stack, q: queue, p: priority queue, x: exit): p

PS: once you finish this part, you will have implemented a cornerstone of artificial intelligence (AI): A* search! http://en.wikipedia.org/wiki/A*_search_algorithm

Provided Code

To enable you to focus on the non-routine parts of this program, we have provided you with a small amount of starting code and several test classes.

In ChainManager.java, we have given you a basic framework and some variable declarations that you may find helpful. Feel free to change anything except the signature of the methods, since we may write test code that calls these directly. You may wish to write additional methods to help with your computations.

There are unit tests for the Link, Chain, and Doublets classes, which are some of the first you'll need to write. PQPerformanceTest.java will check to see if your PriorityQueue-based solutions are better than the others. And of course you should test your complete interactive program.

Practical Notes

Recall that the Java option -Xmx allows one to specify the maximum heap size available to the program. You will likely need to increase this from Java's paltry default in order to do actual searches with the given links files. See here for a reminder about how to do this.

Turn-in Instructions

Turn-in your programming work by committing and pushing to your repository. (Of course, you should have committed many times by now). Be sure that your commit message contains the name(s) of the driver, as in "worked as pair, Mary drove".

Grading

We plan to use this list for assigning a grade:
Link unit tests 20
Chain unit tests 10
Doublets unit tests (for stack and queue chain managers) 22
PQPerformance unit tests (mostly on PriQ, 3 points are for saying no path when no path exists) 23
Program works interactively (-2 if crashes when start or end word isn’t in dictionary, -2 if crashes when start and end words have different lengths) 15
Efficiency. This is to make sure that getCandidates is O(1), it doesn't take more than 30 seconds to load Dict35, or reloads the dictionary each time it looks for a chain. 10
Style. Good layout, comments, variable names, good use of methods to break up the task, use of local variables instead of fields where appropriate, use meaningful constant names to avoid magic numbers. Up to -10 for egregious problems.
Total 100

Other thoughts

Optional

Because creating the links is a high-cost startup operation, you may choose to modify Link so that it is serializable. Then you may write the processed word list to disk and load the object directly on the fly.

Credits

This assignment was created by Rose CS/MA alumnus (2001) Jerod Weinman, who is now a professor at Grinnell College. Modifications and solution by Matt Boutell, Claude Anderson, and Nate Chenette. Helpful diagrams by Joe Hollingsworth. CSSE assistant Fred Zhang came up with the way of unit-testing the algorithm and wrote many of the unit tests. The excerpts of Lewis Carroll, published by Vanity Fair in 1879, are in the public domain.