This is a pair programming assignment. You may talk with others about the ideas in this assignment, but you must write all code with your assigned partner. If you need specific help with your code, please get help from the instructor or one of the student assistants. You may get help from other students during the debugging phase.
You should
java.util
package
Your task is to implement two different algorithms, which will be performed in sequence.
The first is the Markov Chain algorithm. (Follow the link for an in-depth description.) It takes as input some sequence of "words" and generates as output a constrained random sequence composed from some or all of the original "words". The new sequence has similar statistical properties to the original sequence.
The second algorithm is a Text Justification algorithm. (Follow the link for an in-depth description.) This algorithm should take as input some intermediate data representation from the Markov Chain algorithm and print that data to the console so that it is both left and right justified.
You are to write a Java application whose
main
method resides in a class named
Markov
. The provided Eclipse project already implements this
main
method, and calls a constructor (stub) in the class
Chain
. Your task is to complete the constructor and the other method stubs in the
Chain
class, creating other classes and methods as needed.
The application expects four command-line arguments (in the following order):
For example, the Unix command
java Markov Oz.txt 3 400 75
generates random text based on the contents of the file
Oz.txt
with Markov Chains of length 3. It generates text until the Markov Chain algorithm naturally terminates or until 400 words are generated, whichever comes first. It justifies the output so that each output line occupies exactly 75 characters (except the last line).
A "word" shall be defined as any sequence of non-whitespace characters, so you will not need to treat punctuation or capitalization in any special way when "parsing" the input. Thus the following will be considered four different words:
book
Book
"Book
book,
You are required to use the data structures specified in this document for pedagogical reasons. Later in the course you will have a chance to choose the data structures you think best fit the problem or implement your own, but at this point it is important to make sure you can figure out how to use those specified here.
Note that this section will not make much sense if you have not yet read the in-depth descriptions of the Markov Chain and Text Justification algorithms.
FixedLengthQueue
object as an instance variable that holds the actual words in the prefix. The FixedLengthQueue
class is included in the provided template.
java.util.HashMap
object. You must be sure that the keys (which should be Markov prefixes) have a sensible
public int hashCode()
method.
In sections 16.3-16.4, Horstmann discusses hash codes. The
hashCode()
example in that section should be useful to you. Each table-entry will have a key that is a prefix and an associated value that is a (or possibly has a, depending on what you want to do) MultiSet
containing the suffixes that go with the given prefix.
MultiSet
, a class that we provide. Each item in the collection represents one suffix that corresponds to the given prefix. This should allow fast insertion of new items and fast search for the
Kth
item when the random number generator says to use the
Kth
suffix. A trivial sample program that uses a
MultiSet
is included in
examples.html
.
java.util.LinkedList
class along with an array or
ArrayList
as described in
the Text Justification algorithm description.
If the number of output words is small compared to the text from which the output is constructed, we would expect that most of the program's time would be spent building the data structures, and not much time producing the output.
It is possible that the default amount of memory allocated by the Java interpreter will not be enough. To increase to 50 (or substitute your own number) megabytes, use the
-Xmx
option when you invoke the interpreter: "java -Xmx200m Markov last-of-the-mohicans.txt 3 300 60
". This
-Xmx200m
option will be used when we test your program. In Eclipse, on the
Arguments
tab of a Run Configuration (Run → Run...), use can add the
-Xmx200m
as a
VM Argument.
Before you begin testing your program, you should create your own small data file, designed to produce a relatively small data structure that still provides choices for the random sequence generator. The words in the file do not have to make sense. Once you are confident that your program is working, you will want to try it on larger files that provide interesting random text.
Some large texts on which you may test your program are included in the provided Eclipse template. Due to memory requirements, you may not be able to process the longer ones with prefix-lengths greater than two, but a prefix length of two gives fairly interesting results.
Please see these turn-in instructions for details on submitting and testing your code.
This assignment will be graded in two parts. The first part must generate appropriate Markov chained output, but need not be justified. Grading for the first part will treat any amount of white space in the output as a delimiter. The program should take the four arguments specified above, but may ignore the last argument, justification width.
These questions and answers will probably not make much sense if you have not yet read the Markov Chain and Text Justification algorithm descriptions. Variations on this assignment have been given before. These are questions that have come up before.
Questions | Answers |
---|---|
How can we choose a random suffix from a
MultiSet ?
|
Generate a random number,
k, between 0 and size(), then use the
findKth
method from the
MultiSet
class.
|
Can we assume that the length of the line being outputted is longer than the length of the prefix? | I presume you mean "the first prefix". Since that prefix contains only one word, the answer is yes. You may assume that no input word is longer than the line length. |
It is OK to not align the last line of output or is this mandatory? | It would look pretty ugly if you did! The last line should be left-justified, of course, but not right-justified. |
Can we assume that the command line arguments will be of the correct type and in the right order (the order given in the example)? | Yes. You do not need to do any input error checking. Getting correct output for correct input is quite enough for you to do in ten days. |
Also, will the text file name given at the command prompt include the ".txt"? | Yes. Whatever the full filename is, it should be included on the command line. |
Do you really mean that there should be two spaces after every period, question mark, etc? | No. Only if the period or whatever ends a word (defined for this assignment as a sequence of non-whitespace characters). |
How do I know whether a character is a whitespace character? |
One approach is to use the
Character.isWhiteSpace
method. It would be very nice to set up a StringTokenizer object that considers all whitespace characters as delimiters. This would be difficult if you wanted all Unicode whitespace characters; perhaps it is easier if you restrict your attention to ASCII whitespace characters, which you are allowed to do. Also see
examples.html
for an explanation of a
String 's
split
method.
|
Are we ever supposed to call the
hashCode()
method that we must write?
|
No. The
HashMap
methods will call it when needed.
|
What if I can't get the Markov algorithm working at all? Can I still do the justification part and get some credit? | Definitely. I suggest working on the two parts "in parallel". If you are working on the justification part, you can write a very simple main() program to test it. All that this program must do is (like Markov) feed the output routine a word at a time, so that the output routine can justify and print out the words whenever a line gets filled. So your final program should simply justify the first part of the original text, to demonstrate the justification part. |
What's the worst behavior my program can exhibit (this question came from me, not from a student)? | An infinite loop. You should make sure that your program always terminates and gives some kind of output to show what it is doing, even if that output is not correct. |
What is the point of Markov, besides the DataStructures practice? Most of our programs up to this point could have some use in the real world. I am just not sure what, if anything, Markov can be used for. |
The voice in this answer is Dr. Anderson's. A very good question. Let me give you an analogy. At one point, I started taking piano lessons. Something I had always planned to do "when life slows down a bit." (I finally decided that it never will, and that if I don't do it now, I'll never get around to it). Then life got REALLY busy and I gave it up, but I WILL get back to it someday! No, really :)What did my teacher have me spend most of my time doing? Scales and warm-up exercises. I could ask "Will I ever use this in the real world?" What concert pianist or even bar room pianist would ever play a scale or a warm-up exercise for an audience? My reaction could be to say that I shouldn't have to spend time practicing scales, etc. because they are not "practical." But fortunately I did not do that, and I can already see how practicing those boring pieces with no real-world application has made me more agile at playing in general, so that I can play the (quite simple, so far) "real" pieces better than if I had never practiced the scales. This course is a prerequisite for many other CSSE courses. Mostly, it its the CSSE equivalent of scales and warm-ups. Hopefully by the time you are done this course, you will be able to run up and down the scales at a rapid speed, so that later when you do applications (in Operating Systems and other courses, and presumably after graduation), you will be able to concentrate on the problems you are solving, because grinding out the code will now be easy. By the end of this course I want you to be able to implement and use data structures almost as effortlessly as a concert pianist plays an E-flat minor scale (notice that I did not say C major scale, that's analogous to CSSE 220; the pianist has to work quite a while to get the E-flat scale, but eventually it becomes easy). If that process allows you to do some interesting programs with immediate applications along the way, that's a bonus. But it's not the only point of the programming part of this course. |