Copyright (C) 2001, Particle Corp., Alex S.
To get a description of what's happening inside the program, and how the database is organized, look though the description.html file.
This document is to give you an overview of what each file does, how to go about modifying Prof.Phreak, how to make it work, etc., and mentions some of the problems.
There are currently several database files (you can have more if you like though), one is mostly for direct keyword matches, another is mostly for pattern matching, another mostly for commands, and still another for keywords that we want to exclude from mathematical expressions. This is not strictly the case though, and I had to move things around a bit to make it give intelligent responses.
Reason for several databases: The idea of the search is to find the maximum size match. Some of the more general rules like "What (.*)" were matching rather large sequences, and easily defeated rules like "what is the weather" when the user would say something like "what is the weather today"... so, more direct patterns were put into the first database, so, if a rule like "what is the weather" would not be found, then the general rule "what .*" would be hit in the second database.
File phreak_s.db has all the simple direct matches (and is searched first), and file phreak_t.db has all the big template matches. File phreak_math_exclude.db contains keywords which we want to exclude form math expressions (not to confuse parts of those words for numbers). File phreak_c.db contains rules that have code/command for them to evaluate.
Note: the convention has not worked for some rules, so, the current database is a huge big mess.
The primary code is in prof.pl. That's the file used for testing the database. You just run the program, type in input, and it prints out output. Every time you give it input, it reloads all the databases. This lets you modify the database, and try out some rule without having to restart the program. prof.pl is the program used for development, and testing/modification of the database. It is *the* code.
It is not advisable to call this program from a CGI script, since that would create another process (the last thing you want inside a CGI script is to create new unnecessary processes). For this, there is a separate CGI script that includes all the code from prof.pl
In order to provide a CGI interface, a MakeHTML file was created. That's profphreak_ss.cgi file. You can use MakeHTML processor to compile it into a fully functioning CGI script (you might have to modify it to make it look reasonable though). You can get MakeHTML at http://www.theparticle.com/
Note: ALL the prof.phreak code that's in the CGI is inside the prof.pl. The MakeHTML file is just a wrapper to give the code a nice web interface. (you can easily create your own wrapper, etc...)
One idea I've been thinking about is to have a Java Applet connect to some primitive CGI for interaction (as opposed to having an HTML interface). The Applet might be more user friendly.
You can now write little bits of code for each rule. The rule is now broken up into three parts... regular expressions##replies##code where can have many regular expressions separated by a single #, and many replies, also separated by a single #. The code is regular Perl code that can use the @vars array, which is the array of regular expression variables found. You can modify values of that array to insert your own computed things (ie: dates, times, etc.), and the variable replacement done afterwards will replace the variables with your values from the code.
You can take a look at the phreak_c.db for an example of this.
Note, you can also call functions from the main Prof.Phreak code (to convert numbers to strings, etc.). You can also call on other perl modules/files to perform more complicated tasks. Because the code is only used for a single reply, the idea is that you should not devote huge resources just to make that one reply perfect (ie: a line of code should be enough for most purposes). You can star other processes, access some sql database, etc., whatever you want form these scriptlets.
Obviously, the strength of this program lies with it's database. The bigger and more accurate the database, the better the replies it gives. So, reorganizing, and modifying the database is the primary mode of modification.
(tip: keep logs of conversations, then read them, and whenever prof.phreak goofed at the answer, stick in something into the database to correct it; this iterative approach works quite well.)
Modifying the program itself to handle more specific databases. If someone wants to talk about art, talk about art. If someone wants to talk about philosophy, talk about philosophy, etc. You can also keep the 'topic' of the conversation as an indicator of which database to search first (so, if your last talked about the weather, and then you get a query like "how is it", you might reply "hot. I hope it gets cooler soon." or something like that.)
Store specific information about the conversation. If someone told you their name, try it put it into some replies. (ie: "What do you think?" ... "Well, John, I think a lot about a lot of things. But I won't tell you what I think."). These things can be very convincing. Enable some rules to be fired only if you posses some information about the conversationalist.
To store the conversation, and reply not just to the 'current' sentence, but to the whole conversation (concentrating on the last statement the most).
Rewriting the program in another language (Java might be fun...)
Speeding up the database search. It is reasonably fast as it is, but it is extremely wasteful of resources.
Using scriptlets, allow Prof.Phreak to perform useful operations on behalf of the user (take note of security concerns). Imbed a search engine... for example: "Do you know where I can find information on distributed computing?"... Instead of replying "No, I don't know where you can find information on distributed computing." make Prof.Phreak go out to some search engine, do the query, and reply with a list of URLs (in a cleverly phrased fashion of course ;-). The possibilities for these kinds of things is unlimited... just think of what you'd want if you had a human interface to the Internet, and then write it ;-)
The major problem with Prof.Phreak is the database mess. This problem leads to another related problem: some rules in the database are unreachable. This means that no matter what the user inputs, the rule will be matched by some other rule, and will never hit some nicer rule.
This is a big problem. It unnecessarily increases the size of the database, and neglects all the work put in into creating those unreachable rules in the first place. It also makes the program dumber, since your effective rules database is smaller than it really is.
One way to solve it is to write test suites, that include every rule and inputs. Then, after every database modification run, go through the test suite, and make sure that all the rules in the database get called (and no rule obscures another rule). If it does, fix it so that it doesn't. You might have to get rid of the old rule, modify the new rule to be a bit more picky so that the old rule still gets called on the desired criteria, etc.
This program is being released under the GNU General Public License (GPL). Read the gpl.txt file for details.