I made a very simple corpus-driven chatbot that replies to you with the line of text from the corpus that comes after/is in response to the line of text most similar to what you just typed. here's a sample interaction where the database is built from the Cornell Movie Dialogs corpus... (my typing in green, bot response in blue; I typed the first turn and it alternates after that)
this is the Cornell Movie Dialogs corpus btw http://www.cs.cornell.edu/~cristian//Cornell_Movie-Dialogs_Corpus.html
I have a similar thing based off of irc logs (though with a bunch of other features dumped in) & found that, because in chat people often respond out of order or engage in multiple conversations at once, there's a fairly high error rate. (In my case, I took the idea from a bot that did the same that used to hang out in freenode #d around 2006.)
Seems like using a movie corpus would minimize this, since nobody but Orson Welles has crosstalk dominate their scripts!
@enkiv2 @janellecshane I actually think the movie corpus is uniquely unsuited to "general chatbot" format, since every line is trying to move the action forward and references rich on-screen context. you kinda have to start playing improv games with this bot for it to feel like it "works." (the movie dialog thing isn't an inherent part of this—I have a lot of students who want to turn their chat logs or whatever into "bots" and wanted to have some pre-baked code to give them to get started)
I found that with IRC logs in particular there were more non-sequitors than I would like. But, movies take place in an environment, so you'd get injected references to stuff that doesn't exist, etc.
How are you ranking similarity?
My implementation had a couple different policies for that, basically ranking word importance in different ways and producing a score based on shared words weighted by importance.