I made a very simple corpus-driven chatbot that replies to you with the line of text from the corpus that comes after/is in response to the line of text most similar to what you just typed. here's a sample interaction where the database is built from the Cornell Movie Dialogs corpus... (my typing in green, bot response in blue; I typed the first turn and it alternates after that)
this is the Cornell Movie Dialogs corpus btw http://www.cs.cornell.edu/~cristian//Cornell_Movie-Dialogs_Corpus.html
@aparrish THAT'S REALLY COOL
@aparrish started strong, developed some real attachment issues at the end there
@aparrish tfw two pals, on a road trip, talk about my ailing mom
@aparrish This looks like so much fun! Let me know if you post it somewhere.
@janellecshane I don't know if I can post the thing itself, since the corpus itself is under copyright! but I'm making a tutorial on how to do it yourself, which should be available some time in the next few weeks
I have a similar thing based off of irc logs (though with a bunch of other features dumped in) & found that, because in chat people often respond out of order or engage in multiple conversations at once, there's a fairly high error rate. (In my case, I took the idea from a bot that did the same that used to hang out in freenode #d around 2006.)
Seems like using a movie corpus would minimize this, since nobody but Orson Welles has crosstalk dominate their scripts!
@enkiv2 @janellecshane I actually think the movie corpus is uniquely unsuited to "general chatbot" format, since every line is trying to move the action forward and references rich on-screen context. you kinda have to start playing improv games with this bot for it to feel like it "works." (the movie dialog thing isn't an inherent part of this—I have a lot of students who want to turn their chat logs or whatever into "bots" and wanted to have some pre-baked code to give them to get started)
I found that with IRC logs in particular there were more non-sequitors than I would like. But, movies take place in an environment, so you'd get injected references to stuff that doesn't exist, etc.
How are you ranking similarity?
My implementation had a couple different policies for that, basically ranking word importance in different ways and producing a score based on shared words weighted by importance.
Wow, that got heavy in a hurry!
@aparrish Source code anywhere?
Server run by the main developers of the project It is not focused on any particular niche interest - everyone is welcome as long as you follow our code of conduct!