I made a very simple corpus-driven chatbot that replies to you with the line of text from the corpus that comes after/is in response to the line of text most similar to what you just typed. here's a sample interaction where the database is built from the Cornell Movie Dialogs corpus... (my typing in green, bot response in blue; I typed the first turn and it alternates after that)

@aparrish That escalated quickly O.o

Still, I can't believe how convincing the bot sounded for the first half of that conversation. What a great idea. Maybe it could be made even more realistic by hand-picking a… slightly less dramatic corpus.

@aparrish started strong, developed some real attachment issues at the end there

@aparrish tfw two pals, on a road trip, talk about my ailing mom

@aparrish This looks like so much fun! Let me know if you post it somewhere.

@janellecshane I don't know if I can post the thing itself, since the corpus itself is under copyright! but I'm making a tutorial on how to do it yourself, which should be available some time in the next few weeks

@aparrish @janellecshane
I have a similar thing based off of irc logs (though with a bunch of other features dumped in) & found that, because in chat people often respond out of order or engage in multiple conversations at once, there's a fairly high error rate. (In my case, I took the idea from a bot that did the same that used to hang out in freenode #d around 2006.)

Seems like using a movie corpus would minimize this, since nobody but Orson Welles has crosstalk dominate their scripts!

@enkiv2 @janellecshane I actually think the movie corpus is uniquely unsuited to "general chatbot" format, since every line is trying to move the action forward and references rich on-screen context. you kinda have to start playing improv games with this bot for it to feel like it "works." (the movie dialog thing isn't an inherent part of this—I have a lot of students who want to turn their chat logs or whatever into "bots" and wanted to have some pre-baked code to give them to get started)

@aparrish @janellecshane
That's a good point.

I found that with IRC logs in particular there were more non-sequitors than I would like. But, movies take place in an environment, so you'd get injected references to stuff that doesn't exist, etc.

How are you ranking similarity?

My implementation had a couple different policies for that, basically ranking word importance in different ways and producing a score based on shared words weighted by importance.

@enkiv2 @janellecshane i did similarity by averaging word vectors. yknow like i always do

Sign in to participate in the conversation

Server run by the main developers of the project 🐘 It is not focused on any particular niche interest - everyone is welcome as long as you follow our code of conduct!