libkoishi

23.09.2024 03:29

work started on 2020-07 launched on 2020-11

Markov chain-based text generator for a chatbot.

Back in my day, before GPT (and transformers and neural networks and AI as a buzzword and …), when we wanted to write a chatbot, the go-to way to generate almost sensible text were Markov chains. Pretty much any chatroom that housed programmers ended up with one such bot, likely home-grown, responding to nonsense with nonsense. Or at least the ones I’ve been frequenting. They are quite effective, simple to comprehend, and a fun exercise to implement. As such, when the head nonsense generator of the chatroom I frequent started playing up, I took the opportunity to write my own.

My understanding of Markov chains is probably imperfect, and lacking the mathematical precision that makes their article require 107 <math> tags on Wikipedia, but it’s good enough to implement something fun. A Markov chain can be visualized by a set of states and paths between them. Each path has an associated probability, representing how likely is that that state will follow the current one. Based on the chain, a realistic pattern of states can be generated by randomly walking between the states.

In my implementation, the states represent letters, though it’s a popular choice to use words instead. For example, starting with a t, there is a high chance that an h will be picked, then e, making the generated text start with the. However, to make the text slightly more lucid, the probabilities are not recorded based only on the last letter, but last four. This is an arbitrary choice.

Additionally, instead of first creating a fixed chain, my implementations “study” language directly, from chat messages, keeping track of how often they see a specific pattern.

The first version, retroactively called v1, stored the chain as a “hash"table of “rules” (four-character contexts), each holding a linked list of “continuations” (possible following characters). The “hash” in hashmap is a lie, the slots encode the first two characters, and the latter two are used to disambiguate, an inefficient and ugly design. Here follows a snippet from the header.

struct continuation_t {
	struct continuation_t *next;
	char character;
	uint32_t probability;
};
typedef struct continuation_t continuation_t;

struct rule_t {
	struct rule_t *next;
	char suffix[2];
	uint64_t probtotal;
	continuation_t *cont;
};
typedef struct rule_t rule_t;

typedef struct markovdict_t {
	rule_t *hashtable[0x10000]; // 0x0 -> 0xFFFF
} markovdict_t;

markovdict_t* createdict(void);
void trainmarkov(markovdict_t *dict, const char* string);
void createstring(markovdict_t *dict, char* buf, size_t buflen);
void savedict(markovdict_t* dict, FILE* f);
markovdict_t* loaddict(FILE* f);

The main API consists of the trainmarkov() and createstring() functions, the first one “training” the chain and the other generating text. This implementation was used for a couple months, but it gradually stopped working due to it being byte-based yet accepting Unicode. I thought (and still do) that a four-character context is enough to correctly reproduce valid UTF-8 sequences, but at some point the bot started getting randomly disconnected. It turns out it was sending illegal UTF-8 sequences, and being kicked by the WebSocket layer. As a band-aid fix, all characters with the high bit set were filtered, but this heavily lobotomized the model. To fix this, libkoishi v2 was written.

Version 2 ended up a bit more sophisticated. Not only did it change its unit to Unicode codepoints, it also uses an actual hashmap, and at some point, to optimize memory usage, I made it use variable length encoding in-memory for characters. This is further explained in the header comments (thankfully), parts of which follow.

struct ksh_continuations_t {
	struct ksh_continuations_t *next;
	char buf[KSH_BUFFER_PER_CONT];
	/* the buffer contains pairs of probs and chars, where probs are 4 bytes
	 * long (little endian), and chars are LEB128. this variable length encoding
	 * allows us not to waste so much ram on 0s on our mostly-ASCII corpus.
	 * the unused part is to be all 0s, so that a read of all 0s for a prob
	 * signals end of current buffer fill, and the need to try the next buffer.
	 */
};
typedef struct ksh_continuations_t ksh_continuations_t;

struct ksh_rule_t {
	struct ksh_rule_t *next;
	int64_t probtotal;
	ksh_continuations_t *cont;
	char buf[KSH_BUFFER_PER_RULE];
	/* this buffer works the same as the rule buffer, with the exception that
	 * it begins with 4 LEB128-encoded chars, to mark the rule name.
	 */
};
typedef struct ksh_rule_t ksh_rule_t;

struct ksh_model_t {
    int mapsize; // hashmap[2^mapsize], preferably between 8 and 20
	ksh_rule_t **hashmap;
    int64_t (*rng)(void*, int64_t);
    void *rngdata;
	struct ksh__arena *arena;
};
typedef struct ksh_model_t ksh_model_t;

ksh_model_t *ksh_createmodel(int mapsize, int64_t (*rng)(void*, int64_t), uint32_t seed);

void ksh_trainmarkov(ksh_model_t *model, const char *str);
void ksh_createstring(ksh_model_t *model, char *buf, size_t bufsize);
void ksh_continuestring(ksh_model_t *model, char *buf, size_t bufsize, const char *from);

Overengineered? That’s normal for a for-fun project, or at least should be. It’s fun!

The model now additionally lets the user provide a custom random number generator (with an available default), which allowed me to make the “tests” reproducible, and internally uses an arena allocator, to limit memory overheads and allocation speed as well. This version is used to this day, as of writing.

The chatbot was asked for comment.

esperminakaharacter it wonder pad
neet look good damn
i do no put micross
:love:

She’d end up smarter with a better data source. Alas.