As just mentioned, a text corpus is a large body of text.

We will wait until later before exploring each Python construct systematically.

The corpus contains over 10,000 posts, anonymized by replacing usernames with generic names of the form "User NNN", and manually edited to remove any other identifying information.

The corpus is organized into 15 files, where each file contains several hundred posts collected on a given date, for an age-specific chatroom (teens, 20s, 30s, 40s, plus a generic adults chatroom).

We'll use NLTK's support for conditional frequency distributions.

These are presented systematically in 2, where we also unpick the following code line by line.

