IRE-2012 tip sheet for journalists
Muse: A tool for working with email archives
Download link. Send questions or comments to Sudheendra Hangal, hangal@cs.stanford.edu

BACKGROUND: Muse is a research tool from Stanford Computer Science for browsing large email archives. It was originally meant for people to browse their own long-term email archives. We have now started adapting it for journalists, archivists and researchers. Please send us feedback! You can also watch an hour-long talk about Muse.

DATA FORMATS: In all cases, try to obtain the archive in a digital email format. Avoid getting PDF files. Common email formats are 1) mbox, the canonical and best open format for email 2) .PST (Outlook) and .DBX (Outlook express) are proprietary Microsoft formats, 3) Eudora. If possible, convert the messages to mbox format. These files can be read by many email clients, including Thunderbird (a cousin of Firefox), Apple Mail and Eudora (see this page). To convert between formats, use an email converter like Mailstore Home (free for personal use) or Emailchemy (commercial, but inexpensive). Muse can also fetch email from one or more online email accounts (Gmail, Hotmail, corporate/university servers etc.) to which you have the password. It will try to automatically fix bad formatting, merge identities (same person with different accounts, name spellings) remove duplicates, etc.

USES OF MUSE
1) Automatically create summaries of messages per month (by identifying the top terms mentioned in email) so you can get a quick sense of the contents of the archive.
2) Automatically group people who “go together” (i.e. frequently recipients on the same message) and show you patterns of activity over time with each group. You can also manually edit the groups if you like.
3) Perform sentiment analysis to detect messages that are likely to reflect certain emotions such as congratulations, anger, conflict, milestones, family events, trips, etc. You can customize the categories to any terms you want.
4) View all attachments in one screen, or copy them out as files.
5) Browse messages quickly, turning facets on and off. You can quickly zoom into messages reflecting a certain sentiment, exchanged with a particular group, from a certain folder… etc.).
6) Rapidly skim hundreds or thousands of messages without clicks or keypresses.
7) Investigate archives in complete privacy. You can run Muse on your own computer without you giving up the data anywhere else.

A BROWSER PLUGIN THAT CONNECTS TO ARCHIVES: Muse includes a browser extension for Chrome and Firefox that will automatically highlight significant terms on the current page that are also in the archive. You can click on a highlighted term to get to all the messages that contain the term. If you have other terms you would like the browser to highlight (e.g., a list of names on a web page, or in your Rolodex), simply email the text to yourself, put the message in a new folder, and include the folder when you run Muse.

RAPIDLY BROWSING COLLECTIONS OF WEBPAGES: Bookify (website, see demo video) is a browser bookmarklet for Firefox, Chrome, or Safari lets you “scoop” up links on a web page just by selecting text and quickly skim the pages that lie behind those links without additional tabs or clicking. E.g. scan all news items on a page or the results of a patent search or Fortune 500 companies quickly. See the website for examples. This is a new feature (independent of Muse), and we’d love feedback about sites on which it would be useful.

SWEET SPOT: Archives with about 50,000 messages. Crunches the archive in less than an hour on a laptop. For larger archives, use a big machine with lots of memory.

UPCOMING: We are actively working on making Muse more robust and better documented. We can apply the underlying technology to other kinds of text documents as well, so let us know about datasets in which you have a particular interest.