Five Easy Steps to Build an Organized IBM Watson Corpus

March 22, 2016 Nick Miller

If you are working with IBM WATSON, the supercomputer that defeated human contestants in the US TV show Jeopardy in 2011, these tips may help you to create a reliable corpus in the now cloud-based Watson Analytics Environment. Either you are using Watson to ask natural language questions, visualize data patterns or find insights, your Corpus is the cornerstone of your application.

1. Define the kind of interactions your users require.

Are your users looking for facts, procedures or opinions? The better you understand your user information needs; the better WATSON will respond to them. This knowledge may also help you chose the most suitable documents to create your corpus, thus saving training time and improving precision.

2. Check the documents structure and language.

Once you have identified the most suitable sources to match your type of interactions, check the document structure. A document can have amazing data but if it is not properly written or its internal structure is not neatly organized (no headings or sections, convoluted paragraphs, slang), it will be hard to read not only for WATSON, but for the user as well.

3. Cleaning and reformatting.

Yes, it is a lot of work, but it is totally worth it. Think about this step the same way you would think about having a neatly organized pantry or closet: you will find everything you need when you need it. It will save you time and effort. Eliminate noise, and add headers, tags, punctuation or metadata to improve the document structure. Remember: titles, paragraphs and punctuation have real use! If you don’t do this for all the corpus, do it at least for a ground-group of critical data.

4. Train WATSON at least in three model-questions.

When training WATSON use three basic question formats (short, medium, long). It can greatly improve your precision, and reduce training too. E.g.How to make a cherry pieHow do I make a cherry pie?
How can I make a delicious fast cherry pie?

5. Perform a baseline testing.

Before jumping into a frantic training, perform a baseline testing -Test retrieval precision without any training. Check questions and answers and learn from your mistakes. Once you identify a pattern you will be able to promptly fix it, and more importantly, avoid it in further ingestions. If your corpus is nicely curated, chances are you will get high recall and precision with little initial training.


Previous Article
Oil and Gas, AI, and the Promise of a Better Tomorrow
Oil and Gas, AI, and the Promise of a Better Tomorrow

You've reached the end!


Curated AI News Straight to Your Inbox Every Month. Subscribe to our Newsletter.

First Name
Last Name
Company Name
Phone Number
Job Title
Thank you!
Error - something went wrong!