Classification Multi-label

Apr 27, 2016

PDF Code

The report is divided into 4 parts:

in the first part we briefly describe the process of extracting articles, labels and document ids of the wanted topics from the original sgm files using the beautiful soup package in Python
the second part explains the process of data wrangling which, often overlooked by data scientist, consists of a vital step in the whole pipeline of text mining
the text analysis part is extensively documented in part 3 where we focus on how the feature engineering can benefit from exploratory data analysis (EDA).
much attention has been given to the final part where we consider and evaluate 3 classification algorithms while using Tf-idf vector as a primary feature.

The report ends with a conclusion section where we discuss the most significant findings of our work and how the whole pipeline may integrate with other downstream components such as search engine.

The whole structure of our report can also be compactly captured in with a product-centered principle.