Best way to classify labeled sentences from a set of documents
I have a classification problem and I need to figure out the best approach
to solve it. I have a set of training documents, where some the sentences
and/or paragraphs within the documents are labeled with some tags. Not all
sentences/paragraphs are labeled. A sentence or paragraph may have more
than one tag/label. What I want to do is make some model, where given a
new documents, it will give me suggested labels for each of the
sentences/paragraphs within the document. Ideally, it would only give me
high-probability suggestions.
If I use something like nltk NaiveBayesClassifier, it gives poor results,
I think because it does not take into account the "unlabeled" sentences
from the training documents, which will contain many similar words and
phrases as the labeled sentences. The documents are legal/financial in
nature and are filled with legal/financial jargon most of which should be
discounted in the classification model.
Is there some better classification algorithm that Naive Bayes, or is
there some way to push the unlabelled data into naive bayes, in addition
to the labelled data from the training set?
No comments:
Post a Comment