HTML Text Classification

HTML Text Classification returns the categories for given valid HTML page content.

Initialization

To use the HTML classification functions, the Text Classification Package must first be initialized. To do this, create an instance of the dca::TextClassification module using dca::TextClassification::create().

Once the Text Classification module has been initialized, a dca::HtmlTextClassifier object must be created. Use dca::TextClassification::createHtmlClassifier() to create an HTML Text Classifier.

The HtmlTextClassifier class classifies dca::HtmlText objects. To create an HtmlText object from an HTML text string, use dca::HtmlText::create(). Note that the HTML text must represent a complete HTML web site - the analysis of partial HTML text is not supported.

Classification

To classify an HtmlText object, use dca::HtmlTextClassifier::classify(). This function analyzes the HtmlText object and returns the results of a classification in a dca::TextClassificationResults object. This object is a container for individual dca::TextClassificationResult results (one result per matched category), and can be iterated over to obtain information on each category matched.

Example code

The following code demonstrates the classification of some HTML text.

// assume we have a valid DcaInstance (myDca) and License (myLicense)

// initialize the Text Classification module
dca::TextClassification myTextClassification = 
        dca::TextClassification::create( myDca, myLicense );

// create an HTML Text Classifier
dca::HtmlTextClassifier myHtmlTextClassifier = 
        myTextClassification.createHtmlClassifier();

// create an HTML text object
// assume we have the contents of a valid HTML page in std::string 
// myHtmlTextContents
dca::HtmlText myHtmlText = 
        dca::HtmlText::create( myDca, myHtmlTextContents );

// declare the classification results
dca::TextClassificationResults myTextClassificationResults;

// run the classification
dca::FunctionResult myResult = 
        myHtmlTextClassifier.classify( myHtmlText, myTextClassificationResults );

// if myResult returns false an error occured
if( !myResult ) {
        cout << "Received error from Text Classification (Error code: " <<
                myResult.getReturnCode() << ", Description: " <<
                myResult.getDescription() << ")." << endl;
        return;
}

if( !myTextClassificationResults.isCategorized() ) {
        cout << "No categories found for given HTML data." << endl;
        return;
}

// we received results and simply want to print them out
const DCA_SIZE_TYPE count = myTextClassificationResults.size();

// iterate through all matched categories
for( DCA_INDEX_TYPE i = 0; i < count; ++i ) {
        const dca::TextClassificationResult myTCResult = 
                myTextClassificationResults[ i ];
                
        cout << "Got result #" << (i+1) << " category id:" <<
                myTCResult.id() << ", Score:" <<
                myTCResult.score() << endl;
}

Generated on 26 Sep 2016 for dca_interface by  doxygen 1.6.1