dca_interface  6.3.4
HTML Text Classification

HTML Text Classification returns the categories for given valid HTML page content.

Initialization

To use the HTML classification functions, the Text Classification Package must first be initialized. To do this, create an instance of the dca::TextClassification module using dca::TextClassification::create().

Once the Text Classification module has been initialized, a dca::HtmlTextClassifier object must be created. Use dca::TextClassification::createHtmlClassifier() to create an HTML Text Classifier.

The HtmlTextClassifier class classifies dca::HtmlText objects. To create an HtmlText object from an HTML text string, use dca::HtmlText::create(). Note that the HTML text must represent a complete HTML web site - the analysis of partial HTML text is not supported.

Classification

To classify an HtmlText object, use dca::HtmlTextClassifier::classify(). This function analyzes the HtmlText object and returns the results of a classification in a dca::TextClassificationResults object. This object is a container for individual dca::TextClassificationResult results (one result per matched category), and can be iterated over to obtain information on each category matched.

Example code

The following code demonstrates the classification of some HTML text.

// assume we have a valid DcaInstance (myDca) and License (myLicense)
// initialize the Text Classification module
dca::TextClassification myTextClassification =
dca::TextClassification::create( myDca, myLicense );
// create an HTML Text Classifier
dca::HtmlTextClassifier myHtmlTextClassifier =
myTextClassification.createHtmlClassifier();
// create an HTML text object
// assume we have the contents of a valid HTML page in std::string
// myHtmlTextContents
dca::HtmlText myHtmlText =
dca::HtmlText::create( myDca, myHtmlTextContents );
// declare the classification results
dca::TextClassificationResults myTextClassificationResults;
// run the classification
myHtmlTextClassifier.classify( myHtmlText, myTextClassificationResults );
// if myResult returns false an error occured
if( !myResult ) {
cout << "Received error from Text Classification (Error code: " <<
myResult.getReturnCode() << ", Description: " <<
myResult.getDescription() << ")." << endl;
return;
}
if( !myTextClassificationResults.isCategorized() ) {
cout << "No categories found for given HTML data." << endl;
return;
}
// we received results and simply want to print them out
const DCA_SIZE_TYPE count = myTextClassificationResults.size();
// iterate through all matched categories
for( DCA_INDEX_TYPE i = 0; i < count; ++i ) {
const dca::TextClassificationResult myTCResult =
myTextClassificationResults[ i ];
cout << "Got result #" << (i+1) << " category id:" <<
myTCResult.id() << ", Score:" <<
myTCResult.score() << endl;
}
Single result of a text classification.
HtmlTextClassifier createHtmlClassifier() const
Creates a HtmlTextClassifier that is used to classify HtmlText objects.
DCA_RESULT_TYPE getReturnCode() const
Returns the last error code (if any).
double score() const
Returns the score of the classification (if any), range is from 0.0 to 1.0.
bool isCategorized() const
Returns whether there are any results for the text classification.
DCA_SIZE_TYPE size() const
Returns the number of results in the container.
static HtmlText create(const DcaInstance &aDcaInstance, const std::string &htmlContent)
Creates an HTML text object, used as an input parameter for text classification.
HTML text classifier object for text classification.
FunctionResult classify(const HtmlText &aText, TextClassificationResults &aTextResults) const
The HTML Text Classification method. The method takes an initialized HtmlText object and returns the ...
DCA_CATEGORY_ID_TYPE id() const
Returns the category id of the classification (if any).
size_t DCA_INDEX_TYPE
Type for index access (used for arrays and collections).
Definition: base_types.h:66
std::string getDescription() const
Returns the description for the error or warning.
Encapsulates an HTML text object.
Definition: base_htmltext.h:24
Overall results of a text classification.
size_t DCA_SIZE_TYPE
Type for size (used for size of array and collections).
Definition: base_types.h:72
static TextClassification create(const DcaInstance &aDcaInstance, const License &aLicense)
Initializes the TextClassification module.
Standard function result.
Definition: base_classes.h:148
The HTML Text Classification module class.