URL Classification

The URL Classification returns categories for a given URL object (if found in the database) or whether the URL is unknown or simply not categorized.

Initialization

To use the URL Classification functions, the URL Classification Package must first be initialized. To do this, create an instance of the dca::UrlClassification module using dca::UrlClassification::create().

Next, a connection to a URL database must be set up. Refer to Setting up a Database Connection for the steps required to do this.

Once a connection to a URL database has been established, an instance of a dca::UrlDbClassifier must be created. Use dca::UrlClassification::createDbClassifier(), passing as a parameter the newly created database connection object. If you wish to specify options for the classifier (for example, options for Embedded URLs or to enable Feedback mechanism), you should additionally create and initialize a dca::UrlDbClassifierOptions object.

The UrlDbClassifier class classifies dca::Url objects. To create a Url object from a URL text string, use dca::Url::create().

Classification

To classify a URL, use dca::UrlDbClassifier::classify(). This function analyzes the URL using the database specified by the database connection and returns the results of a classification in a dca::UrlClassificationResults object. This object is a container for individual results (one result per matched category), and can be iterated over to obtain information on each category matched.

A complete list of suppported URL categories can be found at https://exchange.xforce.ibmcloud.com/faq#info_for_url_report

URL Normalization

Before an URL is looked up in our URL database, it has to be normalized in the same way, as it is done on our content analysis servers, where the URL database is created.

If you are using the SCA in a proxy behind a Web Browser this conversion is done by the Web Brower in general.

You may use either

The following schemas are supported by URL classification:

If an URL does not include a schema, http will be assumed

Feedback mechanism

If a URL is not found in the database, the URL is classed as unknown. The function dca::UrlClassificationResults::isUnknownUrl() can be used to check this.

The Feedback option exists to help us to improve the quality of our classifications. Unknown URLs are collected and uploaded to our servers in a given interval. Also some statistics about matched categories and not-categorized classifier calls are collected and submitted.

Uploading such information is done during the dca::UpdateModule::performUpdate() call.

To enable the Feedback option for a UrlDbClassifier, the option enable_Feedback of the dca::UrlDbClassifierOptions must be set to true before creating a UrlDbClassifier.

Note:
enable_Feedback is by default disabled
The proxy settings (if any) used for upload will be taken from the dca::DbConnection associated with the dca::UrlDbClassifier.

Example code

See also:
Content and Engine Updates, and how to implement the required tasks
dca::UrlDbClassifierOptions

The following code demonstrates the classification of a URL.

// assume we have a valid DcaInstance (myDca), License (myLicense) and DbConnection (myDbConnection)

// initialize the URL Classification module
dca::UrlClassification myUrlClassification = 
        dca::UrlClassification::create( myDca, myLicense );

UrlDbClassifierOptions creationOptions;
// enable Feedback mechanism
creationOptions.enable_Feedback = true; 
// enable detection of embedded URLs
creationOptions.enable_EmbeddedUrlDetection = true;

// create a UrlDbClassifier by using creationOptions
dca::UrlDbClassifier myUrlDbClassifier = 
        myUrlClassification.createDbClassifier( myDbConnection, creationOptions );

// create a URL object to classify
dca::Url myUrl = dca::Url::create( myDca, "www.ibm.com" );

// declare the classification results
dca::UrlClassificationResults myUrlClassificationResults;

// start URL Classification
dca::FunctionResult myResult = 
        myUrlDbClassifier.classify( myUrl, myUrlClassificationResults );

// if myResult returns false an error occurred.
if( !myResult ) {
        cout << "Received an error from URL Classification (Error code:" <<
                myResult.getReturnCode() << ", Description: " <<
                myResult.getDescription() << ")." << endl;
        return;
}

if( myUrlClassificationResults.isUnknownUrl() ) {
        // if the given URL is unknown there are no resulting categories available
        cout << "Received no results, URL is unknown." << endl;
        return;
}

if( !myUrlClassificationResults.isCategorized() ) {
        // given URL (or host) is KNOWN, but does not have any categories assigned
        cout << "URL is known but has no assigned categories." << endl;
        return;
}

// we got results and want to print them out
PrintResults( myUrlClassificationResults );

Generated on 26 Sep 2016 for dca_interface by  doxygen 1.6.1