dca_interface  6.3.4
URL Classification

The URL Classification returns categories for a given URL object (if found in the database) or whether the URL is unknown or simply not categorized.

Initialization

To use the URL Classification functions, the URL Classification Package must first be initialized. To do this, create an instance of the dca::UrlClassification module using dca::UrlClassification::create().

Next, a connection to a URL database must be set up. Refer to Setting up a Database Connection for the steps required to do this.

Once a connection to a URL database has been established, an instance of a dca::UrlDbClassifier must be created. Use dca::UrlClassification::createDbClassifier(), passing as a parameter the newly created database connection object. If you wish to specify options for the classifier (for example, options for Embedded URLs or to enable Feedback mechanism), you should additionally create and initialize a dca::UrlDbClassifierOptions object.

The UrlDbClassifier class classifies dca::Url objects. To create a Url object from a URL text string, use dca::Url::create().

Classification

To classify a URL, use dca::UrlDbClassifier::classify(). This function analyzes the URL using the database specified by the database connection and returns the results of a classification in a dca::UrlClassificationResults object. This object is a container for individual results (one result per matched category), and can be iterated over to obtain information on each category matched.

A complete list of suppported URL categories can be found at https://exchange.xforce.ibmcloud.com/faq#info_for_url_report

URL Normalization

Before an URL is looked up in our URL database, it has to be normalized in the same way, as it is done on our content analysis servers, where the URL database is created.

If you are using the SCA in a proxy behind a Web Browser this conversion is done by the Web Brower in general.

You may use either

The following schemas are supported by URL classification:

  • http
  • https
  • ftp

If an URL does not include a schema, http will be assumed

Feedback mechanism

If a URL is not found in the database, the URL is classed as unknown. The function dca::UrlClassificationResults::isUnknownUrl() can be used to check this.

The Feedback option exists to help us to improve the quality of our classifications. Unknown URLs are collected and uploaded to our servers in a given interval. Also some statistics about matched categories and not-categorized classifier calls are collected and submitted.

Uploading such information is done during the dca::UpdateModule::performUpdate() call.

To enable the Feedback option for a UrlDbClassifier, the option enable_Feedback of the dca::UrlDbClassifierOptions must be set to true before creating a UrlDbClassifier.

Note
enable_Feedback is by default disabled
The proxy settings (if any) used for upload will be taken from the dca::DbConnection associated with the dca::UrlDbClassifier.

Example code

See also
Content and Engine Updates, and how to implement the required tasks
dca::UrlDbClassifierOptions

The following code demonstrates the classification of a URL.

// assume we have a valid DcaInstance (myDca), License (myLicense) and DbConnection (myDbConnection)
// initialize the URL Classification module
dca::UrlClassification myUrlClassification =
dca::UrlClassification::create( myDca, myLicense );
UrlDbClassifierOptions creationOptions;
// enable Feedback mechanism
creationOptions.enable_Feedback = true;
// enable detection of embedded URLs
creationOptions.enable_EmbeddedUrlDetection = true;
// create a UrlDbClassifier by using creationOptions
dca::UrlDbClassifier myUrlDbClassifier =
myUrlClassification.createDbClassifier( myDbConnection, creationOptions );
// create a URL object to classify
dca::Url myUrl = dca::Url::create( myDca, "www.ibm.com" );
// declare the classification results
dca::UrlClassificationResults myUrlClassificationResults;
// start URL Classification
myUrlDbClassifier.classify( myUrl, myUrlClassificationResults );
// if myResult returns false an error occurred.
if( !myResult ) {
cout << "Received an error from URL Classification (Error code:" <<
myResult.getReturnCode() << ", Description: " <<
myResult.getDescription() << ")." << endl;
return;
}
if( myUrlClassificationResults.isUnknownUrl() ) {
// if the given URL is unknown there are no resulting categories available
cout << "Received no results, URL is unknown." << endl;
return;
}
if( !myUrlClassificationResults.isCategorized() ) {
// given URL (or host) is KNOWN, but does not have any categories assigned
cout << "URL is known but has no assigned categories." << endl;
return;
}
// we got results and want to print them out
PrintResults( myUrlClassificationResults );
bool isCategorized() const
Returns whether or not the URL matched one or more categories.
static Url create(const DcaInstance &aDcaInstance, const std::string &urlString)
Standard Url creation function.
DCA_RESULT_TYPE getReturnCode() const
Returns the last error code (if any).
FunctionResult classify(const Url &aUrl, UrlClassificationResults &urlResults) const
Performs the URL classification and returns the results.
bool isUnknownUrl() const
Returns whether a URL is known or unknown. A URL is unknown if it is not contained in the database.
Main class for the URL classification.
Results of an URL classification.
URL database classifier class.
static UrlClassification create(const DcaInstance &aDcaInstance, const License &aLicense)
Creates the URL classification module by using the given DcaInstance and License.
std::string getDescription() const
Returns the description for the error or warning.
Standard function result.
Definition: base_classes.h:148
Encapsulates a URL object.
Definition: base_url.h:44
UrlDbClassifier createDbClassifier(const DbConnection &aDbConnection, const UrlDbClassifierOptions &options=UrlDbClassifierOptions()) const
Create a URL database classifier. The classifier is created by using the provided database connection...