Additional settings

Introduction

In addition to the options available at the API level, there are a number of settings which can be adjusted by adjusting the values stored in initialization files.
The settings files global to a SCA module can be found under /init/dca_<module_name>/. The main settings file is named <module_name>_settings.txt.
The settings files specific to a SCA classifier or other object can be found under /init/dca_<module_name>/<object_name>. The main settings file is named <object_name>_settings.txt.
Each settings file has an associated user settings file where each option contained in the main file can be overwritten. This is the preferred method of adjusting the settings contained in the initialization files, as the main settings files may be overwritten as part of the update process. The user files are never overwritten as part of an update, and can therefore be safely modified. The user files have the same name as the main files, with the additional postfix _user before the file extension.
All settings file have the same format. The first line contains the version number of the settings. For a user file, this is always 6.00000000. The settings are arranged as key-value pairs. A key is defined inside square brackets (e.g. [ key ] ). Keys are always case sensitive. The value of values for a key directly follow the key on the next line. Lines which begin with '#' are treated as comments and ignored.
## VERS=6.00000001

[key1]
value1
value2

[key2]
value3
In the following sections the important options are detailed.

Logging

The main log directory for the SCA can be specified in dca::InitData when the SCA instance is first created. This directory is constant for the lifetime of this instance. A log file named dca_info.log is created in this directory. Modules, classifiers and other objects are all able to write to this log file. If no log directory is specified, then logging will be disabled.
All objects which support initialization (settings) files support a log level. There are seven different levels of logging supported. These have the levels 0 - 6, and are defined as follows:
  • CRITICAL = 0
  • ERROR = 1
  • WARNING = 2
  • NOTICE = 3
  • INFO = 4
  • EXTENSIVE = 5
  • DEBUG = 6
The default log level for each object is 3 (NOTICE). The log level can be changed in the user settings file by adding the key [init_object_log_priority] and the required log level under the key. Note that all user files are initially empty. To override a key-value from the main settings file, copy the key to the user file and update the value.
The changed settings take effect the next time an object is created or initialized, or when an update for the object takes place (e.g. a classifier update).
The log file dca_info.log consists of five tab-separated columns:
<thread-id>   <date-time>   <level>   <component-name>   <log-message>
If an error occurs during a call to an API function, we recommend examining the log file for additional information on what could have caused the error.
See section Manager Options for further options with respect to logging.

Logging Behavior of Internal Components

Additionally the logging behavior of several internal components inside SCA objects can be changed. For this, options are available that contain the string _log_level_ in their name. With these options you can specify the SCA log level for all info log messages of the corresponding internal component. The possible values are 1, 2, 3 and 4 (default is 4) with the following meaning:
  • 1 = All info messages of the internal component are written with the log level NOTICE
  • 2 = All info messages of the internal component are written with the log level INFO
  • 3 = All info messages of the internal component are written with the log level EXTENSIVE
  • 4 = All info messages of the internal component are written with the log level DEBUG
If you change the log level of a SCA object to DEBUG, you will sometimes get a tremendous amount of log output, where it is hard to find what you want. By setting the log level for a single internal component you can reduce the log output to the area you wish to focus on.
For example to change the log level for all info messages of the internal embedded URL component of the dca::UrlDbClassifier object to log level NOTICE, simply set the option [urldatabase_log_level_embedded_urls] to 1.
For performance reasons change the logging behavior of the internal components only for debugging purposes, and never in real live systems!

Manager Options

The following options are available for the SCA Manager and can be found in the file /init/dca_manager/dca_manager_settings.txt.
[manager_dca_log_priority]
The whole logging of the SCA is done by the SCA Manager. All log messages that are not discarded in front arrive finally at the SCA Manager. The log level of the message is additionally compared to the overall log priority (default is LOG_Debug = 6).
[manager_factory_tracking_flags]
Flag vector for the SCA factory tracking flags (by default all flags are disabled). The following flags are supported:
  • 1 Enable the internal factory tracking, that tracks the creation and destruction of internal objects
  • 2 Clear at initialization the dca_factory_tracking.log file
  • 4 Write the factory tracking information also on screen
  • 8 Enable the internal tracking of the reference counting for internal objects
The tracking information is written at deinitialization to the dca_factory_tracking.log file. This option is available for debugging purposes to track down possible memory leaks in the application.
[manager_log_max_size]
The maximum size of the log file dca_info.log in megabytes (default is 250 MB). Every 500 log messages the size of the current log file is checked for this value. If this value is exceeded, the existing log file is renamed to dca_info.log.bak and a new dca_info.log file is created. Any existing dca_info.log.bak file will be overwritten
[manager_log_mode]
Flag vector for the log mode (default is 2). The following flags are supported:
  • 1 Log all messages on the screen
  • 2 Log all messages to the dca_info.log file
  • 4 Log each message to the corresponding module log file, which is named dca_module.dca
  • 512 Clear at initialization all module log files
  • 1024 Clear at initialization the dca_info.log file

License Options

The following options are available for licensing, and can be found in the file /init/dca_license/dca_license_settings.txt.
[license_server]
The optional license server as a string, that is used for license requests.
If this entry is missing the default license server is used.
[license_server_hash] This option is deprecated and will be ignored.
[license_error_handling_flags]
If this is set to 1 (default), a list of valid license servers is obtained, and a license request will be made on each server, until one server returns a positive result. If this is disabled (0), only one server will be contacted with a license request. If this single request fails, a license error will be returned immediately.
[license_http_timeout]
The timeout (in seconds) for a license request. Default is ten seconds.

Update Options

The following options are available for updates, and can be found in the file /init/dca_update/dca_update_settings.txt.
[update_download_server]
This optional option sets the used download server for the update module. This server is requested for all content and binary updates.
If this entry is missing the default download server is used.
[update_upload_server]
This optional option sets the upload server for the update module. All data uploads are directed to this server.
If this entry is missing the default upload server is used.
[update_weblearn_upload]
This option enables or disables the upload of unknown URL files. If it is set to 1 (default), files containing lists of unknown URLs, collected during URL classification, will be uploaded to our servers for offline analysis. If it set to 0, the collected files will not be uploaded, and will remain in the upload directory.
Feedback mechanism for a URL Classifier can be enabled via dca::UrlDbClassifierOptions.
This option only enables or disables the upload of the unknown URL files. If Feedback mechanism is enabled and the upload is disabled, the user is responsible for the removal of the upload files! The files are located in the directory /init/dca_update/upload/unknownurls.
[update_transfer_speed_limit]
This option sets the maximum transfer rate for update downloads. A value of 0 (default) indicates that the transfer rate should not be limited. A value > 0 specifies the maximum rate in Kbytes / second.
This option has no effect on the upload transfer rate.

Database Connection Options

The database connection object dca::DbConnection contains a data cache for efficient data exchange between a classifier and the database.
There are three different types of database connection:
  • Local databases such as URL, Mail, IPR or WAC
  • Custom databases
  • Remote URL database
Local database connections contain an additional frequent update component, where all downloaded updates are cached until the database merge process exchanges the database file with a newer one.
Local database connections support the following settings defined in the file /init/dca_db/db_connection/db_conn_settings.txt.

Frequent updates component settings

The frequent update component contains a data cache. The cache is made up of a fixed number of memory blocks, and each block contains a fixed number of entries. A block is briefly locked when a thread accesses or updates the data, so that the block remains consistent when accessed or updated by multiple threads. Multiple threads will not be locked out if they access different blocks.
The amount of memory the frequent update cache requires is related to the number of blocks and the number of entries per block. The default cache settings are set according to the estimated number of update records per day.
The cache settings can be modified to suit your memory environment by using the following options. In the option names, <type> should be replaced with one of the following identifiers, depending on which database cache you wish to modify:
  • url
  • mail
  • ipr
  • wac
db_connection_store_hash_bits_<type>
This value specifies the number of bits used to define a cache block lookup key, and hence relates directly to the number of blocks available in the cache. For a given value of N, 2^N blocks will be stored in the cache. Use a value between 2 and 24 bits depending on your memory requirements. The recommended and default value is 15 bits.
db_connection_store_block_size_<type>
This value specifies the number of entries per block. Different default values may be used here depending on the database type. The number of entries must not lie under 5 or over 100.
If there is no more space available to add an entry in a block, a database merge process will be started, and the frequent update cache will be emptied.
db_connection_frequent_upd_write_stats_<type>
If set to 1 statistics for the frequent update cache will be written using the given interval (see option db_connection_frequent_upd_write_stats_time_sec...). The default value is 0, so that no statistics will be written.
db_connection_frequent_upd_write_stats_time_sec_<type>
If option db_connection_frequent_upd_write_stats is set to 1, the statistics of the frequent update cache will be written to the logfile by using the given interval.
Note: The statistics will be written after every 5000 lookups. If you call the classify method less than 5000 times, no statistics will be written.

Local database cache component settings

The cache component of a local database connection is placed logically between the frequent update component and database access component, and caches the results of database lookups.
Some settings are very similar to the settings for the frequent update cache component.
db_connection_cache_hash_bits_<type>
db_connection_cache_block_size_<type>
db_connection_cache_write_stats_<type>
db_connection_cache_write_stats_time_sec_<type>
All of these settings are similar to the ones used in the frequent update cache component. Again, the amount of memory the local database cache requires is related to the number of blocks and the number of entries per block. The default sizes are different per database type, and relate to the size of the database file.
In contrast to the frequent update cache, if a block in the local database cache becomes full, earlier entries in the block will be overwritten on a least recently used basis.
db_connection_cache_use_negative_cache_<type>
A negative cache is used to store entries that are currently NOT present in the database, and avoids multiple checking for those entries in the database itself.
Disabling negative cache is only recommended when using a mail database, since signatures calculated from emails will not be repeated very often, and there are too many to store in the cache.
db_connection_cache_ttl_<type>
This is the time in seconds for which a cache entry remains valid (Time To Live). After the given ttl period has expired, the entry will be simply overwritten by new entries that are added.

Custom and Remote Cache settings and other Local database connection settings

Each cache type (custom or remote) has it's own set of global options which control cache behavior. The options are defined in the file /init/dca_db/db_connection/db_conn_settings.txt. All settings have the same prefix: db_connection_cache_<type>_, where type can be either custom or remote.
For example the option name for the maximum number of entries in the local case will be [db_connection_cache_local_max_num_entries].
A cache contains three separate sub-caches:
  • Read Cache - contains database lookup results
  • Negative Cache - contains unknown database entries
  • Update Cache - contains entries from update files which have not yet been merged with the database.
The following options are supported and can be modified using the user settings file.
[...auto_thinout]
If this is set to 1 (default), read and negative entries will be automatically removed from the cache if the cache becomes full.
[...max_num_entries]
The maximum number of entries the complete cache can contain. If this value is reached an attempt is made to reduce the number of read and negative entries in the cache. If this does not release enough entries, a database merge process will be started to reduce the update cache. The maximum value for this value is limited to 2000000. In the case of a locale or remote cache, the minimum value for this value is 1024. If a smaller value is given the whole cache will be disabled. In case of a custom cache, this value is adjustable over the API. Only if the value in the API is 0, the value from the init file section will be taken.
[...sec_max_residence_time_negative]
The time in seconds before a negative cache entry becomes invalid. After this time has expired, the entry will be refreshed on the next database lookup.
[...sec_max_residence_time_read]
The time in seconds before a read cache entry becomes invalid. After this time has expired, the entry will be refreshed on the next database lookup.
[...store_mode]
The cache can be optimized for speed or memory. Mode 1 (default) optimizes the cache for speed, but uses more memory. Mode 0 optimizes the cache for memory, but lookups will take longer.
[...use_negative_cache]
Enables or disables the negative cache (default = 1).
[...use_read_cache]
Enables or disables the read cache (default = 1).
Additionally for the remote database access the following options are supported.
[db_connection_access_remote_max_time_wait_for_server]
The time (in milliseconds) to iterate over the remote server list to get a valid server until we give up (default is 15 seconds).
[db_connection_access_remote_retry_loops]
The number of retries before a remote server is deemed unreachable and the next server is queried (default = 3).
[db_connection_access_remote_timeout]
The connection timeout (in milliseconds) for a remote database request (default is one second).
Following options are supported for the Logging Behavior of Internal Components of a database connection:
[db_connection_log_level_access]
The logging behavior of the internal database access component.
[db_connection_log_level_cache]
The logging behavior of the internal database cache component.
[db_connection_log_level_dbdownloader]
The logging behavior of the internal database downloader component.
[db_connection_log_level_global]
The logging behavior of global functionality of the database connection.
[db_connection_log_level_merger]
The logging behavior of the internal database merger component.
[db_connection_log_level_notifier]
The logging behavior of the internal notifier component.
[db_connection_log_level_scheduler]
The logging behavior of the internal scheduler component.
[db_connection_log_level_servermgr]
The logging behavior of the internal server manager component.

URL DB Classifier Options

The following options are available for the URL Database classifier and can be found in the file /init/dca_urlclassification/uc_urldatabase/uc_db_settings.txt.
[urldatabase_local_enable_host_cache]
Enables the host cache in case of a local database connection (default = 1).
[urldatabase_local_host_cache_max_size]
The maximum number of entries allowed in the host cache in case of a local database connection (default = 10000).
[urldatabase_local_tld_check_mode]
Enables or disables the internal top level domain check module in case of a local database connection (default = 1).
[urldatabase_local_unknown_urls_interval]
The interval (in minutes) when the collected unknown URLs are written in a file for uploading in case of a local database connection (default = 10).
This value is used if the Feedback mechanism feature is turned on, see dca::UrlDbClassifierOptions for information on how to enable Feedback mechanism.
[urldatabase_local_unknown_urls_max_entries]
The maximum number of unknown URLs before the collected unknown URLs are written in a file for uploading in case of a local database connection (default = 50000).
This value is used if the Feedback mechanism feature is turned on, see dca::UrlDbClassifierOptions for information on how to enable Feedback mechanism.
[urldatabase_remote_tld_check_mode]
Enables or disables the internal top level domain check module in case of a remote database connection (default = 1).
Following options are supported for the Logging Behavior of Internal Components of the URL DB Classifier:
[urldatabase_log_level_embedded_urls]
The logging behavior of the internal embedded URLs component.
[urldatabase_log_level_host_cache]
The logging behavior of the internal host cache component.
[urldatabase_log_level_tld_check]
The logging behavior of the internal top level domain check component.
[urldatabase_log_level_unknown_urls]
The logging behavior of the internal unknown URLs component.
[urldatabase_log_level_url_logic]
The logging behavior of the internal URL logic component.

Custom Database Options

The following options are available for Custom Databases, and can be found in the file /init/dca_customdb/customdb/cdb_settings.txt.
[customdb_log_level_persistent_training]
The logging behavior of the internal persistent training component of the Custom Database, see Logging Behavior of Internal Components.
[customdb_log_level_trainer]
The logging behavior of the internal trainer component of the Custom Database, see Logging Behavior of Internal Components.
[customdb_persistent_training_interval]
This is the default value for the interval (specified in minutes) in which the modifications made to the Custom Database are saved to an update file. This value can be overwritten by the API, see dca::DbConnectionCustomData::updateFileWriteIntervalMinutes.

Generated on 26 Sep 2016 for dca_interface by  doxygen 1.6.1