Metrics Service Implementation

class d1_metrics.solrclient.SolrClient(base_url, core_name, select='/')[source]
doGet(params)[source]
getFieldValues(name, q='*:*', fq=None, maxvalues=-1, sort=True, **query_args)[source]

Retrieve the unique values for a field, along with their usage counts. :param sort: Sort the result :param name: Name of field for which to retrieve values :type name: string :param q: Query identifying the records from which values will be retrieved :type q: string :param fq: Filter query restricting operation of query :type fq: string :param maxvalues: Maximum number of values to retrieve. Default is -1,

which causes retrieval of all values.
Returns:dict of {fieldname: [[value, count], … ], }
class d1_metrics.solrclient.SolrSearchResponseIterator(base_url, core_name, q, select='select', fq=None, fields='*', page_size=10000, max_records=None, sort=None, **query_args)[source]

Performs a search against a Solr index and acts as an iterator to retrieve all the values.

process_row(row)[source]

Override this method in derived classes to reformat the row response.

d1_metrics.solrclient.escapeSolrQueryTerm(term)[source]

Solr Event Processor

Echo DataONE aggregated logs to disk

Requires python 3

This script reads records from the aggregated logs solr index and writes each record to a log file on disk, one record per line. Each line is formatted as:

JSON_DATA

where:

JSON_DATA = JSON representation of the record as retrieved from solr

Output log files are rotated based on size, with rotation scheduled at 1GB. A maximum of 150 log files are kept, so the log directory should not exceed about 150GB.

JSON loading benchmarks: http://artem.krylysov.com/blog/2015/09/29/benchmark-python-json-libraries/ Note performance difference under python3 are much reduced.

One particular challenge is that the dateLogged time in the log records has precision only to the second. This makes restarting the harvest challenging since there may be multiple records on the same second.

The strategy employed here is to retrieve the last set of n records (100 or so) and ignore any retrieved records that are present in the last set.

Each log record is on the order of 500-600 bytes, assume 1000bytes / record. The last 100 or so records would be the last 100k bytes of the log record.

class d1_logagg.eventprocessor.LogFormatter(fmt=None, datefmt=None, style='%')[source]
converter()

timestamp[, tz] -> tz’s local time from POSIX timestamp.

formatTime(record, datefmt=None)[source]

Return the creation time of the specified LogRecord as formatted text.

This method should be called from format() by a formatter which wants to make use of a formatted time. This method can be overridden in formatters to provide for any specific requirement, but the basic behaviour is as follows: if datefmt (a string) is specified, it is used with time.strftime() to format the creation time of the record. Otherwise, an ISO8601-like (or RFC 3339-like) format is used. The resulting string is returned. This function uses a user-configurable function to convert the creation time to a tuple. By default, time.localtime() is used; to change this for a particular formatter instance, set the ‘converter’ attribute to a function with the same signature as time.localtime() or time.gmtime(). To change it for all formatters, for example if you want all logging times to be shown in GMT, set the ‘converter’ attribute in the Formatter class.

class d1_logagg.eventprocessor.OutputLogFormatter(fmt=None, datefmt=None, style='%')[source]
converter()

timestamp[, tz] -> tz’s local time from POSIX timestamp.

formatTime(record, datefmt=None)[source]

Return the creation time of the specified LogRecord as formatted text.

This method should be called from format() by a formatter which wants to make use of a formatted time. This method can be overridden in formatters to provide for any specific requirement, but the basic behaviour is as follows: if datefmt (a string) is specified, it is used with time.strftime() to format the creation time of the record. Otherwise, an ISO8601-like (or RFC 3339-like) format is used. The resulting string is returned. This function uses a user-configurable function to convert the creation time to a tuple. By default, time.localtime() is used; to change this for a particular formatter instance, set the ‘converter’ attribute to a function with the same signature as time.localtime() or time.gmtime(). To change it for all formatters, for example if you want all logging times to be shown in GMT, set the ‘converter’ attribute in the Formatter class.

class d1_logagg.eventprocessor.SolrSearchResponseIterator(select_url, q, fq=None, fields='*', page_size=10000, max_records=None, sort=None, **query_args)[source]

Performs a search against a Solr index and acts as an iterator to retrieve all the values.

process_row(row)[source]

Override this method in derived classes to reformat the row response.

d1_logagg.eventprocessor.escapeSolrQueryTerm(term)[source]
d1_logagg.eventprocessor.getLastLinesFromFile(fname, seek_back=100000, pattern='^{', lines_to_return=100)[source]

Returns the last lines matching pattern from the file fname

Args:
fname: name of file to examine seek_back: number of bytes to look backwards in file pattern: Pattern lines must match to be returned lines_to_return: maximum number of lines to return
Returns:
last n log entries that match pattern
d1_logagg.eventprocessor.getOutputLogger(log_file, log_level=20)[source]

Logger used for emitting the solr records as JSON blobs, one record per line.

Only really using logger for this to take advantage of the file rotation capability.

Parameters:
  • log_file
  • log_level
Returns:

d1_logagg.eventprocessor.getQuery(src_file='d1logagg.log', tstamp=None)[source]

Returns a query that would retrieve the last entry in the log file

Args:
src_file: name of the log file to examine tstamp: timestamp of last point for query. Defaults to the value of utcnow if not set
Returns:
A solr query string that returns at least the last record from the index and the record data that was retrieved from the log
d1_logagg.eventprocessor.getRecords(log_file_name, core_name, base_url='http://localhost:8983/solr', test_only=False)[source]

Main method. Retrieve records from solr and save them to disk.

Args:
log_file_name: Name of the destination log file core_name: Name of the solr core to query base_url: Base URL of the solr service
Returns:
Nothing
d1_logagg.eventprocessor.logRecordInList(record_list, record)[source]

Returns True if record is in record_list Args:

record_list: record:
Returns:
Boolean
d1_logagg.eventprocessor.main()[source]
d1_logagg.eventprocessor.setupLogger(level=30)[source]

Logger used for application logging

Parameters:level
Returns:
d1_logagg.eventprocessor.trimLogEntries(records, field)[source]

Shrink the list so only entries that match the last record are returned Args:

records: records to test field: field to evaluate

Returns: records that match the field of the last record