vinegar.data_source.text_file

Data source backed by a text file.

This data source is designed to work with any text file, where there is a line for each system. The exact format of the file can be configured through the use of regular expressions.

This data source supports the find_sytem method, which makes it perfect for being used as the root source that defines the list of existing systems.

Specifying the file format

This section only describes the options related to the file format. For a full list of supported options, please refer to Configuration options. For an example configuration, please refer to Configuration example.

The centerpiece of the file format configuration is a regular expression that defines the format of a single line in the file. This regular expression is specified through the regular_expression option. This regular expression must match the full line (the pattern is matched using fullmatch). Consequently, there is no need to use start of string or end of string anchors.

This regular expression has to define groups that represent the various pieces of data. Both regular groups (identified by their index) and named groups ((?P<...> syntax) can be used.

Often, it is desirable to ignore certain lines (e.g. empty lines or lines representing comments). This can be achieved through the regular_expression_ignore option. If a line matches that expression, it is ignored entirely, without even a warning message being logged. Again, the regular expression has to match the full line.

For each line, the system ID has to be extracted and at least one associated piece of data has to be extract. This both works through the same mechanism: A configuration that refers to one of the groups defined in the regular expression matching the line.

The configuration for extracting the system ID is specified through the system_id configuration option. The configuration for extracting pieces of data is specified through the variables option.

There are two differences between the two: First, the variables option is actually a dict where each key is the name of the corresponding key that is included in the data tree and the value is the configuration for extract that piece of data. Second, the configuration for the system ID must never result in a value of None being extracted.

The keys in the dict of the variables option can define a hierarchy. That hierarchy is specified by using the colon (:) in keys. Each key is split at these colons and the components are used as keys into nested instances of dict.

Each of the configurations for extracting a piece of data is itself a dict that has the following keys:

source (mandatory):: The name (as a str) or index (as an int) of the group in the regular expression that provides this piece of data.
transform (optional):: A list defining the transformations that shall be applied to the string extracted through the regular expression. This list is passed to vinegar.transform.get_transformation_chain. If this list is empty (the default), no transformations are applied and the string extracted by the regular expression is used as is.
transform_none_value (optional):: A bool defining whether a value of None should still be transformed. As most transformation functions do not support None values, the default is False. If setting this option to True one has to ensure that only transformation functions that can handle a value of None are used. The value extracted from a line can be None if the corresponding capturing group in the regular expression is optional.
use_none_value (optional):: A bool defining whether a value of None (possibly as a result of the transformations) should still result in the corresponding key being added to the data tree. Usually, there is no sense in adding a key without a value, so this option has a default value of False. Please note that this option does not have any effects when being specified in the configuration for the system_id. The system ID is mandatory and thus a system ID of None is treated as an error. This should be avoided by ensuring that the group capturing the system ID is non-optional.

It might be that a file contains some lines that do not match the expected format (as specified by regular_expression), but are not lines that shall be ignored (as specified by regular_expression_ignore) either. The mismatch_action option defines how to deal with those lines. By default, a warning is logged when such a line is encountered. This can be changed to raising an exception by setting mismatch_action to error. Such lines can also be ignored completely (without logging a warning), by setting mismatch_action to ignore.

If there is more than one line specifying the same system ID, the behavior is controlled by the duplicate_system_id_action option. By default, a warning is logged and only the first line for the system ID is used (option value warn_ignore). This can be changed to raising an exception by setting the option to error. If the option is set to ignore, only the first line is used, but no warning message is logged.

Configuration example

In order to get a better understanding of how the various configuration options work together, let us discuss the following example (in this example, we use YAML for describing the configuration):

# The cache is enabled by default, so we only specify it here for
# completeness.
cache_enabled: True
# The warn action is already the default, we only specify it here for
# completeness.
duplicate_system_id_action: warn
# This is the path to the text file.
file: /path/to/file.txt
# Enabling find_first_match has the effect that if multiple systems use the
# same value for a key (as defined in the variable dict), the first of the
# systems (the first line) is returned by the find_system method when
# looking for that specific key-value combination. If this option is not
# enabled, no system is returned if there is no unique match.
find_first_match: True
# The warn action is already the default, we only specify it here for
# completeness.
mismatch_action: warn
# This is the regular expression that matches the lines that we want to
# use. We specify the X flag first (?x) so that we can use the multi-line
# syntax, which makes the regular expression much more readable.
regular_expression: |
    (?x)
    # We expect a CSV file with three columns that are separated by
    # semicolons.
    # The first column specifies the MAC address.
    (?P<mac>[0-9A-Fa-f]{2}(?::[0-9A-Fa-f]{2}){5});
    # The second column specifies the IP address.
    (?P<ip>[0-9]{1,3}(?:\.[0-9]{1,3}){3});
    # The third column specifies the hostname and an optional list
    # of additional names.
    (?P<hostname>[^,]+)
    (,(?P<extra_names>.+))?
# We want to ignore empty lines and lines starting with a "#".
regular_expression_ignore: "|(?:#.*)"
# We build the system ID from the hostname by adding a domain name and
# ensuring that everything is in lower case.
system_id:
    source: hostname:
    transform:
        - string.add_suffix: .mydomain.example.com
        - string.to_lower
# We define a couple of variables that will be available in the data tree
# for each system.
variables:
    'info:extra_names':
        source: extra_names
        transform:
            - string.to_lower
            # Please not that we could also write this shorter as
            # "- string.split: ." because "sep" is the first argument
            # (after the value) and "maxsplit" defaults to -1.
            - string.split:
                sep: .
                maxsplit: -1
    'net:fqdn':
        source: hostname
        transform:
            - string.add_suffix: .mydomain.example.com
            - string.to_lower
    'net:hostname':
        source: hostname
        transform:
            - string.to_lower
    'net:ipv4_addr':
        source: ip
        transform:
            - ipv4_address.normalize
    'net:mac_addr':
        source: mac
        transform:
            # The colon is the default delimiter, so we could also simply
            # write "- mac_address.normalize" without specifying any
            # options.
            - mac_address.normalize:
                delimiter: colon

Now, let us assume we have the following text file:

00:00:00:00:01;192.168.0.1;System1
00:00:00:00:02;192.168.0.2;system2,alias1,Alias2
00:00:00:00:0a;192.168.000.3;system3
00:00:00:00:0A;192.168.0.4;system4

Parsing this file with the configuration specified earlier, would result in the following data for the systems (we list the data in YAML format and use the system IDs as the keys in the top dict):

system1.mydomain.example.com:
    net:
        fqdn: system1.mydomain.example.com:
        hostname: system1
        ipv4_addr: 192.168.0.1
        mac_addr: '02:00:00:00:00:01'

system2.mydomain.example.com:
    info:
        extra_names:
            - alias1
            - alias2
    net:
        fqdn: system2.mydomain.example.com:
        hostname: system2
        ipv4_addr: 192.168.0.2
        mac_addr: '02:00:00:00:00:02'

system3.mydomain.example.com:
    net:
        fqdn: system3.mydomain.example.com:
        hostname: system3
        ipv4_addr: 192.168.0.3
        mac_addr: '02:00:00:00:00:0A'

system4.mydomain.example.com:
    net:
        fqdn: system4.mydomain.example.com:
        hostname: system4
        ipv4_addr: 192.168.0.4
        mac_addr: '02:00:00:00:00:0A'

Thanks to the transformations, all names have been converted to lower case and IP and MAC addresses have been normalized.

With this data, it is possible to look up systems through find_system. For example find_system('net:mac_addr', '02:00:00:00:00:0A'), will return system3.mydomain.example.com. This works because the look-up is done on the final (transformed) data and the find_first_match configuration option has been enabled. If it had not been enabled, the result would be None because system4.mydomain.example.com has the same MAC address.

Configuration options

This data source has several configuration options that can be used to control its behavior. This section only gives an overview of the available options. For a more detailed discussion about the options controlling the file format, please refer to Specifying the file format and Configuration example.

file (mandatory):: Path to the text file (as a str).
regular_expression (mandatory):: Regular expression (as a str) matching the data lines in the file. This regular expression must match the full line (the pattern is matched using fullmatch). Consequently, there is no need to use start of string or end of string anchors. The regular expression must define catching groups that can then be referenced from the system_id and variables configuration. See Specifying the file format for details.
system_id (mandatory):: Configuration describing how the system ID is extracted from a line. This configuration refers to a catching group of regular_expression through its source option. See Specifying the file format for details.
variables (mandatory):: Configuration describing how the various data itmes are extracted from a line. This configuration option expects a dict where each key-value pair refers to one data item, using the key as the key in the data tree generated for the system and the value as the configuration for that data item. See Specifying the file format for details.
cache_enabled (optional):: If True (the default), the contents of the text file are read once and cached until the file changes. File changes are detected through the time-stamp of the file. If False the file is read and parsed every time find_system() or get_data() is called.
duplicate_system_id_action (optional):: If warn (the default), a warning message is logged when a line specifying the same system ID as an earlier line is encountered and the second line is ignored. If error a ValueError is raised instead. If ignore the second line is ignored without logging a warning.
find_first_match (optional):: If True and there are multiple matches in a call to find_system(), the first system ID (this is the ID of the first system in the file that matches the specified query) is returned. If False (the default), no system ID is returned if there are multiple matches, so a system ID is only returned if there is only one system matching the query.
mismatch_action (optional):: Controls the behavior when a line that matches neither regular_expression nor regular_expression_ignore is encountered. If warn (the default), a warning message is logged. If error a ValueError is raised instead. If``ignore`` the line is ignored without logging a warning.
regular_expression_ignore (optional):: Regular expression (as a str) matching the lines in the file that shall be ignored. This regular expression must match the full line (the pattern is matched using fullmatch). Consequently, there is no need to use start of string or end of string anchors. If None (the default), no lines are ignored.

class vinegar.data_source.text_file.TextFileSource(config: Mapping[Any, Any])

Data source that reads data from a text file.

For information about the configuration options supported by this data source, please refer to the module documentation.

find_system(lookup_key: str, lookup_value: Any) → str | None

Find a system given the specified key and value.

If no system can be found, the data source returns None.

Parameters:

lookup_key – key for which to look. The interpretation of the key is up to the data source. Some data sources might use a flat structure, while others might support hierarchical data-structures. In the latter case, the use of the colon (:) as a hierarchy separator in the key is encouraged, but not required.
lookup_value – value for which to look. The interpreation of the value is up to the data source.

Returns:

system identifier or None if no system could be identified using the specified key and value.

get_data(system_id: str, preceding_data: Mapping[Any, Any], preceding_data_version: str) → Tuple[Mapping[Any, Any], str]

Return data associated with the specified system.

If the data source does not have any information associated with the specified system ID, it should return an empty dictionary.

The return value of this method is in fact a tuple of the configuration data and a version string. The version string can be used by the calling code to decide whether the data has changed and thus caches have to be discarded. For example, the results of rendering a template might be cached and the cached version might be used as long as the version string returned by this method does not change. This means that implementations have to be careful to never return the same version string when the data for a system has changed. The vinegar.utils.version provides utility functions for generating version strings in a way that makes accidental collisions unlikely.

Please note that it is not the job of a data source to merge the preceding_data with the data provided by itself. The calling code takes care of this. Code wanting to use multiple data sources in a chain can use the get_composite_data_source function.

Implementations are encouraged to use caching to improve performance when this method is repeatedly called for the same systems.

Parameters:

system_id – ID of the system for which data is requested.
preceding_data – Data provided by the data source(s) that come earlier in the chain. This may be empty if there are no preceding data sources or if they did not provide any data for the system.
preceding_data_version – Version of the preceding_data. This is an arbitrary string (typically a hash) that can be used to detect when the data provided by the preceding sources has changed.

Returns:

tuple where the first element is the data associated with the specified system and the second element is a version string that changes whenver the returned data changes (for the same system).

vinegar.data_source.text_file.get_instance(config: Mapping[Any, Any]) → TextFileSource

Create a text file data source.

For information about the configuration options supported by that source, please refer to the module documentation.

Parameters:: config – configuration for the data source.
Returns:: text file data source using the specified configuration.