beancount.ingest

Code to help identify, extract, and file external downloads.

This package contains code to help you build importers and drive the process of identifying which importer to run on an externally downloaded file, extract transactions from them and file away these files under a clean and rigidly named hierarchy for preservation.

beancount.ingest.cache

A file wrapper which acts as a cache for on-demand evaluation of conversions.

This object is used in lieu of a file in order to allow the various importers to reuse each others' conversion results. Converting file contents, e.g. PDF to text, can be expensive.

beancount.ingest.cache.contents(filename)

A converter that just reads the entire contents of a file.

Parameters:
  • num_bytes – The number of bytes to read.

Returns:
  • A converter function.

Source code in beancount/ingest/cache.py
def contents(filename):
    """A converter that just reads the entire contents of a file.

    Args:
      num_bytes: The number of bytes to read.
    Returns:
      A converter function.
    """
    # Attempt to detect the input encoding automatically, using chardet and a
    # decent amount of input.
    rawdata = open(filename, 'rb').read(HEAD_DETECT_MAX_BYTES)
    detected = chardet.detect(rawdata)
    encoding = detected['encoding']

    # Ignore encoding errors for reading the contents because input files
    # routinely break this assumption.
    errors = 'ignore'

    with open(filename, encoding=encoding, errors=errors) as file:
        return file.read()

beancount.ingest.cache.get_file(filename)

Create or reuse a globally registered instance of a FileMemo.

Note: the FileMemo objects' lifetimes are reused for the duration of the process. This is usually the intended behavior. Always create them by calling this constructor.

Parameters:
  • filename – A path string, the absolute name of the file whose memo to create.

Returns:
  • A FileMemo instance.

Source code in beancount/ingest/cache.py
def get_file(filename):
    """Create or reuse a globally registered instance of a FileMemo.

    Note: the FileMemo objects' lifetimes are reused for the duration of the
    process. This is usually the intended behavior. Always create them by
    calling this constructor.

    Args:
      filename: A path string, the absolute name of the file whose memo to create.
    Returns:
      A FileMemo instance.

    """
    assert path.isabs(filename), (
        "Path should be absolute in order to guarantee a single call.")
    return _CACHE[filename]

beancount.ingest.cache.head(num_bytes=8192)

A converter that just reads the first bytes of a file.

Parameters:
  • num_bytes – The number of bytes to read.

Returns:
  • A converter function.

Source code in beancount/ingest/cache.py
def head(num_bytes=8192):
    """A converter that just reads the first bytes of a file.

    Args:
      num_bytes: The number of bytes to read.
    Returns:
      A converter function.
    """
    def head_reader(filename):
        with open(filename, 'rb') as file:
            rawdata = file.read(num_bytes)
            detected = chardet.detect(rawdata)
            encoding = detected['encoding']
            return rawdata.decode(encoding)
    return head_reader

beancount.ingest.cache.mimetype(filename)

A converter that computes the MIME type of the file.

Returns:
  • A converter function.

Source code in beancount/ingest/cache.py
def mimetype(filename):
    """A converter that computes the MIME type of the file.

    Returns:
      A converter function.
    """
    return file_type.guess_file_type(filename)

beancount.ingest.extract

Extract script.

Read an import script and a list of downloaded filenames or directories of downloaded files, and for each of those files, extract transactions from it.

beancount.ingest.extract.add_arguments(parser)

Add arguments for the extract command.

Source code in beancount/ingest/extract.py
def add_arguments(parser):
    """Add arguments for the extract command."""

    parser.add_argument('-e', '-f', '--existing', '--previous', metavar='BEANCOUNT_FILE',
                        default=None,
                        help=('Beancount file or existing entries for de-duplication '
                              '(optional)'))

    parser.add_argument('-r', '--reverse', '--descending',
                        action='store_const', dest='ascending',
                        default=True, const=False,
                        help='Write out the entries in descending order')

beancount.ingest.extract.extract(importer_config, files_or_directories, output, entries=None, options_map=None, mindate=None, ascending=True, hooks=None)

Given an importer configuration, search for files that can be imported in the list of files or directories, run the signature checks on them, and if it succeeds, run the importer on the file.

A list of entries for an existing ledger can be provided in order to perform de-duplication and a minimum date can be provided to filter out old entries.

Parameters:
  • importer_config – A list of (regexps, importer) pairs, the configuration.

  • files_or_directories – A list of strings, filenames or directories to be processed.

  • output – A file object, to be written to.

  • entries – A list of directives loaded from the existing file for the newly extracted entries to be merged in.

  • options_map – The options parsed from existing file.

  • mindate – Optional minimum date to output transactions for.

  • ascending – A boolean, true to print entries in ascending order, false if descending is desired.

  • hooks – An optional list of hook functions to apply to the list of extract (filename, entries) pairs, in order. If not specified, find_duplicate_entries() is used, automatically.

Source code in beancount/ingest/extract.py
def extract(importer_config,
            files_or_directories,
            output,
            entries=None,
            options_map=None,
            mindate=None,
            ascending=True,
            hooks=None):
    """Given an importer configuration, search for files that can be imported in the
    list of files or directories, run the signature checks on them, and if it
    succeeds, run the importer on the file.

    A list of entries for an existing ledger can be provided in order to perform
    de-duplication and a minimum date can be provided to filter out old entries.

    Args:
      importer_config: A list of (regexps, importer) pairs, the configuration.
      files_or_directories: A list of strings, filenames or directories to be processed.
      output: A file object, to be written to.
      entries: A list of directives loaded from the existing file for the newly
        extracted entries to be merged in.
      options_map: The options parsed from existing file.
      mindate: Optional minimum date to output transactions for.
      ascending: A boolean, true to print entries in ascending order, false if
        descending is desired.
      hooks: An optional list of hook functions to apply to the list of extract
        (filename, entries) pairs, in order. If not specified, find_duplicate_entries()
        is used, automatically.
    """
    allow_none_for_tags_and_links = (
        options_map and options_map["allow_deprecated_none_for_tags_and_links"])

    # Run all the importers and gather their result sets.
    new_entries_list = []
    for filename, importers in identify.find_imports(importer_config,
                                                     files_or_directories):
        for importer in importers:
            # Import and process the file.
            try:
                new_entries = extract_from_file(
                    filename,
                    importer,
                    existing_entries=entries,
                    min_date=mindate,
                    allow_none_for_tags_and_links=allow_none_for_tags_and_links)
                new_entries_list.append((filename, new_entries))
            except Exception as exc:
                logging.exception("Importer %s.extract() raised an unexpected error: %s",
                                  importer.name(), exc)
                continue

    # Find potential duplicate entries in the result sets, either against the
    # list of existing ones, or against each other. A single call to this
    # function is made on purpose, so that the function be able to merge
    # entries.
    if hooks is None:
        hooks = [find_duplicate_entries]
    for hook_fn in hooks:
        new_entries_list = hook_fn(new_entries_list, entries)
    assert isinstance(new_entries_list, list)
    assert all(isinstance(new_entries, tuple) for new_entries in new_entries_list)
    assert all(isinstance(new_entries[0], str) for new_entries in new_entries_list)
    assert all(isinstance(new_entries[1], list) for new_entries in new_entries_list)

    # Print out the results.
    output.write(HEADER)
    for key, new_entries in new_entries_list:
        output.write(identify.SECTION.format(key))
        output.write('\n')
        if not ascending:
            new_entries.reverse()
        print_extracted_entries(new_entries, output)

beancount.ingest.extract.extract_from_file(filename, importer, existing_entries=None, min_date=None, allow_none_for_tags_and_links=False)

Import entries from file 'filename' with the given matches,

Also cross-check against a list of provided 'existing_entries' entries, de-duplicating and possibly auto-categorizing.

Parameters:
  • filename – The name of the file to import.

  • importer – An importer object that matched the file.

  • existing_entries – A list of existing entries parsed from a ledger, used to detect duplicates and automatically complete or categorize transactions.

  • min_date – A date before which entries should be ignored. This is useful when an account has a valid check/assert; we could just ignore whatever comes before, if desired.

  • allow_none_for_tags_and_links – A boolean, whether to allow plugins to generate Transaction objects with None as value for the 'tags' or 'links' attributes.

Returns:
  • A list of new imported entries.

Exceptions:
  • Exception – If there is an error in the importer's extract() method.

Source code in beancount/ingest/extract.py
def extract_from_file(filename, importer,
                      existing_entries=None,
                      min_date=None,
                      allow_none_for_tags_and_links=False):
    """Import entries from file 'filename' with the given matches,

    Also cross-check against a list of provided 'existing_entries' entries,
    de-duplicating and possibly auto-categorizing.

    Args:
      filename: The name of the file to import.
      importer: An importer object that matched the file.
      existing_entries: A list of existing entries parsed from a ledger, used to
        detect duplicates and automatically complete or categorize transactions.
      min_date: A date before which entries should be ignored. This is useful
        when an account has a valid check/assert; we could just ignore whatever
        comes before, if desired.
      allow_none_for_tags_and_links: A boolean, whether to allow plugins to
        generate Transaction objects with None as value for the 'tags' or 'links'
        attributes.
    Returns:
      A list of new imported entries.
    Raises:
      Exception: If there is an error in the importer's extract() method.
    """
    # Extract the entries.
    file = cache.get_file(filename)

    # Note: Let the exception through on purpose. This makes developing
    # importers much easier by rendering the details of the exceptions.
    #
    # Note: For legacy support, support calling without the existing entries.
    kwargs = {}
    if 'existing_entries' in inspect.signature(importer.extract).parameters:
        kwargs['existing_entries'] = existing_entries
    new_entries = importer.extract(file, **kwargs)
    if not new_entries:
        return []

    # Make sure the newly imported entries are sorted; don't trust the importer.
    new_entries.sort(key=data.entry_sortkey)

    # Ensure that the entries are typed correctly.
    for entry in new_entries:
        data.sanity_check_types(entry, allow_none_for_tags_and_links)

    # Filter out entries with dates before 'min_date'.
    if min_date:
        new_entries = list(itertools.dropwhile(lambda x: x.date < min_date,
                                               new_entries))

    return new_entries

beancount.ingest.extract.find_duplicate_entries(new_entries_list, existing_entries)

Flag potentially duplicate entries.

Parameters:
  • new_entries_list – A list of pairs of (key, lists of imported entries), one for each importer. The key identifies the filename and/or importer that yielded those new entries.

  • existing_entries – A list of previously existing entries from the target ledger.

Returns:
  • A list of lists of modified new entries (like new_entries_list), potentially with modified metadata to indicate those which are duplicated.

Source code in beancount/ingest/extract.py
def find_duplicate_entries(new_entries_list, existing_entries):
    """Flag potentially duplicate entries.

    Args:
      new_entries_list: A list of pairs of (key, lists of imported entries), one
        for each importer. The key identifies the filename and/or importer that
        yielded those new entries.
      existing_entries: A list of previously existing entries from the target
        ledger.
    Returns:
      A list of lists of modified new entries (like new_entries_list),
      potentially with modified metadata to indicate those which are duplicated.
    """
    mod_entries_list = []
    for key, new_entries in new_entries_list:
        # Find similar entries against the existing ledger only.
        duplicate_pairs = similar.find_similar_entries(new_entries, existing_entries)

        # Add a metadata marker to the extracted entries for duplicates.
        duplicate_set = set(id(entry) for entry, _ in duplicate_pairs)
        mod_entries = []
        for entry in new_entries:
            if id(entry) in duplicate_set:
                marked_meta = entry.meta.copy()
                marked_meta[DUPLICATE_META] = True
                entry = entry._replace(meta=marked_meta)
            mod_entries.append(entry)
        mod_entries_list.append((key, mod_entries))
    return mod_entries_list

beancount.ingest.extract.print_extracted_entries(entries, file)

Print a list of entries.

Parameters:
  • entries – A list of extracted entries.

  • file – A file object to write to.

Source code in beancount/ingest/extract.py
def print_extracted_entries(entries, file):
    """Print a list of entries.

    Args:
      entries: A list of extracted entries.
      file: A file object to write to.
    """
    # Print the filename and which modules matched.
    # pylint: disable=invalid-name
    pr = lambda *args: print(*args, file=file)
    pr('')

    # Print out the entries.
    for entry in entries:
        # Check if this entry is a dup, and if so, comment it out.
        if DUPLICATE_META in entry.meta:
            meta = entry.meta.copy()
            meta.pop(DUPLICATE_META)
            entry = entry._replace(meta=meta)
            entry_string = textwrap.indent(printer.format_entry(entry), '; ')
        else:
            entry_string = printer.format_entry(entry)
        pr(entry_string)

    pr('')

beancount.ingest.extract.run(args, _, importers_list, files_or_directories, hooks=None)

Run the subcommand.

Source code in beancount/ingest/extract.py
def run(args, _, importers_list, files_or_directories, hooks=None):
    """Run the subcommand."""

    # Load the ledger, if one is specified.
    if args.existing:
        entries, _, options_map = loader.load_file(args.existing)
    else:
        entries, options_map = None, None

    extract(importers_list, files_or_directories, sys.stdout,
            entries=entries,
            options_map=options_map,
            mindate=None,
            ascending=args.ascending,
            hooks=hooks)
    return 0

beancount.ingest.file

Filing script.

Read an import script and a list of downloaded filenames or directories of downloaded files, and for each of those files, move the file under an account corresponding to the filing directory.

beancount.ingest.file.add_arguments(parser)

Add arguments for the extract command.

Source code in beancount/ingest/file.py
def add_arguments(parser):
    """Add arguments for the extract command."""

    parser.add_argument('-o', '--output', '--output-dir', '--destination',
                        dest='output_dir', action='store',
                        help="The root of the documents tree to move the files to.")

    parser.add_argument('-n', '--dry-run', action='store_true',
                        help=("Just print where the files would be moved; "
                              "don't actually move them."))

    parser.add_argument('--no-overwrite', dest='overwrite',
                        action='store_false', default=True,
                        help="Don't overwrite destination files with the same name.")

beancount.ingest.file.file(importer_config, files_or_directories, destination, dry_run=False, mkdirs=False, overwrite=False, idify=False, logfile=None)

File importable files under a destination directory.

Given an importer configuration object, search for files that can be imported under the given list of files or directories and moved them under the given destination directory with the date computed by the module prepended to the filename. If the date cannot be extracted, use a reasonable default for the date (e.g. the last modified time of the file itself).

If 'mkdirs' is True, create the destination directories before moving the files.

Parameters:
  • importer_config – A list of importer instances that define the config.

  • files_or_directories – a list of files of directories to walk recursively and hunt for files to import.

  • destination – A string, the root destination directory where the files are to be filed. The files are organized there under a hierarchy mirroring that of the chart of accounts.

  • dry_run – A flag, if true, don't actually move the files.

  • mkdirs – A flag, if true, make all the intervening directories; otherwise, fail to move files to non-existing dirs.

  • overwrite – A flag, if true, overwrite an existing destination file.

  • idify – A flag, if true, remove whitespace and funky characters in the destination filename.

  • logfile – A file object to write log entries to, or None, in which case no log is written out.

Source code in beancount/ingest/file.py
def file(importer_config,
         files_or_directories,
         destination,
         dry_run=False,
         mkdirs=False,
         overwrite=False,
         idify=False,
         logfile=None):
    """File importable files under a destination directory.

    Given an importer configuration object, search for files that can be
    imported under the given list of files or directories and moved them under
    the given destination directory with the date computed by the module
    prepended to the filename. If the date cannot be extracted, use a reasonable
    default for the date (e.g. the last modified time of the file itself).

    If 'mkdirs' is True, create the destination directories before moving the
    files.

    Args:
      importer_config: A list of importer instances that define the config.
      files_or_directories: a list of files of directories to walk recursively and
        hunt for files to import.
      destination: A string, the root destination directory where the files are
        to be filed. The files are organized there under a hierarchy mirroring
        that of the chart of accounts.
      dry_run: A flag, if true, don't actually move the files.
      mkdirs: A flag, if true, make all the intervening directories; otherwise,
        fail to move files to non-existing dirs.
      overwrite: A flag, if true, overwrite an existing destination file.
      idify: A flag, if true, remove whitespace and funky characters in the destination
        filename.
      logfile: A file object to write log entries to, or None, in which case no log is
        written out.
    """
    jobs = []
    has_errors = False
    for filename, importers in identify.find_imports(importer_config,
                                                     files_or_directories,
                                                     logfile):
        # If we're debugging, print out the match text.
        # This option is useful when we're building our importer configuration,
        # to figure out which patterns to create as unique signatures.
        if not importers:
            continue

        # Process a single file.
        new_fullname = file_one_file(filename, importers, destination, idify, logfile)
        if new_fullname is None:
            continue

        # Check if the destination directory exists.
        new_dirname = path.dirname(new_fullname)
        if not path.exists(new_dirname) and not mkdirs:
            logging.error("Destination directory '{}' does not exist.".format(new_dirname))
            has_errors = True
            continue

        # Check if the destination file already exists; we don't want to clobber
        # it by accident.
        if not overwrite and path.exists(new_fullname):
            logging.error("Destination file '{}' already exists.".format(new_fullname))
            has_errors = True
            continue

        jobs.append((filename, new_fullname))

    # Check if any two imported files would be colliding in their destination
    # name, before we move anything.
    destmap = collections.defaultdict(list)
    for src, dest in jobs:
        destmap[dest].append(src)
    for dest, sources in destmap.items():
        if len(sources) != 1:
            logging.error("Collision in destination filenames '{}': from {}.".format(
                dest, ", ".join(["'{}'".format(source) for source in sources])))
            has_errors = True

    # If there are any errors, just don't do anything at all. This is a nicer
    # behaviour than moving just *some* files.
    if dry_run or has_errors:
        return

    # Actually carry out the moving job.
    for old_filename, new_filename in jobs:
        move_xdev_file(old_filename, new_filename, mkdirs)

    return jobs

beancount.ingest.file.file_one_file(filename, importers, destination, idify=False, logfile=None)

Move a single filename using its matched importers.

Parameters:
  • filename – A string, the name of the downloaded file to be processed.

  • importers – A list of importer instances that handle this file.

  • destination – A string, the root destination directory where the files are to be filed. The files are organized there under a hierarchy mirroring that of the chart of accounts.

  • idify – A flag, if true, remove whitespace and funky characters in the destination filename.

  • logfile – A file object to write log entries to, or None, in which case no log is written out.

Returns:
  • The full new destination filename on success, and None if there was an error.

Source code in beancount/ingest/file.py
def file_one_file(filename, importers, destination, idify=False, logfile=None):
    """Move a single filename using its matched importers.

    Args:
      filename: A string, the name of the downloaded file to be processed.
      importers: A list of importer instances that handle this file.
      destination: A string, the root destination directory where the files are
        to be filed. The files are organized there under a hierarchy mirroring
        that of the chart of accounts.
      idify: A flag, if true, remove whitespace and funky characters in the destination
        filename.
      logfile: A file object to write log entries to, or None, in which case no log is
        written out.
    Returns:
      The full new destination filename on success, and None if there was an error.
    """
    # Create an object to cache all the conversions between the importers
    # and phases and what-not.
    file = cache.get_file(filename)

    # Get the account corresponding to the file.
    file_accounts = []
    for index, importer in enumerate(importers):
        try:
            account_ = importer.file_account(file)
        except Exception as exc:
            account_ = None
            logging.exception("Importer %s.file_account() raised an unexpected error: %s",
                              importer.name(), exc)
        if account_ is not None:
            file_accounts.append(account_)

    file_accounts_set = set(file_accounts)
    if not file_accounts_set:
        logging.error("No account provided by importers: {}".format(
            ", ".join(imp.name() for imp in importers)))
        return None

    if len(file_accounts_set) > 1:
        logging.warning("Ambiguous accounts from many importers: {}".format(
            ', '.join(file_accounts_set)))
        # Note: Don't exit; select the first matching importer's account.

    file_account = file_accounts.pop(0)

    # Given multiple importers, select the first one that was yielded to
    # obtain the date and process the filename.
    importer = importers[0]

    # Compute the date from the last modified time.
    mtime = path.getmtime(filename)
    mtime_date = datetime.datetime.fromtimestamp(mtime).date()

    # Try to get the file's date by calling a module support function. The
    # module may be able to extract the date from the filename, from the
    # contents of the file itself (e.g. scraping some text from the PDF
    # contents, or grabbing the last line of a CSV file).
    try:
        date = importer.file_date(file)
    except Exception as exc:
        logging.exception("Importer %s.file_date() raised an unexpected error: %s",
                          importer.name(), exc)
        date = None
    if date is None:
        # Fallback on the last modified time of the file.
        date = mtime_date
        date_source = 'mtime'
    else:
        date_source = 'contents'

    # Apply filename renaming, if implemented.
    # Otherwise clean up the filename.
    try:
        clean_filename = importer.file_name(file)

        # Warn the importer implementor if a name is returned and it's an
        # absolute filename.
        if clean_filename and (path.isabs(clean_filename) or os.sep in clean_filename):
            logging.error(("The importer '%s' file_name() method should return a relative "
                           "filename; the filename '%s' is absolute or contains path "
                           "separators"),
                          importer.name(), clean_filename)
    except Exception as exc:
        logging.exception("Importer %s.file_name() raised an unexpected error: %s",
                          importer.name(), exc)
        clean_filename = None
    if clean_filename is None:
        # If no filename has been provided, use the basename.
        clean_filename = path.basename(file.name)
    elif re.match(r'\d\d\d\d-\d\d-\d\d', clean_filename):
        logging.error("The importer '%s' file_name() method should not date the "
                      "returned filename. Implement file_date() instead.")

    # We need a simple filename; remove the directory part if there is one.
    clean_basename = path.basename(clean_filename)

    # Remove whitespace if requested.
    if idify:
        clean_basename = misc_utils.idify(clean_basename)

    # Prepend the date prefix.
    new_filename = '{0:%Y-%m-%d}.{1}'.format(date, clean_basename)

    # Prepend destination directory.
    new_fullname = path.normpath(path.join(destination,
                                           file_account.replace(account.sep, os.sep),
                                           new_filename))

    # Print the filename and which modules matched.
    if logfile is not None:
        logfile.write('Importer:    {}\n'.format(importer.name() if importer else '-'))
        logfile.write('Account:     {}\n'.format(file_account))
        logfile.write('Date:        {} (from {})\n'.format(date, date_source))
        logfile.write('Destination: {}\n'.format(new_fullname))
        logfile.write('\n')

    return new_fullname

beancount.ingest.file.move_xdev_file(src_filename, dst_filename, mkdirs=False)

Move a file, potentially across devices.

Parameters:
  • src_filename – A string, the name of the file to copy.

  • dst_filename – A string, where to copy the file.

  • mkdirs – A flag, true if we should create a non-existing destination directory.

Source code in beancount/ingest/file.py
def move_xdev_file(src_filename, dst_filename, mkdirs=False):
    """Move a file, potentially across devices.

    Args:
      src_filename: A string, the name of the file to copy.
      dst_filename: A string, where to copy the file.
      mkdirs: A flag, true if we should create a non-existing destination directory.
    """
    # Create missing directory if required.
    dst_dirname = path.dirname(dst_filename)
    if mkdirs:
        if not path.exists(dst_dirname):
            os.makedirs(dst_dirname)
    else:
        if not path.exists(dst_dirname):
            raise OSError("Destination directory '{}' does not exist.".format(dst_dirname))

    # Copy the file to its new name.
    shutil.copyfile(src_filename, dst_filename)

    # Remove the old file. Note that we copy and remove to support
    # cross-device moves, because it's sensible that the destination might
    # be on an encrypted device.
    os.remove(src_filename)

beancount.ingest.file.run(args, parser, importers_list, files_or_directories, hooks=None)

Run the subcommand.

Source code in beancount/ingest/file.py
def run(args, parser, importers_list, files_or_directories, hooks=None):
    """Run the subcommand."""

    # If the output directory is not specified, move the files at the root where
    # the import configuration file is located. (Providing this default seems
    # better than using a required option.)
    if args.output_dir is None:
        if hasattr(args, 'config'):
            args.output_dir = path.dirname(path.abspath(args.config))
        else:
            import __main__ # pylint: disable=import-outside-toplevel
            args.output_dir = path.dirname(path.abspath(__main__.__file__))

    # Make sure the output directory exists.
    if not path.exists(args.output_dir):
        parser.error('Output directory "{}" does not exist.'.format(args.output_dir))

    file(importers_list, files_or_directories, args.output_dir,
         dry_run=args.dry_run,
         mkdirs=True,
         overwrite=args.overwrite,
         idify=True,
         logfile=sys.stdout)
    return 0

beancount.ingest.identify

Identify script.

Read an import script and a list of downloaded filenames or directories of 2downloaded files, and for each of those files, identify which importer it should be associated with.

beancount.ingest.identify.add_arguments(parser)

Add arguments for the identify command.

Source code in beancount/ingest/identify.py
def add_arguments(parser):
    """Add arguments for the identify command."""

beancount.ingest.identify.find_imports(importer_config, files_or_directories, logfile=None)

Given an importer configuration, search for files that can be imported in the list of files or directories, run the signature checks on them and return a list of (filename, importers), where 'importers' is a list of importers that matched the file.

Parameters:
  • importer_config – a list of importer instances that define the config.

  • files_or_directories – a list of files of directories to walk recursively and hunt for files to import.

  • logfile – A file object to write log entries to, or None, in which case no log is written out.

Yields: Triples of filename found, textified contents of the file, and list of importers matching this file.

Source code in beancount/ingest/identify.py
def find_imports(importer_config, files_or_directories, logfile=None):
    """Given an importer configuration, search for files that can be imported in the
    list of files or directories, run the signature checks on them and return a list
    of (filename, importers), where 'importers' is a list of importers that matched
    the file.

    Args:
      importer_config: a list of importer instances that define the config.
      files_or_directories: a list of files of directories to walk recursively and
                            hunt for files to import.
      logfile: A file object to write log entries to, or None, in which case no log is
        written out.
    Yields:
      Triples of filename found, textified contents of the file, and list of
      importers matching this file.
    """
    # Iterate over all files found; accumulate the entries by identification.
    for filename in file_utils.find_files(files_or_directories):
        if logfile is not None:
            logfile.write(SECTION.format(filename))
            logfile.write('\n')

        # Skip files that are simply too large.
        size = path.getsize(filename)
        if size > FILE_TOO_LARGE_THRESHOLD:
            logging.warning("File too large: '{}' ({} bytes); skipping.".format(
                filename, size))
            continue

        # For each of the sources the user has declared, identify which
        # match the text.
        file = cache.get_file(filename)
        matching_importers = []
        for importer in importer_config:
            try:
                matched = importer.identify(file)
                if matched:
                    matching_importers.append(importer)
            except Exception as exc:
                logging.exception("Importer %s.identify() raised an unexpected error: %s",
                                  importer.name(), exc)

        yield (filename, matching_importers)

beancount.ingest.identify.identify(importers_list, files_or_directories)

Run the identification loop.

Parameters:
  • importers_list – A list of importer instances.

  • files_or_directories – A list of strings, files or directories.

Source code in beancount/ingest/identify.py
def identify(importers_list, files_or_directories):
    """Run the identification loop.

    Args:
      importers_list: A list of importer instances.
      files_or_directories: A list of strings, files or directories.
    """
    logfile = sys.stdout
    for filename, importers in find_imports(importers_list, files_or_directories,
                                            logfile=logfile):
        file = cache.get_file(filename)
        for importer in importers:
            logfile.write('Importer:    {}\n'.format(importer.name() if importer else '-'))
            logfile.write('Account:     {}\n'.format(importer.file_account(file)))
            logfile.write('\n')

beancount.ingest.identify.run(_, __, importers_list, files_or_directories, hooks=None)

Run the subcommand.

Source code in beancount/ingest/identify.py
def run(_, __, importers_list, files_or_directories, hooks=None):
    """Run the subcommand."""
    return identify(importers_list, files_or_directories)

beancount.ingest.importer

Importer protocol.

All importers must comply with this interface and implement at least some of its methods. A configuration consists in a simple list of such importer instances. The importer processes run through the importers, calling some of its methods in order to identify, extract and file the downloaded files.

Each of the methods accept a cache.FileMemo object which has a 'name' attribute with the filename to process, but which also provides a place to cache conversions. Use its convert() method whenever possible to avoid carrying out the same conversion multiple times. See beancount.ingest.cache for more details.

Synopsis:

name(): Return a unique identifier for the importer instance. identify(): Return true if the identifier is able to process the file. extract(): Extract directives from a file's contents and return of list of entries. file_account(): Return an account name associated with the given file for this importer. file_date(): Return a date associated with the downloaded file (e.g., the statement date). file_name(): Return a cleaned up filename for storage (optional).

Just to be clear: Although this importer will not raise NotImplementedError exceptions (it returns default values for each method), you NEED to derive from it in order to do anything meaningful. Simply instantiating this importer will not match not provide any useful information. It just defines the protocol for all importers.

beancount.ingest.importer.ImporterProtocol

Interface that all source importers need to comply with.

beancount.ingest.importer.ImporterProtocol.__str__(self) special

Return a unique id/name for this importer.

Returns:
  • A string which uniquely identifies this importer.

Source code in beancount/ingest/importer.py
def name(self):
    """Return a unique id/name for this importer.

    Returns:
      A string which uniquely identifies this importer.
    """
    cls = self.__class__
    return '{}.{}'.format(cls.__module__, cls.__name__)

beancount.ingest.importer.ImporterProtocol.extract(self, file, existing_entries=None)

Extract transactions from a file.

If the importer would like to flag a returned transaction as a known duplicate, it may opt to set the special flag "duplicate" to True, and the transaction should be treated as a duplicate by the extraction code. This is a way to let the importer use particular information about previously imported transactions in order to flag them as duplicates. For example, if an importer has a way to get a persistent unique id for each of the imported transactions. (See this discussion for context: https://groups.google.com/d/msg/beancount/0iV-ipBJb8g/-uk4wsH2AgAJ)

Parameters:
  • file – A cache.FileMemo instance.

  • existing_entries – An optional list of existing directives loaded from the ledger which is intended to contain the extracted entries. This is only provided if the user provides them via a flag in the extractor program.

Returns:
  • A list of new, imported directives (usually mostly Transactions) extracted from the file.

Source code in beancount/ingest/importer.py
def extract(self, file, existing_entries=None):
    """Extract transactions from a file.

    If the importer would like to flag a returned transaction as a known
    duplicate, it may opt to set the special flag "__duplicate__" to True,
    and the transaction should be treated as a duplicate by the extraction
    code. This is a way to let the importer use particular information about
    previously imported transactions in order to flag them as duplicates.
    For example, if an importer has a way to get a persistent unique id for
    each of the imported transactions. (See this discussion for context:
    https://groups.google.com/d/msg/beancount/0iV-ipBJb8g/-uk4wsH2AgAJ)

    Args:
      file: A cache.FileMemo instance.
      existing_entries: An optional list of existing directives loaded from
        the ledger which is intended to contain the extracted entries. This
        is only provided if the user provides them via a flag in the
        extractor program.
    Returns:
      A list of new, imported directives (usually mostly Transactions)
      extracted from the file.
    """

beancount.ingest.importer.ImporterProtocol.file_account(self, file)

Return an account associated with the given file.

Note: If you don't implement this method you won't be able to move the files into its preservation hierarchy; the bean-file command won't work.

Also, normally the returned account is not a function of the input file--just of the importer--but it is provided anyhow.

Parameters:
  • file – A cache.FileMemo instance.

Returns:
  • The name of the account that corresponds to this importer.

Source code in beancount/ingest/importer.py
def file_account(self, file):
    """Return an account associated with the given file.

    Note: If you don't implement this method you won't be able to move the
    files into its preservation hierarchy; the bean-file command won't
    work.

    Also, normally the returned account is not a function of the input
    file--just of the importer--but it is provided anyhow.

    Args:
      file: A cache.FileMemo instance.
    Returns:
      The name of the account that corresponds to this importer.
    """

beancount.ingest.importer.ImporterProtocol.file_date(self, file)

Attempt to obtain a date that corresponds to the given file.

Parameters:
  • file – A cache.FileMemo instance.

Returns:
  • A date object, if successful, or None if a date could not be extracted. (If no date is returned, the file creation time is used. This is the default.)

Source code in beancount/ingest/importer.py
def file_date(self, file):
    """Attempt to obtain a date that corresponds to the given file.

    Args:
      file: A cache.FileMemo instance.
    Returns:
      A date object, if successful, or None if a date could not be extracted.
      (If no date is returned, the file creation time is used. This is the
      default.)
    """

beancount.ingest.importer.ImporterProtocol.file_name(self, file)

A filter that optionally renames a file before filing.

This is used to make tidy filenames for filed/stored document files. If you don't implement this and return None, the same filename is used. Note that if you return a filename, a simple, RELATIVE filename must be returned, not an absolute filename.

Parameters:
  • file – A cache.FileMemo instance.

Returns:
  • The tidied up, new filename to store it as.

Source code in beancount/ingest/importer.py
def file_name(self, file):
    """A filter that optionally renames a file before filing.

    This is used to make tidy filenames for filed/stored document files. If
    you don't implement this and return None, the same filename is used.
    Note that if you return a filename, a simple, RELATIVE filename must be
    returned, not an absolute filename.

    Args:
      file: A cache.FileMemo instance.
    Returns:
      The tidied up, new filename to store it as.
    """

beancount.ingest.importer.ImporterProtocol.identify(self, file)

Return true if this importer matches the given file.

Parameters:
  • file – A cache.FileMemo instance.

Returns:
  • A boolean, true if this importer can handle this file.

Source code in beancount/ingest/importer.py
def identify(self, file):
    """Return true if this importer matches the given file.

    Args:
      file: A cache.FileMemo instance.
    Returns:
      A boolean, true if this importer can handle this file.
    """

beancount.ingest.importer.ImporterProtocol.name(self)

Return a unique id/name for this importer.

Returns:
  • A string which uniquely identifies this importer.

Source code in beancount/ingest/importer.py
def name(self):
    """Return a unique id/name for this importer.

    Returns:
      A string which uniquely identifies this importer.
    """
    cls = self.__class__
    return '{}.{}'.format(cls.__module__, cls.__name__)

beancount.ingest.importers special

beancount.ingest.importers.config

Mixin to add support for configuring importers with multiple accounts.

This importer implements some simple common functionality to create importers which accept a long number of account names or regular expressions on the set of account names. This is inspired by functionality in the importers in the previous iteration of the ingest code, which used to be its own project.

beancount.ingest.importers.config.ConfigImporterMixin

A mixin class which supports configuration of account names.

Mix this into the implementation of a importer.ImporterProtocol.

beancount.ingest.importers.config.ConfigImporterMixin.__init__(self, config) special

Provide a list of accounts and regexps as configuration to the importer.

Parameters:
  • config – A dict of configuration accounts, that must match the values declared in the class' REQUIRED_CONFIG.

Source code in beancount/ingest/importers/config.py
def __init__(self, config):
    """Provide a list of accounts and regexps as configuration to the importer.

    Args:
      config: A dict of configuration accounts, that must match the values
        declared in the class' REQUIRED_CONFIG.
    """
    super().__init__()

    # Check that the required configuration values are present.
    assert isinstance(config, dict), "Configuration must be a dict type"
    if not self._verify_config(config):
        raise ValueError("Invalid config {}, requires {}".format(
            config, self.REQUIRED_CONFIG))
    self.config = config

beancount.ingest.importers.csv

CSV importer.

beancount.ingest.importers.csv.Col (Enum)

The set of interpretable columns.

beancount.ingest.importers.csv.Importer (IdentifyMixin, FilingMixin)

Importer for CSV files.

beancount.ingest.importers.csv.Importer.__init__(self, config, account, currency, regexps=None, skip_lines=0, last4_map=None, categorizer=None, institution=None, debug=False, csv_dialect='excel', dateutil_kwds=None, narration_sep='; ', encoding=None, invert_sign=False, **kwds) special

Constructor.

Parameters:
  • config – A dict of Col enum types to the names or indexes of the columns.

  • account – An account string, the account to post this to.

  • currency – A currency string, the currency of this account.

  • regexps – A list of regular expression strings.

  • skip_lines (int) – Skip first x (garbage) lines of file.

  • last4_map (Optional[Dict]) – A dict that maps last 4 digits of the card to a friendly string.

  • categorizer (Optional[Callable]) – A callable that attaches the other posting (usually expenses) to a transaction with only single posting.

  • institution (Optional[str]) – An optional name of an institution to rename the files to.

  • debug (bool) – Whether or not to print debug information

  • csv_dialect (Union[str, csv.Dialect]) – A csv dialect given either as string or as instance or subclass of csv.Dialect.

  • dateutil_kwds (Optional[Dict]) – An optional dict defining the dateutil parser kwargs.

  • narration_sep (str) – A string, a separator to use for splitting up the payee and narration fields of a source field.

  • encoding (Optional[str]) – An optional encoding for the file. Typically useful for files encoded in 'latin1' instead of 'utf-8' (the default).

  • invert_sign (Optional[bool]) – If true, invert the amount's sign unconditionally.

  • **kwds – Extra keyword arguments to provide to the base mixins.

Source code in beancount/ingest/importers/csv.py
def __init__(self, config, account, currency,
             regexps=None,
             skip_lines: int = 0,
             last4_map: Optional[Dict] = None,
             categorizer: Optional[Callable] = None,
             institution: Optional[str] = None,
             debug: bool = False,
             csv_dialect: Union[str, csv.Dialect] = 'excel',
             dateutil_kwds: Optional[Dict] = None,
             narration_sep: str = '; ',
             encoding: Optional[str] = None,
             invert_sign: Optional[bool] = False,
             **kwds):
    """Constructor.

    Args:
      config: A dict of Col enum types to the names or indexes of the columns.
      account: An account string, the account to post this to.
      currency: A currency string, the currency of this account.
      regexps: A list of regular expression strings.
      skip_lines: Skip first x (garbage) lines of file.
      last4_map: A dict that maps last 4 digits of the card to a friendly string.
      categorizer: A callable that attaches the other posting (usually expenses)
        to a transaction with only single posting.
      institution: An optional name of an institution to rename the files to.
      debug: Whether or not to print debug information
      csv_dialect: A `csv` dialect given either as string or as instance or
        subclass of `csv.Dialect`.
      dateutil_kwds: An optional dict defining the dateutil parser kwargs.
      narration_sep: A string, a separator to use for splitting up the payee and
        narration fields of a source field.
      encoding: An optional encoding for the file. Typically useful for files
        encoded in 'latin1' instead of 'utf-8' (the default).
      invert_sign: If true, invert the amount's sign unconditionally.
      **kwds: Extra keyword arguments to provide to the base mixins.
    """
    assert isinstance(config, dict), "Invalid type: {}".format(config)
    self.config = config

    self.currency = currency
    assert isinstance(skip_lines, int)
    self.skip_lines = skip_lines
    self.last4_map = last4_map or {}
    self.debug = debug
    self.dateutil_kwds = dateutil_kwds
    self.csv_dialect = csv_dialect
    self.narration_sep = narration_sep
    self.encoding = encoding
    self.invert_sign = invert_sign

    self.categorizer = categorizer

    # Prepare kwds for filing mixin.
    kwds['filing'] = account
    if institution:
        prefix = kwds.get('prefix', None)
        assert prefix is None
        kwds['prefix'] = institution

    # Prepare kwds for identifier mixin.
    if isinstance(regexps, str):
        regexps = [regexps]
    matchers = kwds.setdefault('matchers', [])
    matchers.append(('mime', 'text/csv'))
    if regexps:
        for regexp in regexps:
            matchers.append(('content', regexp))

    super().__init__(**kwds)
beancount.ingest.importers.csv.Importer.extract(self, file, existing_entries=None)

Extract transactions from a file.

If the importer would like to flag a returned transaction as a known duplicate, it may opt to set the special flag "duplicate" to True, and the transaction should be treated as a duplicate by the extraction code. This is a way to let the importer use particular information about previously imported transactions in order to flag them as duplicates. For example, if an importer has a way to get a persistent unique id for each of the imported transactions. (See this discussion for context: https://groups.google.com/d/msg/beancount/0iV-ipBJb8g/-uk4wsH2AgAJ)

Parameters:
  • file – A cache.FileMemo instance.

  • existing_entries – An optional list of existing directives loaded from the ledger which is intended to contain the extracted entries. This is only provided if the user provides them via a flag in the extractor program.

Returns:
  • A list of new, imported directives (usually mostly Transactions) extracted from the file.

Source code in beancount/ingest/importers/csv.py
def extract(self, file, existing_entries=None):
    account = self.file_account(file)
    entries = []

    # Normalize the configuration to fetch by index.
    iconfig, has_header = normalize_config(
        self.config, file.head(), self.csv_dialect, self.skip_lines)

    reader = iter(csv.reader(open(file.name, encoding=self.encoding),
                             dialect=self.csv_dialect))

    # Skip garbage lines
    for _ in range(self.skip_lines):
        next(reader)

    # Skip header, if one was detected.
    if has_header:
        next(reader)

    def get(row, ftype):
        try:
            return row[iconfig[ftype]] if ftype in iconfig else None
        except IndexError:  # FIXME: this should not happen
            return None

    # Parse all the transactions.
    first_row = last_row = None
    for index, row in enumerate(reader, 1):
        if not row:
            continue
        if row[0].startswith('#'):
            continue

        # If debugging, print out the rows.
        if self.debug:
            print(row)

        if first_row is None:
            first_row = row
        last_row = row

        # Extract the data we need from the row, based on the configuration.
        date = get(row, Col.DATE)
        txn_date = get(row, Col.TXN_DATE)
        txn_time = get(row, Col.TXN_TIME)

        payee = get(row, Col.PAYEE)
        if payee:
            payee = payee.strip()

        fields = filter(None, [get(row, field)
                               for field in (Col.NARRATION1,
                                             Col.NARRATION2,
                                             Col.NARRATION3)])
        narration = self.narration_sep.join(
            field.strip() for field in fields).replace('\n', '; ')

        tag = get(row, Col.TAG)
        tags = {tag} if tag is not None else data.EMPTY_SET

        link = get(row, Col.REFERENCE_ID)
        links = {link} if link is not None else data.EMPTY_SET

        last4 = get(row, Col.LAST4)

        balance = get(row, Col.BALANCE)

        # Create a transaction
        meta = data.new_metadata(file.name, index)
        if txn_date is not None:
            meta['date'] = parse_date_liberally(txn_date,
                                                self.dateutil_kwds)
        if txn_time is not None:
            meta['time'] = str(dateutil.parser.parse(txn_time).time())
        if balance is not None:
            meta['balance'] = D(balance)
        if last4:
            last4_friendly = self.last4_map.get(last4.strip())
            meta['card'] = last4_friendly if last4_friendly else last4
        date = parse_date_liberally(date, self.dateutil_kwds)
        txn = data.Transaction(meta, date, self.FLAG, payee, narration,
                               tags, links, [])

        # Attach one posting to the transaction
        amount_debit, amount_credit = self.get_amounts(iconfig, row)

        # Skip empty transactions
        if amount_debit is None and amount_credit is None:
            continue

        for amount in [amount_debit, amount_credit]:
            if amount is None:
                continue
            if self.invert_sign:
                amount = -amount
            units = Amount(amount, self.currency)
            txn.postings.append(
                data.Posting(account, units, None, None, None, None))

        # Attach the other posting(s) to the transaction.
        if isinstance(self.categorizer, collections.abc.Callable):
            txn = self.categorizer(txn)

        # Add the transaction to the output list
        entries.append(txn)

    # Figure out if the file is in ascending or descending order.
    first_date = parse_date_liberally(get(first_row, Col.DATE),
                                      self.dateutil_kwds)
    last_date = parse_date_liberally(get(last_row, Col.DATE),
                                     self.dateutil_kwds)
    is_ascending = first_date < last_date

    # Reverse the list if the file is in descending order
    if not is_ascending:
        entries = list(reversed(entries))

    # Add a balance entry if possible
    if Col.BALANCE in iconfig and entries:
        entry = entries[-1]
        date = entry.date + datetime.timedelta(days=1)
        balance = entry.meta.get('balance', None)
        if balance is not None:
            meta = data.new_metadata(file.name, index)
            entries.append(
                data.Balance(meta, date,
                             account, Amount(balance, self.currency),
                             None, None))

    # Remove the 'balance' metadata.
    for entry in entries:
        entry.meta.pop('balance', None)

    return entries
beancount.ingest.importers.csv.Importer.file_date(self, file)

Get the maximum date from the file.

Source code in beancount/ingest/importers/csv.py
def file_date(self, file):
    "Get the maximum date from the file."
    iconfig, has_header = normalize_config(
        self.config, file.head(), self.csv_dialect, self.skip_lines)
    if Col.DATE in iconfig:
        reader = iter(csv.reader(open(file.name), dialect=self.csv_dialect))
        for _ in range(self.skip_lines):
            next(reader)
        if has_header:
            next(reader)
        max_date = None
        for row in reader:
            if not row:
                continue
            if row[0].startswith('#'):
                continue
            date_str = row[iconfig[Col.DATE]]
            date = parse_date_liberally(date_str, self.dateutil_kwds)
            if max_date is None or date > max_date:
                max_date = date
        return max_date
beancount.ingest.importers.csv.Importer.get_amounts(self, iconfig, row, allow_zero_amounts=False)

See function get_amounts() for details.

This method is present to allow clients to override it in order to deal with special cases, e.g., columns with currency symbols in them.

Source code in beancount/ingest/importers/csv.py
def get_amounts(self, iconfig, row, allow_zero_amounts=False):
    """See function get_amounts() for details.

    This method is present to allow clients to override it in order to deal
    with special cases, e.g., columns with currency symbols in them.
    """
    return get_amounts(iconfig, row, allow_zero_amounts)

beancount.ingest.importers.csv.get_amounts(iconfig, row, allow_zero_amounts=False)

Get the amount columns of a row.

Parameters:
  • iconfig – A dict of Col to row index.

  • row – A row array containing the values of the given row.

  • allow_zero_amounts – Is a transaction with amount D('0.00') okay? If not, return (None, None).

Returns:
  • A pair of (debit-amount, credit-amount), both of which are either an instance of Decimal or None, or not available.

Source code in beancount/ingest/importers/csv.py
def get_amounts(iconfig, row, allow_zero_amounts=False):
    """Get the amount columns of a row.

    Args:
      iconfig: A dict of Col to row index.
      row: A row array containing the values of the given row.
      allow_zero_amounts: Is a transaction with amount D('0.00') okay? If not,
        return (None, None).
    Returns:
      A pair of (debit-amount, credit-amount), both of which are either an
      instance of Decimal or None, or not available.
    """
    debit, credit = None, None
    if Col.AMOUNT in iconfig:
        credit = row[iconfig[Col.AMOUNT]]
    else:
        debit, credit = [row[iconfig[col]] if col in iconfig else None
                         for col in [Col.AMOUNT_DEBIT, Col.AMOUNT_CREDIT]]

    # If zero amounts aren't allowed, return null value.
    is_zero_amount = ((credit is not None and D(credit) == ZERO) and
                      (debit is not None and D(debit) == ZERO))
    if not allow_zero_amounts and is_zero_amount:
        return (None, None)

    return (-D(debit) if debit else None,
            D(credit) if credit else None)

beancount.ingest.importers.csv.normalize_config(config, head, dialect='excel', skip_lines=0)

Using the header line, convert the configuration field name lookups to int indexes.

Parameters:
  • config – A dict of Col types to string or indexes.

  • head – A string, some decent number of bytes of the head of the file.

  • dialect – A dialect definition to parse the header

  • skip_lines (int) – Skip first x (garbage) lines of file.

Returns:
  • A pair of A dict of Col types to integer indexes of the fields, and a boolean, true if the file has a header.

Exceptions:
  • ValueError – If there is no header and the configuration does not consist entirely of integer indexes.

Source code in beancount/ingest/importers/csv.py
def normalize_config(config, head, dialect='excel', skip_lines: int = 0):
    """Using the header line, convert the configuration field name lookups to int indexes.

    Args:
      config: A dict of Col types to string or indexes.
      head: A string, some decent number of bytes of the head of the file.
      dialect: A dialect definition to parse the header
      skip_lines: Skip first x (garbage) lines of file.
    Returns:
      A pair of
        A dict of Col types to integer indexes of the fields, and
        a boolean, true if the file has a header.
    Raises:
      ValueError: If there is no header and the configuration does not consist
        entirely of integer indexes.
    """
    # Skip garbage lines before sniffing the header
    assert isinstance(skip_lines, int)
    assert skip_lines >= 0
    for _ in range(skip_lines):
        head = head[head.find('\n')+1:]

    has_header = csv.Sniffer().has_header(head)
    if has_header:
        header = next(csv.reader(io.StringIO(head), dialect=dialect))
        field_map = {field_name.strip(): index
                     for index, field_name in enumerate(header)}
        index_config = {}
        for field_type, field in config.items():
            if isinstance(field, str):
                field = field_map[field]
            index_config[field_type] = field
    else:
        if any(not isinstance(field, int)
               for field_type, field in config.items()):
            raise ValueError("CSV config without header has non-index fields: "
                             "{}".format(config))
        index_config = config
    return index_config, has_header

beancount.ingest.importers.fileonly

A simplistic importer that can be used just to file away some download.

Sometimes you just want to save and accumulate data

beancount.ingest.importers.fileonly.Importer (FilingMixin, IdentifyMixin)

An importer that supports only matching (identification) and filing.

beancount.ingest.importers.mixins special

beancount.ingest.importers.mixins.config

Base class that implements configuration and a filing account.

beancount.ingest.importers.mixins.config.ConfigMixin (ImporterProtocol)
beancount.ingest.importers.mixins.config.ConfigMixin.__init__(self, **kwds) special

Pull 'config' from kwds.

Source code in beancount/ingest/importers/mixins/config.py
def __init__(self, **kwds):
    """Pull 'config' from kwds."""

    config = kwds.pop('config', None)
    schema = self.REQUIRED_CONFIG
    if config or schema:
        assert config is not None
        assert schema is not None
        self.config = validate_config(config, config, self)
    else:
        self.config = None

    super().__init__(**kwds)
beancount.ingest.importers.mixins.config.validate_config(config, schema, importer)

Check the configuration account provided by the user against the accounts required by the source importer.

Parameters:
  • config – A config dict of actual values on an importer.

  • schema – A dict of declarations of required values.

Exceptions:
  • ValueError – If the configuration is invalid.

Returns:
  • A validated configuration dict.

Source code in beancount/ingest/importers/mixins/config.py
def validate_config(config, schema, importer):
    """Check the configuration account provided by the user against the accounts
    required by the source importer.

    Args:
      config: A config dict of actual values on an importer.
      schema: A dict of declarations of required values.
    Raises:
      ValueError: If the configuration is invalid.
    Returns:
      A validated configuration dict.
    """
    provided_options = set(config)
    required_options = set(schema)

    for option in (required_options - provided_options):
        raise ValueError("Missing value from user configuration for importer {}: {}".format(
            importer.__class__.__name__, option))

    for option in (provided_options - required_options):
        raise ValueError("Unknown value in user configuration for importer {}: {}".format(
            importer.__class__.__name__, option))

    # FIXME: Validate types as well, including account type as a default.

    # FIXME: Here we could validate account names by looking them up from the
    # existing ledger.

    return config

beancount.ingest.importers.mixins.filing

Base class that implements filing account.

It also sports an optional prefix to prepend to the renamed filename. Typically you can put the name of the institution there, so you get a renamed filename like this:

YYYY-MM-DD.institution.Original_File_Name.pdf

beancount.ingest.importers.mixins.filing.FilingMixin (ImporterProtocol)
beancount.ingest.importers.mixins.filing.FilingMixin.__init__(self, **kwds) special

Pull 'filing' and 'prefix' from kwds.

Parameters:
  • filing – The name of the account to file to.

  • prefix – The name of the institution prefix to insert.

Source code in beancount/ingest/importers/mixins/filing.py
def __init__(self, **kwds):
    """Pull 'filing' and 'prefix' from kwds.

    Args:
      filing: The name of the account to file to.
      prefix: The name of the institution prefix to insert.
    """

    self.filing_account = kwds.pop('filing', None)
    assert account.is_valid(self.filing_account)

    self.prefix = kwds.pop('prefix', None)

    super().__init__(**kwds)
beancount.ingest.importers.mixins.filing.FilingMixin.file_account(self, file)

Return an account associated with the given file.

Note: If you don't implement this method you won't be able to move the files into its preservation hierarchy; the bean-file command won't work.

Also, normally the returned account is not a function of the input file--just of the importer--but it is provided anyhow.

Parameters:
  • file – A cache.FileMemo instance.

Returns:
  • The name of the account that corresponds to this importer.

Source code in beancount/ingest/importers/mixins/filing.py
def file_account(self, file):
    return self.filing_account
beancount.ingest.importers.mixins.filing.FilingMixin.file_name(self, file)

Return the optional renamed account filename.

Source code in beancount/ingest/importers/mixins/filing.py
def file_name(self, file):
    """Return the optional renamed account filename."""
    supername = super().file_name(file)
    if not self.prefix:
        return supername
    else:
        return '.'.join(filter(None, [self.prefix,
                                      supername or path.basename(file.name)]))
beancount.ingest.importers.mixins.filing.FilingMixin.name(self)

Include the filing account in the name.

Source code in beancount/ingest/importers/mixins/filing.py
def name(self):
    """Include the filing account in the name."""
    return '{}: "{}"'.format(super().name(), self.filing_account)

beancount.ingest.importers.mixins.identifier

Base class that implements identification using regular expressions.

beancount.ingest.importers.mixins.identifier.IdentifyMixin (ImporterProtocol)
beancount.ingest.importers.mixins.identifier.IdentifyMixin.__init__(self, **kwds) special

Pull 'matchers' and 'converter' from kwds.

Source code in beancount/ingest/importers/mixins/identifier.py
def __init__(self, **kwds):
    """Pull 'matchers' and 'converter' from kwds."""

    self.remap = collections.defaultdict(list)
    matchers = kwds.pop('matchers', [])
    cls_matchers = getattr(self, 'matchers', [])
    assert isinstance(matchers, list)
    assert isinstance(cls_matchers, list)
    for part, regexp in itertools.chain(matchers, cls_matchers):
        assert part in _PARTS, repr(part)
        assert isinstance(regexp, str), repr(regexp)
        self.remap[part].append(re.compile(regexp))

    # Converter is a fn(filename: Text) -> contents: Text.
    self.converter = kwds.pop('converter',
                              getattr(self, 'converter', None))

    super().__init__(**kwds)
beancount.ingest.importers.mixins.identifier.IdentifyMixin.identify(self, file)

Return true if this importer matches the given file.

Parameters:
  • file – A cache.FileMemo instance.

Returns:
  • A boolean, true if this importer can handle this file.

Source code in beancount/ingest/importers/mixins/identifier.py
def identify(self, file):
    return identify(self.remap, self.converter, file)
beancount.ingest.importers.mixins.identifier.identify(remap, converter, file)

Identify the contents of a file.

Parameters:
  • remap – A dict of 'part' to list-of-compiled-regexp objects, where each item is a specification to match against its part. The 'part' can be one of 'mime', 'filename' or 'content'.

  • converter – A

Returns:
  • A boolean, true if the file is not rejected by the constraints.

Source code in beancount/ingest/importers/mixins/identifier.py
def identify(remap, converter, file):
    """Identify the contents of a file.

    Args:
      remap: A dict of 'part' to list-of-compiled-regexp objects, where each item is
        a specification to match against its part. The 'part' can be one of 'mime',
        'filename' or 'content'.
      converter: A
    Returns:
      A boolean, true if the file is not rejected by the constraints.
    """
    if remap.get('mime', None):
        mimetype = file.convert(cache.mimetype)
        if not all(regexp.search(mimetype)
                   for regexp in remap['mime']):
            return False

    if remap.get('filename', None):
        if not all(regexp.search(file.name)
                   for regexp in remap['filename']):
            return False

    if remap.get('content', None):
        # If this is a text file, read the whole thing in memory.
        text = file.convert(converter or cache.contents)
        if not all(regexp.search(text)
                   for regexp in remap['content']):
            return False

    return True

beancount.ingest.importers.ofx

OFX file format importer for bank and credit card statements.

https://en.wikipedia.org/wiki/Open_Financial_Exchange

This importer will parse a single account in the OFX file. Instantiate it multiple times with different accounts if it has many accounts. It makes more sense to do it this way so that you can define your importer configuration account by account.

Note that this importer is provided as an example and with no guarantees. It's not really super great. On the other hand, I've been using it for more than five years over multiple accounts, so it has been useful to me (it works, by some measure of "works"). If you need a more powerful or compliant OFX importer please consider either writing one or contributing changes. Also, this importer does its own very basic parsing; a better one would probably use (and depend on) the ofxparse module (see https://sites.google.com/site/ofxparse/).

beancount.ingest.importers.ofx.BalanceType (Enum)

Type of Balance directive to be inserted.

beancount.ingest.importers.ofx.Importer (ImporterProtocol)

An importer for Open Financial Exchange files.

beancount.ingest.importers.ofx.Importer.__init__(self, acctid_regexp, account, basename=None, balance_type=<BalanceType.DECLARED: 1>) special

Create a new importer posting to the given account.

Parameters:
  • account – An account string, the account onto which to post all the amounts parsed.

  • acctid_regexp – A regexp, to match against the <ACCTID> tag of the OFX file.

  • basename – An optional string, the name of the new files.

  • balance_type – An enum of type BalanceType.

Source code in beancount/ingest/importers/ofx.py
def __init__(self, acctid_regexp, account, basename=None,
             balance_type=BalanceType.DECLARED):
    """Create a new importer posting to the given account.

    Args:
      account: An account string, the account onto which to post all the
        amounts parsed.
      acctid_regexp: A regexp, to match against the <ACCTID> tag of the OFX file.
      basename: An optional string, the name of the new files.
      balance_type: An enum of type BalanceType.
    """
    self.acctid_regexp = acctid_regexp
    self.account = account
    self.basename = basename
    self.balance_type = balance_type
beancount.ingest.importers.ofx.Importer.extract(self, file, existing_entries=None)

Extract a list of partially complete transactions from the file.

Source code in beancount/ingest/importers/ofx.py
def extract(self, file, existing_entries=None):
    """Extract a list of partially complete transactions from the file."""
    soup = bs4.BeautifulSoup(file.contents(), 'lxml')
    return extract(soup, file.name, self.acctid_regexp, self.account, self.FLAG,
                   self.balance_type)
beancount.ingest.importers.ofx.Importer.file_account(self, _)

Return the account against which we post transactions.

Source code in beancount/ingest/importers/ofx.py
def file_account(self, _):
    """Return the account against which we post transactions."""
    return self.account
beancount.ingest.importers.ofx.Importer.file_date(self, file)

Return the optional renamed account filename.

Source code in beancount/ingest/importers/ofx.py
def file_date(self, file):
    """Return the optional renamed account filename."""
    return find_max_date(file.contents())
beancount.ingest.importers.ofx.Importer.file_name(self, file)

Return the optional renamed account filename.

Source code in beancount/ingest/importers/ofx.py
def file_name(self, file):
    """Return the optional renamed account filename."""
    if self.basename:
        return self.basename + path.splitext(file.name)[1]
beancount.ingest.importers.ofx.Importer.identify(self, file)

Return true if this importer matches the given file.

Parameters:
  • file – A cache.FileMemo instance.

Returns:
  • A boolean, true if this importer can handle this file.

Source code in beancount/ingest/importers/ofx.py
def identify(self, file):
    # Match for a compatible MIME type.
    if file.mimetype() not in {'application/x-ofx',
                               'application/vnd.intu.qbo',
                               'application/vnd.intu.qfx'}:
        return False

    # Match the account id.
    return any(re.match(self.acctid_regexp, acctid)
               for acctid in find_acctids(file.contents()))
beancount.ingest.importers.ofx.Importer.name(self)

Include the filing account in the name.

Source code in beancount/ingest/importers/ofx.py
def name(self):
    """Include the filing account in the name."""
    return '{}: "{}"'.format(super().name(), self.file_account(None))

beancount.ingest.importers.ofx.build_transaction(stmttrn, flag, account, currency)

Build a single transaction.

Parameters:
  • stmttrn – A <STMTTRN> bs4.element.Tag.

  • flag – A single-character string.

  • account – An account string, the account to insert.

  • currency – A currency string.

Returns:
  • A Transaction instance.

Source code in beancount/ingest/importers/ofx.py
def build_transaction(stmttrn, flag, account, currency):
    """Build a single transaction.

    Args:
      stmttrn: A <STMTTRN> bs4.element.Tag.
      flag: A single-character string.
      account: An account string, the account to insert.
      currency: A currency string.
    Returns:
      A Transaction instance.
    """
    # Find the date.
    date = parse_ofx_time(find_child(stmttrn, 'dtposted')).date()

    # There's no distinct payee.
    payee = None

    # Construct a description that represents all the text content in the node.
    name = find_child(stmttrn, 'name', saxutils.unescape)
    memo = find_child(stmttrn, 'memo', saxutils.unescape)

    # Remove memos duplicated from the name.
    if memo == name:
        memo = None

    # Add the transaction type to the description, unless it's not useful.
    trntype = find_child(stmttrn, 'trntype', saxutils.unescape)
    if trntype in ('DEBIT', 'CREDIT'):
        trntype = None

    narration = ' / '.join(filter(None, [name, memo, trntype]))

    # Create a single posting for it; the user will have to manually categorize
    # the other side.
    number = find_child(stmttrn, 'trnamt', D)
    units = amount.Amount(number, currency)
    posting = data.Posting(account, units, None, None, None, None)

    # Build the transaction with a single leg.
    fileloc = data.new_metadata('<build_transaction>', 0)
    return data.Transaction(fileloc, date, flag, payee, narration,
                            data.EMPTY_SET, data.EMPTY_SET, [posting])

beancount.ingest.importers.ofx.extract(soup, filename, acctid_regexp, account, flag, balance_type)

Extract transactions from an OFX file.

Parameters:
  • soup – A BeautifulSoup root node.

  • acctid_regexp – A regular expression string matching the account we're interested in.

  • account – An account string onto which to post the amounts found in the file.

  • flag – A single-character string.

  • balance_type – An enum of type BalanceType.

Returns:
  • A sorted list of entries.

Source code in beancount/ingest/importers/ofx.py
def extract(soup, filename, acctid_regexp, account, flag, balance_type):
    """Extract transactions from an OFX file.

    Args:
      soup: A BeautifulSoup root node.
      acctid_regexp: A regular expression string matching the account we're interested in.
      account: An account string onto which to post the amounts found in the file.
      flag: A single-character string.
      balance_type: An enum of type BalanceType.
    Returns:
      A sorted list of entries.
    """
    new_entries = []
    counter = itertools.count()
    for acctid, currency, transactions, balance in find_statement_transactions(soup):
        if not re.match(acctid_regexp, acctid):
            continue

        # Create Transaction directives.
        stmt_entries = []
        for stmttrn in transactions:
            entry = build_transaction(stmttrn, flag, account, currency)
            entry = entry._replace(meta=data.new_metadata(filename, next(counter)))
            stmt_entries.append(entry)
        stmt_entries = data.sorted(stmt_entries)
        new_entries.extend(stmt_entries)

        # Create a Balance directive.
        if balance and balance_type is not BalanceType.NONE:
            date, number = balance
            if balance_type is BalanceType.LAST and stmt_entries:
                date = stmt_entries[-1].date

            # The Balance assertion occurs at the beginning of the date, so move
            # it to the following day.
            date += datetime.timedelta(days=1)

            meta = data.new_metadata(filename, next(counter))
            balance_entry = data.Balance(meta, date, account,
                                         amount.Amount(number, currency),
                                         None, None)
            new_entries.append(balance_entry)

    return data.sorted(new_entries)

beancount.ingest.importers.ofx.find_acctids(contents)

Find the list of <ACCTID> tags.

Parameters:
  • contents – A string, the contents of the OFX file.

Returns:
  • A list of strings, the contents of the <ACCTID> tags.

Source code in beancount/ingest/importers/ofx.py
def find_acctids(contents):
    """Find the list of <ACCTID> tags.

    Args:
      contents: A string, the contents of the OFX file.
    Returns:
      A list of strings, the contents of the <ACCTID> tags.
    """
    # Match the account id. Don't bother parsing the entire thing as XML, just
    # match the tag for this purpose. This'll work fine enough.
    for match in re.finditer('<ACCTID>([^<]*)', contents):
        yield match.group(1)

beancount.ingest.importers.ofx.find_child(node, name, conversion=None)

Find a child under the given node and return its value.

Parameters:
  • node – A <STMTTRN> bs4.element.Tag.

  • name – A string, the name of the child node.

  • conversion – A callable object used to convert the value to a new data type.

Returns:
  • A string, or None.

Source code in beancount/ingest/importers/ofx.py
def find_child(node, name, conversion=None):
    """Find a child under the given node and return its value.

    Args:
      node: A <STMTTRN> bs4.element.Tag.
      name: A string, the name of the child node.
      conversion: A callable object used to convert the value to a new data type.
    Returns:
      A string, or None.
    """
    child = node.find(name)
    if not child:
        return None
    value = child.contents[0].strip()
    if conversion:
        value = conversion(value)
    return value

beancount.ingest.importers.ofx.find_currency(soup)

Find the first currency in the XML tree.

Parameters:
  • soup – A BeautifulSoup root node.

Returns:
  • A string, the first currency found in the file. Returns None if no currency is found.

Source code in beancount/ingest/importers/ofx.py
def find_currency(soup):
    """Find the first currency in the XML tree.

    Args:
      soup: A BeautifulSoup root node.
    Returns:
      A string, the first currency found in the file. Returns None if no currency
      is found.
    """
    for stmtrs in soup.find_all(re.compile('.*stmtrs$')):
        for currency_node in stmtrs.find_all('curdef'):
            currency = currency_node.contents[0]
            if currency is not None:
                return currency

beancount.ingest.importers.ofx.find_max_date(contents)

Extract the report date from the file.

Source code in beancount/ingest/importers/ofx.py
def find_max_date(contents):
    """Extract the report date from the file."""
    soup = bs4.BeautifulSoup(contents, 'lxml')
    dates = []
    for ledgerbal in soup.find_all('ledgerbal'):
        dtasof = ledgerbal.find('dtasof')
        dates.append(parse_ofx_time(dtasof.contents[0]).date())
    if dates:
        return max(dates)

beancount.ingest.importers.ofx.find_statement_transactions(soup)

Find the statement transaction sections in the file.

Parameters:
  • soup – A BeautifulSoup root node.

Yields: A trip of An account id string, A currency string, A list of transaction nodes (<STMTTRN> BeautifulSoup tags), and A (date, balance amount) for the <LEDGERBAL>.

Source code in beancount/ingest/importers/ofx.py
def find_statement_transactions(soup):
    """Find the statement transaction sections in the file.

    Args:
      soup: A BeautifulSoup root node.
    Yields:
      A trip of
        An account id string,
        A currency string,
        A list of transaction nodes (<STMTTRN> BeautifulSoup tags), and
        A (date, balance amount) for the <LEDGERBAL>.
    """
    # Process STMTTRNRS and CCSTMTTRNRS tags.
    for stmtrs in soup.find_all(re.compile('.*stmtrs$')):
        # For each CURDEF tag.
        for currency_node in stmtrs.find_all('curdef'):
            currency = currency_node.contents[0].strip()

            # Extract ACCTID account information.
            acctid_node = stmtrs.find('acctid')
            if acctid_node:
                acctid = next(acctid_node.children).strip()
            else:
                acctid = ''

            # Get the LEDGERBAL node. There appears to be a single one for all
            # transaction lists.
            ledgerbal = stmtrs.find('ledgerbal')
            balance = None
            if ledgerbal:
                dtasof = find_child(ledgerbal, 'dtasof', parse_ofx_time).date()
                balamt = find_child(ledgerbal, 'balamt', D)
                balance = (dtasof, balamt)

            # Process transaction lists (regular or credit-card).
            for tranlist in stmtrs.find_all(re.compile('(|bank|cc)tranlist')):
                yield acctid, currency, tranlist.find_all('stmttrn'), balance

beancount.ingest.importers.ofx.parse_ofx_time(date_str)

Parse an OFX time string and return a datetime object.

Parameters:
  • date_str – A string, the date to be parsed.

Returns:
  • A datetime.datetime instance.

Source code in beancount/ingest/importers/ofx.py
def parse_ofx_time(date_str):
    """Parse an OFX time string and return a datetime object.

    Args:
      date_str: A string, the date to be parsed.
    Returns:
      A datetime.datetime instance.
    """
    if len(date_str) < 14:
        return datetime.datetime.strptime(date_str[:8], '%Y%m%d')
    else:
        return datetime.datetime.strptime(date_str[:14], '%Y%m%d%H%M%S')

beancount.ingest.regression

Support for implementing regression tests on sample files using nose.

NOTE: This itself is not a regression test. It's a library used to create regression tests for your importers. Use it like this in your own importer code:

def test(): importer = Importer([], { 'FILE' : 'Assets:US:MyBank:Main', }) yield from regression.compare_sample_files(importer, file)

WARNING: This is deprecated. Nose itself has been deprecated for a while and Beancount is now using only pytest. Ignore this and use beancount.ingest.regression_ptest instead.

beancount.ingest.regression.ImportFileTestCase (TestCase)

Base class for importer tests that compare output to an expected output text.

beancount.ingest.regression.ImportFileTestCase.test_expect_extract(self, filename, msg)

Extract entries from a test file and compare against expected output.

If an expected file (as <filename>.extract) is not present, we issue a warning. Missing expected files can be written out by removing them before running the tests.

Parameters:
  • filename – A string, the name of the file to import using self.importer.

Exceptions:
  • AssertionError – If the contents differ from the expected file.

Source code in beancount/ingest/regression.py
@test_utils.skipIfRaises(ToolNotInstalled)
def test_expect_extract(self, filename, msg):
    """Extract entries from a test file and compare against expected output.

    If an expected file (as <filename>.extract) is not present, we issue a
    warning. Missing expected files can be written out by removing them
    before running the tests.

    Args:
      filename: A string, the name of the file to import using self.importer.
    Raises:
      AssertionError: If the contents differ from the expected file.

    """
    # Import the file.
    entries = extract.extract_from_file(filename, self.importer, None, None)

    # Render the entries to a string.
    oss = io.StringIO()
    printer.print_entries(entries, file=oss)
    string = oss.getvalue()

    expect_filename = '{}.extract'.format(filename)
    if path.exists(expect_filename):
        expect_string = open(expect_filename, encoding='utf-8').read()
        self.assertEqual(expect_string.strip(), string.strip())
    else:
        # Write out the expected file for review.
        open(expect_filename, 'w', encoding='utf-8').write(string)
        self.skipTest("Expected file not present; generating '{}'".format(
            expect_filename))

beancount.ingest.regression.ImportFileTestCase.test_expect_file_date(self, filename, msg)

Compute the imported file date and compare to an expected output.

If an expected file (as <filename>.file_date) is not present, we issue a warning. Missing expected files can be written out by removing them before running the tests.

Parameters:
  • filename – A string, the name of the file to import using self.importer.

Exceptions:
  • AssertionError – If the contents differ from the expected file.

Source code in beancount/ingest/regression.py
@test_utils.skipIfRaises(ToolNotInstalled)
def test_expect_file_date(self, filename, msg):
    """Compute the imported file date and compare to an expected output.

    If an expected file (as <filename>.file_date) is not present, we issue a
    warning. Missing expected files can be written out by removing them
    before running the tests.

    Args:
      filename: A string, the name of the file to import using self.importer.
    Raises:
      AssertionError: If the contents differ from the expected file.
    """
    # Import the date.
    file = cache.get_file(filename)
    date = self.importer.file_date(file)
    if date is None:
        self.fail("No date produced from {}".format(file.name))

    expect_filename = '{}.file_date'.format(file.name)
    if path.exists(expect_filename) and path.getsize(expect_filename) > 0:
        expect_date_str = open(expect_filename, encoding='utf-8').read().strip()
        expect_date = datetime.datetime.strptime(expect_date_str, '%Y-%m-%d').date()
        self.assertEqual(expect_date, date)
    else:
        # Write out the expected file for review.
        with open(expect_filename, 'w', encoding='utf-8') as outfile:
            print(date.strftime('%Y-%m-%d'), file=outfile)
        self.skipTest("Expected file not present; generating '{}'".format(
            expect_filename))

beancount.ingest.regression.ImportFileTestCase.test_expect_file_name(self, filename, msg)

Compute the imported file name and compare to an expected output.

If an expected file (as <filename>.file_name) is not present, we issue a warning. Missing expected files can be written out by removing them before running the tests.

Parameters:
  • filename – A string, the name of the file to import using self.importer.

Exceptions:
  • AssertionError – If the contents differ from the expected file.

Source code in beancount/ingest/regression.py
@test_utils.skipIfRaises(ToolNotInstalled)
def test_expect_file_name(self, filename, msg):
    """Compute the imported file name and compare to an expected output.

    If an expected file (as <filename>.file_name) is not present, we issue a
    warning. Missing expected files can be written out by removing them
    before running the tests.

    Args:
      filename: A string, the name of the file to import using self.importer.
    Raises:
      AssertionError: If the contents differ from the expected file.
    """
    # Import the date.
    file = cache.get_file(filename)
    generated_basename = self.importer.file_name(file)
    if generated_basename is None:
        self.fail("No filename produced from {}".format(filename))

    # Check that we're getting a non-null relative simple filename.
    self.assertFalse(path.isabs(generated_basename), generated_basename)
    self.assertNotRegex(generated_basename, os.sep)

    expect_filename = '{}.file_name'.format(file.name)
    if path.exists(expect_filename) and path.getsize(expect_filename) > 0:
        expect_filename = open(expect_filename, encoding='utf-8').read().strip()
        self.assertEqual(expect_filename, generated_basename)
    else:
        # Write out the expected file for review.
        with open(expect_filename, 'w', encoding='utf-8') as file:
            print(generated_basename, file=file)
        self.skipTest("Expected file not present; generating '{}'".format(
            expect_filename))

beancount.ingest.regression.ImportFileTestCase.test_expect_identify(self, filename, msg)

Attempt to identify a file and expect results to be true.

Parameters:
  • filename – A string, the name of the file to import using self.importer.

Exceptions:
  • AssertionError – If the contents differ from the expected file.

Source code in beancount/ingest/regression.py
@test_utils.skipIfRaises(ToolNotInstalled)
def test_expect_identify(self, filename, msg):
    """Attempt to identify a file and expect results to be true.

    Args:
      filename: A string, the name of the file to import using self.importer.
    Raises:
      AssertionError: If the contents differ from the expected file.
    """
    file = cache.get_file(filename)
    matched = self.importer.identify(file)
    self.assertTrue(matched)

beancount.ingest.regression.ToolNotInstalled (OSError)

An error to be used by converters when necessary software isn't there.

Raising this exception from your converter code when the tool is not installed will make the tests defined in this file skipped instead of failing. This will happen when you test your converters on different computers and/or platforms.

beancount.ingest.regression.compare_sample_files(importer, directory=None, ignore_cls=None)

Compare the sample files under a directory.

Parameters:
  • importer – An instance of an Importer.

  • directory – A string, the directory to scour for sample files or a filename in that directory. If a directory is not provided, the directory of the file from which the importer class is defined is used.

  • ignore_cls – An optional base class of the importer whose methods should not trigger the addition of a test. For example, if you are deriving from a base class which is already well-tested, you may not want to have a regression test case generated for those methods. This was used to ignore methods provided from a common backwards compatibility support class.

Yields: Generated tests as per nose's requirements (a callable and arguments for it).

Source code in beancount/ingest/regression.py
@deprecated("Use beancount.ingest.regression_pytest instead")
def compare_sample_files(importer, directory=None, ignore_cls=None):
    """Compare the sample files under a directory.

    Args:
      importer: An instance of an Importer.
      directory: A string, the directory to scour for sample files or a filename
          in that directory. If a directory is not provided, the directory of
          the file from which the importer class is defined is used.
      ignore_cls: An optional base class of the importer whose methods should
        not trigger the addition of a test. For example, if you are deriving
        from a base class which is already well-tested, you may not want to have
        a regression test case generated for those methods. This was used to
        ignore methods provided from a common backwards compatibility support
        class.
    Yields:
      Generated tests as per nose's requirements (a callable and arguments for
      it).
    """
    # If the directory is not specified, use the directory where the importer
    # class was defined.
    if not directory:
        directory = sys.modules[type(importer).__module__].__file__
    if path.isfile(directory):
        directory = path.dirname(directory)

    for filename in find_input_files(directory):
        # For each of the methods to be tested, check if there is an actual
        # implementation and if so, run a comparison with an expected file.
        for name in ['identify',
                     'extract',
                     'file_date',
                     'file_name']:
            # Check if the method has been overridden from the protocol
            # interface. If so, even if it's provided by concretely inherited
            # method, we want to require a test against that method.
            func = getattr(importer, name).__func__
            if (func is not getattr(ImporterProtocol, name) and
                (ignore_cls is None or (func is not getattr(ignore_cls, name, None)))):
                method = getattr(ImportFileTestCase(importer),
                                 'test_expect_{}'.format(name))
                yield (method, filename, name)

beancount.ingest.regression.find_input_files(directory)

Find the input files in the module where the class is defined.

Parameters:
  • directory – A string, the path to a root directory to check for.

Yields: Strings, the absolute filenames of sample input and expected files.

Source code in beancount/ingest/regression.py
def find_input_files(directory):
    """Find the input files in the module where the class is defined.

    Args:
      directory: A string, the path to a root directory to check for.
    Yields:
      Strings, the absolute filenames of sample input and expected files.
    """
    for sroot, dirs, files in os.walk(directory):
        for filename in files:
            if re.match(r'.*\.(extract|file_date|file_name|py|pyc|DS_Store)$', filename):
                continue
            yield path.join(sroot, filename)

beancount.ingest.regression_pytest

Support for implementing regression tests on sample files using pytest.

This module provides definitions for testing a custom importer against a set of existing downloaded files, running the various importer interface methods on it, and comparing the output to an expected text file. (Expected test files can be auto-generated using the --generate option). You use it like this:

from beancount.ingest import regression_pytest ... import mymodule ...

# Create your importer instance used for testing. importer = mymodule.Importer(...)

# Select a directory where your test files are to be located. directory = ...

# Create a test case using the base in this class.

@regression_pytest.with_importer(importer) @regression_pytest.with_testdir(directory) class TestImporter(regtest.ImporterTestBase): pass

Also, to add the --generate option to 'pytest', you must create a conftest.py somewhere in one of the roots above your importers with this module as a plugin:

pytest_plugins = "beancount.ingest.regression_pytest"

See beancount/example/ingest for a full working example.

How to invoke the tests:

Via pytest. First run your test with the --generate option to generate all the expected files. Then inspect them visually for correctness. Finally, check them in to preserve them. You should be able to regress against those correct outputs in the future. Use version control to your advantage to visualize the differences.

beancount.ingest.regression_pytest.ImporterTestBase

beancount.ingest.regression_pytest.ImporterTestBase.test_extract(self, importer, file, pytestconfig)

Extract entries from a test file and compare against expected output.

Source code in beancount/ingest/regression_pytest.py
def test_extract(self, importer, file, pytestconfig):
    """Extract entries from a test file and compare against expected output."""
    entries = extract.extract_from_file(file.name, importer, None, None)
    oss = io.StringIO()
    printer.print_entries(entries, file=oss)
    string = oss.getvalue()
    compare_contents_or_generate(string, '{}.extract'.format(file.name),
                                 pytestconfig.getoption("generate", False))

beancount.ingest.regression_pytest.ImporterTestBase.test_file_account(self, importer, file, pytestconfig)

Compute the selected filing account and compare to an expected output.

Source code in beancount/ingest/regression_pytest.py
def test_file_account(self, importer, file, pytestconfig):
    """Compute the selected filing account and compare to an expected output."""
    account = importer.file_account(file) or ''
    compare_contents_or_generate(account, '{}.file_account'.format(file.name),
                                 pytestconfig.getoption("generate", False))

beancount.ingest.regression_pytest.ImporterTestBase.test_file_date(self, importer, file, pytestconfig)

Compute the imported file date and compare to an expected output.

Source code in beancount/ingest/regression_pytest.py
def test_file_date(self, importer, file, pytestconfig):
    """Compute the imported file date and compare to an expected output."""
    date = importer.file_date(file)
    string = date.isoformat() if date else ''
    compare_contents_or_generate(string, '{}.file_date'.format(file.name),
                                 pytestconfig.getoption("generate", False))

beancount.ingest.regression_pytest.ImporterTestBase.test_file_name(self, importer, file, pytestconfig)

Compute the imported file name and compare to an expected output.

Source code in beancount/ingest/regression_pytest.py
def test_file_name(self, importer, file, pytestconfig):
    """Compute the imported file name and compare to an expected output."""
    filename = importer.file_name(file) or ''
    compare_contents_or_generate(filename, '{}.file_name'.format(file.name),
                                 pytestconfig.getoption("generate", False))

beancount.ingest.regression_pytest.ImporterTestBase.test_identify(self, importer, file)

Attempt to identify a file and expect results to be true.

This method does not need to check against an existing expect file. It is just assumed it should return True if your test is setup well (the importer should always identify the test file).

Source code in beancount/ingest/regression_pytest.py
def test_identify(self, importer, file):
    """Attempt to identify a file and expect results to be true.

    This method does not need to check against an existing expect file. It
    is just assumed it should return True if your test is setup well (the
    importer should always identify the test file).
    """
    assert importer.identify(file)

beancount.ingest.regression_pytest.compare_contents_or_generate(actual_string, expect_fn, generate)

Compare a string to the contents of an expect file.

Assert if different; auto-generate otherwise.

Parameters:
  • actual_string – The expected string contents.

  • expect_fn – The filename whose contents to read and compare against.

  • generate – A boolean, true if we are to generate the tests.

Source code in beancount/ingest/regression_pytest.py
def compare_contents_or_generate(actual_string, expect_fn, generate):
    """Compare a string to the contents of an expect file.

    Assert if different; auto-generate otherwise.

    Args:
      actual_string: The expected string contents.
      expect_fn: The filename whose contents to read and compare against.
      generate: A boolean, true if we are to generate the tests.
    """
    if generate:
        with open(expect_fn, 'w', encoding='utf-8') as expect_file:
            expect_file.write(actual_string)
            if actual_string and not actual_string.endswith('\n'):
                expect_file.write('\n')
        pytest.skip("Generated '{}'".format(expect_fn))
    else:
        # Run the test on an existing expected file.
        assert path.exists(expect_fn), (
            "Expected file '{}' is missing. Generate it?".format(expect_fn))
        with open(expect_fn, encoding='utf-8') as infile:
            expect_string = infile.read()
        assert expect_string.strip() == actual_string.strip()

beancount.ingest.regression_pytest.find_input_files(directory)

Find the input files in the module where the class is defined.

Parameters:
  • directory – A string, the path to a root directory to check for.

Yields: Strings, the absolute filenames of sample input and expected files.

Source code in beancount/ingest/regression_pytest.py
def find_input_files(directory):
    """Find the input files in the module where the class is defined.

    Args:
      directory: A string, the path to a root directory to check for.
    Yields:
      Strings, the absolute filenames of sample input and expected files.
    """
    for sroot, dirs, files in os.walk(directory):
        for filename in files:
            if re.match(r'.*\.(extract|file_date|file_name|file_account|py|pyc|DS_Store)$',
                        filename):
                continue
            yield path.join(sroot, filename)

beancount.ingest.regression_pytest.pytest_addoption(parser)

Add an option to generate the expected files for the tests.

Source code in beancount/ingest/regression_pytest.py
def pytest_addoption(parser):
    """Add an option to generate the expected files for the tests."""
    group = parser.getgroup("beancount")
    group.addoption("--generate", "--gen", action="store_true",
                    help="Don't test; rather, generate the expected files")

beancount.ingest.regression_pytest.with_importer(importer)

Parametrizing fixture that provides the importer to test.

Source code in beancount/ingest/regression_pytest.py
def with_importer(importer):
    """Parametrizing fixture that provides the importer to test."""
    return pytest.mark.parametrize("importer", [importer])

beancount.ingest.regression_pytest.with_testdir(directory)

Parametrizing fixture that provides files from a directory.

Source code in beancount/ingest/regression_pytest.py
def with_testdir(directory):
    """Parametrizing fixture that provides files from a directory."""
    return pytest.mark.parametrize(
        "file", [cache.get_file(fn) for fn in find_input_files(directory)])

beancount.ingest.scripts_utils

Common front-end to all ingestion tools.

beancount.ingest.scripts_utils.TestScriptsBase (TestTempdirMixin, TestCase)

beancount.ingest.scripts_utils.TestScriptsBase.setUp(self)

Hook method for setting up the test fixture before exercising it.

Source code in beancount/ingest/scripts_utils.py
def setUp(self):
    super().setUp()
    for filename, contents in self.FILES.items():
        absname = path.join(self.tempdir, filename)
        os.makedirs(path.dirname(absname), exist_ok=True)
        with open(absname, 'w') as file:
            file.write(contents)
        if filename.endswith('.py') or filename.endswith('.sh'):
            os.chmod(absname, stat.S_IRUSR|stat.S_IXUSR)

beancount.ingest.scripts_utils.create_legacy_arguments_parser(description, run_func)

Create an arguments parser for all the ingestion bean-tools.

Parameters:
  • description (str) – The program description string.

  • func – A callable function to run the particular command.

Returns:
  • An argparse.Namespace instance with the rest of arguments in 'rest'.

Source code in beancount/ingest/scripts_utils.py
def create_legacy_arguments_parser(description: str, run_func: callable):
    """Create an arguments parser for all the ingestion bean-tools.

    Args:
      description: The program description string.
      func: A callable function to run the particular command.
    Returns:
      An argparse.Namespace instance with the rest of arguments in 'rest'.
    """
    parser = version.ArgumentParser(description=description)

    parser.add_argument('config', action='store', metavar='CONFIG_FILENAME',
                        help=('Importer configuration file. '
                              'This is a Python file with a data structure that '
                              'is specific to your accounts'))

    parser.add_argument('downloads', nargs='+', metavar='DIR-OR-FILE',
                        default=[],
                        help='Filenames or directories to search for files to import')

    parser.set_defaults(command=run_func)

    return parser

beancount.ingest.scripts_utils.ingest(importers_list, detect_duplicates_func=None, hooks=None)

Driver function that calls all the ingestion tools.

Put a call to this function at the end of your importer configuration to make your import script; this should be its main function, like this:

from beancount.ingest.scripts_utils import ingest my_importers = [ ... ] ingest(my_importers)

This more explicit way of invoking the ingestion is now the preferred way to invoke the various tools, and replaces calling the bean-identify, bean-extract, bean-file tools with a --config argument. When you call the import script itself (as as program) it will parse the arguments, expecting a subcommand ('identify', 'extract' or 'file') and corresponding subcommand-specific arguments.

Here you can override some importer values, such as installing a custom duplicate finding hook, and eventually more. Note that this newer invocation method is optional and if it is not present, a call to ingest() is generated implicitly, and it functions as it used to. Future configurable customization of the ingestion process will be implemented by inserting new arguments to this function, this is the motivation behind doing this.

Note that invocation by the three bean-* ingestion tools is still supported, and calling ingest() explicitly from your import configuration file will not break these tools either, if you invoke them on it; the values you provide to this function will be used by those tools.

Parameters:
  • importers_list – A list of importer instances. This is used as a chain-of-responsibility, called on each file.

  • detect_duplicates_func – (DEPRECATED) An optional function which accepts a list of lists of imported entries and a list of entries already existing in the user's ledger. See function find_duplicate_entries(), which is the default implementation for this. Use 'filter_funcs' instead.

  • hooks – An optional list of hook functions to apply to the list of extract (filename, entries) pairs, in order. This replaces 'detect_duplicates_func'.

Source code in beancount/ingest/scripts_utils.py
def ingest(importers_list, detect_duplicates_func=None, hooks=None):
    """Driver function that calls all the ingestion tools.

    Put a call to this function at the end of your importer configuration to
    make your import script; this should be its main function, like this:

      from beancount.ingest.scripts_utils import ingest
      my_importers = [ ... ]
      ingest(my_importers)

    This more explicit way of invoking the ingestion is now the preferred way to
    invoke the various tools, and replaces calling the bean-identify,
    bean-extract, bean-file tools with a --config argument. When you call the
    import script itself (as as program) it will parse the arguments, expecting
    a subcommand ('identify', 'extract' or 'file') and corresponding
    subcommand-specific arguments.

    Here you can override some importer values, such as installing a custom
    duplicate finding hook, and eventually more. Note that this newer invocation
    method is optional and if it is not present, a call to ingest() is generated
    implicitly, and it functions as it used to. Future configurable
    customization of the ingestion process will be implemented by inserting new
    arguments to this function, this is the motivation behind doing this.

    Note that invocation by the three bean-* ingestion tools is still supported,
    and calling ingest() explicitly from your import configuration file will not
    break these tools either, if you invoke them on it; the values you provide
    to this function will be used by those tools.

    Args:
      importers_list: A list of importer instances. This is used as a
        chain-of-responsibility, called on each file.
      detect_duplicates_func: (DEPRECATED) An optional function which accepts a
        list of lists of imported entries and a list of entries already existing
        in the user's ledger. See function find_duplicate_entries(), which is
        the default implementation for this. Use 'filter_funcs' instead.
      hooks: An optional list of hook functions to apply to the list of extract
        (filename, entries) pairs, in order. This replaces
        'detect_duplicates_func'.
    """
    if detect_duplicates_func is not None:
        warnings.warn("Argument 'detect_duplicates_func' is deprecated.")
        # Fold it in hooks.
        if hooks is None:
            hooks = []
        hooks.insert(0, detect_duplicates_func)
        del detect_duplicates_func

    if ingest.args is not None:
        # The script has been called from one of the bean-* ingestion tools.
        # 'ingest.args' is only set when we're being invoked from one of the
        # bean-xxx tools (see below).

        # Mark this function as called, so that if it is called from an import
        # triggered by one of the ingestion tools, it won't be called again
        # afterwards.
        ingest.was_called = True

        # Use those args rather than to try to parse the command-line arguments
        # from a naked ingest() call as a script. {39c7af4f6af5}
        args, parser = ingest.args
    else:
        # The script is called directly. This is the main program of the import
        # script itself. This is the new invocation method.
        parser = version.ArgumentParser(description=DESCRIPTION)

        # Use required on subparsers.
        # FIXME: Remove this when we require version 3.7 or above.
        kwargs = {}
        if sys.version_info >= (3, 7):
            kwargs['required'] = True
        subparsers = parser.add_subparsers(dest='command', **kwargs)

        parser.add_argument('--downloads', '-d', metavar='DIR-OR-FILE',
                            action='append', default=[],
                            help='Filenames or directories to search for files to import')

        for cmdname, module in [('identify', identify),
                                ('extract', extract),
                                ('file', file)]:
            parser_cmd = subparsers.add_parser(cmdname, help=module.DESCRIPTION)
            parser_cmd.set_defaults(command=module.run)
            module.add_arguments(parser_cmd)

        args = parser.parse_args()

        if not args.downloads:
            args.downloads.append(os.getcwd())

        # Implement required ourselves.
        # FIXME: Remove this when we require version 3.7 or above.
        if not (sys.version_info >= (3, 7)):
            if not hasattr(args, 'command'):
                parser.error("Subcommand is required.")

    abs_downloads = list(map(path.abspath, args.downloads))
    args.command(args, parser, importers_list, abs_downloads, hooks=hooks)
    return 0

beancount.ingest.scripts_utils.run_import_script_and_ingest(parser, argv=None, importers_attr_name='CONFIG')

Run the import script and optionally call ingest().

This path is only called when trampolined by one of the bean-* ingestion tools.

Parameters:
  • parser – The parser instance, used only to report errors.

  • importers_attr_name – The name of the special attribute in the module which defines the importers list.

Returns:
  • An execution return code.

Source code in beancount/ingest/scripts_utils.py
def run_import_script_and_ingest(parser, argv=None, importers_attr_name='CONFIG'):
    """Run the import script and optionally call ingest().

    This path is only called when trampolined by one of the bean-* ingestion
    tools.

    Args:
      parser: The parser instance, used only to report errors.
      importers_attr_name: The name of the special attribute in the module which
        defines the importers list.
    Returns:
      An execution return code.
    """
    args = parser.parse_args(args=argv)

    # Check the existence of the config.
    if not path.exists(args.config) or path.isdir(args.config):
        parser.error("File does not exist: '{}'".format(args.config))

    # Check the existence of all specified files.
    for filename in args.downloads:
        if not path.exists(filename):
            parser.error("File does not exist: '{}'".format(filename))

    # Reset the state of ingest() being called (for unit tests, which use the
    # same runtime with run_with_args).
    ingest.was_called = False

    # Save the arguments parsed from the command-line as default for
    # {39c7af4f6af5}.
    ingest.args = args, parser

    # Evaluate the importer script/module.
    mod = runpy.run_path(args.config)

    # If the importer script has already called ingest() within itself, don't
    # call it again. We're done. This allows the use to insert an explicit call
    # to ingest() while still running the bean-* ingestion tools on the file.
    if ingest.was_called:
        return 0

    # ingest() hasn't been called by the script so we assume it isn't
    # present in it. So we now run the ingestion by ourselves here, without
    # specifying any of the newer optional arguments.
    importers_list = mod[importers_attr_name]
    return ingest(importers_list)

beancount.ingest.scripts_utils.trampoline_to_ingest(module)

Parse arguments for bean tool, import config script and ingest.

This function is called by the three bean-* tools to support the older import files, which only required a CONFIG object to be defined in them.

Parameters:
  • module – One of the identify, extract or file module objects.

Returns:
  • An execution return code.

Source code in beancount/ingest/scripts_utils.py
def trampoline_to_ingest(module):
    """Parse arguments for bean tool, import config script and ingest.

    This function is called by the three bean-* tools to support the older
    import files, which only required a CONFIG object to be defined in them.

    Args:
      module: One of the identify, extract or file module objects.
    Returns:
      An execution return code.
    """
    # Disable debugging logging which is turned on by default in chardet.
    logging.getLogger('chardet.charsetprober').setLevel(logging.INFO)
    logging.getLogger('chardet.universaldetector').setLevel(logging.INFO)

    parser = create_legacy_arguments_parser(module.DESCRIPTION, module.run)
    module.add_arguments(parser)
    return run_import_script_and_ingest(parser)

beancount.ingest.similar

Identify similar entries.

This can be used during import in order to identify and flag duplicate entries.

beancount.ingest.similar.SimilarityComparator

Similarity comparator of transactions.

This comparator needs to be able to handle Transaction instances which are incomplete on one side, which have slightly different dates, or potentially slightly different numbers.

beancount.ingest.similar.SimilarityComparator.__call__(self, entry1, entry2) special

Compare two entries, return true if they are deemed similar.

Parameters:
  • entry1 – A first Transaction directive.

  • entry2 – A second Transaction directive.

Returns:
  • A boolean.

Source code in beancount/ingest/similar.py
def __call__(self, entry1, entry2):
    """Compare two entries, return true if they are deemed similar.

    Args:
      entry1: A first Transaction directive.
      entry2: A second Transaction directive.
    Returns:
      A boolean.
    """
    # Check the date difference.
    if self.max_date_delta is not None:
        delta = ((entry1.date - entry2.date)
                 if entry1.date > entry2.date else
                 (entry2.date - entry1.date))
        if delta > self.max_date_delta:
            return False

    try:
        amounts1 = self.cache[id(entry1)]
    except KeyError:
        amounts1 = self.cache[id(entry1)] = amounts_map(entry1)
    try:
        amounts2 = self.cache[id(entry2)]
    except KeyError:
        amounts2 = self.cache[id(entry2)] = amounts_map(entry2)

    # Look for amounts on common accounts.
    common_keys = set(amounts1) & set(amounts2)
    for key in sorted(common_keys):
        # Compare the amounts.
        number1 = amounts1[key]
        number2 = amounts2[key]
        if number1 == ZERO and number2 == ZERO:
            break
        diff = abs((number1 / number2)
                   if number2 != ZERO
                   else (number2 / number1))
        if diff == ZERO:
            return False
        if diff < ONE:
            diff = ONE/diff
        if (diff - ONE) < self.EPSILON:
            break
    else:
        return False

    # Here, we have found at least one common account with a close
    # amount. Now, we require that the set of accounts are equal or that
    # one be a subset of the other.
    accounts1 = set(posting.account for posting in entry1.postings)
    accounts2 = set(posting.account for posting in entry2.postings)
    return accounts1.issubset(accounts2) or accounts2.issubset(accounts1)

beancount.ingest.similar.SimilarityComparator.__init__(self, max_date_delta=None) special

Constructor a comparator of entries.

Parameters:
  • max_date_delta – A datetime.timedelta instance of the max tolerated distance between dates.

Source code in beancount/ingest/similar.py
def __init__(self, max_date_delta=None):
    """Constructor a comparator of entries.
    Args:
      max_date_delta: A datetime.timedelta instance of the max tolerated
        distance between dates.
    """
    self.cache = {}
    self.max_date_delta = max_date_delta

beancount.ingest.similar.amounts_map(entry)

Compute a mapping of (account, currency) -> Decimal balances.

Parameters:
  • entry – A Transaction instance.

Returns:
  • A dict of account -> Amount balance.

Source code in beancount/ingest/similar.py
def amounts_map(entry):
    """Compute a mapping of (account, currency) -> Decimal balances.

    Args:
      entry: A Transaction instance.
    Returns:
      A dict of account -> Amount balance.
    """
    amounts = collections.defaultdict(D)
    for posting in entry.postings:
        # Skip interpolated postings.
        if posting.meta and interpolate.AUTOMATIC_META in posting.meta:
            continue
        currency = isinstance(posting.units, amount.Amount) and posting.units.currency
        if isinstance(currency, str):
            key = (posting.account, currency)
            amounts[key] += posting.units.number
    return amounts

beancount.ingest.similar.find_similar_entries(entries, source_entries, comparator=None, window_days=2)

Find which entries from a list are potential duplicates of a set.

Note: If there are multiple entries from 'source_entries' matching an entry in 'entries', only the first match is returned. Note that this function could in theory decide to merge some of the imported entries with each other.

Parameters:
  • entries – The list of entries to classify as duplicate or note.

  • source_entries – The list of entries against which to match. This is the previous, or existing set of entries to compare against. This may be null or empty.

  • comparator – A functor used to establish the similarity of two entries.

  • window_days – The number of days (inclusive) before or after to scan the entries to classify against.

Returns:
  • A list of pairs of entries (entry, source_entry) where entry is from 'entries' and is deemed to be a duplicate of source_entry, from 'source_entries'.

Source code in beancount/ingest/similar.py
def find_similar_entries(entries, source_entries, comparator=None, window_days=2):
    """Find which entries from a list are potential duplicates of a set.

    Note: If there are multiple entries from 'source_entries' matching an entry
    in 'entries', only the first match is returned. Note that this function
    could in theory decide to merge some of the imported entries with each
    other.

    Args:
      entries: The list of entries to classify as duplicate or note.
      source_entries: The list of entries against which to match. This is the
        previous, or existing set of entries to compare against. This may be null
        or empty.
      comparator: A functor used to establish the similarity of two entries.
      window_days: The number of days (inclusive) before or after to scan the
        entries to classify against.
    Returns:
      A list of pairs of entries (entry, source_entry) where entry is from
      'entries' and is deemed to be a duplicate of source_entry, from
      'source_entries'.
    """
    window_head = datetime.timedelta(days=window_days)
    window_tail = datetime.timedelta(days=window_days + 1)

    if comparator is None:
        comparator = SimilarityComparator()

    # For each of the new entries, look at existing entries at a nearby date.
    duplicates = []
    if source_entries is not None:
        for entry in data.filter_txns(entries):
            for source_entry in data.filter_txns(
                    data.iter_entry_dates(source_entries,
                                          entry.date - window_head,
                                          entry.date + window_tail)):
                if comparator(entry, source_entry):
                    duplicates.append((entry, source_entry))
                    break
    return duplicates