Dependency discovery with DFD — discover • autodb

Finds all the minimal functional dependencies represented in a data frame.

Usage

discover(
  df,
  accuracy,
  digits = getOption("digits"),
  full_cache = TRUE,
  store_cache = TRUE,
  skip_bijections = FALSE,
  exclude = character(),
  exclude_class = character(),
  dependants = names(df),
  detset_limit = ncol(df) - 1L,
  progress = FALSE,
  progress_file = ""
)

Arguments

df: a data.frame, the relation to evaluate.
accuracy: a numeric in (0, 1]: the accuracy threshold required in order to conclude a dependency.
digits: a positive integer, indicating how many significant digits are to be used for numeric and complex variables. A value of NA results in no rounding. By default, this uses getOption("digits"), similarly to format. See the "Floating-point variables" section below for why this rounding is necessary for consistent results across different machines. See the note in print.default about digits >= 16.
full_cache: a logical, indicating whether to store information about how sets of attributes group the relation records (stripped partitions). Otherwise, only the number of groups is stored. Storing the stripped partition is expected to let the algorithm run more quickly, but might be inefficient for small data frames or small amounts of memory.
store_cache: a logical, indicating whether to keep cached information to use when finding dependencies for other dependants. This allows the algorithm to run more quickly by not having to re-calculate information, but takes up more memory.
skip_bijections: a logical, indicating whether to skip some dependency searches that are made redundant by discovered bijections between attributes. This can significantly speed up the search if df contains equivalent attributes early in column order, but results in undefined behaviour if accuracy < 1. See Details for more information.
exclude: a character vector, containing names of attributes to not consider as members of determinant sets. If names are given that aren't present in df, the user is given a warning.
exclude_class: a character vector, indicating classes of attributes to not consider as members of determinant_sets. Attributes are excluded if they inherit from any given class.
dependants: a character vector, containing names of all attributes for which to find minimal functional dependencies for which they are the dependant. By default, this is all of the attribute names. A smaller set of attribute names reduces the amount of searching required, so can reduce the computation time if only some potential dependencies are of interest.
detset_limit: an integer, indicating the largest determinant set size that should be searched for. By default, this is large enough to allow all possible determinant sets. See Details for comments about the effect on the result, and on the computation time.
progress: a logical, for whether to display progress to the user during dependency search in discover.
progress_file: a scalar character or a connection. If progress is non-zero, determines where the progress is written to, in the same way as the file argument for cat.

Value

A functional_dependency object, containing the discovered dependencies. The column names of df are stored in the attrs attribute, in order, to serve as a default priority order for the attributes during normalisation.

Details

Column names for df must be unique.

The algorithm used for finding dependencies is DFD. This searches for determinant sets for each dependent attribute (dependant) by traversing the powerset of the other (non-excluded) attributes, and is equivalent to depth-first.

The implementation for DFD differs a little from the algorithm presented in the original paper:

Some attributes, or attribute types, can be designated, ahead of time, as not being candidate members for determinant sets. This reduces the number of candidate determinant sets to be searched, saving time by not searching for determinant sets that the user would remove later anyway.
Attributes that have a single unique value, i.e. are constant, get attributed a single empty determinant set. In the standard DFD algorithm, they would be assigned all the other non-excluded attributes as length-one determinant sets. Assigning them the empty set distinguishes them as constant, allowing for special treatment at normalisation and later steps.
As was done in the original Python library, there is an extra case in seed generation for when there are no discovered maximal non-dependencies. In this case, we take all of the single-attribute nodes, then filter out by minimal dependencies as usual. This is equivalent to taking the empty set as the single maximal non-dependency.
There are three results when checking whether a candidate node is minimal/maximal. TRUE indicates the node is minimal/maximal, as usual. FALSE has been split into FALSE and NA. NA indicates that we can not yet determine whether the node is minimal/maximal. FALSE indicates that we have determined that it is not minimal/maximal, and we can set its category as such. This is done by checking whether any of its adjacent subsets/supersets are dependencies/non-dependencies, instead of waiting to exhaust the adjacent subsets/supersets to visit when picking the next node to visit.
We do not yet keep hashmaps to manage subset/superset relationships, as described in Section 3.5 of the original paper.
skip_bijections allows for additional optimisation for finding functional dependencies when there are pairwise-equivalent attributes.
Missing values (NA) are treated as a normal value, with NA = NA being true, and x = NA being false for any non-NA value of x.

Floating-point variables

Numerical/complex values, i.e. floating-point values, represent difficulties for stating functional dependencies. A fundamental condition for stating functional dependencies is that we can compare two values for the same variable, and they are equivalent or not equivalent.

Usually, this is done by checking they're equal – this is the approach used in discover – but we can use any comparison that is an equivalence relation.

However, checking floating-point values for equality is not simple. == is not appropriate, even when comparing non-calculated values we've read from a file, because how a given number is converted into a float can vary by computer architecture, meaning that two values can be considered equal on one computer, and not equal on another. This can happen even if they're both using 64-bit R, and even though all R platforms work with values conforming to the same standard (see double). For example, \(8.54917750000000076227\) and \(8.54917749999999898591\) are converted into different floating-point representations on x86, but the same representation on ARM, resulting in inequality and equality respectively.

For this and other reasons, checking numerical/complex values for (near-)equality in R is usually done with all.equal. This determines values \(x\) and \(y\) to be equal if their absolute/relative absolute difference is within some tolerance value. However, we can not use this. Equivalence relations must be transitive: if we have values \(x\), \(y\), and \(z\), and \(x\) is equivalent to both \(y\) and \(z\), then \(y\) and \(z\) must also be equivalent. This tolerance-based equivalence is not transitive: it is reasonably straightforward to set up three values so that the outer values are far enough apart to be considered non-equivalent, but the middle value is close enough to be considered equivalent to both of them. Using this to determine functional dependencies, therefore, could easily result in a large number of inconsistencies.

This means we have no good option for comparing numerical/complex values as-is for equivalence, with consistent results across different machines, so we must treat them differently. We have three options:

Round/truncate the values, before comparison, to some low degree of precision;
Coerce the values to another class before passing them into discover;
Read values as characters if reading data from a file.

discover takes the first option, with a default number of significant digits low enough to ensure consistency across different machines. However, the user can also use any of these options when processing the data before passing it to discover. The third option, in particular, is recommended if reading data from a file.

Skipping bijections

Skipping bijections allows skipping redundant searches. For example, if the search discovers that A -> B and B -> A, then only one of those attributes is considered for the remainder of the search. Since the search time increases exponentially with the number of attributes considered, this can significantly speed up search times. At the moment, this is only be done for bijections between single attributes, such as A <-> B; if A <-> {B, C}, nothing is skipped. Whether bijections are skipped doesn't affect which functional dependencies are present in the output, but it might affect their order.

Skipping bijections for approximate dependencies, i.e. when accuracy < 1, should be avoided: it can result in incorrect output, since an approximate bijection doesn't imply equivalent approximate dependencies.

Limiting the determinant set size

Setting detset_limit smaller than the largest-possible value has different behaviour for different search algorithms, the result is always that discover(x, 1, detset_limit = n) is equivalent to doing a full search, fds <- discover(x, 1), then filtering by determinant set size post-hoc, fds[lengths(detset(fds)) <= n].

For DFD, the naive way to implement it is by removing determinant sets larger than the limit from the search tree for possible functional dependencies for each dependant. However, this usually results in the search taking much more time than without a limit.

For example, suppose we search for determinant sets for a dependant that has none (the dependant is the only key for df, for example). Using DFD, we begin with a single attribute, then add other attributes one-by-one, since every set gives a non-dependency. When we reach a maximum-size set, we can mark all subsets as also being non-dependencies.

With the default limit, there is only one maximum-size set, containing all of the available attributes. If there are \(n\) candidate attributes for determinants, the search finishes after visiting \(n\) sets.

With a smaller limit \(k\), there are \(\binom{n}{k}\) maximum-size sets to explore. Since a DFD search adds or removes one attribute at each step, this means the search must take at least \(k - 2 + 2\binom{n}{k}\) steps, which is larger than \(n\) for all non-trivial cases \(0 < k \leq n\).

We therefore use a different approach, where any determinant sets above the size limit are not allowed to be candidate seeds for new search paths, and any discovered dependencies with a size above the limit are discard at the end of the entire DFD search. This means that nodes for determinant sets above the size limit are only visited in order to determine maximality of non-dependencies within the size limit. It turns out to be rare that this results in a significant speed-up, but it never results in the search having to visit more nodes than it would without a size limit, so the average search time is never made worse.

References

Abedjan Z., Schulze P., Naumann F. (2014) DFD: efficient functional dependency discovery. Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management (CIKM '14). New York, U.S.A., 949–958.

Examples

# simple example
discover(ChickWeight, 1)
#> 2 functional dependencies
#> 4 attributes: weight, Time, Chick, Diet
#> Time, Chick -> weight
#>       Chick -> Diet

# example with spurious dependencies
discover(CO2, 1)
#> 5 functional dependencies
#> 5 attributes: Plant, Type, Treatment, conc, uptake
#> Treatment, conc, uptake -> Plant
#>                   Plant -> Type
#>            conc, uptake -> Type
#>                   Plant -> Treatment
#>             Plant, conc -> uptake
# exclude attributes that can't be determinants.
# in this case, the numeric attributes are now
# not determined by anything, because of repeat measurements
# with no variable to mark them as such.
discover(CO2, 1, exclude_class = "numeric")
#> 2 functional dependencies
#> 5 attributes: Plant, Type, Treatment, conc, uptake
#> Plant -> Type
#> Plant -> Treatment
# include only dependencies with dependants of interest.
discover(CO2, 1, dependants = c("Treatment", "uptake"))
#> 2 functional dependencies
#> 5 attributes: Plant, Type, Treatment, conc, uptake
#>       Plant -> Treatment
#> Plant, conc -> uptake