Finds all the minimal functional dependencies represented in a data frame.
Arguments
- df
a data.frame, the relation to evaluate.
- accuracy
a numeric in (0, 1]: the accuracy threshold required in order to conclude a dependency.
- digits
a positive integer, indicating how many significant digits are to be used for numeric and complex variables. A value of
NA
results in no rounding. By default, this usesgetOption("digits")
, similarly toformat
. See the "Floating-point variables" section below for why this rounding is necessary for consistent results across different machines. See the note inprint.default
aboutdigits >= 16
.- full_cache
a logical, indicating whether to store information about how sets of attributes group the relation records (stripped partitions). Otherwise, only the number of groups is stored. Storing the stripped partition is expected to let the algorithm run more quickly, but might be inefficient for small data frames or small amounts of memory.
- store_cache
a logical, indicating whether to keep cached information to use when finding dependencies for other dependants. This allows the algorithm to run more quickly by not having to re-calculate information, but takes up more memory.
- skip_bijections
a logical, indicating whether to skip some dependency searches that are made redundant by discovered bijections between attributes. This can significantly speed up the search if
df
contains equivalent attributes early in column order, but results in undefined behaviour ifaccuracy < 1
. See Details for more information.- exclude
a character vector, containing names of attributes to not consider as members of determinant sets. If names are given that aren't present in
df
, the user is given a warning.- exclude_class
a character vector, indicating classes of attributes to not consider as members of determinant_sets. Attributes are excluded if they inherit from any given class.
- dependants
a character vector, containing names of all attributes for which to find minimal functional dependencies for which they are the dependant. By default, this is all of the attribute names. A smaller set of attribute names reduces the amount of searching required, so can reduce the computation time if only some potential dependencies are of interest.
- detset_limit
an integer, indicating the largest determinant set size that should be searched for. By default, this is large enough to allow all possible determinant sets. See Details for comments about the effect on the result, and on the computation time.
- progress
a logical, for whether to display progress to the user during dependency search in
discover
.- progress_file
a scalar character or a connection. If
progress
is non-zero, determines where the progress is written to, in the same way as thefile
argument forcat
.
Value
A functional_dependency
object, containing the discovered
dependencies. The column names of df
are stored in the attrs
attribute, in order, to serve as a default priority order for the
attributes during normalisation.
Details
Column names for df
must be unique.
The algorithm used for finding dependencies is DFD. This searches for determinant sets for each dependent attribute (dependant) by traversing the powerset of the other (non-excluded) attributes, and is equivalent to depth-first.
The implementation for DFD differs a little from the algorithm presented in the original paper:
Some attributes, or attribute types, can be designated, ahead of time, as not being candidate members for determinant sets. This reduces the number of candidate determinant sets to be searched, saving time by not searching for determinant sets that the user would remove later anyway.
Attributes that have a single unique value, i.e. are constant, get attributed a single empty determinant set. In the standard DFD algorithm, they would be assigned all the other non-excluded attributes as length-one determinant sets. Assigning them the empty set distinguishes them as constant, allowing for special treatment at normalisation and later steps.
As was done in the original Python library, there is an extra case in seed generation for when there are no discovered maximal non-dependencies. In this case, we take all of the single-attribute nodes, then filter out by minimal dependencies as usual. This is equivalent to taking the empty set as the single maximal non-dependency.
There are three results when checking whether a candidate node is minimal/maximal. TRUE indicates the node is minimal/maximal, as usual. FALSE has been split into FALSE and NA. NA indicates that we can not yet determine whether the node is minimal/maximal. FALSE indicates that we have determined that it is not minimal/maximal, and we can set its category as such. This is done by checking whether any of its adjacent subsets/supersets are dependencies/non-dependencies, instead of waiting to exhaust the adjacent subsets/supersets to visit when picking the next node to visit.
We do not yet keep hashmaps to manage subset/superset relationships, as described in Section 3.5 of the original paper.
skip_bijections
allows for additional optimisation for finding functional dependencies when there are pairwise-equivalent attributes.Missing values (NA) are treated as a normal value, with NA = NA being true, and x = NA being false for any non-NA value of x.
Floating-point variables
Numerical/complex values, i.e. floating-point values, represent difficulties for stating functional dependencies. A fundamental condition for stating functional dependencies is that we can compare two values for the same variable, and they are equivalent or not equivalent.
Usually, this is done by checking they're equal – this is the approach used
in discover
– but we can use any comparison that is an equivalence
relation.
However, checking floating-point values for equality is not simple. ==
is not appropriate, even when comparing non-calculated values we've read from
a file, because how a given number is converted into a float can vary by
computer architecture, meaning that two values can be considered equal on one
computer, and not equal on another. This can happen even if they're both
using 64-bit R, and even though all R platforms work with values conforming
to the same standard (see double
). For example,
\(8.54917750000000076227\) and \(8.54917749999999898591\) are converted into
different floating-point representations on x86, but the same representation
on ARM, resulting in inequality and equality respectively.
For this and other reasons, checking numerical/complex values for
(near-)equality in R is usually done with all.equal
. This
determines values \(x\) and \(y\) to be equal if their absolute/relative
absolute difference is within some tolerance value. However, we can not use
this. Equivalence relations must be transitive: if we have values \(x\),
\(y\), and \(z\), and \(x\) is equivalent to both \(y\) and \(z\),
then \(y\) and \(z\) must also be equivalent. This tolerance-based
equivalence is not transitive: it is reasonably straightforward to set up
three values so that the outer values are far enough apart to be considered
non-equivalent, but the middle value is close enough to be considered
equivalent to both of them. Using this to determine functional dependencies,
therefore, could easily result in a large number of inconsistencies.
This means we have no good option for comparing numerical/complex values as-is for equivalence, with consistent results across different machines, so we must treat them differently. We have three options:
Round/truncate the values, before comparison, to some low degree of precision;
Coerce the values to another class before passing them into
discover
;Read values as characters if reading data from a file.
discover
takes the first option, with a default number of significant
digits low enough to ensure consistency across different machines. However,
the user can also use any of these options when processing the data before
passing it to discover
. The third option, in particular, is
recommended if reading data from a file.
Skipping bijections
Skipping bijections allows skipping redundant searches. For example, if the
search discovers that A -> B
and B -> A
, then only one of those
attributes is considered for the remainder of the search. Since the search
time increases exponentially with the number of attributes considered, this
can significantly speed up search times. At the moment, this is only be done
for bijections between single attributes, such as A <-> B
; if A
<-> {B, C}
, nothing is skipped. Whether bijections are skipped doesn't
affect which functional dependencies are present in the output, but it might
affect their order.
Skipping bijections for approximate dependencies, i.e. when accuracy < 1
,
should be avoided: it can result in incorrect output, since an approximate
bijection doesn't imply equivalent approximate dependencies.
Limiting the determinant set size
Setting detset_limit
smaller than the largest-possible value has
different behaviour for different search algorithms, the result is always
that discover(x, 1, detset_limit = n)
is equivalent to doing a full
search, fds <- discover(x, 1)
, then
filtering by determinant set size post-hoc, fds[lengths(detset(fds)) <=
n]
.
For DFD, the naive way to implement it is by removing determinant sets larger than the limit from the search tree for possible functional dependencies for each dependant. However, this usually results in the search taking much more time than without a limit.
For example, suppose we search for determinant sets for a dependant that has
none (the dependant is the only key for df
, for example). Using DFD,
we begin with a single attribute, then add other attributes one-by-one, since
every set gives a non-dependency. When we reach a maximum-size set, we can
mark all subsets as also being non-dependencies.
With the default limit, there is only one maximum-size set, containing all of the available attributes. If there are \(n\) candidate attributes for determinants, the search finishes after visiting \(n\) sets.
With a smaller limit \(k\), there are \(\binom{n}{k}\) maximum-size sets to explore. Since a DFD search adds or removes one attribute at each step, this means the search must take at least \(k - 2 + 2\binom{n}{k}\) steps, which is larger than \(n\) for all non-trivial cases \(0 < k \leq n\).
We therefore use a different approach, where any determinant sets above the size limit are not allowed to be candidate seeds for new search paths, and any discovered dependencies with a size above the limit are discard at the end of the entire DFD search. This means that nodes for determinant sets above the size limit are only visited in order to determine maximality of non-dependencies within the size limit. It turns out to be rare that this results in a significant speed-up, but it never results in the search having to visit more nodes than it would without a size limit, so the average search time is never made worse.
References
Abedjan Z., Schulze P., Naumann F. (2014) DFD: efficient functional dependency discovery. Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management (CIKM '14). New York, U.S.A., 949–958.
Examples
# simple example
discover(ChickWeight, 1)
#> 2 functional dependencies
#> 4 attributes: weight, Time, Chick, Diet
#> Time, Chick -> weight
#> Chick -> Diet
# example with spurious dependencies
discover(CO2, 1)
#> 5 functional dependencies
#> 5 attributes: Plant, Type, Treatment, conc, uptake
#> Treatment, conc, uptake -> Plant
#> Plant -> Type
#> conc, uptake -> Type
#> Plant -> Treatment
#> Plant, conc -> uptake
# exclude attributes that can't be determinants.
# in this case, the numeric attributes are now
# not determined by anything, because of repeat measurements
# with no variable to mark them as such.
discover(CO2, 1, exclude_class = "numeric")
#> 2 functional dependencies
#> 5 attributes: Plant, Type, Treatment, conc, uptake
#> Plant -> Type
#> Plant -> Treatment
# include only dependencies with dependants of interest.
discover(CO2, 1, dependants = c("Treatment", "uptake"))
#> 2 functional dependencies
#> 5 attributes: Plant, Type, Treatment, conc, uptake
#> Plant -> Treatment
#> Plant, conc -> uptake