ExpandNestedData.jl

ExpandNestedData.jl is a small package that can consume nested data structures like dictionaries of dictionaries or structs of structs and produce a normalized, Tables.jl-compliant NamedTuple. It can be used with JSON3.jl, XMLDict.jl, and other packages that parse file formats which are structured as denormalized data. It operates similarly to Pandas.json_normalize, but it is much more flexible.

Getting Started

Install

using Pkg
Pkg.add("ExpandNestedData")

Basic Usage

ExpandNestedData provides a single function expand to flatten out nested data.

using JSON3
using DataFrames

message = JSON3.read("""
    {
        "a" : [
            {"b" : 1, "c" : 2},
            {"b" : 2},
            {"b" : [3, 4], "c" : 1},
            {"b" : []}
        ],
        "d" : 4
    }
    """
)

expand(message) |> DataFrame
5×3 DataFrame
Rowda_ba_c
Int64Int64?Int64?
14missingmissing
2431
3441
442missing
5412

Configuring Options

While expand can produce a Table out-of-the-box, it is often useful to configure some options in how it handles the normalization process. ExpandNestedData.jl offers two ways to set these configurations. You can set them at the table-level with expand's keyword arguments or exercise finer control with per-column configurations.

Keyword Arguments

ParameterDescription
default_value::AnyWhen a certain key exists in one branch, but not another, what value should be used to fill missing. Default: missing
lazy_columns::BoolIf true, return columns as a custom lazy iterator instead of collecting them as materialized vectors. This option can speed things up if you only need to access a subset of rows once. It is usually better to materialize the columns since getindex() on the lazy columns is expensive. Default: false
pool_arrays::BoolWhen collecting vectors for columns, choose whether to use PooledArrays instead of Base.Vector
column_names::Dict{Tuple, Symbol}Provide a mapping of key/fieldname paths to replaced column names
column_style::SymbolChoose returned column style from :nested or :flat. If nested, column_names are ignored and a TypedTables.Table is returned in which the columns are nested in the same structure as the source data. Default: :flat
name_join_pattern::StringA pattern to put between the keys when joining the path into a column name. Default: "_".
name_map = Dict([:a, :b] => :Column_B)
expand(message; default_value="no value", pool_arrays=true, column_names=name_map) |> DataFrame
5×3 DataFrame
Rowda_ba_c
Int64Union…Union…
14no valueno value
2431
3441
442no value
5412

Using ColumnDefinitions

Instead of setting the configurations for the whole dataset, you can use a Vector{ColumnDefinition} to control how each column is handled. Using ColumnDefinitions has the added benefit of allowing you to ignore certain fields from the input. ColumnDefinition takes a Vector or Tuple of keys that act as the path to the values for the column. It also supports most of the keyword arguments as the regular expand API with the following exceptions:

  • column_names is column_name and accepts a single Symbol
  • No support for lazy_columns
  • column_style does not apply
column_defs = [
    ColumnDefinition([:d]; column_name = :ColumnD),
    ColumnDefinition([:a, :b]),
    ColumnDefinition([:e, :f]; column_name = :MissingColumn, default_value="Missing branch")
]

expand(message, column_defs) |> DataFrame
5×3 DataFrame
RowMissingColumnColumnDa_b
StringInt64Int64?
1Missing branch4missing
2Missing branch43
3Missing branch44
4Missing branch42
5Missing branch41

ColumnStyles

In the examples above, we've used flat_columns style. However, we can also maintain the nesting hierarchy of the source data.

using TypedTables

tbl = expand(message; column_style = nested_columns)

Now, our table has its columns nested, so we can access a specific column using dot syntax.

tbl.a.b

Furthermore, rows(tbl) returns a nested NamedTuple for each row

tbl |> rows |> first

API

ExpandNestedData.expandMethod
expand(data, column_defs=nothing; 
        default_value = missing, 
        lazy_columns::Bool = false,
        pool_arrays::Bool = false, 
        column_names::Dict = Dict{Tuple, Symbol}(),
        column_style::Symbol=:flat, 
        name_join_pattern = "_")

Expand a nested data structure into a Tables

Args:

  • data::Any - The nested data to unpack
  • column_defs::Vector{ColumnDefinition} - A list of paths to follow in data, ignoring other branches. Optional. Default: nothing.

Kwargs:

  • lazy_columns::Bool - If true, return columns using a lazy iterator. If false, collect into regular vectors before returning. Default: true (don't collect).
  • pool_arrays::Bool - If true, use pool arrays to collect the columns. Default: false.
  • column_names::Dict{Tuple, Symbol} - A lookup to replace column names in the final result with any other symbol
  • column_style::Symbol - Choose returned column style from :nested or :flat. If nested, column_names are ignored and a TypedTables.Table is returned in which the columns are nested in the same structure as the source data. Default: :flat
  • name_join_pattern::String - A pattern to put between the keys when joining the path into a column name. Default: "_".

Returns

::NamedTuple when column_style = :flat or TypedTable.Table when column_style = :nested.

source
ExpandNestedData.ColumnDefinitions.ColumnDefinitionMethod
ColumnDefinition(field_path; column_name=nothing, flatten_arrays=false, default_value=missing, pool_arrays=false)

Args

  • field_path: Vector or Tuple of keys/fieldnames that constitute a path from the top of the data to the values to extract for the column

Keyword Args

  • column_name::Symbol: A name for the resulting column. If nothing, defaults to joining the field_path with snake case format.
  • default_value: When the field_path keys do not exist on one or more branches, fill with this value. Default: missing
  • pool_arrays::Bool: When collecting values for this column, choose whether to use PooledArrays instead of Base.Vector. Default: false (use Vector)
  • name_join_pattern::String: The separator for joining field paths into column names. Default: "_"

Returns

::ColumnDefinition

source