ExpandNestedData.jl
ExpandNestedData.jl is a small package that can consume nested data structures like dictionaries of dictionaries or structs of structs and produce a normalized, Tables.jl-compliant NamedTuple. It can be used with JSON3.jl, XMLDict.jl, and other packages that parse file formats which are structured as denormalized data. It operates similarly to Pandas.json_normalize
, but it is much more flexible.
Getting Started
Install
using Pkg
Pkg.add("ExpandNestedData")
Basic Usage
ExpandNestedData provides a single function expand
to flatten out nested data.
using JSON3
using DataFrames
message = JSON3.read("""
{
"a" : [
{"b" : 1, "c" : 2},
{"b" : 2},
{"b" : [3, 4], "c" : 1},
{"b" : []}
],
"d" : 4
}
"""
)
expand(message) |> DataFrame
Row | d | a_b | a_c |
---|---|---|---|
Int64 | Int64? | Int64? | |
1 | 4 | missing | missing |
2 | 4 | 3 | 1 |
3 | 4 | 4 | 1 |
4 | 4 | 2 | missing |
5 | 4 | 1 | 2 |
Configuring Options
While expand
can produce a Table
out-of-the-box, it is often useful to configure some options in how it handles the normalization process. ExpandNestedData.jl
offers two ways to set these configurations. You can set them at the table-level with expand
's keyword arguments or exercise finer control with per-column configurations.
Keyword Arguments
Parameter | Description |
---|---|
default_value::Any | When a certain key exists in one branch, but not another, what value should be used to fill missing. Default: missing |
lazy_columns::Bool | If true, return columns as a custom lazy iterator instead of collecting them as materialized vectors. This option can speed things up if you only need to access a subset of rows once. It is usually better to materialize the columns since getindex() on the lazy columns is expensive. Default: false |
pool_arrays::Bool | When collecting vectors for columns, choose whether to use PooledArrays instead of Base.Vector |
column_names::Dict{Tuple, Symbol} | Provide a mapping of key/fieldname paths to replaced column names |
column_style::Symbol | Choose returned column style from :nested or :flat . If nested, column_names are ignored and a TypedTables.Table is returned in which the columns are nested in the same structure as the source data. Default: :flat |
name_join_pattern::String | A pattern to put between the keys when joining the path into a column name. Default: "_" . |
name_map = Dict([:a, :b] => :Column_B)
expand(message; default_value="no value", pool_arrays=true, column_names=name_map) |> DataFrame
Row | d | a_b | a_c |
---|---|---|---|
Int64 | Union… | Union… | |
1 | 4 | no value | no value |
2 | 4 | 3 | 1 |
3 | 4 | 4 | 1 |
4 | 4 | 2 | no value |
5 | 4 | 1 | 2 |
Using ColumnDefinitions
Instead of setting the configurations for the whole dataset, you can use a Vector{ColumnDefinition}
to control how each column is handled. Using ColumnDefinition
s has the added benefit of allowing you to ignore certain fields from the input. ColumnDefinition
takes a Vector
or Tuple
of keys that act as the path to the values for the column. It also supports most of the keyword arguments as the regular expand
API with the following exceptions:
column_names
iscolumn_name
and accepts a singleSymbol
- No support for
lazy_columns
column_style
does not apply
column_defs = [
ColumnDefinition([:d]; column_name = :ColumnD),
ColumnDefinition([:a, :b]),
ColumnDefinition([:e, :f]; column_name = :MissingColumn, default_value="Missing branch")
]
expand(message, column_defs) |> DataFrame
Row | MissingColumn | ColumnD | a_b |
---|---|---|---|
String | Int64 | Int64? | |
1 | Missing branch | 4 | missing |
2 | Missing branch | 4 | 3 |
3 | Missing branch | 4 | 4 |
4 | Missing branch | 4 | 2 |
5 | Missing branch | 4 | 1 |
ColumnStyles
In the examples above, we've used flat_columns
style. However, we can also maintain the nesting hierarchy of the source data.
using TypedTables
tbl = expand(message; column_style = nested_columns)
Now, our table has its columns nested, so we can access a specific column using dot
syntax.
tbl.a.b
Furthermore, rows(tbl)
returns a nested NamedTuple for each row
tbl |> rows |> first
API
ExpandNestedData.expand
— Methodexpand(data, column_defs=nothing;
default_value = missing,
lazy_columns::Bool = false,
pool_arrays::Bool = false,
column_names::Dict = Dict{Tuple, Symbol}(),
column_style::Symbol=:flat,
name_join_pattern = "_")
Expand a nested data structure into a Tables
Args:
data::Any
- The nested data to unpackcolumn_defs::Vector{ColumnDefinition}
- A list of paths to follow indata
, ignoring other branches. Optional. Default:nothing
.
Kwargs:
lazy_columns::Bool
- If true, return columns using a lazy iterator. If false,collect
into regular vectors before returning. Default:true
(don't collect).pool_arrays::Bool
- If true, use pool arrays tocollect
the columns. Default:false
.column_names::Dict{Tuple, Symbol}
- A lookup to replace column names in the final result with any other symbolcolumn_style::Symbol
- Choose returned column style from:nested
or:flat
. If nested,column_names
are ignored and a TypedTables.Table is returned in which the columns are nested in the same structure as the source data. Default::flat
name_join_pattern::String
- A pattern to put between the keys when joining the path into a column name. Default:"_"
.
Returns
::NamedTuple
when column_style = :flat
or TypedTable.Table
when column_style = :nested
.
ExpandNestedData.ColumnDefinitions.ColumnDefinition
— MethodColumnDefinition(field_path; column_name=nothing, flatten_arrays=false, default_value=missing, pool_arrays=false)
Args
field_path
: Vector or Tuple of keys/fieldnames that constitute a path from the top of the data to the values to extract for the column
Keyword Args
column_name::Symbol
: A name for the resulting column. Ifnothing
, defaults to joining thefield_path
with snake case format.default_value
: When the field_path keys do not exist on one or more branches, fill with this value. Default:missing
pool_arrays::Bool
: When collecting values for this column, choose whether to usePooledArrays
instead ofBase.Vector
. Default:false
(useVector
)name_join_pattern::String
: The separator for joining field paths into column names. Default: "_"
Returns
::ColumnDefinition