As instruments for Python sort annotations (or hints) have advanced, extra complicated knowledge buildings may be typed, bettering maintainability and static evaluation. Arrays and DataFrames, as complicated containers, have solely lately supported full sort annotations in Python. NumPy 1.22 launched generic specification of arrays and dtypes. Constructing on NumPy’s basis, StaticFrame 2.0 launched full sort specification of DataFrames, using NumPy primitives and variadic generics. This text demonstrates sensible approaches to completely type-hinting arrays and DataFrames, and reveals how the identical annotations can enhance code high quality with each static evaluation and runtime validation.
StaticFrame is an open-source DataFrame library of which I’m an creator.
Sort hints (see PEP 484) enhance code high quality in a variety of methods. As a substitute of utilizing variable names or feedback to speak varieties, Python-object-based sort annotations present maintainable and expressive instruments for sort specification. These sort annotations may be examined with sort checkers similar to mypy
or pyright
, shortly discovering potential bugs with out executing code.
The identical annotations can be utilized for runtime validation. Whereas reliance on duck-typing over runtime validation is widespread in Python, runtime validation is extra usually wanted with complicated knowledge buildings similar to arrays and DataFrames. For instance, an interface anticipating a DataFrame argument, if given a Collection, won’t want specific validation as utilization of the unsuitable sort will seemingly elevate. Nevertheless, an interface anticipating a 2D array of floats, if given an array of Booleans, would possibly profit from validation as utilization of the unsuitable sort might not elevate.
Many necessary typing utilities are solely accessible with the most-recent variations of Python. Happily, the typing-extensions
bundle back-ports commonplace library utilities for older variations of Python. A associated problem is that sort checkers can take time to implement full help for brand new options: most of the examples proven right here require not less than mypy
1.9.0.
With out sort annotations, a Python operate signature provides no indication of the anticipated varieties. For instance, the operate under would possibly take and return any varieties:
def process0(v, q): ... # no sort info
By including sort annotations, the signature informs readers of the anticipated varieties. With fashionable Python, user-defined and built-in courses can be utilized to specify varieties, with further sources (similar to Any
, Iterator
, forged()
, and Annotated
) present in the usual library typing
module. For instance, the interface under improves the one above by making anticipated varieties specific:
def process0(v: int, q: bool) -> record[float]: ...
When used with a sort checker like mypy
, code that violates the specs of the kind annotations will elevate an error throughout static evaluation (proven as feedback, under). For instance, offering an integer when a Boolean is required is an error:
x = process0(v=5, q=20)
# tp.py: error: Argument "q" to "process0"
# has incompatible sort "int"; anticipated "bool" [arg-type]
Static evaluation can solely validate statically outlined varieties. The total vary of runtime inputs and outputs is commonly extra numerous, suggesting some type of runtime validation. One of the best of each worlds is feasible by reusing sort annotations for runtime validation. Whereas there are libraries that do that (e.g., typeguard
and beartype
), StaticFrame gives CallGuard
, a software specialised for complete array and DataFrame type-annotation validation.
A Python decorator is right for leveraging annotations for runtime validation. CallGuard
gives two decorators: @CallGuard.examine
, which raises an informative Exception
on error, or @CallGuard.warn
, which points a warning.
Additional extending the process0
operate above with @CallGuard.examine
, the identical sort annotations can be utilized to lift an Exception
(proven once more as feedback) when runtime objects violate the necessities of the kind annotations:
import static_frame as sf@sf.CallGuard.examine
def process0(v: int, q: bool) -> record[float]:
return [x * (0.5 if q else 0.25) for x in range(v)]
z = process0(v=5, q=20)
# static_frame.core.type_clinic.ClinicError:
# In args of (v: int, q: bool) -> record[float]
# └── Anticipated bool, offered int invalid
Whereas sort annotations have to be legitimate Python, they’re irrelevant at runtime and may be unsuitable: it’s doable to have accurately verified varieties that don’t replicate runtime actuality. As proven above, reusing sort annotations for runtime checks ensures annotations are legitimate.
Python courses that allow part sort specification are “generic”. Part varieties are specified with positional “sort variables”. A listing of integers, for instance, is annotated with record[int]
; a dictionary of floats keyed by tuples of integers and strings is annotated dict[tuple[int, str], float]
.
With NumPy 1.20, ndarray
and dtype
turn out to be generic. The generic ndarray
requires two arguments, a form and a dtype
. Because the utilization of the primary argument continues to be underneath improvement, Any
is often used. The second argument, dtype
, is itself a generic that requires a sort variable for a NumPy sort similar to np.int64
. NumPy additionally gives extra common generic varieties similar to np.integer[Any]
.
For instance, an array of Booleans is annotated np.ndarray[Any, np.dtype[np.bool_]]
; an array of any sort of integer is annotated np.ndarray[Any, np.dtype[np.integer[Any]]]
.
As generic annotations with part sort specs can turn out to be verbose, it’s sensible to retailer them as sort aliases (right here prefixed with “T”). The next operate specifies such aliases after which makes use of them in a operate.
from typing import Any
import numpy as npTNDArrayInt8 = np.ndarray[Any, np.dtype[np.int8]]
TNDArrayBool = np.ndarray[Any, np.dtype[np.bool_]]
TNDArrayFloat64 = np.ndarray[Any, np.dtype[np.float64]]
def process1(
v: TNDArrayInt8,
q: TNDArrayBool,
) -> TNDArrayFloat64:
s: TNDArrayFloat64 = np.the place(q, 0.5, 0.25)
return v * s
As earlier than, when used with mypy
, code that violates the kind annotations will elevate an error throughout static evaluation. For instance, offering an integer when a Boolean is required is an error:
v1: TNDArrayInt8 = np.arange(20, dtype=np.int8)
x = process1(v1, v1)
# tp.py: error: Argument 2 to "process1" has incompatible sort
# "ndarray[Any, dtype[floating[_64Bit]]]"; anticipated "ndarray[Any, dtype[bool_]]" [arg-type]
The interface requires 8-bit signed integers (np.int8
); trying to make use of a special sized integer can be an error:
TNDArrayInt64 = np.ndarray[Any, np.dtype[np.int64]]
v2: TNDArrayInt64 = np.arange(20, dtype=np.int64)
q: TNDArrayBool = np.arange(20) % 3 == 0
x = process1(v2, q)
# tp.py: error: Argument 1 to "process1" has incompatible sort
# "ndarray[Any, dtype[signedinteger[_64Bit]]]"; anticipated "ndarray[Any, dtype[signedinteger[_8Bit]]]" [arg-type]
Whereas some interfaces would possibly profit from such slim numeric sort specs, broader specification is feasible with NumPy’s generic varieties similar to np.integer[Any]
, np.signedinteger[Any]
, np.float[Any]
, and so on. For instance, we are able to outline a brand new operate that accepts any measurement signed integer. Static evaluation now passes with each TNDArrayInt8
and TNDArrayInt64
arrays.
TNDArrayIntAny = np.ndarray[Any, np.dtype[np.signedinteger[Any]]]
def process2(
v: TNDArrayIntAny, # a extra versatile interface
q: TNDArrayBool,
) -> TNDArrayFloat64:
s: TNDArrayFloat64 = np.the place(q, 0.5, 0.25)
return v * sx = process2(v1, q) # no mypy error
x = process2(v2, q) # no mypy error
Simply as proven above with components, generically specified NumPy arrays may be validated at runtime if embellished with CallGuard.examine
:
@sf.CallGuard.examine
def process3(v: TNDArrayIntAny, q: TNDArrayBool) -> TNDArrayFloat64:
s: TNDArrayFloat64 = np.the place(q, 0.5, 0.25)
return v * sx = process3(v1, q) # no error, identical as mypy
x = process3(v2, q) # no error, identical as mypy
v3: TNDArrayFloat64 = np.arange(20, dtype=np.float64) * 0.5
x = process3(v3, q) # error
# static_frame.core.type_clinic.ClinicError:
# In args of (v: ndarray[Any, dtype[signedinteger[Any]]],
# q: ndarray[Any, dtype[bool_]]) -> ndarray[Any, dtype[float64]]
# └── ndarray[Any, dtype[signedinteger[Any]]]
# └── dtype[signedinteger[Any]]
# └── Anticipated signedinteger, offered float64 invalid
StaticFrame supplies utilities to increase runtime validation past sort checking. Utilizing the typing
module’s Annotated
class (see PEP 593), we are able to prolong the kind specification with a number of StaticFrame Require
objects. For instance, to validate that an array has a 1D form of (24,)
, we are able to substitute TNDArrayIntAny
with Annotated[TNDArrayIntAny, sf.Require.Shape(24)]
. To validate {that a} float array has no NaNs, we are able to substitute TNDArrayFloat64
with Annotated[TNDArrayFloat64, sf.Require.Apply(lambda a: ~a.insna().any())]
.
Implementing a brand new operate, we are able to require that each one enter and output arrays have the form (24,)
. Calling this operate with the beforehand created arrays raises an error:
from typing import Annotated@sf.CallGuard.examine
def process4(
v: Annotated[TNDArrayIntAny, sf.Require.Shape(24)],
q: Annotated[TNDArrayBool, sf.Require.Shape(24)],
) -> Annotated[TNDArrayFloat64, sf.Require.Shape(24)]:
s: TNDArrayFloat64 = np.the place(q, 0.5, 0.25)
return v * s
x = process4(v1, q) # varieties cross, however Require.Form fails
# static_frame.core.type_clinic.ClinicError:
# In args of (v: Annotated[ndarray[Any, dtype[int8]], Form((24,))], q: Annotated[ndarray[Any, dtype[bool_]], Form((24,))]) -> Annotated[ndarray[Any, dtype[float64]], Form((24,))]
# └── Annotated[ndarray[Any, dtype[int8]], Form((24,))]
# └── Form((24,))
# └── Anticipated form ((24,)), offered form (20,)
Similar to a dictionary, a DataFrame is a posh knowledge construction composed of many part varieties: the index labels, column labels, and the column values are all distinct varieties.
A problem of generically specifying a DataFrame is {that a} DataFrame has a variable variety of columns, the place every column could be a special sort. The Python TypeVarTuple
variadic generic specifier (see PEP 646), first launched in Python 3.11, permits defining a variable variety of column sort variables.
With StaticFrame 2.0, Body
, Collection
, Index
and associated containers turn out to be generic. Assist for variable column sort definitions is offered by TypeVarTuple
, back-ported with the implementation in typing-extensions
for compatibility right down to Python 3.9.
A generic Body
requires two or extra sort variables: the kind of the index, the kind of the columns, and nil or extra specs of columnar worth varieties specified with NumPy varieties. A generic Collection
requires two sort variables: the kind of the index and a NumPy sort for the values. The Index
is itself generic, additionally requiring a NumPy sort as a sort variable.
With generic specification, a Collection
of floats, listed by dates, may be annotated with sf.Collection[sf.IndexDate, np.float64]
. A Body
with dates as index labels, strings as column labels, and column values of integers and floats may be annotated with sf.Body[sf.IndexDate, sf.Index[np.str_], np.int64, np.float64]
.
Given a posh Body
, deriving the annotation could be tough. StaticFrame gives the via_type_clinic
interface to supply a whole generic specification for any part at runtime:
>>> v4 = sf.Body.from_fields([range(5), np.arange(3, 8) * 0.5],
columns=('a', 'b'), index=sf.IndexDate.from_date_range('2021-12-30', '2022-01-03'))
>>> v4
<Body>
<Index> a b <<U1>
<IndexDate>
2021-12-30 0 1.5
2021-12-31 1 2.0
2022-01-01 2 2.5
2022-01-02 3 3.0
2022-01-03 4 3.5
<datetime64[D]> <int64> <float64># get a string illustration of the annotation
>>> v4.via_type_clinic
Body[IndexDate, Index[str_], int64, float64]
As proven with arrays, storing annotations as sort aliases permits reuse and extra concise operate signatures. Beneath, a brand new operate is outlined with generic Body
and Collection
arguments totally annotated. A forged
is required as not all operations can statically resolve their return sort.
TFrameDateInts = sf.Body[sf.IndexDate, sf.Index[np.str_], np.int64, np.int64]
TSeriesYMBool = sf.Collection[sf.IndexYearMonth, np.bool_]
TSeriesDFloat = sf.Collection[sf.IndexDate, np.float64]def process5(v: TFrameDateInts, q: TSeriesYMBool) -> TSeriesDFloat:
t = v.index.iter_label().apply(lambda l: q[l.astype('datetime64[M]')]) # sort: ignore
s = np.the place(t, 0.5, 0.25)
return forged(TSeriesDFloat, (v.via_T * s).imply(axis=1))
These extra complicated annotated interfaces can be validated with mypy
. Beneath, a Body
with out the anticipated column worth varieties is handed, inflicting mypy
to error (proven as feedback, under).
TFrameDateIntFloat = sf.Body[sf.IndexDate, sf.Index[np.str_], np.int64, np.float64]
v5: TFrameDateIntFloat = sf.Body.from_fields([range(5), np.arange(3, 8) * 0.5],
columns=('a', 'b'), index=sf.IndexDate.from_date_range('2021-12-30', '2022-01-03'))q: TSeriesYMBool = sf.Collection([True, False],
index=sf.IndexYearMonth.from_date_range('2021-12', '2022-01'))
x = process5(v5, q)
# tp.py: error: Argument 1 to "process5" has incompatible sort
# "Body[IndexDate, Index[str_], signedinteger[_64Bit], floating[_64Bit]]"; anticipated
# "Body[IndexDate, Index[str_], signedinteger[_64Bit], signedinteger[_64Bit]]" [arg-type]
To make use of the identical sort hints for runtime validation, the sf.CallGuard.examine
decorator may be utilized. Beneath, a Body
of three integer columns is offered the place a Body
of two columns is predicted.
# a Body of three columns of integers
TFrameDateIntIntInt = sf.Body[sf.IndexDate, sf.Index[np.str_], np.int64, np.int64, np.int64]
v6: TFrameDateIntIntInt = sf.Body.from_fields([range(5), range(3, 8), range(1, 6)],
columns=('a', 'b', 'c'), index=sf.IndexDate.from_date_range('2021-12-30', '2022-01-03'))x = process5(v6, q)
# static_frame.core.type_clinic.ClinicError:
# In args of (v: Body[IndexDate, Index[str_], signedinteger[_64Bit], signedinteger[_64Bit]],
# q: Collection[IndexYearMonth, bool_]) -> Collection[IndexDate, float64]
# └── Body[IndexDate, Index[str_], signedinteger[_64Bit], signedinteger[_64Bit]]
# └── Anticipated Body has 2 dtype, offered Body has 3 dtype
It won’t be sensible to annotate each column of each Body
: it’s common for interfaces to work with Body
of variable column sizes. TypeVarTuple
helps this by the utilization of *tuple[]
expressions (launched in Python 3.11, back-ported with the Unpack
annotation). For instance, the operate above could possibly be outlined to take any variety of integer columns with that annotation Body[IndexDate, Index[np.str_], *tuple[np.int64, ...]]
, the place *tuple[np.int64, ...]]
means zero or extra integer columns.
The identical implementation may be annotated with a much more common specification of columnar varieties. Beneath, the column values are annotated with np.quantity[Any]
(allowing any sort of numeric NumPy sort) and a *tuple[]
expression (allowing any variety of columns): *tuple[np.number[Any], …]
. Now neither mypy
nor CallGuard
errors with both beforehand created Body
.
TFrameDateNums = sf.Body[sf.IndexDate, sf.Index[np.str_], *tuple[np.number[Any], ...]]@sf.CallGuard.examine
def process6(v: TFrameDateNums, q: TSeriesYMBool) -> TSeriesDFloat:
t = v.index.iter_label().apply(lambda l: q[l.astype('datetime64[M]')]) # sort: ignore
s = np.the place(t, 0.5, 0.25)
return tp.forged(TSeriesDFloat, (v.via_T * s).imply(axis=1))
x = process6(v5, q) # a Body with integer, float columns passes
x = process6(v6, q) # a Body with three integer columns passes
As with NumPy arrays, Body
annotations can wrap Require
specs in Annotated
generics, allowing the definition of further run-time validations.
Whereas StaticFrame could be the primary DataFrame library to supply full generic specification and a unified answer for each static sort evaluation and run-time sort validation, different array and DataFrame libraries provide associated utilities.
Neither the Tensor
class in PyTorch (2.4.0), nor the Tensor
class in TensorFlow (2.17.0) help generic sort or form specification. Whereas each libraries provide a TensorSpec
object that can be utilized to carry out run-time sort and form validation, static sort checking with instruments like mypy
shouldn’t be supported.
As of Pandas 2.2.2, neither the Pandas Collection
nor DataFrame
help generic sort specs. Quite a lot of third-party packages have supplied partial options. The pandas-stubs
library, for instance, supplies sort annotations for the Pandas API, however doesn’t make the Collection
or DataFrame
courses generic. The Pandera library permits defining DataFrameSchema
courses that can be utilized for run-time validation of Pandas DataFrames. For static-analysis with mypy
, Pandera gives different DataFrame
and Collection
subclasses that allow generic specification with the identical DataFrameSchema
courses. This strategy doesn’t allow the expressive alternatives of utilizing generic NumPy varieties or the unpack operator for supplying variadic generic expressions.
Python sort annotations could make static evaluation of varieties a beneficial examine of code high quality, discovering errors earlier than code is even executed. Up till lately, an interface would possibly take an array or a DataFrame, however no specification of the kinds contained in these containers was doable. Now, full specification of part varieties is feasible in NumPy and StaticFrame, allowing extra highly effective static evaluation of varieties.
Offering appropriate sort annotations is an funding. Reusing these annotations for runtime checks supplies the most effective of each worlds. StaticFrame’s CallGuard
runtime sort checker is specialised to accurately consider totally specified generic NumPy varieties, in addition to all generic StaticFrame containers.