Skip to content

[FSTORE-2030] Add support for specifying lookback windows for PIT queries#583

Open
manu-sj wants to merge 2 commits into
logicalclocks:mainfrom
manu-sj:FSTORE-2030
Open

[FSTORE-2030] Add support for specifying lookback windows for PIT queries#583
manu-sj wants to merge 2 commits into
logicalclocks:mainfrom
manu-sj:FSTORE-2030

Conversation

@manu-sj
Copy link
Copy Markdown
Contributor

@manu-sj manu-sj commented May 21, 2026

Summary

  • Adds a user-guide section Lookback window for PIT joins to feature_view/batch-data.md covering the two modes, the dict and dataclass call shapes, partition pruning behavior, and the one-sided lower-only form.
  • Cross-links from feature_view/training-data.md so users hitting create_training_data find the same reference.

JIRA

FSTORE-2030

Test plan

  • One-sentence-per-line convention respected.
  • Python code blocks valid Python (run through ruff via the workspace policy).
  • Reviewer to verify the page renders correctly in the mkdocs preview.

Companion PRs

  • Backend: logicalclocks/hopsworks-ee → branch FSTORE-2030
  • SDK: logicalclocks/hopsworks-api → branch FSTORE-2030
  • Integration tests: logicalclocks/loadtest → branch FSTORE-2030

…ries

https://hopsworks.atlassian.net/browse/FSTORE-2030

PIT joins generate predicates of the form `feature_fg.event_time <=
root_fg.event_time` to select the latest matching feature record.
Because this is a range join rather than an equality join, partition
pruning cannot eliminate older partitions of the joined feature group:
the latest valid value may live in any of them. As feature groups
grow with daily ingestion, every PIT query scans more historical
partitions, inflating IO, shuffle volume, and execution time.

[FSTORE-2030] adds an optional `lookback` parameter on
`FeatureView.get_batch_data`, `create_training_data`, and the split
variants. `Lookback(key=..., start=..., end=...)` declares a
constant-bound window that the backend AND's onto the root FG and
each joined FG. `key="partition_key"` mode places the bound on the
partition column so flyingduck and Spark catalyst can prune
partitions; `key="event_time"` mode emits the predicate on the
event_time column with engine-dependent pruning.

The user guides document the new `lookback` parameter on the
batch-data and training-data flows, with worked examples for both
`partition_key` and `event_time` modes. The pages call out that
`start` is required and `end` is optional, and that omitting `end`
falls back to the existing upper-only auto-pruning derived from
`query.end_time`. The pages cross-link to each other so users on
either entry point see the same vocabulary.

Reviewed-by: OpenAI Codex (GPT-5 via codex-plugin-cc 1.0.4) <codex@openai.com>
Signed-off-by: Manu Sathyarajan Joseph <manu.joseph@logicalclocks.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@manu-sj manu-sj marked this pull request as ready for review May 23, 2026 23:55
https://hopsworks.atlassian.net/browse/FSTORE-2030

Updates the user-facing lookback documentation to match the
implementation that landed after PR #3046 removed the legacy
auto-partition-pruning machinery.

Drops the STRING-partition mention from batch-data.md — only DATE
partition columns are eligible for PARTITION_KEY mode now that the
skipFromSql elision mechanism is gone. Corrects the end-omitted
behavior from a fallback claim ("falls back to query end_time
auto-pruning") to the actual emit shape ("emits a one-sided lower
bound only"). Replaces a false per-FG-key API description (which
claimed (name, version) and (name, version, prefix) tuple keys plus
a non-existent "available triples" ambiguity error) with the two key
forms the SDK actually accepts: a bare string matching every version
of the named Feature Group, or a Feature Group instance matching the
exact (name, version). Clarifies that the lookback predicate applies
to the root and every joined Feature Group, not joined-only.

Adds a "Combining `lookback` with other filters" section explaining
how the predicate interacts with sub-query filters and outer filters,
including the mixed-Feature-Group outer-filter case where root pruning
is lost while joined FGs still prune via their own wrapped predicates.

training-data.md picks up the same root-plus-joined wording so the
two pages stay consistent.

Signed-off-by: Manu Sathyarajan Joseph <manu.joseph@logicalclocks.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant