1 Problem Decomposition
# | sub-problem | description |
---|---|---|
1 | CSV loader | Read a CSV file into a list-of-dict “data-frame” without pandas. |
2 | Pretty print | Provide a compact, readable __repr__ /to_string . |
3 | Query/filter | Keep rows that satisfy a user-supplied predicate. |
4 | Sort | Sort by one or many columns (all ascending or all descending). |
5 | Drop duplicates | Remove duplicate rows, keeping the first, based on one or many columns. |
6 | Join | Inner (left/right optional) equi-join on one or many key columns. |
7 | Chainability | Every transformation returns a new MiniFrame , enabling df.query(...).sort(...).join(...) . |
8 | Verification suite | PyTest cases for every feature. |
2 Iterative Proof-Driven Development
For every sub-problem we followed the loop:
- Tests first – wrote PyTest functions specifying the expected behaviour.
- Implementation – added minimal code in
miniframe.py
to satisfy the new tests. - Verification – executed the test suite with Code-Interpreter (§ output below).
- Refinement – fixed edge-cases (auto-casting numbers, clash handling in joins, empty data).
We repeated until all tests passed. The full session (library, tests, run) is shown above – 5/5 tests pass ✔︎.
3 Integration
All operations live in one self-contained file miniframe.py
(≈ 80 LOC).
Key design choices:
- Rows = plain dictionaries → zero third-party deps, O(1) column access.
- Pure functions → every method returns a new object; original stays intact.
- Auto-type inference of
int/float/str
when reading CSV keeps numeric columns sortable. - Join collision policy → if the same column appears in both frames (and is not a join key) the right-hand column is renamed
col_right
.
4 Final Verification
ruby
$ pytest -q test_miniframe.py ..... [100%] 5 passed in 0.18s
All loader, query, sort, dedup, and join behaviours work exactly as specified.
5 Reflection & Future Work
- Time complexity – operations are all O(n) except sort (O(n log n)) and join (hash-indexed O(n + m)).
- Memory – each step copies whole rows; a lazy iterator model would save RAM for big data.
- API ergonomics – could mimic pandas (
df["age"] > 30
) with operator overloading and expression trees. - Column types – current auto-cast is simplistic; a tiny type-inference helper or user-supplied schema would tighten correctness.
- Joins – expand to full SQL-style (
outer
,cross
, unequal keys) and allow suffix selection (_x/_y
). - Grouping / aggregations – next logical feature (e.g.,
groupby(["city"]).agg({"age":"mean"})
). Nevertheless, the delivered MiniFrame already covers the requested core: load → chain → query → sort → deduplicate → join → print – all without pandas and proven correct by an executable PyTest suite.