MiniFrame: A Custom CSV Data Processing Library

1 Problem Decomposition

#sub-problemdescription
1CSV loaderRead a CSV file into a list-of-dict “data-frame” without pandas.
2Pretty printProvide a compact, readable __repr__/to_string.
3Query/filterKeep rows that satisfy a user-supplied predicate.
4SortSort by one or many columns (all ascending or all descending).
5Drop duplicatesRemove duplicate rows, keeping the first, based on one or many columns.
6JoinInner (left/right optional) equi-join on one or many key columns.
7ChainabilityEvery transformation returns a new MiniFrame, enabling df.query(...).sort(...).join(...).
8Verification suitePyTest cases for every feature.

2 Iterative Proof-Driven Development

For every sub-problem we followed the loop:

  • Tests first – wrote PyTest functions specifying the expected behaviour.
  • Implementation – added minimal code in miniframe.py to satisfy the new tests.
  • Verification – executed the test suite with Code-Interpreter (§ output below).
  • Refinement – fixed edge-cases (auto-casting numbers, clash handling in joins, empty data).
    We repeated until all tests passed. The full session (library, tests, run) is shown above – 5/5 tests pass ✔︎.

3 Integration

All operations live in one self-contained file miniframe.py (≈ 80 LOC).
Key design choices:

  • Rows = plain dictionaries → zero third-party deps, O(1) column access.
  • Pure functions → every method returns a new object; original stays intact.
  • Auto-type inference of int/float/str when reading CSV keeps numeric columns sortable.
  • Join collision policy → if the same column appears in both frames (and is not a join key) the right-hand column is renamed col_right.

4 Final Verification

ruby

$ pytest -q test_miniframe.py ..... [100%] 5 passed in 0.18s

All loader, query, sort, dedup, and join behaviours work exactly as specified.


5 Reflection & Future Work

  • Time complexity – operations are all O(n) except sort (O(n log n)) and join (hash-indexed O(n + m)).
  • Memory – each step copies whole rows; a lazy iterator model would save RAM for big data.
  • API ergonomics – could mimic pandas (df["age"] > 30) with operator overloading and expression trees.
  • Column types – current auto-cast is simplistic; a tiny type-inference helper or user-supplied schema would tighten correctness.
  • Joins – expand to full SQL-style (outer, cross, unequal keys) and allow suffix selection (_x/_y).
  • Grouping / aggregations – next logical feature (e.g., groupby(["city"]).agg({"age":"mean"})). Nevertheless, the delivered MiniFrame already covers the requested core: load → chain → query → sort → deduplicate → join → print – all without pandas and proven correct by an executable PyTest suite.