A summary of technical work done during my internship at Ren, an AI EdTech startup building an AI-powered essay grading platform for schools.
- Engineered an end-to-end essay grading pipeline in Python, orchestrating GPT Vision to extract handwritten essay content into structured paragraphs and bounding boxes, then applying a dual-method grading strategy (image-based and text-extracted) that produces per-component rubric scores with justifications and a strengths/weaknesses/actionables summary
- Extended the AI grading output schema to add per-rubric component scoring with justifications, giving students structured breakdowns of their performance across each rubric criterion
- Designed a type-safe discriminated union interface for the grading adapter layer, enabling new grading strategies to be added without modifying the upstream worker - demonstrated by integrating text-extracted grading alongside the original image-based method with zero changes to the worker contract
- Replaced legacy image-per-page grading with a GPT Vision pre-processing step that extracts paragraph text and page-spanning boundaries upfront, eliminating repeated vision API calls per grading iteration to reduce inference cost
- Designed a question-specific context injection layer for the grading pipeline — generating structured per-question analysis (argument scope, judgment requirements, common misreadings) from each exam question before grading — enabling the LLM grader to assess answers against what each question is specifically testing rather than generic subject knowledge. Built the generator as a two-layer system (generic base extended by subject-specific guide fields) so exam teams can add analysis parameters for new subjects without code changes
- Built essay grading benchmark tooling using embedding cosine similarity and an LLM judge against gold-standard teacher-annotated scripts, with automated quality gates to detect AI hallucination in grading outputs and prevent rubric regression across model iterations
- Built a concurrent load testing framework for the grading API, instrumenting LLM token usage, latency distributions, and Docker memory footprints across parallel grading jobs, generating per-run analytics reports to establish cost-per-submission baselines
- Designed LLM observability infrastructure aggregating token costs, latency distributions, and memory footprints across concurrent grading workers, with the intention of informing capacity planning and pricing strategy
- Built the end-to-end AI feedback summarization feature from Next.js/tRPC API through Python workers to GPT, automatically condensing ~20-30 teacher annotations from 30-page marked scripts into structured student summaries - enabling students to access targeted takeaways instead of manually reviewing annotated multi-page scripts
- Extended the post-marking pipeline to auto-generate student-facing cover pages from existing component scores, delivering structured grading reports without additional LLM inference costs
- Extended the Next.js LaTeX renderer to support inline math notation and built a PDF export sanitisation utility, enabling mathematical content in student feedback to render correctly across browser and PDF output
- Designed a unit testing strategy for a Python FastAPI backend and authored 15+ test modules covering grading engine orchestration, worker concurrency, LLM inference, S3 storage, document handling, and notification services - using pytest-asyncio, factory-boy, and fakeredis to fully isolate all external dependencies, with 80% branch coverage enforced via GitHub Actions CI
- Built a Playwright E2E test suite from scratch using the Page Object Model pattern, covering the full marking workflow end-to-end: class creation, student enrolment, assignment upload, AI pipeline completion, and graded status verification
- Identified a silent failure propagation risk in the grading pipeline where error states could pass through undetected, and eliminated it through test-driven development - writing failing invariant tests to define the contract, then modifying production code until the contract held
- Configured GitHub Actions CI pipelines for both unit and E2E test suites, automating the full test run on every pull request
Built with: Python, FastAPI, OpenAI GPT API, Next.js, TypeScript, tRPC, Prisma, Playwright, pytest, Docker