Alumni Pathways: Next Generation Data Pipeline
Created Date | Mar 31, 2023 |
---|---|
Last Major Update | Aug 24, 2023 (in preparation for PI 6) |
Target PI | PI 3: Research (architecture), prioritization, planning, and begin building (not yet user-facing)
PI 5 & on: see https://docs.google.com/document/d/1Z__f34WEBTn5_qGuqyeVmv2pHX9ddJG1-ylcEebLkC8/edit PI-6: see https://docs.google.com/document/d/1K11u825jxt2gmtA2JmgxhRAbiHasYkYekfNnDi6T_JA/edit?usp=sharing |
Target Release | Base functionality completed in PI 5 (based on expectations set with customers & stakeholders that these would be released by the end of Q3 ‘23) PI-6: Release targeted enhancements and improve quality to meet standards |
Jira Epics | Documents - CDOT - https://economicmodeling.atlassian.net/browse/AOD-740 Micro - Analyst RED - RAPTOR - |
Document Status | REVIEW |
Epic Owner | @Gavin Esser |
Stakeholder | @Kaleb Trotter @Hunter Burk @Chris Kellogg @Dave Wallace (Deactivated) @Matt McNair |
Engineering Team(s) Involved | Documents, Micro, Analyst RED / possibly: CDOT& RAPTOR |
PART 1
Customer/User Job-to-be-Done or Problem
Lightcast’s Education business unit is working aggressively to ramp up Alumni Pathways sales and retention while simultaneously increasing data update frequencies. However, the data pipeline for Alumni Pathways was never intended to handle this combined load but instead has significant fragility arising from the scrappy way it was originally built (which heavily relies on Microsoft Excel and manual processes).
We need to build a next-generation data pipeline and swap it out before this becomes a bottleneck and/or breaking point for delivering Alumni Pathways. This is foundational / critical-path work to:
enable more frequent data updates
achieve the scale we are targeting (active customers x update frequency)
support future features plus the new data structures and sources required for them
See this presentation for a visual summary of the vision for the new data pipeline versus its current state.
Value to Customers & Users
A next-generation data pipeline will benefit customers through:
Faster deliveries (shorter time between sale and when matched data is available in Alumni Pathways)
More frequent data updates (in Alumni Pathways)
Future software features that are dependent on this new pipeline
Future data sources that are dependent on this new pipeline
Value to Lightcast
A next-generation data pipeline will benefit Lightcast through:
The added value to customers (described above) supports sales acceleration, market penetration, and retention
Increased opportunity to add or integrate additional data sources
Reduced delivery risk (due to fragility and/or bottlenecks)
Reduced work-in-process (due to faster pipeline)
Reduced reliance on manual processes (reduced quality risks, reduced labor costs per customer)
Target User Role/Client/Client Category
Buyers/users (External):
Institutional Research, Academic, Enrollment Marketing, President’s Office, and Advancement/Alumni Relations/Foundation teams across all Education segments
Lightcast Internal Teams
Customer Delivery Operations Team (“CDOT”) and Documents are the primary internal users
Delivery Mechanism
Both the current and proposed new data pipelines will deliver data to Lightcast applications and/or customers via at least API. Adding delivery via snowflake would be a separate and future consideration.
We are intentionally moving away from delivering data pipeline outputs in static files (Excel/CSV) as part of our strategy to pull users towards using the dynamic software platforms, APIs, and potentially Snowflake instead.
Success Criteria & Metrics
Definition of Done
Alumni Pathways is powered by a new, next-generation data pipeline.
PI 3: Research (architecture), prioritization, planning, and begin building (not yet user-facing)
PI 4:
Docs (Hunter): Continue building with aggressive/stretch target of completing an end-to-end parallel process to what CDOT does today
Micro (Chris K.):
Add in new NSC fields in old version of the API
Finish out the work for schema v2
PI 5: see https://docs.google.com/document/d/1Z__f34WEBTn5_qGuqyeVmv2pHX9ddJG1-ylcEebLkC8/edit
PI-6: see https://docs.google.com/document/d/1K11u825jxt2gmtA2JmgxhRAbiHasYkYekfNnDi6T_JA/edit?usp=sharing
North Star Metric
Alumni Pathway’s current North Star Metric is Filtered Reports per Month by Account
(as a proxy for answering institutional users' questions and supporting their work). We believe this new data pipeline will move that metric via:
Getting customers access to matched data (across multiple reports) sooner
Providing customers with a reason to regularly return to the tool (because data is updating more frequently)
Our target improvements for this metric are:
>10% more Filtered Reports per Month by Account within the first two months after the first feature is released
The higher level of Filtered Reports per Month by Account is sustained (e.g. does not drop back to previous levels over time)
Significant improvements to CDOT team workload and efficiency
Aspects that are out of scope (of this phase)
As currently described, this epic covers a minimum viable product (MVP) release of this new feature. Subsequent extensions and enhancements are not yet defined (or included).
Related Epics:
Each of these focus on optimizing one step/component of the new data pipeline; as such each potentially rolls up into this epic or represents parallel/supporting work for it.
Stretch target for PI 4: https://economicmodeling.atlassian.net/l/cp/5nFz0PhX
Deferred to at least PI 7 and to be re-evaluated:
PI-6
PART 2
Solution Description
Documentation
In/for PI 3:
Scope of work (when we were previously evaluating contracting this work out)
Process flow & improvement opportunities (at a high level)
This is the most detailed flow diagram of our whole process: https://miro.com/app/board/uXjVOUw1M6k=/?share_link_id=121476884821
This board shows which parts of our process need to be optimized: https://miro.com/app/board/uXjVOZGnXp4=/?share_link_id=675053311338
An all-important spreadsheet template (in this folder) that the team uses to do a lot of manual curation. This Confluence Article describes how a user actually uses the spreadsheet.
Early priority work items:
Load the Profile data using a pre-defined schema. Ever since the data was changed to JSON the data loads slowly. Based on testing in other pipelines, we know that we can speed this up to near Parquet speeds by defining a schema when we load the data. If you need help testing the code, reach out to Skye. He knows how to run the project which is a little bit different because of how AO works.
Change how we manage the encoding of the school's contact info file. Talk to Dave Wallace to get more details.
In/for PI 3:
TBD
For PI-6:
Get information flowing through the pipeline to power customer data and deliverables (if not complete in PI-5)
Significant time spent closing the quality and tech debt gap to improve results for customers. The goal is to reach a similar level of quality to the current process and in later PI’s surpass it. Important measures to be thinking about here are:
Breadth: Number of profiles found and surfaced for the customer
Confidence/Quality: The level of confidence we have that each record contains correct information and is correctly matched to school records.
Depth: The amount of information we can provide about each individual alumni/profile
Enchanting and adding to the data points available for profiles in the alumni database to unlock new features and capabilities.
Early UX (wireframes or mockups)
N/A
Non-Functional Attributes & Usage Projections
Privacy / Security Implications
Customer- (and/or National Student Clearinghouse-) provided student records contain personally identifying information and are protected by:
FERPA regulations
[If applicable]: National Student Clearinghouse partnership contract terms
Any data transmissions, access, and usage on this project will require special safeguards including special/additional background checks by teams interacting with National Student Clearinghouse data.
Localization Requirements
Alumni Pathways is USA-only. Wherever National Student Clearinghouse data is used, development must be restricted to the USA only (due to security requirements in our partnership agreement with the Clearinghouse).
Performance Characteristics
As part of this work, the data pipeline performance should be optimized relative to its current state.
Dependencies
Finish-to-finish dependencies on:
https://economicmodeling.atlassian.net/wiki/spaces/DPM/pages/2488304662
https://economicmodeling.atlassian.net/wiki/spaces/DPM/pages/2620784653
Legal and Ethical Considerations
Just answer yes or no.
High-Level Rollout Strategies
The new data pipeline will be used when ready and customers will see the new and improved data in the software and other deliverables that are fed by this pipeline (a similar release approach to new versions of Skills or other taxonomies).
Assuming we can achieve our target improvements, we will work with Sales, Marketing, and Success to promote the improvements.
Risks
Appropriate safeguards and correct handling of protected information (PII & National Student Clearinghouse)
Open Questions
What are you still looking to resolve?
Complete with Engineering Teams
Effort Size Estimate |
---|
Estimated Costs
Direct Financial Costs
Are there direct costs that this feature entails? Dataset acquisition, server purchasing, software licenses, etc.?
Small cost (~$75/person) of National Student Clearinghouse-required background checks
Team Effort
Each team involved should give a general t-shirt size estimate of their work involved. As the epic proceeds, they can add a link to the Jira epic/issue associated with their portion of this work.
Team | PI + Definition of Done | Effort Estimate (T-shirt sizes) | Jira Link | Notes |
---|---|---|---|---|
DOCUMENTS | Completed: PI 3: Research (architecture), prioritization, planning, and begin building (not yet user-facing) |
| ||
PI 4: Continue building with aggressive/stretch target of completing an end-to-end parallel process to what CDOT does today |
|
|
| |
CDOT | Completed: PI 3: Research (architecture), prioritization, planning, and begin building (not yet user-facing) | Medium (1-2 weeks) | If stretch goal happens, then we will build NSC-match input revisions into our process | |
Stretch for PI 4: Test & confirm/finalize the inputs of new end-to-end parallel process |
|
|
| |
Micro | Completed: PI 3: Research (architecture), prioritization, planning, and begin building (not yet user-facing) |
|
|
|
Planned for PI 4:
Stretch for PI 4: Test & confirm/finalize the outputs of new end-to-end parallel process |
|
|
| |
Analyst RED | Completed: PI 3: Research (architecture), prioritization, planning, and begin building (not yet user-facing) |
|
|
|
Stretch for PI 4: Test & confirm/finalize the outputs of new end-to-end parallel process |
|
|
| |
RAPTOR | Completed: PI 3: Research (architecture), prioritization, planning, and begin building (not yet user-facing) |
|
|
|
TBD - No work anticipated at this time, but there may be support / collaboration needs from CDOT and/or other teams |
|
|
|