Created Date	Mar 31, 2023
Last Major Update	Aug 24, 2023 (in preparation for PI 6)
Target PI	PI 3: Research (architecture), prioritization, planning, and begin building (not yet user-facing) PI 4: Docs (Hunter): Continue building with aggressive/stretch target of completing an end-to-end parallel process to what CDOT does today Micro (Chris K.): Add in new NSC fields in old version of the API Finish out the work for schema v2 PI 5 & on: see https://docs.google.com/document/d/1Z__f34WEBTn5_qGuqyeVmv2pHX9ddJG1-ylcEebLkC8/edit PI-6: see https://docs.google.com/document/d/1K11u825jxt2gmtA2JmgxhRAbiHasYkYekfNnDi6T_JA/edit?usp=sharing
Target Release	Base functionality completed in PI 5 (based on expectations set with customers & stakeholders that these would be released by the end of Q3 ‘23) PI-6: Release targeted enhancements and improve quality to meet standards
Jira Epics	Documents - CDOT - https://economicmodeling.atlassian.net/browse/AOD-740 Micro - Analyst RED - RAPTOR -
Document Status	REVIEW
Epic Owner	@Gavin Esser
Stakeholder	@Kaleb Trotter @Hunter Burk @Chris Kellogg @Dave Wallace (Deactivated) @Matt McNair
Engineering Team(s) Involved	Documents, Micro, Analyst RED / possibly: CDOT& RAPTOR

PART 1

Customer/User Job-to-be-Done or Problem

Lightcast’s Education business unit is working aggressively to ramp up Alumni Pathways sales and retention while simultaneously increasing data update frequencies. However, the data pipeline for Alumni Pathways was never intended to handle this combined load but instead has significant fragility arising from the scrappy way it was originally built (which heavily relies on Microsoft Excel and manual processes).

We need to build a next-generation data pipeline and swap it out before this becomes a bottleneck and/or breaking point for delivering Alumni Pathways. This is foundational / critical-path work to:

enable more frequent data updates
achieve the scale we are targeting (active customers x update frequency)
support future features plus the new data structures and sources required for them

See this presentation for a visual summary of the vision for the new data pipeline versus its current state.

Value to Customers & Users

A next-generation data pipeline will benefit customers through:

Faster deliveries (shorter time between sale and when matched data is available in Alumni Pathways)
More frequent data updates (in Alumni Pathways)
Future software features that are dependent on this new pipeline
Future data sources that are dependent on this new pipeline

Value to Lightcast

A next-generation data pipeline will benefit Lightcast through:

The added value to customers (described above) supports sales acceleration, market penetration, and retention
Increased opportunity to add or integrate additional data sources
Reduced delivery risk (due to fragility and/or bottlenecks)
Reduced work-in-process (due to faster pipeline)
Reduced reliance on manual processes (reduced quality risks, reduced labor costs per customer)

Target User Role/Client/Client Category

Buyers/users (External):

Institutional Research, Academic, Enrollment Marketing, President’s Office, and Advancement/Alumni Relations/Foundation teams across all Education segments

Lightcast Internal Teams

Customer Delivery Operations Team (“CDOT”) and Documents are the primary internal users

Delivery Mechanism

Both the current and proposed new data pipelines will deliver data to Lightcast applications and/or customers via at least API. Adding delivery via snowflake would be a separate and future consideration.

We are intentionally moving away from delivering data pipeline outputs in static files (Excel/CSV) as part of our strategy to pull users towards using the dynamic software platforms, APIs, and potentially Snowflake instead.

Success Criteria & Metrics

Definition of Done

Alumni Pathways is powered by a new, next-generation data pipeline.

PI 3: Research (architecture), prioritization, planning, and begin building (not yet user-facing)
PI 4:
- Docs (Hunter): Continue building with aggressive/stretch target of completing an end-to-end parallel process to what CDOT does today
- Micro (Chris K.):
  - Add in new NSC fields in old version of the API
  - Finish out the work for schema v2
PI 5: see https://docs.google.com/document/d/1Z__f34WEBTn5_qGuqyeVmv2pHX9ddJG1-ylcEebLkC8/edit
PI-6: see https://docs.google.com/document/d/1K11u825jxt2gmtA2JmgxhRAbiHasYkYekfNnDi6T_JA/edit?usp=sharing

North Star Metric

Alumni Pathway’s current North Star Metric is Filtered Reports per Month by Account (as a proxy for answering institutional users' questions and supporting their work). We believe this new data pipeline will move that metric via:

Getting customers access to matched data (across multiple reports) sooner
Providing customers with a reason to regularly return to the tool (because data is updating more frequently)

Our target improvements for this metric are:

>10% more Filtered Reports per Month by Account within the first two months after the first feature is released
The higher level of Filtered Reports per Month by Account is sustained (e.g. does not drop back to previous levels over time)
Significant improvements to CDOT team workload and efficiency

Aspects that are out of scope (of this phase)

As currently described, this epic covers a minimum viable product (MVP) release of this new feature. Subsequent extensions and enhancements are not yet defined (or included).

Related Epics:
Each of these focus on optimizing one step/component of the new data pipeline; as such each potentially rolls up into this epic or represents parallel/supporting work for it.

Stretch target for PI 4: https://economicmodeling.atlassian.net/l/cp/5nFz0PhX
Deferred to at least PI 7 and to be re-evaluated:
PI-6
- Alumni Pathways: Expand coverage to optionally include profile data with lower match quality

PART 2 Solution Description

Documentation

In/for PI 3:

Scope of work (when we were previously evaluating contracting this work out)
Process flow & improvement opportunities (at a high level)
This is the most detailed flow diagram of our whole process: https://miro.com/app/board/uXjVOUw1M6k=/?share_link_id=121476884821
This board shows which parts of our process need to be optimized: https://miro.com/app/board/uXjVOZGnXp4=/?share_link_id=675053311338
An all-important spreadsheet template (in this folder) that the team uses to do a lot of manual curation. This Confluence Article describes how a user actually uses the spreadsheet.

Early priority work items:

Load the Profile data using a pre-defined schema. Ever since the data was changed to JSON the data loads slowly. Based on testing in other pipelines, we know that we can speed this up to near Parquet speeds by defining a schema when we load the data. If you need help testing the code, reach out to Skye. He knows how to run the project which is a little bit different because of how AO works.
Change how we manage the encoding of the school's contact info file. Talk to Dave Wallace to get more details.

In/for PI 3:
TBD

For PI-6:

Get information flowing through the pipeline to power customer data and deliverables (if not complete in PI-5)
Significant time spent closing the quality and tech debt gap to improve results for customers. The goal is to reach a similar level of quality to the current process and in later PI’s surpass it. Important measures to be thinking about here are:
- Breadth: Number of profiles found and surfaced for the customer
- Confidence/Quality: The level of confidence we have that each record contains correct information and is correctly matched to school records.
- Depth: The amount of information we can provide about each individual alumni/profile
Enchanting and adding to the data points available for profiles in the alumni database to unlock new features and capabilities.

Early UX (wireframes or mockups)

N/A

Non-Functional Attributes & Usage Projections

Privacy / Security Implications

Customer- (and/or National Student Clearinghouse-) provided student records contain personally identifying information and are protected by:

FERPA regulations
[If applicable]: National Student Clearinghouse partnership contract terms

Any data transmissions, access, and usage on this project will require special safeguards including special/additional background checks by teams interacting with National Student Clearinghouse data.

Localization Requirements

Alumni Pathways is USA-only. Wherever National Student Clearinghouse data is used, development must be restricted to the USA only (due to security requirements in our partnership agreement with the Clearinghouse).

Performance Characteristics

As part of this work, the data pipeline performance should be optimized relative to its current state.

Dependencies

Finish-to-finish dependencies on:

Legal and Ethical Considerations

Just answer yes or no.

Have you thought through these considerations (e.g. data privacy) and raised any potential concerns with the Legal team?

High-Level Rollout Strategies

The new data pipeline will be used when ready and customers will see the new and improved data in the software and other deliverables that are fed by this pipeline (a similar release approach to new versions of Skills or other taxonomies).

Assuming we can achieve our target improvements, we will work with Sales, Marketing, and Success to promote the improvements.

Risks

Appropriate safeguards and correct handling of protected information (PII & National Student Clearinghouse)

Open Questions

What are you still looking to resolve?

Complete with Engineering Teams

Effort Size Estimate

Estimated Costs

Direct Financial Costs

Are there direct costs that this feature entails? Dataset acquisition, server purchasing, software licenses, etc.?

Small cost (~$75/person) of National Student Clearinghouse-required background checks

Team Effort

Each team involved should give a general t-shirt size estimate of their work involved. As the epic proceeds, they can add a link to the Jira epic/issue associated with their portion of this work.

Team	PI + Definition of Done	Effort Estimate (T-shirt sizes)	Jira Link	Notes

Team	PI + Definition of Done	Effort Estimate (T-shirt sizes)	Jira Link	Notes
DOCUMENTS	Completed: PI 3: Research (architecture), prioritization, planning, and begin building (not yet user-facing)
DOCUMENTS	PI 4: Continue building with aggressive/stretch target of completing an end-to-end parallel process to what CDOT does today
CDOT	Completed: PI 3: Research (architecture), prioritization, planning, and begin building (not yet user-facing)	Medium (1-2 weeks)	https://economicmodeling.atlassian.net/browse/AOD-740	If stretch goal happens, then we will build NSC-match input revisions into our process
CDOT	Stretch for PI 4: Test & confirm/finalize the inputs of new end-to-end parallel process
Micro	Completed: PI 3: Research (architecture), prioritization, planning, and begin building (not yet user-facing)
Micro	Planned for PI 4: Add in new NSC fields in old version of the API Finish out the work for schema v2 Stretch for PI 4: Test & confirm/finalize the outputs of new end-to-end parallel process
Analyst RED	Completed: PI 3: Research (architecture), prioritization, planning, and begin building (not yet user-facing)
Analyst RED	Stretch for PI 4: Test & confirm/finalize the outputs of new end-to-end parallel process
RAPTOR	Completed: PI 3: Research (architecture), prioritization, planning, and begin building (not yet user-facing)
RAPTOR	TBD - No work anticipated at this time, but there may be support / collaboration needs from CDOT and/or other teams

Data Strategy

Alumni Pathways: Next Generation Data Pipeline