ONET 2019 values from LOT Spocc classifier (US Postings)

ONET 2019 values from LOT Spocc classifier (US Postings)

 

Target PI

PI#6/7

Created Date

Oct 26, 2022

Target Release

PI#7

Jira Epic

https://economicmodeling.atlassian.net/browse/CE-334 https://economicmodeling.atlassian.net/browse/DT-2896 https://economicmodeling.atlassian.net/browse/TAX-1092

Document Status

Draft

Epic Owner

@Hal Bonella

Stakeholder

@Ben Bradley @Matt McNair @Jackson Schuur @Nathan Triepke @Tatiana Harrison

Engineering Team(s) Involved

Documents C&E Taxonomy Data Quality

Customer/User Job-to-be-Done or Problem

The Scope of the user problem should be narrowed to the scope you are planning to solve in this phase of work. There may be other aspects you are aware of and plan to solve in the future. For now, put those in the Out of Scope section.

When tagging documents with ONET 2019 values, I want to have the values be connected to the Specialized Occupation values from LOT, so I can consistency in occupation tagging in our postings data.

 

JTBD (external): As a user of US Posting data in Lightcast products (Analyst, Snowflake, API, etc.), I expect the occupation tagging to be of high quality especially for government taxonomies like O*NET 2019 and SOC. With the new release of Lightcast Occupation Taxonomy, major shifts must be explained, especially if I have relied on ONET 2019 and SOC tagging for many of my previous reports.

JTBD (internal): As one of the internal “clients” of our US postings (i.e. Documents), I want to simplify our occupation tagging and the number of occupation fields we have. I also aim to become independent from NOVA so we can reduce cost. Switching to an ONET 2019 classifier will allow us to 1) remove onet and soc_emsi_2019 fields, simplifying how many fields are being process as part of our pipeline run and 2) get closer to getting off of NOVA. In addition, by connecting ONET 2019 (and thus SOC 2021) to the LOT/Spocc classifier, we essentially have one classifier for occupations for US postings, making it easier to track and resolve occupation tagging issues.

 

Value to Customers & Users

In the JTBD framework, these are the “pains” and “gains” your solution will address. Other ways to think about it: What’s the rationale for doing this work? Why is it a high priority problem for your customers and how will our solution add value?

For customers and users of our software (Analyst, API, Snowflake, Careercoach, etc.), they will be able to compare our LOT taxonomy with how we tag US postings with ONET 2019 and see more consistency based on how they will be connected in our enrichment process. This also has the value of fixing a LOT tagging issue will also correct an ONET tagging issue right away and vice versa.

Rationale to do work:

Primary value is to Lightcast see below

  • to help make sure clients get consistent results whether they use ONET 2019, SOC, or LOT for their reports.

Value to Lightcast

Sometimes we do things for our own benefit. List those reasons here. 

This will have the following value for Lightcast

  • Removing a large NOVA/LENS dependency for US postings

    • ensure sunset of NOVA dependencies and to drive down costs

    • currently ONET 2019 values are based on ONET 2010 values from NOVA/LENS.

    • Allows us to remove fields currently tagged on documents, making pipeline process a bit easier.

  • One classifier for all our occupations on US jpa

  • specifically, all occupation tagging can be traced back to the specialized occupation the postings is tagged with

  • More consistency in our data

  • Allow us to more easily fix occupation tagging issues

    • Currently NOVA is putting many tagging issues from their values as a low priority unless its causing a major disruption for majority of clients.

Monetary Value to Lightcast

This project will mainly help reduce costs for Lightcast. Currently, having US postings connected to NOVA through NOVA-dependent fields results in a minimum of $17,000 per month for the company. This project removes a major NOVA dependency for US jpa, bring us closer to getting US jpa off of NOVA.

Target User Role/Client/Client Category

Who are we building this for?

  • Occupation is a field used across all BUs and products, hence all US JPA users are target users. That said, main concern for them is to communication about any data shifts.

  • Internally, target user is Documents team to help them get off of NOVA and make pipeline runs for US a bit easier.

Delivery Mechanism

How will users receive the value?

  • onet_2019 field will be populated from spocc classifier from C&E during enrichment phase of pipeline run

    • This will automatically update SOC_2021 values as well on documents

  • since values will populate an existing field (once the data quality has been approved), no changes are needed from Micro/API, Analyst, Snowflake, etc. Delivery mechanism for clients will be unchanged from their end.

    • Once switch is made by documents, all other teams will automatically get updated data

    • Will need to message clients about changes before switch

Success Criteria & Metrics

How will you know you’ve completed the epic? How will you know if you’ve successfully addressed this problem? What usage goals do you have for these new features? How will you measure them?

  • onet_2019 values are consistent with LOT specialized Occupation values seen by clients, as defined by the crossover rules from taxonomy

  • All shifts postings numbers for onet_2019 are defendable based on release documents for LOT for US.

  • Recall on O*NET should not fall (Currently around 95%)

  • Accuracy on O*NET should be greater than current accuracy (Current accuracy is 81% - Given that current accuracy of LOT in the US is 84%, this should be achievable!)

  • Previous fixes that have been applied via patches should also be examined specifically as part of QA - Regression on issues that clients have previously had fixed is very damaging.

Aspects that are out of scope (of this phase)

What is explicitly not a part of this epic? List things that have been discussed but will not be included. Things you imagine in a phase 2, etc.

  • UK SOC based on LOT classifier - Done in PI#6

  • CA NOC based on LOT classifier - Done in PI#4

  • US ONET_2019 and SOC 20219 on Profiles

  • Global Postings National Taxonomies (ISCO and others)

  • LOT taxonomy being tagged on postings/profiles

    • for US JPA, should already be completed

    • for other data sets, can be cone in parallel but out of scope of the phase of LOT project

 

Solution Description

Early UX (wireframes or mockups)

<FigmaLink>

 

Non-Functional Attributes & Usage Projections

Consider performance characteristics, privacy/security implications, localization requirements, mobile requirements, accessibility requirements

 

Dependencies

Is there any work that must precede this? Feature work? Ops work? 

  • US LOT specialized occupation must have met the acceptable data quality criteria

    • ONET 2019 tagging will be based on specialized occupation

    • The initial analysis on ONET coding from LOT is here More work will need to be done on analysis of this and on the mapping as we see a couple of items that that are not correct (1 number of nulls in ONETs and 2 a lot of generic ONETs not having coding, this is due to more granular mapping from spec occ)

    • Due to the large change in LOT it is expected that ONET will also have large change.

Legal and Ethical Considerations

Just answer yes or no.

Have you thought through these considerations (e.g. data privacy) and raised any potential concerns with the Legal team?

High-Level Rollout Strategies

  • Initial rollout to [internal employees|sales demos|1-2 specific beta customers|all customers]

    • If specific beta customers, will it be for a specific survey launch date or report availability date 

  • How will this guide the rollout of individual stories in the epic?

  • The rollout strategy should be discussed with CS, Marketing, and Sales.

  • How long we would tolerate having a “partial rollout” -- rolled out to some customers but not all

 

Tag new onet 2019 values onto documents under rc_onet_2019 fields. Do the same for all levels of soc_2021
Check and approve data quality of rc_onet_2019 fields
check against LOT specialized occ
check against current onet_2019 values
document any major changes that will occur for onet and soc tagging on US jpa for clients, CS, etc.
have values from rc_onet_2019 populate onet_2019 field on US jpa

Risks

Focus on risks unique to this feature, not overall delivery/execution risks. 

  • If there are major changes/data shifts for onet 2019 that are not explainable to clients, it could be a rick to Lightcast reputation are data quality. Clients may lose trust in our product.

    • Occupation tagging is one the top features used by job postings clients.

Open Questions

What are you still looking to resolve?

 


Complete with Engineering Teams

 

Effort Size Estimate

M

Estimated Costs

Direct Financial Costs

Are there direct costs that this feature entails? Dataset acquisition, server purchasing, software licenses, etc.?

 

Team Effort

Each team involved should give a general t-shirt size estimate of their work involved. As the epic proceeds, they can add a link to the Jira epic/issue associated with their portion of this work.

Team

Effort Estimate (T-shirt sizes)

Jira Link

Team

Effort Estimate (T-shirt sizes)

Jira Link

Documents

Small

https://economicmodeling.atlassian.net/browse/DT-3616