Annotating Data
Annotations play an important role in data management. For example, annotations help:
-
Identify datasets that contain personally identifiable information (PII).
-
Data governance teams and systems apply the appropriate level of protection to datasets that contain PII.
See the Data Access and Onboarding Introduction for the most current information and an in-depth guide to this topic.
Note
-
This section requires no action. We have already annotated the data for you, but please make sure to review this information. It provides background context about data annotations.
-
You must annotate your data when sharing it outside of your team.
How to annotate
To annotate data, first determine if it contains PII or not (see What is Personal Data for guidance). If your data:
-
Does not contain PII, annotate it with
{ policy: { noPersonalData: true }}
-
Contains PII, annotate the field containing personal data with the correct semantic type. If your data contains
NARROW
orSTRICT
fields, you must encrypt it. See the Padlock Documentation on encryption.
Viewing annotated data
Search for UserTrackCounts
in the README.md
of your repository. Follow the link that takes you to the UserTrackCounts
Avro schema file, which should be already annotated.
Your dataset includes a userId
field. The userId
field is based on the user_id
stored in the upstream dataset, di.golden.path.Stream.days.v1.parquet. In turn, the user_id
is derived from an anonymized Spotify user ID. As previous information suggests, these user ID fields contain sensitive PII that require annotations. See also, the Personal Data Semantic Type Policy spreadsheet. It contains a comprehensive set of fields and their related annotation requirements.