Skip to content

Annotating Data

writing-sample

Annotations play an important role in data management. For example, annotations help:

  • Identify datasets that contain personally identifiable information (PII).

  • Data governance teams and systems apply the appropriate level of protection to datasets that contain PII.

See the Data Access and Onboarding Introduction for the most current information and an in-depth guide to this topic.

Note

  • This section requires no action. We have already annotated the data for you, but please make sure to review this information. It provides background context about data annotations.

  • You must annotate your data when sharing it outside of your team.

How to annotate

To annotate data, first determine if it contains PII or not (see What is Personal Data for guidance). If your data:

  • Does not contain PII, annotate it with { policy: { noPersonalData: true }}

  • Contains PII, annotate the field containing personal data with the correct semantic type. If your data contains NARROW or STRICT fields, you must encrypt it. See the Padlock Documentation on encryption.

Viewing annotated data

Search for UserTrackCounts in the README.md of your repository. Follow the link that takes you to the UserTrackCounts Avro schema file, which should be already annotated.

Your dataset includes a userId field. The userId field is based on the user_id stored in the upstream dataset, di.golden.path.Stream.days.v1.parquet. In turn, the user_id is derived from an anonymized Spotify user ID. As previous information suggests, these user ID fields contain sensitive PII that require annotations. See also, the Personal Data Semantic Type Policy spreadsheet. It contains a comprehensive set of fields and their related annotation requirements.