Data Ingestion Workflow

1 Guiding Principials

  • Avoid ingesting duplicate data.
  • If at any point anything is unclear then ask the data provider. Do not assume anything.
  • Always verify the type of ingest with the data provider. Is the ingest an append (adding rows to an existing table), a replacement (replacing the entire table), or an update (deleting rows on a condition and then loading new data)?
  • Always verify the DDL.
  • The end goal is to take a flat data file and load it into the database.

2 Mount S3 buckets:

  • mount -o nolock -o mtype=hard 10.12.24.139:/adrf3-redshift11-data-ingest L:
  • mount -o nolock -o mtype=hard 10.12.24.139:/adrf3-redshift01-data-ingest R:

1. Purpose

Ensure that all data entering the ADRF is trustworty, well-documented, secure, and aligned with organizational standards for quality, privacy, and usability.

This document also outlines the steps for ingesting data into the Redshift database. Ensure each step is followed carefully to maintain data integrity and achieve a successful data load.

2. Objective

  • Validate structure, format, and metadata of incoming data.
  • Enforce consistency across ingestion processes.
  • Classify ingested data by sensitivity level (e.g., PII, public).
  • Establish ownership and accountability for each data source.
  • Ensure traceability and auditability of all ingested data.

3. Roles and Responsibilities

Role Responsibility
Data Owner Approves source and defines ingestion requirements
Data Steward Ensures metadata completeness and ingestion quality monitoring
Data Engineer Builds and maintains ingestion pipelines; applies validation and logging

4. Ingestion Policies

  • All data must be registered in the PASS data catalog.
  • Schema validation is mandatory for all production data ingests.
  • Each ingest must logged in the data ingest tracker, including:
    • Source and timestamp
    • File size, record count
    • Email of data provider
    • New table indication

5. Standards

5.1 File and Naming Conventions

  • File Format: Redshift can ingest most file formats. However, it cannot ingest formats such as .dta, .sasybdat, .xlsx, and .RDS directly. You will need to convert these files to CSV to load them properly.
  • Schema Naming Convention: ds_{state/agency}_{dataOwner/primaryAgency}. EX: ds_ar_dws = Arkansas Department of Workforce Services.
  • Table Naming Convention: give the table an intuitive name. Always ask the data provider

6. High-level Ingestion Process Flow

  1. Verify with Jon Mills that data ingestion can occur.
  2. Send Kiteworks link to data provider.
  3. Once data are received, download data from Kiteworks into ADRF.
  4. Inspect data and schema validation.
  5. Verify that it isn’t duplicate data.
  6. Ingest data.
  7. Catalog Registration (If data does not exist in catalog).
  8. Notify data provider of successful ingest.

1. Access and Permissions

Ensure you have the appropriate permissions to perform data ingestion:

  • The user performing the data ingestion must be a member of the ci_data_eng group on Redshift.
  • Alternatively, a user can log in as dbadmin. The password can be found at G:\CI_DataEng\data_engineering\admin\sa_passwords.

2. Preparation of Data Files

  • All data files should be downloaded from Kiteworks into the ADRF.
  • Place the downloaded files into the appropriate folder within G:/CI_Data_Eng. For example, place AR (Arkansas) data in the AR folder (G:/CI_DataEng/AR).
    • If the required folder does not exist, create it.
Note

EMEditor is required for this step. It is not installed on the data-mgmt server, so you must use a CI Associate account to open all data files.

3. Inspect Data Files

Open each file using EMEditor to inspect its contents and validate the data format.

Checklist for Inspection:

  • Delimiter: Identify the delimiter used in the data file.

    • If the delimiter is a double pipe (||), change it to a single pipe (|). Redshift does not support double pipes as delimiters.
    • If you change the delimiter, make sure to quote all fields. If the new delimiter is present in the data, the load will fail.

    Possible Delimiters:

    • |
    • ,
    • ^
    • (used in some cases for data migration from SQL Server01 to Redshift11)
  • Column Headers:

    • Check if column names are present in the data file. If they are, use the IGNOREHEADER 1 option during the load operation to ignore the first row.
  • Column Order:

    • Verify if the order of the columns in the data file matches the order of the columns in the target Redshift table. If they differ, specify the order in the COPY command. The order must match the order of the source data. AWS Documentation
  • Date Formats:

  • General Inspection:

    • Look for any other anomalies or issues that might affect the data load process.

4. Move Files to S3 Bucket

After inspecting and making necessary adjustments to the data file, move it to the correct S3 bucket. The S3 naming convention is: L:\DATA-INGEST\<state>\<agency_name>\<date_of_transfer>\. (this is for redshift11)

  • For example, for AR data, move it to L:\DATA-INGEST\ar\<agency_name>\<date_of_transfer>\.

5. Open DBeaver and Connect to Redshift

  • Open DBeaver and connect to the appropriate Redshift server:
    • For states: redshift11
    • For USDA: redshift01

6. Run the COPY Command to Load Data

Run the COPY command to load the data into Redshift. Ensure that all necessary options (such as delimiter, IGNOREHEADER, DATEFORMAT, etc.) are specified.

If ingesting data into an existing table then first ingest data into a temp table. Then verify that the new data are not duplicates of any existing data. For example, check the counts by the time element. For instance, if ingesting UI wage data then do a count by year and quarter. If

copy projects.ds_ar_dhe.table (a, b, c, d …) --order of columns in ingestion file
from 's3://adrf3-redshift11-data-ingest/ds_public_2/irs_form_990.txt' 
Iam_role 'arn:aws-us-gov:iam::810527209037:role/ADRF_Redshift11_Admin_Role'
delimiter ''
IGNOREHEADER 1
ACCEPTINVCHARS
EMPTYASNULL;  
  • Adjust the COPY command based on the file inspection details.

7. Handle Load Errors

If the load fails, check the error details by querying the stl_load_errors table:

SELECT * 
FROM pg_catalog.stl_load_errors 
ORDER BY starttime DESC;

The top row will show the most recent error. Common errors include:

  • Column Too Large: Increase the VARCHAR or CHAR datatype size.
  • Invalid UTF-8 Characters: Replace the invalid characters in the data file or use the ACCEPTINVCHARS option in the COPY command.
  • Quoted Fields: If fields are double-quoted, use the REMOVEQUOTES option.
  • Missing Delimiter: If the delimiter cannot be found, double-quote the fields.

8. Validate Loaded Data

Once the data load is successful:

  • Run a row count of the table and compare it with the row count provided by the data provider.
  • Perform basic Exploratory Data Analysis (EDA) to ensure the data integrity, such as checking counts, missing values, and basic statistics.

Example Commands for Validation

-- Check row count
SELECT COUNT(*) FROM table_name;

-- Perform basic data checks
SELECT column_name, COUNT(*) FROM table_name GROUP BY column_name;

9. Documentation and Clean-Up

  • Document any issues encountered and steps taken to resolve them.
  • Clean up any intermediate files or temporary data used during the ingestion process.

Ingestion Procedure

Role Names

  • Redshift01: arn:aws-us-gov:iam::810527209037:role/ADRF_Redshift01_Admin_Role
  • Redshift11: arn:aws-us-gov:iam::810527209037:role/ADRF_Redshift11_Admin_Role

Replace Iam_role in the code below with the appropriate role depending on which server the data is being ingested into.

Login Requirements

You need to be logged in as dbadmin01, dbadmin11 or ci_data_eng.

Data Conversion and Load Operations Documentation

COPY projects.ds_usda_iri.iriweek_all  
FROM 's3://adrf3-redshift01-data-ingest/ds_public_2/irs_form_990.txt' 
IAM_ROLE 'arn:aws-us-gov:iam::810527209037:role/ADRF_Redshift11_Admin_Role'
DELIMITER '#'
IGNOREHEADER 1
ACCEPTINVCHARS
EMPTYASNULL;

Specifying Column Order in COPY Command

COPY projects.ds_usda_iri.iriweek_all (a, b, c, d, …) --order of columns in ingestion file
FROM 's3://adrf3-redshift01-data-ingest/ds_public_2/irs_form_990.txt' 
IAM_ROLE ''
DELIMITER '|'
IGNOREHEADER 1
ACCEPTINVCHARS
EMPTYASNULL;
  • Refer to the sample ingest script: G:\CI_DataEng\data_engineering\sql\ingest_script.sql.

Creating and Permissioning Schemas

CREATE SCHEMA IF NOT EXISTS ds_test_schema AUTHORIZATION dbadmin11;
GRANT USAGE ON SCHEMA ds_test_schema TO GROUP db_p00100_ro;

Creating and Permissioning Tables

CREATE TABLE IF NOT EXISTS ds_test_schema.table (
  Col1 VARCHAR(64),
  Col2 INT,
  Col3 DATE,
  -- Additional columns here
);

ALTER TABLE ds_test_schema.table OWNER TO dbadmin11;
GRANT ALL ON TABLE ds_test_schema.table TO dbadmin11;
GRANT SELECT ON TABLE ds_test_schema.table TO db_p00100_ro;
  • Refer to G:\\CI_Data_Eng\\admin\\Redshift Schema Generation.xlsx for more information.

Creating Local Users in Redshift

CREATE USER "user.name" WITH PASSWORD 'password';
CREATE USER "josh.edelmann" WITH PASSWORD 'Password1234!';
ALTER GROUP ci_read_group ADD USER "josh.edelmann";

Different Group Names

  • ar_dashboard_development
  • wi_dashboard_development
  • ci_careplots_group
  • ci_data_eng
  • ci_read_group

When creating users and groups, you need to be logged in as ci_data_eng or dbadmin11 or dbadmin01.