Handling date formats in big data

When working with datasets from multiple sources, it’s common to encounter varying date formats. Each source may provide dates in different formats, leading to challenges in data consistency and processing. Here are some examples of different date formats you might encounter:

02/25/2024 (MM/dd/yyyy)
03/04/2024 15:23:10 (dd/MM/yyyy HH:mm)
22-5-2023 (dd-M-yyyy)
4-21-2024 (M-dd-yyyy)
04-21-2024 09:40:11 p.m. (MM-dd-yyyy hh:mm a)
02/05/24 (MM/dd/yy)
Apr 05 2024 00:15:30 (MMM dd yyyy HH:mm:ss)
01/25/2024 5:30 a.m. EST (MM/dd/yyyy h:mm a z)

(Note that H for 24 hour, non-padded, HH for 24 hour, padded, h for 12 hour, non-padded, hh for 12 hour, padded)
H: 5 or 23
HH: 05 or 23
h: 7 or 11 used with am, pm
hh: 07 or 11 used with am, pm

(Note that M for non-padded month, MM for padded month, MMM for abbreviated month name, MMMM for full name month)
M: 1 (January), 11 (November)
MM: 01 (January), 11 (November)
MMM: Jan (January), Nov (November)
MMMM: November (November)

Imagine you have datasets from various data providers stored in your S3 bucket, each using different date formats. Your goal is to import these datasets into your data warehouse with date fields standardised to the timestamp type. This can be challenging due to the diversity in date formats.

There are several ways to handle this situation, and I’d like to introduce one effective method:

  1. Import Date Fields as Strings First, import the date fields as string types. This will allow all date formats to be captured in their original form.

  2. Identify Date Formats Next, determine the various date formats present in your dataset. You can achieve this by running a query that uses regular expressions to match different date patterns:
    SELECT
    CASE
       WHEN date RLIKE '\\d{2}/\\d{2}/\\d{4} \\d{2}:\\d{2}:\\d{2}' THEN 'MM/dd/yyyy HH:mm:ss'
       WHEN date RLIKE '\\d{2}-\\d{2}-\\d{4} \\d{2}:\\d{2}:\\d{2}' THEN 'MM-dd-yyyy HH:mm:ss'
       WHEN date RLIKE '\\d{1,2}-\\d{2}-\\d{4}' THEN 'M-dd-yyyy'
       -- Add more formats as needed
       ELSE 'Unknown format'
    END AS date_format
    FROM dataset_table;
    
  3. Update the Table with standardised Timestamps Once you have identified all the date formats, you can update your table to convert these string dates into a standardised timestamp format. Here’s how:
ALTER TABLE dataset_table
ADD COLUMN new_date TIMESTAMP;  

-- Example for MM/dd/yyyy HH:mm:ss
UPDATE dataset_table
SET new_date = TO_TIMESTAMP(date, 'MM/dd/yyyy HH:mm:ss')
WHERE date RLIKE '\\d{2}/\\d{2}/\\d{4}\\s\\d{2}:\\d{2}:\\d{2}';  

-- Example for MM-dd-yyyy hh:mm:ss a
UPDATE dataset_table
SET new_date = TO_TIMESTAMP(date, 'MM-dd-yyyy hh:mm:ss a')
WHERE date RLIKE '\\d{2}-\\d{2}-\\d{4}\\s\\d{2}:\\d{2}:\\d{2}\\s[APap]\\.[Mm]\\.';  

-- Example for MMM dd yyyy HH:mm:ss
UPDATE dataset_table
SET new_date = TO_TIMESTAMP(date, 'MMM dd yyyy HH:mm:ss')
WHERE date RLIKE '^[A-Za-z]{3}\\s\\d{2}\\s\\d{4}\\s\\d{2}:\\d{2}:\\d{2}$'  

-- Example for MM/dd/yyyy h:mm a z
UPDATE dataset_table
SET new_date = TO_TIMESTAMP(date, 'MM/dd/yyyy h:mm a z')
WHERE date RLIKE '\\d{2}/\\d{2}/\\d{4}\\s\\d{1}:\\d{2}\\s[APap]\\.[Mm]\\.\\sEST$'  

Repeat the above UPDATE statement for each identified date format.
Note that RLIKE stands for “regular expression like”, allowing you to match patterns using regular expressions.
Since SQL often treats \ as a special character, use \\ to represent a literal backslash.

Final Step: Clean Up

After updating the date fields and ensuring all dates are standardised:

ALTER TABLE dataset_table
DROP COLUMN date;

ALTER TABLE dataset_table
RENAME COLUMN new_date TO date;

Conclusion

Handling multiple date formats in datasets can be complex, but by importing date fields as strings, identifying the formats using regular expressions, and then updating the table with standardised timestamps, you can effectively manage and standardise your date fields. This approach ensures that your data warehouse maintains consistency and accuracy across different data sources.

Written on July 31, 2024