[ST1501] Data Engineering

Topic B: Relational Data Model

Set of concepts that describes the structure of a database and the associated Retrieval and Update transactions.

Properties of a proper form table:

  • The name of each Relation is unique

  • Single-valued cell

  • Unique attribute names in the relation

  • Values of the attributes are from the same domain

  • Order of tuples/attributes does not matter

  • Each tuple(data point) in a relation is unique

The process usually involves several steps to transform the improper form relation (0NF) to the first normal form (1NF), then to the second normal form (2NF), and finally to the third normal form (3NF). For most tables, 3NF will be deemed sufficient for making all the tables normalized.

Normalization

The process to convert an improper form of relation into several proper-form relations.

A table is said to be improper if it violates any/a combination of the requirements stated above.

If a table is normalized, it should satisfy all the requirements above.

Process of Normalization

  1. From 0NF to 1NF

    • Remove all repeating groups or make every cell contain 1 and only 1 value or contain NULL values.

    • Identify the Primary Key; it can be simple or composite depending on the contextual data.

    • Display the relation in the Relational Heading Format.

    • At 1NF, there are NO REPEATING groups

  2. From 1NF to 2NF

    1. Identify the Anomalies:

      • INSERT Anomaly: One cannot insert a record into a table unless all parts of the Primary Key have a value in them. [violates Entity Integrity Rule]

      • UPDATE Anomaly: One must change a value to another value on several records instead of 1 location. This may give rise to data integrity problems as some records may be missed during the updates. [Data inconsistency due to data redundancy]

      • DELETE Anomaly: One inevitably deletes other information that is not supposed to be deleted when one wants to delete a particular record. [lose information when rows are deleted]

    2. Establish Functional Dependencies (FD):

      • Split the Primary Key into partial parts and establish the determinant and dependants.

      • For every Functional Dependency, create a new relation with a meaningful name. Make the determinant the Primary Key for the newly created relation and the dependants as non-ket attributes.

      • At 2NF, every non-key attribute is functionally dependent on the whole of its primary key.

  3. From 2NF to 3NF

    • Identify the function Dependencies between the non-key attributes (Transitive Dependency (TD))

    • For every TD, create a new relation with a meaningful name. Make the determinant the Primary Key for the newly created relation and the dependents as non-key attributes.

    • Retain the Primary Key as a Foreign Key in the original relation where Transitive Dependency is found.

    • At 3NF, there is no functional dependency among the non=key attributes.

The Normalization process creates more tables at the end of its process.

At 3NF, most tables will not exhibit any anomalies. They are good to be created in the database for holding data.

CHAR vs VARCHAR

CHAR has a fixed character count [takes up 1 byte per character]

VARCHAR has a variable character count [takes up x amount of bytes based on character length]

Relational Heading Format

Syntax: <RelationName> (single-value columns, single-value columns, { Multi-valued columns })

Concepts

Entity type

  • A collection of entities that have common properties and are of interest to an organization.

  • Identified by a name(Singular noun) and a list of attributes.

Attributes

  • COMPOSITE Attributes

    • Attributes that comprises multiple components

    • e.g Address, Name

  • SIMPLE Attributes

    • Attributes that comprise a single component.

    • e.g Phone number

  • DERIVED Attributes

    • Attributes that are derived from another set of attributes.

    • e.g Age

Relationships

  • Association between 2 entities

  • Has a verb-based name

  • Multiple relationships can be created when the 2 entities are associated through 2 distinct relationships

Structural Constraints

The different Sturctural constraints

Cardinality Constraints

Describes the number of entity instances that participates in a relationship

  • 1 : 1 Relationship

1:1 Relationship
  • 1 : Many Relationship

1:Many Relationship
  • Many : Many Relationship

Many : Many Reltaionship

Participation Constraints

Determines if the existence of an entity type depends on another entity type.

Weak Entity & Identification Dependency

Weak(Child) entity

Generalization Hierarchy

  • also known as "subsumption(IS-A) Relationship"

  • It is a collection of entity types that are arranged in a hierarchical structure

  • Attributes that are common to all entities are associated with the Supertype

  • Attributes that are unique to a particular subtype are associated with the subtypes (does not affect other subtypes)

  • Subtype entities possess general attributes from the Supertype as well as subtype-specific attributes.

Example of a "IS-A Relationship"

Disjointness Constraints

  • Disjointed

    • Super-type is either sub-type 1 or sub-type 2

    • Super-type cannot simultaneously be a member of both

  • Non-Disjointed

    • Omission of the disjoint constraints implies that a super-type can be both sub-type 1 and sub-type 2 at the same time.

Completeness Constraints

  • Completed

    • All super-types are either sub-type 1 or sub-type 2

    • All the sub-types makes up the super-type

  • Non-Completed

    • There are other types of super-type besides sub-type 1 and 2

Many to Many Constraints

  • Cannot be directly implemented in a Relational Database

  • Violates the property that no cell of a relation can be multi-valued

Solution for Many to Many constraints:

  1. Identify the Many : Many Relationship

  2. Replace with 2 1 : Many identifying relationships

Explanation:

For example, a many-to-many relationshi[ exists between customers and products: customers can purchase various products, and products can be purchased by many customers.

Relational database systems usually don't allow you to implement a direct many-to-many relationship between two tables. Consider the example of keeping track of invoices. If there were many invoices with the same invoice number and one of your customers inquired about that invoice number, you wouldn't know which number they were referring to. This is one reason for assigning a unique value to each invoice.

To avoid this problem, you can break the many-to-many relationship into two one-to-many relationships by using a third table, called a join table. Each record in a join table includes a match field that contains the value of the primary keys of the two tables it joins. (In the join table, these match fields are foreign keys.) These foreign key fields are populated with data as records in the join table are created from either table it joins.

Topic C: SQL Queries

UPDATE All rows

UPDATE <tableName> SET (colName = colvalue), [colName02 = colValue02]

The [ ] indicates that the portion of the statement is optional

UPDATE Specific rows

UPDATE <tableName> SET (colName = colValue)

DELETE All rows

DELETE FROM <tableName>

DELETE Specific rows

DELETE (col01, col02, col03) FROM <tableName>

SELECT All rows

SELECT * FROM <tableName>

Used to retrieve all rows and columns of a table

SELECT Specific columns

SELECT (colName1, colName2, colName3) FROM <tableName>

You can rename the colName to be outputted by doing this SELECT (colName1 "newName1", colName2 "newName2", ...) FROM <tableName>

INSERT Data

INSERT INTO <tableName> (colName1, colName2, colName3) 
VALUES (value01, value02, value03)

Data stored in the table is NOT Case-Sensitive

To store data with a quote( ' ), you have to double the quote (e.g David ' ' s)

Retrieving DISTINCT values from columns

SELECT DISTINCT <colName> FROM <tableName>

Changing column Labels/Heading using AS

SELECT <colName01> AS 'newColName1' FROM <tableName>

Add a _ if column name to be changed has a spacing. (e.g Country_Name -> Country Name)

Sorting Results using ORDER BY

SELECT * FROM <tableName> 
ORDER BY <colName used to order> ASC

'ASC' is optional. if the order is not specified, the program will default the order to Ascending.

This command will display all the columns and values from the table in ascending order according to the values found in the column specified.

using WHERE clause (single search condition)

SELECT colName1, colName2, colName3 FROM <tableName> 
WHERE (condition)

using WHERE clause (Search condition using range: BETWEEN)

SELECT colName1, colName2, colNam3 FROM <tableName> 
WHERE (condition1) 
BETWEEN value1 AND value2

using WHERE clause (Search condition using range: NOT BETWEEN)

SELECT colName1, colName2, colName3 FROM <tableName> 
WHERE (condition1) 
NOT BETWEEN value1 AND value2

using WHERE clause (Search condition using Range: OR)

SELECT colName1, colName2, colName3 FROM <tableName>
WHERE (condition1) OR (condition2)

using WHERE Clause (Search condition using Range: IN)

SELECT colName1, colName2, colName3 FROM <tableName>
WHERE <colName>
IN (list of values)

Topic E: Data Integration & ETL Process

Components of a BI Solution:

  1. Data Sources

  2. Data Preparation

  3. Data Warehouse

  4. Data Analytics

  5. Data Presentation

What is ETL?

ETL (Extract, Transform and Load) is a process in data warehousing responsible for pulling data out of the source systems and placing it into a data warehouse.

ETL Process

The ETL process is critical toa a successful data warehouse

What is ETL - the tasks

  • Extract: The process of reading data from a variety of data sources.

  • Transform: The process of converting the extracted data from its previous form into the form it needs to be in so that it can be placed into the data warehouse.

  • Load: The process of uploading and updating data into the data warehouse.

Extract:

In this phase, the required data are identified and extracted from different sources. The extracted data are then held in temporary storage (Staging Area) until the "Transform" and "Load" phases can be executed.

Extract - Data Validation

During extraction, validation rules are applied to test whether the data have expected values essential to the data warehouse.

Data that fail the validation are rejected and further processed to discover why it failed validation and remediate if possible.

Data Extraction phase

Transform:

In this phase, data are fetched from the staging area, "cleansed" and transformed. Data cleansing or data scrubbing is the process of detecting and correcting (or removing) data anomalies to improve data quality.

Data Transformation phase

Examples of "Dirty Data"

Examples of Dirty Data

Load:

In this phase, data are loaded unto the end target, usually a data warehouse or data mart.

This process varies widely depending on the requirements of the organization. Some data warehouses may add new data in a historical form at regular intervals - for example, hourly.

Some Data Warehouses may overwrite existing information with cumulative information.

Data Loading phase

Updating extracted data is frequently done on a daily, weekly, or monthly basis.

Data Quality

Proper data cleaning can make or break your project. Professional data scientists usually spend a very large portion of their time on this step.

Better Data beats Fancier Algorithms 💁‍♂️💁‍♂️

Remove Unwanted observations

The first step to data cleaning is removing unwanted observations from your dataset. This includes duplicate or irrelevant observations.

Duplicate observations most frequently arise during data collection, such as when you:

  • Combine datasets from multiple places

  • Scrape Data

  • Recieve data from clients/other departments

Fix Structural Errors

Structural Errors are those that arise during measurement, data transfer, or other types of "poor housekeeping".

  • Check for typos

  • Check for inconsistent capitalizations

  • Mislabeled Classes

Filtering Unwanted Outliers

Outliers can cause problems with certain types of models.

For example, linear regression models are less robust to outliers than decision tree models. If you have a legitimate reason to remove an outlier, it will help your model's performance.

When to remove outliers?

Outliers are innocent until proven guilty. You should never remove an outlier, such as suspicious measurements that are unlikely to be real data.

Handling Missing Data

Missing Data is a deceptively tricky issue in applied machine learning.

You cannot simply ignore missing values in your dataset. You must handle them in some way for the very practical reason that most algorithms do not accept missing values.

2 Common ways of dealing with missing data:

Dropping missing values is sub-optimal because when you drop observations, you drop information. The fact that the value was missing may be informative in itself. Plus, in the real world, you often need to make predictions on new data even if some of the features are missing.

Missing data is like a missing puzzle piece. If you drop it, that's like pretending the puzzle slot isn't there. If you impute it, that's like trying to squeeze in a piece from somewhere else in the puzzle.

In short, you should always tell your algorithm that a value was missing because missingness is informative.

How to Handle Missing Data?

The best way to handle missing data for categorical features is to simply label them as "Missing". This tells the algorithm that the value was missing. This also gets around the technical requirement for no missing values.

Data Cleansing

Data Cleansing Cycle (Methodology)

First off, identify the set of data that is critical.

Focus on High priority data, and start small. Create and put into place specific validation rules at this point to standardize and cleanse the existing data as well as automate this process for the future.

Last updated

Was this helpful?