[ST1501] Data Engineering
Topic B: Relational Data Model
Set of concepts that describes the structure of a database and the associated Retrieval and Update transactions.
Properties of a proper form table:
The name of each Relation is unique
Single-valued cell
Unique attribute names in the relation
Values of the attributes are from the same domain
Order of tuples/attributes does not matter
Each tuple(data point) in a relation is unique
The process usually involves several steps to transform the improper form relation (0NF) to the first normal form (1NF), then to the second normal form (2NF), and finally to the third normal form (3NF). For most tables, 3NF will be deemed sufficient for making all the tables normalized.
Normalization
A table is said to be improper if it violates any/a combination of the requirements stated above.
If a table is normalized, it should satisfy all the requirements above.
Process of Normalization
From 0NF to 1NF
Remove all repeating groups or make every cell contain 1 and only 1 value or contain
NULL
values.Identify the Primary Key; it can be simple or composite depending on the contextual data.
Display the relation in the Relational Heading Format.
At 1NF, there are NO REPEATING groups
From 1NF to 2NF
Identify the Anomalies:
INSERT
Anomaly: One cannot insert a record into a table unless all parts of the Primary Key have a value in them. [violates Entity Integrity Rule]UPDATE
Anomaly: One must change a value to another value on several records instead of 1 location. This may give rise to data integrity problems as some records may be missed during the updates. [Data inconsistency due to data redundancy]DELETE
Anomaly: One inevitably deletes other information that is not supposed to be deleted when one wants to delete a particular record. [lose information when rows are deleted]
Establish Functional Dependencies (FD):
Split the Primary Key into partial parts and establish the determinant and dependants.
For every Functional Dependency, create a new relation with a meaningful name. Make the determinant the Primary Key for the newly created relation and the dependants as non-ket attributes.
At 2NF, every non-key attribute is functionally dependent on the whole of its primary key.
From 2NF to 3NF
Identify the function Dependencies between the non-key attributes (Transitive Dependency (TD))
For every TD, create a new relation with a meaningful name. Make the determinant the Primary Key for the newly created relation and the dependents as non-key attributes.
Retain the Primary Key as a Foreign Key in the original relation where Transitive Dependency is found.
At 3NF, there is no functional dependency among the non=key attributes.
At 3NF, most tables will not exhibit any anomalies. They are good to be created in the database for holding data.
CHAR vs VARCHAR
CHAR has a fixed character count [takes up 1 byte per character]
VARCHAR has a variable character count [takes up x amount of bytes based on character length]

Relational Heading Format
Concepts
Entity type
A collection of entities that have common properties and are of interest to an organization.
Identified by a name(Singular noun) and a list of attributes.
Attributes
COMPOSITE Attributes
Attributes that comprises multiple components
e.g Address, Name
SIMPLE Attributes
Attributes that comprise a single component.
e.g Phone number
DERIVED Attributes
Attributes that are derived from another set of attributes.
e.g Age
Relationships
Association between 2 entities
Has a verb-based name
Multiple relationships can be created when the 2 entities are associated through 2 distinct relationships
Structural Constraints

Cardinality Constraints
1 : 1
Relationship

1 : Many
Relationship

Many : Many
Relationship

Participation Constraints
Weak Entity & Identification Dependency

Generalization Hierarchy
also known as "subsumption(IS-A) Relationship"
It is a collection of entity types that are arranged in a hierarchical structure
Attributes that are common to all entities are associated with the Supertype
Attributes that are unique to a particular subtype are associated with the subtypes (does not affect other subtypes)
Subtype entities possess general attributes from the Supertype as well as subtype-specific attributes.

Disjointness Constraints
Disjointed
Super-type is either sub-type 1 or sub-type 2
Super-type cannot simultaneously be a member of both
Non-Disjointed
Omission of the disjoint constraints implies that a super-type can be both sub-type 1 and sub-type 2 at the same time.
Completeness Constraints
Completed
All super-types are either sub-type 1 or sub-type 2
All the sub-types makes up the super-type
Non-Completed
There are other types of super-type besides sub-type 1 and 2
Many to Many Constraints
Cannot be directly implemented in a Relational Database
Violates the property that no cell of a relation can be multi-valued
Solution for Many to Many constraints:
Identify the
Many : Many
RelationshipReplace with 2
1 : Many
identifying relationships

Explanation:
For example, a many-to-many relationshi[ exists between customers and products: customers can purchase various products, and products can be purchased by many customers.
Relational database systems usually don't allow you to implement a direct many-to-many relationship between two tables. Consider the example of keeping track of invoices. If there were many invoices with the same invoice number and one of your customers inquired about that invoice number, you wouldn't know which number they were referring to. This is one reason for assigning a unique value to each invoice.
To avoid this problem, you can break the many-to-many relationship into two one-to-many relationships by using a third table, called a join table. Each record in a join table includes a match field that contains the value of the primary keys of the two tables it joins. (In the join table, these match fields are foreign keys.) These foreign key fields are populated with data as records in the join table are created from either table it joins.
Topic C: SQL Queries
UPDATE
All rows
UPDATE
All rowsUPDATE <tableName> SET (colName = colvalue), [colName02 = colValue02]
UPDATE
Specific rows
UPDATE
Specific rowsUPDATE <tableName> SET (colName = colValue)
DELETE
All rows
DELETE
All rowsDELETE FROM <tableName>
DELETE
Specific rows
DELETE
Specific rowsDELETE (col01, col02, col03) FROM <tableName>
SELECT
All rows
SELECT
All rowsSELECT * FROM <tableName>
SELECT Specific columns
SELECT (colName1, colName2, colName3) FROM <tableName>
INSERT
Data
INSERT
DataINSERT INTO <tableName> (colName1, colName2, colName3)
VALUES (value01, value02, value03)
Retrieving DISTINCT
values from columns
DISTINCT
values from columnsSELECT DISTINCT <colName> FROM <tableName>
Changing column Labels/Heading using AS
SELECT <colName01> AS 'newColName1' FROM <tableName>
Sorting Results using ORDER BY
ORDER BY
SELECT * FROM <tableName>
ORDER BY <colName used to order> ASC
This command will display all the columns and values from the table in ascending order according to the values found in the column specified.
using WHERE
clause (single search condition)
WHERE
clause (single search condition)SELECT colName1, colName2, colName3 FROM <tableName>
WHERE (condition)
using WHERE
clause (Search condition using range: BETWEEN
)
WHERE
clause (Search condition using range: BETWEEN
)SELECT colName1, colName2, colNam3 FROM <tableName>
WHERE (condition1)
BETWEEN value1 AND value2
using WHERE
clause (Search condition using range: NOT BETWEEN
)
WHERE
clause (Search condition using range: NOT BETWEEN
)SELECT colName1, colName2, colName3 FROM <tableName>
WHERE (condition1)
NOT BETWEEN value1 AND value2
using WHERE
clause (Search condition using Range: OR
)
WHERE
clause (Search condition using Range: OR
)SELECT colName1, colName2, colName3 FROM <tableName>
WHERE (condition1) OR (condition2)
using WHERE
Clause (Search condition using Range: IN
)
WHERE
Clause (Search condition using Range: IN
)SELECT colName1, colName2, colName3 FROM <tableName>
WHERE <colName>
IN (list of values)
Topic E: Data Integration & ETL Process
Components of a BI Solution:
Data Sources
Data Preparation
Data Warehouse
Data Analytics
Data Presentation
What is ETL?

The ETL process is critical toa a successful data warehouse
What is ETL - the tasks
Extract: The process of reading data from a variety of data sources.
Transform: The process of converting the extracted data from its previous form into the form it needs to be in so that it can be placed into the data warehouse.
Load: The process of uploading and updating data into the data warehouse.
Extract:
In this phase, the required data are identified and extracted from different sources. The extracted data are then held in temporary storage (Staging Area) until the "Transform" and "Load" phases can be executed.
Extract - Data Validation
During extraction, validation rules are applied to test whether the data have expected values essential to the data warehouse.
Data that fail the validation are rejected and further processed to discover why it failed validation and remediate if possible.

Transform:
In this phase, data are fetched from the staging area, "cleansed" and transformed. Data cleansing or data scrubbing is the process of detecting and correcting (or removing) data anomalies to improve data quality.

Examples of "Dirty Data"

Load:
In this phase, data are loaded unto the end target, usually a data warehouse or data mart.
This process varies widely depending on the requirements of the organization. Some data warehouses may add new data in a historical form at regular intervals - for example, hourly.
Some Data Warehouses may overwrite existing information with cumulative information.

Data Quality
Proper data cleaning can make or break your project. Professional data scientists usually spend a very large portion of their time on this step.
Remove Unwanted observations
The first step to data cleaning is removing unwanted observations from your dataset. This includes duplicate or irrelevant observations.
Duplicate observations most frequently arise during data collection, such as when you:
Combine datasets from multiple places
Scrape Data
Recieve data from clients/other departments
Fix Structural Errors
Check for typos
Check for inconsistent capitalizations
Mislabeled Classes
Filtering Unwanted Outliers
For example, linear regression models are less robust to outliers than decision tree models. If you have a legitimate reason to remove an outlier, it will help your model's performance.
When to remove outliers?
Outliers are innocent until proven guilty. You should never remove an outlier, such as suspicious measurements that are unlikely to be real data.
Handling Missing Data
You cannot simply ignore missing values in your dataset. You must handle them in some way for the very practical reason that most algorithms do not accept missing values.
2 Common ways of dealing with missing data:
Dropping missing values is sub-optimal because when you drop observations, you drop information. The fact that the value was missing may be informative in itself. Plus, in the real world, you often need to make predictions on new data even if some of the features are missing.
Missing data is like a missing puzzle piece. If you drop it, that's like pretending the puzzle slot isn't there. If you impute it, that's like trying to squeeze in a piece from somewhere else in the puzzle.
How to Handle Missing Data?
The best way to handle missing data for categorical features is to simply label them as "Missing". This tells the algorithm that the value was missing. This also gets around the technical requirement for no missing values.
Data Cleansing
Data Cleansing Cycle (Methodology)
First off, identify the set of data that is critical.
Focus on High priority data, and start small. Create and put into place specific validation rules at this point to standardize and cleanse the existing data as well as automate this process for the future.
Last updated
Was this helpful?