[ST1501] Data Engineering
Topic B: Relational Data Model
Set of concepts that describes the structure of a database and the associated Retrieval and Update transactions.
Properties of a proper form table:
The name of each Relation is unique
Single-valued cell
Unique attribute names in the relation
Values of the attributes are from the same domain
Order of tuples/attributes does not matter
Each tuple(data point) in a relation is unique
The process usually involves several steps to transform the improper form relation (0NF) to the first normal form (1NF), then to the second normal form (2NF), and finally to the third normal form (3NF). For most tables, 3NF will be deemed sufficient for making all the tables normalized.
Normalization
A table is said to be improper if it violates any/a combination of the requirements stated above.
If a table is normalized, it should satisfy all the requirements above.
Process of Normalization
From 0NF to 1NF
Remove all repeating groups or make every cell contain 1 and only 1 value or contain
NULLvalues.Identify the Primary Key; it can be simple or composite depending on the contextual data.
Display the relation in the Relational Heading Format.
At 1NF, there are NO REPEATING groups
From 1NF to 2NF
Identify the Anomalies:
INSERTAnomaly: One cannot insert a record into a table unless all parts of the Primary Key have a value in them. [violates Entity Integrity Rule]UPDATEAnomaly: One must change a value to another value on several records instead of 1 location. This may give rise to data integrity problems as some records may be missed during the updates. [Data inconsistency due to data redundancy]DELETEAnomaly: One inevitably deletes other information that is not supposed to be deleted when one wants to delete a particular record. [lose information when rows are deleted]
Establish Functional Dependencies (FD):
Split the Primary Key into partial parts and establish the determinant and dependants.
For every Functional Dependency, create a new relation with a meaningful name. Make the determinant the Primary Key for the newly created relation and the dependants as non-ket attributes.
At 2NF, every non-key attribute is functionally dependent on the whole of its primary key.
From 2NF to 3NF
Identify the function Dependencies between the non-key attributes (Transitive Dependency (TD))
For every TD, create a new relation with a meaningful name. Make the determinant the Primary Key for the newly created relation and the dependents as non-key attributes.
Retain the Primary Key as a Foreign Key in the original relation where Transitive Dependency is found.
At 3NF, there is no functional dependency among the non=key attributes.
At 3NF, most tables will not exhibit any anomalies. They are good to be created in the database for holding data.
CHAR vs VARCHAR
CHAR has a fixed character count [takes up 1 byte per character]
VARCHAR has a variable character count [takes up x amount of bytes based on character length]

Relational Heading Format
Concepts
Entity type
A collection of entities that have common properties and are of interest to an organization.
Identified by a name(Singular noun) and a list of attributes.
Attributes
COMPOSITE Attributes
Attributes that comprises multiple components
e.g Address, Name
SIMPLE Attributes
Attributes that comprise a single component.
e.g Phone number
DERIVED Attributes
Attributes that are derived from another set of attributes.
e.g Age
Relationships
Association between 2 entities
Has a verb-based name
Multiple relationships can be created when the 2 entities are associated through 2 distinct relationships
Structural Constraints

Cardinality Constraints
1 : 1Relationship

1 : ManyRelationship

Many : ManyRelationship

Participation Constraints
Weak Entity & Identification Dependency

Generalization Hierarchy
also known as "subsumption(IS-A) Relationship"
It is a collection of entity types that are arranged in a hierarchical structure
Attributes that are common to all entities are associated with the Supertype
Attributes that are unique to a particular subtype are associated with the subtypes (does not affect other subtypes)
Subtype entities possess general attributes from the Supertype as well as subtype-specific attributes.

Disjointness Constraints
Disjointed
Super-type is either sub-type 1 or sub-type 2
Super-type cannot simultaneously be a member of both
Non-Disjointed
Omission of the disjoint constraints implies that a super-type can be both sub-type 1 and sub-type 2 at the same time.
Completeness Constraints
Completed
All super-types are either sub-type 1 or sub-type 2
All the sub-types makes up the super-type
Non-Completed
There are other types of super-type besides sub-type 1 and 2
Many to Many Constraints
Cannot be directly implemented in a Relational Database
Violates the property that no cell of a relation can be multi-valued
Solution for Many to Many constraints:
Identify the
Many : ManyRelationshipReplace with 2
1 : Manyidentifying relationships

Explanation:
For example, a many-to-many relationshi[ exists between customers and products: customers can purchase various products, and products can be purchased by many customers.
Relational database systems usually don't allow you to implement a direct many-to-many relationship between two tables. Consider the example of keeping track of invoices. If there were many invoices with the same invoice number and one of your customers inquired about that invoice number, you wouldn't know which number they were referring to. This is one reason for assigning a unique value to each invoice.
To avoid this problem, you can break the many-to-many relationship into two one-to-many relationships by using a third table, called a join table. Each record in a join table includes a match field that contains the value of the primary keys of the two tables it joins. (In the join table, these match fields are foreign keys.) These foreign key fields are populated with data as records in the join table are created from either table it joins.
Topic C: SQL Queries
UPDATE All rows
UPDATE All rowsUPDATE Specific rows
UPDATE Specific rowsDELETE All rows
DELETE All rowsDELETE Specific rows
DELETE Specific rowsSELECT All rows
SELECT All rowsSELECT Specific columns
INSERT Data
INSERT DataRetrieving DISTINCT values from columns
DISTINCT values from columnsChanging column Labels/Heading using AS
Sorting Results using ORDER BY
ORDER BYThis command will display all the columns and values from the table in ascending order according to the values found in the column specified.
using WHERE clause (single search condition)
WHERE clause (single search condition) using WHERE clause (Search condition using range: BETWEEN)
WHERE clause (Search condition using range: BETWEEN)using WHERE clause (Search condition using range: NOT BETWEEN)
WHERE clause (Search condition using range: NOT BETWEEN)using WHERE clause (Search condition using Range: OR)
WHERE clause (Search condition using Range: OR)using WHERE Clause (Search condition using Range: IN)
WHERE Clause (Search condition using Range: IN)Topic E: Data Integration & ETL Process
Components of a BI Solution:
Data Sources
Data Preparation
Data Warehouse
Data Analytics
Data Presentation
What is ETL?

The ETL process is critical toa a successful data warehouse
ETL must ensure that the data loaded into the data warehouse is high-quality, accurate, relevant, useful, and accessible.
ETL is the most time-consuming phase in building a data warehouse as routines must be developed to select required fields from numerous sources of data.
What is ETL - the tasks
Extract: The process of reading data from a variety of data sources.
Transform: The process of converting the extracted data from its previous form into the form it needs to be in so that it can be placed into the data warehouse.
Load: The process of uploading and updating data into the data warehouse.
Extract:
In this phase, the required data are identified and extracted from different sources. The extracted data are then held in temporary storage (Staging Area) until the "Transform" and "Load" phases can be executed.
Extract - Data Validation
During extraction, validation rules are applied to test whether the data have expected values essential to the data warehouse.
Data that fail the validation are rejected and further processed to discover why it failed validation and remediate if possible.

Transform:
In this phase, data are fetched from the staging area, "cleansed" and transformed. Data cleansing or data scrubbing is the process of detecting and correcting (or removing) data anomalies to improve data quality.

Examples of "Dirty Data"

Load:
In this phase, data are loaded unto the end target, usually a data warehouse or data mart.
This process varies widely depending on the requirements of the organization. Some data warehouses may add new data in a historical form at regular intervals - for example, hourly.
Some Data Warehouses may overwrite existing information with cumulative information.

Data Quality
Proper data cleaning can make or break your project. Professional data scientists usually spend a very large portion of their time on this step.
Remove Unwanted observations
The first step to data cleaning is removing unwanted observations from your dataset. This includes duplicate or irrelevant observations.
Duplicate observations most frequently arise during data collection, such as when you:
Combine datasets from multiple places
Scrape Data
Recieve data from clients/other departments
Irrelevant observations are those that don't actually fit the specific problem that you're trying to solve.
For example, if you were building a model for Single-Family homes only, you wouldn't want observations for apartments in there.
This is also a great time to review your charts during EDA (Exploratory Data Analysis). You can look at the distribution charts for categorical features to see if there are any classes that shouldn't be there.
Checking for irrelevant observations before engineering features can save you many headaches down the road.
Fix Structural Errors
Check for typos
Check for inconsistent capitalizations
Mislabeled Classes
Filtering Unwanted Outliers
For example, linear regression models are less robust to outliers than decision tree models. If you have a legitimate reason to remove an outlier, it will help your model's performance.
When to remove outliers?
Outliers are innocent until proven guilty. You should never remove an outlier, such as suspicious measurements that are unlikely to be real data.
Handling Missing Data
You cannot simply ignore missing values in your dataset. You must handle them in some way for the very practical reason that most algorithms do not accept missing values.
2 Common ways of dealing with missing data:
Dropping missing values is sub-optimal because when you drop observations, you drop information. The fact that the value was missing may be informative in itself. Plus, in the real world, you often need to make predictions on new data even if some of the features are missing.
Imputing missing values is sub-optimal because the value was originally missing but you filled it in, which always leads to a loss in information, no matter how sophisticated your imputation method is.
Missing data is like a missing puzzle piece. If you drop it, that's like pretending the puzzle slot isn't there. If you impute it, that's like trying to squeeze in a piece from somewhere else in the puzzle.
How to Handle Missing Data?
The best way to handle missing data for categorical features is to simply label them as "Missing". This tells the algorithm that the value was missing. This also gets around the technical requirement for no missing values.
For missing Numerical Data, you should flag and fill in the values.
Flag the observations with an indicator variable of missingness
Fill the original missing value with 0 just to meet the technical requirement of no missing values.
By using this technique of flagging and filling, you are essentially allowing the algorithm to estimate the optimal constant for missingness, instead of just filling it in with the mean.
Data Cleansing
Data Cleansing Cycle (Methodology)
First off, identify the set of data that is critical.
Focus on High priority data, and start small. Create and put into place specific validation rules at this point to standardize and cleanse the existing data as well as automate this process for the future.
Go through the data you already have in order to see what is missing, what can be thrown out, and what, if any, are gaps between them.
Identify a set of resources to handle and manually cleanse exceptions to your rules. The amount of manual intervention is directly correlated to the amount of acceptable levels of data quality.
Begin to standardize and cleanse the flow of new data as it enters the system by creating scripts/workflows
These can be run in real-time or in batch(daily, weekly, monthly). These routines can be applied to new data, or to previously keyed-in data.
It is important especially for records that cannot be automatically corrected.
It is important to identify the correct way of getting a hold of the missing data.
Set up a periodic review so that you can monitor issues before they become a major problem.
Bring the whole process full circle. Revisit your plans from the first step and reevaluate. Can priorities be changed? Do the rules you implemented still fit into your overall business strategy?
Conduct periodic reviews to make sure that your data cleansing is running with smoothness and accuracy.
Last updated
Was this helpful?