Data 101 — Almost all you need to know — Part II
CAP Theorem, ACID, BASE, ETL, ELT, OTLP, OLAP and lots more — all in one go
This is the second part of a multi-part introduction to data. You can read the first part for an introduction to data types, data processing options, CAP Theorem, ACID, BASE etc. here. This part will introduce more concepts so let’s get started.
Key concepts across Databases / Data Warehouses
ETL vs. ELT vs. EL
Data Analytics usually involves three stages
The data processing stage occurs after the data has been ingested and collected. Data processing takes the data in its raw form, cleans it, and converts it into a more meaningful format. The data processing mechanism can take following approaches to retrieving the ingested data, processing this data to transform it and make it available in the destination datasource so that they can be further explored:
- EL — EL stands for Extract and Load. This is used when source and destination have the same schema and doesn’t need cleaning of the data. For example EL is useful for batch load of historical data, scheduled periodic load of log files etc.
- ELT — ELT stands for Extract, Load and Transform. This is used when transformations can be performed directly in the destination. For example, example datasets where you are still not sure what kind of transformations would be need to make the data usable or when the transformations could easily be presented as SQL etc.
- ETL — ETL stands for Extract, Transform, and Load. This is used when there are complex transformations (binary data conversions, transformations not otherwise possible in the destination) to be performed or that the transformation reduces the size of the data.
Semi-structured data formats
If you are a data engineer, following data formats are your bread and butter. For everybody else, following is a short introductions:
- Avro is a row-based format. Each record contains a header that describes the structure of the data in the record. This header is stored as JSON. The data is stored as binary information. An application uses the information in the header to parse the binary data and extract the fields it contains. Avro is a very good format for compressing data and minimizing storage and network bandwidth requirements.
- ORC (Optimized Row Columnar format) organizes data into columns rather than rows. It was developed by HortonWorks for optimizing read and write operations in Apache Hive. An ORC file contains stripes of data. Each stripe holds the data for a column or set of columns. A stripe contains an index into the rows in the stripe, the data for each row, and a footer that holds statistical information (count, sum, max, min, and so on) for each column.
- Parquet is another columnar data format. A Parquet file contains row groups. Data for each column is stored together in the same row group. Each row group contains one or more chunks of data. A Parquet file includes metadata that describes the set of rows found in each chunk. An application can use this metadata to quickly locate the correct chunk for a given set of rows, and retrieve the data in the specified columns for these rows. Parquet specializes in storing and processing nested data types efficiently. It supports very efficient compression and encoding schemes.
Index vs. Clustered Index
Index — An index on a column in a table is the same as an index in a book. A book index contains a sorted set of references, with the pages on which each reference occurs. When you create an index in a database, you specify a column from the table, and the index contains a copy of this data in a sorted order, with pointers to the corresponding rows in the table.
Using the index, the select query will return the result faster because it only reads the data for the “C1” CustomerId. Note here that the index is a separate entity from the table itself. An index might consume additional storage space, and each time you insert, update, or delete data in a table, the indexes for that table must be maintained. This additional work can slow down insert, update, and delete operations, and incur additional processing charges.
Clustered Index — Unlike normal indexes, a clustered index physically reorganizes a table by the index key. This arrangement can improve the performance of queries still further, because the relational database management system doesn’t have to follow references from the index to find the corresponding data in the underlying table.
A view is a virtual table based on the result set of a query. In the simplest case, you can think of a view as a window on specified rows in an underlying table. Such a functionality is very useful for reporting purposes, restricted authorizations where you want to give access to only specific data from a table etc. For example, in the following query, we create a view with only P1 products in it.
Non-relational / NoSQL Databases
I’ll start by saying there is no such thing as non-relational data. Relations between the data don’t just disappear — the only thing that changes is how you organize these relations. The term “non-relational data” has however somehow stood the test of time….
Instead of saving the data in highly-normalized form where the related data is generally stored in a separate tables, in NoSQL databases all the related data is stored in the same “table”. Depending upon the type of NoSQL database, the concept of table changes and has a different convention. Types of NoSQL databases include:
- Key-Value Store — A key-value database is a type of non-relational database that uses a simple key-value method to store data. A key-value database stores data as a collection of key-value pairs in which a key serves as a unique identifier. Both keys and values can be anything, ranging from simple objects to complex compound objects. Key-value databases are highly partitionable and allow horizontal scaling at scales that other types of databases cannot achieve.
Write operations are usually restricted to inserts and deletes. If you need to update an item, you must retrieve the item, modify it in memory (in the application), and then write it back to the database, overwriting the original (effectively a delete and an insert).
The document model works well with use cases such as catalogs, user profiles, and content management systems where each document is unique and evolves over time.
- Graph database — Graph databases enable you to store entities, but the main focus is on the relationships that these entities have with each other. Graph databases are purpose-built to store and navigate relationships. Relationships are first-class citizens in graph databases, and most of the value of graph databases is derived from these relationships. For example, in the Employee graph shown below, it is quite easy to find all the employees who report to Sarah, including the indirect reports. To find the same result in a non-graph database would require multiple queries. Thus if you find yourself having to write multiple queries to get the results you want, it might be worth your while to look at a graph database.
Graph databases use nodes to store data entities, and edges to store relationships between entities. An edge always has a start node, end node, type, and direction, and an edge can describe parent-child relationships, actions, ownership, and the like.
- Wide Column / Column Family Databases — Such databases use tables, rows, and columns, but unlike a relational database, the names and format of the columns can vary from row to row in the same table. You can think of a column family database as holding tabular data comprising rows and columns, but you can divide the columns into groups known as column-families. Each column family holds a set of columns that are logically related together.
The most widely used column family database management system is Apache Cassandra.