Data 101 — Almost all you need to know — Part I
CAP Theorem, ACID, BASE, ETL, ELT, OTLP, OLAP and lots more — all in one go
Data is omnipresent. The world runs on data. Without data, there is no machine learning, no artificial intelligence, no projection on election results. It is available in different structures, formats, shapes and sizes. Your ability to gain insights out of data will steer you in the direction of better, profitable decisions. This is going to be a multi-part introduction and I will start with data types and data processing options.
Data types and processing options
Data can be categorized into three types:
- Structured data — Data that is represented by rows and columns in a database which makes it easy to search, manage and organize. In structured data, entities can be grouped together to form relations. For example, tables in relational databases.
- Semi-structured data — Unstructured data has some defining or consistent characteristics but doesn’t conform to a structure as rigid as structured data, thus can’t be represented by rows and columns. You can use some metadata to make them more organizable. For example, JSON Messages, E-Mail Messages have a consistent characteristic that you can use to project structure.
- Unstructured data — Unstructured data is data that cannot be contained in a row-column database and doesn’t have an associated data model. The lack of structure made unstructured data more difficult to search, manage and analyse. For example, photos, videos, audio files etc.
You need to able to take insights out of all the three types of data. Because of inherent structure available in structured and semi-structured data, it’s easier to analyse them. For analysis of unstructured data, you usually need to apply machine learning.
Data processing solutions often fall into one of two broad categories: analytical systems, and transaction processing systems. Analytical systems are also known as OLAP, whereas the transaction processing systems are also known as OLTP.
OLTP stands for Online Transaction Processing, a system that’s capable of performing transactions. A transaction is a unit of work that is either completed as a whole or not at all. Primary characteristics of OLTP include:
- Database design is highly normalized
- Data is stored in row format
- Used primary for daily business operations
- Queries are usually simple and standardized
- Provides a single-dimensional view of business activities
- Source of the data is operational data
- Used for fast inserts and updates, real-time queries
- Generally has a short term data retention
- Data size is usually in GBs — TBs
OLAP stands for Online Analytical Processing, a system that’s capable of performing analysis. Analytical systems are concerned with capturing raw data, and using it to generate insights. An organization can use these insights to make business decisions. Primary characteristics of OLAP include:
- Database design is highly denormalized
- Data is stored in columnar format
- Used primary for decision support and planning
- Queries are complex including massive joins, aggregations etc.
- Provides a multi-dimensional view of business activities
- Requires ETL/ELT/EL processes to get the data into the system
- Used for reporting, visualization and batch loads.
- Generally has a long term data retention
- Data size is usually in TBs — PBs
You can also describe the data processing options depending on when the data is processed. You could process each data item as it arrives, or buffer the raw data and process it in groups. Processing data as it arrives is called streaming. Buffering and processing the data in groups is called batch processing.
In batch processing, newly arriving data elements are collected into a group. The whole group is then processed at a future time as a batch.
- Large volumes of data can be processed at a convenient time.
- It can be scheduled to run at a time when computers or systems might otherwise be idle, such as overnight, or during off-peak hours.
In stream processing, each new piece of data is processed as soon as it arrives or at most with a minor delay that one can configure.
- No time delay between ingesting the data and getting the results.
- Data is processed as individual pieces rather than being processed a batch at a time.
Stream processing typically occurs immediately, with latency in the order of seconds or milliseconds, compared to batch processing which typically takes between few minutes till few hours depending upon the size of the dataset.
ACID, BASE and CAP Theorem
CAP Theorem, ACID and BASE are used to define a database or better put, they help you define what characteristics does a database has or the guarantees that a database provides.
ACID are the properties of transactions in a database. A particular database either guarantees ACID transactions or it doesn’t:
- Atomicity — All operations in a transaction succeed or every operation is rolled back. A guarantee of atomicity prevents updates to the database occurring only partially, which can cause greater problems than rejecting the whole series outright.
- Consistency — On the completion of a transaction, the database is structurally sound. Any data written to the database must be valid according to all defined rules, including constraints, cascades, triggers, and any combination thereof.
- Isolation — Transactions do not contend with one another. Isolation ensures that concurrent execution of transactions leaves the database in the same state that would have been obtained if the transactions were executed sequentially.
- Durability — The results of applying a transaction are permanent, even in the presence of failures. This usually means that completed transactions (or their effects) are recorded in non-volatile memory.
BASE has evolved as an ACID alternative in the last few years. It does NOT replace the need for ACID transactions, rather provides a less restrictive / “pessimistic”, more scalable alternative for the big / fast data world or certain domains where you don’t really need all the guarantees of ACID. BASE stands for Basically Available, Soft-state, Eventually consistent.
BASE provides a general guarantee that the database system will have high availability, however the data read may not always be the latest data, which is where the “Eventually consistent” part comes from.
- Basically Available — The database appears to work most of the time, or better put, basic reading and writing operations are available as much as possible.
- Soft-state — The state of the data does not have to be the same across all replicas of the data.
- Eventually consistent — Reads may not be giving you the latest value for a particular data point, however it will eventually exhibit consistency at some later point (eventually!).
Mostly when we mention the word data, we mean distributed data. CAP Theorem postulates that it is impossible for a distributed data store to simultaneously provide more than two out of the following three guarantees:
- Consistency — All readers / clients have the same view of the data.
Note — This definition of consistency is different than consistency definition in ACID (Atomicity, Consistency, Isolation, Durability). See the ACID section below for more information
- Availability — The data store is always available for the clients to read and write.
- Partition tolerance — The data store works well despite any physical network partitions or an arbitrary number of messages being dropped between nodes in the network.
Since you can’t guarantee absence of network partition, your data store has to be able to tolerate network partitions. This leaves data store designers with a choice between consistency and availability, as they can’t have both. This is where the division between Traditional SQL and NoSQL Databases arises from. Traditional SQL Databases like Oracle, SQL Server, MySQL, PostgreSQL etc. choose consistency over availability, whereas NoSQL Databases like Cassandra, MongoDB, Riak, Aerospike, Neo4j etc. choose availability over consistency. Traditional SQL Databases generally provide ACID transactions whereas NoSQL Databases provide BASE guarantees.
That’s about it for this part. In the next part I will provide a deeper introduction to NoSQL Databases and some other key concepts like EL/ETL/ELT etc. See you there!