As data scientists, one of the most important elements of our tech stack is the system we choose to connect to our working environment in order to manage our data. A Data Management System is responsible for managing and organizing large volumes of data throughout the organization lifecycle. It encompasses not only the software but also the policies, procedures, and governance frameworks that guide how data is collected, stored, processed, and used within our organization. Hence a Data Management System provides the foundation for effective upcoming data analysis and management.
We will start introducing the different data model and workload needs that Data Management Systems can fulfill (1). We will then focus on exploring the Database Management Systems: Relational (2), Non-Relational (3) and Data Lakes (4). We will switch to Data Warehouse Management Systems (5) to finally close with the Distributed Computing Systems (6) that support large scale data processing from the systems.
1. Data Model and Workload Constraints
A data management system can optimise two different types of workloads: a Database Management System is optimized for transactional workloads, while a Data Warehouse Management System is optimized for analytical workloads.
Database Management Systems (DBMS). A database is a software system that is designed to efficiently store, manage, and retrieve data for day-to-day operations, for quick reads and writes. DBMS technology is robust and essential for ensuring data consistency, reliability, and scalability. So as the DBMS is able to handle a high volume of concurrent transactions, then exploratory analysis should be limited to ad-hoc querying. There are three types of DBMS that differ in their data models and query languages. Relational Database Management Systems (RDBMS) are better suited for handling structured data while Non-Relational Database Management System (NoSQL databases) are better suited for handling unstructured (data does not fit into a predefined schema) or semi-structured (data schema may change frequently) data. Then, if we happen to use both structured and non-structured data, we should rather mix both and use Data Lakes.
Data Warehouse Management System (DWMS). Data warehouses structure is as flexible as Data Lakes. Though, while Data Lakes store raw data that can be processed later based on the needs of the organization, data warehouses instead require data to be processed and transformed before it is loaded into the warehouse. As a result, data warehouses are more optimized for advanced querying and reporting.
2. Relational Database Management Systems (RDBMS)
A RDBMS is better suited for handling transactional workloads based on structured data stored in a relational database. A relational database is a type of database that stores data in a series of tables, with each table representing a separate entity or concept. The tables are related to each other by one or more fields, which are used to establish relationships between the tables. On top of providing a range of features to store and retrieve data, a RDBMS also offers a range of security features, such as user authentication and access control, and a range of tools for maintaining data integrity, such as constraints, triggers, backup and recovery. SQL is the most specialised programming language designed to create, modify, and query relational databases. These are some of the most widely used RDBMS systems in the market:
RDBMS | SQLite | PostgreSQL | MySQL | Oracle Database | Microsoft SQL Server | Db2 |
---|---|---|---|---|---|---|
Free | Yes | Yes | Yes | Only free up to 12GB | Only free up to 10GB | Only free up to 100GB |
Open-source | Yes | Yes | Yes | No | No | No |
Developer | Richard Hipp | Berkeley | Acquired by Oracle Corp. | Oracle Corp. | Microsoft | IBM |
Main use cases | Most used, common on iOS & Android platforms given it is lightweight, great for small datasets | Efficient, great for large datasets, great for app container testing too | Popular with web dev languages like HTML, CSS & Javascript | Popular with large corporations given the large amount of features, tools & support available | Moslty used alongside other Microsoft products like Azure | Moslty used alongside other IBM products like Watson |
Big tech companies other than Microsoft and IBM, usually expect us to use their respective proprietary RDBMS: Amazon Relational Database Service (Amazon RDS) at Amazon, MyRocks at Facebook and Cloud SQL at Google.
One example of a company that uses a RDBMS is Airbnb. It is a global online marketplace for lodging and travel experiences tahta relies on RDBMS technology to facilitate its reservations and bookings of lodging accommodations.
3. Non-Relational Database Management Systems (NoSQL databases)
As stated early, a Non-Relational database is better suited for handling unstructured (data does not fit into a predefined schema) or semi-structured (data schema may change frequently) data. In other words, a Non-Relational database stores data is flexible way and do not enforce strict data integrity rules. It also often uses specialized query languages that are optimized for specific data models, such as document-based, graph-based or key-value-based data. These specialised query languages are categorised as Not only SQL (NoSQL). Non-relational databases shall be connected to a Distributed Computing System*. By distributing the data, the system is a more scalable, more available and secure. These are some of the most widely used NoSQL databases in the market:
NoSQLDB | MongoDB | Apache Cassandra | Neo4j |
---|---|---|---|
Free | Only the Community Edition | Yes | Only the Community Edition |
Open-source | Only the Community Edition | Yes | Only the Community Edition |
Developer | Dwight Merriman, Eliot Horowitz, and Kevin Ryan | Facebook, now maintained by Apache Software Foundation | Neo Technology |
Data structure | Document-based | Key-value-based | Graph-based |
Main use cases | An analytics company wants to store & analyse large volumes of social media data, sensor data, or machine logs including text, images, and videos. | A social media platform or an e-commerce site needs to handle high volumes of real-time rapidly changing user activity data. | A social network, a recommendation engine, or a fraud detection services is dealing with complex & personalised data with interconnected relationships. The company wants to represent this data in a more natural and expressive way to support a more accurate and relevant decision making model. |
Amazon Web Services (AWS) uses Amazon DynamoDB across various applications and services including e-commerce product catalog maintenance, real-time gaming leaderboards, IoT (Internet of Things) sensor data ingestion … Amazon DynamoDB addresses the requirements of scalable, low-latency, and high-performance applications.
4. Data Lakes
What if we have both structured and unstructured data to store in a centralized repository? Then we need a data lake. A data lake provides a way to store data in its native format, without the need for a predefined schema or organization, making it easier for data analysts and data scientists to access and analyze large volumes of data from various sources. When querying data from a data lake, the choice of database management system (DBMS) will depend on the type of data and the query requirements.
One company that effectively utilizes data lakes is the Netflix. The streaming platform uses a data lake to store and analyze a wide range of data types like user interactions, content metadata, viewing history, and user profiles. This data is leveraged to conduct A/B tests and improve their content recommendation algorithms and personalization.
5. Data Warehouse Management System (DWMS)
The purpose of a data warehouse is to provide a single source of truth for the organization, where data can be accessed and analyzed by business analysts, data scientists, and other users. Data warehouses are designed to support complex queries, data mining, long-term analysis, business intelligence (BI) reporting and strategic decision-making.
They involve aggregating and summarizing large amounts of data from multiple sources. Therefore data warehouses are large and centralized data repositories that allow field duplication to ease the connections between the data sources. They are organized in a schema based on the specific needs of the organization and the types of queries and analytics that will be performed. The primary schema designs used in data warehousing are:
Schema | Structure | Description |
---|---|---|
Star Schema | Structured | Data is organized into a central fact table and a set of dimension tables. The fact table contains transactional or event data, while dimension tables store descriptive attributes related to the facts. |
Snowflake Schema | Highly structured | Extension of the star schema, where dimension tables are further normalized into multiple related tables. This normalization reduces data redundancy at the cost of increased complexity. |
Galaxy Schema (Constellation Schema) | Structured | Combines multiple star schemas into a single data warehouse. Each star schema represents a subject area, and they are interconnected through shared dimension tables. |
Schema-less (No Schema) | Semi-structured or unstructured | Data is stored in its raw form without a predefined schema. This can be useful for handling semi-structured or unstructured data types like JSON, XML, or free-text documents |
Hybrid Schema | Mix of structured & semi-structured | Combines elements of structured data storage (usually traditional sources) and semi-structured data storage (usually newer sources). |
To optimize its advanced analytics performance, they require data to be processed and transformed before it is loaded into the warehouse. We use ETL tools to proceed that stand for Extract, Transformed, and Load. These are some of the most widely used DWMS in the market:
DWMS | Azure Synapse Analytics | Amazon Redshift | Google BigQuery | Db2 Warehouse | Oracle Autonomous Data Warehouse | Snowflake Data Cloud |
Free | No, pay-as-you-go | No, pay-as-you-go | Yes, up to 10 GB | No, subscription | No, subscription | No |
Open-source | No | No | No | No | No | No |
Developer | Microsoft | Amazon | IBM | Oracle Corp. | Snowflake Computing | |
Scalability | ++++ | +++ | ++++ | ++ | +++ | ++++ |
Main use cases | A retail company wants a seamless integration with Azure services and to gain real-time sales & inventory insights from POS & IoT devices. | An ad tech company wants to perform fast ad hoc queries on their large volumes of data. | A small and growing startup wants a fully managed architecture and to easily collaborate and share datasets or queries. | A financial services company with sensitive data is looking for strong data security & encryption. | A healthcare organization with patient records is also looking for both scalability and high security. | An e-commerce company wants to seriously reduce its warehousing costs during off-peak holiday season. |
Let’s continue with the example of Netflix in order to properly understand the distinction between data lakes and data warehouses. Netflix actually employs Amazon Redshift to extract data from the data lake, transform it into a structured format, and load it into the data warehouse for analytics and reporting. This structured data is used for tasks like business intelligence, reporting, and generating insights. The combination of a data lake and data warehouse allows Netflix to effectively manage both raw, unstructured data and structured, processed data.
6. Distributed Computing Systems
Distributed computing systems are used for larger scale data processing tasks such as big data analytics, machine learning, and real-time data processing. They partition and replicate the data to store it across different servers or nodes that are connected to each other over a network. As a result, by parallelizing the computation across multiple nodes, the database management systems can handle larger amounts of data and higher traffic volumes while allowing multiple users to access and update the data simultaneously, all in all while providing mechanisms for ensuring data consistency, integrity, and security.
A distributed computing environment can also be integrated to our computing environment. They are usually connected to a non-relational database (NoSQL database) but not exclusively. By distributing the data, the system is a more scalable, more available and secure. These are some of the most widely used distributed computing systems in the market:
Distributed Computing Systems | Apache Spark | Apache Hadoop MapReduce | Amazon EMR (Elastics MapReduce) | Microsoft Azure Batch |
---|---|---|---|---|
Free | Yes | Yes | No | No |
Open-source | Yes | Yes | No | No |
Developer | Berkeley | Doug Cutting and Mike Cafarella, inspired by Google MapReduce | Amazon.com | Microsoft |
Main use cases | Generally faster thanks to its in-memory processing model & more flexible as it supports multiple programming languages | More scalable as it processes data in batches & more efficient as it primarily relies on Java | Strong security features, integration with other AWS services, high scalability, user-friendly | Strong security features, integration with Azure ecosystem, helps you optimize resource usage and reduce costs |
If you happen to work for big tech companies other than Amazon and Microsft you will be asked to use their respective proprietary Distributed Computing Systems: Xgrid for Apple, Spectrum Computing at IBM and the Google Cloud Platform at Google.
A distributed computing system uses APIs to facilitate communication between nodes in a cluster. To learn more about the most popular librairies that we can use to call for computation while performance your data analysis via an API, check this post: Data Science Tech Stack Series: Languages, Librairies and Frameworks
Explore more
A data management system can be integrated to our working environment. To learn more about working environments, check this post: Data Science Tech Stack Series: Working Environments
Also, check my post that introduces the full stack: Baking Up The Ultimate Data Science Tech Stack