Data science requires proficiency in one or more programming languages, depending on the task at hand (1). These languages can be connected together. To save time while programming, we can use librairies or frameworks. They are collections of pre-written code. We can import them into our code, allowing us to easily use their functions and methods. We can call libraries to manipulate data (3), visualise data (4) and perform large scale data processing (5). Frameworks are more complex than librairies and are mostly used for Machine Learning (6).
1. Programming Languages
Examples of popular programming languages include Python, R and SQL.
Python. Python is the go-to general-purpose programming language. Python benefits from a large ecosystem of libraries and frameworks. It is a very suitable tool to build complex applications as it focuses on productivity. Python is also the most versatile language which makes it very helpful to build scripts outside our Data Science scope. On the downside, visualizations are more convoluted and messy in Python than in R, and its results are also not as informative or eye-pleasing.
R. R is another general-purpose programming language and is favorite one to pursue statistical computing. It is the mecca for regression analysis, cluster analysis and time series analysis. It has actually more Data Science librairies than Python. As a result, R is widely used in academia and certain sectors, such as finance and pharmaceuticals. Another plus: it is the perfect language if you have limited programming skills as it is easier to learn when you start out. Though, don’t get too confident: it gets more difficult than Python when using advanced functionalities.
Structure Query Language (SQL). SQL is the specialized language to manage relational databases. It is not a general-purpose programming language as it is only designed to create, modify, and query relational databases. Relational databases are collections of tables that store data in a structured way. SQL is often used to extract data from these databases, join data from different tables, and perform aggregations and transformations on data. It is particularly well-suited for handling large amounts of structured data, hence it is widely used in industries such as finance, healthcare, and e-commerce.
2. Connecting Programming Languages Together
Can we connect SQL with Python or R? Yes we can. Connecting them allows us to take advantage of the strengths of both languages to perform complex data analysis tasks efficiently and effectively. We can extract data from a large dataset stored in a SQL database and load it into a Python or R data structures to use their powerful data analysis and visualization tools and librairies. Additionally, we may want to automate database tasks using Python, such as creating and updating tables, or performing data cleaning and transformation.
3. Data Manipulation Librairies
Data Manipulation librairies are used in particular to manipulate and compute data. Examples of popular open-source data manipulation librairies that we can download and use include:
Python packages | R packages |
---|---|
Pandas: provides fast and flexible data structures for data manipulation and analysis. It is particularly useful for working with tabular data and time series data. | dplyr: provides functions for data manipulation, including filtering, arranging, and summarizing data. It is designed to work well with data frames and provides a more intuitive syntax than base R functions. |
NumPy: provides fast and efficient operations on arrays and matrices, making it ideal for scientific computing and data analysis. | data.table: provides a fast and efficient way to work with large datasets. Syntax is similar to dplyr but is optimized for speed and memory usage. |
SciPy: provides a wide range of algorithms for scientific computing, including optimization, signal processing, and linear algebra. This librairy builds on top of NumPy. | tidyr: provides functions for tidying data by reshaping data between wide and long formats and by filling missing values. |
FYI, we can only use a libray in Python or R as SQL isn’t a library language.
4. Data Visualisation Librairies
Data Visualisation librairies are used to create graphical representations of data, making it easier to understand and interpret complex data sets. Examples of popular open-source data visualisation librairies that you can download and use include:
Python packages | R packages |
---|---|
Matplotlib: provides a wide range of plotting functions for creating static visualizations like line charts, scatterplots, and histograms. | ggplot2: provides a flexible and powerful system to create complex graphics with ease. Also provides a wide range of customization options. |
Seaborn: provides a higher-level interface to Matplotlib. Allows to create complex visualizations with fewer lines of code and has a wide range of built-in themes and color palettes. | lattice: provides functions for creating trellis plots, which are useful for visualizing multivariate data. Allows to create plots with multiple panels, each showing a different subset of the data. |
plotly: provides interactive and web-based visualizations. Allows to create interactive plots, maps, and dashboards that can be shared online. | plotly: provides interactive and web-based visualizations. Allows to create interactive plots, maps, and dashboards that can be shared online. |
FYI, we can only use a libray in Python or R as SQL isn’t a library language.
5. Large Scale Data Processing API Librairies
To perform larger scale data processing tasks such as big data analytics, machine learning, and real-time data processing, we need to use a distributed computing system. By distributing the data, the system is a more scalable, more available and secure. Distributed computing systems use both MPIs (Message Passing Interfaces) and APIs (Application Programming Interfaces) to facilitate communication between nodes in a cluster. MPIs provides a low-level communication protocol for parallel programming on distributed systems, while APIs provide a high-level interface for developers to interact with the system. Below are some of the most popular distributed computing systems in the market and the respective popular librairies that we can use to call for computation while performance your data analysis via an API:
Python packages | R packages |
---|---|
Apache Spark: PySpark, TensorFlowOnSpark, Arrow, Dask | Apache Spark: SparkR, sparklyr |
Apache Hadoop: mrjob, Hadoop Streaming, Arrow, Dask | Apache Hadoop: Rhipe, rhdfs, rmr2 |
Amazon Web Services: PySpark, Dask | Amazon Web Services: rhdfs, rmr2 |
Azure Batch: Batch Shipyard, Azure SDK | Azure Batch: AzureSMR, BatchJobs, AzureStor |
To learn more about distributed computing systems, check this post: Data Science Tech Stack Series: Data Management Systems
6. Machine Learning Frameworks
Frameworks are more complex than librairies. We call a library but we don’t call a framework because a framework inverts the control of the program which means that it tells the developer what and when he needs to perform a task.
Machine learning frameworks are used to build predictive models and algorithms for building and deploy them through training, validation, and testing. Examples of popular open-source machine learning frameworks that you can download and use include:
Python packages | R packages |
---|---|
Scikit-learn: provides a wide range of machine learning algorithms, including classification, regression, clustering, and dimensionality reduction. | Caret: provides functions for training and testing various machine learning algorithms. Allows to tune hyperparameters and perform cross-validation to improve model accuracy. |
TensorFlow: platform developed by Google to build and deploy machine learning models. It is widely used for deep learning, deep neural networks creation, dataflow and differentiable programming across a range of tasks. | tfestimators: it is the TensorFlow interface for R. |
Keras: High-level neural networks API built on top of TensorFlow. Provides a more user-friendly interface and simplified interface for building and deploying deep learning models. | Keras: High-level neural networks API built on top of TensorFlow. Provides a more user-friendly interface and simplified interface for building and deploying deep learning models. |
PyTorch: developed by Facebook, to build and deploy machine learning models. It is widely used in research and production environments. | mlr3: Provides a unified interface for working with various machine learning algorithms. Provides tools for preprocessing data, feature selection, model evaluation, clustering, regression, classification, and survival analysis. |
FYI, we can only use a libray in Python or R as SQL isn’t a framework language.
Explore more
Check my post that introduces the full stack: Baking Up The Ultimate Data Science Tech Stack