When performing Data Science, it is crucial to choose the right working environment. Our goal is to optimize our workflow efficiency during the various stages of data wrangling, coding, collaboration, and results interpretation. In this article, we will explore three types of environments with a progressive set of features: code editors (1), integrated development environments (2), and interactive computing environments (3). We will also introduce the most popular collaboration platform and tool (4) and we will conclude with the very useful package distribution softwares (5) that enable us to download and install in one go all we need to do Data Science.
1. Code Editors
Code editors are software tools used for writing and editing code. They are lightweight and offer a simple and streamlined workflow for smaller-scale coding tasks. They typically offer basic features such as syntax highlighting, auto-indentation, code completion, code snippets, code folding and version control. Here are some of the best code editors available today that support Python, SQL and R:
Code Editors | Notepad ++ | Sublime Text | Emacs |
---|---|---|---|
Free | Yes | No | Yes |
Open-source | No | No | Yes |
Developer | Windows | Jon Skinner | Community-owned |
Primary users | Software Developers | Software Developers | Software Developers, Data Scientists, Researchers |
Main Feature | Powerful | Customizable | Natural language processing, Computational linguistics |
Cross-platform | No | Yes | Yes |
Git integration* | No | Yes | No |
Debugger | Yes | No, you’ll need to use Third-party pluggins. | Yes |
2. Integrated Development Environments (IDE)
Integrated Development Environments are more comprehensive software tools than code editors with additional functionalities and integration with other tools. As a result, they are more heavyweight than code editors but they improve productivity and complexity management for medium scale development projects. IDE typically include all the features of a code editor with others tools such as build automation, code refactoring and project management. They are obviously cross-platform and mandatorily provide integrations with version control systems like Git (cf. Section 4. Version Control & Collaboration Tools), which allows users to manage and collaborate on their code and projects with others. Here are some of the best IDEs available today that support Python, SQL and R:
IDEs | Visual Studio | PyCharm | Spyder | RStudio |
---|---|---|---|---|
Free | Yes | Yes only for the Community Edition | Yes | Yes |
Open-source | Yes | Yes only for the Community Edition | Yes | Yes |
Developer | Microsoft | JetBrains | MIT | RStudio, Inc. |
Primary users | Microsoft Software Developers | Python Software Developers | Python Data Scientists | Data Scientists & Statisticians |
Main Feature | Azure compatible | Specialised in Python | Scientific environment for Python | Strong statistician community |
If you happen to work for big tech companies other than Microsoft, you will be asked to use their respective proprietary IDEs: AWS Cloud9 at Amazon, Xcode at Apple and Google Cloud Shell at Google.
3. Interactive Computing Environments
In comparison to the previously mentioned IDE and code editor, an Interactive Computing Environment is a more comprehensive working environment. An Interactive Computing Environment primary focus is, as its name stands for, on providing an environment where we can interactively work with our data, intuitively code and flexibly create and share documents that combine live code, narrative text, visualizations, and other multimedia elements. It is typically used for medium scale data processing projects including exploratory data analysis, data visualization, and rapid prototyping of machine learning models. Examples of popular web-based Interactive computing environments that support Python, SQL and R include:
ICEs | Jupyter Notebook | JupyterLab | Apache Zeppelin | MATLAB Online |
---|---|---|---|---|
Free | Yes | Yes | Yes | No |
Open-source | Yes | Yes | Yes | No |
Cloud-based | No | No | No | Yes |
Developer | Fernando Pérez & Brian Granger | Fernando Pérez & Brian Granger | Moon Soo Lee, Sungwook Yoon & Hyungtae Kim under the Apache License | MathWorks |
Primary users | Software Developers & Data Scientists | Software Developers & Data Scientists | Data Scientists | Data Scientists, Engineers, Researchers in Maths, Physics, Finance & Biology |
Main Feature | Ease of use, versatility | Next generation of Jupyter Notebook, more powerful & more extensions | Specifically designed for data analysis, highly extensible & customizable | Includes Simulink to graphically model and simulate dynamic systems |
If you happen to work for big tech companies, you will be asked to use their respective proprietary Interactive computing environments: Azure Notebooks at Microsoft and Google Colaboratory at Google. They will all connect to Jupyter Notebooks to offer their Cloud version. Their proprietary versions will provide access to GPUs and TPUs to accelerate model training & computation.
The above Interactive Computing Environments are primarily designed for individual use and small-scale collaborations. They tend to have no to limited built-ins for real-time collaboration and multi-user editing. This means that we can share our notebooks on the platform servers or cloud to ask for comments and suggestions but we cannot all collaborate on the same notebook. Hence it is better to connect our environment to Version Control & Collaboration Tools extensions like GitHub.
4. Collaboration Platforms and Tools
We will break down the collaboration platforms and tools into version control dedicated tools, collaboration and hosting dedicated tools and Artificial Intelligence (AI) dedicated tools.
Version control systems. The purpose of a version control tools is to allow multiple people to work on the same codebase without conflicting with each other them and with a precise tracking of the changes made to the code over time. The most popular version control software is Git. Git is a free web-based distributed version control system. Git is mostly used by software developers who need to collaborate on a programming project. With Git, developers can make changes to code on their local machine and then push those changes to a central repository, where they can be shared with other members of the development team.
Collaboration and hosting systems. The purpose of a collaboration tools is to facilitate communication, issue tracking, quality control and accountability of the stakeholders involved in the program being built. Connected to Git introduced earlier, GitHub provides a centralized hosting platform service for developers to store and share their Git repositories, making it even easier to create repositories, contribute to open source projects, and collaborate with other developers on code development and maintenance. You can create an account for free. As a data scientist, I also use GitHub to post my work and share it with a broader community. Posting my Jupyter notebooks for instance helps be build my professional profile and showcase my skills to potential employers or collaborators. Though we can also access GitLab or proprietary tools like AWS CodeCommit.
Artificial Intelligence (AI) Collaboration Platforms. These platforms enable teams of data scientists, engineers, and other stakeholders to work together more efficiently and effectively on complex AI projects. They provide a range of tools and services for building, customizing, training, deploying and experimenting with machine learning models. The platforms also provide support for data management and collaboration along with including Jupyter Notebook and Jupyter Lab integrations. Some examples of popular AI collaboration platforms include:
AI Platforms | Amazon SageMaker | Google Cloud AI Platform | IBM Watson Studio | Azure Machine Learning Studio | Databricks |
---|---|---|---|---|---|
Free | Only up to 250h | Only up to 120h | Only the Lite plan (25GB) | Only up to 4h per month | Only the Community Edition |
Open-source | No | No | No | No | No |
Cloud-based | Yes | Yes | Yes | Yes | Yes |
Developer | Amazon | IBM | Microsoft | Apache Spark |
5. Package Distribution Softwares
A Package Distribution Software enables us to download and install in one go all the following related softwares: programming language, key librairies & frameworks packages, version control systems, working environements, etc. It also allows us to easily manage these software packages and dependencies and avoid conflicts between different versions of packages along your Data Science journey to voiding. Below are some of the most popular Package Distribution Softwares that support Python, SQL and R:
Python Distribution | R Distribution |
---|---|
Anaconda: popular open-source choice for data science and scientific computing | RStudio Desktop: popular open-source choice particularly for data science and statistical analysis |
Enthought Canopy: another popular open-source distribution | Revolution R Open: preferred open-source distribution performance optimizations |
ActivePython: commercial enterprise distribution | Microsoft R Open (MRO): commercial distribution for performance optimizations |
Explore more
Check my post that introduces the full stack: Baking Up The Ultimate Data Science Tech Stack