Is it time to replace matplotlib and pandas?

Is it time to replace matplotlib and pandas?

Comment by Markus Schmitt, Data Revenue Is it time to replace Matplotlib and Pandas?

September 23, 2020

Author / Editor: Markus Schmitt / Nico Litzel

Programmers take great pleasure in comparing different software libraries. Tensor Flow vs Pytorch? Vim vs Emacs? Python vs R? Assuming a kind of global competition in which software competes by popularity, Pandas and Matplotlib represent the most established players in the Python data science space. They are among the oldest and most widely used libraries for data wrangling and visualization. However, within this competition there are also interesting competitors.

Companies on the subject

In the area of ​​data wrangling, libraries like Dask, Vaex, and Modin offer some advantages over Pandas, although they are even less mature. Likewise, Bokeh and Seaborn (among many others) compete with Matplotlib in the rich Python visualization landscape because they offer simpler APIs, nicer default settings, and modern features - such as interactivity.

In this article we will:

Some hot Python libraries highlight and discuss why they are of interest to data scientists and machine learning developers;

Compare them with Matplotlib and Pandas and

Conclude that "competition" is the wrong way to look at it. A butter knife is no better or worse than a bread knife - they simply do different jobs. Likewise, it usually depends on the task at hand which Python tool you should use. The tools often complement each other.

Data wrangling

Incumbent: Pandas

Why Pandas is popular: Simplicity, maturity and flexibility

Problems: Inefficient, not scalable, and often limited to a single thread.

Challengers: Dask, Modin, Vaex, and RAPIDS cuDF (detailed comparison)

Pandas get a lot of gripes. Even the original developer of Pandas has written scathing reviews about Pandas. Nevertheless, for many data scientists it is an indispensable tool that enables simple workflows, e.g. B: for:

Read data from various file formats, including JSON and CSV;

Loading data into dataframe objects;

Manipulating data by filtering, transforming, or aggregating;

Visualize data through seamless integration with Matplotlib.

However, because Pandas is built with such a strong focus on simplicity and flexibility, it performs poorly in other areas - particularly in terms of efficiency and scalability to large datasets.

How do other libraries solve these problems?

All the libraries that we examine as pandas challengers try to preserve the familiar pandas interface by implementing (fully or partially) the pandas API. At the same time, they solve efficiency problems through better use of more powerful hardware (multi-core processing, GPU processing or cluster computing) or clever algorithms (such as lazy evaluation and memory mapping).

Modin

Modin is the simplest tool of all and aims to provide a full replacement for pandas. If you already have Pandas code, you can make significant progress by adjusting a single line.

import pandas as pd

Let's just replace with:

import modin.pandas as pd

There are a few preparations that need to be made first. You first install Modin (via pip or similar) and then Dask or Ray, the backends on which Modin runs. Then everything should work directly.

Use Modin if your existing code relies heavily on Pandas and you're looking for the easiest way to optimize Pandas execution.

Don't use Modin if you're looking for more control over scalability, or if you value efficiency over ease of use.

Vaex

Similar to Modin, Vaex implements part of Panda's API and can often be used as an alternative. However, unlike Modin, Vaex is not aiming for full compatibility with pandas. Instead, Vaex focuses primarily on exploration and visualization. With Vaex, you can explore and visualize datasets on your regular laptop, even if they're larger than your computer's available memory. Vaex makes this possible through a combination of lazy evaluation (performing calculations only when the results are definitely needed) and memory mapping (treating files on hard disks as if they were loaded in RAM).

Use Vaex when you want to explore and visualize a large dataset on regular hardware.

Do not use Vaex if you want more sophisticated data manipulation or to scale to a cluster of servers.

Dask

Although Modin runs on Dask, Dask also provides a high-level, Pandas-like library called Dask.Dataframe. Similar to Modin, this library implements many of Pandas' methods, allowing it to completely replace Pandas in certain cases. Similar to Vaex, Dask uses Lazy Evaluation to extract additional efficiencies from existing hardware.

Unlike Modin, Dask does not aim for full compatibility with the Pandas API, instead moving away from Pandas whenever additional performance is required.

Dask also offers far more features than Vaex or Modin. While these libraries only offer their own DataFrame-like functionality (e.g. as a Pandas replacement), Dask can be used to scale any Python code and additionally offers a replacement for Numpy and Scikit-learn.

It's time matplotlib and pandas too

Use Dask when you want to scale large datasets and use compute clusters.

Don't use Dask unless you're willing to add some complexity to your projects.

RAPIDS cuDF

While the other libraries we looked at make more effective use of RAM and CPUs, RAPIDS is designed to run on NVIDIA GPUs. Similar to Dask, RAPIDS offers a set of tools with alternatives not only for pandas but also for scikit learn.

RAPIDS focuses on data science use cases and aims to enable developers to run complete data science pipelines on GPUs instead of switching data back and forth between GPUs and CPUs at different stages.

Use RAPIDS if you have access to NVIDIA GPUs and want to use them to accelerate your Python data science pipelines.

Don't use RAPIDS if you don't have GPUs available.

Plotting and visualization

Incumbent: Matplotlib

Why Matplotlib is popular: Maturity and flexibility

Issues: code is often bulky, default settings are ugly, and it doesn't lend itself well to interactive visualizations.

Challengers: Seaborn, Bokeh, Plotly, Datashader

Matplotlib is the go-to Python visualization library for many developers. It has a wide range of functionality to create almost any type of chart and integrates well with other libraries such as B. Pandas integrate.

However, the charts created by default are rather ugly and require more lines of code than newer visualization libraries. Also, Matplotlib was primarily designed to create static image files, such as PNGs. In order to make visualizations interactive or to embed them in web dashboards, one should look for alternatives.

How do other libraries solve these problems?

There are dozens of popular Python visualization libraries. You can get a good overview with Jake VanderPla's talk on the subject, we'll cover part of it here. These newer libraries offer nicer defaults, more concise code, better JavaScript integrations, and more efficiency.

Seaborn

Seaborn is usually the first matplotlib alternative people turn to. Seaborn is based on Matplotlib and offers nicer default settings and a simpler syntax for standard visualizations. The main disadvantage of Seaborn is that the options are a bit more limited. There are some settings and charts that you can create in Matplotlib but not in Seaborn.

However, Seaborn was developed to be used in combination with Matplotlib. So if you reach your limits, you can simply use Matplotlib code and fill in any gaps.

Use Seaborn when you want non-interactive plots with nicer tuned defaults and simpler code than matplotlib.

Do not use Seaborn if you need interactive charts or dashboards.

Bokeh

While Seaborn aims to simplify Matplotlib's existing functionality, Bokeh adds new features. Above all, Bokeh can create interactive visualizations. You define the plots in Python and Bokeh automatically creates the required JavaScript.

Bokeh is a good choice when you want to publish your plots as part of a web application or HTML and JavaScript based report.

Use bokeh when you want to generate interactive charts or dashboards.

Don't use bokeh if you enjoy writing your own JavaScript or just need static images.

Plotly

Plotly is similar to Bokeh in that it creates interactive charts and generates the necessary JavaScript from Python. Plotly is also behind Dash, a framework that allows you to write complete web applications directly in Python.

Use Plotly if you want to generate interactive charts or dashboards.

Don't use Plotly if you enjoy writing JavaScript yourself or just need static charts.

Datashader

Unlike the other libraries mentioned that focus on creating plots and charts, Datashader focuses on visualizing large datasets. For example, the chart below was generated from a huge dataset using Datashader to show population density in the United States.

Since Datashader is designed to work with large datasets, it focuses far more on performance and efficiency than the other libraries. However, it is less well suited to displaying simpler diagrams such as B. Scatterplots to generate.

Use Datashader when you want to create charts from really large data sets or with well over millions of points.

Don't use Datashader if you just want to create standard charts on smaller datasets.

Conclusion

Pandas and Matplotlib are often the first choice for data scientists and Python programmers. As powerful and mature as these libraries are, they still have some shortcomings, and there are good alternatives that address those shortcomings.

Which is best for you will depend on your specific problem, and the newer "challenger" libraries should be seen as a support to the established libraries rather than competition.

Article files and article links

Link: Data Revenue on the Web

(ID:46851117)