• Mon. Nov 25th, 2024

Dataframes explained: The modern in-memory data science format

Byadmin

Nov 7, 2024



import pandas as pd
data = {
“Title”: [“Blade Runner”, “2001: a space odyssey”, “Alien”],
“Year”: [1982, 1968, 1979],
“MPA Rating”: [“R”,”G”,”R”]
}
df = pd.DataFrame(data)

Applications that use dataframes

As I previously mentioned, most every data science library or framework supports a dataframe-like structure of some kind. The R language is generally credited with popularizing the dataframe concept (although it existed in other forms before then). Spark, one of the first broadly popular platforms for processing data at scale, has its own dataframe system. The Pandas data library for Python, and its speed-optimized cousin Polars, both offer dataframes. And the analytics database DuckDB combines the conveniences of dataframes with the power of a full-blown database system.

It’s worth noting the application in question may support dataframe data formats specific to that application. For instance, Pandas provides data types for sparse data structures in a dataframe. By contrast, Spark does not have an explicit sparse data type, so any sparse-format data needs an additional conversion step to be used in a Spark dataframe.

To that end, while some libraries with dataframes are more popular, there’s no one definitive version of a dataframe. They’re a concept implemented by many different applications. Each implementation of a dataframe is free to do things differently under the hood, and some dataframe implementations vary in the end-user details, too.



Source link