`pandas_path` - Path style access for pandas

Love pathlib.Path*? Love pandas? Wish it were easy to use pathlib methods on pandas Series?

This package is for you. Just one import adds a .path accessor to any pandas Series or Index so that you can use all of the methods on a Path object.

* If not, you should.

Quickstart

Install latest pandas-path with pip.

pip install pandas-path

Import path from pandas_path, and then the .path accessor will be available on any Series or Index:

# this is all you need
from pandas_path import path

Now you can use all the pathlib methods using the .path accessor on any Series in pandas!

pd.Series([
    'cat/1.jpg',
    'cat/2.jpg',
    'dog/1.jpg',
    'dog/2.jpg',
]).path.parent

# 0    cat
# 1    cat
# 2    dog
# 3    dog
# dtype: object

Examples

Here's an example:

from pathlib import Path
import pandas as pd

# This is the only line you need to register `.path` as an accessor
# on any Series or Index in pandas.
from pandas_path import path

# we'll make an example series from the py files in this repo;
# note that every element here is just a string--no need to make Path objects yourself
file_paths = pd.Series(str(s) for s in Path().glob('**/*.py'))

# 0                   setup.py
# 1    pandas_path/accessor.py
# 2        pandas_path/test.py
# dtype: object

Use the .path accessor to get just the filename rather than the full path:

file_paths.path.name

# 0       setup.py
# 1    accessor.py
# 2        test.py
# dtype: object

Use the .path accessor to get just the parent folder of each file:

file_paths.path.parent

# 0              .
# 1    pandas_path
# 2    pandas_path
# dtype: object

Use calculated methods like exists to filter for what exists on the filesystem:

file_paths.loc[3] = 'fake_file.txt'

# 0                   setup.py
# 1    pandas_path/accessor.py
# 2        pandas_path/test.py
# 3              fake_file.txt
# dtype: object

file_paths.path.exists()

# 0     True
# 1     True
# 2     True
# 3    False
# dtype: bool

Use path methods like with_suffix to dynamically create new filenames:

file_paths.path.with_suffix('.png')

# 0                   setup.png
# 1    pandas_path/accessor.png
# 2        pandas_path/test.png
# 3               fake_file.png
# dtype: object

Use the / operators just as you would in pathlib (with the .path accessor on either side of the operator.)

"different_root_folder" / file_paths.path

# 0                   different_root_folder/setup.py
# 1    different_root_folder/pandas_path/accessor.py
# 2        different_root_folder/pandas_path/test.py
# dtype: object

We'll even do element wise operations with lists/arrays/series of the same length.

file_paths.path.parent.path / ["other_file1.txt", "other_file2.txt", "other_file3.txt"]

# 0                other_file1.txt
# 1    pandas_path/other_file2.txt
# 2    pandas_path/other_file3.txt
# dtype: object

Custom path accessors

Some libraries (such as cloudpathlib, which support path operations for AWS S3, Azure Blobs, and Google Cloud Storage) implement the Path interface in other contexts. You can use pandas-path to register and use any class that implements Path. For example:

import pandas as pd
from pandas_path import register_path_accessor
from cloudpathlib import S3Path

# creates an accessor ".s3" that creates s3 paths
register_path_accessor("s3", S3Path)

test = pd.Series(
    S3Path("s3://ladi/Images/FEMA_CAP/2020/70349").iterdir()
)

test.s3.bucket
#> 0      ladi
#> 1      ladi
#>        ... 
#> 577    ladi
#> 578    ladi
#> Length: 579, dtype: object

If you need to pass specific args or kwargs to the path instantiation, you can pass those at registration time. For example, S3Path can be passed an S3Client with explicit credentials.

import pandas as pd
from pandas_path import register_path_accessor
from cloudpathlib import S3Path, S3Client

# creates an accessor ".s3" that creates s3 paths using `S3Path(*, client=S3Client(...))`
register_path_accessor("s3", S3Path, client=S3Client(profile_name='other_aws_profile'))

test = pd.Series(
    S3Path("s3://ladi/Images/FEMA_CAP/2020/70349").iterdir()
)

test.s3.bucket
#> 0      ladi
#> 1      ladi
#>        ... 
#> 577    ladi
#> 578    ladi
#> Length: 579, dtype: object

Another example is if you want to use Windows paths on a Posix machine. You can explicitly indicate you want to work with PureWindowsPath to do this on any operating system:

import pandas as pd
from pandas_path import register_path_accessor
from pathlib import PureWindowsPath

register_path_accessor("win", PureWindowsPath)

test = pd.Series([
    r"c:\test\f1.txt",
    r"c:\test2\f2.txt",
])

test.win.parent
#> 0     c:\test
#> 1    c:\test2
#> dtype: object

Limitations

While most operations work out of the box, operator chaining with / will not work as expected since we always return the series itself, not the accessor.

file_paths.path.parent.path / "subfolder" / "other_file1.txt"

# ----> 1 file_paths.path.parent.path / "subfolder" / "other_file1.txt"
# ...
# TypeError: unsupported operand type(s) for /: 'str' and 'str'

Instead, either use the .path accessor on the result or re-write without chaining:

(file_paths.path.parent.path / "subfolder").path / "other_file1.txt"

# 0                subfolder/other_file1.txt
# 1    pandas_path/subfolder/other_file1.txt
# 2    pandas_path/subfolder/other_file1.txt
# dtype: object

file_paths.path.parent.path / "subfolder/other_file1.txt"

# 0                subfolder/other_file1.txt
# 1    pandas_path/subfolder/other_file1.txt
# 2    pandas_path/subfolder/other_file1.txt
# dtype: object

A numpy array or pandas series on the left hand side of / will not work properly.

pd.Series(['a', 'b', 'c']) / pd.Series(['1', '2', '3']).path

## IMPROPERLY BROADCASTS :'(

# 0    0    a/1
# 1    a/2
# 2    a/3
# dtype: object
# 1    0    b/1
# 1    b/2
# 2    b/3
# dtype: object
# 2    0    c/1
# 1    c/2
# 2    c/3
# dtype: object
# dtype: object

Instead, use the path accessor on the left-hand side as well.

pd.Series(['a', 'b', 'c']).path / pd.Series(['1', '2', '3']).path

# 0    a/1
# 1    b/2
# 2    c/3
# dtype: object

Path object on the left-hand side of a join (Python < 3.8)

Due to a bug in Python, this never gets handed off to us.

Path("dir") / pd.Series(['a', 'b', 'c']).path

#  TypeError: expected str, bytes or os.PathLike object, not PathAccessor

Workaround is to use a str on the left-hand side:

str(Path("dir")) / pd.Series(['a', 'b', 'c']).path

# 0    dir/a
# 1    dir/b
# 2    dir/c
# dtype: object

That's all folks, enjoy!

Developed and maintained by your friends at DrivenData! ml competitions | ai consulting

^{Some examples created with reprexlite v0.4.2 to ensure reproducibility.}

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.github		.github
pandas_path		pandas_path
.gitignore		.gitignore
HISTORY.md		HISTORY.md
LICENSE.txt		LICENSE.txt
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github

.github

pandas_path

pandas_path

.gitignore

.gitignore

HISTORY.md

HISTORY.md

LICENSE.txt

LICENSE.txt

MANIFEST.in

MANIFEST.in

Makefile

Makefile

README.md

README.md

pyproject.toml

pyproject.toml

requirements-dev.txt

requirements-dev.txt

requirements.txt

requirements.txt

setup.cfg

setup.cfg

setup.py

setup.py

Repository files navigation

`pandas_path` - Path style access for pandas

Quickstart

Examples

Custom path accessors

Limitations

About

Releases 2

Contributors 2

Languages

License

drivendataorg/pandas-path

Folders and files

Latest commit

History

Repository files navigation

pandas_path - Path style access for pandas

Quickstart

Examples

Custom path accessors

Limitations

About

Topics

Resources

License

Stars

Watchers

Forks

Languages

`pandas_path` - Path style access for pandas