Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix File Names for Unix Environment #1322

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

emrakyz
Copy link
Contributor

@emrakyz emrakyz commented May 10, 2023

Luke mentions about Unix naming conventions on his videos. Here is a script to increase consistency according to Unix conventions for all file names in parallel, very easily and fast in a safe way.

Luke also asks: "What do you think about naming files with underscores instead of dashes?", stating his worry about the usage of underscores seems like a "soydev" thing 馃槀. I give my opinion below. Actually the justification is objective compared to an opinion.

EDIT (2024-04-16): I have written the exact same program with pure C without any external libraries or programs instead of Glibc and Linux. The C program is faster (about 20 times), more minimal, safer, and it has more features. It can now handle singular files; it will never encounter race conditions; it accepts multiple arguments and it has a flag for dry running. It can easily be integrated into TUI File Managers like LF. Check it out:

No Bullshit Instant Filename Sanitizer: Threaded & Recursive & Lightweight

This version directly uses the newer renameat2() and fstatat functions from the Linux kernel and therefore we don't even need to use lots of different calls, checks repetitively. renameat2() also supports renaming things at their own place and without replacing so it's even faster, safer than mv.

What The Script Does

1. Check if the item is a directory. If so;

  • a) Remove non-English characters.
  • b) Replace spaces, dots, and dashes with underscores.
  • c) Remove consecutive underscores.
  • d) Convert the name to lowercase.
  • e) Remove any other special characters.
  • f) Every file or directory should start and end with an alphanumeric character.

2. If the item is a file, apply the same transformations as for directories, but keep the file extension intact.

3. Check if the original name and the new name are different. If so, and if a file or directory with the new name already exists, create a unique name.

  • The script can use Dash and parallel processes, ensuring safety and performance with a subshell environment. Therefore it can even rename more than 1.000.000 files and directories that have extremely weird names in minutes. For normal human sized hard drives, it's mostly instant. (I have tested bash built-in functions, tr, awk and sed. None of them was faster than sed for this task, awk was very close but still slower).

Examples of How Every File Should Look: this_is_an_example_directory_name OR this_is_an_example_video_file.mp4

Why "_" is Preferred Instead of a Space or a Dot or a Dash

In Unix environments, it is generally recommended to replace spaces in filenames with underscores (_), rather than dots (.) or dashes (-). This is because underscores are more commonly used and supported by Unix utilities and programming languages.

Dots (.) are typically used as a separator between a file's name and its extension, so using them to replace spaces can lead to confusion and errors. Dashes (-) are sometimes used in place of spaces, but they can be problematic because they are often used as a command-line option delimiter in Unix, which can lead to unexpected behavior.

  • Readability: Underscores make file and directory names more readable, as they clearly separate words and components in the name, whereas spaces can be easily overlooked, and dots can be mistaken for file extensions.

  • Compatibility: Some command line tools and scripts may not handle file names with spaces or dots properly without additional configuration or escaping. Underscores, on the other hand, do not require special handling and are generally better supported across various tools and environments.

  • URL encoding: When sharing file paths in URLs or web applications, spaces and dots may require URL encoding (e.g., replacing spaces with "%20" and dots with "%2E"), which can make the URLs less readable and more cumbersome to work with

The Reason Behind Using a Subshell Environment

Subshells are used in the script to isolate the execution environment of each parallel process. This isolation ensures that the processes do not interfere with each other, as they have their own separate environments, including local variables and function definitions. This separation is particularly important when running multiple processes in parallel, as it reduces the risk of race conditions and other synchronization issues.

Using subshells in the script also simplifies the process of launching parallel processes. By executing the process_item function within a subshell, the script can easily leverage the -P flag of xargs to specify the maximum number of parallel processes to run. This results in improved performance and efficiency when processing a large number of files and directories.

The Benefit of Removing Non-English Characters

  • Compatibility: Non-English characters can cause compatibility issues with some tools, applications, or systems that are not properly configured to handle them. By removing these characters, you reduce the risk of encountering issues related to character encoding and ensure broader compatibility across different environments.

  • Consistency: Standardizing file and directory names by removing non-English characters can make it easier to organize, search, and manage your files. It helps maintain a consistent naming convention across your file system, which can be beneficial for both human users and automated processes.

  • Accessibility: Using only English characters in file and directory names can improve accessibility for users who may not be familiar with non-English characters or languages. This can be particularly important in multi-user or multi-language environments where not all users might be comfortable with non-English characters.

A Lot More Details

  • The first line checks if the user runs this as root user. So it doesn't allow you to edit system related files that needs extra permission.

$(id -u) gets the user ID of the current user.
The script checks if the user ID is "0", which corresponds to the root user.
If the user is root, it will print an error message "This script should not be run as root" to standard error (>&2 is redirecting to standard error).
If the condition is true (the script is being run as root), it exits immediately with a status code of 1 (indicating an error).

  • find . -depth ( -path '/./' -o -path './.' ) -prune -o ( -type f -o -type d ) -print0 | xargs -0 -P 0 -I {} dash -c '

find . starts a search in the current directory.
-depth ensures it processes a directory's contents before the directory itself.
The escaped parentheses group conditions or expressions together, allowing them to be treated as a single unit. This is useful for combining different conditions with logical operators.
-path '/./*' This tests the path for a match against the given pattern.
This pattern matches any path where there's a directory that starts with a dot (.*)
The first * matches any sequence of characters (any directory or sub-directory).
/.*/ matches any directory that starts with a dot.
The last * matches any sequence of characters after that directory and any contents inside the dot directory.
In essence, it matches any file or directory that is inside a dot directory at any depth.
-o This is a logical OR. It ensures that if the condition before it fails, the command will test the condition after it.

-path './.*' This tests the path for a match against the given pattern.

./.* This pattern matches any path that starts from the current directory (./) and has a filename or directory name starting with a dot (.*) This is essentially targeting dot files or directories at the root of where you run the find command.
-prune tells find to not descend into those directories (ignores hidden directories).

  • ( -type f -o -type d )

This is another grouped condition.
-type f Matches regular files.
-type d Matches directories.
The -o between them ensures that both regular files and directories are matched. So this entire grouped condition is true for both files and directories.
-print0 will print the results (full path of files and directories) separated by null characters. This is useful to handle filenames with special characters like spaces or newlines.

| xargs -0 pipes the find results to xargs, which will run a command on each result. The -0 tells xargs to expect null-separated input (to match -print0).
-P 0 tells xargs to run as many processes as possible in parallel.
-I {} specifies a placeholder {} that will be replaced by each input item.
dash -c invokes the dash shell to execute the command provided in the string that follows.

  • base="${1##/}" and path="${1%/}"

First one extracts the base name (the filename or directory name without the path). For example, if $1 is ./mydir/myfile.txt, base will be myfile.txt

Second one extracts the path without the base name. Using the previous example, path will be ./mydir.

  • pattern="s/[^a-zA-Z0-9 ._-]//g; s/[ .-]//g; s/+//g; s/^+//; s/_+$//; s/[A-Z]/\L&/g"

This sets up a series of sed commands (regular expression replacements):
Remove all characters that are not alphanumeric, space, dot, underscore, or hyphen.
Replace spaces, dots, and hyphens with underscores.
Replace consecutive underscores with a single underscore.
Remove leading underscores.
Remove trailing underscores.
Convert all uppercase letters to lowercase.

  • [ -f "$1" ] && pattern="$pattern; s/_([^_]+)$/.\1/"

If the item is a file (-f checks for a regular file), it appends another sed parameter to the pattern. This command ensures the file extension remains after the base name, separated by a dot. For instance, "this_is an-example.file.txt" becomes "this_is_an_example_file.txt"

  • new_name="$(echo "$base" | sed -E "$pattern")"

Here, the base name of the file/directory is passed through sed to apply the transformations defined in the pattern.

  • [ "$base" != "$new_name" ] && [ -e "$path/$new_name" ] && new_name="${$}_${new_name}"

If the original name is different from the new name and there already exists a file/directory with the new name, it appends the process ID "${$}" to the new name to avoid collisions. This is the most streamlined and fastest method. Arithmetic prefixes can have problems with race conditions and they are slower. Since the possibility of having the same file names are pretty low; it's good to go with the fastest and the most minimal method.

  • [ "$base" != "$new_name" ] && mv "$1" "$path/$new_name"

If the original name is different from the new name, it renames (moves) the original file/directory to the new name.

  • ' _ {}

This closes the dash -c command string and provides initial parameters for each invocation of the command. The _ is a dummy argument corresponding to $0 (the command/script name itself) when using dash -c. The {} is replaced by each input item from xargs, representing the current file or directory being processed.

Luke mentions about Unix naming conventions on his videos. Here is a script to increase consistency according to Unix conventions for all file names in parallel, very easily and fast in a safe way.

Luke also asks: "What do you think about naming files with underscores instead of dashes?", stating his worry about the usage of underscores seems like a "soydev" thing 馃槀. I give my opinion below. Actually the justification is objective compared to an opinion.

### What The Script Does

**1.** Check if the item is a directory. If so;

- **a)** Remove non-English characters.
- **b)** Replace spaces, dots, and dashes with underscores.
- **c)** Remove consecutive underscores.
- **d)** Convert the name to lowercase.
- **e)** Remove any other special characters.
- **f)** If the resulting name is empty, set it to "untitled".
- **g)** Every file or directory should start and end with an alphanumeric character.

**2.** If the item is a file, apply the same transformations as for directories, but keep the file extension intact.

**3.** Check if the original name and the new name are different. If so, and if a file or directory with the new name already exists, create a unique name.

- The script can use Dash and parallel processes, ensuring safety and performance with a subshell environment. Therefore it can even rename more than 100.000 files that have extremely weird names in 30 seconds (I have tested bash built-in functions, tr, awk and sed. None of them was faster than sed for this task, awk was very close but still slower).

**Examples of How Every File Should Look:** this_is_an_example_directory_name  **OR**  this_is_an_example_video_file.mp4

### Why "_" is Preferred Instead of a Space or a Dot or a Dash

In Unix environments, it is generally recommended to replace spaces in filenames with underscores (_), rather than dots (.) or dashes (-). This is because underscores are more commonly used and supported by Unix utilities and programming languages.

Dots (.) are typically used as a separator between a file's name and its extension, so using them to replace spaces can lead to confusion and errors. Dashes (-) are sometimes used in place of spaces, but they can be problematic because they are often used as a command-line option delimiter in Unix, which can lead to unexpected behavior.

- **Readability:** Underscores make file and directory names more readable, as they clearly separate words and components in the name, whereas spaces can be easily overlooked, and dots can be mistaken for file extensions.

- **Compatibility:** Some command line tools and scripts may not handle file names with spaces or dots properly without additional configuration or escaping. Underscores, on the other hand, do not require special handling and are generally better supported across various tools and environments.

- **URL encoding:** When sharing file paths in URLs or web applications, spaces and dots may require URL encoding (e.g., replacing spaces with "%20" and dots with "%2E"), which can make the URLs less readable and more cumbersome to work with

### The Reason Behind Using a Subshell Environment

Subshells are used in the script to isolate the execution environment of each parallel process. This isolation ensures that the processes do not interfere with each other, as they have their own separate environments, including local variables and function definitions. This separation is particularly important when running multiple processes in parallel, as it reduces the risk of race conditions and other synchronization issues.

Using subshells in the script also simplifies the process of launching parallel processes. By executing the process_item function within a subshell, the script can easily leverage the -P flag of xargs to specify the maximum number of parallel processes to run. This results in improved performance and efficiency when processing a large number of files and directories.

### The Benefit of Removing Non-English Characters

- **Compatibility:** Non-English characters can cause compatibility issues with some tools, applications, or systems that are not properly configured to handle them. By removing these characters, you reduce the risk of encountering issues related to character encoding and ensure broader compatibility across different environments.

- **Consistency:** Standardizing file and directory names by removing non-English characters can make it easier to organize, search, and manage your files. It helps maintain a consistent naming convention across your file system, which can be beneficial for both human users and automated processes.

- **Accessibility:** Using only English characters in file and directory names can improve accessibility for users who may not be familiar with non-English characters or languages. This can be particularly important in multi-user or multi-language environments where not all users might be comfortable with non-English characters.

### A Lot More Details
- find . -depth -name '*' -print0: This find command searches for all files and directories recursively in the current directory (.). -depth ensures that the directory tree is traversed depth-first, and -name '*' matches all items. -print0 prints the results separated by a null character (useful for handling filenames with spaces or special characters).

- | xargs -0 -n1 -P10 -I{} sh -c '...': The find command output is piped (|) to xargs. The -0 option tells xargs to expect null-terminated items. -n1 processes one item at a time. -P10 runs 10 parallel processes. -I{} sets the placeholder for input items. sh -c '...' runs a shell script with the given commands for each input item.

- generate_unique_name() { ... }: This is a function that generates a unique name for a file or directory. It takes three arguments: the base name, the extension (if any), and the destination path. It increments a counter and appends it to the base name until a unique name is found, then returns the unique name.

- process_item() { ... }: This is the main function that processes a single file or directory path. It sanitizes the name and renames the item if needed.

- [ "$item_path" = "." ] && return: This line checks if the item path is the current directory (.). If it is, the function returns without doing anything.

- dir_name=$(dirname "$item_path"); base_name=$(basename "$item_path"): These commands extract the directory name and base name from the item path.

- if [ -d "$item_path" ]; then ... else ... fi: This conditional block checks if the item is a directory (-d) and processes it accordingly.

- new_name=$(echo "$base_name" | sed -E "s/[^a-zA-Z0-9 _.-]+//g; s/[ .-]+/_/g; s/_+/_/g; s/^_//; s/_$//; s/(.*)/\L\1/"): This line uses sed to sanitize the base name by removing unwanted characters, replacing spaces and periods with underscores, and converting the name to lowercase. The -E flag enables extended regular expressions.

- [ -z "$new_name" ] && new_name="untitled": If the new name is empty, it is set to "untitled".

- file_ext="${base_name##*.}" base_name_no_ext="${base_name%.*}": For files, this line extracts the file extension and the base name without the extension.

- new_name="${new_base_name_no_ext}.${file_ext}": For files, this line constructs the new file name with the sanitized base name and the original file extension.

- if [ "$base_name" != "$new_name" ]; then ... fi: This conditional block checks if the original name and the new name are different.

- [ -e "${dir_name}/${new_name}" ] && new_name=$(generate_unique_name "${new_name%.*}" "${new_name##*.}" "$dir_name"): If the new name already exists, the generate_unique_name function is called to get a unique name.

- mv "$item_path" "${dir_name}/${new_name}" 2>/dev/null || true: This line moves (renames) the item to the new path with the sanitized name. If an error occurs, it is redirected to /dev/null (ignored) and the script continues executing due to the || true.

- process_item "{}": This line calls the process_item function with the input item path (represented by {}) as the argument.

- ' 2>/dev/null: This part of the script suppresses any error messages by redirecting the standard error output to /dev/null.
@mutageneral
Copy link

Why does it use dash explicitly? I do use dash, but you should probably just use sh so that it works on systems with other shells as /bin/sh.

Also, I'm getting this error:

$ ./fixnames sulinos-20201112-minimal.iso 
mv: unable to evaluate 'sulinos-20201112-minimal.iso/sulinos_20201112_minimal.iso': Not a directory
$ ./fixnames out.md
$ ls out.md sulinos-20201112-minimal.iso 
out.md  sulinos-20201112-minimal.iso

@emrakyz
Copy link
Contributor Author

emrakyz commented Feb 27, 2024

Hi @mutageneral

It uses Dash explicitly because Luke's repo right now doesn't have any POSIX shell and trying to run this script with /bin/sh (which is linked to Bash on this repo right now) can give users errors. By showing that this use Dash shows this script is POSIX, fast and simple; as well as showing the fact that you can't run this on Bash but on the other hand, you are completely right and you can change it to /bin/sh on your system if you use a POSIX shell as a /bin/sh replacement.

For your problem:
This script can't rename files alone. You need to give it a PATH that is a directory. That's why you see an error.

For example let's say I have a million files and directories under the root directory /

When you enter: fixnames / it renames ALL files and directories on your system in an extremely fast way, excluding hidden files, config files, root protected files for better safety and security. So you can't rename any files that is owned by the root (system files) or you can't rename things under .local or .config in order for things not to break.

On the other hand, sometimes you may see some unrelated harmless errors (though I tried to get rid of them as much as possible). Let's say your system is too fast and you have a threadripper CPU with 192 threads. In that case you would rename files and directories so fast that there can be some race conditions (directories renamed before files) so in this case you need to run the script for the second time for renaming the files that couldn't be renamed before.

This script is so easy to use and fast so you don't need to worry about these. Just run it and your files will be renamed. Let me give you an example:

You can give it a full PATH or a relative one:
fixnames /home/user/pictures or fixnames pictures (pictures should be a directory since we first run find command on it).

It would rename all files and directories no matter how deep the files and directories are located. Let me give you a before after practical example with a video. It will fix anything and everything instantly no matter how complex or how deep your files are. If you have millions of files as well as having a very slow machine, you may need to wait for a couple minutes though:

fixnames.mp4

@narukeh
Copy link
Contributor

narukeh commented Mar 4, 2024

@emrakyz have a look at this:

https://github.com/narukeh/dotfiles/blob/master/.local/bin/personal/ffn

maybe it will help/inspire you to make it better/more-feauters

@emrakyz
Copy link
Contributor Author

emrakyz commented Mar 5, 2024

Hi! @narukeh

Thanks for sharing your script.

Can you exclusively state which features you recommend adding, so we could discuss?

Meanwhile, let me tell you my initial aims:

I have 16TB data which I have gathered in years on an external hard drive. It had lots of nested directory structures and tons of different files. Current solutions were not performant and efficient enough so I created this tool.

At the same time, I went with experimentalism and tried to create a tool with 'autistic' level simplicity, minimalism, safety and speed. With this way, this script is more fitting to the philosophy behind this repository and it is also more explainable to the people so they easily know what they exactly do with proper explanation.

On the other hand, I tried to use universally applied rules for Unix file naming and URL conventions (alphanumeric lower-cased English and underscores). Not just Unix by the way. This convention is best suited for Windows, Linux, BSD, Mac, Solaris and the Web. No application is problematic with lower-cased alphanumeric English characters with underscores. This does not require special handling on any machine.

So, the main aim of this tool is to have as less lines as possible; to apply as less commands, pipes and checks as possible; to have no dependencies and to get rid of user interactivity completely (because there is only one universal file naming conventions) while being completely safe. This also helps increasing consistency among all files within the system. So, a user can blindly invoke fixnames / command without further thinking to rename all files and directories within a system completely safely, and in a very fast way. You don't need to think about config files, hidden files, root protected files, extensions, directories and all.

Let's do a direct comparison and see the different aims, advantages, disadvantages of both scripts even though they pretty much do a similar job:

This script is extremely efficient and performant because it only applies singular shell built-in check and single sed -E command without any pipes, redirection or other commands. So it only invokes dash and sed, nothing else. Your script, however, is aimed towards singular file handling. Even though it has a pseudo feature to rename multiple files with multiple arguments or wildcards, it can only apply within a directory. Firstly, sequential file renaming with loops is not as efficient. Secondly, it uses pipes with different commands (multiple tr commands and sed) adding extra overhead. My script also utilizes xargs to use parallelism. So it's not only efficient but also parallel.

This script aims to create compactness, simplicity and minimalism. This makes it easy to read and modify. The logic of the script is extremely easy to understand. On the other hand, your script is harder to understand. The reader needs to delve deep and think about it properly to understand what it does.

This script handles both files and directories (regardless of the rate of the depth) by also handling extensions without exclusively writing different commands for them separately but only adding a parameter with a built-in shell expansion. So a single, static sed command is used for files, directories and for handling extensions. Your script's logic revolves around processing individual filenames. It extracts the base name, extension, and performs transformations on that filename.

This script handles file name conflicts in a very simple, fast way. Your script doesn't handle this. If a file with the intended new name already exists, your script's mv command might silently overwrite it, leading to data loss.

This script is more scalable because adding a simple sed pattern is easy. On the other hand, I don't find adding a custom pattern via flags is useful because it is prone to errors and mostly unnecessary. Therewithal, using a better tool such as bulkrename along with neovim is more appropriate for manual, custom file naming.

This script can easily be integrated similarly with another script. The core logic within dash -c could be extracted and used as a modular component within a larger shell pipeline due to its potential for stdin/stdout usage. Your script has a more specific and complicated use case.

Additionally, I don't find dry running is necessary because it's obvious. It would be lower-cased alphanumeric English separated by underscores. It's the biggest advantage of my script. No one has to care. It's completely automated, consistent and safe.

find command is one of the best known methods to handle file names. It's POSIX. It can find and normalize all files with all kinds of names (even with newline characters). print0 is a very good way to use them as arguments by also separating output filenames with a null character. This is essential for handling filenames with spaces, newlines, or unusual special characters. Sorting them prevents race conditions. -depth also help prevent race conditions by issuing file renaming before the directory itself. By issuing find, you can get all the depth, you can exclude specific files and directories and you can redirect into a subshell to do things.

Even though I could add additional features to this, I would do so only without making it significantly bigger and more complex; without breaking its parallelism. The worst thing that can happen is that an unwanted file can be renamed in a better way to be readable by a machine. I can't think of any reason a user requires an unconventional file name. Maybe an exclusion criteria could be added or a condition to apply singular file renaming could be added above the find command but in my opinion it's completely unnecessary.

@emrakyz
Copy link
Contributor Author

emrakyz commented Apr 16, 2024

@mutageneral
@narukeh

Hey, you can check the new version I have written in Pure C instead of using subshells, redirections, piping, external programs and all. It's much faster, more minimal, safer, and it has more features.

I have taken @narukeh's recommendations into account (better help page, dry running, individual file handling; multiple argument handling).

No Bullshit Instant Filename Sanitizer: Threaded & Recursive & Lightweight

This version directly uses the newer renameat2() and fstatat functions from the Linux kernel and therefore we don't even need to use lots of different calls, checks repetitively. renameat2() also supports renaming things at their own place and without replacing so it's even faster, safer than mv.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants