Dissecting the "find" utility (Linux, MacOS)

Published:  02/05/2022 15:30

Introduction

When searching for files and directories on a Linux (or MacOS) box, the most reliable way is to use a program called "find".

However very few people (myself included) are familiar with how extensive the program actually is and how to make it handle advanced searches.

Its base purpose is to search recursively starting from a given directory.

There's another utility called locate that also performs searches except it does so against a pre-built search index.

Because that index needs to be populated by a scheduled task, it's not very popular and is no longer installed by default on many Linux distributions.

On the contrary, find is present in almost any and all distributions and even in small containers context.

You can skip to the "Recipes" section in case you're not interested in theory and the mechanics behind the program itself.

Anatomy of find commands

The command structure itself is a little exotic, in that it's using what they call an expression to represent both what to search for and what to do with the result.

We could say the expression is made of a combination of tests and actions:

  • Tests help matching files and directories using characteristics like name, size, date of modification, etc. The default test is to select everything recursively.
  • Actions control what to do with matched items, the default implicit action is to print out the filename to standard output.

It gets a little bit complicated because tests and actions can be combined using operators, but more on that later.

To me, pretty much all of the useful find commands should look like this:

find [path] [options] [tests|operators] [actions|operators]

Where path is the starting directory for the search. It can be omitted and defaults to the current directory (which is effectively "."). I tend to omit it in many example.

Options

By default, find doesn't follow symbolic links and just lists them as files (they actually are files).

It's important to remember as you may be searching in a directory that has symlinked subdirectories and those won't get traversed.

When you want find to follow symbolik links, you can add the -L option.

When that option is present and a test for file type is used (-type d for instance, more on that later), symlinks are considered to be the type of file to which they point to.

For instance, a symlink to a directory will be matched by -type d when the -L option is present. Otherwise it's only matched by -type l.

On the other hand, find always follows mount points. You can tell it to not do that by adding the -mount option.

Tests

What makes find powerful is its capability to find files and directories that match a certain combination of tests.

There are a lot of different tests. For instance, searching for a file by name involves using one of the "name" tests:

find -name "*.txt"

Let's explore the most useful and common tests:

  • -name, -iname — Search for a specific filename with -name being case sensitive, and -iname case insensitive;
  • -type — Only match items of that type, most common ones being f for regular file and d for directories; Can be combined using comma as the separator, for instance:
    • -type f,d — Matches common files and directories, but doesn't match links and special files like sockets, named pipes etc.
  • -path, -ipath — Match the full path of the file, usually involves using wilcards. Keep in mind all paths provided by find are absolute which means paths searching from current directory will all start with "./" and found paths never end with a "/";
  • -regex, -iregex — Match the entire file path to the specified regex;
    • Uses emacs type regexes by default but that can be changed using the -regextype option.
  • -mmin — Search for items that were last modified a certain amount of minutes ago; Like most numeric tests, modifiers are allowed and pretty much always used with this test:
    • + (e.g. -mmin +10) means "at least that amount of minutes ago";
    • - means "less than that amount of minutes ago";
    • Not providing a modifier searches for items modified exactly that amount of minutes ago, which is usually not what we want.
  • -mtime — Same as above but in days (more precisely x*24h);
  • -readable, -writable, -executable — Only select items that the current user can read, write to, or have the execute permission for the current user (which has a different meaning on files or directories);
  • -perm — Search for items with specific file modes/permissions;
    • Can search for an exact file mode in octal (e.g. 774 or 0774);
    • Accepts numeric modifiers to match items with "at least permission X" or "at most permission X";
    • Example: -perm -060 will select anything with at least read/write permissions set for the group.
  • -size — Searches for items based on filesize; As with the time tests, it's often used with modifiers: a + in front means at least that size, - means less than that size, and no modifier means matching this exact size. You can and should also specify a unit to use (defaults to 512 bytes blocks):
    • b — 512 bytes blocks, this is default for some reason;
    • c — Bytes;
    • k — KB;
    • M — MB;
    • G — GB.
  • -user, -uid — Select items that belong to provided user name or user ID;
  • -group, -gid — Same as above for groups;
  • -maxdepth — Recursively descend at most to given level (starts at 1) — -maxdepth 1 forces find to stay at the current directory level, effectively removing recursivity, which is sometimes useful for performance reasons;
  • -mindepth — Do not select any item at a directory level that is less than the number in argument. Using -mindepth 1 is the most common use case and is meant to ignore the starting directory altogether;

Operators

When using find you may have noticed that combining multiple tests matches items that fulfill all of these tests.

For instance:

find -type f -size +2M -iname "*.jpg"

Would select items that are regular files, are at least 2MB in size and are named anything.jpg or .JPG, etc.

There is an implied AND logical operator between all of the tests, and that's how it works by default but it can be changed in any number of other logical combinations.

It's important to note that AND is always implied when operators are missing, even when using other operators in the expression.

The other operators are:

  • Parentheses — They group tests together logically;
  • -not — Logically inverts what's next;
  • -a — Logical "AND", implied when multiple tests are combined with no operator;
  • -o — Logical "OR".

As an example, let's say we want to look for files that are not only readable by their owner. For normal files, that would be octal permission 600 (read and write, no execute) and for directories 700 (because the execute permission on directories means they're readable, hence the 7 instead of 6):

find \( -type f -not -perm 0600\) -o \( -type d -not -perm 0700 \)

Do note we have to escape the parentheses because they have a special meaning for the shell. This can quickly make find commands hard to read but is the best way to make sure the tests work as intended, especially since operators also affect actions as we'll see momentarily.

Also notice that each of the parentheses in the example actually hold more than one test. As always, they get combined with an implied "AND" operator.

As is the case in programming languages, if some part of an expression using logical "AND" evaluates to false, the rest of the "AND" arms aren't even evaluated since the whole expression will always be false (logical AND only evaluates to true if all of its parts are true).

The main consequence of that is observed on actions (we talk about them later on) which are also affected by operators and this can have unexpected side effects.

As such, if we explicitely add the default action (-print) to a find command and there's a test before it, a logical "AND" will be implied between the test and the action.

Considering the following:

find -size +1MB -print

There's an implied "AND" between the size test and print that makes it so that when the size test evaluates to false, that file doesn't get printed, and that makes sense.

We could put the -print action (or any action) before tests:

find -print -name "*.txt"

The expression pretty much means "print the item name and test for filename matching *.txt" and that'll just print every single file and folder and whatnot found recursively starting from the current directory. The test is completely useless.

This is the reason why actions are always at the end of find commands and I presented it as such in the Anatomy section though it's not mandatory, but any action present before the end of the command increases the risk of unexpected behavior and I'd advise against doing it unless you know what you're doing.

It becomes even more complicated when other logical operators and/or multiple actions (they're also combined by operators with default "AND" being implied) are used.

Even though -print is the default action, writing it explicitely has a different behavior, as it makes it behind an implied "AND" that doesn't exist when no action is mentioned.

Let's imagine we're in a directory that has two files named "somefile" and "someotherfile", these two commands will have different outputs:

find -name somefile -o -name someotherfile

# Will correctly list both somefile and someotherfile
find -name somefile -o -name someotherfile -print

# Will only list someotherfile

In effect, the second command behaves as if it was:

find -name somefile -o \( -name someotherfile -a -print \)

The reason for the weirdness is that a "AND" was implied between the last test and the action, and "AND" has precedence over "OR" implying it'll create a single logical block between the second name test and the print action.

When the first arm of the "OR" evaluates to false, find doesn't even compute the next block (which contains the action) because a logical "OR" with the first arm set to false always evaluates to false.

You may have to re-read this section a few time after reading the next one (about actions) to understand or just keep in mind that, to avoid unexpected behavior, it can be a good idea to explicitely always add parentheses when an action is involved alongside logical "OR" or "NOT".

Actions

If you've used find before, you probably know it has an option called -exec that will run a command on all the matches. In fact, -exec is only one of many possible actions.

We'll review executing commands on search results as its specific section as it's a common use case.

In case it wasn't clear already, multiple actions can be combined in a single find command, knowing that operators will apply and that may have unexpected side effects depending on how the action evaluates. For instance, -exec actions tend to evaluate to true if the program return code was 0, and false otherwise but not always. More on that later.

We've already mentioned the default action: -print — it just prints out the paths of found files to standard output which is very easy to pipe to other commands using xargs. Example listing:

find /usr/bin -name "zip*"

/usr/bin/zipinfo
/usr/bin/zip
/usr/bin/ziptool
/usr/bin/zipcmp
/usr/bin/zipgrep
/usr/bin/zipsplit
/usr/bin/zipcloak
/usr/bin/zipmerge
/usr/bin/zipnote
/usr/bin/core_perl/zipdetails

ls - Show more details

Sometimes you need a little more information to review what the found files are about, and that's what the action -ls is for. Example:

find /usr/bin -name "zip*" -ls

3157801    172 -rwxr-xr-x   2 root     root       174040 Feb 16 19:25 /usr/bin/zipinfo
3157955    228 -rwxr-xr-x   1 root     root       232408 Apr 24  2020 /usr/bin/zip
3163743     40 -rwxr-xr-x   1 root     root        39920 Sep  8  2021 /usr/bin/ziptool
3163741     28 -rwxr-xr-x   1 root     root        27144 Sep  8  2021 /usr/bin/zipcmp
3157958      4 -rwxr-xr-x   1 root     root         2953 Feb 16 19:25 /usr/bin/zipgrep
3157960    100 -rwxr-xr-x   1 root     root       101544 Apr 24  2020 /usr/bin/zipsplit
3157956    104 -rwxr-xr-x   1 root     root       105872 Apr 24  2020 /usr/bin/zipcloak
3163742     20 -rwxr-xr-x   1 root     root        18848 Sep  8  2021 /usr/bin/zipmerge
3157959     96 -rwxr-xr-x   1 root     root        97448 Apr 24  2020 /usr/bin/zipnote
3147293     60 -rwxr-xr-x   1 root     root        60065 Nov 13 21:22 /usr/bin/core_perl/zipdetails

When more specific info is desired, it's best to use -exec and another program like file, stat or ls.

quit - Terminate on first match

Another interesting action is -quit. It instructs find to exit immediately after the first match.

A useful action when you're looking for a single specific file in some source directory. For instance:

find /usr/bin -name "zip*" -print -quit

/usr/bin/zipinfo

Since we're using -quit, the default -print action is overridden so we need to add -print explicitely or we won't get any result in the output whether it found the file or not.

delete - Remove matches from filesystem

There are a lot of examples on the web of find commands that remove specific files (e.g. all files older than X), it's a very common use case that most users accomplish with the -exec action but there is a dedicated action for it.

It's got a few caveats though:

  • Won't remove non-empty directories;
  • Doesn't ask for any confirmation, just deletes all matches.

It goes without saying that this action is dangerous and you should always test the target find command with another innocuous/harmless action first to verify that it's matching the files you really want to delete.

Running commands on the matched items

The most commonly seen way to run commands on found items that can be found online is through -exec. We've all seen something like:

find /tmp -type f -mmin +90 -exec rm "{}" \;

Where we're looking to delete files that have not been modified in the past 90 days in /tmp.

The syntax looks a little strange right after the -exec part, let's explore what all these characters are:

  • The string {} gets replaced with the current file path being processed, we like to put it under double quotes to prevent shell expansion but it's not mandatory — It's possible to use the placeholder multiple times if needed;
  • The trailing ; signifies the possible arguments that follows are no longer arguments of the program in the -exec line (which is rm here) and is called a delimiter; There's another one we'll talk about later;
  • The \ right before the ; is needed because ; is a special character for shells (its purpose is to write multiple commands on one line) so we have to escape it or the shell will pick it up and the find command won't work at all.

The ";" character is called a delimiter, the other possibility is to use "+" instead, a character that doesn't need to be escaped to work.

The "+" delimiter makes it so the command is called only once (it can be called more than once in practice but bear with us) and is given the whole list of matches as the placeholder "{}".

Any command that accepts multiple files as arguments could be a good candidate for using the "+" delimiter instead of "\;". There are however a few limitations:

  • Only one "{}" placeholder is allowed with "+" and it has to be right before the "+" delimiter" — Quite a limiting factor to what can be accomplished;
  • The action always evaluates to "true" when the "+" delimiter is used. It's important to know as it can influence behavior when logical operators are inserted between actions;
  • Any non 0 return value from the program inside the action using the "+" delimiter will result in a non 0 return value from find;
  • A minor point, but using "+" behaves similarly to xargs in that it'll invoke the command multiple times if the list of matches is greater than the maximum amount of command line arguments allowed on the system.

Combining -exec-type actions (well see momentarily there are more than one) makes it possible to produce powerful combinations in a single commands, but one should always remember how operators work and that "AND" is implied when there aren't any.

For instance, examining the following command:

find -type f -exec grep "match" "{}" \; -exec cp "{}" copy/ \;

We want to copy any file that has the text "match" in it to a directory named "copy/". One way to go is to use grep, then cp.

The command will only copy the files that have the "match" string because grep will only yield an exit code of 0 in that case, which find evaluates as the value "true".

At this point we should understand what's going on: there's an implicit "AND" between the two -exec actions, so when the grep evaluates to true, the next action is performed. Otherwise, it's skipped.

If you want to always run multiple -exec commands, you'll have to make use of the logical "OR" (-o).

The case of execdir

The action -execdir behaves exactly like -exec except the command is ran from the directory containing the matched file whereas regular -exec runs the command from the directory in which find is running.

It's very uncommon to see -execdir in examples though it's technically usually "safer" to use.

The whole safety concern is related to how -exec takes place some unit of time after matching a file. During that time, it's theoretically possible that the parent directory of said matched file or directory has changed (for instance, been replaced with a link) and -exec will run the command in that other directory with no verification.

If you feel like this is a fringe case, you're right. But it's better to be safe than risk one of these race conditions, especially when multiple users and programs are altering the file system at the same time.

On the other hand, -execdir might also be unsafe if there are relative paths in the PATH environment variable. For instance, some users add "." to their path so that they don't have to use "./program" in the current directory and can just run "program".

In that case, -execdir might run a program with the same name as the one mentioned in argument found in another directory somewhere where there's a match for the running find command.

However, find will refuse to run any -execdir action if there is a relative path in the PATH environment variable.

Conclusion

Because of the extra check for possible -execdir issues, it's the safer option and should be used instead of -exec.

However, the "risk" related to -exec is very low anyway. So you're pretty much fine either way.

ok, okdir

These two actions behave exactly like -exec and -execdir respectively except they ask for confirmation before running their associated command.

More precisely, when using the "\;" modifier, these actions will result in a prompt for every single matched item.

It's a safer version of the exec actions that sometimes makes sense when using the console interactively.

A few words about globstar

Starting from Bash v4 (zsh has this by default AFAIK) it became possible to use ** in any path to mean full recursion of any directory and subdirectory.

For instance:

ls ./**/*.md

Will list .md files anywhere in subirectories starting from the current directory, including ./something/somethingelse/README.md.

Without globstar, double "*" have the same effect as a single "*" and thus are limited to one single level of recursion.

On Bash, enabling globstar requires adding the following to your ~/.bashrc file (then relogging or sourcing that file to apply the changes):

shopt -s globstar

You can see that it's working using ls ** in some directory, every single file anywhere in that directory tree should get listed.

We're talking about globstar because it sort of fills in for some of the use cases related to find (matching files recursively).

It can be easier to read a for loop such as:

for pathname in ./**/*.md; do
  echo "Found md file $pathname"
  # Do something with the file...
done

Rather than a big find command with several actions that might need to be combined with the right operators and delimiters.

However, you probably shouldn't rely on globstar to write the most portable scripts, not only because older versions of Bash do not support it, but because some shells plainly just don't.

For example, BusyBox doesn't support it and is often used in lightweight containers. BusyBox however does include find.

A few recipes

Practice speaks more than manpages.

Delete files older than a year

We could add "-rf" to rm and remove the "-type f" test to also remove directories but that makes this command even more dangerous so proceed with caution.

You may want to try it first without the -execdir part to see that it matches the right files.

find <TARGET_DIR> -type f -mtime +365 -execdir rm {} \;

Find recently modified files in a directory

It's easy to find out recent modifications in a code repository of some sort. In other settings, you might need find with a time argument and the "-" modifier.

For example:

find <TARGET_DIR> -type f -mmin -60

Will find all files modified in the last 60 minutes.

To count in days, -mtime can be used instead of -mmin.

The special case of -mtime 0 means "in the last 24 hours".

Find all files of certain type but ignore some subdirectories

This one comes in handy in JavaScript projects where the directory "node_modules" can grow to become extremely large and only contains the project dependencies — There's usually no point in searching through it.

There are multiple ways to accomplish the same result but the intended one is through a rarely seen action called -prune that we didn't even bring up in the current article.

In light of trying to explain this one, we could list every file and directories that are not node_modules in target directory using:

find <TARGET_DIR> -name node_modules -prune -o -print

However that doesn't get us far, we want to look for specific things but not into the node_modules directory.

If you've read the previous sections, you may remember strange things can happen with explicit operators (like -o) and multiple actions and tests.

In the end, we just have to group everything that comes after the -prune -o part with parentheses, which are special characters for the shell so we have to escape them using "\":

find <TARGET_DIR> -name node_modules -prune -o \( -name "*.md" -type f -print \)

Where we're looking to list all of the "*.md" files present anywhere in the project except in node_modules.

Add extensions to files that do not have any

We look for files that do not have the name pattern *.* and rename them:

find <TARGET_DIR> -type f  ! -name "*.*" -exec mv {} {}.txt \;

Find big files

There are utilities and sorting options on commands like ls or du to help you hunt for heavy storage consumers but find is one of the best option to quickly find the biggest files recursively inside some starting directory.

Example to find files bigger than 50 megabytes starting from current directory:

find -type f -size +50M -execdir ls -lh "{}" \;

There's a -ls action that gives results to ls just like we're doing except there's no way to make it show easy to read file sizes (in MB, etc.) so I'm using -exec.

backup a directory but skip all the images

The use case could also be about copying a directory except for some specific files that may be in it, or backup some important directory but leave out some big files that you don't need to backup (.iso images, for instance).

find <TARGET_DIR> -type f -not -iname "*.jpg" -exec tar -rf archive.tar "{}" +

Where we append everything that isn't a jpg file to an archive called "archive.tar" (will add files to it if it already exists, even if files with the same name already exist in the archive).

NB: The specific command above won't archive empty directories or directories that only had .jpg files in them.

You could easily invert the command to only archive files of a certain type.

Find files and directories with 777 permissions

Helpful to make sure there aren't any "everything allowed" permissions inside of a directory.

Can be adapted to find "at least X permissions" since we're using a minus sign up front.

find <TARGET_DIR> -perm -777 -ls

We added the -ls action in this example to get a better listing that may help identify directories from files but you can remove it to get the default file listing.

Check if a specific file exists somewhere

We're just interested in knowing whether there is a file called "needle" somewhere in the target directory tree:

find <TARGET_DIR> -name needle -print -quit

Conclusion

We hope to have helped shed a little light on how find actually works and gave a glimpse of which powerful combinations are possible to both find filesystem items and perform actions on them.

We haven't touched on scripting, xargs and possible filename issues (Linux allows filenames with line feeds in them — Which could create issues with many scripts and find commands) but these are discussed in the manpage if need be.

In the end, we feel find makes much more sense when you finally understand why it comes with all of its strange syntax and shell escaping with all of the possible "\;" and escaped parentheses.

See you next time.

Comments

Loading...