Command Injection

Forcing commands to run

Paul Krzyzanowski

March 24, 2025

We looked at buffer overflow and printf format string attacks that enable the modification of memory contents to change the flow of control in the program and, in the case of buffer overflows, inject executable binary code (machine instructions). Other injection attacks enable you to modify inputs used by command processors, such as interpreted languages or databases. We will now look at some of these attacks.

Numeric overflow

Before delving into command injection, let us briefly examine the problem of integer overflow. This doesn’t relate to command injection but can lead to buffer overflow as well as other problems.

Integer overflow In most languages and all computer architectures, numbers occupy a fixed number of bytes. This limits their range of values.

An 8-bit integer can hold values from 0 to 255 or, if we’re using signed integers, from -128 to +127.
A 16-bit integer can hold values from 0 to 65,535 or -32768 to 32767.
A 32-bit integer can hold values from 0 to a bit over 4 billion or a signed integer from a little under -2 billion to a little over 2 billion.
64-bit values, of course, hold much larger values from 0 to 18 quintillion for unsigned integers or -9 quintillion .. +9 quintillion: that’s 10¹⁸.

Some languages offer arbitrary precision libraries but there’s a performance penalty for using these libraries and they are not used for general purposes. Python supported arbitrary precision integers with the mpmath library but then Python 3 added native support for arbitrary precision integer arithmetic.

Sometimes, even if an integer doesn’t overflow, other problems can occur if an attacker can control its value to something the programmer didn’t anticipate. For example, you might try to allocate a buffer that’s terabytes in size.

Integer overflow

What happens if you have a 16-bit unsigned integer and add 1 to 65535? Most languages will not detect an error and simply perform a modulo operation.

For unsigned numbers, 65535+1 = 0

We have a possibly more unfortunate situation with signed numbers. What happens if we take a 16-bit integer and add 1 to 32767?

32767+1 = -32768

What should have become a bigger positive number has now become a big negative number.

And underflow

We can go in the opposite direction. If we take the largest negative integer for our bit length and subtract 1, we get a large positive number.

-32768 – 1 = +32767

We used shorts, which are 16 bits long, as an example, but the same thing happens with any data size. A standard int in C is 32 bits, even on a 64-bit system. Adding 1 to the maximum value, which is a bit over 2 billion, gives us a number that’s a bit smaller than negative 2 billion.

Overflows can also occur due to casting from an unsigned to a signed type.

	unsigned short n =65535
	short i = n;

Converting an unsigned 65,535 to a signed integer gives us a value of -1, not 65,535.

A most significant bit of 1 indicates a negative number in two’s complement arithmetic.

What are the problems?

The big problem with underflows and overflows is that you are not likely to detect an overflow or underflow. The program does not die and the processor does not generate an exception.

If you’re computing the length of a buffer and have an integer overflow, this can lead to a buffer overflow since the right amount of space may not have been allocated to the buffer. If you’re computing money, you might end up with bad math: a negative account can become positive, for example, or vice versa.

Here’s an example of an integer overflow that led to a buffer overflow in version 3.3 of OpenSSH:

nresp = packet_get_int();
if (nresp > 0) {
  response = xmalloc(nresp*sizeof(char*));
  for (i = 0; i < nresp; i++)
    response[i] = packet_get_string(NULL);
}

This was on 32-bit system where the size of a pointer, and hence sizeof(char*), is 4 bytes.

If packet_get_int() returned a value of 1,073,741,824, then 1073741824*4 will not be able to store 4294967296 in 4 bytes and will store the value of 0 instead.

In binary, 4294967296 = 1 0000 0000 0000 0000 0000 0000 0000 0000

That’s a 1 followed by 32 zeros … but we can only store 32 bits, so the most significant bit has nowhere to go.

But we have 64-bit architectures

You’d think this wouldn’t be a problem with 64-bit architectures. 9 quintillion (9,223,372,036,854,775,808) is a huge number. However, remember the problems we can get into by making assumptions. If a user can set a field to some value, like in a network packet, overflows can still occur. Moreover, the default size of an int in C on Linux and macOS is still 32 bits.

Overflows are especially a problem when code specifically deals with smaller data types. Various Internet Protocol fields, for example, regularly use 8- and 16-bit fields.

The Global Positioning System (GPS) stores the week number in 10 bits, which rolls over every 19.7 years. Week 0 started on January 6, 1980. It rolled over on August 21, 1999 and again on April 6, 2019. Most software was updated for this rollover but we can imagine a situation has not been updated to know of a new reference date for the week count and will compute a value that’s 19.7 years in the past.

Finally, there are lots of legacy data structures or programmers who might have been concerned about wasting storage where these smaller integer sizes are still present.

Python, Java, Rust

Integer overflow was an issue in Python until Python 3, which implemented integers (int type) to have arbitrary precision, meaning that they can grow to accommodate any number as long as your machine’s memory can handle it. This design choice eliminates the traditional issues associated with integer overflow in fixed-size integer types found in other programming languages.

While this feature of Python makes it very robust for mathematical computations that involve large numbers, it also means that Python’s arbitrary-precision integers may consume more memory than the fixed-size integers of other languages, which can be a consideration for performance-sensitive applications. In Java, integer overflow can be an issue, similar to other programming languages that use fixed-size integer representations. Java provides several primitive integer types (byte, short, int, long) with fixed sizes: 8, 16, 32, and 64 bits respectively. When an operation causes the value to exceed the range of these types, overflow occurs, and the number wraps around to the minimum value of the type and continues from there, potentially leading to unexpected or incorrect results if not properly handled.

For example, for a 32-bit int, the maximum value is 2,147,483,647 (Integer.MAX_VALUE). If you add 1 to this value, it will overflow and wrap around to -2,147,483,648 (Integer.MIN_VALUE), which is likely not the intended result.

Integer overflow is also an issue in Go since it also uses fixed-size integer types. Go provides several integer types (int8, int16, int32, int64 and their unsigned counterparts uint8, uint16, uint32, uint64, along with architecture-dependent types like int, uint, and uintptr) that have fixed sizes. When the value assigned to such a type exceeds its capacity, it wraps around to the beginning of its range, which can lead to unexpected behavior if not properly managed.

In Rust, integer overflow behavior differs based on the build profile: in debug mode, Rust checks for integer overflow and causes your program to panic (terminate execution with an error) if overflow occurs. In release mode, Rust does not check for overflow, and if overflow occurs, it wraps around to the minimum or maximum value of the type.

SMB Ghost: 2020

An integer overflow vulnerability led to a major exploit in 2020. It allowed an attacker access to a Windows system by connecting to it over the SMB protocol. SMB is the Server Message Block protocol, Microsoft’s remote file access protocol.

March 2020 was a particularly bad time for disclosing patches. Microsoft announced that they fixed 116 vulnerabilities that month, 25 of them critical and could be used by an attacker to execute remote code and perform local privilege elevation.

This particular bug affected the data compression mechanism within the SMB message structure in Windows 10 implementations of SMB. Attackers could create a packet that would trigger an integer overflow or underflow that would allow them to write arbitrary data anywhere in the kernel.

The detailed steps of an attack are long, so we will just go over the basic weakness that was uncovered. Since attackers can create the message, they can control data within it. Two particular fields end up being useful.

Original_Compressed_Segment_Size tells the system the size of decompressed data.

Offset defines the size of an optional extra chunk of data that is not compressed.

The system allocates a buffer that is the size of the original size plus the offset. A simple attack that caused the program to crash simply set the offset to 0xffffffff, which triggered an integer overflow.

In a more sophisticated attack, attackers used a huge value for the Original_Compressed_Segment_Size and a legitimate value for the offset. That also triggers an overflow, causing the system to allocate less memory than needed.

memcpy

Later in the code, a memcpy takes place. The attackers realized that all three parts were under their control:

The target of the copy, Alloc->UserBuffer, comes from the allocation header, but the allocation header can be overwritten when the user buffer overflows.
The source is the header data, which comes from the attacker.
The length is the offset and is also controlled by the attacker.

Since an attacker can set the destination and the contents, they could write any data anywhere in kernel memory and were able to then use other attacks for local privilege escalation by connecting to a local machine. Other attackers were able to trigger remote code execution.

Microsoft Exchange Year 2022 Bug

Here’s another, stranger, example of overflow. It’s the Microsoft Exchange Year 2022 bug.

Starting January 1, 2022, On-premises Microsoft Exchange servers were not able to deliver email because of a bug in their anti-spam engine.

The bug occurred because Microsoft was using a signed 32-bit integer to store the value of a date. This gave it a maximum value of a little over 2 billion (2,147,483,647). Unlike systems like Linux that count seconds from an epoch (Jan 1, 1970 0:00 UTC), they represented the date as a year-month-date encoded as a decimal value. Dates in 2022 have a value (2,201,010,001 or larger) that’s larger than the maximum value that fits in 32 bits, causing it to overflow to a negative value and the scanning engine to fail.

Type confusion

Vulnerabilities can arise if an object is created as one type but later used as a different type. Accessing an unsigned integer as a signed integer is a simple example, but it can be assumptions about sizes of arrays, member of unions, or the data types of pointers. The bug is most common in C and C++ but can also be found in languages like PHP and Perl.

The bug may not appear to be exploitable, but sometimes is. For example, on May 24, 2024, Google rolled out a fix to address the fourth zero-day exploit for May of 2024 (and eighth of the year) in its Chrome browser. This fixed a type confusion vulnerability that was exploited in the wild and allowed a remote attacker to execute arbitrary code via a specially-crafted HTML page.

See CWE-843: Access of Resource Using Incompatible Type (‘Type Confusion’).

SQL Injection (SQLi)

It is common practice to take user input and make it part of a database query. This is particularly popular with web services, which are often front ends for databases. For example, we might ask the user for a login name and password and then create a string that contains an SQL query (SQL is the Structured Query Language, the dominant way of interacting with relational databases):

sprintf(buf,
	”SELECT * from logininfo WHERE username = '%s' AND password = '%s’;",
	uname, passwd);

Suppose that the user entered this for a password:

' OR 1=1 ; --

We end up creating this query string¹:

SELECT * from logininfo WHERE username = 'paul' AND password = '' OR 1=1 ; -- ';

The “--” after “1=1” is an SQL comment, telling it to ignore everything else on the line. In SQL, OR operations have precedence over AND, so the query checks for a null password (which the user probably does not have) or the condition 1=1, which is always true. In essence, the user’s “password” turned the query into one that ignores the user’s password and unconditionally validates the user.

Statements such as this can be even more destructive as the user can use semicolons to add multiple statements and perform operations such as dropping (deleting) tables or changing values in the database.

This attack can take place because the programmer blindly allowed user input to become part of the SQL command without validating that the user data does not change the quoting or tokenization of the query.

A programmer can avoid the problem by sanitizing the input. Input sanitization means validating the input to ensure that there is nothing dangerous in it before it is used. This may involve:

Disallowing certain characters or strings from the input. For example, reject any strings that contain quotes.
Allow only certain characters or strings. For instance, we may accept only alphanumeric characters and a limited set of symbols from the user.
Escape any characters that have special meaning. SQL, Linux shells, and many other programs often support the use of a backslash (\) that tells the interpreter not to treat the next character as a special character. Alternatively, spaces or special character may be quoted.

Unfortunately, this can be difficult. SQL contains too many words and symbols that may be legitimate in other contexts (such as passwords) and escaping special characters, such as prepending backslashes or escaping single quotes with two quotes can be error prone as these escapes may differ for different database vendors.

The safest defense in SQL is to use parameterized queries, where user input never becomes part of the query but is listed as parameters. For example, we can write the previous query as:

uname = getResourceString("username");
passwd = getResourceString("password");
query = "SELECT * FROM users WHERE username = @0 AND password = @1";
db.Execute(query, uname, passwd);

A related safe alternative is to use stored procedures. They have the same property that the query statement is not generated from user input and parameters are clearly identified.

While SQL injection is the most common code injection attack, databases are not the only target. Creating executable statements built with user input is common in interpreted languages, such as Shell, Perl, PHP, and Python. Before making user input part of any invocable command, the programmer must be fully aware of parsing rules for that command interpreter.

An example of a recent vulnerability was announced on June 27, 2024 in Fort FileCatalyst Workflow, a file transfer application. In this attack, a user-supplied jobID is used in creating the WHERE clause of an SQL query. An anonymous remote attacker can send URLs with a JobID parameter of their choice.

Shell attacks

The various POSIX² shells (sh, csh, ksh, bash, tcsh, zsh) are commonly used as scripting tools for software installation, start-up scripts, and tying together workflow that involves processing data through multiple commands. A few aspects of how many of the shells work and the underlying program execution environment can create attack vectors.

system() and popen() functions

Both system and popen functions are part of the Standard C Library and are common functions that C programmers use to execute shell commands. The system function runs a shell command while the popen function also runs the shell command but allows the programmer to capture its output and/or send it input via the returned FILE pointer.

Here, we again have the danger of turning improperly validated data into a command. For example, a program might use a function such as this to send an email alert:

char command[BUFSIZE];
snprintf(command, BUFSIZE, "/usr/bin/mail –s \"system alert\" %s", user);
FILE *fp = popen(command, "w");

In this example, the programmer uses snprintf to create the complete command with the desired user name into a buffer. This incurs the possibility of an injection attack if the user name is not carefully validated. If the attacker had the option to set the user name, she could enter a string such as:

nobody; rm -fr /home/*

which will result in popen running the following command:

sh -c "/usr/bin/mail -s \"system alert\" nobody; rm -fr /home/*"

which is a sequence of commands, the latter of which deletes all user directories.

Command injection example

A particularly insidious attack is one that targets security tools: the very software that is there to try to detect and prevent attacks. An example of a command injection exploit on such a service is the 2023 maximum-severity vulnerability in Fortinet’s security information and event management (SIEM) solution, which was patched in February of 2024 (see the bleepingcomputer article).

In this attack, a Python program formats a call to os.system tgat, which contains a user-controlled mount_point value. An attacker can define the mount point to contain a semicolon, which serves as a command separator, followed by whatever command they want executed. The Fortinet client will execute the command with root privileges. Remarkably, this is a similar attack to a related command injection vulnerability that was discovered six months earlier.

Other environment variables

The shell PATH environment variable controls how the shell searches for commands. For instance, suppose

PATH=/home/paul/bin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/games

and the user runs the ls command. The shell will search through the PATH sequentially to find an executable filenamed ls:

/home/paul/bin/ls
/usr/local/bin/ls
/usr/sbin/ls
/usr/bin/ls
/bin/ls
/usr/local/games/ls

If an attacker can either change a user’s PATH environment variable or if one of the paths is publicly writable and appears before the “safe” system directories, then he can add a booby-trapped command in one of those directories. For example, if the user runs the ls command, the shell may pick up a booby-trapped version in the /usr/local/bin directory. Even if a user has trusted locations, such as /bin and /usr/bin foremost in the PATH, an intruder may place a misspelled version of a common command into another directory in the path. The safest remedy is to make sure there are no untrusted directories in PATH.

Some shells allow a user to set an ENV or BASH_ENV variable that contains the name of a file that will be executed as a script whenever a non-interactive shell is started (when a shell script is run, for example). If an attacker can change this variable then arbitrary commands may be added to the start of every shell script.

Preloading shared libraries

In the distant past, programs used to be fully linked, meaning that all the code needed to run the program, aside from interactions with the operating system, was part of the executable program. Since so many programs use common libraries, such as the Standard C Library, they are not compiled into the code of an executable but instead are dynamically loaded when needed.

Similar to PATH, LD_LIBRARY_PATH is an environment variable used by the operating system’s program loader that contains a colon-separated list of directories where libraries should be searched. If an attacker can change a user’s LD_LIBRARY_PATH, common library functions can be overwritten with custom versions. The LD_PRELOAD environment variable allows one to explicitly specify shared libraries that contain functions that override standard library functions.

LD_LIBRARY_PATH and LD_PRELOAD will not give an attacker root access but they can be used to change the behavior of a program or to log library interactions. For example, by overwriting standard functions, one may change how a program generates encryption keys, uses random numbers, sets delays in games, reads input, and writes output.

As an example, let’s suppose we have a program that prints random numbers:

#include <time.h>
#include <stdio.h>
#include <stdlib.h>
int
main(int argc, char **argv)
{
	int i;

	srand(time(NULL));
	for (i=0; i < 10; i++)
		printf("%d\n", rand()%100);
	return 0;
}

We can compile this via:

$ gcc -o random random.c

When run, we may get output containing 10 random numbers:

$ ./random
9
57
13
1
83
86
45
63
51
5

Let us write a replacement rand function that always returns the same value. We’ll put it in a file called rand.c:

int rand() {
	return 42;
}

We compile it into a shared library named newrandom.so:

gcc -shared -fPIC rand.c -o newrandom.so

Now we set the LD_PRELOAD environment variable to this library and run the program:

$ export LD_PRELOAD=$PWD/newrandom.so
$ ./random
42
42
42
42
42
42
42
42
42
42

Note that our program now behaves differently, and we did not have to recompile it or run it differently.

Windows DLL Sideloading

Shared libraries in Windows are implemented via a DLL mechanism: dynamically linked libraries.

In Linux the LD_PRELOAD mechanism lets you “inject” a shared library into a process at load time by simply setting an environment variable. In Windows there isn’t an exact equivalent built into the loader, but attackers have found a similar vector through DLL sideloading (often also called DLL hijacking).

While Linux’s LD_PRELOAD is a consciously designed feature for overriding library functions, Windows’ approach to DLL loading is inherently implicit. This implicit mechanism offers flexibility—for example, enabling legacy applications to load older libraries by simply copying the necessary DLL into the application’s directory. However, the same flexibility also creates an opportunity for exploitation through DLL sideloading. Understanding this dual-use nature is key for both leveraging compatibility fixes and implementing effective security measures.

How it works

When a Windows application loads a DLL without a fully qualified path, the operating system follows a well-defined search order. It begins in the directory containing the executable, then checks system directories like C:\Windows\System32, followed by other locations. This predictable order means that if a DLL with the expected name exists in the executable’s local directory, it will be loaded in preference to one found later in the search path.

Legitimate Use: Enforcing Legacy Library Versions

This same DLL search order can be exploited intentionally to solve compatibility issues. Consider a legacy application that was developed to work with an outdated version of a library. If the original version of the DLL resides in a shared system location (for example, C:\Windows\System32) but the application requires an older version, a simple fix is to copy the outdated DLL into the same folder as the executable.

In doing so, the application loads the local copy—thus enforcing compatibility—while all other applications continue to use the updated, system-wide version. This method, sometimes referred to as DLL redirection, offers a straightforward solution for managing library (in)compatibilities without modifying system-wide files.

Dual-Use Nature: Compatibility and Exploitation

The same mechanism that enables legitimate legacy support also opens the door to potential security vulnerabilities:

Legacy Compatibility:: By deliberately placing a legacy version of a DLL alongside an application, developers can ensure that the application runs with the expected functionality even when system-wide libraries have been updated. This controlled use of the search order allows legacy software to continue operating without complex rewrites.
Malicious Exploitation (DLL Sideloading/Hijacking):: Conversely, an attacker can take advantage of the DLL search order by placing a malicious DLL—with the same name as a legitimate dependency—in a directory that is searched before the legitimate library is found. If an application does not use fully qualified paths to load its DLLs, it inadvertently loads the malicious DLL, allowing the attacker’s code to run with the privileges of that application.

Mitigation

To defend against DLL sideloading attacks, Windows developers and administrators can adopt several measures:

Use Fully Qualified Paths: Whenever possible, specify the full path when loading DLLs to bypass the default search order[].
Secure DLL Search Mode: Modern Windows versions support “Secure DLL Search Mode,” which can be enabled to reduce the risk by changing the default search order.
Digital Signing: Requiring DLLs to be signed can help ensure that only trusted libraries are loaded.
Manifest Files: By using application manifests to specify the exact versions and locations of DLL dependencies, developers can further limit the risk of an attacker’s DLL being loaded.

For more info on DLL and sideloading, see:

Welcome to the world of Dynamic Link Libraries, Bitdefender TechZone.

Input sanitization

The important lesson in writing code that uses any user input in forming commands is that of input sanitization. Input must be carefully validated to make sure it conforms to the requirements of the application that uses it and does not try to execute additional commands, escape to a shell, set malicious environment variables, or specify out-of-bounds directories or devices.

File descriptors

POSIX systems have a convention that programs expect to receive three open file descriptors when they start up:

file descriptor 0: standard input
file descriptor 1: standard output
file descriptor 2: standard error

Functions such as printf, scanf, puts, getc and others expect these file descriptors to be available for input and output. When a program opens a new file, the operating system searches through the file descriptor table and allocates the first available unused file descriptor. Typically this will be file descriptor 3. However, if any of the three standard file descriptors are closed, the operating system will use one of those as an available, unused file descriptor.

The vulnerability lies in the fact that we may have a program running with elevated privileges (e.g., setuid root) that modifies a file that is not accessible to regular users. If that program also happens to write to the user via, say, printf, there is an opportunity to corrupt that file. The attacker simply needs to close the standard output (file descriptor 1) and run the program. When it opens its secret file, it will be given file descriptor 1 and will be able to do its read and write operations on the file. However, whenever the program will print a message to the user, the output will not be seen by the user as it will be directed to what printf assumes is the standard output: file descriptor 1. Printf output will be written onto the secret file, thereby corrupting it.

The shell command (bash, sh, or ksh) for closing the standard output file is an obscure-looking >&-. For example:

./testfile >&-

Comprehension Errors

The overwhelming majority of security problems are caused by bugs or misconfigurations. Both often stem from comprehension errors. These are mistakes created when someone – usually the programmer or administrator – does not understand the details and every nuance of what they are doing. Some examples include:

Not knowing all possible special characters that need escaping in SQL commands.
Not realizing that the standard input, output, or error file descriptors may be closed.
Not understanding how access control lists work or how to configure mandatory access control mechanisms such as type enforcement correctly.

If we consider the Windows CreateProcess function, we see it is defined as:

BOOL WINAPI CreateProcess(
  _In_opt_    LPCTSTR               lpApplicationName,
  _Inout_opt_ LPTSTR                lpCommandLine,
  _In_opt_    LPSECURITY_ATTRIBUTES lpProcessAttributes,
  _In_opt_    LPSECURITY_ATTRIBUTES lpThreadAttributes,
  _In_        BOOL                  bInheritHandles,
  _In_        DWORD                 dwCreationFlags,
  _In_opt_    LPVOID                lpEnvironment,
  _In_opt_    LPCTSTR               lpCurrentDirectory,
  _In_        LPSTARTUPINFO         lpStartupInfo,
  _Out_       LPPROCESS_INFORMATION lpProcessInformation);

We have to wonder whether a programmer who does not use this frequently will take the time to understand the ramifications of correctly setting process and thread security attributes, the current directory, environment, inheritance handles, and so on. There’s a good chance that the programmer will just look up an example on places such as github.com or stackoverflow.com and copy something that seems to work, unaware that there may be obscure side effects that compromise security.

As we will see in the following sections, comprehension errors also apply to the proper understanding of things as basic as various ways to express characters.

Path traversal and path equivalence vulnerabilities

Some applications, notably web servers, accept hierarchical filenames from a user but need to ensure that they restrict access only to files within a specific point in the directory tree. For example, a web server may need to ensure that no page requests go outside of /home/httpd/html.

An attacker may try to gain access by using paths that include .. (dot-dot), which is a link to the parent directory. For example, an attacker may try to download a password file by requesting

http://poopybrain.com/../../../etc/passwd

The hope is that the programmer did not implement parsing correctly and might try simply suffixing the user-requested path to a base directory:

"/home/httpd/html/" + "../../../etc/passwd"

to form

/home/httpd/html/../../../etc/passwd

which will retrieve the password file, /etc/passwd.

Being able to navigate out of an intended directory by manipulating the pathname and thus access unauthorized files is a path traversal vulnerability.

A programmer may anticipate this and check for dot-dot but has to realize that dot-dot directories can be anywhere in the path. This is also a valid pathname but one that should be rejected for trying to escape to the parent:

http://poopybrain.com/419/notes/../../416/../../../../etc/passwd

Moreover, the programmer cannot just search for .. because that can be a valid part of a filename. All three of these should be accepted:

http://poopybrain.com/419/notes/some..other..stuff/
http://poopybrain.com/419/notes/whatever../
http://poopybrain.com/419/notes/..more.stuff/

Also, extra slashes are perfectly fine in a filename, so this is acceptable:

http://poopybrain.com/419////notes///////..more.stuff/

The programmer should also track where the request is in the hierarchy. If dot-dot doesn’t escape above the base directory, it should most likely be accepted:

http://poopybrain.com/419/notes/../exams/

These are not insurmountable problems but they illustrate that a quick-and-dirty attempt at filename processing may be riddled with bugs.

Path traversal vulnerabilities have been used to obtain unauthorized content that resides outside the allowable directory. In some cases, attackers have been able to use them as a stepping stone to remote code execution. For example, a 2024 analysis of open source AI software discovered a path traversal vulnerability because it included a user-configurable user_name parameter as part of the path. In this case, not only could an attacker change the parameter to download sensitive information but could also upload files to the /etc/cron.d directory, which would later get executed by the system.

The October 2024 list of vulnerabilities from Protect AI includes several additional pathname-related vulnerabilities.

Path Equivalence vulnerabilities

A related issue related to pathname parsing is a path equivalence vulnerability, which happens when the system fails to recognize that two different-looking paths point to the same file or directory, leading to bypassed security checks.

Note the difference: a path traversal vulnerability escapes permissible directories (e.g., ../../etc/passwd) while a path equivalence vulnerability uses an alternate representation to get to a path to bypass checks. Both may involve path manipulation but traversal escapes bounds while equivalence tricks validation logic.

For example, if access to /admin/config.php is not allowed, the attacker may be able to gain access by using /admin/../admin/config.php.

Internal dot path equivalence is path equivalence that specifically refers to the use of dot (.) or double dot (..) segments within a path – not necessarily to escape directories, but to create an equivalent path that evades string-based security checks.

Systems sometimes block access to specific file paths based on pattern matching, such as denying access to anything containing /private/. But if the system checks the raw path string before resolving it, an attacker may insert internal . or .. segments to bypass the check while still reaching the same file.

A 2024 exploit demonstration took advantage of the fact that Microsoft Windows supports two different styles of pathnames: DOS-style and NT-style. The DOS-style dates back to the earliest versions of MS-DOS and Windows and looks like:

C:\directory\subdirectory\file.txt

and the equivalent NT path is:

\??\C:\directory\subdirectory\file.txt

DOS pathnames only support the ANSI character set, while NT paths support Unicode, allowing the use of a larger set of characters. When presented with a DOS path, applications will call the RtlpDosPathNameToRelativeNtPathName function to convert it to an NT pathname, which applies various rules to perform the conversion. One rule that was exploitable was that the conversion would remove trailing dots in any names as well as an empty space in the last element.

Simply by adding a dot to the end of a file name, all user-level programs that use the standard API will not be able to access the file, and directories cannot be listed. This provides a simple way for attackers to conceal a file, since they can place a malicious file with a dot at the end alongside a non-malicious file with the same name without a dot. It also allows an attacker to conceal a malicious process in the Task Manager.

2025 Tomcat internal dot vulnerability

A vulnerability discovered in 2025 in Apache Tomcat, a popular web server, was found to have been actively exploited for the past eight years.

This affects the Tomcat’s servlet when configured with write permissions. The original code generated temporary filenames by replacing path separators (/) with internal dots (.), leading to improper security checks and allowing attackers to access sensitive information, corrupt server data, and execute arbitrary code.

Unicode parsing and character representations

For an example of a less obvious parsing pathnames in a web server, let’s look at a bug in discovered in 2000 in Microsoft’s IIS (Internet Information Services, Microsoft’s web server). IIS had proper pathname checking to ensure that attempts to get to a parent are blocked:

http://www.poopybrain.com/scripts/../../winnt/system32/cmd.exe

Once the pathname was validated, it was passed to a decode function that decoded any embedded Unicode characters and then processed the request.

The problem with this technique was that non-international characters (traditional ASCII) could also be written as Unicode characters. A “/” could also be written in HTML as its hexadecimal value, %2f (decimal 47). It could also be represented as the two-byte Unicode sequence%c0%af.

The reason for this stems from the way Unicode was designed to support compatibility with one-byte ASCII characters. This encoding is called UTF-8. If the first bit of a character is a 0, then we have a one-byte ASCII character (in the range 0..127). However, if the first bit is a 1, we have a multi-byte character. The number of leading 1s determines the number of bytes that the character takes up. If a character starts with 110, we have a two-byte Unicode character.

With a two-byte character, the UTF-8 standard defines a bit pattern of

110a bcde   10fg hijk

The values a-k above represent 11 bits that give us a value in the range 0..2047. The “/” character, 0x2f, is 47 in decimal and 0010 1111 in binary. This value represents offset 47 into the character table (called the codepoint in Unicode parlance). Hence we can represent the “/” as 0x2f or as the two byte Unicode sequence:

1100 0000   1010 1111

which is the hexadecimal sequence %c0%af. Technically, this is disallowed. The standard states that codepoints less than 128 must be represented as one byte, but the two-byte sequence is supported by most Unicode parsers. We can also construct a valid three-byte sequence too.

Microsoft’s bug was that they ignored parsing %c0%af as being equivalent to a / because it should never have been used to represent the character. However, the Unicode parser was happy to translate it and attackers were able to use this to access any file in on a server running IIS. This bug also gave attackers the ability to invoke cmd.com, the command interpreter, and execute any commands on the server.

After Microsoft fixed the multi-byte Unicode bug, another problem came up. The parsing of escaped characters took place in two different parts of the code, so if the resultant string looked like a Unicode hexadecimal sequence, it would be re-parsed.

As an example of this, let’s consider the backslash (`\`), which Microsoft treats as equivalent to a slash (/) in URLs since their native pathname separator is a backlash³.

The backslash can be written in a URL in a hexadecimal format as %5c:

The “%” character can be expressed as %25.
The “5” character can be expressed as %35.
The “c” character can be expressed as %63.

Hence, if the URL parser sees the string %%35c, it would expand the %35 to the character “5”, which would result in %5c, which would then be converted to a \. If the parser sees %25%35%63, it would expand each of the %nn components to get the string %5c, which would then be converted to a \. As a final example, if the parser comes across %255c, it will expand %25 to % to get the string %5c, which would then be converted to a \.

For more details on this attack, see the SANS report. This bug was fixed a long time ago, of course, but it illustrates some unanticipated aspects of pathname parsing.

It is not always trivial to determine what a pathname actually refers to in an application and validation can be error-prone. The operating system itself parses a pathname a component at a time, traversing the directory tree and checking access rights as it goes along. An application may attempt to recreate a similar action without actually traversing the file system but rather by just parsing the name and mapping it to a subtree of the file system namespace. This doesn’t always work.

TOCTTOU attacks

TOCTTOU stands for Time of Check to Time of Use. If we have code of the form:

if I am allowed to do something
	then do it

we may be exposing ourselves to a race condition. There is a window of time between the test and the action. If an attacker can change the condition after the check then the action may take place even if the check should have failed.

One example of this is the print spooling program, lpr. It runs as a setuid program with root privileges so that it can copy a file from a user’s directory into a privileged spool directory that serves as a queue of files for printing. Because it runs as root, it can open any file, regardless of permissions. To keep the user honest, it will check access permissions on the file that the user wants to print and then, only if the user has legitimate read access to the file, it will copy it over to the spool directory for printing. An attacker can create a link to a readable file and then run lpr in the background. At the same time, he can change the link to point to a file for which he does not have read access. If the timing is just perfect, the lpr program will check access rights before the file is re-linked but will then copy the file for which the user has no read access.

Another example of the TOCTTOU race condition is the set of temporary filename creation functions (tempnam, tempnam, mktemp, GetTempFileName, etc.). These functions create a unique filename when they are called but there is no guarantee that an attacker doesn’t create a file with the same name before that filename is used. If the attacker creates and opens a file with the same name, she will have access to that file for as long as it is open, even if the user’s program changes access permissions for the file later on.

The best defense for the temporary file race condition on POSIX systems is to use the mkstemp function, which creates a file based on a template name and opens it as well, avoiding the race condition between checking the uniqueness of the name and opening the file.

References

Christopher Hacking, Recognizing and Preventing Time-of-Check to Time-of-Use Vulnerabilities, iSEC Partners whitepaper, March 2015.
Jinpeng Wei and Calton Pu, TOCTTOU Vulnerabilities in UNIX-Style File Systems: An Anatomical Study, 4th USENIX Conference on File and Storage Technologies (FAST’05), San Francisco, CA, December 2005`
Here’s a walkthrough of a real command injection attack in 2024 on a Palo Alto firewall.
Some more info about the above

Note that sprintf is vulnerable to buffer overflow. We should use snprintf, which allows one to specify the maximum size of the buffer. ↩︎
Unix, Linux, macOS, FreeBSD, NetBSD, OpenBSD, Android, etc. ↩︎
the official Unicode name for the slash and backslash characters are solidus and reverse solidus, respectively. ↩︎