`profiling.sampling` — Statistical profiler¶

Added in version 3.15.

Source code: Lib/profiling/sampling/

The profiling.sampling module, named Tachyon, provides statistical profiling of Python programs through periodic stack sampling. Tachyon can run scripts directly or attach to any running Python process without requiring code changes or restarts. Because sampling occurs externally to the target process, overhead is virtually zero, making Tachyon suitable for both development and production environments.

What is statistical profiling?¶

Statistical profiling builds a picture of program behavior by periodically capturing snapshots of the call stack. Rather than instrumenting every function call and return as deterministic profilers do, Tachyon reads the call stack at regular intervals to record what code is currently running.

This approach rests on a simple principle: functions that consume significant CPU time will appear frequently in the collected samples. By gathering thousands of samples over a profiling session, Tachyon constructs an accurate statistical estimate of where time is spent. The more samples collected, the more precise this estimate becomes.

The following interactive visualization demonstrates how sampling profiling works. Press Play to watch a Python program execute, and observe how the profiler periodically captures snapshots of the call stack. Adjust the sample interval to see how sampling frequency affects the results.

How time is estimated¶

The time values shown in Tachyon’s output are estimates derived from sample counts, not direct measurements. Tachyon counts how many times each function appears in the collected samples, then multiplies by the sampling interval to estimate time.

For example, with a 10 kHz sampling rate over a 10-second profile, Tachyon collects approximately 100,000 samples. If a function appears in 5,000 samples (5% of total), Tachyon estimates it consumed 5% of the 10-second duration, or about 500 milliseconds. This is a statistical estimate, not a precise measurement.

The accuracy of these estimates depends on sample count. With 100,000 samples, a function showing 5% has a margin of error of roughly ±0.5%. With only 1,000 samples, the same 5% measurement could actually represent anywhere from 3% to 7% of real time.

This is why longer profiling durations and shorter sampling intervals produce more reliable results—they collect more samples. For most performance analysis, the default settings provide sufficient accuracy to identify bottlenecks and guide optimization efforts.

Because sampling is statistical, results will vary slightly between runs. A function showing 12% in one run might show 11% or 13% in the next. This is normal and expected. Focus on the overall pattern rather than exact percentages, and don’t worry about small variations between runs.

When to use a different approach¶

Statistical sampling is not ideal for every situation.

For very short scripts that complete in under one second, the profiler may not collect enough samples for reliable results. Use profiling.tracing instead, or run the script in a loop to extend profiling time.

When you need exact call counts, sampling cannot provide them. Sampling estimates frequency from snapshots, so if you need to know precisely how many times a function was called, use profiling.tracing.

When comparing two implementations where the difference might be only 1-2%, sampling noise can obscure real differences. Use timeit for micro-benchmarks or profiling.tracing for precise measurements.

The key difference from profiling.tracing is how measurement happens. A tracing profiler instruments your code, recording every function call and return. This provides exact call counts and precise timing but adds overhead to every function call. A sampling profiler, by contrast, observes the program from outside at fixed intervals without modifying its execution. Think of the difference like this: tracing is like having someone follow you and write down every step you take, while sampling is like taking photographs every second and inferring your path from those snapshots.

This external observation model is what makes sampling profiling practical for production use. The profiled program runs at full speed because there is no instrumentation code running inside it, and the target process is never stopped or paused during sampling—Tachyon reads the call stack directly from the process’s memory while it continues to run. You can attach to a live server, collect data, and detach without the application ever knowing it was observed. The trade-off is that very short-lived functions may be missed if they happen to complete between samples.

Statistical profiling excels at answering the question, “Where is my program spending time?” It reveals hotspots and bottlenecks in production code where deterministic profiling overhead would be unacceptable. For exact call counts and complete call graphs, use profiling.tracing instead.

Quick examples¶

Profile a script and see the results immediately:

python -m profiling.sampling run script.py

Profile a module with arguments:

python -m profiling.sampling run -m mypackage.module arg1 arg2

Generate an interactive flame graph:

python -m profiling.sampling run --flamegraph -o profile.html script.py

Attach to a running process by PID:

python -m profiling.sampling attach 12345

Print a single snapshot of a running process’s stack:

python -m profiling.sampling dump 12345

Use live mode for real-time monitoring (press q to quit):

python -m profiling.sampling run --live script.py

Profile for 60 seconds with a faster sampling rate:

python -m profiling.sampling run -d 60 -r 20khz script.py

Generate a line-by-line heatmap:

python -m profiling.sampling run --heatmap script.py

Enable opcode-level profiling to see which bytecode instructions are executing:

python -m profiling.sampling run --opcodes --flamegraph script.py

Commands¶

Tachyon operates through several subcommands. run and attach collect samples over time; dump captures a single snapshot; replay converts binary profiles to other formats.

The `run` command¶

The run command launches a Python script or module and profiles it from startup:

python -m profiling.sampling run script.py
python -m profiling.sampling run -m mypackage.module

When profiling a script, the profiler starts the target in a subprocess, waits for it to initialize, then begins collecting samples. The -m flag indicates that the target should be run as a module (equivalent to python -m). Arguments after the target are passed through to the profiled program:

python -m profiling.sampling run script.py --config settings.yaml

The `attach` command¶

The attach command connects to an already-running Python process by its process ID:

python -m profiling.sampling attach 12345

This command is particularly valuable for investigating performance issues in production systems. The target process requires no modification and need not be restarted. The profiler attaches, collects samples for the specified duration, then detaches and produces output.

python -m profiling.sampling attach --live 12345
python -m profiling.sampling attach --flamegraph -d 30 -o profile.html 12345

On most systems, attaching to another process requires appropriate permissions. See Platform requirements for platform-specific requirements.

The `dump` command¶

The dump command prints a single snapshot of a running process’s Python stack and exits, similar to a traceback:

python -m profiling.sampling dump 12345

Unlike attach, dump does not run a sampling loop: it reads the stack once. This is useful for investigating hung or unresponsive processes, or for answering “what is this process doing right now?”.

The output mirrors a traceback (most recent call last) and annotates each thread with its current state (main thread, has GIL, on CPU, waiting for GIL, has exception, or idle):

Stack dump for PID 12345, thread 140735 (main thread, has GIL, on CPU; most recent call last):
  File "server.py", line 28, in serve
    await handle_request(req)
  File "handler.py", line 91, in handle_request
    result = expensive_call(req)

When the target’s source files are readable, dump prints the source line for each frame and highlights the executing expression.

Like attach, dump requires permission to read the target process’s memory. See Platform requirements.

The dump command supports the following options:

-a, --all-threads: Dump every thread in the target process. Without this flag only the main thread is shown.
--native: Include synthetic <native> frames marking transitions into C extensions or other non-Python code.
--no-gc: Hide the synthetic <GC> frames that mark active garbage collection.
--opcodes: Annotate each frame with the bytecode opcode the thread is currently executing (for example, opcode=CALL_KW). Useful for instruction-level investigation, including identifying specializations chosen by the adaptive interpreter.
--async-aware: Reconstruct stacks across await boundaries. dump walks the task graph and emits one section per task, with <task> markers separating coroutines awaiting each other.
--async-mode {running,all}: Controls which tasks are included when --async-aware is enabled. running shows only the task currently executing on each thread; all (the default for dump) also includes tasks suspended on a wait. attach’s default for this flag is running; dump defaults to all because a single snapshot is most useful when it shows the full task graph.
--blocking: Pause every thread in the target while reading its stack and resume them after. Guarantees a fully consistent snapshot at the cost of briefly stopping the target. Without it, dump reads memory while the target keeps running, which is faster but can occasionally produce a torn stack.

The `replay` command¶

The replay command converts binary profile files to other output formats:

python -m profiling.sampling replay profile.bin
python -m profiling.sampling replay --flamegraph -o profile.html profile.bin

This command is useful when you have captured profiling data in binary format and want to analyze it later or convert it to a visualization format. Binary profiles can be replayed multiple times to different formats without re-profiling.

# Convert binary to pstats (default, prints to stdout)
python -m profiling.sampling replay profile.bin

# Convert binary to flame graph
python -m profiling.sampling replay --flamegraph -o output.html profile.bin

# Convert binary to gecko format for Firefox Profiler
python -m profiling.sampling replay --gecko -o profile.json profile.bin

# Convert binary to heatmap
python -m profiling.sampling replay --heatmap -o my_heatmap profile.bin

Profiling in production¶

The sampling profiler is designed for production use. It imposes no measurable overhead on the target process because it reads memory externally rather than instrumenting code. The target application continues running at full speed and is unaware it is being profiled.

When profiling production systems, keep these guidelines in mind:

Start with shorter durations (10-30 seconds) to get quick results, then extend if you need more statistical accuracy. By default, profiling runs until the target process completes, which is usually sufficient to identify major hotspots.

If possible, profile during representative load rather than peak traffic. Profiles collected during normal operation are easier to interpret than those collected during unusual spikes.

The profiler itself consumes some CPU on the machine where it runs (not on the target process). On the same machine, this is typically negligible. When profiling remote processes, network latency does not affect the target.

Results from production may differ from development due to different data sizes, concurrent load, or caching effects. This is expected and is often exactly what you want to capture.

Platform requirements¶

The profiler reads the target process’s memory to capture stack traces. This requires elevated permissions on most operating systems.

Linux

On Linux, the profiler uses ptrace or process_vm_readv to read the target process’s memory. This typically requires one of:

Running as root
Having the CAP_SYS_PTRACE capability
Adjusting the Yama ptrace scope: /proc/sys/kernel/yama/ptrace_scope

The default ptrace_scope of 1 restricts ptrace to parent processes only. To allow attaching to any process owned by the same user, set it to 0:

echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope

macOS

On macOS, the profiler uses task_for_pid() to access the target process. This requires one of:

Running as root
The profiler binary having the com.apple.security.cs.debugger entitlement
System Integrity Protection (SIP) being disabled (not recommended)

Windows

On Windows, the profiler requires administrative privileges or the SeDebugPrivilege privilege to read another process’s memory.

Note: On Windows, python -m profiling.sampling fails inside a virtual environment because the venv’s python.exe is just a launcher shim that re-executes the base interpreter as a child process. The shim itself isn’t a Python process and has no PyRuntime section to attach to. Instead, run it from the global Python installation.

Version compatibility¶

The profiler and target process must run the same Python minor version (for example, both Python 3.15). Attaching from Python 3.14 to a Python 3.15 process is not supported.

Additional restrictions apply to pre-release Python versions: if either the profiler or target is running a pre-release (alpha, beta, or release candidate), both must run the exact same version.

On free-threaded Python builds, the profiler cannot attach from a free-threaded build to a standard build, or vice versa.

Sampling configuration¶

Before exploring the various output formats and visualization options, it is important to understand how to configure the sampling process itself. The profiler offers several options that control how frequently samples are collected, how long profiling runs, which threads are observed, and what additional context is captured in each sample.

The default configuration works well for most use cases:

Option	Default
Default for `--sampling-rate` / `-r`	1 kHz
Default for `--duration` / `-d`	Run to completion
Default for `--all-threads` / `-a`	Main thread only
Default for `--native`	No `<native>` frames (C code time attributed to caller)
Default for `--no-gc`	`<GC>` frames included when garbage collection is active
Default for `--mode`	Wall-clock mode (all samples recorded)
Default for `--realtime-stats`	Disabled
Default for `--subprocesses`	Disabled
Default for `--blocking`	Disabled (non-blocking sampling)

Sampling rate and duration¶

The two most fundamental parameters are the sampling rate and duration. Together, these determine how many samples will be collected during a profiling session.

The --sampling-rate option (-r) sets how frequently samples are collected. The default is 1 kHz (1,000 samples per second):

python -m profiling.sampling run -r 20khz script.py

Higher rates capture more samples and provide finer-grained data at the cost of slightly higher profiler CPU usage. Lower rates reduce profiler overhead but may miss short-lived functions. For most applications, the default rate provides a good balance between accuracy and overhead.

The --duration option (-d) sets how long to profile in seconds. By default, profiling continues until the target process exits or is interrupted:

python -m profiling.sampling run -d 60 script.py

Specifying a duration is useful when attaching to long-running processes or when you want to limit profiling to a specific time window. When profiling a script, the default behavior of running to completion is usually what you want.

Thread selection¶

Python programs often use multiple threads, whether explicitly through the threading module or implicitly through libraries that manage thread pools.

By default, the profiler samples only the main thread. The --all-threads option (-a) enables sampling of all threads in the process:

python -m profiling.sampling run -a script.py

Multi-thread profiling reveals how work is distributed across threads and can identify threads that are blocked or starved. Each thread’s samples are combined in the output, with the ability to filter by thread in some formats. This option is particularly useful when investigating concurrency issues or when work is distributed across a thread pool.

Blocking mode¶

By default, Tachyon reads the target process’s memory without stopping it. This non-blocking approach is ideal for most profiling scenarios because it imposes virtually zero overhead on the target application: the profiled program runs at full speed and is unaware it is being observed.

However, non-blocking sampling can occasionally produce incomplete or inconsistent stack traces in applications with many generators or coroutines that rapidly switch between yield points, or in programs with very fast-changing call stacks where functions enter and exit between the start and end of a single stack read, resulting in reconstructed stacks that mix frames from different execution states or that never actually existed.

For these cases, the --blocking option stops the target process during each sample:

python -m profiling.sampling run --blocking script.py
python -m profiling.sampling attach --blocking 12345

When blocking mode is enabled, the profiler suspends the target process, reads its stack, then resumes it. This guarantees that each captured stack represents a real, consistent snapshot of what the process was doing at that instant. The trade-off is that the target process runs slower because it is repeatedly paused.

Warning

Do not use very high sample rates (low --interval values) with blocking mode. Suspending and resuming a process takes time, and if the sampling interval is too short, the target will spend more time stopped than running. For blocking mode, intervals of 1000 microseconds (1 millisecond) or higher are recommended. The default 100 microsecond interval may cause noticeable slowdown in the target application.

Use blocking mode only when you observe inconsistent stacks in your profiles, particularly with generator-heavy or coroutine-heavy code. For most applications, the default non-blocking mode provides accurate results with zero impact on the target process.

Special frames¶

The profiler can inject artificial frames into the captured stacks to provide additional context about what the interpreter is doing at the moment each sample is taken. These synthetic frames help distinguish different types of execution that would otherwise be invisible.

The --native option adds <native> frames to indicate when Python has called into C code (extension modules, built-in functions, or the interpreter itself):

python -m profiling.sampling run --native script.py

These frames help distinguish time spent in Python code versus time spent in native libraries. Without this option, native code execution appears as time in the Python function that made the call. This is useful when optimizing code that makes heavy use of C extensions like NumPy or database drivers.

By default, the profiler includes <GC> frames when garbage collection is active. The --no-gc option suppresses these frames:

python -m profiling.sampling run --no-gc script.py

GC frames help identify programs where garbage collection consumes significant time, which may indicate memory allocation patterns worth optimizing. If you see substantial time in <GC> frames, consider investigating object allocation rates or using object pooling.

Opcode-aware profiling¶

The --opcodes option enables instruction-level profiling that captures which Python bytecode instructions are executing at each sample:

python -m profiling.sampling run --opcodes --flamegraph script.py

This feature provides visibility into Python’s bytecode execution, including adaptive specialization optimizations. When a generic instruction like LOAD_ATTR is specialized at runtime into a more efficient variant like LOAD_ATTR_INSTANCE_VALUE, the profiler shows both the specialized name and the base instruction.

Opcode information appears in several output formats:

Flame graphs: Hovering over a frame displays a tooltip with a bytecode instruction breakdown, showing which opcodes consumed time in that function
Heatmap: Expandable bytecode panels per source line show instruction breakdown with specialization percentages
Live mode: An opcode panel shows instruction-level statistics for the selected function, accessible via keyboard navigation
Gecko format: Opcode transitions are emitted as interval markers in the Firefox Profiler timeline

This level of detail is particularly useful for:

Understanding the performance impact of Python’s adaptive specialization
Identifying hot bytecode instructions that might benefit from optimization
Analyzing the effectiveness of different code patterns at the instruction level
Debugging performance issues that occur at the bytecode level

The --opcodes option is compatible with --live, --flamegraph, --heatmap, and