profiling.sampling — Statistical profiler¶
Added in version 3.15.
Source code: Lib/profiling/sampling/
The profiling.sampling module, named Tachyon, provides statistical
profiling of Python programs through periodic stack sampling. Tachyon can
run scripts directly or attach to any running Python process without requiring
code changes or restarts. Because sampling occurs externally to the target
process, overhead is virtually zero, making Tachyon suitable for both
development and production environments.
What is statistical profiling?¶
Statistical profiling builds a picture of program behavior by periodically capturing snapshots of the call stack. Rather than instrumenting every function call and return as deterministic profilers do, Tachyon reads the call stack at regular intervals to record what code is currently running.
This approach rests on a simple principle: functions that consume significant CPU time will appear frequently in the collected samples. By gathering thousands of samples over a profiling session, Tachyon constructs an accurate statistical estimate of where time is spent. The more samples collected, the more precise this estimate becomes.
The following interactive visualization demonstrates how sampling profiling works. Press Play to watch a Python program execute, and observe how the profiler periodically captures snapshots of the call stack. Adjust the sample interval to see how sampling frequency affects the results.
How time is estimated¶
The time values shown in Tachyon’s output are estimates derived from sample counts, not direct measurements. Tachyon counts how many times each function appears in the collected samples, then multiplies by the sampling interval to estimate time.
For example, with a 10 kHz sampling rate over a 10-second profile, Tachyon collects approximately 100,000 samples. If a function appears in 5,000 samples (5% of total), Tachyon estimates it consumed 5% of the 10-second duration, or about 500 milliseconds. This is a statistical estimate, not a precise measurement.
The accuracy of these estimates depends on sample count. With 100,000 samples, a function showing 5% has a margin of error of roughly ±0.5%. With only 1,000 samples, the same 5% measurement could actually represent anywhere from 3% to 7% of real time.
This is why longer profiling durations and shorter sampling intervals produce more reliable results—they collect more samples. For most performance analysis, the default settings provide sufficient accuracy to identify bottlenecks and guide optimization efforts.
Because sampling is statistical, results will vary slightly between runs. A function showing 12% in one run might show 11% or 13% in the next. This is normal and expected. Focus on the overall pattern rather than exact percentages, and don’t worry about small variations between runs.
When to use a different approach¶
Statistical sampling is not ideal for every situation.
For very short scripts that complete in under one second, the profiler may not
collect enough samples for reliable results. Use profiling.tracing
instead, or run the script in a loop to extend profiling time.
When you need exact call counts, sampling cannot provide them. Sampling
estimates frequency from snapshots, so if you need to know precisely how many
times a function was called, use profiling.tracing.
When comparing two implementations where the difference might be only 1-2%,
sampling noise can obscure real differences. Use timeit for
micro-benchmarks or profiling.tracing for precise measurements.
The key difference from profiling.tracing is how measurement happens.
A tracing profiler instruments your code, recording every function call and
return. This provides exact call counts and precise timing but adds overhead
to every function call. A sampling profiler, by contrast, observes the program
from outside at fixed intervals without modifying its execution. Think of the
difference like this: tracing is like having someone follow you and write down
every step you take, while sampling is like taking photographs every second
and inferring your path from those snapshots.
This external observation model is what makes sampling profiling practical for production use. The profiled program runs at full speed because there is no instrumentation code running inside it, and the target process is never stopped or paused during sampling—Tachyon reads the call stack directly from the process’s memory while it continues to run. You can attach to a live server, collect data, and detach without the application ever knowing it was observed. The trade-off is that very short-lived functions may be missed if they happen to complete between samples.
Statistical profiling excels at answering the question, “Where is my program
spending time?” It reveals hotspots and bottlenecks in production code where
deterministic profiling overhead would be unacceptable. For exact call counts
and complete call graphs, use profiling.tracing instead.
Quick examples¶
Profile a script and see the results immediately:
python -m profiling.sampling run script.py
Profile a module with arguments:
python -m profiling.sampling run -m mypackage.module arg1 arg2
Generate an interactive flame graph:
python -m profiling.sampling run --flamegraph -o profile.html script.py
Attach to a running process by PID:
python -m profiling.sampling attach 12345
Print a single snapshot of a running process’s stack:
python -m profiling.sampling dump 12345
Use live mode for real-time monitoring (press q to quit):
python -m profiling.sampling run --live script.py
Profile for 60 seconds with a faster sampling rate:
python -m profiling.sampling run -d 60 -r 20khz script.py
Generate a line-by-line heatmap:
python -m profiling.sampling run --heatmap script.py
Enable opcode-level profiling to see which bytecode instructions are executing:
python -m profiling.sampling run --opcodes --flamegraph script.py
Commands¶
Tachyon operates through several subcommands. run and attach collect
samples over time; dump captures a single snapshot; replay converts
binary profiles to other formats.
The run command¶
The run command launches a Python script or module and profiles it from
startup:
python -m profiling.sampling run script.py
python -m profiling.sampling run -m mypackage.module
When profiling a script, the profiler starts the target in a subprocess, waits
for it to initialize, then begins collecting samples. The -m flag
indicates that the target should be run as a module (equivalent to
python -m). Arguments after the target are passed through to the
profiled program:
python -m profiling.sampling run script.py --config settings.yaml
The attach command¶
The attach command connects to an already-running Python process by its
process ID:
python -m profiling.sampling attach 12345
This command is particularly valuable for investigating performance issues in production systems. The target process requires no modification and need not be restarted. The profiler attaches, collects samples for the specified duration, then detaches and produces output.
python -m profiling.sampling attach --live 12345
python -m profiling.sampling attach --flamegraph -d 30 -o profile.html 12345
On most systems, attaching to another process requires appropriate permissions. See Platform requirements for platform-specific requirements.
The dump command¶
The dump command prints a single snapshot of a running process’s Python
stack and exits, similar to a traceback:
python -m profiling.sampling dump 12345
Unlike attach, dump does not run a sampling loop: it reads the
stack once. This is useful for investigating hung or unresponsive
processes, or for answering “what is this process doing right now?”.
The output mirrors a traceback (most recent call last) and annotates each thread with its current state (main thread, has GIL, on CPU, waiting for GIL, has exception, or idle):
Stack dump for PID 12345, thread 140735 (main thread, has GIL, on CPU; most recent call last):
File "server.py", line 28, in serve
await handle_request(req)
File "handler.py", line 91, in handle_request
result = expensive_call(req)
When the target’s source files are readable, dump prints the source
line for each frame and highlights the executing expression.
Like attach, dump requires permission to read the target process’s
memory. See Platform requirements.
The dump command supports the following options:
-a,--all-threadsDump every thread in the target process. Without this flag only the main thread is shown.
--nativeInclude synthetic
<native>frames marking transitions into C extensions or other non-Python code.--no-gcHide the synthetic
<GC>frames that mark active garbage collection.--opcodesAnnotate each frame with the bytecode opcode the thread is currently executing (for example,
opcode=CALL_KW). Useful for instruction-level investigation, including identifying specializations chosen by the adaptive interpreter.--async-awareReconstruct stacks across
awaitboundaries.dumpwalks the task graph and emits one section per task, with<task>markers separating coroutines awaiting each other.--async-mode {running,all}Controls which tasks are included when
--async-awareis enabled.runningshows only the task currently executing on each thread;all(the default fordump) also includes tasks suspended on a wait.attach’s default for this flag isrunning;dumpdefaults toallbecause a single snapshot is most useful when it shows the full task graph.--blockingPause every thread in the target while reading its stack and resume them after. Guarantees a fully consistent snapshot at the cost of briefly stopping the target. Without it,
dumpreads memory while the target keeps running, which is faster but can occasionally produce a torn stack.
The replay command¶
The replay command converts binary profile files to other output formats:
python -m profiling.sampling replay profile.bin
python -m profiling.sampling replay --flamegraph -o profile.html profile.bin
This command is useful when you have captured profiling data in binary format and want to analyze it later or convert it to a visualization format. Binary profiles can be replayed multiple times to different formats without re-profiling.
# Convert binary to pstats (default, prints to stdout)
python -m profiling.sampling replay profile.bin
# Convert binary to flame graph
python -m profiling.sampling replay --flamegraph -o output.html profile.bin
# Convert binary to gecko format for Firefox Profiler
python -m profiling.sampling replay --gecko -o profile.json profile.bin
# Convert binary to heatmap
python -m profiling.sampling replay --heatmap -o my_heatmap profile.bin
Profiling in production¶
The sampling profiler is designed for production use. It imposes no measurable overhead on the target process because it reads memory externally rather than instrumenting code. The target application continues running at full speed and is unaware it is being profiled.
When profiling production systems, keep these guidelines in mind:
Start with shorter durations (10-30 seconds) to get quick results, then extend if you need more statistical accuracy. By default, profiling runs until the target process completes, which is usually sufficient to identify major hotspots.
If possible, profile during representative load rather than peak traffic. Profiles collected during normal operation are easier to interpret than those collected during unusual spikes.
The profiler itself consumes some CPU on the machine where it runs (not on the target process). On the same machine, this is typically negligible. When profiling remote processes, network latency does not affect the target.
Results from production may differ from development due to different data sizes, concurrent load, or caching effects. This is expected and is often exactly what you want to capture.
Platform requirements¶
The profiler reads the target process’s memory to capture stack traces. This requires elevated permissions on most operating systems.
Linux
On Linux, the profiler uses ptrace or process_vm_readv to read the
target process’s memory. This typically requires one of:
Running as root
Having the
CAP_SYS_PTRACEcapabilityAdjusting the Yama ptrace scope:
/proc/sys/kernel/yama/ptrace_scope
The default ptrace_scope of 1 restricts ptrace to parent processes only. To allow attaching to any process owned by the same user, set it to 0:
echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope
macOS
On macOS, the profiler uses task_for_pid() to access the target process.
This requires one of:
Running as root
The profiler binary having the
com.apple.security.cs.debuggerentitlementSystem Integrity Protection (SIP) being disabled (not recommended)
Windows
On Windows, the profiler requires administrative privileges or the
SeDebugPrivilege privilege to read another process’s memory.
Note: On Windows, python -m profiling.sampling fails inside a virtual
environment because the venv’s python.exe is just a launcher shim that
re-executes the base interpreter as a child process. The shim itself isn’t
a Python process and has no PyRuntime section to attach to. Instead,
run it from the global Python installation.
Version compatibility¶
The profiler and target process must run the same Python minor version (for example, both Python 3.15). Attaching from Python 3.14 to a Python 3.15 process is not supported.
Additional restrictions apply to pre-release Python versions: if either the profiler or target is running a pre-release (alpha, beta, or release candidate), both must run the exact same version.
On free-threaded Python builds, the profiler cannot attach from a free-threaded build to a standard build, or vice versa.
Sampling configuration¶
Before exploring the various output formats and visualization options, it is important to understand how to configure the sampling process itself. The profiler offers several options that control how frequently samples are collected, how long profiling runs, which threads are observed, and what additional context is captured in each sample.
The default configuration works well for most use cases:
Option |
Default |
|---|---|
Default for |
1 kHz |
Default for |
Run to completion |
Default for |
Main thread only |
Default for |
No |
Default for |
|
Default for |
Wall-clock mode (all samples recorded) |
Default for |
Disabled |
Default for |
Disabled |
Default for |
Disabled (non-blocking sampling) |
Sampling rate and duration¶
The two most fundamental parameters are the sampling rate and duration. Together, these determine how many samples will be collected during a profiling session.
The --sampling-rate option (-r) sets how frequently samples
are collected. The default is 1 kHz (1,000 samples per second):
python -m profiling.sampling run -r 20khz script.py
Higher rates capture more samples and provide finer-grained data at the cost of slightly higher profiler CPU usage. Lower rates reduce profiler overhead but may miss short-lived functions. For most applications, the default rate provides a good balance between accuracy and overhead.
The --duration option (-d) sets how long to profile in seconds. By
default, profiling continues until the target process exits or is interrupted:
python -m profiling.sampling run -d 60 script.py
Specifying a duration is useful when attaching to long-running processes or when you want to limit profiling to a specific time window. When profiling a script, the default behavior of running to completion is usually what you want.
Thread selection¶
Python programs often use multiple threads, whether explicitly through the
threading module or implicitly through libraries that manage thread
pools.
By default, the profiler samples only the main thread. The --all-threads
option (-a) enables sampling of all threads in the process:
python -m profiling.sampling run -a script.py
Multi-thread profiling reveals how work is distributed across threads and can identify threads that are blocked or starved. Each thread’s samples are combined in the output, with the ability to filter by thread in some formats. This option is particularly useful when investigating concurrency issues or when work is distributed across a thread pool.
Blocking mode¶
By default, Tachyon reads the target process’s memory without stopping it. This non-blocking approach is ideal for most profiling scenarios because it imposes virtually zero overhead on the target application: the profiled program runs at full speed and is unaware it is being observed.
However, non-blocking sampling can occasionally produce incomplete or inconsistent stack traces in applications with many generators or coroutines that rapidly switch between yield points, or in programs with very fast-changing call stacks where functions enter and exit between the start and end of a single stack read, resulting in reconstructed stacks that mix frames from different execution states or that never actually existed.
For these cases, the --blocking option stops the target process during
each sample:
python -m profiling.sampling run --blocking script.py
python -m profiling.sampling attach --blocking 12345
When blocking mode is enabled, the profiler suspends the target process, reads its stack, then resumes it. This guarantees that each captured stack represents a real, consistent snapshot of what the process was doing at that instant. The trade-off is that the target process runs slower because it is repeatedly paused.
Warning
Do not use very high sample rates (low --interval values) with blocking
mode. Suspending and resuming a process takes time, and if the sampling
interval is too short, the target will spend more time stopped than running.
For blocking mode, intervals of 1000 microseconds (1 millisecond) or higher
are recommended. The default 100 microsecond interval may cause noticeable
slowdown in the target application.
Use blocking mode only when you observe inconsistent stacks in your profiles, particularly with generator-heavy or coroutine-heavy code. For most applications, the default non-blocking mode provides accurate results with zero impact on the target process.
Special frames¶
The profiler can inject artificial frames into the captured stacks to provide additional context about what the interpreter is doing at the moment each sample is taken. These synthetic frames help distinguish different types of execution that would otherwise be invisible.
The --native option adds <native> frames to indicate when Python has
called into C code (extension modules, built-in functions, or the interpreter
itself):
python -m profiling.sampling run --native script.py
These frames help distinguish time spent in Python code versus time spent in native libraries. Without this option, native code execution appears as time in the Python function that made the call. This is useful when optimizing code that makes heavy use of C extensions like NumPy or database drivers.
By default, the profiler includes <GC> frames when garbage collection is
active. The --no-gc option suppresses these frames:
python -m profiling.sampling run --no-gc script.py
GC frames help identify programs where garbage collection consumes significant
time, which may indicate memory allocation patterns worth optimizing. If you
see substantial time in <GC> frames, consider investigating object
allocation rates or using object pooling.
Opcode-aware profiling¶
The --opcodes option enables instruction-level profiling that captures
which Python bytecode instructions are executing at each sample:
python -m profiling.sampling run --opcodes --flamegraph script.py
This feature provides visibility into Python’s bytecode execution, including
adaptive specialization optimizations. When a generic instruction like
LOAD_ATTR is specialized at runtime into a more efficient variant like
LOAD_ATTR_INSTANCE_VALUE, the profiler shows both the specialized name
and the base instruction.
Opcode information appears in several output formats:
Flame graphs: Hovering over a frame displays a tooltip with a bytecode instruction breakdown, showing which opcodes consumed time in that function
Heatmap: Expandable bytecode panels per source line show instruction breakdown with specialization percentages
Live mode: An opcode panel shows instruction-level statistics for the selected function, accessible via keyboard navigation
Gecko format: Opcode transitions are emitted as interval markers in the Firefox Profiler timeline
This level of detail is particularly useful for:
Understanding the performance impact of Python’s adaptive specialization
Identifying hot bytecode instructions that might benefit from optimization
Analyzing the effectiveness of different code patterns at the instruction level
Debugging performance issues that occur at the bytecode level
The --opcodes option is compatible with --live, --flamegraph,
--heatmap, and