Linus Torvalds 802f0d58d5 perf tools changes for v6.15
perf record
 -----------
 * Introduce latency profiling using scheduler information.  The latency
   profiling is to show impacts on wall-time rather than cpu-time.  By
   tracking context switches, it can weight samples and find which part
   of the code contributed more to the execution latency.
 
   The value (period) of the sample is weighted by dividing it by the
   number of parallel execution at the moment.  The parallelism is
   tracked in perf report with sched-switch records.  This will reduce
   the portion that are run in parallel and in turn increase the portion
   of serial executions.
 
   For now, it's limited to profile processes, IOW system-wide profiling
   is not supported.  You can add --latency option to enable this.
 
     $ perf record --latency -- make -C tools/perf
 
   I've run the above command for perf build which adds -j option to
   make with the number of CPUs in the system internally.  Normally
   it'd show something like below:
 
     $ perf report -F overhead,comm
     ...
     #
     # Overhead  Command
     # ........  ...............
     #
         78.97%  cc1
          6.54%  python3
          4.21%  shellcheck
          3.28%  ld
          1.80%  as
          1.37%  cc1plus
          0.80%  sh
          0.62%  clang
          0.56%  gcc
          0.44%  perl
          0.39%  make
 	 ...
 
   The cc1 takes around 80% of the overhead as it's the actual compiler.
   However it runs in parallel so its contribution to latency may be less
   than that.  Now, perf report will show both overhead and latency (if
   --latency was given at record time) like below:
 
     $ perf report -s comm
     ...
     #
     # Overhead   Latency  Command
     # ........  ........  ...............
     #
         78.97%    48.66%  cc1
          6.54%    25.68%  python3
          4.21%     0.39%  shellcheck
          3.28%    13.70%  ld
          1.80%     2.56%  as
          1.37%     3.08%  cc1plus
          0.80%     0.98%  sh
          0.62%     0.61%  clang
          0.56%     0.33%  gcc
          0.44%     1.71%  perl
          0.39%     0.83%  make
 	 ...
 
   You can see latency of cc1 goes down to around 50% and python3 and ld
   contribute a lot more than their overhead.  You can use --latency
   option in perf report to get the same result but ordered by latency.
 
     $ perf report --latency -s comm
 
 perf report
 -----------
 * As a side effect of the latency profiling work, it adds a new output
   field 'latency' and a sort key 'parallelism'.  The below is a result
   from my system with 64 CPUs.  The build was well-parallelized but
   contained some serial portions.
 
     $ perf report -s parallelism
     ...
     #
     # Overhead   Latency  Parallelism
     # ........  ........  ...........
     #
         16.95%     1.54%           62
         13.38%     1.24%           61
         12.50%    70.47%            1
         11.81%     1.06%           63
          7.59%     0.71%           60
          4.33%    12.20%            2
          3.41%     0.33%           59
          2.05%     0.18%           64
          1.75%     1.09%            9
          1.64%     1.85%            5
          ...
 
 * Support Feodra mini-debuginfo which is a LZMA compressed symbol table
   inside ".gnu_debugdata" ELF section.
 
 perf annotate
 -------------
 * Add --code-with-type option to enable data-type profiling with the
   usual annotate output.  Instead of focusing on data structure, it
   shows code annotation together with data type it accesses in case the
   instruction refers to a memory location (and it was able to resolve
   the target data type).  Currently it only works with --stdio.
 
     $ perf annotate --stdio --code-with-type
     ...
      Percent |      Source code & Disassembly of vmlinux for cpu/mem-loads,ldlat=30/pp (18 samples, percent: local period)
     ----------------------------------------------------------------------------------------------------------------------
              : 0                0xffffffff81050610 <__fdget>:
         0.00 :   ffffffff81050610:        callq   0xffffffff81c01b80 <__fentry__>           # data-type: (stack operation)
         0.00 :   ffffffff81050615:        pushq   %rbp              # data-type: (stack operation)
         0.00 :   ffffffff81050616:        movq    %rsp, %rbp
         0.00 :   ffffffff81050619:        pushq   %r15              # data-type: (stack operation)
         0.00 :   ffffffff8105061b:        pushq   %r14              # data-type: (stack operation)
         0.00 :   ffffffff8105061d:        pushq   %rbx              # data-type: (stack operation)
         0.00 :   ffffffff8105061e:        subq    $0x10, %rsp
         0.00 :   ffffffff81050622:        movl    %edi, %ebx
         0.00 :   ffffffff81050624:        movq    %gs:0x7efc4814(%rip), %rax  # 0x14e40 <current_task>              # data-type: struct task_struct* +0
         0.00 :   ffffffff8105062c:        movq    0x8d0(%rax), %r14         # data-type: struct task_struct +0x8d0 (files)
         0.00 :   ffffffff81050633:        movl    (%r14), %eax              # data-type: struct files_struct +0 (count.counter)
         0.00 :   ffffffff81050636:        cmpl    $0x1, %eax
         0.00 :   ffffffff81050639:        je      0xffffffff810506a9 <__fdget+0x99>
         0.00 :   ffffffff8105063b:        movq    0x20(%r14), %rcx          # data-type: struct files_struct +0x20 (fdt)
         0.00 :   ffffffff8105063f:        movl    (%rcx), %eax              # data-type: struct fdtable +0 (max_fds)
         0.00 :   ffffffff81050641:        cmpl    %ebx, %eax
         0.00 :   ffffffff81050643:        jbe     0xffffffff810506ef <__fdget+0xdf>
         0.00 :   ffffffff81050649:        movl    %ebx, %r15d
         5.56 :   ffffffff8105064c:        movq    0x8(%rcx), %rdx           # data-type: struct fdtable +0x8 (fd)
 	...
 
   The "# data-type:" part was added with this change.  The first few
   entries are not very interesting.  But later you can it accesses
   a couple of fields in the task_struct, files_struct and fdtable.
 
 perf trace
 ----------
 * Support syscall tracing for different ABI.  For example it can trace
   system calls for 32-bit applications on 64-bit kernel transparently.
 
 * Add --summary-mode=total option to show global syscall summary.  The
   default is 'thread' to show per-thread syscall summary.
 
 Python support
 --------------
 * Add more interfaces to 'perf' module to parse events, and config,
   enable or disable the event list properly so that it can implement
   basic functionalities purely in Python.  There is an example code
   for these new interfaces in python/tracepoint.py.
 
 * Add mypy and pylint support to enable build time checking.  Fix
   some code based on the findings from these tools.
 
 Internals
 ---------
 * Introduce io_dir__readdir() API to make directory traveral (usually
   for proc or sysfs) efficient with less memory footprint.
 
 JSON vendor events
 ------------------
 * Add events and metrics for ARM Neoverse N3 and V3
 * Update events and metrics on various Intel CPUs
 * Add/update events for a number of SiFive processors
 
 Signed-off-by: Namhyung Kim <namhyung@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSo2x5BnqMqsoHtzsmMstVUGiXMgwUCZ+Y/YQAKCRCMstVUGiXM
 g64CAP9sdUgKCe66JilZRAEy9d3Cp835v7KjHSlCXLXkRSoy6wD+KH96Y0OqVtGw
 GYON4s9ZoXAoH+J0lKIFSffVL9MhYgY=
 =7L7B
 -----END PGP SIGNATURE-----

Merge tag 'perf-tools-for-v6.15-2025-03-27' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools

Pull perf tools updates from Namhyung Kim:
 "perf record:

   - Introduce latency profiling using scheduler information.

     The latency profiling is to show impacts on wall-time rather than
     cpu-time. By tracking context switches, it can weight samples and
     find which part of the code contributed more to the execution
     latency.

     The value (period) of the sample is weighted by dividing it by the
     number of parallel execution at the moment. The parallelism is
     tracked in perf report with sched-switch records. This will reduce
     the portion that are run in parallel and in turn increase the
     portion of serial executions.

     For now, it's limited to profile processes, IOW system-wide
     profiling is not supported. You can add --latency option to enable
     this.

       $ perf record --latency -- make -C tools/perf

     I've run the above command for perf build which adds -j option to
     make with the number of CPUs in the system internally. Normally
     it'd show something like below:

       $ perf report -F overhead,comm
       ...
       #
       # Overhead  Command
       # ........  ...............
       #
           78.97%  cc1
            6.54%  python3
            4.21%  shellcheck
            3.28%  ld
            1.80%  as
            1.37%  cc1plus
            0.80%  sh
            0.62%  clang
            0.56%  gcc
            0.44%  perl
            0.39%  make
  	 ...

     The cc1 takes around 80% of the overhead as it's the actual
     compiler. However it runs in parallel so its contribution to
     latency may be less than that. Now, perf report will show both
     overhead and latency (if --latency was given at record time) like
     below:

       $ perf report -s comm
       ...
       #
       # Overhead   Latency  Command
       # ........  ........  ...............
       #
           78.97%    48.66%  cc1
            6.54%    25.68%  python3
            4.21%     0.39%  shellcheck
            3.28%    13.70%  ld
            1.80%     2.56%  as
            1.37%     3.08%  cc1plus
            0.80%     0.98%  sh
            0.62%     0.61%  clang
            0.56%     0.33%  gcc
            0.44%     1.71%  perl
            0.39%     0.83%  make
  	 ...

     You can see latency of cc1 goes down to around 50% and python3 and
     ld contribute a lot more than their overhead. You can use --latency
     option in perf report to get the same result but ordered by
     latency.

       $ perf report --latency -s comm

  perf report:

   - As a side effect of the latency profiling work, it adds a new
     output field 'latency' and a sort key 'parallelism'. The below is a
     result from my system with 64 CPUs. The build was well-parallelized
     but contained some serial portions.

       $ perf report -s parallelism
       ...
       #
       # Overhead   Latency  Parallelism
       # ........  ........  ...........
       #
           16.95%     1.54%           62
           13.38%     1.24%           61
           12.50%    70.47%            1
           11.81%     1.06%           63
            7.59%     0.71%           60
            4.33%    12.20%            2
            3.41%     0.33%           59
            2.05%     0.18%           64
            1.75%     1.09%            9
            1.64%     1.85%            5
            ...

   - Support Feodra mini-debuginfo which is a LZMA compressed symbol
     table inside ".gnu_debugdata" ELF section.

  perf annotate:

   - Add --code-with-type option to enable data-type profiling with the
     usual annotate output.

     Instead of focusing on data structure, it shows code annotation
     together with data type it accesses in case the instruction refers
     to a memory location (and it was able to resolve the target data
     type). Currently it only works with --stdio.

       $ perf annotate --stdio --code-with-type
       ...
        Percent |      Source code & Disassembly of vmlinux for cpu/mem-loads,ldlat=30/pp (18 samples, percent: local period)
       ----------------------------------------------------------------------------------------------------------------------
                : 0                0xffffffff81050610 <__fdget>:
           0.00 :   ffffffff81050610:        callq   0xffffffff81c01b80 <__fentry__>           # data-type: (stack operation)
           0.00 :   ffffffff81050615:        pushq   %rbp              # data-type: (stack operation)
           0.00 :   ffffffff81050616:        movq    %rsp, %rbp
           0.00 :   ffffffff81050619:        pushq   %r15              # data-type: (stack operation)
           0.00 :   ffffffff8105061b:        pushq   %r14              # data-type: (stack operation)
           0.00 :   ffffffff8105061d:        pushq   %rbx              # data-type: (stack operation)
           0.00 :   ffffffff8105061e:        subq    $0x10, %rsp
           0.00 :   ffffffff81050622:        movl    %edi, %ebx
           0.00 :   ffffffff81050624:        movq    %gs:0x7efc4814(%rip), %rax  # 0x14e40 <current_task>              # data-type: struct task_struct* +0
           0.00 :   ffffffff8105062c:        movq    0x8d0(%rax), %r14         # data-type: struct task_struct +0x8d0 (files)
           0.00 :   ffffffff81050633:        movl    (%r14), %eax              # data-type: struct files_struct +0 (count.counter)
           0.00 :   ffffffff81050636:        cmpl    $0x1, %eax
           0.00 :   ffffffff81050639:        je      0xffffffff810506a9 <__fdget+0x99>
           0.00 :   ffffffff8105063b:        movq    0x20(%r14), %rcx          # data-type: struct files_struct +0x20 (fdt)
           0.00 :   ffffffff8105063f:        movl    (%rcx), %eax              # data-type: struct fdtable +0 (max_fds)
           0.00 :   ffffffff81050641:        cmpl    %ebx, %eax
           0.00 :   ffffffff81050643:        jbe     0xffffffff810506ef <__fdget+0xdf>
           0.00 :   ffffffff81050649:        movl    %ebx, %r15d
           5.56 :   ffffffff8105064c:        movq    0x8(%rcx), %rdx           # data-type: struct fdtable +0x8 (fd)
  	...

     The "# data-type:" part was added with this change. The first few
     entries are not very interesting. But later you can it accesses a
     couple of fields in the task_struct, files_struct and fdtable.

  perf trace:

   - Support syscall tracing for different ABI. For example it can trace
     system calls for 32-bit applications on 64-bit kernel
     transparently.

   - Add --summary-mode=total option to show global syscall summary. The
     default is 'thread' to show per-thread syscall summary.

  Python support:

   - Add more interfaces to 'perf' module to parse events, and config,
     enable or disable the event list properly so that it can implement
     basic functionalities purely in Python. There is an example code
     for these new interfaces in python/tracepoint.py.

   - Add mypy and pylint support to enable build time checking. Fix some
     code based on the findings from these tools.

  Internals:

   - Introduce io_dir__readdir() API to make directory traveral (usually
     for proc or sysfs) efficient with less memory footprint.

  JSON vendor events:

   - Add events and metrics for ARM Neoverse N3 and V3

   - Update events and metrics on various Intel CPUs

   - Add/update events for a number of SiFive processors"

* tag 'perf-tools-for-v6.15-2025-03-27' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools: (229 commits)
  perf bpf-filter: Fix a parsing error with comma
  perf report: Fix a memory leak for perf_env on AMD
  perf trace: Fix wrong size to bpf_map__update_elem call
  perf tools: annotate asm_pure_loop.S
  perf python: Fix setup.py mypy errors
  perf test: Address attr.py mypy error
  perf build: Add pylint build tests
  perf build: Add mypy build tests
  perf build: Rename TEST_LOGS to SHELL_TEST_LOGS
  tools/build: Don't pass test log files to linker
  perf bench sched pipe: fix enforced blocking reads in worker_thread
  perf tools: Fix is_compat_mode build break in ppc64
  perf build: filter all combinations of -flto for libperl
  perf vendor events arm64 AmpereOneX: Fix frontend_bound calculation
  perf vendor events arm64: AmpereOne/AmpereOneX: Mark LD_RETIRED impacted by errata
  perf trace: Fix evlist memory leak
  perf trace: Fix BTF memory leak
  perf trace: Make syscall table stable
  perf syscalltbl: Mask off ABI type for MIPS system calls
  perf build: Remove Makefile.syscalls
  ...
2025-03-31 08:52:33 -07:00
..
2024-08-01 12:11:33 -03:00
2022-05-23 10:11:12 -03:00
2025-03-31 08:52:33 -07:00
2024-04-26 22:13:10 -03:00