-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expand section on profilers (perf and VTune) #381
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder whether the presented content is too detailed. But I would let other people comment on this.
You can always skip what you don't need, but the content is useful for people just looking at the slides as a reference. That said, @hageboeck had the same concern. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concerning the level of details, although it's a bit too much for perf, I would keep the slides on list/stat/record/report as I find it nice to have one feature per slide. Maybe a couple of complex examples can be removed, but on the other hand, it's a nice ref and we do not need to go through all details when we give the course
\begin{minted}{shell-session} | ||
$ perf | ||
usage: perf [--version] [--help] [OPTIONS] COMMAND [ARGS] | ||
The most commonly used perf commands are: | ||
annotate Read perf.data and display annotated code | ||
c2c Shared Data C2C/HITM Analyzer. | ||
config Get and set variables in a configuration file. | ||
diff Read perf.data and display the differential profile | ||
evlist List the event names in a perf.data file | ||
list List all symbolic event types | ||
mem Profile memory accesses | ||
record Run a command and record its profile into perf.data | ||
report Read perf.data and display the profile | ||
sched Tool to trace/measure scheduler properties (latencies) | ||
script Read perf.data and display trace output | ||
stat Run command and gather performance counter statistics | ||
top System profiling tool. | ||
version display the version of perf binary | ||
probe Define new dynamic tracepoints | ||
trace strace inspired tool | ||
See 'perf help COMMAND' for more information on a specific command. | ||
\end{minted} | ||
\end{block} | ||
} | ||
\end{frame} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this useful ? I think I would drop it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use a similar slide to this to give a general overview of perf in my own presentations, mentioning that there are more commands than the ones I cover. If you don't want to go into details, this could be a useful slide for that. However, other than that, it's probably fine to drop. I did have to shorten the description of the commands to fit in the slide anyway, so this is not quite what you'd get by running perf without arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On first thought I also found this too much. On second thought, yeah, why shouldn't we leave an overview here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point is that this slide would be systematically skipped when you present. So if it's a pure reference, then let's put it in a reference section at the very end. Otherwise, let's drop it.
mentioning that there are more commands than the ones I cover
Useful indeed, but then I would mention that there are a lot of commands, not list them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that most people don't think it's useful, so I will drop this slide.
\begin{frame}[fragile] | ||
\frametitle{Intel VTune Profiler} | ||
\centering | ||
\includegraphics[width=0.75\textwidth]{tools/vtune.png} | ||
\begin{itemize} | ||
\item Very powerful GUI-based profiler for Intel CPUs and GPUs | ||
\item Now free to use with | ||
\href{https://www.intel.com/content/www/us/en/developer/tools/oneapi/toolkits.html}{Intel oneAPI Base Toolkit} or | ||
\href{https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html}{standalone} | ||
\item See the \href{https://www.intel.com/content/www/us/en/develop/documentation/vtune-help/} | ||
{official online documentation} for more information | ||
\end{itemize} | ||
\end{frame} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the picture brings something for people not knowing the tool ? I would maybe replace it with a bullet highlighting the things it can do which perf cannot (if any) and another giving the donwsides
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since VTune is a graphical tool, I thought it would be nice to show what it looks like when you open it. You can use the picture to show the types of analyses that VTune is able to do instead of a bullet list, and just tell people when presenting about the extra features it has over perf. For detailed usage information, I'd point people to the online docs. One thing I'd mention while presenting is the Top-Down Microarchitecture Analysis, which is a very good method to find bottlenecks. While perf can also do it, it cannot show you detailed information for each symbol like VTune does, and the annotation of source code by VTune is also a lot easier to use than perf's.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also link a talk from Ahmad Yasin, who was behind the creation of the Top-Down Microarchitecture Analysis Method at Intel. It's a very nice talk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would even like to have more pictures. E.g. I love the microarchitecture analysis with the pipeline visualization. Or how a general hierarchical profile looks like. Or the pane showing contention between threads. Or even better, a live demonstration :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not care about picture themselves. I care that if there is a picture, it's understandable, that is that we explain what appears there. In this case, there is a LOT of explanations missing, and I'm not sure we want to include them actually.
Are any changes needed? From my side this should be ready for merging. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read through the example commands again and found them to be quite hard to understand for someone who is only using perf
casually. Maybe you can reword or simplify a few of those.
$ # Sample CPU stack traces (via frame pointers), at 100 Hertz, for 10s: | ||
$ perf record -F 100 -g -- sleep 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the sleep 10
here the command to be profiled or a trick to profile something systemwide? Sorry for my limited knowledge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good catch, I did intend to have -a
to capture things system-wide, but the command as is records data only for the sleep command.
$ # Sample stack traces for PID using DWARF to unwind stacks, for 10s: | ||
$ perf record -p <PID> --call-graph=dwarf -- sleep 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, it is even more surprising for me. The PID should give the process to profile. What does the sleep 10
do? Is there no flag to tell perf to count 10s
? The current command line is surprising to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the sleep
command is only used to give perf the start/stop timings (it's a very common thing to do with perf to use sleep, as there's no other easy way to tell perf to stop otherwise). The profiled process is actually the one given by <PID>
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And here we suppose that people are at easy with frame-pointers (previous line) and dwarf. That would require another set of slides by itself. Less and less convinced that we should not simplify drastically and give only one slide of examples with one line of each list/stat/record/report
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to agree with @sponce. Maybe I'm assuming too much prior knowledge that the average student doesn't/won't have. I guess in that case, showing just how to do the simplest case, which is to collect and view a report just using the default of cycles
for the event is good enough for the course, and we can point people to other sets of slides when more advanced material is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I'm sure HSF people would love to create a full course dedicated to perf. And I promise I would be one of your first students :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've given a few talks here and there, so I have many slides on perf (not using LaTeX, though). I could think about converting the material I have into a course on performance analysis, and including other less known tools, like bpftrace, uftrace, bcc, etc. That said, perf itself is more than enough for a full course, as I doubt many people have used perf data
, perf c2c
, perf mem
, and other less well known commands as well. Plus there is the post-processing and data visualization as well, which is also interesting (gprof2dot, flamegraph, d3js).
{ \scriptsize | ||
\begin{block}{} | ||
\begin{minted}{shell-session} | ||
$ # Sample on-CPU functions for the specified command, at 100 Hertz: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is an on-CPU
function? Does this relate to heterogeneous computing? In the sense that you don't profile GPU functions?
I just tried that command and it counted cycles
. So maybe:
$ # Sample on-CPU functions for the specified command, at 100 Hertz: | |
$ # Sample cycles for the specified command, at 100 Hertz: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perf
cannot take samples when the process is not running, that's why it's usually referred to on-CPU sampling, because samples are taken only when threads are scheduled on some CPU. However, you can also trace scheduling events to try to see what is going on when threads are off-CPU (i.e. being scheduled out, then back in). See https://www.brendangregg.com/offcpuanalysis.html for more information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I start wondering if it's worth keeping examples that cannot be understood simply. The explanation you just gave is already far above the expected knowledge of the people attending the course. In order to explain that, you would need a whole set of slides starting with "thread scheduling", "sampling", etc...
$ # Sample stack traces for PID using DWARF to unwind stacks, for 10s: | ||
$ perf record -p <PID> --call-graph=dwarf -- sleep 10 | ||
|
||
$ # Precise on-CPU user stack traces (no skid) using PEBS (Intel CPUs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is an on-CPU stack trace
? And what is skid
? And what's PEBS
? :)
I am asking because a future presenter of these slides might not know this. Is all the information relevant?
Maybe we need a slide introducing some terms of art and defining the acronyms. Or a glossary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I explained on-CPU above. Basically, there is a margin of error to attribute samples to instructions, as a number of instructions are in flight in parallel on the CPU at any given time. This error is called the skid in the sampling (see more information here). PEBS stands for Precise Event Based Sampling (PEBS), and is a feature on Intel CPUs that allows sampling with low or no skid. The sort of equivalent thing on AMD CPUs is IBS, or Instruction-based Sampling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am asking because a future presenter of these slides might not know this. Is all the information relevant?
I hope that someone presenting perf
to others will read the manual pages and understand these examples ahead of time. I tried to give a general overview of how to do several different things with each of the most important commands, so of course that what I added I think is relevant information for people trying to use perf
. Maybe this is all too complicated for a C++ course and we should really just point people to the actual documentation or other material instead. I'm starting to think that that will be easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this is all too complicated for a C++ course
Do we need a tool section in the expert part ? That could be a solution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a tools course, separate from a C++ course. VTune, perf, valgrind, can all be used for much more than just C++, so we can bundle this together with bash, coreutils, and some other command line tools that are used very often and make a new course.
$ # Sample CPU stack traces using Instruction-based sampling (AMD CPUs): | ||
$ # (Note that you need to use system-wide sampling for IBS on AMD CPUs) | ||
$ perf record -a -g -e cycles:pp -- <command> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't -a
a system-wide sampling? Why do I need a <command>
then? What is IBS
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IBS is explained above. The requirement to use system-wide sampling is a hardware requirement when using IBS on AMD CPUs. This is also explained in perf
's documentation (see man perf-list
). I added this example to show how to use event modifiers and to remind people that IBS requires system-wide sampling to work.
\begin{frame}[fragile] | ||
\frametitle{Intel VTune Profiler} | ||
\centering | ||
\includegraphics[width=0.75\textwidth]{tools/vtune.png} | ||
\begin{itemize} | ||
\item Very powerful GUI-based profiler for Intel CPUs and GPUs | ||
\item Now free to use with | ||
\href{https://www.intel.com/content/www/us/en/developer/tools/oneapi/toolkits.html}{Intel oneAPI Base Toolkit} or | ||
\href{https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html}{standalone} | ||
\item See the \href{https://www.intel.com/content/www/us/en/develop/documentation/vtune-help/} | ||
{official online documentation} for more information | ||
\end{itemize} | ||
\end{frame} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would even like to have more pictures. E.g. I love the microarchitecture analysis with the pipeline visualization. Or how a general hierarchical profile looks like. Or the pane showing contention between threads. Or even better, a live demonstration :)
\begin{minted}{shell-session} | ||
$ perf | ||
usage: perf [--version] [--help] [OPTIONS] COMMAND [ARGS] | ||
The most commonly used perf commands are: | ||
annotate Read perf.data and display annotated code | ||
c2c Shared Data C2C/HITM Analyzer. | ||
config Get and set variables in a configuration file. | ||
diff Read perf.data and display the differential profile | ||
evlist List the event names in a perf.data file | ||
list List all symbolic event types | ||
mem Profile memory accesses | ||
record Run a command and record its profile into perf.data | ||
report Read perf.data and display the profile | ||
sched Tool to trace/measure scheduler properties (latencies) | ||
script Read perf.data and display trace output | ||
stat Run command and gather performance counter statistics | ||
top System profiling tool. | ||
version display the version of perf binary | ||
probe Define new dynamic tracepoints | ||
trace strace inspired tool | ||
See 'perf help COMMAND' for more information on a specific command. | ||
\end{minted} | ||
\end{block} | ||
} | ||
\end{frame} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On first thought I also found this too much. On second thought, yeah, why shouldn't we leave an overview here.
I could not reply directly to this, so adding as quote above. Although I would like to, I unfortunately don't have so much more time to invest in improving the slides. I really need to go back to work on Geant4 and XRootD now. In any case, I think the online documentation of VTune is really good already. |
I would like @sponce and @hageboeck to comment on the complexity of the presented material. For my part, I am fine enough to merge. If I had to present this material, I would probably skip a third of the commands because my knowledge about them is insufficient. |
I'm in general not at ease with this one. On one hand it's already far too complex, on the other hand a lot of explanations are missing on concepts used without presenting them. I can see 2 ways out : adding more, but then splitting into a standard part and an expert one. Or simplifying, keeping really only the core, as we did for gdb, in 4 slides total (first 2 with second one split and one example slide. |
Ok, I think it's better to go with the second route of simplifying things a bit and providing examples only for the more basic usage of perf, and breaking the first slide into two. I will update this pull request in the next few days when I find the time for it. |
I've focused more on
perf
than VTune, but this is intended to close #43. I think the online documentation for VTune is good enough that we can just point students there. However, if you think the VTune section should be expanded further, let me know.