WEBVTT

00:00.000 --> 00:15.540
I'm Christoph Kluge, main developer of the HPC portal and I will show you the tool or

00:15.540 --> 00:23.500
my tool from the perspective of a user and of a manager or PI.

00:23.500 --> 00:29.780
The main entry point that some of you already know is the login page in which it's required

00:29.780 --> 00:41.380
to log in via SSO, either with the IDM login and the FIU or some other institution of your

00:43.060 --> 00:46.900
affiliation. So it should also work

00:49.780 --> 00:52.980
the other way around. Now I'm logged in.

00:52.980 --> 01:03.540
And the first view you have the user view. And when you log in the first time you will have

01:04.740 --> 01:10.580
the right email subscription box for the first and only time in this view.

01:11.540 --> 01:16.260
Where you can then select the recommended NHRS system announce,

01:16.260 --> 01:22.420
the general NHR users newsletter, and whether you want to subscribe permanently even after your account

01:22.420 --> 01:29.220
or your project has reached its end of validity. I will just update these.

01:32.580 --> 01:38.180
You can change the subscriptions later in your profile which I will show in a short minute.

01:38.180 --> 01:47.700
Give me localization. Yeah, for your user view you have two tabs here. You have your accounts which are

01:48.660 --> 01:56.660
linked to your user and our user HPC Cafe has already one account in the project

01:56.660 --> 02:08.100
Cafe Quality Project. And if you have multiple accounts that will be listed, we will see that later.

02:08.900 --> 02:16.340
You can see your account details, the general ID, your home path, which shell you use, and the state

02:16.340 --> 02:22.740
and your validity date. In this case the state of the home, and the state of the home.

02:22.740 --> 02:28.660
And the state and your validity date. In this case the 30th of January.

02:30.820 --> 02:39.300
This account already has a public SSH key uploaded. I will demonstrate that also in a second.

02:39.940 --> 02:46.340
But just it looks like this if you've already done it. It's a fingerprint, the MD5 fingerprint

02:46.340 --> 02:53.300
of your key, and an alias for yourself to identify the key. These will be used in the cluster

02:53.300 --> 03:04.580
environment for login. This new button directly links your account to Cluster Cockpit.

03:05.540 --> 03:12.580
I will show that also later because I'm on the local page right now. But with a click here you

03:12.580 --> 03:18.660
will be forwarded to Cluster Cockpit, all automatically logged in to your respective account,

03:19.300 --> 03:23.220
and will be shown your jobs in the job list.

03:25.700 --> 03:33.140
This account also obviously has already some computation done in the last year and this year.

03:33.140 --> 03:41.220
So we have some GPU hours, we have some core hours on Flitz, and it's just a listing of the usage done

03:41.220 --> 03:50.900
by this account or your account, which you can also plot for some use back in 2020-2025 at the moment.

03:53.380 --> 04:03.700
Again, this localization hangs a bit. But we have some Flitz hours used up here, and some

04:03.700 --> 04:16.660
Alex as well. Not yet documented, but it's very small bars here because some people calculate in

04:16.660 --> 04:23.940
the 100,000s of hours and on the second cluster they just have a few 100 for testing. You can't

04:23.940 --> 04:34.820
really see these, but you can click on the legend to switch to your data bars as needed.

04:38.260 --> 04:47.700
The second tab is the invitation tab, which is the main path to create accounts in this tool.

04:47.700 --> 04:55.700
In this case, our user HPC Cafe already has one invitation accepted, thus it is archived,

04:56.020 --> 05:06.100
and it was the self-invitation. So a manager can create this or her own account by inviting him or

05:06.100 --> 05:19.380
herself. State accepted, created yesterday. No more new invitations. As a manager or a PI,

05:19.380 --> 05:26.180
you will have a second tab called the management tab where you can view the projects in which you

05:26.180 --> 05:33.140
are either PI or manager. So in this case, we have a fully blown project already here,

05:33.140 --> 05:39.380
the coffee quality project in which our user is the principal investigator. We have not yet

05:40.100 --> 05:47.380
any managers, so it's his own project. PIs are always in charge of the coffee quality, as we know.

05:49.460 --> 05:57.700
Project resources can be viewed in this dropdown. Again, the usage project-wide this month,

05:57.700 --> 06:04.260
this year, and over all the years, and also the quota which is allocated to this project.

06:05.220 --> 06:13.860
You can as well trend this, plot this. It's the same plugin, the same tool, only that this is

06:13.860 --> 06:21.940
summed up over all the accounts of this project. So this is the total GPU hours or total core hours

06:21.940 --> 06:28.900
for this project. Again, I can plot out Alex to better see the small start in Fritz.

06:33.540 --> 06:39.140
On the right side, we have the accounts for this project. So we have the account we just saw

06:39.140 --> 06:46.820
in the user tab. So it's the account of the PI himself. Again, we can view the usages and plot

06:46.820 --> 06:53.940
the usages of this account. We can edit the account, in which case we can change the validity

06:54.820 --> 07:03.620
and view the general information of this account. But you can also view the other accounts. So we

07:03.620 --> 07:13.220
have the demo user, IMS, which also did some computation, far less. So we are on the 100s.

07:13.220 --> 07:20.180
We can also plot this. We can see it, the sum of the last year. And we also have one account,

07:20.180 --> 07:27.220
which validity ended last month. So it is automatically moved to archived accounts.

07:27.220 --> 07:36.100
So it ended on New Year's Eve, but we can still see and plot the usage of this account. If you

07:36.100 --> 07:48.420
want to have some old data or just want to inform oneself. I think, yeah, we can also edit this.

07:50.660 --> 07:57.780
And when we do want to create an invitation as PI, I can show it here.

07:57.780 --> 08:06.980
PI tab is the third invitation tab. We have three archived invitations, which is the self-invitation

08:06.980 --> 08:14.180
and the two invitations for the two active users or the one active user and the one

08:15.380 --> 08:23.860
archived user already. And we have one pending invitation for the one active user.

08:23.860 --> 08:32.900
And we have one pending invitation for a future user. And if in some case, the email isn't

08:32.900 --> 08:40.660
received by this user, you can resend the mail here. It is important to note that the email to

08:40.660 --> 08:49.540
send the invitation to is the one which is linked to the account because the matching of invitation

08:49.540 --> 09:00.260
to the user is done by matching the email. So in this case, we have a second project also,

09:00.260 --> 09:06.100
which is completely fresh and new. So it's just another PI, but we are manager in this case.

09:06.100 --> 09:11.220
So we can also see it, but we have no accounts yet. So nobody bothered to create accounts.

09:11.220 --> 09:20.100
And we will just do that. So we have a new invitation for email address, NHPC Cafe.

09:21.940 --> 09:27.700
It's just a random email in my DevO database. So it won't be sending any email actually,

09:27.700 --> 09:40.180
but we will see the invitation in a minute. So for project and message, please join for KIG.

09:42.740 --> 09:48.980
For validity, so we already set the validity of the to be created account. So it's end of

09:48.980 --> 09:57.860
January in this case. And then I can send this invitation. The error is only that it couldn't be sent.

09:58.500 --> 10:06.020
Well, that shouldn't appear in the live situation. Refresh invitations. And now we have one pending

10:06.020 --> 10:15.940
invitations for ourselves, which should be also part now in our user tab with an invitation for

10:15.940 --> 10:25.700
the Kuchengeschmack project. Please join for KIG. I gladly will do so. Now we have a second

10:25.700 --> 10:35.220
archived invitation and a second account on the left side. The ID will automatically be generated

10:35.220 --> 10:39.780
and will be your user ID on the machines as well. So in this case, A100AC10.

10:39.780 --> 10:50.340
To access the clusters, we now have to upload an SH key. So we will do add new SH key.

10:50.900 --> 10:59.380
Alias is the KIG public key. As I said, it's just the alias for yourself.

10:59.380 --> 11:15.300
And let's have some demo key here. Submit. Let's see. Our key appears to be too short

11:15.300 --> 11:20.980
for our internal security checks. So let's remedy that.

11:20.980 --> 11:28.180
There's some cases in which a warning will pop up, either the key's too short or some

11:28.180 --> 11:34.180
reg ex will fail. And then it will tell you what is, what is the problem is.

11:37.780 --> 11:43.620
This key should be working. And now we just have an information box because the distribution of

11:43.620 --> 11:51.220
this key can take up to two hours until it is accessible through all the clusters.

11:52.740 --> 11:58.900
And now it's linked here. It will be distributed on the clusters. And if there ever is the

11:58.900 --> 12:04.900
necessity, you can either delete it or change the alias.

12:04.900 --> 12:12.100
And back on the project management, we shall now have an account for the Kuhn project

12:14.260 --> 12:22.340
and an archived invitation. If you need to import or invite more than one person, you can

12:22.340 --> 12:27.300
set the switch here, invite multiple email addresses, and then you can also set the

12:27.300 --> 12:35.700
search for some typos. And then you can just list email addresses here.

12:37.700 --> 12:41.700
It will also check for some typos.

12:41.700 --> 12:48.100
Now, those two, which are demo users, which are the users for our coffee project, but we wanted to

12:48.100 --> 12:56.100
invite them for cake as well. And we will stop that there.

12:56.100 --> 13:04.100
And then we can also set the search for the user. And then we can also set the

13:04.100 --> 13:14.500
user here for address, which are the users for the document we built, but we walk past

13:14.500 --> 13:12.340
and then we can sort of stuffs things out. If you can categorize theKehn client

13:12.340 --> 13:21.600
into two types of86 Company, you will find areas that will heavily

13:21.600 --> 13:28.680
influence the retail style. Each section of the report here is typically

13:28.680 --> 13:34.480
Probably need to look into that, but should have been two pending invitations now and

13:34.480 --> 13:42.560
not only one, but in principle it works.

13:42.560 --> 13:50.160
And this user also now sees the invitation as I have shown before.

13:50.160 --> 13:56.040
Last but not least, let's switch back to the live system now.

13:56.040 --> 14:03.240
My main principle is my user account here, my path, my key.

14:03.240 --> 14:07.160
And I have no usage data, but I want to go to cluster cockpit.

14:07.160 --> 14:14.360
So I just click here and I will be directly forwarded, locked in as my user account.

14:14.360 --> 14:16.080
But I didn't calculate anything.

14:16.080 --> 14:20.720
So I have no jobs all across the board.

14:20.720 --> 14:32.960
And this is a transfer for showing cluster cockpit if Jan is already here.

14:32.960 --> 14:35.360
So should I take over?

14:35.360 --> 14:37.120
That should be the fluent idea.

14:37.120 --> 14:48.280
Before switching, Christoph, can you go on the user profile first of all to show how

14:48.280 --> 14:52.120
to change subscriptions?

14:52.120 --> 14:58.400
And if you do not see an invitation, it's probably that the invitation was for the wrong

14:58.400 --> 15:00.200
email address.

15:00.200 --> 15:08.440
And if you go to your profile, you will see which email address is transferred by the

15:08.440 --> 15:11.080
SSO login.

15:11.080 --> 15:20.000
So for FAU people, that always will be something at FAUDE, no legacy at Physik Uni Alangen

15:20.000 --> 15:24.080
or whatever.

15:24.080 --> 15:32.720
Invitations to any Gmail account will not work because at least no German university

15:32.720 --> 15:40.000
I'm aware of uses Gmail email addresses.

15:40.000 --> 15:46.600
But you find the correct one here.

15:46.600 --> 15:51.120
And also, as I said, when you ever want to change your email subscriptions, once you've

15:51.120 --> 15:54.240
done it in the user tab, it will be shown here.

15:54.240 --> 15:57.720
You can just switch.

15:57.720 --> 16:07.640
You set to default or just disable anything.

16:07.640 --> 16:10.040
I also have one or two questions.

16:10.040 --> 16:11.040
I'm not sure.

16:11.040 --> 16:14.040
Should I ask now or until the end?

16:14.040 --> 16:15.440
No, go ahead.

16:15.440 --> 16:16.440
Ask.

16:16.440 --> 16:17.440
Right away.

16:17.440 --> 16:18.440
So I'm from Coburg.

16:18.440 --> 16:19.440
I'm not from FAU.

16:19.440 --> 16:22.280
We are currently using the test account.

16:22.280 --> 16:26.120
I got a bit lost with the roles that exist.

16:26.120 --> 16:27.600
So you said something there.

16:27.600 --> 16:32.880
There is users, there's projects and there's PI accounts.

16:32.880 --> 16:33.880
How do they relate?

16:33.880 --> 16:42.600
A user can be in multiple projects, a user can create a project or is this only a PI

16:42.600 --> 16:43.800
feature?

16:43.800 --> 16:50.320
So the users are the principal persons which log in.

16:50.320 --> 16:54.640
So without any role, after login, you have a user role.

16:54.640 --> 16:56.400
You have no counts.

16:56.400 --> 16:59.560
You have no further tabs than home and user.

16:59.560 --> 17:00.560
Exactly.

17:00.560 --> 17:08.360
After application and revision of the project, the admins will set up the project in a not

17:08.360 --> 17:17.200
shown admin view and declare the PI, which will then see a management tab and his or

17:17.200 --> 17:20.680
her project in this list.

17:20.680 --> 17:31.480
And after that, the PI can send out invitations for users.

17:31.480 --> 17:41.240
And it's planned that the PI also can declare the managers of the projects by him or herself,

17:41.240 --> 17:45.800
which is at the moment done via the admin view as well.

17:45.800 --> 17:52.400
So mail to the admin people and then they will set up the manager list.

17:52.400 --> 17:56.440
There's no button yet for PIs.

17:56.440 --> 18:00.080
I see.

18:00.080 --> 18:04.120
The other point was about SSH keys.

18:04.120 --> 18:09.200
So the SSH key, the public key is generated on the machine from where you log in.

18:09.200 --> 18:16.000
And the first place to log in is this, how is it called, the firewall server, so the

18:16.000 --> 18:20.440
central server from which you log in then to the machines of the cluster of the LX or

18:20.440 --> 18:21.440
so on.

18:21.440 --> 18:23.360
At least this is how it's working for us.

18:23.360 --> 18:28.960
So we need at least two keys, one generated from the machine where we log in from remote

18:28.960 --> 18:35.120
to the firewall server or firewall access.

18:35.120 --> 18:42.240
And then we need a key from generated on this firewall server to log into LX or whatever

18:42.240 --> 18:43.240
where we want to work.

18:43.240 --> 18:44.240
Is this correct?

18:44.240 --> 18:49.240
That's one option.

18:49.240 --> 18:57.760
I will post a link in the chat to our FAQ on SSH.

18:57.760 --> 19:04.240
There are a couple of more ways to do it.

19:04.240 --> 19:20.360
So if you are a Linux user, you can define a proxy chump and directly connect to the

19:20.360 --> 19:24.600
Alex front end through chhpc.

19:24.600 --> 19:29.520
And that's conveniently done by using a config file for SSH.

19:29.520 --> 19:41.600
Or as you said, you create an additional SSH key on chhpc or as a third option, you use

19:41.600 --> 19:47.920
an SSH agent.

19:47.920 --> 19:55.440
For security reasons, the preferred solution is really to use proxy chump because agent

19:55.440 --> 20:03.240
forwarding has issues and also you shouldn't put a private key on a shared system.

20:03.240 --> 20:06.800
So we really recommend to set up proxy chump wherever possible.

20:06.800 --> 20:08.580
And there was a HPC talk on this.

20:08.580 --> 20:13.220
So there should be a video and also slides on how to set it up.

20:13.220 --> 20:14.220
It's quite simple.

20:14.220 --> 20:18.840
We'll put the link to the video into the comments on this video.

20:18.840 --> 20:20.440
Okay, great.

20:20.440 --> 20:21.440
Thank you.

20:21.440 --> 20:22.440
All right.

20:22.440 --> 20:28.920
Let's go to Cluster Cockpit.

20:28.920 --> 20:30.800
So just an introduction.

20:30.800 --> 20:33.720
So what is Cluster Cockpit?

20:33.720 --> 20:41.920
It's a web-based service that gives you access to job-specific performance monitoring metrics.

20:41.920 --> 20:51.620
And those are also metrics that are measured using performance monitoring.

20:51.620 --> 20:57.520
So we have continuous system-wide performance monitoring on the clusters.

20:57.520 --> 21:02.320
And via this web-based interface, you as a user for your jobs can have access to this

21:02.320 --> 21:05.080
monitoring data.

21:05.080 --> 21:09.120
Cluster Cockpit also, because it has this job view, has a rudimentary job accounting

21:09.120 --> 21:10.120
functionality.

21:10.120 --> 21:17.860
Still, yeah, that's not the primary purpose.

21:17.860 --> 21:28.040
So we plan to have a separate, preferably in the HPC portal, if you need some job accounting

21:28.040 --> 21:29.040
thing.

21:29.040 --> 21:32.980
But still, you can use it for that purpose also.

21:32.980 --> 21:37.280
To get access, there are two options.

21:37.280 --> 21:45.340
There is a login mask, and you can authenticate with your IDM HPC account.

21:45.340 --> 21:51.120
This only works with local FAU, so you could say legacy accounts.

21:51.120 --> 21:59.520
Any account that was issued via the HPC portal can only access the service, and I will demonstrate

21:59.520 --> 22:03.820
this later on, from within the portal.

22:03.820 --> 22:08.560
So you log into the portal, and then you have this button Christoph showed.

22:08.560 --> 22:14.520
And then you get an authenticated session in Cluster Cockpit.

22:14.520 --> 22:20.280
So Cluster Cockpit is a web framework and a stack that was developed at our site.

22:20.280 --> 22:25.640
It's an open source project, so if you're interested, you could also install it on other

22:25.640 --> 22:26.640
systems.

22:26.640 --> 22:30.920
But it's still in early development, and we have now a third party funded project, so

22:30.920 --> 22:37.720
I hope that it will mature quicker than over the last years.

22:37.720 --> 22:41.840
So this is what I already mentioned.

22:41.840 --> 22:44.640
You can see then running and completed jobs.

22:44.640 --> 22:51.560
So as soon as you start a job, after a few minutes, I think it's around six minutes,

22:51.560 --> 22:55.440
you'll see also your running jobs.

22:55.440 --> 22:59.040
At the moment, there are only two roles.

22:59.040 --> 23:06.960
There is either standard user, and the standard user will only see its own jobs, or admin

23:06.960 --> 23:09.340
user, which sees all the jobs.

23:09.340 --> 23:16.440
So what is still missing, but what we plan to implement is that if you're a PI, that

23:16.440 --> 23:22.560
you can see the jobs of all the users that belong to your project, but that's not implemented

23:22.560 --> 23:24.560
yet.

23:24.560 --> 23:25.640
So what can be seen there?

23:25.640 --> 23:29.200
You will see it then also in a moment in the short demo.

23:29.200 --> 23:38.600
So you have a job view, and in the job view, you get some metadata about the job, what

23:38.600 --> 23:40.480
are the resources used.

23:40.480 --> 23:49.120
Then you get a so-called polar plot, which for the basic resources, flops, memory, bandwidth,

23:49.120 --> 23:58.160
and allocated memory, you get how much of that you use, and there is an average and

23:58.160 --> 24:03.240
a max marking there.

24:03.240 --> 24:09.440
So you can see at a single glance how your job overall utilizes the system.

24:09.440 --> 24:13.760
And furthermore, you also get this roofline plot.

24:13.760 --> 24:17.100
And in the roofline plot, you have the timeline.

24:17.100 --> 24:22.840
So every dot is one sample on every node.

24:22.840 --> 24:30.560
And the color is from start of job in blue to end of job in red.

24:30.560 --> 24:38.320
And so you can have some impression where in the capabilities of the system your job

24:38.320 --> 24:39.320
performs.

24:39.320 --> 24:44.280
And I come back to how to interpret the roofline plot in a moment.

24:44.280 --> 24:47.160
And then you have a list of the measured metrics.

24:47.160 --> 24:52.600
And you get at the moment, unfortunately, I don't know, but you only get basic timeline

24:52.600 --> 24:54.960
plots of those metrics.

24:54.960 --> 24:59.960
The metrics are not the same on all systems, but on most systems, we try to keep this the

24:59.960 --> 25:01.280
same.

25:01.280 --> 25:08.160
And I will show you the data in the demo, then you can see better what is available.

25:08.160 --> 25:10.080
So what are the purposes?

25:10.080 --> 25:15.080
First is you get instant feedback about how your job performs.

25:15.080 --> 25:16.860
So what's the flop rate?

25:16.860 --> 25:23.240
How does the flop rate relate to the capability of the system?

25:23.240 --> 25:31.760
And what's the memory bandwidth, what the network IO and how much memory is allocated

25:31.760 --> 25:33.680
also.

25:33.680 --> 25:39.240
And this allows you and that's maybe the most important purpose to identify pathological

25:39.240 --> 25:40.240
jobs.

25:40.240 --> 25:41.960
So jobs where something goes wrong.

25:41.960 --> 25:45.680
So where you don't want to sit and wait until it's finished.

25:45.680 --> 25:50.720
So for example, that might be some typo in the job script or you made a mistake or something

25:50.720 --> 25:53.840
goes wrong in the simulation or something like that.

25:53.840 --> 25:59.120
But you look at it and say, when you see, OK, for example, the job sits there, does

25:59.120 --> 26:06.560
nothing or only one core is active or only a few cores per node active, where you say,

26:06.560 --> 26:15.040
OK, maybe I terminate the job and have a look what's going wrong.

26:15.040 --> 26:20.440
And then, of course, it also allows to classify your job performance and you get some statistics

26:20.440 --> 26:23.200
over the job performance.

26:23.200 --> 26:27.360
This is a topic which we do not really exploit yet.

26:27.360 --> 26:36.640
The data is there in principle, but in the future we want to add more automatism to this.

26:36.640 --> 26:39.620
So some words about the roofline plot.

26:39.620 --> 26:42.840
If you're not familiar with it, how to interpret it.

26:42.840 --> 26:48.040
The roofline plot gives you an impression how your software performs in the capabilities

26:48.040 --> 26:50.000
of the system.

26:50.000 --> 26:57.280
And in the first order, the bottlenecks on a computing system is either the execution

26:57.280 --> 27:04.240
of work, which is P peak for most application in computer in computational science.

27:04.240 --> 27:07.580
This is flops floating point is usually what we do.

27:07.580 --> 27:16.840
If your useful work is not flops, then the way we do the roofline plot is not very useful

27:16.840 --> 27:20.320
because we only represent work with flops.

27:20.320 --> 27:26.320
Of course, your application may have some other useful work, which is not flops.

27:26.320 --> 27:29.360
Then you can just ignore it.

27:29.360 --> 27:31.040
Or it is the data path.

27:31.040 --> 27:34.200
So transfer over any data path.

27:34.200 --> 27:41.440
And the usual bottleneck on computing system nowadays is the memory, access to main memory.

27:41.440 --> 27:43.160
And then there are two.

27:43.160 --> 27:45.360
Let's go through it.

27:45.360 --> 27:48.440
There are two actually roofs.

27:48.440 --> 27:53.600
You have the horizontal roof here.

27:53.600 --> 28:02.100
And this represents the capabilities where execution is limited by, where performance

28:02.100 --> 28:03.680
is limited by execution.

28:03.680 --> 28:13.800
And in the plot, what you set up is down here, you have the intensity, which is how many

28:13.800 --> 28:15.560
flops you do per byte.

28:15.560 --> 28:17.760
And here you have performance.

28:17.760 --> 28:21.240
In our case here, we plot flop rate.

28:21.240 --> 28:25.520
And usually this is in lock lock scale.

28:25.520 --> 28:31.160
So you have to be careful that small differences can be actually big differences because of

28:31.160 --> 28:33.560
the lock lock scale.

28:33.560 --> 28:39.680
This roof is the limit is when you are limited by data transfer.

28:39.680 --> 28:43.240
And of course, how this roof looks like that depends on the machine.

28:43.240 --> 28:46.240
So this is characterized by the machine.

28:46.240 --> 28:48.880
Optimally your job would be located here.

28:48.880 --> 28:53.360
And of course, all your job points must be beyond the roof.

28:53.360 --> 28:55.000
Because this is really a light speed.

28:55.000 --> 29:01.600
So you cannot get beyond the roof.

29:01.600 --> 29:07.520
Optimally you are at the knee because at the knee you make full use of all resources.

29:07.520 --> 29:12.600
You exploit the computational capabilities and you exploit the memory bandwidth.

29:12.600 --> 29:14.120
Of course, this rarely happens.

29:14.120 --> 29:15.120
And this is also not the purpose.

29:15.120 --> 29:18.520
I mean, at the end, it's general purpose machine.

29:18.520 --> 29:24.820
And still in high performance computing, you want to be somewhere near the roof.

29:24.820 --> 29:33.540
That then shows that you are actually making good use of available resources.

29:33.540 --> 29:37.900
And you can of course set up the roofline model analytically.

29:37.900 --> 29:44.640
But we do it by measuring those values using hardware performance counter data.

29:44.640 --> 29:47.440
So you can measure the memory bandwidth.

29:47.440 --> 29:49.640
You can measure the flop rate.

29:49.640 --> 29:57.540
And the capabilities of the system are entered as parameters in the cluster cockpit framework.

29:57.540 --> 29:59.520
So now just some examples.

29:59.520 --> 30:03.100
How do you detect bad drops?

30:03.100 --> 30:10.800
We have this reference line here, which indicates some normal or maximum usage.

30:10.800 --> 30:17.980
So here, for example, for load, that I think was on Emmy or Maggie.

30:17.980 --> 30:19.960
The load that you want is 20.

30:19.960 --> 30:28.480
20 would be the number of cores and using SMT threads.

30:28.480 --> 30:30.680
No that's actually wrong.

30:30.680 --> 30:33.760
Yeah, no, it's right.

30:33.760 --> 30:34.760
This is per node.

30:34.760 --> 30:37.880
So every line here is one node.

30:37.880 --> 30:44.240
And then we color the graphs when something goes wrong.

30:44.240 --> 30:52.000
For example, for load, usually you want to utilize every single core and thread.

30:52.000 --> 30:57.680
And in this case here, not only is the load much lower, so you're only using half of the

30:57.680 --> 31:03.480
machine, but also the load is different for different nodes.

31:03.480 --> 31:07.520
So obviously something is going wrong there.

31:07.520 --> 31:10.400
And then you can have a look at basic resource utilization.

31:10.400 --> 31:12.760
For example, this is a good job.

31:12.760 --> 31:17.040
You see that this job here, in this case here, is almost at the knee.

31:17.040 --> 31:24.960
And you see also here in the polar plot that it makes almost optimal usage of the memory

31:24.960 --> 31:29.280
bandwidth.

31:29.280 --> 31:32.400
And it allocates almost all of the memory.

31:32.400 --> 31:37.320
One thing here I didn't mention yet is there is a scalar roof and there is a SIMD roof.

31:37.320 --> 31:43.640
So we plot the roof using just scalar instructions, so not SIMD instructions.

31:43.640 --> 31:48.080
If you don't know what this is, it doesn't matter.

31:48.080 --> 31:51.800
There are two different ways to execute floating point.

31:51.800 --> 31:54.280
And the other way gives you more performance.

31:54.280 --> 31:56.400
Very simply speaking.

31:56.400 --> 32:02.680
So this is obviously a code that only makes use of scalar instructions.

32:02.680 --> 32:05.760
Maybe it's not a proof, but you can look it up in the metrics.

32:05.760 --> 32:12.280
And this is a job here that makes use of SIMD because it makes almost full use of the floating

32:12.280 --> 32:17.680
point capabilities of the system.

32:17.680 --> 32:28.520
Another useful view is the statistic table, which allows you to sort the different metrics.

32:28.520 --> 32:33.740
And you can also configure that different metrics according to min, average, or max.

32:33.740 --> 32:35.480
And most useful is average.

32:35.480 --> 32:45.800
And this gives you a simple way to classify groups of processes or nodes.

32:45.800 --> 32:50.720
So for example, you could see if there is a severe load imbalance, then you could see

32:50.720 --> 32:51.720
it here.

32:51.720 --> 32:58.360
Like for example, in this job here, there was one class of jobs that made almost 25

32:58.360 --> 33:04.960
to 30 gigaflops and others made one third less.

33:04.960 --> 33:11.440
And one thing I maybe need to explain is this flops any metric.

33:11.440 --> 33:18.040
To encapsulate this, also for the roofline plot in a single metric, we scale double precision

33:18.040 --> 33:21.580
floating point rates to single precision.

33:21.580 --> 33:27.200
So with SIMD, with scalar, it doesn't matter because the throughput is the same.

33:27.200 --> 33:33.820
But for SIMD, a system can do double the floating point rate for a single precision than it

33:33.820 --> 33:36.000
can do for double precision.

33:36.000 --> 33:42.880
And to still represent that with a single metric, we scale the double precision rate

33:42.880 --> 33:48.880
by the factor of 2 to get then to the single precision performance, to have this all in

33:48.880 --> 33:50.760
one metric.

33:50.760 --> 33:55.920
OK, so still valid.

33:55.920 --> 34:00.400
Half a year later, we're still working on our plan to do a more intuitive job performance

34:00.400 --> 34:01.400
visualization.

34:01.400 --> 34:08.400
So for beginners, we are aware that it's maybe difficult to interpret the raw time series

34:08.400 --> 34:10.320
data.

34:10.320 --> 34:17.760
And we want to add a more accessible way to give you feedback how your job performs or

34:17.760 --> 34:21.120
if there is anything you need to do.

34:21.120 --> 34:25.560
And then we want to add an automatic job classification.

34:25.560 --> 34:28.980
And so is it running good?

34:28.980 --> 34:30.920
Has it a certain issue?

34:30.920 --> 34:40.240
For example, examples could be load imbalance, low resource utilization, or also in a positive

34:40.240 --> 34:44.040
sense, throughput limited, and so on.

34:44.040 --> 34:46.800
And we also want to add automatic application tagging.

34:46.800 --> 34:50.800
And that would also be then available to you.

34:50.800 --> 34:56.200
Like for example, we want to automatically detect all gromics drops or all open form

34:56.200 --> 35:03.640
drops also for us to analyze them and to see how they perform or where there are differences

35:03.640 --> 35:06.880
within the same application.

35:06.880 --> 35:09.480
So this is kind of still in better stage.

35:09.480 --> 35:11.400
But I mean, it works.

35:11.400 --> 35:13.200
We offer it to the users.

35:13.200 --> 35:18.380
If you encounter any problem, and there are a lot of things that require improvement,

35:18.380 --> 35:19.800
we know that.

35:19.800 --> 35:21.760
Feel free to open a ticket.

35:21.760 --> 35:31.880
And yeah, now I want to show this from within the user portal.

35:31.880 --> 35:39.880
So this is when I log into the user portal.

35:39.880 --> 35:46.920
Then in the user account down here, I have this view jobs in cluster cockpit.

35:46.920 --> 35:50.200
And when I press this, I get an authenticated session.

35:50.200 --> 35:57.320
What's a bit weird is, and Kristoph maybe knows that I have an IDM account which is

35:57.320 --> 36:01.240
UNRZ254.

36:01.240 --> 36:04.800
But here, when I see the jobs, I have some cryptic account.

36:04.800 --> 36:07.680
But I guess this is for technical reasons.

36:07.680 --> 36:14.320
So at the moment, I don't have any running jobs, and as you see, I only do tests, so

36:14.320 --> 36:16.000
I'm not a production user.

36:16.000 --> 36:19.040
So here you can get then for the different systems.

36:19.040 --> 36:25.140
see, meanwhile, we included all systems in the monitoring.

36:25.140 --> 36:28.200
You can click then on the running drops on Fritz

36:28.200 --> 36:32.200
or the total drops you have on Fritz.

36:32.200 --> 36:36.720
When you go on My Jobs, you see all the jobs.

36:36.720 --> 36:40.360
And you also see some statistics,

36:40.360 --> 36:42.280
like the total drops, short drops,

36:42.280 --> 36:46.680
are jobs which when I think it's less than 10 minutes or six

36:46.680 --> 36:48.680
minutes, I'm not completely sure.

36:48.680 --> 36:55.840
So short drops is something that's also seen as something

36:55.840 --> 37:02.440
that should be improved because it adds overhead, starting

37:02.440 --> 37:06.320
and stopping a job for the scheduler, adds some overhead,

37:06.320 --> 37:09.440
and have jobs that only last for a few minutes,

37:09.440 --> 37:11.600
just is not very efficient.

37:11.600 --> 37:16.600
And also, it disturbs the scheduler when you constantly,

37:16.600 --> 37:20.560
especially when you specify in your job

37:20.560 --> 37:24.520
a expected wall time that's much bigger than what

37:24.520 --> 37:28.840
the job was consuming, then this will really

37:28.840 --> 37:30.560
disturb the scheduler a lot.

37:30.560 --> 37:32.840
So you should really then, if you

37:32.840 --> 37:36.200
have a lot of throughput, which is running just

37:36.200 --> 37:39.000
for a few minutes, then you should set up

37:39.000 --> 37:41.080
some internal scheduling.

37:41.080 --> 37:46.240
So get a job and then do scheduling within this job.

37:46.240 --> 37:51.600
You have here powerful sorting and filter options.

37:51.600 --> 37:55.480
You can configure what metrics you want to see here.

37:55.480 --> 37:58.040
And this is then also stored for you as a user.

37:58.040 --> 37:59.800
For example, you can have a look, OK,

37:59.800 --> 38:03.280
what did I run the last 48 hours?

38:03.280 --> 38:04.680
That was just one job.

38:04.680 --> 38:08.480
So let's take the last seven days.

38:08.480 --> 38:12.840
And when we go here, I think that was today.

38:12.840 --> 38:15.800
Yeah, you see that's in the middle of nowhere.

38:15.800 --> 38:18.200
And that are the metrics.

38:21.000 --> 38:22.840
You can tag a job.

38:22.840 --> 38:27.920
Yeah, this is a feature that's available to you.

38:27.920 --> 38:31.400
And also here, you can configure which metrics you want to see.

38:31.400 --> 38:34.160
And this is then valid for all the job views.

38:34.160 --> 38:40.200
As already mentioned, down here, you can see the job script.

38:40.200 --> 38:42.760
Oh, that didn't work.

38:42.760 --> 38:44.400
Ah, it was an interactive job.

38:44.400 --> 38:47.440
There was no job script.

38:47.440 --> 38:49.560
And you have the statistic table.

38:49.560 --> 38:53.520
Here, this is not very useful because when the job is archived,

38:53.520 --> 38:56.240
you only have access to the node information.

38:56.240 --> 39:00.240
Of course, this was just a single node.

39:00.240 --> 39:04.760
And in the future, especially for the shared jobs,

39:04.760 --> 39:08.560
when I go here on Alex, I've only a few jobs.

39:08.560 --> 39:12.080
You can see that here, you don't get exclusive nodes.

39:12.080 --> 39:17.360
I mean, you can, but usually the jobs are shared among users.

39:17.360 --> 39:20.480
I got here exclusive jobs, actually, all the way.

39:20.480 --> 39:32.880
But here, you may want to then have a look at the core view,

39:32.880 --> 39:34.080
which is not available here.

39:34.080 --> 39:45.280
Can you check whether you have the scrambling still enabled,

39:45.280 --> 39:46.160
the demo scrambling?

39:49.680 --> 39:53.120
No, because I'm not an admin user.

39:53.120 --> 39:54.160
I cannot set this.

39:55.360 --> 39:56.560
Just a regular user.

39:56.560 --> 40:01.680
So I hope that we will provide more features soon.

40:01.680 --> 40:04.800
So any questions or remarks?

40:04.800 --> 40:06.240
Yeah, maybe one question.

40:06.240 --> 40:07.760
I mean, these plots, they are great.

40:07.760 --> 40:13.520
But just in case, if you would like to have the numerical data

40:13.520 --> 40:18.160
on your job, so basically the measurements which

40:18.160 --> 40:21.520
are behind these plots for including the data,

40:21.520 --> 40:27.360
for publication, is it possible to download them as well?

40:27.360 --> 40:29.600
What do you mean by numerical jobs?

40:29.600 --> 40:30.800
Numerical data.

40:30.800 --> 40:33.360
So the numerical data, you mean the raw data?

40:33.360 --> 40:36.480
The raw data behind these jobs, the measurements

40:36.480 --> 40:38.640
that you did while the job was running.

40:38.640 --> 40:40.880
So you want to have access to the timelines

40:40.880 --> 40:44.480
or just to the aggregate data?

40:44.480 --> 40:46.320
Yeah, this data which is plotted there.

40:46.320 --> 40:48.160
I think that's the most important thing.

40:48.160 --> 40:52.240
Yeah, this data which is plotted there, I don't want it.

40:52.240 --> 40:55.280
I just asked if this is a possibility also in the API.

40:55.280 --> 40:55.760
Not yet.

40:55.760 --> 40:58.480
There was a similar request already in the last summer.

40:59.200 --> 41:01.920
And there we said, and there is a REST API,

41:02.640 --> 41:04.960
but this is not yet exposed.

41:05.680 --> 41:10.960
So there is a REST API which allows you to query jobs

41:10.960 --> 41:14.160
and to also get the job metadata.

41:14.160 --> 41:19.200
There is not yet the option to get the metric data.

41:19.200 --> 41:22.480
But of course, this could be added for your jobs

41:22.480 --> 41:26.640
so that you could then query your job data, for example,

41:26.640 --> 41:27.200
using Curl.

41:29.840 --> 41:30.640
That's great.

41:30.640 --> 41:35.040
I think there is a ticket for that, that we create an API.

41:35.040 --> 41:38.960
And how it would work is then that you,

41:38.960 --> 41:45.120
in here in the configuration, you could have something

41:45.120 --> 41:47.200
like create API key or something.

41:48.240 --> 41:54.560
And then you could authenticate the user with the API key.

41:57.600 --> 41:58.160
Great.

41:58.160 --> 41:59.600
So it's not particularly urgent.

41:59.600 --> 42:00.400
I don't need that now.

42:00.400 --> 42:01.680
I just asked myself.

42:01.680 --> 42:04.720
I think maybe there is a use case scenario

42:04.720 --> 42:08.320
where this makes sense and somebody wants to visualize

42:08.320 --> 42:09.600
by himself data.

42:09.600 --> 42:10.100
Sure.

42:10.100 --> 42:11.600
Thank you.

42:15.600 --> 42:18.000
Of course, everybody is encouraged to try it out.

42:18.000 --> 42:19.600
And yeah, sure.

42:19.600 --> 42:22.720
So there is a question.

42:22.720 --> 42:24.240
Just feel free.

42:24.240 --> 42:24.720
Yes.

42:24.720 --> 42:25.200
Hi.

42:25.200 --> 42:26.800
I'm Wolfgang Söder from Regensburg.

42:26.800 --> 42:28.800
So thanks for the nice talk.

42:28.800 --> 42:32.160
So I have a few questions.

42:33.440 --> 42:36.000
So first, maybe for the accounting,

42:36.000 --> 42:40.160
the total core hours which is displayed in the cluster

42:40.160 --> 42:45.120
cockpit, also on the portal, is this actually the core hours

42:45.120 --> 42:48.960
which are accounted?

42:48.960 --> 42:51.840
Or are these the core hours which are really

42:51.840 --> 42:54.080
where the job is running, the jobs are running?

42:54.080 --> 42:57.440
So for accounting, the valid data

42:57.440 --> 42:59.760
is only in the drop portal.

42:59.760 --> 43:03.120
So the data you see here is computed

43:03.120 --> 43:08.880
because I have my own job metadata table within cluster

43:08.880 --> 43:09.680
cockpit.

43:09.680 --> 43:13.280
And I extract this then from this table.

43:15.280 --> 43:20.960
And for me, this is just the time your job actually

43:20.960 --> 43:23.520
was running.

43:23.520 --> 43:28.400
So I don't know the difference between accounted and running.

43:28.400 --> 43:29.440
OK.

43:29.440 --> 43:30.880
So is there a difference?

43:30.880 --> 43:33.280
Is there a difference?

43:33.280 --> 43:36.880
I'm asking because if there's some discount,

43:36.880 --> 43:38.080
how this is handled?

43:38.080 --> 43:38.560
OK.

43:38.560 --> 43:40.240
Now I understand.

43:40.240 --> 43:42.160
I mean, this is just the time.

43:42.160 --> 43:46.800
So this is nothing to do with any accounting.

43:46.800 --> 43:47.360
OK.

43:47.360 --> 43:51.280
And for the portal, how is it handled there?

43:51.280 --> 43:54.720
I mean, if you get some discount, for example,

43:54.720 --> 43:57.040
I mean, how is this really?

43:57.040 --> 44:00.800
I guess that they subtract it from maybe

44:00.800 --> 44:02.560
by hand, or Thomas, what?

44:02.560 --> 44:07.360
So in the moment, you cannot see the discount yet.

44:07.360 --> 44:08.160
OK.

44:08.160 --> 44:14.080
What you see in the portal is the use time, not including

44:14.080 --> 44:15.600
a discount.

44:15.600 --> 44:20.560
OK, they actually used what the jobs are actually used.

44:20.560 --> 44:25.280
This discount, and then you, the question is,

44:25.280 --> 44:26.840
how is it accounted?

44:26.840 --> 44:29.520
I mean, because of it.

44:29.520 --> 44:30.160
Yeah.

44:30.160 --> 44:33.280
Right now, you do not see the discount,

44:33.280 --> 44:38.560
and probably it will show up as an increase of your project

44:38.560 --> 44:39.680
quota.

44:39.680 --> 44:40.640
OK.

44:40.640 --> 44:44.600
So not the compute time is reduced,

44:44.600 --> 44:47.640
but the awarded time is increased.

44:47.640 --> 44:48.160
OK.

44:48.160 --> 44:48.640
OK.

44:48.640 --> 44:51.200
Because that's good to know, because if I

44:51.200 --> 44:55.760
have to manage my project, and of course, I

44:55.760 --> 45:00.120
need to know how many co-ops actually have been used.

45:00.120 --> 45:05.760
And if this somehow matches with how my project is progressing.

45:05.760 --> 45:08.720
So that's the background of the question.

45:08.720 --> 45:09.240
OK.

45:09.240 --> 45:14.120
So this is actually really the time which is taken by the.

45:14.120 --> 45:15.000
Yeah.

45:15.000 --> 45:15.760
OK.

45:15.760 --> 45:22.040
And the value in cluster cockpit and the one in the HPC portal

45:22.040 --> 45:24.720
should be close to each other.

45:24.720 --> 45:27.920
If not, then there's some problem.

45:27.920 --> 45:28.440
OK.

45:28.440 --> 45:28.960
Yeah.

45:28.960 --> 45:30.600
It's close in my case, at least.

45:30.600 --> 45:31.120
Yeah.

45:31.120 --> 45:32.680
So that's fine.

45:32.680 --> 45:33.840
OK.

45:33.840 --> 45:36.280
Second question is also to the portal.

45:36.280 --> 45:42.960
I mean, if I'm adding a user, is there the possibility

45:42.960 --> 45:48.360
or is it possible that one user is member of two projects?

45:48.360 --> 45:52.720
Or do I have to create a new user in case of that?

45:52.720 --> 45:54.880
So you have to distinguish.

45:54.880 --> 45:57.600
There is the user that logs in, in case me.

45:57.600 --> 45:59.600
That's a unique user.

45:59.600 --> 46:02.520
But then the HPC accounts are per project

46:02.520 --> 46:06.760
for every new invitation, a new project with new home,

46:06.760 --> 46:08.600
and everything is created.

46:08.600 --> 46:09.120
OK.

46:09.120 --> 46:14.320
So you cannot share one user with different projects?

46:14.320 --> 46:15.320
OK.

46:15.320 --> 46:17.040
That means if you have to share data,

46:17.040 --> 46:20.440
then you have to do some, how would you do that?

46:20.440 --> 46:25.400
If you were to share data for one user with different projects?

46:25.400 --> 46:26.240
Yeah, we are aware.

46:26.240 --> 46:30.520
I don't know, Thomas, is there any idea to have project?

46:30.520 --> 46:32.120
There is no simple solution here.

46:32.120 --> 46:34.000
Simple.

46:34.000 --> 46:36.000
And we are aware that this is not ideal,

46:36.000 --> 46:38.960
but you have to, in German, one would say,

46:38.960 --> 46:41.840
one death must be a death.

46:41.840 --> 46:44.800
So what you probably would have to do

46:44.800 --> 46:51.960
is to use UNIX permissions and file system ACLs

46:51.960 --> 46:55.560
to open your directory for the other account.

46:55.560 --> 46:56.520
OK.

46:56.520 --> 46:57.040
OK.

46:57.040 --> 47:00.280
And keep the files within the directories

47:00.280 --> 47:03.840
readable for everyone so that you only

47:03.840 --> 47:07.880
have to deal with access permissions on the first entry

47:07.880 --> 47:10.200
directory.

47:10.200 --> 47:13.680
And you block other people there,

47:13.680 --> 47:20.840
and therefore can use UMask 0022 for the files below,

47:20.840 --> 47:24.760
because if you cannot enter the door,

47:24.760 --> 47:29.920
it doesn't matter if the file permissions on the files

47:29.920 --> 47:33.880
themselves are restrictive or not.

47:33.880 --> 47:34.380
OK.

47:34.380 --> 47:39.920
So that is not urgent right now, but in case I'm not clear.

47:39.920 --> 47:43.000
A quick and dirty solution if you do not

47:43.000 --> 47:47.040
have very highly sensitive data is

47:47.040 --> 47:53.080
to remove read permissions on your first directory

47:53.080 --> 47:59.160
for others, but keep it executable.

47:59.160 --> 48:03.520
Then people have to know how directories and files are

48:03.520 --> 48:08.520
named, and they can access it, but they cannot do an LS.

48:11.680 --> 48:19.440
And then you do not have to deal with ACLs and stuff like that.

48:19.440 --> 48:25.840
And so security by obscurity is quite safe in most cases.

48:28.640 --> 48:31.200
I mean, one other option, I don't know how feasible is it,

48:31.200 --> 48:36.000
but to mount the other home with refuse,

48:36.000 --> 48:40.520
so with the user file system and with via Secure Shell FS.

48:40.520 --> 48:43.480
That won't work on the HPC systems.

48:43.480 --> 48:47.120
And if it would work, it would be terribly slow.

48:47.120 --> 48:48.600
OK.

48:48.600 --> 48:49.720
I mean, it's not very urgent.

48:49.720 --> 48:53.440
I just wanted to ask, because at Ulyf, for example,

48:53.440 --> 48:56.280
I mean, they have some different model now, which

48:56.280 --> 49:00.000
allows for, exactly for this, that one user can

49:00.000 --> 49:02.160
be part of different projects.

49:02.160 --> 49:06.360
And they changed that actually some time ago.

49:06.360 --> 49:10.160
Before that, they also had this model like you did,

49:10.160 --> 49:11.320
like you do now.

49:11.320 --> 49:12.000
OK.

49:12.000 --> 49:15.400
Also, maybe another question, if possible.

49:15.400 --> 49:18.360
So there is a vectorization ratio.

49:18.360 --> 49:22.200
I'm just looking at my jobs here at Glastocockpit.

49:22.200 --> 49:23.400
What does this mean?

49:26.360 --> 49:28.400
Oh, no, no, no, no.

49:28.400 --> 49:29.960
Or is this?

49:29.960 --> 49:32.400
This is something directly measured.

49:32.400 --> 49:39.040
So this only relates to the floating point instructions.

49:39.040 --> 49:40.160
We need to document that.

49:40.160 --> 49:45.800
And it says, what percentage of your floating point instructions

49:45.800 --> 49:50.000
are using SIMD, essentially?

49:50.000 --> 49:54.400
In my case, the code was running completely

49:54.400 --> 49:56.640
with Scalar without SIMD.

49:56.640 --> 49:59.840
OK, so you would like to have something close to 100% then,

49:59.840 --> 50:00.560
right?

50:00.560 --> 50:03.040
Yeah, well, that's, of course, the ideal case,

50:03.040 --> 50:08.200
that all floating point is done using SIMD instructions.

50:08.200 --> 50:08.680
OK.

50:08.680 --> 50:11.240
And what is the dashed line, actually, here?

50:11.240 --> 50:12.560
Yeah, the dashed line.

50:12.560 --> 50:18.080
So I specify different thresholds.

50:18.080 --> 50:23.640
And this is the normal threshold.

50:23.640 --> 50:27.440
So this is just a line of reference,

50:27.440 --> 50:31.280
also, that you can compare different jobs, that you

50:31.280 --> 50:33.240
don't have different scaling.

50:33.240 --> 50:35.680
Because here, when you would auto-scale,

50:35.680 --> 50:37.840
you couldn't see or you couldn't judge

50:37.840 --> 50:39.000
if this is good or bad.

50:39.000 --> 50:42.120
And therefore, we are drawing this reference line,

50:42.120 --> 50:44.120
which is a threshold.

50:44.120 --> 50:46.520
I enter for each cluster by hand.

50:46.520 --> 50:49.080
I just configure for the cluster what

50:49.080 --> 50:51.640
is normal on the system, because this is, of course,

50:51.640 --> 50:53.040
cluster-specific.

50:53.040 --> 50:55.000
It's like a reasonable value.

50:55.000 --> 50:58.360
Yeah, it's a reasonable value, 60%.

50:58.360 --> 51:01.120
It's arbitrary at the end, yeah.

51:01.120 --> 51:04.520
And for then, there is yellow.

51:04.520 --> 51:08.600
If your average job performance, if your average performance

51:08.600 --> 51:13.080
is below a, I think it's called warning,

51:13.080 --> 51:21.040
and then red if you are beyond an even lower threshold.

51:21.040 --> 51:24.720
So there are different thresholds that are configured.

51:24.720 --> 51:30.120
And yeah, the line is the normal threshold.

51:30.120 --> 51:31.480
Line is the normal threshold.

51:31.480 --> 51:33.840
And the other, sorry, yeah, go ahead.

51:33.840 --> 51:38.000
The dashed line is what a typical job would show.

51:38.000 --> 51:38.500
OK.

51:38.500 --> 51:41.160
If you are above the dashed line, you are good.

51:41.160 --> 51:43.320
If you are below, you are bad.

51:43.320 --> 51:45.960
OK, that's good.

51:45.960 --> 51:48.920
So in my case, it looks good.

51:48.920 --> 51:51.080
But there are also more lines here.

51:51.080 --> 51:54.320
So if a yellow line, a blue.

51:54.320 --> 51:58.520
So here, this string here tells you what is shown.

51:58.520 --> 52:01.680
Like here, for example, for the single node job,

52:01.680 --> 52:06.440
some metrics are only available in a single granularity,

52:06.440 --> 52:07.840
like here for node.

52:07.840 --> 52:10.240
There is only a node value available.

52:10.240 --> 52:12.960
And then you see here, this line is node.

52:12.960 --> 52:16.640
If you would have a job with multiple nodes,

52:16.640 --> 52:21.440
then you would see here multiple lines, one for each node.

52:21.440 --> 52:26.040
Then here you see this is course.

52:26.040 --> 52:28.040
And there are others.

52:28.040 --> 52:33.040
No, this is, OK, no, this is course.

52:33.040 --> 52:37.200
Where every line then is a core, as you can see here.

52:37.200 --> 52:39.960
And I think you could switch here to node.

52:39.960 --> 52:42.680
And then it will convert here and only show the node value.

52:42.680 --> 52:47.280
So here you can switch between core and node.

52:47.280 --> 52:53.160
So some metrics are available per core, others per node.

52:53.160 --> 52:56.880
Weight memory bandwidth, where is it?

52:56.880 --> 53:00.160
In this case, for whatever reason,

53:00.160 --> 53:02.160
it's not available for memory domain.

53:02.160 --> 53:07.640
I'm not so sure why, but it isn't.

53:07.640 --> 53:11.560
Maybe because this was already archived,

53:11.560 --> 53:14.200
because this already finished the job.

53:17.800 --> 53:20.000
And these other dashed lines in these other plots,

53:20.000 --> 53:21.360
I mean, this is the same.

53:21.360 --> 53:23.840
So this is also some thresholds, which?

53:23.840 --> 53:24.680
Exactly.

53:24.680 --> 53:27.760
Like here, for example, is 72 for the load,

53:27.760 --> 53:29.920
because that's the number, of course,

53:29.920 --> 53:36.720
that should be the optimal load using all the physical cores.

53:36.720 --> 53:39.520
OK, so that's maximum here, actually, anyway.

53:39.520 --> 53:41.280
In that case, it's maximum.

53:41.280 --> 53:48.560
And for other metrics, it is some reasonable value.

53:48.560 --> 53:53.840
OK, and then they also see some IB receive points.

53:53.840 --> 53:56.080
So that's, I think, for the network.

53:56.080 --> 53:57.960
That's for the network, exactly.

53:57.960 --> 54:04.040
OK, and what would one expect there?

54:04.040 --> 54:05.480
I mean, what is this?

54:05.480 --> 54:07.000
That is for network.

54:07.000 --> 54:12.280
This is hard, because there, I mean, for those metrics here,

54:12.280 --> 54:13.200
higher is better.

54:13.200 --> 54:15.680
You want to have a good resource utilization.

54:15.680 --> 54:20.080
For network, probably, ideally, you do not

54:20.080 --> 54:21.560
want to communicate, right?

54:21.560 --> 54:23.920
Any communication is over-ed.

54:23.920 --> 54:31.520
So this is more for information.

54:31.520 --> 54:37.280
I mean, if your code has very low utilization

54:37.280 --> 54:41.000
of basic resources, like block throughput and memory bandwidth,

54:41.000 --> 54:46.760
and has very high values here in communication,

54:46.760 --> 54:49.800
then obviously, something is also an indication

54:49.800 --> 54:51.240
that something goes wrong.

54:51.240 --> 54:52.720
One thing to look at, for example,

54:52.720 --> 54:55.960
is the number of received or sent packets per second.

54:55.960 --> 54:57.600
But that's where the cursor is right now,

54:57.600 --> 54:59.800
on the mouse pointer.

54:59.800 --> 55:02.240
If that is of the order of hundreds of thousands

55:02.240 --> 55:05.920
per second, that usually indicates a problem,

55:05.920 --> 55:08.960
because the MPI latency of the order of one to two

55:08.960 --> 55:12.200
microseconds, that means you're firing out packages

55:12.200 --> 55:14.280
as fast as it can.

55:14.280 --> 55:16.560
And that's usually not a good sign, for example.

55:16.560 --> 55:18.680
But it's really hard to tell.

55:18.680 --> 55:22.240
So the unit's just packages per seconds.

55:22.240 --> 55:24.080
OK.

55:24.080 --> 55:26.560
And the other, for example, the left plot

55:26.560 --> 55:29.120
would be then bytes per second.

55:29.120 --> 55:31.800
Bytes per second, for example, the middle one

55:31.800 --> 55:33.160
in the current screen.

55:33.160 --> 55:34.440
OK.

55:34.440 --> 55:36.240
I see.

55:36.240 --> 55:39.920
And is some documentation available?

55:39.920 --> 55:41.480
Maybe a miss?

55:41.480 --> 55:42.040
No.

55:42.040 --> 55:47.760
The metrics and the names are not yet documented.

55:47.760 --> 55:52.840
In most cases, it should be quite obvious.

55:52.840 --> 55:56.520
But in some cases, it's not.

55:56.520 --> 55:57.020
OK.

55:57.020 --> 55:59.000
Memory band is, I think, fine.

55:59.000 --> 56:00.800
OK.

56:00.800 --> 56:01.300
OK.

56:01.300 --> 56:02.600
But this was very helpful, actually.

56:02.600 --> 56:03.080
Yeah.

56:03.080 --> 56:03.680
Thanks a lot.

56:03.680 --> 56:04.960
OK.

56:04.960 --> 56:07.240
Actually, this discussion also answered one of the questions

56:07.240 --> 56:08.320
in the chat.

56:08.320 --> 56:11.400
Somebody asked whether it's possible to have

56:11.400 --> 56:14.240
like an overview of how much time a process spends

56:14.240 --> 56:16.640
communicating.

56:16.640 --> 56:19.480
We don't provide that, but we provide the InfiniBand metrics.

56:19.480 --> 56:22.840
So you could infer sort of something similar to that

56:22.840 --> 56:24.080
from the data.

56:24.080 --> 56:27.040
So with regard to when you do want

56:27.040 --> 56:28.360
to optimize your application, this

56:28.360 --> 56:32.280
is not a replacement for something like a trace

56:32.280 --> 56:33.680
tool or something.

56:33.680 --> 56:36.600
It's just to give you a rough overview

56:36.600 --> 56:40.800
how your job performs and if anything goes severely wrong.

56:40.800 --> 56:44.640
If you really want to profile your application in detail,

56:44.640 --> 56:51.520
you have to use some dedicated analysis or tracing tool.

56:51.520 --> 56:54.800
Another question in the chat about the HPC portal.

56:54.800 --> 56:58.720
Will this also be available for FAU resources like TinyFAT?

56:58.720 --> 57:02.080
Or will it remain limited to NHR?

57:02.080 --> 57:03.480
Yeah, Thomas has to answer.

57:03.480 --> 57:06.320
I think the plan is to report it.

57:06.320 --> 57:09.840
The plan is to report it also for your three users.

57:09.840 --> 57:12.160
Right now, this is not possible.

57:12.160 --> 57:15.760
.

