Spaces:

jerpint
/

buster-dev

Runtime error

App Files Files Community

buster-dev / buster /data /documents.csv

hbertrand

end to end working

05dabf4 over 3 years ago

raw

history blame

149 kB

	name,url,text
	AI tooling and methodology handbook,https://docs.mila.quebec/Handbook.html#ai-tooling-and-methodology-handbook,"AI tooling and methodology handbook
	This section seeks to provide researchers with insightful articles pertaining to
	aspects of methodology in their work.
	"
	What is a computer cluster?,https://docs.mila.quebec/Theory_cluster.html#what-is-a-computer-cluster,"What is a computer cluster?
	A computer cluster is a set
	of loosely or tightly connected computers that work together so that, in many
	respects, they can be viewed as a single system.
	"
	Parts of a computing cluster,https://docs.mila.quebec/Theory_cluster.html#parts-of-a-computing-cluster,"Parts of a computing cluster
	To provide high performance computation capabilities, clusters can
	combine hundreds to thousands of computers, called nodes, which are all
	inter-connected with a high-performance communication network. Most nodes are
	designed for high-performance computations, but clusters can also use
	specialized nodes to offer parallel file systems, databases, login nodes and
	even the cluster scheduling functionality as pictured in the image below.

	We will overview the different types of nodes which you can encounter on a
	typical cluster.
	"
	The login nodes,https://docs.mila.quebec/Theory_cluster.html#the-login-nodes,"The login nodes
	To execute computing processes on a cluster, you must first connect to a
	cluster and this is accomplished through a login node. These so-called
	login nodes are the entry point to most clusters.
	Another entry point to some clusters such as the Mila cluster is the JupyterHub
	web interface, but we’ll read about that later. For now let’s return to the
	subject of this section; Login nodes. To connect to these, you would typically
	use a remote shell connection. The most usual tool to do so is SSH. You’ll hear
	and read a lot about this tool. Imagine it as a very long (and somewhat
	magical) extension cord which connects the computer you are using now, such as
	your laptop, to a remote computer’s terminal shell. You might already know what
	a terminal shell is if you ever used the command line.
	"
	The compute nodes,https://docs.mila.quebec/Theory_cluster.html#the-compute-nodes,"The compute nodes
	In the field of artificial intelligence, you will usually be on the hunt for
	GPUs. In most clusters, the compute nodes are the ones with GPU capacity.
	While there is a general paradigm to tend towards a homogeneous configuration
	for nodes, this is not always possible in the field of artificial intelligence
	as the hardware evolve rapidly as is being complemented by new hardware and so
	on. Hence, you will often read about computational node classes. Some of which
	might have different GPU models or even no GPU at all. For the Mila cluster you
	will find this information in the Node profile description section. For
	now, you should note that is important to keep in mind that you should be aware
	of which nodes your code is running on. More on that later.
	"
	The storage nodes,https://docs.mila.quebec/Theory_cluster.html#the-storage-nodes,"The storage nodes
	Some computers on a cluster function to only store and serve files. While the
	name of these computers might matter to some, as a user, you’ll only be
	concerned about the path to the data. More on that in the Processing data section.
	"
	Different nodes for different uses,https://docs.mila.quebec/Theory_cluster.html#different-nodes-for-different-uses,"Different nodes for different uses
	It is important to note here the difference in intended uses between the
	compute nodes and the login nodes. While the compute nodes are meant for heavy
	computation, the login nodes are not.
	The login nodes however are used by everyone who uses the cluster and care must
	be taken not to overburden these nodes. Consequently, only very short and light
	processes should be run on these otherwise the cluster may become inaccessible.
	In other words, please refrain from executing long or compute intensive
	processes on login nodes because it affects all other users. In some cases, you
	will also find that doing so might get you into trouble.
	"
	UNIX,https://docs.mila.quebec/Theory_cluster.html#unix,"UNIX
	All clusters typically run on GNU/Linux distributions. Hence a minimum
	knowledge of GNU/Linux and BASH is usually required to use them. See the
	following tutorial
	for a rough guide on getting started with Linux.
	"
	The workload manager,https://docs.mila.quebec/Theory_cluster.html#the-workload-manager,"The workload manager
	On a cluster, users don’t have direct access to the compute nodes but
	instead connect to a login node and add jobs to the workload manager
	queue. Whenever there are resources available to execute these jobs
	they will be allocated to a compute node and run, which can be
	immediately or after a wait of up to several days.
	A job is comprised of a number of steps that will run one after the
	other. This is done so that you can schedule a sequence of processes
	that can use the results of the previous steps without having to
	manually interact with the scheduler.
	Each step can have any number of tasks which are groups of processes
	that can be scheduled independently on the cluster but can run in
	parallel if there are resources available. The distinction between
	steps and tasks is that multiple tasks, if they are part of the same
	step, cannot depend on results of other tasks because there are no
	guarantees on the order in which they will be executed.
	Finally each process group is the basic unit that is scheduled in the
	cluster. It comprises of a set of processes (or threads) that can run
	on a number of resources (CPU, GPU, RAM, …) and are scheduled
	together as a unit on one or more machines.
	Each of these concepts lends itself to a particular use. For multi-gpu
	training in AI workloads you would use one task per GPU for data
	paralellism or one process group if you are doing model
	parallelism. Hyperparameter optimisation can be done using a
	combination of tasks and steps but is probably better left to a
	framework outside of the scope of the workload manager.
	If this all seems complicated, you should know that all these things
	do not need to always be used. It is perfectly acceptable to sumbit
	jobs with a single step, a single task and a single process.
	The available resources on the cluster are not infinite and it is the
	workload manager’s job to allocate them. Whenever a job request comes
	in and there are not enough resources available to start it
	immediately, it will go in the queue.
	Once a job is in the queue, it will stay there until another job
	finishes and then the workload manager will try to use the newly freed
	resources with jobs from the queue. The exact order in which the jobs
	will start is not fixed, because it depends on the local policies
	which can take into account the user priority, the time since the job
	was requested, the amount of resources requested and possibly other
	things. There should be a tool that comes with the manager where you
	can see the status of your queued jobs and why they remain in the
	queue.
	The workload manager will divide the cluster into partitions according
	to the configuration set by the admins. A partition is a set of
	machi"
	The workload manager,https://docs.mila.quebec/Theory_cluster.html#the-workload-manager,"nes typically reserved for a particular purpose. An example might
	be CPU-only machines for preprocessing setup as a separate partition.
	It is possible for multiple partitions to share resources.
	There will always be at least one partition that is the default
	partition in which jobs without a specific request will go. Other
	partitions can be requested, but might be restricted to a group of
	users, depending on policy.
	Partitions are useful for a policy standpoint to ensure efficient use
	of the cluster resources and avoid using up too much of one resource
	type blocking use of another. They are also useful for heterogenous
	clusters where different hardware is mixed in and not all software is
	compatible with all of it (for example x86 and POWER cpus).
	To ensure a fair share of the computing resources for all, the workload
	manager establishes limits on the amount of resources that a single
	user can use at once. These can be hard limits which prevent running
	jobs when you go over or soft limits which will let you run jobs, but
	only until some other job needs the resources.
	Admin policy will determine what those exact limits are for a
	particular cluster or user and whether they are hard or soft limits.
	The way soft limits are enforced is using preemption, which means that
	when another job with higher priority needs the resources that your
	job is using, your job will receive a signal that it needs to save its
	state and exit. It will be given a certain amount of time to do this
	(the grace period, which may be 0s) and then forcefully terminated if
	it is still running.
	Depending on the workload manager in use and the cluster configuration
	a job that is preempted like this may be automatically rescheduled to
	have a chance to finish or it may be up to the job to reschedule
	itself.
	The other limit you can encounter with a job that goes over its
	declared limits. When you schedule a job, you declare how much
	resources it will need (RAM, CPUs, GPUs, …). Some of those may have
	default values and not be explicitely defined. For certain types of
	devices, like GPUs, access to units over your job limit is made
	unavailable. For others, like RAM, usage is monitored and your job
	will be terminated if it goes too much over. This makes it important
	to ensure you estimate resource usage accurately.
	Mila as well as Digital Research Alliance of Canada use the workload
	manager Slurm to schedule and
	allocate resources on their infrastructure.
	Slurm client commands are available on the login nodes for you to submit
	jobs to the main controller and add your job to the queue. Jobs are of 2 types:
	batch jobs and interactive jobs.
	For practical examples of Slurm commands on the Mila cluster, see Running your code."
	Processing data,https://docs.mila.quebec/Theory_cluster.html#processing-data,"Processing data
	For processing large amounts of data common for deep learning, either
	for dataset preprocessing or training, several techniques exist. Each
	has typical uses and limitations.
	"
	Data parallelism,https://docs.mila.quebec/Theory_cluster.html#data-parallelism,"Data parallelism
	The first technique is called data parallelism (aka task
	parallelism in formal computer science). You simply run lots of
	processes each handling a portion of the data you want to
	process. This is by far the easiest technique to use and should be
	favored whenever possible. A common example of this is
	hyperparameter optimisation.
	For really small computations the time to setup multiple processes
	might be longer than the processing time and lead to waste. This can
	be addressed by bunching up some of the processes together by doing
	sequential processing of sub-partitions of the data.
	For the cluster systems it is also inadvisable to launch thousands of
	jobs and even if each job would run for a reasonable amount of time
	(several minutes at minimum), it would be best to make larger groups
	until the amount of jobs is in the low hundreds at most.
	Finally another thing to keep in mind is that the transfer bandwidth
	is limited between the filesystems (see Filesystem concerns)
	and the compute nodes and if you run too many jobs using too much data
	at once they may end up not being any faster because they will spend
	their time waiting for data to arrive.
	"
	Model parallelism,https://docs.mila.quebec/Theory_cluster.html#model-parallelism,"Model parallelism
	The second technique is called model parallelism (which doesn’t
	have a single equivalent in formal computer science). It is used
	mostly when a single instance of a model will not fit in a computing
	resource (such as the GPU memory being too small for all the
	parameters).
	In this case, the model is split into its constituent parts, each
	processed independently and their intermediate results communicated
	with each other to arrive at a final result.
	This is generally harder but necessary to work with larger, more
	powerful models like GPT.
	"
	Communication concerns,https://docs.mila.quebec/Theory_cluster.html#communication-concerns,"Communication concerns
	The main difference of these two approaches is the need for
	communication between the multiple processes. Some common training
	methods, like stochastic gradient descent sit somewhere between the
	two, because they require some communication, but not a lot. Most
	people classify it as data parallelism since it sits closer to that
	end.
	In general for data parallelism tasks or tasks that communicate
	infrequently it doesn’t make a lot of difference where the processes
	sit because the communication bandwidth and latency will not have a
	lot of impact on the time it takes to complete the job. The
	individual tasks can generally be scheduled independently.
	On the contrary for model parallelism you need to pay more attention
	to where your tasks are. In this case it is usually required to use
	the facilities of the workload manager to group the tasks so that they
	are on the same machine or machines that are closely linked to ensure
	optimal communication. What is the best allocation depends on the
	specific cluster architecture available and the technologies it
	support (such as InfiniBand,
	RDMA,
	NVLink or others)
	"
	Filesystem concerns,https://docs.mila.quebec/Theory_cluster.html#filesystem-concerns,"Filesystem concerns
	When working on a cluster, you will generally encounter several
	different filesystems. Usually there will be names such as ‘home’,
	‘scratch’, ‘datasets’, ‘projects’, ‘tmp’.
	The reason for having different filesystems available instead of a
	single giant one is to provide for different use cases. For example,
	the ‘datasets’ filesystem would be optimized for fast reads but have
	slow write performance. This is because datasets are usually written
	once and then read very often for training.
	Different filesystems have different performance levels. For instance, backed
	up filesystems (such as $PROJECT in Digital Research Alliance of Canada
	clusters) provide more space and can handle large files but cannot sustain
	highly parallel accesses typically required for high speed model training.
	The set of filesystems provided by the cluster you are using should be
	detailed in the documentation for that cluster and the names can
	differ from those above. You should pay attention to their recommended
	use case in the documentation and use the appropriate filesystem for
	the appropriate job. There are cases where a job ran hundreds of times
	slower because it tried to use a filesystem that wasn’t a good fit for
	the job.
	One last thing to pay attention to is the data retention policy for
	the filesystems. This has two subpoints: how long is the data kept
	for, and are there backups.
	Some filesystems will have a limit on how long they keep their
	files. Typically the limit is some number of days (like 90 days) but
	can also be ‘as long as the job runs’ for some.
	As for backups, some filesystems will not have a limit for data, but
	will also not have backups. For those it is important to maintain a
	copy of any crucial data somewhere else. The data will not be
	purposefully deleted, but the filesystem may fail and lose all or part
	of its data. If you have any data that is crucial for a paper or your
	thesis keep an additional copy of it somewhere else.
	"
	Software on the cluster,https://docs.mila.quebec/Theory_cluster.html#software-on-the-cluster,"Software on the cluster
	This section aims to raise awareness to problems one can encounter when trying
	to run a software on different computers and how this is dealt with on typical
	computation clusters.
	The Mila cluster and the Digital Research Alliance of Canada clusters both
	provide various useful software and computing environments, which can be
	activated through the module system. Alternatively, you may build containers
	with your desired software and run them on compute nodes.
	Regarding Python development, we recommend using virtual environments to install
	Python packages in isolation.
	"
	Cluster software modules,https://docs.mila.quebec/Theory_cluster.html#cluster-software-modules,"Cluster software modules
	Modules are small files which modify your environment variables to point to
	specific versions of various software and libraries. For instance, a module
	might provide the python command to point to Python 3.7, another might
	activate CUDA version 11.0, another might provide the torch package, and so
	on.
	For more information, see The module command.
	"
	Containers,https://docs.mila.quebec/Theory_cluster.html#containers,"Containers
	Containers are a special form of isolation of software and its dependencies. A
	container is essentially a lightweight virtual machine: it encapsulates a
	virtual file system for a full OS installation, as well as a separate network
	and execution environment.
	For example, you can create an Ubuntu container in which you install various
	packages using apt, modify settings as you would as a root user, and so on,
	but without interfering with your main installation. Once built, a container can
	be run on any compatible system.
	For more information, see Using containers on clusters.
	"
	Python Virtual environments,https://docs.mila.quebec/Theory_cluster.html#python-virtual-environments,"Python Virtual environments
	A virtual environment in Python is a local, isolated environment in which you
	can install or uninstall Python packages without interfering with the global
	environment (or other virtual environments). In order to use a virtual
	environment, you first have to activate it.
	For more information, see Virtual environments.
	"
	"Who, what, where is IDT",https://docs.mila.quebec/IDT.html#who-what-where-is-idt,"Who, what, where is IDT
	This section seeks to help Mila researchers understand the mission and role of
	the IDT team.
	"
	IDT’s mission,https://docs.mila.quebec/IDT.html#idt-s-mission,"IDT’s mission

	"
	The IDT team,https://docs.mila.quebec/IDT.html#the-idt-team,"The IDT team
	See https://mila.quebec/en/mila/team/?cat_id=143
	"
	Purpose of this documentation,https://docs.mila.quebec/Purpose.html#purpose-of-this-documentation,"Purpose of this documentation
	This documentation aims to cover the information required to run scientific
	and data-intensive computing tasks at Mila and the available resources for its
	members.
	It also aims to be an outlet for sharing know-how, tips and tricks and examples
	from the IDT team to the Mila researcher community.
	"
	Intended audience,https://docs.mila.quebec/Purpose.html#intended-audience,"Intended audience
	This documentation is mainly intended for Mila researchers having access to the
	Mila cluster. This access is determined by your researcher status. See
	Roles and authorizations for more information. The core of the
	information with this purpose can be found in the following section:
	Computing infrastructure and policies.
	However, we also aim to provide more general information which can be useful
	outside the scope of using the Mila cluster. For instance, more general theory
	on computational considerations and such. In this perspective, we hope the
	documentation can be of use for all of Mila members.
	"
	Contributing,https://docs.mila.quebec/Purpose.html#contributing,"Contributing
	See the following file for contribution guidelines :
	# Contributing to the Mila Docs

	Thank you for your interest into making a better documentation for all at Mila.

	Here are some guidelines to help bring your contributions to life.

	## What should be included in the Mila Docs

	* Mila cluster usage
	* Digital Research Alliance of Canada cluster usage
	* Job management tips / tricks
	* Research good practices
	* Software development good practices
	* Useful tools

	_NOTE_: Examples should aim to not consume much more than 1 GPU/hour and 2 CPU/hour

	## Issues / Pull Requests

	### Issues

	Issues can be used to report any error in the documentation, missing or unclear
	sections, broken tools or other suggestions to improve the overall
	documentation.

	### Pull Requests

	PRs are welcome and we value the contents of contributions over the appearance
	or functionality of the pull request. If you don't know how to write the proper
	markup in reStructuredText, simply provide the content you would like to add in
	the PR text form which supports markdown or with instructions to format the
	content. In the PR, reference the related issues like this:

	```
	Resolves: #123
	See also: #456, #789
	```

	If you would like to contribute directly in the code of the documentation, keep
	the lines width to 80 characters or less. You can attempt to build the docs
	yourself to see if the formating is right:

	```console
	python3 -m pip install -r docs/requirements.txt
	sphinx-build -b html docs/ docs/_build/
	```

	This will produce the html version of the documentation which you can navigate
	by opening the local file `docs/_build/index.html`.

	If you have any trouble building the docs, don't hesitate to open an issue to
	request help.

	Regarding the restructured text format"
	Contributing,https://docs.mila.quebec/Purpose.html#contributing,", you can simply provide the content
	you would like to add in markdown or plain text format if more convenient
	for you and someone down the line should take responsibility to convert
	the format.

	## Sphinx / reStructuredText (reST)

	The markup language used for the Mila Docs is
	[reStructuredText](http://docutils.sourceforge.net/rst.html) and we follow the
	[Python’s Style Guide for documenting](https://docs.python.org/devguide/documenting.html#style-guide).

	Here are some of reST syntax directives which are useful to know :
	(more can be found in
	[Sphinx's reST Primer](https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html)):


	### Inline markup

	* one asterisk: `text` for emphasis (italics),
	* two asterisks: `text` for strong emphasis (boldface), and
	* backquotes: ` ``text`` ` for `code samples`, and
	* external links: `` `Link text <http://target>`_ ``.

	### Lists

	```reST
	* this is
	* a list

	* with a nested list
	* and some subitems

	* and here the parent list continues
	```

	### Sections

	```reST
	#################
	This is a heading
	#################
	```

	There are no heading levels assigned to certain characters as the structure is
	determined from the succession of headings. However, the Python documentation
	suggests the following convention:

	* `#` with overline, for parts
	* `*` with overline, for chapters
	* `=`, for sections
	* `-`, for subsections
	* `^`, for subsubsections
	* `""`, for paragraphs

	### Note box

	```reST
	.. note:: This is a long
	long long note
	```

	### Collapsible boxes

	This is a local extension, not part of Sphinx itself. It works like this:

	```reST
	.. container:: toggle

	.. container:: header

	Show/Hide Code

	.. code-block:: <type>
	...
	```


	"
	Visual Studio Code,https://docs.mila.quebec/VSCode.html#visual-studio-code,"Visual Studio Code
	One editor of choice for many researchers is VSCode. One feature of VSCode is
	remote editing through SSH. This allows you to edit files on the cluster as if
	they were local. You can also debug your programs using VSCode’s debugger, open
	terminal sessions, etc.
	"
	Connecting to the cluster,https://docs.mila.quebec/VSCode.html#connecting-to-the-cluster,"Connecting to the cluster
	VSCode cannot be used to edit code on the login nodes, because it is a heavy
	enough process (a node process, plus the language server, linter, and
	possibly other plugins depending on your configured environment) that there is a
	risk of overloading the login nodes if too many researchers did it at the same
	time.
	Therefore, to use VSCode on the cluster, you first need to allocate a compute
	node, then connect to that node.
	The milatools package provides a command to make the operation easier. More
	info can be found here.
	"
	Activating an environment,https://docs.mila.quebec/VSCode.html#activating-an-environment,"Activating an environment
	Reference
	To activate a conda or pip environment, you can open the command palette with
	Ctrl+Shift+P and type “Python: Select interpreter”. This will prompt you for the
	path to the Python executable for your environment.

	Tip
	If you already have the environment activated in a terminal session, you can
	run the command which python to get the path for this environment. This
	path can be pasted into the interpreter selection prompt in VSCode to use
	that same environment.

	"
	Troubleshooting,https://docs.mila.quebec/VSCode.html#troubleshooting,"Troubleshooting
	"
	“Cannot reconnect”,https://docs.mila.quebec/VSCode.html#cannot-reconnect,"“Cannot reconnect”
	When connecting to multiple compute nodes (and/or from multiple computers), some
	instances may crash with that message because of conflicts in the lock files
	VSCode installs in ~/.vscode-server (which is shared on all compute nodes).
	To fix this issue, you can change this setting in your settings.json file:
	{ ""remote.SSH.lockfilesInTmp"": true }


	This will store the necessary lockfiles in /tmp on the compute nodes (which
	are local to the node).
	"
	Debugger timeouts,https://docs.mila.quebec/VSCode.html#debugger-timeouts,"Debugger timeouts
	Sometimes, slowness on the compute node or the networked filesystem might cause
	the VSCode debugger to timeout when starting a remote debug process. As a quick
	fix, you can add this to your ~/.bashrc or ~/.profile or equivalent
	resource file for your preferred shell, to increase the timeout delay to 500
	seconds:
	export DEBUGPY_PROCESS_SPAWN_TIMEOUT=500


	"
	Computational resources outside of Mila,https://docs.mila.quebec/Extra_compute.html#computational-resources-outside-of-mila,"Computational resources outside of Mila
	This section seeks to provide insights and information on computational
	resources outside the Mila cluster itself.
	"
	Digital Research Alliance of Canada Clusters,https://docs.mila.quebec/Extra_compute.html#digital-research-alliance-of-canada-clusters,"Digital Research Alliance of Canada Clusters
	The clusters named Beluga, Cedar, Graham, Narval and Niagara are
	clusters provided by the Digital Research Alliance of Canada organisation (the Alliance). For Mila researchers, these
	clusters are to be used for larger experiments having many jobs, multi-node
	computation and/or multi-GPU jobs as well as long running jobs. If you use
	these resources for your research, please remember to acknowledge their use in
	your papers.

	Note
	Compute Canada ceased its operational responsibilities for supporting Canada’s
	national advanced research computing (ARC) platform on March 31, 2022. The services
	will be supported by the new Digital Research Alliance of Canada.
	https://ace-net.ca/compute-canada-operations-move-to-the-digital-research-alliance-of-canada-(the-alliance).html

	"
	Current allocation description,https://docs.mila.quebec/Extra_compute.html#current-allocation-description,"Current allocation description
	Clusters of the Alliance are shared with researchers across the country.
	Allocations are given by the Alliance to selected research groups to ensure to
	a minimal amount of computational resources throughout the year.
	Depending on your affiliation, you will have access to different allocations. If
	you are a student at University of Montreal, you can have access to the
	rrg-bengioy-ad allocation described below. For students from other
	universities, you should ask your advisor to know which allocations you could
	have access to.
	From the Alliance’s documentation: An allocation is an amount of resources
	that a research group can target for use for a period of time, usually a year.
	To be clear, it is not a maximal amount of resources that can be used
	simultaneously, it is a weighting factor of the workload manager to balance
	jobs. For instance, even though we are allocated 400 GPU-years across all
	clusters, we can use more or less than 400 GPUs simultaneously depending on the
	history of usage from our group and other groups using the cluster at a given
	period of time. Please see the Alliance’s documentation for
	more information on how allocations and resource scheduling are configured for
	these installations.
	The table below provides information on the allocation for
	rrg-bengioy-ad for the period which spans from April 2022 to
	April 2023. Note that there are no special allocations for GPUs on
	Graham and therefore jobs with GPUs should be submitted with the
	account def-bengioy.











	Cluster
	CPUs
	GPUs

	#
	account
	Model
	#
	SLURM type specifier
	account

	Beluga
	238
	rrg-bengioy-ad
	V100-16G
	77
	v100
	rrg-bengioy-ad

	Cedar
	34
	rrg-bengioy-ad
	V100-32G
	138
	v100l
	rrg-bengioy-ad

	Graham
	34
	rrg-bengioy-ad
	various
	–
	–
	def-bengioy

	Narval
	34
	rrg-bengioy-ad
	A100-40G
	185
	a100
	rrg-bengioy-ad



	"
	Account Creation,https://docs.mila.quebec/Extra_compute.html#account-creation,"Account Creation
	To access the Alliance clusters you have to first create an account at
	https://ccdb.computecanada.ca. Use a password with at least 8 characters, mixed
	case letters, digits and special characters. Later you will be asked to create
	another password with those rules, and it’s really convenient that the two
	password are the same.
	Then, you have to apply for a role at
	https://ccdb.computecanada.ca/me/add_role, which basically means telling the
	Alliance that you are part of the lab so they know which cluster you can have
	access to, and track your usage.
	You will be asked for the CCRI (See screenshot below). Please reach out to your
	sponsor to get the CCRI.

	You will need to wait for your sponsor to accept before being able to login
	to the Alliance clusters.
	"
	Clusters,https://docs.mila.quebec/Extra_compute.html#clusters,"Clusters

	Beluga:(Mila doc)
	(Digital Research Alliance of Canada doc)
	For most students, Beluga is the best choice for both CPU and GPU jobs because
	of larger allocations on this cluster.

	Narval:(Mila doc)
	(Digital Research Alliance of Canada doc)
	Narval is the newest cluster, and contains the most powerful GPUs (A100). If your
	job can benefit from the A100’s features, such as TF32 floating-point math, Narval
	is the best choice.

	Cedar:(Mila doc)
	(Digital Research Alliance of Canada doc)
	Cedar is a good alternative to Beluga if you absolutely need to have an internet connection
	on the compute nodes.

	Graham:(Mila doc)
	(Digital Research Alliance of Canada doc)
	We do not have a GPU allocation on Graham anymore but it remains an alternative for CPU jobs.

	Niagara:(Mila doc)
	(Digital Research Alliance of Canada doc)
	Niagara is not recommended for most students. It is a CPU-only cluster with unusual
	configurations. Access is not automatic; It is opt-in and must be requested via
	CCDB manually. Compute resources in Niagara are not assigned to jobs on a per-CPU,
	but on a per-node basis.


	"
	Beluga,https://docs.mila.quebec/Extra_compute.html#beluga,"Beluga
	Beluga is a cluster located at ÉTS in Montreal. It
	uses SLURM to schedule jobs. Its full documentation can be found here, and its current status
	here.
	You can access Beluga via ssh:
	ssh <user>@beluga.computecanada.ca
	Where <user> is the username you created previously (see Account Creation).
	"
	Launching Jobs,https://docs.mila.quebec/Extra_compute.html#launching-jobs,"Launching Jobs
	Users must specify the resource allocation Group Name using the flag
	--account=rrg-bengioy-ad. To launch a CPU-only job:
	sbatch --time=1:0:0 --account=rrg-bengioy-ad job.sh

	Note
	The account name will differ based on your affiliation.

	To launch a GPU job:
	sbatch --time=1:0:0 --account=rrg-bengioy-ad --gres=gpu:1 job.sh
	And to get an interactive session, use the salloc command:
	salloc --time=1:0:0 --account=rrg-bengioy-ad --gres=gpu:1
	The full documentation for jobs launching on Beluga can be found here.
	"
	Beluga nodes description,https://docs.mila.quebec/Extra_compute.html#beluga-nodes-description,"Beluga nodes description
	Each GPU node consists of:

	40 CPU cores
	186 GB RAM
	4 GPU NVIDIA V100 (16GB)


	Tip
	You should ask for max 10 CPU cores and 32 GB of RAM per GPU you are
	requesting (as explained here),
	otherwise, your job will count for more than 1 allocation, and will take
	more time to get scheduled.

	"
	Beluga Storage,https://docs.mila.quebec/Extra_compute.html#beluga-storage,"Beluga Storage







	Storage
	Path
	Usage



	$HOME
	/home/<user>/

	Code
	Specific libraries



	$HOME/projects
	/project/rpp-bengioy

	Compressed raw datasets



	$SCRATCH
	/scratch/<user>

	Processed datasets
	Experimental results
	Logs of experiments



	$SLURM_TMPDIR


	Temporary job results





	They are roughly listed in order of increasing performance and optimized for
	different uses:

	The $HOME folder on NFS is appropriate for codes and libraries which are
	small and read once. Do not write experiemental results here!
	The $HOME/projects folder should only contain compressed raw datasets
	(processed datasets should go in $SCRATCH). We have a limit on the
	size and number of file in $HOME/projects, so do not put anything else
	there. If you add a new dataset there (make sure it is readable by every
	member of the group using chgrp -R rpp-bengioy <dataset>).
	The $SCRATCH space can be used for short term storage. It has good
	performance and large quotas, but is purged regularly (every file that has
	not been used in the last 3 months gets deleted, but you receive an email
	before this happens).
	$SLURM_TMPDIR points to the local disk of the node on which a job is
	running. It should be used to copy the data on the node at the beginning of
	the job and write intermediate checkpoints. This folder is cleared after each
	job.

	When an experiment is finished, results should be transferred back to Mila
	servers.
	More details on storage can be found here.
	"
	Modules,https://docs.mila.quebec/Extra_compute.html#modules,"Modules
	Many software, such as Python or MATLAB are already compiled and available on
	Beluga through the module command and its subcommands. Its full
	documentation can be found here.






	module avail
	Displays all the available modules

	module load <module>
	Loads <module>

	module spider <module>
	Shows specific details about <module>



	In particular, if you with to use Python 3.6 you can simply do:
	module load python/3.6

	Tip
	If you wish to use Python on the cluster, we strongly encourage you to
	read Alliance Python Documentation, and in particular the Pytorch and/or Tensorflow pages.

	The cluster has many Python packages (or wheels), such already compiled for
	the cluster. See here for the
	details. In particular, you can browse the packages by doing:
	avail_wheels <wheel>
	Such wheels can be installed using pip. Moreover, the most efficient way to use
	modules on the cluster is to build your environnement inside your job.
	See the script example below.
	"
	Script Example,https://docs.mila.quebec/Extra_compute.html#script-example,"Script Example
	Here is a sbatch script that follows good practices on Beluga:
	1#!/bin/bash
	2#SBATCH --account=rrg-bengioy-ad # Yoshua pays for your job
	3#SBATCH --cpus-per-task=6 # Ask for 6 CPUs
	4#SBATCH --gres=gpu:1 # Ask for 1 GPU
	5#SBATCH --mem=32G # Ask for 32 GB of RAM
	6#SBATCH --time=3:00:00 # The job will run for 3 hours
	7#SBATCH -o /scratch/<user>/slurm-%j.out # Write the log in $SCRATCH
	8
	9# 1. Create your environement locally
	10module load python/3.6
	11virtualenv --no-download $SLURM_TMPDIR/env
	12source $SLURM_TMPDIR/env/bin/activate
	13pip install --no-index torch torchvision
	14
	15# 2. Copy your dataset on the compute node
	16# IMPORTANT: Your dataset must be compressed in one single file (zip, hdf5, ...)!!!
	17cp $SCRATCH/<dataset.zip> $SLURM_TMPDIR
	18
	19# 3. Eventually unzip your dataset
	20unzip $SLURM_TMPDIR/<dataset.zip> -d $SLURM_TMPDIR
	21
	22# 4. Launch your job, tell it to save the model in $SLURM_TMPDIR
	23# and look for the dataset into $SLURM_TMPDIR
	24python main.py --path $SLURM_TMPDIR --data_path $SLURM_TMPDIR
	25
	26# 5. Copy whatever you want to save on $SCRATCH
	27cp $SLURM_TMPDIR/<to_save> $SCRATCH


	"
	Using CometML and Wandb,https://docs.mila.quebec/Extra_compute.html#using-cometml-and-wandb,"Using CometML and Wandb
	The compute nodes for Beluga don’t have access to the internet,
	but there is a special module that can be loaded in order to allow
	training scripts to access some specific servers, which includes
	the necessary servers for using CometML and Wandb (“Weights and Biases”).
	module load httpproxy
	More documentation about this can be found here.
	"
	Graham,https://docs.mila.quebec/Extra_compute.html#graham,"Graham
	Graham is a cluster located at University of Waterloo. It uses SLURM to schedule
	jobs. Its full documentation can be found here, and its current status here.
	You can access Graham via ssh:
	ssh <user>@graham.computecanada.ca
	Where <user> is the username you created previously (see Account Creation).
	Since its structure is similar to Beluga, please look at the Beluga
	documentation, as well as relevant parts of the Digital Research Alliance of
	Canada Documentation.

	Note
	For GPU jobs the ressource allocation Group Name is the same as Beluga, so you should use the flag --account=rrg-bengioy-ad for GPU jobs.

	"
	Cedar,https://docs.mila.quebec/Extra_compute.html#cedar,"Cedar
	Cedar is a cluster located at Simon Fraser University. It uses SLURM to schedule
	jobs. Its full documentation can be found here, and its current status here.
	You can access Cedar via ssh:
	ssh <user>@cedar.computecanada.ca
	Where <user> is the username you created previously (see Account Creation).
	Since its structure is similar to Beluga, please look at the Beluga
	documentation, as well as relevant parts of the Digital Research Alliance of
	Canada Documentation.

	Note
	However, we don’t have any CPU priority on Cedar, in this case you can
	use --account=def-bengioy for CPU. Thus, it might take some time before
	they start.

	"
	Niagara,https://docs.mila.quebec/Extra_compute.html#niagara,"Niagara
	Niagara is a cluster located at University of Toronto. It uses SLURM to schedule
	jobs. Its full documentation can be found here, and its current status here.
	You can access Niagara via ssh:
	ssh <user>@niagara.computecanada.ca
	Where <user> is the username you created previously (see Account Creation).
	Since its structure is similar to Beluga, please look at the Beluga
	documentation, as well as relevant parts of the Digital Research Alliance of
	Canada Documentation.
	"
	FAQ,https://docs.mila.quebec/Extra_compute.html#faq,"FAQ
	"
	What to do with ImportError: /lib64/libm.so.6: version GLIBC_2.23 not found?,https://docs.mila.quebec/Extra_compute.html#what-to-do-with-importerror-lib64-libm-so-6-version-glibc-2-23-not-found,"What to do with ImportError: /lib64/libm.so.6: version GLIBC_2.23 not found?
	The structure of the file system is different than a classical Linux, so your
	code has trouble finding libraries. See how to install binary packages.
	"
	Disk quota exceeded error on /project file systems,https://docs.mila.quebec/Extra_compute.html#disk-quota-exceeded-error-on-project-file-systems,"Disk quota exceeded error on /project file systems
	You have files in /project with the wrong permissions. See how to change
	permissions.
	"
	Computing infrastructure and policies,https://docs.mila.quebec/Information.html#computing-infrastructure-and-policies,"Computing infrastructure and policies
	This section seeks to provide factual information and policies on the Mila cluster computing environments.
	"
	Roles and authorizations,https://docs.mila.quebec/Information.html#roles-and-authorizations,"Roles and authorizations
	There are mainly two types of researchers statuses at Mila :

	Core researchers
	Affiliated researchers

	This is determined by Mila policy. Core researchers have access to the Mila
	computing cluster. See your supervisor’s Mila status to know what is your own
	status.
	"
	Overview of available computing resources at Mila,https://docs.mila.quebec/Information.html#overview-of-available-computing-resources-at-mila,"Overview of available computing resources at Mila
	The Mila cluster is to be used for regular development and relatively small
	number of jobs (< 5). It is a heterogeneous cluster. It uses
	SLURM to schedule jobs.
	"
	Mila cluster versus Digital Research Alliance of Canada clusters,https://docs.mila.quebec/Information.html#mila-cluster-versus-digital-research-alliance-of-canada-clusters,"Mila cluster versus Digital Research Alliance of Canada clusters
	There are a lot of commonalities between the Mila cluster and the clusters from
	Digital Research Alliance of Canada (the Alliance). At the time being, the
	Alliance clusters where we have a large allocation of resources are beluga,
	cedar, graham and narval. We also have comparable computational resources
	in the Mila cluster, with more to come.
	The main distinguishing factor is that we have more control over our own
	cluster than we have over the ones at the Alliance. Notably, also, the compute
	nodes in the Mila cluster all have unrestricted access to the Internet, which
	is not the case in general for the Alliance clusters (although cedar does
	allow it).
	At the current time of this writing (June 2021), Mila students are advised to
	use a healthy diet of a mix of Mila and Alliance clusters. This is especially
	true in times when your favorite cluster is oversubscribed, because you can
	easily switch over to a different one if you are used to it.
	"
	Guarantees about one GPU as absolute minimum,https://docs.mila.quebec/Information.html#guarantees-about-one-gpu-as-absolute-minimum,"Guarantees about one GPU as absolute minimum
	There are certain guarantees that the Mila cluster tries to honor when it comes
	to giving at minimum one GPU per student, all the time, to be used in
	interactive mode. This is strictly better than “one GPU per student on average”
	because it’s a floor meaning that, at any time, you should be able to ask for
	your GPU, right now, and get it (although it might take a minute for the
	request to be processed by SLURM).
	Interactive sessions are possible on the Alliance clusters, and there are
	generally special rules that allow you to get resources more easily if you
	request them for a very short duration (for testing code before queueing long
	jobs). You do not get the same guarantee as on the Mila cluster, however.
	"
	Node profile description,https://docs.mila.quebec/Information.html#node-profile-description,"Node profile description
















	Name
	GPU
	CPUs
	Sockets
	Cores/Socket
	Threads/Core
	Memory (GB)
	TmpDisk (TB)
	Arch
	Slurm Features

	Model
	Mem
	#
	GPU Arch and Memory



	GPU Compute Nodes

	cn-a[001-011]
	RTX8000
	48
	8
	40
	2
	20
	1
	384
	3.6
	x86_64
	turing,48gb

	cn-b[001-005]
	V100
	32
	8
	40
	2
	20
	1
	384
	3.6
	x86_64
	volta,nvlink,32gb

	cn-c[001-040]
	RTX8000
	48
	8
	64
	2
	32
	1
	384
	3
	x86_64
	turing,48gb

	cn-g[001-026]
	A100
	80
	4
	64
	2
	32
	1
	1024
	7
	x86_64
	ampere,nvlink,80gb

	DGX Systems

	cn-d[001-002]
	A100
	40
	8
	128
	2
	64
	1
	1024
	14
	x86_64
	ampere,nvlink,40gb

	cn-d[003-004]
	A100
	80
	8
	128
	2
	64
	1
	2048
	28
	x86_64
	ampere,nvlink,80gb

	cn-e[002-003]
	V100
	32
	8
	40
	2
	20
	1
	512
	7
	x86_64
	volta,32gb

	CPU Compute Nodes

	cn-f[001-004]












	32
	1
	32
	1
	256
	10
	x86_64
	rome

	cn-h[001-004]












	64
	2
	32
	1
	768
	7
	x86_64
	milan

	Legacy GPU Compute Nodes

	kepler5
	V100
	16
	2
	16
	2
	4
	2
	256
	3.6
	x86_64
	volta,16gb

	TITAN RTX

	rtx[1,3-5,7]
	titanrtx
	24
	2
	20
	1
	10
	2
	128
	0.93
	x86_64
	turing,24gb



	"
	Special nodes and outliers,https://docs.mila.quebec/Information.html#special-nodes-and-outliers,"Special nodes and outliers
	"
	DGX A100,https://docs.mila.quebec/Information.html#dgx-a100,"DGX A100
	DGX A100 nodes are NVIDIA appliances with 8 NVIDIA A100 Tensor Core GPUs. Each
	GPU has 40 GB of memory, for a total of 320 GB per appliance. The GPUs are
	interconnected via 6 NVSwitches which allows 4.8 TB/s bi-directional bandwidth.
	In order to run jobs on a DGX A100, add the flags below to your Slurm
	commands:
	--gres=gpu:a100:<number> --reservation=DGXA100


	"
	MIG,https://docs.mila.quebec/Information.html#mig,"MIG
	MIG (Multi-Instance GPU)
	is an NVIDIA technology allowing certain GPUs to be
	partitioned into multiple instances, each of which has a roughly proportional
	amount of compute resources, device memory and bandwidth to that memory.
	NVIDIA supports MIG on its A100 GPUs and allows slicing the A100 into up to 7
	instances. Although this can theoretically be done dynamically, the SLURM job
	scheduler does not support doing so in practice as it does not model
	reconfigurable resources very well. Therefore, the A100s must currently be
	statically partitioned into the required number of instances of every size
	expected to be used.
	The cn-g series of nodes include A100-80GB GPUs. One third have been
	configured to offer regular (non-MIG mode) a100l GPUs. The other two-thirds
	have been configured in MIG mode, and offer the following profiles:









	Name
	GPU
	Cluster-wide

	Model
	Memory
	Compute
	#



	a100l.1g.10gb
	a100l.1
	A100
	10GB
	(1/8th)
	1/7th
	of full
	72

	a100l.2g.20gb
	a100l.2
	A100
	20GB
	(2/8th)
	2/7th
	of full
	108

	a100l.3g.40gb
	a100l.3
	A100
	40GB
	(4/8th)
	3/7th
	of full
	72



	And can be requested using a SLURM flag such as --gres=gpu:a100l.1
	The partitioning may be revised as needs and SLURM capabilities evolve. Other
	MIG profiles exist and could be introduced.

	Warning
	MIG has a number of important limitations,
	most notably that a GPU in MIG mode does not support graphics APIs
	(OpenGL/Vulkan), nor P2P over NVLink and PCIe. We have therefore chosen to
	limit every MIG job to exactly one MIG slice and no more. Thus,
	--gres=gpu:a100l.3 will work (and request a size-3 slice of an
	a100l GPU) but --gres=gpu:a100l.1:3 (with :3 requesting
	three size-1 slices) will not.

	"
	AMD,https://docs.mila.quebec/Information.html#amd,"AMD

	Warning
	As of August 20 2019 the GPUs had to return back to AMD. Mila will get
	more samples. You can join the amd slack channels to get the latest
	information

	Mila has a few node equipped with MI50 GPUs.
	srun --gres=gpu -c 8 --reservation=AMD --pty bash

	first time setup of AMD stack
	conda create -n rocm python=3.6
	conda activate rocm

	pip install tensorflow-rocm
	pip install /wheels/pytorch/torch-1.1.0a0+d8b9d32-cp36-cp36m-linux_x86_64.whl
	"
	Data sharing policies,https://docs.mila.quebec/Information.html#data-sharing-policies,"Data sharing policies

	Note
	/network/scratch aims to support
	Access Control Lists (ACLs)
	to allow collaborative work on rapidly changing data, e.g. work in process
	datasets, model checkpoints, etc…

	/network/projects aims to offer a collaborative
	space for long-term projects. Data that should be kept for a longer period then
	90 days can be stored in that location but first a request to Mila’s helpdesk has to be made to create the project
	directory.
	"
	Monitoring,https://docs.mila.quebec/Information.html#monitoring,"Monitoring
	Every compute node on the Mila cluster has a Netdata
	monitoring daemon allowing you to get a sense of the state of the node.
	This information is exposed in two ways:

	For every node, there is a web interface from Netdata itself at <node>.server.mila.quebec:19999.
	This is accessible only when using the Mila wifi or through SSH tunnelling.

	SSH tunnelling: on your local machine, run

	ssh -L 19999:<node>.server.mila.quebec:19999 -p 2222
	login.server.mila.quebec
	or ssh -L 19999:<node>.server.mila.quebec:19999 mila if you have
	already setup your SSH Login,


	then open http://localhost:19999 in your browser.


	The Mila dashboard at dashboard.server.mila.quebec
	exposes aggregated statistics with the use of grafana.
	These are collected internally to an instance of prometheus.

	In both cases, those graphs are not editable by individual users,
	but they provide valuable insight into the state of the whole cluster
	or the individual nodes.
	One of the important uses is to collect data about the health
	of the Mila cluster and to sound the alarm if outages occur
	(e.g. if the nodes crash or if GPUs mysteriously become unavailable for SLURM).
	"
	Example with Netdata on cn-c001,https://docs.mila.quebec/Information.html#example-with-netdata-on-cn-c001,"Example with Netdata on cn-c001
	For example, if we have a job running on cn-c001, we can type
	cn-c001.server.mila.quebec:19999 in a browser address bar and the following
	page will appear.

	"
	Example watching the CPU/RAM/GPU usage,https://docs.mila.quebec/Information.html#example-watching-the-cpu-ram-gpu-usage,"Example watching the CPU/RAM/GPU usage
	Given that compute nodes are generally shared
	with other users who are also running jobs at the same time and
	consuming resources, this is not generally a good way to profile your code
	in fine details.
	However, it can still be a very useful source of information
	for getting an idea of whether the machine that you requested is being
	used in its full capacity.
	Given how expensive the GPUs are, it generally makes sense to try to
	make sure that this resources is always kept busy.


	CPU
	iowait (pink line): High values means your model is waiting on IO a lot (disk or network).








	CPU RAM
	You can see how much CPU RAM is being used by your script in practice,
	considering the amount that you requested (e.g. `sbatch --mem=8G ...`).
	GPU usage is generally more important to monitor than CPU RAM.
	You should not cut it so close to the limit that your experiments randomly fail
	because they run out of RAM. However, you should not request blindly 32GB of RAM
	when you actually require only 8GB.








	GPU
	Monitors the GPU usage using an nvidia-smi plugin for Netdata.
	Under the plugin interface, select the GPU number which was allocated to
	you. You can figure this out by running echo $SLURM_JOB_GPUS on the
	allocated node or, if you have the job ID,
	scontrol show -d job YOUR_JOB_ID \| grep 'GRES' and checking IDX
	You should make sure you use the GPUs to their fullest capacity.
	Select the biggest batch size if possible to increase GPU memory usage and
	the GPU computational load.
	Spawn multiple experiments if you can fit many on a single GPU.
	Running 10 independent MNIST experiments on a single GPU will probably take
	less than 10x the time to run a single one. This assumes that you have more
	experiments to run, because nothing is gained by gratuitously running experiments.
	You can request a less powerful GPU and leave the more powerful GPUs
	to other researchers who have experiments that can make best use of them.
	Sometimes you really just need a k80 and not a v100.








	Other users or jobs
	If the node seems unresponsive or slow,
	it may be useful to check what other tasks are
	running at the same time on that node.
	This should not be an issue in general,
	but in practice it is useful to be able to
	inspect this to diagnose certain problems.






	"
	Example with Mila dashboard,https://docs.mila.quebec/Information.html#example-with-mila-dashboard,"Example with Mila dashboard

	"
	Storage,https://docs.mila.quebec/Information.html#storage,"Storage










	Path
	Performance
	Usage
	Quota (Space/Files)
	Backup
	Auto-cleanup



	/network/datasets/
	High

	Curated raw datasets (read only)






	$HOME or /home/mila/<u>/<username>/
	Low

	Personal user space
	Specific libraries, code, binaries


	100GB/1000K
	Daily
	no

	$SCRATCH or /network/scratch/<u>/<username>/
	High

	Temporary job results
	Processed datasets
	Optimized for small Files


	no
	no
	90 days

	$SLURM_TMPDIR
	Highest

	High speed disk for temporary job
	results


	4TB/-
	no
	at job end

	/network/projects/<groupname>/
	Fair

	Shared space to facilitate
	collaboration between researchers
	Long-term project storage


	200GB/1000K
	Daily
	no

	$ARCHIVE or /network/archive/<u>/<username>/
	Low

	Long-term personal storage


	500GB
	no
	no




	Note
	The $HOME file system is backed up once a day. For any file
	restoration request, file a request to Mila’s IT support with the path to the file or directory to
	restore, with the required date.


	Warning
	Currently there is no backup system for any other file systems of
	the Mila cluster. Storage local to personal computers, Google Drive and other
	related solutions should be used to backup important data

	"
	$HOME,https://docs.mila.quebec/Information.html#home,"$HOME
	$HOME is appropriate for codes and libraries which are small and read once,
	as well as the experimental results that would be needed at a later time (e.g.
	the weights of a network referenced in a paper).
	Quotas are enabled on $HOME for both disk capacity (blocks) and number of
	files (inodes). The limits for blocks and inodes are respectively 100GiB and 1
	million per user. The command to check the quota usage from a login node is:
	beegfs-ctl --cfgFile=/etc/beegfs/home.d/beegfs-client.conf --getquota --uid $USER
	"
	$SCRATCH,https://docs.mila.quebec/Information.html#scratch,"$SCRATCH
	$SCRATCH can be used to store processed datasets, work in progress datasets
	or temporary job results. Its block size is optimized for small files which
	minimizes the performance hit of working on extracted datasets.

	Note
	Auto-cleanup: this file system is cleared on a weekly basis,
	files not used for more than 90 days will be deleted.

	"
	$SLURM_TMPDIR,https://docs.mila.quebec/Information.html#slurm-tmpdir,"$SLURM_TMPDIR
	$SLURM_TMPDIR points to the local disk of the node on which a job is
	running. It should be used to copy the data on the node at the beginning of the
	job and write intermediate checkpoints. This folder is cleared after each job.
	"
	projects,https://docs.mila.quebec/Information.html#projects,"projects
	projects can be used for collaborative projects. It aims to ease the
	sharing of data between users working on a long-term project.
	Quotas are enabled on projects for both disk capacity (blocks) and number
	of files (inodes). The limits for blocks and inodes are respectively 200GiB and
	1 million per user and per group.

	Note
	It is possible to request higher quota limits if the project requires
	it. File a request to Mila’s IT support.

	"
	$ARCHIVE,https://docs.mila.quebec/Information.html#archive,"$ARCHIVE
	$ARCHIVE purpose is to store data other than datasets that has to be kept
	long-term (e.g. generated samples, logs, data relevant for paper submission).
	$ARCHIVE is only available on the login nodes. Because this file system
	is tuned for large files, it is recommended to archive your directories. For
	example, to archive the results of an experiment in
	$SCRATCH/my_experiment_results/, run the commands below from a login node:
	cd $SCRATCH
	tar cJf $ARCHIVE/my_experiment_results.tar.xz --xattrs my_experiment_results
	Disk capacity quotas are enabled on $ARCHIVE. The soft limit per user is
	500GB, the hard limit is 550GB. The grace time is 7 days. This means that one
	can use more than 500GB for 7 days before the file system enforces quota.
	However, it is not possible to use more than 550GB.
	The command to check the quota usage from a login node is df:
	df -h $ARCHIVE

	Note
	There is NO backup of this file system.

	"
	datasets,https://docs.mila.quebec/Information.html#datasets,"datasets
	datasets contains curated datasets to the benefit of the Mila community.
	To request the addition of a dataset or a preprocessed dataset you think could
	benefit the research of others, you can fill this form. Datasets can also be browsed from the
	web : Mila Datasets
	Datasets in datasets/restricted are restricted and require an explicit
	request to gain access. Please submit a support ticket mentioning the dataset’s
	access group (ex.: scannet_users), your cluster’s username and the
	approbation of the group owner. You can find the dataset’s access group by
	listing the content of /network/datasets/restricted with the ls command.
	Those datasets are mirrored to the Alliance clusters in
	~/projects/rrg-bengioy-ad/data/curated/ if they follow Digital Research
	Alliance of Canada’s good practices on data.
	To list the local datasets on an Alliance cluster, you can execute the
	following command:
	ssh [CLUSTER_LOGIN] -C ""projects/rrg-bengioy-ad/data/curated/list_datasets_cc.sh""
	"
	Data Transmission,https://docs.mila.quebec/Information.html#data-transmission,"Data Transmission
	Multiple methods can be used to transfer data to/from the cluster:

	rsync --bwlimit=10mb; this is the favored method since the bandwidth can
	be limited to prevent impacting the usage of the cluster: rsync
	Digital Research Alliance of Canada: Globus

	"
	Getting started,https://docs.mila.quebec/Getting_started.html#getting-started,"Getting started
	See User’s guide.
	"
	User’s guide,https://docs.mila.quebec/Userguide.html#user-s-guide,"User’s guide
	…or IDT’s list of opinionated howtos
	This section seeks to provide users of the Mila infrastructure with practical
	knowledge, tips and tricks and example commands.
	"
	Quick Start,https://docs.mila.quebec/Userguide.html#quick-start,"Quick Start
	Users first need login access to the cluster. It is
	recommended to install milatools which will help in the set up of the
	ssh configuration needed to securely and easily connect to the
	cluster.
	"
	mila code,https://docs.mila.quebec/Userguide.html#mila-code,"mila code
	milatools also makes it easy to run and debug code on the Mila cluster. Using
	the mila code command will allow you to use VSCode on the server. Simply run:
	mila code path/on/cluster


	The details of the command can be found on the github page of the package. Note that you need to
	first setup your ssh configuration using mila init before the mila code
	command can be used. The initialisation of the ssh configuration is explained
	here and on the github page of the package.
	"
	Logging in to the cluster,https://docs.mila.quebec/Userguide.html#logging-in-to-the-cluster,"Logging in to the cluster
	To access the Mila Cluster clusters, you will need a Mila account. Please contact
	Mila systems administrators if you don’t have it already. Our IT support service
	is available here: https://it-support.mila.quebec/
	You will also need to complete and return an IT Onboarding Training to get
	access to the cluster. Please refer to the Mila Intranet for more
	informations:
	https://sites.google.com/mila.quebec/mila-intranet/it-infrastructure/it-onboarding-training
	IMPORTANT : Your access to the Cluster is granted based on your status at
	Mila (for students, your status is the same as your main supervisor’ status),
	and on the duration of your stay, set during the creation of your account. The
	following have access to the cluster : Current Students of Core Professors -
	Core Professors - Staff
	"
	SSH Login,https://docs.mila.quebec/Userguide.html#ssh-login,"SSH Login
	You can access the Mila cluster via ssh:
	# Generic login, will send you to one of the 4 login nodes to spread the load
	ssh <user>@login.server.mila.quebec -p 2222

	# To connect to a specific login node, X in [1, 2, 3, 4]
	ssh <user>@login-X.login.server.mila.quebec -p 2222
	Four login nodes are available and accessible behind a load balancer. At each
	connection, you will be redirected to the least loaded login-node.
	The ECDSA, RSA and ED25519 fingerprints for Mila’s login nodes are:
	SHA256:baEGIa311fhnxBWsIZJ/zYhq2WfCttwyHRKzAb8zlp8 (ECDSA)
	SHA256:Xr0/JqV/+5DNguPfiN5hb8rSG+nBAcfVCJoSyrR0W0o (RSA)
	SHA256:gfXZzaPiaYHcrPqzHvBi6v+BWRS/lXOS/zAjOKeoBJg (ED25519)



	Important
	Login nodes are merely entry points to the cluster. They give you access
	to the compute nodes and to the filesystem, but they are not meant to run
	anything heavy. Do not run compute-heavy programs on these nodes,
	because in doing so you could bring them down, impeding cluster access for
	everyone.
	This means no training or experiments, no compiling programs, no Python
	scripts, but also no zip of a large folder or anything that demands a
	sustained amount of computation.
	Rule of thumb: never run a program that takes more than a few seconds on
	a login node.

	Note
	In a similar vein, you should not run VSCode remote SSH instances directly
	on login nodes, because even though they are typically not very
	computationally expensive, when many people do it, they add up! See
	Visual Studio Code for specific instructions.


	"
	mila init,https://docs.mila.quebec/Userguide.html#mila-init,"mila init
	To make it easier to set up a productive environment, Mila publishes the
	milatools package, which defines a mila init command which will
	automatically perform some of the below steps for you. You can install it with
	pip and use it, provided your Python version is at least 3.8:
	$ pip install milatools
	$ mila init


	"
	SSH Config,https://docs.mila.quebec/Userguide.html#ssh-config,"SSH Config
	The login nodes support the following authentication mechanisms:
	publickey,keyboard-interactive. If you would like to set an entry in your
	.ssh/config file, please use the following recommendation:
	Host mila
	User YOUR-USERNAME
	Hostname login.server.mila.quebec
	PreferredAuthentications publickey,keyboard-interactive
	Port 2222
	ServerAliveInterval 120
	ServerAliveCountMax 5


	Then you can simply write ssh mila to connect to a login node. You will also
	be able to use mila with scp, rsync and other such programs.

	Tip
	You can run commands on the login node with ssh directly, for example
	ssh mila squeue -u '$USER' (remember to put single quotes around any
	$VARIABLE you want to evaluate on the remote side, otherwise it will be
	evaluated locally before ssh is even executed).

	"
	Passwordless login,https://docs.mila.quebec/Userguide.html#passwordless-login,"Passwordless login
	To save you some repetitive typing it is highly recommended to set up public
	key authentication, which means you won’t have to enter your password every time
	you connect to the cluster.
	# ON YOUR LOCAL MACHINE
	# You might already have done this in the past, but if you haven't:
	ssh-keygen # Press ENTER 3x

	# Copy your public key over to the cluster
	# You will need to enter your password
	ssh-copy-id mila


	"
	Connecting to compute nodes,https://docs.mila.quebec/Userguide.html#connecting-to-compute-nodes,"Connecting to compute nodes
	If (and only if) you have a job running on compute node “cnode”, you are
	allowed to SSH to it directly, if for some reason you need a second terminal.
	That session will be automatically ended when your job is relinquished.
	First, however, you need to have
	password-less ssh either with a key present in your home or with an
	ssh-agent. To generate a key pair on the login node:
	# ON A LOGIN NODE
	ssh-keygen # Press ENTER 3x
	cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
	chmod 600 ~/.ssh/authorized_keys
	chmod 700 ~/.ssh


	Then from the login node you can write ssh <node>. From your local
	machine, you can use ssh -J mila USERNAME@<node> (-J represents a “jump”
	through the login node, necessary because the compute nodes are behind a
	firewall).
	If you wish, you may also add the following wildcard rule in your .ssh/config:
	Host .server.mila.quebec !login.server.mila.quebec
	HostName %h
	User YOUR-USERNAME
	ProxyJump mila


	This will let you connect to a compute node with ssh <node>.server.mila.quebec.
	"
	Running your code,https://docs.mila.quebec/Userguide.html#running-your-code,"Running your code
	"
	SLURM commands guide,https://docs.mila.quebec/Userguide.html#slurm-commands-guide,"SLURM commands guide
	"
	Basic Usage,https://docs.mila.quebec/Userguide.html#basic-usage,"Basic Usage
	The SLURM documentation
	provides extensive information on the available commands to query the cluster
	status or submit jobs.
	Below are some basic examples of how to use SLURM.
	"
	Submitting jobs,https://docs.mila.quebec/Userguide.html#submitting-jobs,"Submitting jobs
	"
	Batch job,https://docs.mila.quebec/Userguide.html#batch-job,"Batch job
	In order to submit a batch job, you have to create a script containing the main
	command(s) you would like to execute on the allocated resources/nodes.
	1#!/bin/bash
	2#SBATCH --job-name=test
	3#SBATCH --output=job_output.txt
	4#SBATCH --error=job_error.txt
	5#SBATCH --ntasks=1
	6#SBATCH --time=10:00
	7#SBATCH --mem=100Gb
	8
	9module load python/3.5
	10python my_script.py


	Your job script is then submitted to SLURM with sbatch (ref.)
	sbatch job_script
	sbatch: Submitted batch job 4323674
	The working directory of the job will be the one where your executed sbatch.

	Tip
	Slurm directives can be specified on the command line alongside sbatch or
	inside the job script with a line starting with #SBATCH.

	"
	Interactive job,https://docs.mila.quebec/Userguide.html#interactive-job,"Interactive job
	Workload managers usually run batch jobs to avoid having to watch its
	progression and let the scheduler run it as soon as resources are available. If
	you want to get access to a shell while leveraging cluster resources, you can
	submit an interactive jobs where the main executable is a shell with the
	srun/salloc (srun/salloc) commands
	salloc
	Will start an interactive job on the first node available with the default
	resources set in SLURM (1 task/1 CPU). srun accepts the same arguments as
	sbatch with the exception that the environment is not passed.

	Tip
	To pass your current environment to an interactive job, add
	--preserve-env to srun.

	salloc can also be used and is mostly a wrapper around srun if provided
	without more info but it gives more flexibility if for example you want to get
	an allocation on multiple nodes.
	"
	Job submission arguments,https://docs.mila.quebec/Userguide.html#job-submission-arguments,"Job submission arguments
	In order to accurately select the resources for your job, several arguments are
	available. The most important ones are:






	Argument
	Description



	-n, –ntasks=<number>
	The number of task in your script, usually =1

	-c, –cpus-per-task=<ncpus>
	The number of cores for each task

	-t, –time=<time>
	Time requested for your job

	–mem=<size[units]>
	Memory requested for all your tasks

	–gres=<list>
	Select generic resources such as GPUs for your job: --gres=gpu:GPU_MODEL




	Tip
	Always consider requesting the adequate amount of resources to improve the
	scheduling of your job (small jobs always run first).

	"
	Checking job status,https://docs.mila.quebec/Userguide.html#checking-job-status,"Checking job status
	To display jobs currently in queue, use squeue and to get only your jobs type
	squeue -u $USER
	JOBID USER NAME ST START_TIME TIME NODES CPUS TRES_PER_NMIN_MEM NODELIST (REASON) COMMENT
	133 my_username myjob R 2019-03-28T18:33 0:50 1 2 N/A 7000M node1 (None) (null)

	Note
	The maximum number of jobs able to be submitted to the system per user is 1000 (MaxSubmitJobs=1000)
	at any given time from the given association. If this limit is reached, new submission requests
	will be denied until existing jobs in this association complete.

	"
	Removing a job,https://docs.mila.quebec/Userguide.html#removing-a-job,"Removing a job
	To cancel your job simply use scancel
	scancel 4323674
	"
	Partitioning,https://docs.mila.quebec/Userguide.html#partitioning,"Partitioning
	Since we don’t have many GPUs on the cluster, resources must be shared as fairly
	as possible. The --partition=/-p flag of SLURM allows you to set the
	priority you need for a job. Each job assigned with a priority can preempt jobs
	with a lower priority: unkillable > main > long. Once preempted, your job is
	killed without notice and is automatically re-queued on the same partition until
	resources are available. (To leverage a different preemption mechanism, see the
	Handling preemption)








	Flag
	Max Resource Usage
	Max Time
	Note



	--partition=unkillable
	6 CPUs, mem=32G, 1 GPU
	2 days


	--partition=unkillable-cpu
	2 CPUs, mem=16G
	2 days
	CPU-only jobs

	--partition=short-unkillable
	24 CPUs, mem=128G, 4 GPUs
	3 hours (!)
	Large but short jobs

	--partition=main
	8 CPUs, mem=48G, 2 GPUs
	5 days


	--partition=main-cpu
	8 CPUs, mem=64G
	5 days
	CPU-only jobs

	--partition=long
	no limit of resources
	7 days


	--partition=long-cpu
	no limit of resources
	7 days
	CPU-only jobs




	Warning
	Historically, before the 2022 introduction of CPU-only nodes (e.g. the cn-f
	series), CPU jobs ran side-by-side with the GPU jobs on GPU nodes. To prevent
	them obstructing any GPU job, they were always lowest-priority and preemptible.
	This was implemented by automatically assigning them to one of the now-obsolete
	partitions cpu_jobs, cpu_jobs_low or cpu_jobs_low-grace.
	Do not use these partition names anymore. Prefer the *-cpu partition
	names defined above.
	For backwards-compatibility purposes, the legacy partition names are translated
	to their effective equivalent long-cpu, but they will eventually be removed
	entirely.


	Note
	As a convenience, should you request the unkillable, main or long
	partition for a CPU-only job, the partition will be translated to its -cpu
	equivalent automatically.

	For instance, to request an unkillable job with 1 GPU, 4 CPUs, 10G of RAM and
	12h of computation do:
	sbatch --gres=gpu:1 -c 4 --mem=10G -t 12:00:00 --partition=unkillable <job.sh>
	You can also make it an interactive job using salloc:
	salloc --gres=gpu:1 -c 4 --mem=10G -t 12:00:00 --partition=unkillable
	The Mila cluster has many different types of nodes/GPUs. To request a specific
	type of node/GPU, you can add specific feature requirements to your job
	submission command.
	To access those special nodes you need to request them explicitly by adding the
	flag --constraint=<name>. The full list of nodes in the Mila Cluster can be
	accessed Node profile description.
	Example:
	To request a machine with 2 GPUs using NVLink, you can use
	sbatch -c 4 --gres=gpu:2 --constraint=nvlink






	Feature
	Particularities



	12GB/16GB/24GB/32GB/48GB
	Request a specific amount of GPU memory

	volta/turing/ampere
	Request a specific GPU architecture

	nvlink
	Machine with GPUs using the NVLink interconnect technology



	"
	Information on partitions/nodes,https://docs.mila.quebec/Userguide.html#information-on-partitions-nodes,"Information on partitions/nodes
	sinfo (ref.) provides most of the
	information about available nodes and partitions/queues to submit jobs to.
	Partitions are a group of nodes usually sharing similar features. On a
	partition, some job limits can be applied which will override those asked for a
	job (i.e. max time, max CPUs, etc…)
	To display available partitions, simply use
	sinfo
	PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
	batch up infinite 2 alloc node[1,3,5-9]
	batch up infinite 6 idle node[10-15]
	cpu up infinite 6 idle cpu_node[1-15]
	gpu up infinite 6 idle gpu_node[1-15]
	To display available nodes and their status, you can use
	sinfo -N -l
	NODELIST NODES PARTITION STATE CPUS MEMORY TMP_DISK WEIGHT FEATURES REASON
	node[1,3,5-9] 2 batch allocated 2 246 16000 0 (null) (null)
	node[2,4] 2 batch drain 2 246 16000 0 (null) (null)
	node[10-15] 6 batch idle 2 246 16000 0 (null) (null)
	...
	And to get statistics on a job running or terminated, use sacct with some of
	the fields you want to display
	sacct --format=User,JobID,Jobname,partition,state,time,start,end,elapsed,nnodes,ncpus,nodelist,workdir -u $USER
	User JobID JobName Partition State Timelimit Start End Elapsed NNodes NCPUS NodeList WorkDir
	--------- ------------ ---------- ---------- ---------- ---------- ------------------- ------------------- ---------- -------- ---------- --------------- --------------------
	my_usern+ 2398 run_extra+ batch RUNNING 130-05:00+ 2019-03-27T18:33:43 Unknown 1-01:07:54 1 16 node9 /home/mila/my_usern+
	my_usern+ 2399 run_extra+ batch RUNNING 130-05:00+ 2019-03-26T08:51:38 Unknown 2-10:49:59 1 16 node9 /home/mila/my_usern+
	Or to get the list of all your previous jobs, use the --start=YYYY-MM-DD flag. You can check sacct(1) for further information about additional t"
	Information on partitions/nodes,https://docs.mila.quebec/Userguide.html#information-on-partitions-nodes,"ime formats.
	sacct -u $USER --start=2019-01-01
	scontrol (ref.) can be used to
	provide specific information on a job (currently running or recently terminated)
	scontrol show job 43123
	JobId=43123 JobName=python_script.py
	UserId=my_username(1500000111) GroupId=student(1500000000) MCS_label=N/A
	Priority=645895 Nice=0 Account=my_username QOS=normal
	JobState=RUNNING Reason=None Dependency=(null)
	Requeue=1 Restarts=3 BatchFlag=1 Reboot=0 ExitCode=0:0
	RunTime=2-10:41:57 TimeLimit=130-05:00:00 TimeMin=N/A
	SubmitTime=2019-03-26T08:47:17 EligibleTime=2019-03-26T08:49:18
	AccrueTime=2019-03-26T08:49:18
	StartTime=2019-03-26T08:51:38 EndTime=2019-08-03T13:51:38 Deadline=N/A
	PreemptTime=None SuspendTime=None SecsPreSuspend=0
	LastSchedEval=2019-03-26T08:49:18
	Partition=slurm_partition AllocNode:Sid=login-node-1:14586
	ReqNodeList=(null) ExcNodeList=(null)
	NodeList=node2
	BatchHost=node2
	NumNodes=1 NumCPUs=16 NumTasks=1 CPUs/Task=16 ReqB:S:C:T=0:0::
	TRES=cpu=16,mem=32000M,node=1,billing=3
	Socks/Node=* NtasksPerN:B:S:C=1:0:: CoreSpec=*
	MinCPUsNode=16 MinMemoryNode=32000M MinTmpDiskNode=0
	Features=(null) DelayBoot=00:00:00
	OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
	WorkDir=/home/mila/my_username
	StdErr=/home/mila/my_username/slurm-43123.out
	StdIn=/dev/null
	StdOut=/home/mila/my_username/slurm-43123.out
	Power=
	Or more info on a node and its resources
	scontrol show node node9
	NodeName=node9 Arch=x86_64 CoresPerSocket=4
	CPUAlloc=16 CPUTot=16 CPULoad=1.38
	AvailableFeatures=(null)
	ActiveFeatures=(null)
	Gres=(null)
	NodeAddr=10.252.232.4 NodeHostName=mila20684000000 Port=0 Version=18.08
	OS=Linux 4.15.0-1036 #38-Ubuntu SMP Fri Dec 7 02:47:47 UTC 2018
	RealMemory=32000 AllocMem=32000 FreeMem=23262 Sockets=2 Boards=1
	State=ALLOCATED+CLOUD ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
	Partitions=slurm_partition
	BootTime=2019-03-26T08:50:01 SlurmdStartTime=2019-03-26T08:51:15
	CfgTRES=cpu=16,mem=32000M,billing=3
	AllocTRES=cpu=16,mem=32000M
	CapWatts=n/a
	CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
	ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
	"
	Useful Commands,https://docs.mila.quebec/Userguide.html#useful-commands,"Useful Commands

	sallocGet an interactive job and give you a shell. (ssh like) CPU only

	salloc --gres=gpu:1 -c 2 --mem=12000Get an interactive job with one GPU, 2 CPUs and 12000 MB RAM

	sbatchstart a batch job (same options as salloc)

	sattach --pty <jobid>.0Re-attach a dropped interactive job

	sinfostatus of all nodes

	sinfo -Ogres:27,nodelist,features -tidle,mix,allocList GPU type and FEATURES that you can request

	savail(Custom) List available gpu

	scancel <jobid>Cancel a job

	squeuesummary status of all active jobs

	squeue -u $USERsummary status of all YOUR active jobs

	squeue -j <jobid>summary status of a specific job

	squeue -Ojobid,name,username,partition,state,timeused,nodelist,gres,tresstatus of all jobs including requested resources (see the SLURM squeue doc for all output options)

	scontrol show job <jobid>Detailed status of a running job

	sacct -j <job_id> -o NodeListGet the node where a finished job ran

	sacct -u $USER -S <start_time> -E <stop_time>Find info about old jobs

	sacct -oJobID,JobName,User,Partition,Node,StateList of current and recent jobs


	"
	Special GPU requirements,https://docs.mila.quebec/Userguide.html#special-gpu-requirements,"Special GPU requirements
	Specific GPU architecture and memory can be easily requested through the
	--gres flag by using either

	--gres=gpu:architecture:number
	--gres=gpu:memory:number
	--gres=gpu:model:number

	Example:
	To request 1 GPU with at least 16GB of memory use
	sbatch -c 4 --gres=gpu:16gb:1
	The full list of GPU and their features can be accessed here.
	"
	Example script,https://docs.mila.quebec/Userguide.html#example-script,"Example script
	Here is a sbatch script that follows good practices on the Mila cluster:
	1#!/bin/bash
	2
	3#SBATCH --partition=unkillable # Ask for unkillable job
	4#SBATCH --cpus-per-task=2 # Ask for 2 CPUs
	5#SBATCH --gres=gpu:1 # Ask for 1 GPU
	6#SBATCH --mem=10G # Ask for 10 GB of RAM
	7#SBATCH --time=3:00:00 # The job will run for 3 hours
	8#SBATCH -o /network/scratch/<u>/<username>/slurm-%j.out # Write the log on scratch
	9
	10# 1. Load the required modules
	11module --quiet load anaconda/3
	12
	13# 2. Load your environment
	14conda activate ""<env_name>""
	15
	16# 3. Copy your dataset on the compute node
	17cp /network/datasets/<dataset> $SLURM_TMPDIR
	18
	19# 4. Launch your job, tell it to save the model in $SLURM_TMPDIR
	20# and look for the dataset into $SLURM_TMPDIR
	21python main.py --path $SLURM_TMPDIR --data_path $SLURM_TMPDIR
	22
	23# 5. Copy whatever you want to save on $SCRATCH
	24cp $SLURM_TMPDIR/<to_save> /network/scratch/<u>/<username>/


	"
	Portability concerns and solutions,https://docs.mila.quebec/Userguide.html#portability-concerns-and-solutions,"Portability concerns and solutions
	When working on a software project, it is important to be aware of all the
	software and libraries the project relies on and to list them explicitly and
	under a version control system in such a way that they can easily be
	installed and made available on different systems. The upsides are significant:

	Easily install and run on the cluster
	Ease of collaboration
	Better reproducibility

	To achieve this, try to always keep in mind the following aspects:

	Versions: For each dependency, make sure you have some record of the
	specific version you are using during development. That way, in the future, you
	will be able to reproduce the original environment which you know to be
	compatible. Indeed, the more time passes, the more likely it is that newer
	versions of some dependency have breaking changes. The pip freeze command can create
	such a record for Python dependencies.
	Isolation: Ideally, each of your software projects should be isolated from
	the others. What this means is that updating the environment for project A
	should not update the environment for project B. That way, you can freely
	install and upgrade software and libraries for the former without worrying about
	breaking the latter (which you might not notice until weeks later, the next time
	you work on project B!) Isolation can be achieved using Python Virtual environments and Containers.

	"
	Managing your environments,https://docs.mila.quebec/Userguide.html#managing-your-environments,"Managing your environments
	"
	Virtual environments,https://docs.mila.quebec/Userguide.html#virtual-environments,"Virtual environments
	A virtual environment in Python is a local, isolated environment in which you
	can install or uninstall Python packages without interfering with the global
	environment (or other virtual environments). It usually lives in a directory
	(location varies depending on whether you use venv, conda or poetry). In order
	to use a virtual environment, you have to activate it. Activating an
	environment essentially sets environment variables in your shell so that:

	python points to the right Python version for that environment (different
	virtual environments can use different versions of Python!)
	python looks for packages in the virtual environment
	pip install installs packages into the virtual environment
	Any shell commands installed via pip install are made available

	To run experiments within a virtual environment, you can simply activate it
	in the script given to sbatch.
	"
	Pip/Virtualenv,https://docs.mila.quebec/Userguide.html#pip-virtualenv,"Pip/Virtualenv
	Pip is the preferred package manager for Python and each cluster provides
	several Python versions through the associated module which comes with pip. In
	order to install new packages, you will first have to create a personal space
	for them to be stored. The preferred solution (as it is the preferred solution
	on Digital Research Alliance of Canada clusters) is to use virtual
	environments.
	First, load the Python module you want to use:
	module load python/3.8
	Then, create a virtual environment in your home directory:
	python -m venv $HOME/<env>
	Where <env> is the name of your environment. Finally, activate the environment:
	source $HOME/<env>/bin/activate
	You can now install any Python package you wish using the pip command, e.g.
	pytorch:
	pip install torch torchvision
	Or Tensorflow:
	pip install tensorflow-gpu
	"
	Conda,https://docs.mila.quebec/Userguide.html#conda,"Conda
	Another solution for Python is to use miniconda or anaconda which are also available through the module
	command: (the use of Conda is not recommended for Digital Research Alliance of
	Canada clusters due to the availability of custom-built packages for pip)
	module load miniconda/3
	=== Module miniconda/3 loaded ===]
	o enable conda environment functions, first use:
	To create an environment (see here
	for details) using a specific Python version, you may write:
	conda create -n <env> python=3.9
	Where <env> is the name of your environment. You can now activate it by doing:
	conda activate <env>
	You are now ready to install any Python package you want in this environment.
	For instance, to install PyTorch, you can find the Conda command of any version
	you want on pytorch’s website, e.g:
	conda install pytorch torchvision cudatoolkit=10.0 -c pytorch
	If you make a lot of environments and install/uninstall a lot of packages, it
	can be good to periodically clean up Conda’s cache:
	conda clean --all
	"
	Using Modules,https://docs.mila.quebec/Userguide.html#using-modules,"Using Modules
	A lot of software, such as Python and Conda, is already compiled and available on
	the cluster through the module command and its sub-commands. In particular,
	if you wish to use Python 3.7 you can simply do:
	module load python/3.7
	"
	The module command,https://docs.mila.quebec/Userguide.html#the-module-command,"The module command
	For a list of available modules, simply use:
	module avail
	-------------------------------------------------------------------------------------------------------------- Global Aliases ---------------------------------------------------------------------------------------------------------------
	cuda/10.0 -> cudatoolkit/10.0 cuda/9.2 -> cudatoolkit/9.2 pytorch/1.4.1 -> python/3.7/cuda/10.2/cudnn/7.6/pytorch/1.4.1 tensorflow/1.15 -> python/3.7/tensorflow/1.15
	cuda/10.1 -> cudatoolkit/10.1 mujoco-py -> python/3.7/mujoco-py/2.0 pytorch/1.5.0 -> python/3.7/cuda/10.2/cudnn/7.6/pytorch/1.5.0 tensorflow/2.2 -> python/3.7/tensorflow/2.2
	cuda/10.2 -> cudatoolkit/10.2 mujoco-py/2.0 -> python/3.7/mujoco-py/2.0 pytorch/1.5.1 -> python/3.7/cuda/10.2/cudnn/7.6/pytorch/1.5.1
	cuda/11.0 -> cudatoolkit/11.0 pytorch -> python/3.7/cuda/10.2/cudnn/7.6/pytorch/1.5.1 tensorflow -> python/3.7/tensorflow/2.2
	cuda/9.0 -> cudatoolkit/9.0 pytorch/1.4.0 -> python/3.7/cuda/10.2/cudnn/7.6/pytorch/1.4.0 tensorflow-cpu/1.15 -> python/3.7/tensorflow/1.15

	-------------------------------------------------------------------------------------------------- /cvmfs/config.mila.quebec/modules/Core ---------------------------------------------------------------------------------------------------
	Mila (S,L) anaconda/3 (D) go/1.13.5 miniconda/2 mujoco/1.50 python/2.7 python/3.6 python/3.8 singularity/3.0.3 singularity/3.2.1 singularity/3.5.3 (D)
	anaconda/2 go/1.12.4 go/1.14 (D) miniconda/3 (D) mujoco/2.0 (D) python/3.5 python/3.7 (D) singularity/2.6.1 singularity/3.1.1 singularity/3.4.2

	------------------------------------------------------------------------------------------------ /cvmfs/config.mila.quebec/modules/Compiler ---------------------------------------------------------------------------------------"
	The module command,https://docs.mila.quebec/Userguide.html#the-module-command,"----------
	python/3.7/mujoco-py/2.0

	-------------------------------------------------------------------------------------------------- /cvmfs/config.mila.quebec/modules/Cuda ---------------------------------------------------------------------------------------------------
	cuda/10.0/cudnn/7.3 cuda/10.0/nccl/2.4 cuda/10.1/nccl/2.4 cuda/11.0/nccl/2.7 cuda/9.0/nccl/2.4 cudatoolkit/9.0 cudatoolkit/10.1 cudnn/7.6/cuda/10.0/tensorrt/7.0
	cuda/10.0/cudnn/7.5 cuda/10.1/cudnn/7.5 cuda/10.2/cudnn/7.6 cuda/9.0/cudnn/7.3 cuda/9.2/cudnn/7.6 cudatoolkit/9.2 cudatoolkit/10.2 cudnn/7.6/cuda/10.1/tensorrt/7.0
	cuda/10.0/cudnn/7.6 (D) cuda/10.1/cudnn/7.6 (D) cuda/10.2/nccl/2.7 cuda/9.0/cudnn/7.5 (D) cuda/9.2/nccl/2.4 cudatoolkit/10.0 cudatoolkit/11.0 (D) cudnn/7.6/cuda/9.0/tensorrt/7.0

	------------------------------------------------------------------------------------------------ /cvmfs/config.mila.quebec/modules/Pytorch --------------------------------------------------------------------------------------------------
	python/3.7/cuda/10.1/cudnn/7.6/pytorch/1.4.1 python/3.7/cuda/10.1/cudnn/7.6/pytorch/1.5.1 (D) python/3.7/cuda/10.2/cudnn/7.6/pytorch/1.5.0
	python/3.7/cuda/10.1/cudnn/7.6/pytorch/1.5.0 python/3.7/cuda/10.2/cudnn/7.6/pytorch/1.4.1 python/3.7/cuda/10.2/cudnn/7.6/pytorch/1.5.1 (D)

	----------------------------------------------------------------------------------------------- /cvmfs/config.mila.quebec/modules/Tensorflow ------------------------------------------------------------------------------------------------
	python/3.7/tensorflow/1.15 python/3.7/tensorflow/2.0 python/3.7/tensorflow/2.2 (D)
	Modules can be loaded using the load command:
	module load <module>
	To search for a module or a software, use the command spider:
	module spider search_term
	E.g.: by default, python2 will refer to the os-shipped installation of python2.7 and python3 to python3.6.
	If you want to use python3.7 you can type:
	module load python3.7
	"
	Available Software,https://docs.mila.quebec/Userguide.html#available-software,"Available Software
	Modules are divided in 5 main sections:






	Section
	Description



	Core
	Base interpreter and software (Python, go, etc…)

	Compiler
	Interpreter-dependent software (see the note below)

	Cuda
	Toolkits, cudnn and related libraries

	Pytorch/Tensorflow
	Pytorch/TF built with a specific Cuda/Cudnn
	version for Mila’s GPUs (see the related paragraph)




	Note
	Modules which are nested (../../..) usually depend on other software/module
	loaded alongside the main module. No need to load the dependent software,
	the complex naming scheme allows an automatic detection of the dependent
	module(s):
	i.e.: Loading cudnn/7.6/cuda/9.0/tensorrt/7.0 will load cudnn/7.6 and
	cuda/9.0 alongside
	python/3.X is a particular dependency which can be served through
	python/3.X or anaconda/3 and is not automatically loaded to let the
	user pick his favorite flavor.

	"
	Default package location,https://docs.mila.quebec/Userguide.html#default-package-location,"Default package location
	Python by default uses the user site package first and packages provided by
	module last to not interfere with your installation. If you want to skip
	packages installed in your site-packages folder (in your /home directory), you
	have to start Python with the -s flag.
	To check which package is loaded at import, you can print package.__file__
	to get the full path of the package.
	Example:
	module load pytorch/1.5.0
	python -c 'import torch;print(torch.__file__)'
	home/mila/my_home/.local/lib/python3.7/site-packages/torch/__init__.py <== package from your own site-package
	Now with the -s flag:
	module load pytorch/1.5.0
	python -s -c 'import torch;print(torch.__file__)'
	cvmfs/ai.mila.quebec/apps/x86_64/debian/pytorch/python3.7-cuda10.1-cudnn7.6-v1.5.0/lib/python3.7/site-packages/torch/__init__.py'
	"
	On using containers,https://docs.mila.quebec/Userguide.html#on-using-containers,"On using containers
	Another option for creating portable code is Using containers on clusters.
	Containers are a popular approach at deploying applications by packaging a lot
	of the required dependencies together. The most popular tool for this is
	Docker, but Docker cannot be used on the Mila
	cluster (nor the other clusters from Digital Research Alliance of Canada).
	One popular mechanism for containerisation on a computational cluster is called
	Singularity.
	This is the recommended approach for running containers on the
	Mila cluster. See section Singularity for more details.
	"
	Singularity,https://docs.mila.quebec/Userguide.html#id7,"Singularity
	"
	Overview,https://docs.mila.quebec/Userguide.html#overview,"Overview
	"
	What is Singularity?,https://docs.mila.quebec/Userguide.html#what-is-singularity,"What is Singularity?
	Running Docker on SLURM is a security problem (e.g. running as root, being able
	to mount any directory). The alternative is to use Singularity, which is a
	popular solution in the world of HPC.
	There is a good level of compatibility between Docker and Singularity,
	and we can find many exaggerated claims about able to convert containers
	from Docker to Singularity without any friction.
	Oftentimes, Docker images from DockerHub are 100% compatible with Singularity,
	and they can indeed be used without friction, but things get messy when
	we try to convert our own Docker build files to Singularity recipes.
	"
	Links to official documentation,https://docs.mila.quebec/Userguide.html#links-to-official-documentation,"Links to official documentation

	official Singularity user guide (this is the one you
	will use most often)
	official Singularity admin guide

	"
	Overview of the steps used in practice,https://docs.mila.quebec/Userguide.html#overview-of-the-steps-used-in-practice,"Overview of the steps used in practice
	Most often, the process to create and use a Singularity container is:

	on your Linux computer (at home or work)

	select a Docker image from DockerHub (e.g. pytorch/pytorch)
	make a recipe file for Singularity that starts with that DockerHub image
	build the recipe file, thus creating the image file (e.g. my-pytorch-image.sif)
	test your singularity container before send it over to the cluster
	rsync -av my-pytorch-image.sif <login-node>:Documents/my-singularity-images


	on the login node for that cluster

	queue your jobs with sbatch ...
	(note that your jobs will copy over the my-pytorch-image.sif to $SLURM_TMPDIR
	and will then launch Singularity with that image)
	do something else while you wait for them to finish
	queue more jobs with the same my-pytorch-image.sif,
	reusing it many times over



	In the following sections you will find specific examples or tips to accomplish
	in practice the steps highlighted above.
	"
	"Nope, not on MacOS",https://docs.mila.quebec/Userguide.html#nope-not-on-macos,"Nope, not on MacOS
	Singularity does not work on MacOS, as of the time of this writing in 2021.
	Docker does not actually run on MacOS, but there Docker silently installs a
	virtual machine running Linux, which makes it a pleasant experience,
	and the user does not need to care about the details of how Docker does it.
	Given its origins in HPC, Singularity does not provide that kind of seamless
	experience on MacOS, even though it’s technically possible to run it
	inside a Linux virtual machine on MacOS.
	"
	Where to build images,https://docs.mila.quebec/Userguide.html#where-to-build-images,"Where to build images
	Building Singularity images is a rather heavy task, which can take 20 minutes
	if you have a lot of steps in your recipe. This makes it a bad task to run on
	the login nodes of our clusters, especially if it needs to be run regularly.
	On the Mila cluster, we are lucky to have unrestricted internet access on the compute
	nodes, which means that anyone can request an interactive CPU node (no need for GPU)
	and build their images there without problem.

	Warning
	Do not build Singularity images from scratch every time your run a
	job in a large batch. This will be a colossal waste of GPU time as well as
	internet bandwidth. If you setup your workflow properly (e.g. using bind
	paths for your code and data), you can spend months reusing the same
	Singularity image my-pytorch-image.sif.

	"
	Building the containers,https://docs.mila.quebec/Userguide.html#building-the-containers,"Building the containers
	Building a container is like creating a new environment except that containers
	are much more powerful since they are self-contained systems. With
	singularity, there are two ways to build containers.
	The first one is by yourself, it’s like when you got a new Linux laptop and you
	don’t really know what you need, if you see that something is missing, you
	install it. Here you can get a vanilla container with Ubuntu called a sandbox,
	you log in and you install each packages by yourself. This procedure can take
	time but will allow you to understand how things work and what you need. This is
	recommended if you need to figure out how things will be compiled or if you want
	to install packages on the fly. We’ll refer to this procedure as singularity
	sandboxes.
	The second way is more like you know what you want, so you write a list of
	everything you need, you send it to singularity and it will install everything
	for you. Those lists are called singularity recipes.
	"
	First way: Build and use a sandbox,https://docs.mila.quebec/Userguide.html#first-way-build-and-use-a-sandbox,"First way: Build and use a sandbox
	You might ask yourself: On which machine should I build a container?
	First of all, you need to choose where you’ll build your container. This
	operation requires memory and high cpu usage.

	Warning
	Do NOT build containers on any login nodes !


	(Recommended for beginner) If you need to use apt-get, you should build
	the container on your laptop with sudo privileges. You’ll only need to
	install singularity on your laptop. Windows/Mac users can look there and
	Ubuntu/Debian users can use directly:

	sudo apt-get install singularity-container


	If you can’t install singularity on your laptop and you don’t need
	apt-get, you can reserve a cpu node on the Mila cluster to build your
	container.

	In this case, in order to avoid too much I/O over the network, you should define
	the singularity cache locally:

	export SINGULARITY_CACHEDIR=$SLURM_TMPDIR


	If you can’t install singularity on your laptop and you want to use
	apt-get, you can use singularity-hub to build your containers and read
	Recipe_section.

	"
	Download containers from the web,https://docs.mila.quebec/Userguide.html#download-containers-from-the-web,"Download containers from the web
	Hopefully, you may not need to create containers from scratch as many have been
	already built for the most common deep learning software. You can find most of
	them on dockerhub.
	Go on dockerhub and select the container you want to pull.
	For example, if you want to get the latest PyTorch version with GPU support
	(Replace runtime by devel if you need the full Cuda toolkit):
	singularity pull docker://pytorch/pytorch:1.0.1-cuda10.0-cudnn7-runtime
	Or the latest TensorFlow:
	singularity pull docker://tensorflow/tensorflow:latest-gpu-py3
	Currently the pulled image pytorch.simg or tensorflow.simg is read-only
	meaning that you won’t be able to install anything on it. Starting now, PyTorch
	will be taken as example. If you use TensorFlow, simply replace every
	pytorch occurrences by tensorflow.
	"
	How to add or install stuff in a container,https://docs.mila.quebec/Userguide.html#how-to-add-or-install-stuff-in-a-container,"How to add or install stuff in a container
	The first step is to transform your read only container
	pytorch-1.0.1-cuda10.0-cudnn7-runtime.simg in a writable version that will
	allow you to add packages.

	Warning
	Depending on the version of singularity you are using, singularity
	will build a container with the extension .simg or .sif. If you’re using
	.sif files, replace every occurences of .simg by .sif.


	Tip
	If you want to use apt-get you have to put sudo ahead of the
	following commands

	This command will create a writable image in the folder pytorch.
	singularity build --sandbox pytorch pytorch-1.0.1-cuda10.0-cudnn7-runtime.simg
	Then you’ll need the following command to log inside the container.
	singularity shell --writable -H $HOME:/home pytorch
	Once you get into the container, you can use pip and install anything you need
	(Or with apt-get if you built the container with sudo).

	Warning
	Singularity mounts your home folder, so if you install things into
	the $HOME of your container, they will be installed in your real
	$HOME!

	You should install your stuff in /usr/local instead.
	"
	Creating useful directories,https://docs.mila.quebec/Userguide.html#creating-useful-directories,"Creating useful directories
	One of the benefits of containers is that you’ll be able to use them across
	different clusters. However for each cluster the datasets and experiments
	folder location can be different. In order to be invariant to those locations,
	we will create some useful mount points inside the container:
	mkdir /dataset
	mkdir /tmp_log
	mkdir /final_log
	From now, you won’t need to worry anymore when you write your code to specify
	where to pick up your dataset. Your dataset will always be in /dataset
	independently of the cluster you are using.
	"
	Testing,https://docs.mila.quebec/Userguide.html#testing,"Testing
	If you have some code that you want to test before finalizing your container,
	you have two choices. You can either log into your container and run Python
	code inside it with:
	singularity shell --nv pytorch
	Or you can execute your command directly with
	singularity exec --nv pytorch Python YOUR_CODE.py

	Tip
	—nv allows the container to use gpus. You don’t need this if you
	don’t plan to use a gpu.


	Warning
	Don’t forget to clear the cache of the packages you installed in
	the containers.

	"
	Creating a new image from the sandbox,https://docs.mila.quebec/Userguide.html#creating-a-new-image-from-the-sandbox,"Creating a new image from the sandbox
	Once everything you need is installed inside the container, you need to convert
	it back to a read-only singularity image with:
	singularity build pytorch_final.simg pytorch
	"
	Second way: Use recipes,https://docs.mila.quebec/Userguide.html#second-way-use-recipes,"Second way: Use recipes
	A singularity recipe is a file including specifics about installation software,
	environment variables, files to add, and container metadata. It is a starting
	point for designing any custom container. Instead of pulling a container and
	installing your packages manually, you can specify in this file the packages
	you want and then build your container from this file.
	Here is a toy example of a singularity recipe installing some stuff:
	################# Header: Define the base system you want to use ################
	# Reference of the kind of base you want to use (e.g., docker, debootstrap, shub).
	Bootstrap: docker
	# Select the docker image you want to use (Here we choose tensorflow)
	From: tensorflow/tensorflow:latest-gpu-py3

	################# Section: Defining the system #################################
	# Commands in the %post section are executed within the container.
	%post
	echo ""Installing Tools with apt-get""
	apt-get update
	apt-get install -y cmake libcupti-dev libyaml-dev wget unzip
	apt-get clean
	echo ""Installing things with pip""
	pip install tqdm
	echo ""Creating mount points""
	mkdir /dataset
	mkdir /tmp_log
	mkdir /final_log


	# Environment variables that should be sourced at runtime.
	%environment
	# use bash as default shell
	SHELL=/bin/bash
	export SHELL


	A recipe file contains two parts: the header and sections. In the
	header you specify which base system you want to use, it can be any docker
	or singularity container. In sections, you can list the things you want to
	install in the subsection post or list the environment’s variable you need
	to source at each runtime in the subsection environment. For a more detailed
	description, please look at the singularity documentation.
	In order to build a singularity container from a singularity recipe file, you
	should use:
	sudo singularity build <NAME_CONTAINER> <YOUR_RECIPE_FILES>

	Warning
	You always need to use sudo when you build a container from a
	recipe. As there is no access to sudo on the cluster, a personal computer or
	the use singularity hub is needed to build a container

	"
	Build recipe on singularity hub,https://docs.mila.quebec/Userguide.html#build-recipe-on-singularity-hub,"Build recipe on singularity hub
	Singularity hub allows users to build containers from recipes directly on
	singularity-hub’s cloud meaning that you don’t need to build containers by
	yourself. You need to register on singularity-hub and link your
	singularity-hub account to your GitHub account, then:


	Create a new github repository.
	Add a collection on singularity-hub and select the github repository your created.
	Clone the github repository on your computer.
	$ git clone <url>



	Write the singularity recipe and save it as a file named Singularity.
	Git add Singularity, commit and push on the master branch
	$ git add Singularity
	$ git commit
	$ git push origin master





	At this point, robots from singularity-hub will build the container for you, you
	will be able to download your container from the website or directly with:
	singularity pull shub://<github_username>/<repository_name>
	"
	"Example: Recipe with OpenAI gym, MuJoCo and Miniworld",https://docs.mila.quebec/Userguide.html#example-recipe-with-openai-gym-mujoco-and-miniworld,"Example: Recipe with OpenAI gym, MuJoCo and Miniworld
	Here is an example on how you can use a singularity recipe to install complex
	environment such as OpenAI gym, MuJoCo and Miniworld on a PyTorch based
	container. In order to use MuJoCo, you’ll need to copy the key stored on the
	Mila cluster in /ai/apps/mujoco/license/mjkey.txt to your current directory.
	#This is a dockerfile that sets up a full Gym install with test dependencies
	Bootstrap: docker

	# Here we ll build our container upon the pytorch container
	From: pytorch/pytorch:1.0-cuda10.0-cudnn7-runtime

	# Now we'll copy the mjkey file located in the current directory inside the container's root
	# directory
	%files
	mjkey.txt

	# Then we put everything we need to install
	%post
	export PATH=$PATH:/opt/conda/bin
	apt -y update && \
	apt install -y keyboard-configuration && \
	apt install -y \
	python3-dev \
	python-pyglet \
	python3-opengl \
	libhdf5-dev \
	libjpeg-dev \
	libboost-all-dev \
	libsdl2-dev \
	libosmesa6-dev \
	patchelf \
	ffmpeg \
	xvfb \
	libhdf5-dev \
	openjdk-8-jdk \
	wget \
	git \
	unzip && \
	apt clean && \
	rm -rf /var/lib/apt/lists/*
	pip install h5py

	# Download Gym and MuJoCo
	mkdir /Gym && cd /Gym
	git clone https://github.com/openai/gym.git \|\| true && \
	mkdir /Gym/.mujoco && cd /Gym/.mujoco
	wget https://www.roboti.us/download/mjpro150_linux.zip && \
	unzip mjpro150_linux.zip && \
	wget https://www.roboti.us/download/mujoco200_linux.zip && \
	unzip mujoco200_linux.zip && \
	mv mujoco200_linux mujoco200

	# Export global environment variables
	export MUJOCO_PY_MJKEY_PATH=/Gym/.mujoco/mjkey.txt
	export MUJOCO_PY_MUJOCO_PATH=/Gym/.mujoco/mujoco150/
	export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/Gym/.mujoco/mjpro150/bin
	export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/Gym"
	"Example: Recipe with OpenAI gym, MuJoCo and Miniworld",https://docs.mila.quebec/Userguide.html#example-recipe-with-openai-gym-mujoco-and-miniworld,"/.mujoco/mujoco200/bin
	export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/bin
	cp /mjkey.txt /Gym/.mujoco/mjkey.txt
	# Install Python dependencies
	wget https://raw.githubusercontent.com/openai/mujoco-py/master/requirements.txt
	pip install -r requirements.txt
	# Install Gym and MuJoCo
	cd /Gym/gym
	pip install -e '.[all]'
	# Change permission to use mujoco_py as non sudoer user
	chmod -R 777 /opt/conda/lib/python3.6/site-packages/mujoco_py/
	pip install --upgrade minerl

	# Export global environment variables
	%environment
	export SHELL=/bin/sh
	export MUJOCO_PY_MJKEY_PATH=/Gym/.mujoco/mjkey.txt
	export MUJOCO_PY_MUJOCO_PATH=/Gym/.mujoco/mujoco150/
	export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/Gym/.mujoco/mjpro150/bin
	export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/Gym/.mujoco/mujoco200/bin
	export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/bin
	export PATH=/Gym/gym/.tox/py3/bin:$PATH

	%runscript
	exec /bin/sh ""$@""


	Here is the same recipe but written for TensorFlow:
	#This is a dockerfile that sets up a full Gym install with test dependencies
	Bootstrap: docker

	# Here we ll build our container upon the tensorflow container
	From: tensorflow/tensorflow:latest-gpu-py3

	# Now we'll copy the mjkey file located in the current directory inside the container's root
	# directory
	%files
	mjkey.txt

	# Then we put everything we need to install
	%post
	apt -y update && \
	apt install -y keyboard-configuration && \
	apt install -y \
	python3-setuptools \
	python3-dev \
	python-pyglet \
	python3-opengl \
	libjpeg-dev \
	libboost-all-dev \
	libsdl2-dev \
	libosmesa6-dev \
	patchelf \
	ffmpeg \
	xvfb \
	wget \
	git \
	unzip && \
	apt clean && \
	rm -rf /var/lib/apt/lists/*

	# Download Gym and MuJoCo
	mkdir /Gym && cd /Gym
	git clone"
	"Example: Recipe with OpenAI gym, MuJoCo and Miniworld",https://docs.mila.quebec/Userguide.html#example-recipe-with-openai-gym-mujoco-and-miniworld," https://github.com/openai/gym.git \|\| true && \
	mkdir /Gym/.mujoco && cd /Gym/.mujoco
	wget https://www.roboti.us/download/mjpro150_linux.zip && \
	unzip mjpro150_linux.zip && \
	wget https://www.roboti.us/download/mujoco200_linux.zip && \
	unzip mujoco200_linux.zip && \
	mv mujoco200_linux mujoco200

	# Export global environment variables
	export MUJOCO_PY_MJKEY_PATH=/Gym/.mujoco/mjkey.txt
	export MUJOCO_PY_MUJOCO_PATH=/Gym/.mujoco/mujoco150/
	export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/Gym/.mujoco/mjpro150/bin
	export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/Gym/.mujoco/mujoco200/bin
	export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/bin
	cp /mjkey.txt /Gym/.mujoco/mjkey.txt

	# Install Python dependencies
	wget https://raw.githubusercontent.com/openai/mujoco-py/master/requirements.txt
	pip install -r requirements.txt
	# Install Gym and MuJoCo
	cd /Gym/gym
	pip install -e '.[all]'
	# Change permission to use mujoco_py as non sudoer user
	chmod -R 777 /usr/local/lib/python3.5/dist-packages/mujoco_py/

	# Then install miniworld
	cd /usr/local/
	git clone https://github.com/maximecb/gym-miniworld.git
	cd gym-miniworld
	pip install -e .

	# Export global environment variables
	%environment
	export SHELL=/bin/bash
	export MUJOCO_PY_MJKEY_PATH=/Gym/.mujoco/mjkey.txt
	export MUJOCO_PY_MUJOCO_PATH=/Gym/.mujoco/mujoco150/
	export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/Gym/.mujoco/mjpro150/bin
	export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/Gym/.mujoco/mujoco200/bin
	export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/bin
	export PATH=/Gym/gym/.tox/py3/bin:$PATH

	%runscript
	exec /bin/bash ""$@""


	Keep in mind that those environment variables are sourced at runtime and not at
	build time. This is why, you should also define them in the %post section
	since they are required to install MuJoCo.
	"
	Using containers on clusters,https://docs.mila.quebec/Userguide.html#using-containers-on-clusters,"Using containers on clusters
	"
	How to use containers on clusters,https://docs.mila.quebec/Userguide.html#how-to-use-containers-on-clusters,"How to use containers on clusters
	On every cluster with Slurm, datasets and intermediate results should go in
	$SLURM_TMPDIR while the final experiment results should go in $SCRATCH.
	In order to use the container you built, you need to copy it on the cluster you
	want to use.

	Warning
	You should always store your container in $SCRATCH !

	Then reserve a node with srun/sbatch, copy the container and your dataset on the
	node given by SLURM (i.e in $SLURM_TMPDIR) and execute the code
	<YOUR_CODE> within the container <YOUR_CONTAINER> with:
	singularity exec --nv -H $HOME:/home -B $SLURM_TMPDIR:/dataset/ -B $SLURM_TMPDIR:/tmp_log/ -B $SCRATCH:/final_log/ $SLURM_TMPDIR/<YOUR_CONTAINER> python <YOUR_CODE>
	Remember that /dataset, /tmp_log and /final_log were created in the
	previous section. Now each time, we’ll use singularity, we are explicitly
	telling it to mount $SLURM_TMPDIR on the cluster’s node in the folder
	/dataset inside the container with the option -B such that each dataset
	downloaded by PyTorch in /dataset will be available in $SLURM_TMPDIR.
	This will allow us to have code and scripts that are invariant to the cluster
	environment. The option -H specify what will be the container’s home. For
	example, if you have your code in $HOME/Project12345/Version35/ you can
	specify -H $HOME/Project12345/Version35:/home, thus the container will only
	have access to the code inside Version35.
	If you want to run multiple commands inside the container you can use:
	singularity exec --nv -H $HOME:/home -B $SLURM_TMPDIR:/dataset/ \
	-B $SLURM_TMPDIR:/tmp_log/ -B $SCRATCH:/final_log/ \
	$SLURM_TMPDIR/<YOUR_CONTAINER> bash -c 'pwd && ls && python <YOUR_CODE>'
	"
	Example: Interactive case (srun/salloc),https://docs.mila.quebec/Userguide.html#example-interactive-case-srun-salloc,"Example: Interactive case (srun/salloc)
	Once you get an interactive session with SLURM, copy <YOUR_CONTAINER> and
	<YOUR_DATASET> to $SLURM_TMPDIR
	0. Get an interactive session
	srun --gres=gpu:1
	1. Copy your container on the compute node
	rsync -avz $SCRATCH/<YOUR_CONTAINER> $SLURM_TMPDIR
	2. Copy your dataset on the compute node
	rsync -avz $SCRATCH/<YOUR_DATASET> $SLURM_TMPDIR
	Then use singularity shell to get a shell inside the container
	3. Get a shell in your environment
	singularity shell --nv \
	-H $HOME:/home \
	-B $SLURM_TMPDIR:/dataset/ \
	-B $SLURM_TMPDIR:/tmp_log/ \
	-B $SCRATCH:/final_log/ \
	$SLURM_TMPDIR/<YOUR_CONTAINER>
	4. Execute your code
	python <YOUR_CODE>
	or use singularity exec to execute <YOUR_CODE>.
	3. Execute your code
	singularity exec --nv \
	-H $HOME:/home \
	-B $SLURM_TMPDIR:/dataset/ \
	-B $SLURM_TMPDIR:/tmp_log/ \
	-B $SCRATCH:/final_log/ \
	$SLURM_TMPDIR/<YOUR_CONTAINER> \
	python <YOUR_CODE>
	You can create also the following alias to make your life easier.
	alias my_env='singularity exec --nv \
	-H $HOME:/home \
	-B $SLURM_TMPDIR:/dataset/ \
	-B $SLURM_TMPDIR:/tmp_log/ \
	-B $SCRATCH:/final_log/ \
	$SLURM_TMPDIR/<YOUR_CONTAINER>'
	This will allow you to run any code with:
	my_env python <YOUR_CODE>
	"
	Example: sbatch case,https://docs.mila.quebec/Userguide.html#example-sbatch-case,"Example: sbatch case
	You can also create a sbatch script:
	:linenos:

	#!/bin/bash
	#SBATCH --cpus-per-task=6 # Ask for 6 CPUs
	#SBATCH --gres=gpu:1 # Ask for 1 GPU
	#SBATCH --mem=10G # Ask for 10 GB of RAM
	#SBATCH --time=0:10:00 # The job will run for 10 minutes

	# 1. Copy your container on the compute node
	rsync -avz $SCRATCH/<YOUR_CONTAINER> $SLURM_TMPDIR
	# 2. Copy your dataset on the compute node
	rsync -avz $SCRATCH/<YOUR_DATASET> $SLURM_TMPDIR
	# 3. Executing your code with singularity
	singularity exec --nv \
	-H $HOME:/home \
	-B $SLURM_TMPDIR:/dataset/ \
	-B $SLURM_TMPDIR:/tmp_log/ \
	-B $SCRATCH:/final_log/ \
	$SLURM_TMPDIR/<YOUR_CONTAINER> \
	python ""<YOUR_CODE>""
	# 4. Copy whatever you want to save on $SCRATCH
	rsync -avz $SLURM_TMPDIR/<to_save> $SCRATCH


	"
	Issue with PyBullet and OpenGL libraries,https://docs.mila.quebec/Userguide.html#issue-with-pybullet-and-opengl-libraries,"Issue with PyBullet and OpenGL libraries
	If you are running certain gym environments that require pyglet, you may
	encounter a problem when running your singularity instance with the Nvidia
	drivers using the --nv flag. This happens because the --nv flag also
	provides the OpenGL libraries:
	libGL.so.1 => /.singularity.d/libs/libGL.so.1
	libGLX.so.0 => /.singularity.d/libs/libGLX.so.0


	If you don’t experience those problems with pyglet, you probably don’t need
	to address this. Otherwise, you can resolve those problems by apt-get install
	-y libosmesa6-dev mesa-utils mesa-utils-extra libgl1-mesa-glx, and then making
	sure that your LD_LIBRARY_PATH points to those libraries before the ones in
	/.singularity.d/libs.
	%environment
	# ...
	export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu/mesa:$LD_LIBRARY_PATH


	"
	Mila cluster,https://docs.mila.quebec/Userguide.html#mila-cluster,"Mila cluster
	On the Mila cluster $SCRATCH is not yet defined, you should add the
	experiment results you want to keep in /network/scratch/<u>/<username>/. In
	order to use the sbatch script above and to match other cluster environment’s
	names, you can define $SCRATCH as an alias for
	/network/scratch/<u>/<username> with:
	echo ""export SCRATCH=/network/scratch/${USER:0:1}/$USER"" >> ~/.bashrc
	Then, you can follow the general procedure explained above.
	"
	Digital Research Alliance of Canada,https://docs.mila.quebec/Userguide.html#digital-research-alliance-of-canada,"Digital Research Alliance of Canada
	Using singularity on Digital Research Alliance of Canada is similar except that
	you need to add Yoshua’s account name and load singularity. Here is an example
	of a sbatch script using singularity on compute Canada cluster:

	Warning
	You should use singularity/2.6 or singularity/3.4. There is a bug
	in singularity/3.2 which makes gpu unusable.

	1#!/bin/bash
	2#SBATCH --account=rpp-bengioy # Yoshua pays for your job
	3#SBATCH --cpus-per-task=6 # Ask for 6 CPUs
	4#SBATCH --gres=gpu:1 # Ask for 1 GPU
	5#SBATCH --mem=32G # Ask for 32 GB of RAM
	6#SBATCH --time=0:10:00 # The job will run for 10 minutes
	7#SBATCH --output=""/scratch/<user>/slurm-%j.out"" # Modify the output of sbatch
	8
	9# 1. You have to load singularity
	10module load singularity
	11# 2. Then you copy the container to the local disk
	12rsync -avz $SCRATCH/<YOUR_CONTAINER> $SLURM_TMPDIR
	13# 3. Copy your dataset on the compute node
	14rsync -avz $SCRATCH/<YOUR_DATASET> $SLURM_TMPDIR
	15# 4. Executing your code with singularity
	16singularity exec --nv \
	17 -H $HOME:/home \
	18 -B $SLURM_TMPDIR:/dataset/ \
	19 -B $SLURM_TMPDIR:/tmp_log/ \
	20 -B $SCRATCH:/final_log/ \
	21 $SLURM_TMPDIR/<YOUR_CONTAINER> \
	22 python ""<YOUR_CODE>""
	23# 5. Copy whatever you want to save on $SCRATCH
	24rsync -avz $SLURM_TMPDIR/<to_save> $SCRATCH


	"
	Sharing Data with ACLs,https://docs.mila.quebec/Userguide.html#sharing-data-with-acls,"Sharing Data with ACLs
	Regular permissions bits are extremely blunt tools: They control access through
	only three sets of bits owning user, owning group and all others. Therefore,
	access is either too narrow (0700 allows access only by oneself) or too wide
	(770 gives all permissions to everyone in the same group, and 777 to
	literally everyone).
	ACLs (Access Control Lists) are an expansion of the permissions bits that allow
	more fine-grained, granular control of accesses to a file. They can be used to
	permit specific users access to files and folders even if conservative default
	permissions would have denied them such access.
	As an illustrative example, to use ACLs to allow $USER (oneself) to
	share with $USER2 (another person) a “playground” folder hierarchy in
	Mila’s scratch filesystem at a location

	/network/scratch/${USER:0:1}/$USER/X/Y/Z/...

	in a safe and secure fashion that allows both users to read, write, execute,
	search and delete each others’ files:


	1. Grant oneself permissions to access any future files/folders created
	by the other (or oneself)
	(-d renders this permission a “default” / inheritable one)

	setfacl -Rdm user:${USER}:rwx /network/scratch/${USER:0:1}/$USER/X/Y/Z/




	Note
	The importance of doing this seemingly-redundant step first is that files
	and folders are always owned by only one person, almost always their
	creator (the UID will be the creator’s, the GID typically as well). If that
	user is not yourself, you will not have access to those files unless the
	other person specifically gives them to you – or these files inherited a
	default ACL allowing you full access.
	This is the inherited, default ACL serving that purpose.


	2. Grant the other permission to access any future files/folders created
	by the other (or oneself)
	(-d renders this permission a “default” / inheritable one)

	setfacl"
	Sharing Data with ACLs,https://docs.mila.quebec/Userguide.html#sharing-data-with-acls," -Rdm user:${USER2}:rwx /network/scratch/${USER:0:1}/$USER/X/Y/Z/




	3. Grant the other permission to access any existing files/folders created
	by oneself.
	Such files and folders were created before the new default ACLs were added
	above and thus did not inherit them from their parent folder at the moment of
	their creation.

	setfacl -Rm user:${USER2}:rwx /network/scratch/${USER:0:1}/$USER/X/Y/Z/



	Note
	The purpose of granting permissions first for future files and then for
	existing files is to prevent a race condition whereby after the first
	setfacl command the other person could create files to which the
	second setfacl command does not apply.



	4. Grant another permission to search through one’s hierarchy down to the
	shared location in question.


	Non-recursive (!!!!)
	May also grant :rx in unlikely event others listing your folders on the
	path is not troublesome or desirable.

	setfacl -m user:${USER2}:x /network/scratch/${USER:0:1}/$USER/X/Y/
	setfacl -m user:${USER2}:x /network/scratch/${USER:0:1}/$USER/X/
	setfacl -m user:${USER2}:x /network/scratch/${USER:0:1}/$USER/



	Note
	In order to access a file, all folders from the root (/) down to the
	parent folder in question must be searchable (+x) by the concerned user.
	This is already the case for all users for folders such as /,
	/network and /network/scratch, but users must explicitly grant access
	to some or all users either through base permissions or by adding ACLs, for
	at least /network/scratch/${USER:0:1}/$USER, $HOME and subfolders.
	To bluntly allow all users to search through a folder (think twice!),
	the following command can be used:
	chmod a+x /network/scratch/${USER:0:1}/$USER/




	Note
	For more information on setfacl and path resolution/access checking,
	consider the following documentation viewing commands:

	man setfacl
	man path_resolution

	"
	Viewing and Verifying ACLs,https://docs.mila.quebec/Userguide.html#viewing-and-verifying-acls,"Viewing and Verifying ACLs
	getfacl /path/to/folder/or/file
	1: # file: somedir/
	2: # owner: lisa
	3: # group: staff
	4: # flags: -s-
	5: user::rwx
	6: user:joe:rwx #effective:r-x
	7: group::rwx #effective:r-x
	8: group:cool:r-x
	9: mask::r-x
	10: other::r-x
	11: default:user::rwx
	12: default:user:joe:rwx #effective:r-x
	13: default:group::r-x
	14: default:mask::r-x
	15: default:other::---



	Note

	man getfacl


	"
	Contributing datasets,https://docs.mila.quebec/Userguide.html#contributing-datasets,"Contributing datasets
	If a dataset could help the research of others at Mila, this form can be filled to request its addition
	to /network/datasets.
	"
	Publicly share a Mila dataset,https://docs.mila.quebec/Userguide.html#publicly-share-a-mila-dataset,"Publicly share a Mila dataset
	Mila offers two ways to publicly share a Mila dataset:

	Academic Torrent
	Google Drive

	Note that these options are not mutually exclusive and both can be used.
	"
	Academic Torrent,https://docs.mila.quebec/Userguide.html#id10,"Academic Torrent
	Mila hosts/seeds some datasets created by the Mila community through Academic
	Torrent. The first step is to create an
	account and a torrent file.
	Then drop the dataset in /network/scratch/.transit_datasets and send the
	Academic Torrent URL to Mila’s helpdesk. If
	the dataset does not reside on the Mila cluster, only the Academic Torrent URL
	would be needed to proceed with the initial download. Then you can delete /
	stop sharing your copy.

	Note

	Avoid mentioning dataset in the name of the dataset
	Avoid capital letters, special charaters (including spaces) in files and
	directories names. Spaces can be replaced by hyphens (-).
	Multiple archives can be provided to spread the data (e.g. dataset splits,
	raw data, extra data, …)


	"
	Generate a .torrent file to be uploaded to Academic Torrent,https://docs.mila.quebec/Userguide.html#generate-a-torrent-file-to-be-uploaded-to-academic-torrent,"Generate a .torrent file to be uploaded to Academic Torrent
	The command line / Python utility torrentool can be used to create a
	DATASET_NAME.torrent file:
	# Install torrentool
	python3 -m pip install torrentool click
	# Change Directory to the location of the dataset to be hosted by Mila
	cd /network/scratch/.transit_datasets
	torrent create --tracker https://academictorrents.com/announce.php DATASET_NAME


	The resulting DATASET_NAME.torrent can then be used to register a new dataset
	on Academic Torrent.

	Warning

	The creation of a DATASET_NAME.torrent file requires the computation of
	checksums for the dataset content which can quickly become CPU-heavy. This
	process should not be executed on a login node


	"
	Download a dataset from Academic Torrent,https://docs.mila.quebec/Userguide.html#download-a-dataset-from-academic-torrent,"Download a dataset from Academic Torrent
	Academic Torrent provides a Python API to easily download a dataset
	from it’s registered list:
	# Install the Python API with:
	# python3 -m pip install academictorrents
	import academictorrents as at
	mnist_path = at.get(""323a0048d87ca79b68f12a6350a57776b6a3b7fb"", datastore=""~/scratch/.academictorrents-datastore"") # Download the mnist dataset



	Note
	Current needs have been evaluated to be for a download speed of about 10
	MB/s. This speed can be higher if more users also seeds the dataset.

	"
	Google Drive,https://docs.mila.quebec/Userguide.html#id12,"Google Drive
	Only a member of the staff team can upload to Mila’s Google Drive
	which requires to first drop the dataset in
	/network/scratch/.transit_datasets. Then, contact Mila’s helpdesk and provide the following informations:

	directory containing the archived dataset (zip is favored) in
	/network/scratch/.transit_datasets
	the name of the dataset
	a licence in .txt format. One of the the creative common licenses can be used. It is
	recommended to at least have the Attribution option. The No Derivatives
	option is discouraged unless the dataset should not be modified by others.
	MD5 checksum of the archive
	the arXiv and GitHub URLs (those can be sent later if the article is still in
	the submission process)
	instructions to know if the dataset needs to be unziped, untared or
	else before uploading to Google Drive


	Note

	Avoid mentioning dataset in the name of the dataset
	Avoid capital letters, special charaters (including spaces) in files and
	directories names. Spaces can be replaced by hyphens (-).
	Multiple archives can be provided to spread the data (e.g. dataset splits,
	raw data, extra data, …)


	"
	Download a dataset from Mila’s Google Drive with gdown,https://docs.mila.quebec/Userguide.html#download-a-dataset-from-mila-s-google-drive-with-gdown,"Download a dataset from Mila’s Google Drive with gdown
	The utility gdown is a simple utility to
	download data from Google Drive from the command line shell or in a Python
	script and requires no setup.

	Warning
	A limitation however is that it uses a shared client id which can cause a
	quota block when too many users uses it in the same day. It is described in
	a GitHub issue.

	"
	Download a dataset from Mila’s Google Drive with rclone,https://docs.mila.quebec/Userguide.html#download-a-dataset-from-mila-s-google-drive-with-rclone,"Download a dataset from Mila’s Google Drive with rclone
	Rclone is a command line program to manage files on
	cloud storage. In the context of a Google Drive remote, it allows to specify a
	client id to avoid sharing with other users which avoid quota limits. Rclone
	describes the creation of a client id in its documentaton. Once this is done, a
	remote for Mila’s Google Drive can be configured from the command line:
	rclone config create mila-gdrive drive client_id XXXXXXXXXXXX-XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX.apps.googleusercontent.com \
	client_secret XXXXXXXXXXXXX-XXXXXXXXXX \
	scope 'drive.readonly' \
	root_folder_id 1peJ6VF9wQ-LeETgcdGxu1e4fo28JbtUt \
	config_is_local false \
	config_refresh_token false


	The remote can then be used to download a dataset:
	rclone copy --progress mila-gdrive:DATASET_NAME/ ~/scratch/datasets/DATASET_NAME/


	Rclone is available from the conda channel conda-forge.
	"
	Digital Object Identifier (DOI),https://docs.mila.quebec/Userguide.html#digital-object-identifier-doi,"Digital Object Identifier (DOI)
	It is recommended to get a DOI to reference the dataset. A DOI is a permanent
	id/URL which prevents losing references of online scientific data.
	https://figshare.com can be used to create a DOI:

	Go in My Data
	Create an item by clicking Create new item
	Check Metadata record only at the top
	Fill the metadata fields

	Then reference the dataset using https://doi.org like this:
	https://doi.org/10.6084/m9.figshare.2066037
	"
	Data Transmission using Globus Connect Personal,https://docs.mila.quebec/Userguide.html#data-transmission-using-globus-connect-personal,"Data Transmission using Globus Connect Personal
	Mila doesn’t own a Globus license but if the source or destination provides a
	Globus account, like Digital Research Alliance of Canada for example, it’s
	possible to setup Globus Connect Personal to create a personal endpoint on the
	Mila cluster by following the Globus guide to Install, Configure, and
	Uninstall Globus Connect Personal for Linux.
	This endpoint can then be used to transfer data to and from the Mila cluster.
	"
	JupyterHub,https://docs.mila.quebec/Userguide.html#jupyterhub,"JupyterHub
	JupyterHub is a platform connected to SLURM to start a JupyterLab
	session as a batch job then connects it when the allocation has been granted.
	It does not require any ssh tunnel or port redirection, the hub acts as a proxy
	server that will redirect you to a session as soon as it is available.
	It is currently available for Mila clusters and some Digital Research Alliance
	of Canada (Alliance) clusters.







	Cluster
	Address
	Login type



	Mila Local
	https://jupyterhub.server.mila.quebec
	Google Oauth

	Alliance
	https://docs.alliancecan.ca/wiki/JupyterHub
	DRAC login




	Warning
	Do not forget to close the JupyterLab session! Closing the window leaves
	running the session and the SLURM job it is linked to.
	To close it, use the hub menu and then Control Panel > Stop my server


	Note
	For Mila Clusters:
	mila.quebec account credentials should be used to login and start a
	JupyterLab session.

	"
	Access Mila Storage in JupyterLab,https://docs.mila.quebec/Userguide.html#access-mila-storage-in-jupyterlab,"Access Mila Storage in JupyterLab
	Unfortunately, JupyterLab does not allow the navigation to parent directories of
	$HOME. This makes some file systems like /network/datasets or
	$SLURM_TMPDIR unavailable through their absolute path in the interface. It
	is however possible to create symbolic links to those resources. To do so, you
	can use the ln -s command:
	ln -s /network/datasets $HOME


	Note that $SLURM_TMPDIR is a directory that is dynamically created for each
	job so you would need to recreate the symbolic link every time you start a
	JupyterHub session:
	ln -sf $SLURM_TMPDIR $HOME


	"
	Advanced SLURM usage and Multiple GPU jobs,https://docs.mila.quebec/Userguide.html#advanced-slurm-usage-and-multiple-gpu-jobs,"Advanced SLURM usage and Multiple GPU jobs
	"
	Handling preemption,https://docs.mila.quebec/Userguide.html#handling-preemption,"Handling preemption
	On the Mila cluster, jobs can preempt one-another depending on their priority
	(unkillable>high>low) (See the Slurm documentation)
	The default preemption mechanism is to kill and re-queue the job automatically
	without any notice. To allow a different preemption mechanism, every partition
	have been duplicated (i.e. have the same characteristics as their counterparts)
	allowing a 120sec grace period before killing your job but don’t requeue
	it automatically: those partitions are referred by the suffix: -grace
	(main-grace, long-grace, main-cpu-grace, long-cpu-grace).
	When using a partition with a grace period, a series of signals consisting of
	first SIGCONT and SIGTERM then SIGKILL will be sent to the SLURM
	job. It’s good practice to catch those signals using the Linux trap command
	to properly terminate a job and save what’s necessary to restart the job. On
	each cluster, you’ll be allowed a grace period before SLURM actually kills
	your job (SIGKILL).
	The easiest way to handle preemption is by trapping the SIGTERM signal
	1#SBATCH --ntasks=1
	2#SBATCH ....
	3
	4exit_script() {
	5 echo ""Preemption signal, saving myself""
	6 trap - SIGTERM # clear the trap
	7 # Optional: sends SIGTERM to child/sub processes
	8 kill -- -$$
	9}
	10
	11trap exit_script SIGTERM
	12
	13# The main script part
	14python3 my_script



	Note

	Requeuing:
	The Slurm scheduler on the cluster does not allow a grace period before
	preempting a job while requeuing it automatically, therefore your job will
	be cancelled at the end of the grace period.
	To automatically requeue it, you can just add the sbatch command inside
	your exit_script function.


	"
	Packing jobs,https://docs.mila.quebec/Userguide.html#packing-jobs,"Packing jobs
	"
	Sharing a GPU between processes,https://docs.mila.quebec/Userguide.html#sharing-a-gpu-between-processes,"Sharing a GPU between processes
	srun, when used in a batch job is responsible for starting tasks on the
	allocated resources (see srun) SLURM batch script
	1#SBATCH --ntasks-per-node=2
	2#SBATCH --output=myjob_output_wrapper.out
	3#SBATCH --ntasks=2
	4#SBATCH --gres=gpu:1
	5#SBATCH --cpus-per-task=4
	6#SBATCH --mem=18G
	7srun -l --output=myjob_output_%t.out python script args


	This will run Python 2 times, each process with 4 CPUs with the same arguments
	--output=myjob_output_%t.out will create 2 output files appending the task
	id (%t) to the filename and 1 global log file for things happening outside
	the srun command.
	Knowing that, if you want to have 2 different arguments to the Python program,
	you can use a multi-prog configuration file: srun -l --multi-prog silly.conf
	0 python script firstarg
	1 python script secondarg


	Or by specifying a range of tasks
	0-1 python script %t


	%t being the taskid that your Python script will parse. Note the -l on the
	srun command: this will prepend each line with the taskid (0:, 1:)
	"
	Sharing a node with multiple GPU 1process/GPU,https://docs.mila.quebec/Userguide.html#sharing-a-node-with-multiple-gpu-1process-gpu,"Sharing a node with multiple GPU 1process/GPU
	On Digital Research Alliance of Canada, several nodes, especially nodes with
	largeGPU (P100) are reserved for jobs requesting the whole node, therefore
	packing multiple processes in a single job can leverage faster GPU.
	If you want different tasks to access different GPUs in a single allocation you
	need to create an allocation requesting a whole node and using srun with a
	subset of those resources (1 GPU).
	Keep in mind that every resource not specified on the srun command while
	inherit the global allocation specification so you need to split each resource
	in a subset (except –cpu-per-task which is a per-task requirement)
	Each srun represents a job step (%s).
	Example for a GPU node with 24 cores and 4 GPUs and 128G of RAM
	Requesting 1 task per GPU
	1#!/bin/bash
	2#SBATCH --nodes=1-1
	3#SBATCH --ntasks-per-node=4
	4#SBATCH --output=myjob_output_wrapper.out
	5#SBATCH --gres=gpu:4
	6#SBATCH --cpus-per-task=6
	7srun --gres=gpu:1 -n1 --mem=30G -l --output=%j-step-%s.out --exclusive --multi-prog python script args1 &
	8srun --gres=gpu:1 -n1 --mem=30G -l --output=%j-step-%s.out --exclusive --multi-prog python script args2 &
	9srun --gres=gpu:1 -n1 --mem=30G -l --output=%j-step-%s.out --exclusive --multi-prog python script args3 &
	10srun --gres=gpu:1 -n1 --mem=30G -l --output=%j-step-%s.out --exclusive --multi-prog python script args4 &
	11wait


	This will create 4 output files:

	JOBID-step-0.out
	JOBID-step-1.out
	JOBID-step-2.out
	JOBID-step-3.out

	"
	Sharing a node with multiple GPU & multiple processes/GPU,https://docs.mila.quebec/Userguide.html#sharing-a-node-with-multiple-gpu-multiple-processes-gpu,"Sharing a node with multiple GPU & multiple processes/GPU
	Combining both previous sections, we can create a script requesting a whole node
	with four GPUs, allocating 1 GPU per srun and sharing each GPU with multiple
	processes
	Example still with a 24 cores/4 GPUs/128G RAM
	Requesting 2 tasks per GPU
	1#!/bin/bash
	2#SBATCH --nodes=1-1
	3#SBATCH --ntasks-per-node=8
	4#SBATCH --output=myjob_output_wrapper.out
	5#SBATCH --gres=gpu:4
	6#SBATCH --cpus-per-task=3
	7srun --gres=gpu:1 -n2 --mem=30G -l --output=%j-step-%s-task-%t.out --exclusive --multi-prog silly.conf &
	8srun --gres=gpu:1 -n2 --mem=30G -l --output=%j-step-%s-task-%t.out --exclusive --multi-prog silly.conf &
	9srun --gres=gpu:1 -n2 --mem=30G -l --output=%j-step-%s-task-%t.out --exclusive --multi-prog silly.conf &
	10srun --gres=gpu:1 -n2 --mem=30G -l --output=%j-step-%s-task-%t.out --exclusive --multi-prog silly.conf &
	11wait


	--exclusive is important to specify subsequent step/srun to bind to different cpus.
	This will produce 8 output files, 2 for each step:

	JOBID-step-0-task-0.out
	JOBID-step-0-task-1.out
	JOBID-step-1-task-0.out
	JOBID-step-1-task-1.out
	JOBID-step-2-task-0.out
	JOBID-step-2-task-1.out
	JOBID-step-3-task-0.out
	JOBID-step-3-task-1.out

	Running nvidia-smi in silly.conf, while parsing the output, we can see 4
	GPUs allocated and 2 tasks per GPU
	cat JOBID-step-* \| grep Tesla
	0: \| 0 Tesla P100-PCIE... On \| 00000000:04:00.0 Off \| 0 \|
	1: \| 0 Tesla P100-PCIE... On \| 00000000:04:00.0 Off \| 0 \|
	0: \| 0 Tesla P100-PCIE... On \| 00000000:83:00.0 Off \| 0 \|
	1: \| 0 Tesla P100-PCIE... On \| 00000000:83:00.0 Off \| 0 \|
	0: \| 0 Tesla P100-PCIE... On \| 00000000:82:00.0 Off \| 0 \|
	1: \| 0 Tesla P100-PCIE... On \| 00000000:82:00.0 Off \| 0 \|
	0: \| 0 Tesla P100-PCIE... On \| 00000000:03:00.0 Off \| 0 \|
	1: \| 0 Tesla P100-PCIE... On \| 00000000:03:00.0 Off \| 0 \|
	"
	Multiple Nodes,https://docs.mila.quebec/Userguide.html#multiple-nodes,"Multiple Nodes
	"
	Data Parallel,https://docs.mila.quebec/Userguide.html#data-parallel,"Data Parallel

	Request 3 nodes with at least 4 GPUs each.
	1#!/bin/bash
	2
	3# Number of Nodes
	4#SBATCH --nodes=3
	5
	6# Number of tasks. 3 (1 per node)
	7#SBATCH --ntasks=3
	8
	9# Number of GPU per node
	10#SBATCH --gres=gpu:4
	11#SBATCH --gpus-per-node=4
	12
	13# 16 CPUs per node
	14#SBATCH --cpus-per-gpu=4
	15
	16# 16Go per nodes (4Go per GPU)
	17#SBATCH --mem=16G
	18
	19# we need all nodes to be ready at the same time
	20#SBATCH --wait-all-nodes=1
	21
	22# Total resources:
	23# CPU: 16 * 3 = 48
	24# RAM: 16 * 3 = 48 Go
	25# GPU: 4 * 3 = 12
	26
	27# Setup our rendez-vous point
	28RDV_ADDR=$(hostname)
	29WORLD_SIZE=$SLURM_JOB_NUM_NODES
	30# -----
	31
	32srun -l torchrun \
	33 --nproc_per_node=$SLURM_GPUS_PER_NODE\
	34 --nnodes=$WORLD_SIZE\
	35 --rdzv_id=$SLURM_JOB_ID\
	36 --rdzv_backend=c10d\
	37 --rdzv_endpoint=$RDV_ADDR\
	38 training_script.py


	You can find below a pytorch script outline on what a multi-node trainer could look like.
	import os
	import torch.distributed as dist

	class Trainer:
	def __init__(self):
	self.local_rank = None
	self.chk_path = ...
	self.model = ...

	@property
	def device_id(self):
	return self.local_rank

	def load_checkpoint(self, path):
	self.chk_path = path
	# ...

	def should_checkpoint(self):
	# Note: only one worker saves its weights
	return self.global_rank == 0 and self.local_rank == 0

	def save_checkpoint(self):
	if self.chk_path is None:
	return

	# Save your states here
	# Note: you should save the weights of self.model not ddp_model
	# ...

	def initialize(self):
	self.global_rank = int(os.environ.get(""RANK"", -1))
	self.local_rank = int(os.environ.get(""LOCAL_RANK"", -1))

	assert self.global_rank >= 0, 'Global rank should be set (Only Rank 0 can save checkpoints)'
	assert self.local_rank >= 0, 'Local rank should be set'

	dist.init_process_group(backend=""gloo\|nccl"")

	def sy"
	Data Parallel,https://docs.mila.quebec/Userguide.html#data-parallel,"nc_weights(self, resuming=False):
	if resuming:
	# in the case of resuming all workers need to load the same checkpoint
	self.load_checkpoint()

	# Wait for everybody to finish loading the checkpoint
	dist.barrier()
	return

	# Make sure all workers have the same initial weights
	# This makes the leader save his weights
	if self.should_checkpoint():
	self.save_checkpoint()

	# All workers wait for the leader to finish
	dist.barrier()

	# All followers load the leader's weights
	if not self.should_checkpoint():
	self.load_checkpoint()

	# Leader waits for the follower to load the weights
	dist.barrier()

	def dataloader(self, dataset, batch_size):
	train_sampler = ElasticDistributedSampler(dataset)
	train_loader = DataLoader(
	dataset,
	batch_size=batch_size,
	num_workers=4,
	pin_memory=True,
	sampler=train_sampler,
	)
	return train_loader

	def train_step(self):
	# Your batch processing step here
	# ...
	pass

	def train(self, dataset, batch_size):
	self.sync_weights()

	ddp_model = torch.nn.parallel.DistributedDataParallel(
	self.model,
	device_ids=[self.device_id],
	output_device=self.device_id
	)

	loader = self.dataloader(dataset, batch_size)

	for epoch in range(100):
	for batch in iter(loader):
	self.train_step(batch)

	if self.should_checkpoint():
	self.save_checkpoint()

	def main():
	trainer = Trainer()
	trainer.load_checkpoint(path)
	tainer.initialize()

	trainer.train(dataset, batch_size)



	Note
	To bypass Python GIL (Global interpreter lock) pytorch spawn one process for each GPU.
	In the example above this means at least 12 processes are spawn, at least 4 on each node.
	"
	Frequently asked questions (FAQs),https://docs.mila.quebec/Userguide.html#frequently-asked-questions-faqs,"Frequently asked questions (FAQs)
	"
	Connection/SSH issues,https://docs.mila.quebec/Userguide.html#connection-ssh-issues,"Connection/SSH issues
	"
	I’m getting connection refused while trying to connect to a login node,https://docs.mila.quebec/Userguide.html#i-m-getting-connection-refused-while-trying-to-connect-to-a-login-node,"I’m getting connection refused while trying to connect to a login node
	Login nodes are protected against brute force attacks and might ban your IP if
	it detects too many connections/failures. You will be automatically unbanned
	after 1 hour. For any further problem, please submit a support ticket.
	"
	Shell issues,https://docs.mila.quebec/Userguide.html#shell-issues,"Shell issues
	"
	How do I change my shell ?,https://docs.mila.quebec/Userguide.html#how-do-i-change-my-shell,"How do I change my shell ?
	By default you will be assigned /bin/bash as a shell. If you would like to
	change for another one, please submit a support ticket.
	"
	SLURM issues,https://docs.mila.quebec/Userguide.html#slurm-issues,"SLURM issues
	"
	How can I get an interactive shell on the cluster ?,https://docs.mila.quebec/Userguide.html#how-can-i-get-an-interactive-shell-on-the-cluster,"How can I get an interactive shell on the cluster ?
	Use salloc [--slurm_options] without any executable at the end of the
	command, this will launch your default shell on an interactive session. Remember
	that an interactive session is bound to the login node where you start it so you
	could risk losing your job if the login node becomes unreachable.
	"
	How can I reset my cluster password ?,https://docs.mila.quebec/Userguide.html#how-can-i-reset-my-cluster-password,"How can I reset my cluster password ?
	To reset your password, please submit a support ticket.
	Warning: your cluster password is the same as your Google Workspace account. So,
	after reset, you must use the new password for all your Google services.
	"
	srun: error: –mem and –mem-per-cpu are mutually exclusive,https://docs.mila.quebec/Userguide.html#srun-error-mem-and-mem-per-cpu-are-mutually-exclusive,"srun: error: –mem and –mem-per-cpu are mutually exclusive
	You can safely ignore this, salloc has a default memory flag in case you
	don’t provide one.
	"
	How can I see where and if my jobs are running ?,https://docs.mila.quebec/Userguide.html#how-can-i-see-where-and-if-my-jobs-are-running,"How can I see where and if my jobs are running ?
	Use squeue -u YOUR_USERNAME to see all your job status and locations.
	To get more info on a running job, try scontrol show job #JOBID
	"
	Unable to allocate resources: Invalid account or account/partition combination specified,https://docs.mila.quebec/Userguide.html#unable-to-allocate-resources-invalid-account-or-account-partition-combination-specified,"Unable to allocate resources: Invalid account or account/partition combination specified
	Chances are your account is not setup properly. You should submit a support ticket.
	"
	How do I cancel a job?,https://docs.mila.quebec/Userguide.html#how-do-i-cancel-a-job,"How do I cancel a job?

	To cancel a specific job, use scancel #JOBID
	To cancel all your jobs (running and pending), use scancel -u YOUR_USERNAME
	To cancel all your pending jobs only, use scancel -t PD

	"
	How can I access a node on which one of my jobs is running ?,https://docs.mila.quebec/Userguide.html#how-can-i-access-a-node-on-which-one-of-my-jobs-is-running,"How can I access a node on which one of my jobs is running ?
	You can ssh into a node on which you have a job running, your ssh connection
	will be adopted by your job, i.e. if your job finishes your ssh connection will
	be automatically terminated. In order to connect to a node, you need to have
	password-less ssh either with a key present in your home or with an
	ssh-agent. You can generate a key on the login node like this:
	ssh-keygen (3xENTER)
	cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
	chmod 600 ~/.ssh/authorized_keys
	chmod 700 ~/.ssh
	"
	I’m getting Permission denied (publickey) while trying to connect to a node,https://docs.mila.quebec/Userguide.html#i-m-getting-permission-denied-publickey-while-trying-to-connect-to-a-node,"I’m getting Permission denied (publickey) while trying to connect to a node
	See previous question
	"
	Where do I put my data during a job ?,https://docs.mila.quebec/Userguide.html#where-do-i-put-my-data-during-a-job,"Where do I put my data during a job ?
	Your /home as well as the datasets are on shared file-systems, it is
	recommended to copy them to the $SLURM_TMPDIR to better process them and
	leverage higher-speed local drives. If you run a low priority job subject to
	preemption, it’s better to save any output you want to keep on the shared file
	systems, because the $SLURM_TMPDIR is deleted at the end of each job.
	"
	slurmstepd: error: Detected 1 oom-kill event(s) in step #####.batch cgroup,https://docs.mila.quebec/Userguide.html#slurmstepd-error-detected-1-oom-kill-event-s-in-step-batch-cgroup,"slurmstepd: error: Detected 1 oom-kill event(s) in step #####.batch cgroup
	You exceeded the amount of memory allocated to your job, either you did not
	request enough memory or you have a memory leak in your process. Try increasing
	the amount of memory requested with --mem= or --mem-per-cpu=.
	"
	fork: retry: Resource temporarily unavailable,https://docs.mila.quebec/Userguide.html#fork-retry-resource-temporarily-unavailable,"fork: retry: Resource temporarily unavailable
	You exceeded the limit of 2000 tasks/PIDs in your job, it probably means there
	is an issue with a sub-process spawning too many processes in your script. For
	any help with your software, please submit a support ticket.
	"
	PyTorch issues,https://docs.mila.quebec/Userguide.html#pytorch-issues,"PyTorch issues
	"
	"I randomly get INTERNAL ASSERT FAILED at ""../aten/src/ATen/MapAllocator.cpp"":263",https://docs.mila.quebec/Userguide.html#i-randomly-get-internal-assert-failed-at-aten-src-aten-mapallocator-cpp-263,"I randomly get INTERNAL ASSERT FAILED at ""../aten/src/ATen/MapAllocator.cpp"":263
	You are using PyTorch 1.10.x and hitting #67864,
	for which the solution is PR #72232
	merged in PyTorch 1.11.x. For an immediate fix, consider the following compilable Gist:
	hack.cpp.
	Compile the patch to hack.so and then export LD_PRELOAD=/absolute/path/to/hack.so
	before executing the Python process that import torch a broken PyTorch 1.10.
	For Hydra users who are using the submitit launcher plug-in, the env_set key cannot
	be used to set LD_PRELOAD in the environment as it does so too late at runtime. The
	dynamic loader reads LD_PRELOAD only once and very early during the startup of any
	process, before the variable can be set from inside the process. The hack must therefore
	be injected using the setup key in Hydra YAML config file:
	hydra:
	launcher:
	setup:
	- export LD_PRELOAD=/absolute/path/to/hack.so


	"
	Mila technical documentation,https://docs.mila.quebec/index.html#mila-technical-documentation,"Mila technical documentation
	Welcome to Mila’s technical documentation. If this is your first time here, we
	recommend you start by checking out the short quick start guide.

	Introduction

	Purpose of this documentation
	Intended audience


	Contributing



	How-tos and Guides

	User’s guide
	Quick Start
	Logging in to the cluster
	Running your code
	Portability concerns and solutions
	Singularity
	Sharing Data with ACLs
	Contributing datasets
	Data Transmission using Globus Connect Personal
	JupyterHub
	Advanced SLURM usage and Multiple GPU jobs
	Multiple Nodes
	Frequently asked questions (FAQs)


	AI tooling and methodology handbook



	Systems and services

	Computing infrastructure and policies
	Roles and authorizations
	Overview of available computing resources at Mila
	Node profile description
	Data sharing policies
	Monitoring
	Storage
	Data Transmission


	Computational resources outside of Mila
	Digital Research Alliance of Canada Clusters





	General theory

	What is a computer cluster?
	Parts of a computing cluster
	The login nodes
	The compute nodes
	The storage nodes
	Different nodes for different uses


	UNIX
	The workload manager
	Processing data
	Data parallelism
	Model parallelism
	Communication concerns
	Filesystem concerns


	Software on the cluster
	Cluster software modules
	Containers
	Python Virtual environments





	Extras

	Mila Datasets
	Audio and video resources at Mila
	Visual Studio Code
	Connecting to the cluster
	Activating an environment
	Troubleshooting


	Who, what, where is IDT
	IDT’s mission
	The IDT team





	Support
	To reach the Mila infrastructure support, please submit
	a support ticket.

	Contribution
	If you find any errors in the documentation, missing or unclear
	sections, or would simply like to contribute, please open an
	issue or make a pull request on the github page.


	"
	Audio and video resources at Mila,https://docs.mila.quebec/Audio_video.html#audio-and-video-resources-at-mila,"Audio and video resources at Mila
	See the intranet section on
	audio and video
	for complete information on audio and video systems made available at Mila.
	"