Spaces:
Runtime error
Runtime error
| name,url,text | |
| AI tooling and methodology handbook,https://docs.mila.quebec/Handbook.html#ai-tooling-and-methodology-handbook,"AI tooling and methodology handbook | |
| This section seeks to provide researchers with insightful articles pertaining to | |
| aspects of methodology in their work. | |
| " | |
| What is a computer cluster?,https://docs.mila.quebec/Theory_cluster.html#what-is-a-computer-cluster,"What is a computer cluster? | |
| A computer cluster is a set | |
| of loosely or tightly connected computers that work together so that, in many | |
| respects, they can be viewed as a single system. | |
| " | |
| Parts of a computing cluster,https://docs.mila.quebec/Theory_cluster.html#parts-of-a-computing-cluster,"Parts of a computing cluster | |
| To provide high performance computation capabilities, clusters can | |
| combine hundreds to thousands of computers, called nodes, which are all | |
| inter-connected with a high-performance communication network. Most nodes are | |
| designed for high-performance computations, but clusters can also use | |
| specialized nodes to offer parallel file systems, databases, login nodes and | |
| even the cluster scheduling functionality as pictured in the image below. | |
| We will overview the different types of nodes which you can encounter on a | |
| typical cluster. | |
| " | |
| The login nodes,https://docs.mila.quebec/Theory_cluster.html#the-login-nodes,"The login nodes | |
| To execute computing processes on a cluster, you must first connect to a | |
| cluster and this is accomplished through a login node. These so-called | |
| login nodes are the entry point to most clusters. | |
| Another entry point to some clusters such as the Mila cluster is the JupyterHub | |
| web interface, but we’ll read about that later. For now let’s return to the | |
| subject of this section; Login nodes. To connect to these, you would typically | |
| use a remote shell connection. The most usual tool to do so is SSH. You’ll hear | |
| and read a lot about this tool. Imagine it as a very long (and somewhat | |
| magical) extension cord which connects the computer you are using now, such as | |
| your laptop, to a remote computer’s terminal shell. You might already know what | |
| a terminal shell is if you ever used the command line. | |
| " | |
| The compute nodes,https://docs.mila.quebec/Theory_cluster.html#the-compute-nodes,"The compute nodes | |
| In the field of artificial intelligence, you will usually be on the hunt for | |
| GPUs. In most clusters, the compute nodes are the ones with GPU capacity. | |
| While there is a general paradigm to tend towards a homogeneous configuration | |
| for nodes, this is not always possible in the field of artificial intelligence | |
| as the hardware evolve rapidly as is being complemented by new hardware and so | |
| on. Hence, you will often read about computational node classes. Some of which | |
| might have different GPU models or even no GPU at all. For the Mila cluster you | |
| will find this information in the Node profile description section. For | |
| now, you should note that is important to keep in mind that you should be aware | |
| of which nodes your code is running on. More on that later. | |
| " | |
| The storage nodes,https://docs.mila.quebec/Theory_cluster.html#the-storage-nodes,"The storage nodes | |
| Some computers on a cluster function to only store and serve files. While the | |
| name of these computers might matter to some, as a user, you’ll only be | |
| concerned about the path to the data. More on that in the Processing data section. | |
| " | |
| Different nodes for different uses,https://docs.mila.quebec/Theory_cluster.html#different-nodes-for-different-uses,"Different nodes for different uses | |
| It is important to note here the difference in intended uses between the | |
| compute nodes and the login nodes. While the compute nodes are meant for heavy | |
| computation, the login nodes are not. | |
| The login nodes however are used by everyone who uses the cluster and care must | |
| be taken not to overburden these nodes. Consequently, only very short and light | |
| processes should be run on these otherwise the cluster may become inaccessible. | |
| In other words, please refrain from executing long or compute intensive | |
| processes on login nodes because it affects all other users. In some cases, you | |
| will also find that doing so might get you into trouble. | |
| " | |
| UNIX,https://docs.mila.quebec/Theory_cluster.html#unix,"UNIX | |
| All clusters typically run on GNU/Linux distributions. Hence a minimum | |
| knowledge of GNU/Linux and BASH is usually required to use them. See the | |
| following tutorial | |
| for a rough guide on getting started with Linux. | |
| " | |
| The workload manager,https://docs.mila.quebec/Theory_cluster.html#the-workload-manager,"The workload manager | |
| On a cluster, users don’t have direct access to the compute nodes but | |
| instead connect to a login node and add jobs to the workload manager | |
| queue. Whenever there are resources available to execute these jobs | |
| they will be allocated to a compute node and run, which can be | |
| immediately or after a wait of up to several days. | |
| A job is comprised of a number of steps that will run one after the | |
| other. This is done so that you can schedule a sequence of processes | |
| that can use the results of the previous steps without having to | |
| manually interact with the scheduler. | |
| Each step can have any number of tasks which are groups of processes | |
| that can be scheduled independently on the cluster but can run in | |
| parallel if there are resources available. The distinction between | |
| steps and tasks is that multiple tasks, if they are part of the same | |
| step, cannot depend on results of other tasks because there are no | |
| guarantees on the order in which they will be executed. | |
| Finally each process group is the basic unit that is scheduled in the | |
| cluster. It comprises of a set of processes (or threads) that can run | |
| on a number of resources (CPU, GPU, RAM, …) and are scheduled | |
| together as a unit on one or more machines. | |
| Each of these concepts lends itself to a particular use. For multi-gpu | |
| training in AI workloads you would use one task per GPU for data | |
| paralellism or one process group if you are doing model | |
| parallelism. Hyperparameter optimisation can be done using a | |
| combination of tasks and steps but is probably better left to a | |
| framework outside of the scope of the workload manager. | |
| If this all seems complicated, you should know that all these things | |
| do not need to always be used. It is perfectly acceptable to sumbit | |
| jobs with a single step, a single task and a single process. | |
| The available resources on the cluster are not infinite and it is the | |
| workload manager’s job to allocate them. Whenever a job request comes | |
| in and there are not enough resources available to start it | |
| immediately, it will go in the queue. | |
| Once a job is in the queue, it will stay there until another job | |
| finishes and then the workload manager will try to use the newly freed | |
| resources with jobs from the queue. The exact order in which the jobs | |
| will start is not fixed, because it depends on the local policies | |
| which can take into account the user priority, the time since the job | |
| was requested, the amount of resources requested and possibly other | |
| things. There should be a tool that comes with the manager where you | |
| can see the status of your queued jobs and why they remain in the | |
| queue. | |
| The workload manager will divide the cluster into partitions according | |
| to the configuration set by the admins. A partition is a set of | |
| machi" | |
| The workload manager,https://docs.mila.quebec/Theory_cluster.html#the-workload-manager,"nes typically reserved for a particular purpose. An example might | |
| be CPU-only machines for preprocessing setup as a separate partition. | |
| It is possible for multiple partitions to share resources. | |
| There will always be at least one partition that is the default | |
| partition in which jobs without a specific request will go. Other | |
| partitions can be requested, but might be restricted to a group of | |
| users, depending on policy. | |
| Partitions are useful for a policy standpoint to ensure efficient use | |
| of the cluster resources and avoid using up too much of one resource | |
| type blocking use of another. They are also useful for heterogenous | |
| clusters where different hardware is mixed in and not all software is | |
| compatible with all of it (for example x86 and POWER cpus). | |
| To ensure a fair share of the computing resources for all, the workload | |
| manager establishes limits on the amount of resources that a single | |
| user can use at once. These can be hard limits which prevent running | |
| jobs when you go over or soft limits which will let you run jobs, but | |
| only until some other job needs the resources. | |
| Admin policy will determine what those exact limits are for a | |
| particular cluster or user and whether they are hard or soft limits. | |
| The way soft limits are enforced is using preemption, which means that | |
| when another job with higher priority needs the resources that your | |
| job is using, your job will receive a signal that it needs to save its | |
| state and exit. It will be given a certain amount of time to do this | |
| (the grace period, which may be 0s) and then forcefully terminated if | |
| it is still running. | |
| Depending on the workload manager in use and the cluster configuration | |
| a job that is preempted like this may be automatically rescheduled to | |
| have a chance to finish or it may be up to the job to reschedule | |
| itself. | |
| The other limit you can encounter with a job that goes over its | |
| declared limits. When you schedule a job, you declare how much | |
| resources it will need (RAM, CPUs, GPUs, …). Some of those may have | |
| default values and not be explicitely defined. For certain types of | |
| devices, like GPUs, access to units over your job limit is made | |
| unavailable. For others, like RAM, usage is monitored and your job | |
| will be terminated if it goes too much over. This makes it important | |
| to ensure you estimate resource usage accurately. | |
| Mila as well as Digital Research Alliance of Canada use the workload | |
| manager Slurm to schedule and | |
| allocate resources on their infrastructure. | |
| Slurm client commands are available on the login nodes for you to submit | |
| jobs to the main controller and add your job to the queue. Jobs are of 2 types: | |
| batch jobs and interactive jobs. | |
| For practical examples of Slurm commands on the Mila cluster, see Running your code." | |
| Processing data,https://docs.mila.quebec/Theory_cluster.html#processing-data,"Processing data | |
| For processing large amounts of data common for deep learning, either | |
| for dataset preprocessing or training, several techniques exist. Each | |
| has typical uses and limitations. | |
| " | |
| Data parallelism,https://docs.mila.quebec/Theory_cluster.html#data-parallelism,"Data parallelism | |
| The first technique is called data parallelism (aka task | |
| parallelism in formal computer science). You simply run lots of | |
| processes each handling a portion of the data you want to | |
| process. This is by far the easiest technique to use and should be | |
| favored whenever possible. A common example of this is | |
| hyperparameter optimisation. | |
| For really small computations the time to setup multiple processes | |
| might be longer than the processing time and lead to waste. This can | |
| be addressed by bunching up some of the processes together by doing | |
| sequential processing of sub-partitions of the data. | |
| For the cluster systems it is also inadvisable to launch thousands of | |
| jobs and even if each job would run for a reasonable amount of time | |
| (several minutes at minimum), it would be best to make larger groups | |
| until the amount of jobs is in the low hundreds at most. | |
| Finally another thing to keep in mind is that the transfer bandwidth | |
| is limited between the filesystems (see Filesystem concerns) | |
| and the compute nodes and if you run too many jobs using too much data | |
| at once they may end up not being any faster because they will spend | |
| their time waiting for data to arrive. | |
| " | |
| Model parallelism,https://docs.mila.quebec/Theory_cluster.html#model-parallelism,"Model parallelism | |
| The second technique is called model parallelism (which doesn’t | |
| have a single equivalent in formal computer science). It is used | |
| mostly when a single instance of a model will not fit in a computing | |
| resource (such as the GPU memory being too small for all the | |
| parameters). | |
| In this case, the model is split into its constituent parts, each | |
| processed independently and their intermediate results communicated | |
| with each other to arrive at a final result. | |
| This is generally harder but necessary to work with larger, more | |
| powerful models like GPT. | |
| " | |
| Communication concerns,https://docs.mila.quebec/Theory_cluster.html#communication-concerns,"Communication concerns | |
| The main difference of these two approaches is the need for | |
| communication between the multiple processes. Some common training | |
| methods, like stochastic gradient descent sit somewhere between the | |
| two, because they require some communication, but not a lot. Most | |
| people classify it as data parallelism since it sits closer to that | |
| end. | |
| In general for data parallelism tasks or tasks that communicate | |
| infrequently it doesn’t make a lot of difference where the processes | |
| sit because the communication bandwidth and latency will not have a | |
| lot of impact on the time it takes to complete the job. The | |
| individual tasks can generally be scheduled independently. | |
| On the contrary for model parallelism you need to pay more attention | |
| to where your tasks are. In this case it is usually required to use | |
| the facilities of the workload manager to group the tasks so that they | |
| are on the same machine or machines that are closely linked to ensure | |
| optimal communication. What is the best allocation depends on the | |
| specific cluster architecture available and the technologies it | |
| support (such as InfiniBand, | |
| RDMA, | |
| NVLink or others) | |
| " | |
| Filesystem concerns,https://docs.mila.quebec/Theory_cluster.html#filesystem-concerns,"Filesystem concerns | |
| When working on a cluster, you will generally encounter several | |
| different filesystems. Usually there will be names such as ‘home’, | |
| ‘scratch’, ‘datasets’, ‘projects’, ‘tmp’. | |
| The reason for having different filesystems available instead of a | |
| single giant one is to provide for different use cases. For example, | |
| the ‘datasets’ filesystem would be optimized for fast reads but have | |
| slow write performance. This is because datasets are usually written | |
| once and then read very often for training. | |
| Different filesystems have different performance levels. For instance, backed | |
| up filesystems (such as $PROJECT in Digital Research Alliance of Canada | |
| clusters) provide more space and can handle large files but cannot sustain | |
| highly parallel accesses typically required for high speed model training. | |
| The set of filesystems provided by the cluster you are using should be | |
| detailed in the documentation for that cluster and the names can | |
| differ from those above. You should pay attention to their recommended | |
| use case in the documentation and use the appropriate filesystem for | |
| the appropriate job. There are cases where a job ran hundreds of times | |
| slower because it tried to use a filesystem that wasn’t a good fit for | |
| the job. | |
| One last thing to pay attention to is the data retention policy for | |
| the filesystems. This has two subpoints: how long is the data kept | |
| for, and are there backups. | |
| Some filesystems will have a limit on how long they keep their | |
| files. Typically the limit is some number of days (like 90 days) but | |
| can also be ‘as long as the job runs’ for some. | |
| As for backups, some filesystems will not have a limit for data, but | |
| will also not have backups. For those it is important to maintain a | |
| copy of any crucial data somewhere else. The data will not be | |
| purposefully deleted, but the filesystem may fail and lose all or part | |
| of its data. If you have any data that is crucial for a paper or your | |
| thesis keep an additional copy of it somewhere else. | |
| " | |
| Software on the cluster,https://docs.mila.quebec/Theory_cluster.html#software-on-the-cluster,"Software on the cluster | |
| This section aims to raise awareness to problems one can encounter when trying | |
| to run a software on different computers and how this is dealt with on typical | |
| computation clusters. | |
| The Mila cluster and the Digital Research Alliance of Canada clusters both | |
| provide various useful software and computing environments, which can be | |
| activated through the module system. Alternatively, you may build containers | |
| with your desired software and run them on compute nodes. | |
| Regarding Python development, we recommend using virtual environments to install | |
| Python packages in isolation. | |
| " | |
| Cluster software modules,https://docs.mila.quebec/Theory_cluster.html#cluster-software-modules,"Cluster software modules | |
| Modules are small files which modify your environment variables to point to | |
| specific versions of various software and libraries. For instance, a module | |
| might provide the python command to point to Python 3.7, another might | |
| activate CUDA version 11.0, another might provide the torch package, and so | |
| on. | |
| For more information, see The module command. | |
| " | |
| Containers,https://docs.mila.quebec/Theory_cluster.html#containers,"Containers | |
| Containers are a special form of isolation of software and its dependencies. A | |
| container is essentially a lightweight virtual machine: it encapsulates a | |
| virtual file system for a full OS installation, as well as a separate network | |
| and execution environment. | |
| For example, you can create an Ubuntu container in which you install various | |
| packages using apt, modify settings as you would as a root user, and so on, | |
| but without interfering with your main installation. Once built, a container can | |
| be run on any compatible system. | |
| For more information, see Using containers on clusters. | |
| " | |
| Python Virtual environments,https://docs.mila.quebec/Theory_cluster.html#python-virtual-environments,"Python Virtual environments | |
| A virtual environment in Python is a local, isolated environment in which you | |
| can install or uninstall Python packages without interfering with the global | |
| environment (or other virtual environments). In order to use a virtual | |
| environment, you first have to activate it. | |
| For more information, see Virtual environments. | |
| " | |
| "Who, what, where is IDT",https://docs.mila.quebec/IDT.html#who-what-where-is-idt,"Who, what, where is IDT | |
| This section seeks to help Mila researchers understand the mission and role of | |
| the IDT team. | |
| " | |
| IDT’s mission,https://docs.mila.quebec/IDT.html#idt-s-mission,"IDT’s mission | |
| " | |
| The IDT team,https://docs.mila.quebec/IDT.html#the-idt-team,"The IDT team | |
| See https://mila.quebec/en/mila/team/?cat_id=143 | |
| " | |
| Purpose of this documentation,https://docs.mila.quebec/Purpose.html#purpose-of-this-documentation,"Purpose of this documentation | |
| This documentation aims to cover the information required to run scientific | |
| and data-intensive computing tasks at Mila and the available resources for its | |
| members. | |
| It also aims to be an outlet for sharing know-how, tips and tricks and examples | |
| from the IDT team to the Mila researcher community. | |
| " | |
| Intended audience,https://docs.mila.quebec/Purpose.html#intended-audience,"Intended audience | |
| This documentation is mainly intended for Mila researchers having access to the | |
| Mila cluster. This access is determined by your researcher status. See | |
| Roles and authorizations for more information. The core of the | |
| information with this purpose can be found in the following section: | |
| Computing infrastructure and policies. | |
| However, we also aim to provide more general information which can be useful | |
| outside the scope of using the Mila cluster. For instance, more general theory | |
| on computational considerations and such. In this perspective, we hope the | |
| documentation can be of use for all of Mila members. | |
| " | |
| Contributing,https://docs.mila.quebec/Purpose.html#contributing,"Contributing | |
| See the following file for contribution guidelines : | |
| # Contributing to the Mila Docs | |
| Thank you for your interest into making a better documentation for all at Mila. | |
| Here are some guidelines to help bring your contributions to life. | |
| ## What should be included in the Mila Docs | |
| * Mila cluster usage | |
| * Digital Research Alliance of Canada cluster usage | |
| * Job management tips / tricks | |
| * Research good practices | |
| * Software development good practices | |
| * Useful tools | |
| **_NOTE_**: Examples should aim to not consume much more than 1 GPU/hour and 2 CPU/hour | |
| ## Issues / Pull Requests | |
| ### Issues | |
| Issues can be used to report any error in the documentation, missing or unclear | |
| sections, broken tools or other suggestions to improve the overall | |
| documentation. | |
| ### Pull Requests | |
| PRs are welcome and we value the contents of contributions over the appearance | |
| or functionality of the pull request. If you don't know how to write the proper | |
| markup in reStructuredText, simply provide the content you would like to add in | |
| the PR text form which supports markdown or with instructions to format the | |
| content. In the PR, reference the related issues like this: | |
| ``` | |
| Resolves: #123 | |
| See also: #456, #789 | |
| ``` | |
| If you would like to contribute directly in the code of the documentation, keep | |
| the lines width to 80 characters or less. You can attempt to build the docs | |
| yourself to see if the formating is right: | |
| ```console | |
| python3 -m pip install -r docs/requirements.txt | |
| sphinx-build -b html docs/ docs/_build/ | |
| ``` | |
| This will produce the html version of the documentation which you can navigate | |
| by opening the local file `docs/_build/index.html`. | |
| If you have any trouble building the docs, don't hesitate to open an issue to | |
| request help. | |
| Regarding the restructured text format" | |
| Contributing,https://docs.mila.quebec/Purpose.html#contributing,", you can simply provide the content | |
| you would like to add in markdown or plain text format if more convenient | |
| for you and someone down the line should take responsibility to convert | |
| the format. | |
| ## Sphinx / reStructuredText (reST) | |
| The markup language used for the Mila Docs is | |
| [reStructuredText](http://docutils.sourceforge.net/rst.html) and we follow the | |
| [Python’s Style Guide for documenting](https://docs.python.org/devguide/documenting.html#style-guide). | |
| Here are some of reST syntax directives which are useful to know : | |
| (more can be found in | |
| [Sphinx's reST Primer](https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html)): | |
| ### Inline markup | |
| * one asterisk: `*text*` for *emphasis* (italics), | |
| * two asterisks: `**text**` for **strong emphasis** (boldface), and | |
| * backquotes: ` ``text`` ` for `code samples`, and | |
| * external links: `` `Link text <http://target>`_ ``. | |
| ### Lists | |
| ```reST | |
| * this is | |
| * a list | |
| * with a nested list | |
| * and some subitems | |
| * and here the parent list continues | |
| ``` | |
| ### Sections | |
| ```reST | |
| ################# | |
| This is a heading | |
| ################# | |
| ``` | |
| There are no heading levels assigned to certain characters as the structure is | |
| determined from the succession of headings. However, the Python documentation | |
| suggests the following convention: | |
| * `#` with overline, for parts | |
| * `*` with overline, for chapters | |
| * `=`, for sections | |
| * `-`, for subsections | |
| * `^`, for subsubsections | |
| * `""`, for paragraphs | |
| ### Note box | |
| ```reST | |
| .. note:: This is a long | |
| long long note | |
| ``` | |
| ### Collapsible boxes | |
| This is a local extension, not part of Sphinx itself. It works like this: | |
| ```reST | |
| .. container:: toggle | |
| .. container:: header | |
| **Show/Hide Code** | |
| .. code-block:: <type> | |
| ... | |
| ``` | |
| " | |
| Visual Studio Code,https://docs.mila.quebec/VSCode.html#visual-studio-code,"Visual Studio Code | |
| One editor of choice for many researchers is VSCode. One feature of VSCode is | |
| remote editing through SSH. This allows you to edit files on the cluster as if | |
| they were local. You can also debug your programs using VSCode’s debugger, open | |
| terminal sessions, etc. | |
| " | |
| Connecting to the cluster,https://docs.mila.quebec/VSCode.html#connecting-to-the-cluster,"Connecting to the cluster | |
| VSCode cannot be used to edit code on the login nodes, because it is a heavy | |
| enough process (a node process, plus the language server, linter, and | |
| possibly other plugins depending on your configured environment) that there is a | |
| risk of overloading the login nodes if too many researchers did it at the same | |
| time. | |
| Therefore, to use VSCode on the cluster, you first need to allocate a compute | |
| node, then connect to that node. | |
| The milatools package provides a command to make the operation easier. More | |
| info can be found here. | |
| " | |
| Activating an environment,https://docs.mila.quebec/VSCode.html#activating-an-environment,"Activating an environment | |
| Reference | |
| To activate a conda or pip environment, you can open the command palette with | |
| Ctrl+Shift+P and type “Python: Select interpreter”. This will prompt you for the | |
| path to the Python executable for your environment. | |
| Tip | |
| If you already have the environment activated in a terminal session, you can | |
| run the command which python to get the path for this environment. This | |
| path can be pasted into the interpreter selection prompt in VSCode to use | |
| that same environment. | |
| " | |
| Troubleshooting,https://docs.mila.quebec/VSCode.html#troubleshooting,"Troubleshooting | |
| " | |
| “Cannot reconnect”,https://docs.mila.quebec/VSCode.html#cannot-reconnect,"“Cannot reconnect” | |
| When connecting to multiple compute nodes (and/or from multiple computers), some | |
| instances may crash with that message because of conflicts in the lock files | |
| VSCode installs in ~/.vscode-server (which is shared on all compute nodes). | |
| To fix this issue, you can change this setting in your settings.json file: | |
| { ""remote.SSH.lockfilesInTmp"": true } | |
| This will store the necessary lockfiles in /tmp on the compute nodes (which | |
| are local to the node). | |
| " | |
| Debugger timeouts,https://docs.mila.quebec/VSCode.html#debugger-timeouts,"Debugger timeouts | |
| Sometimes, slowness on the compute node or the networked filesystem might cause | |
| the VSCode debugger to timeout when starting a remote debug process. As a quick | |
| fix, you can add this to your ~/.bashrc or ~/.profile or equivalent | |
| resource file for your preferred shell, to increase the timeout delay to 500 | |
| seconds: | |
| export DEBUGPY_PROCESS_SPAWN_TIMEOUT=500 | |
| " | |
| Computational resources outside of Mila,https://docs.mila.quebec/Extra_compute.html#computational-resources-outside-of-mila,"Computational resources outside of Mila | |
| This section seeks to provide insights and information on computational | |
| resources outside the Mila cluster itself. | |
| " | |
| Digital Research Alliance of Canada Clusters,https://docs.mila.quebec/Extra_compute.html#digital-research-alliance-of-canada-clusters,"Digital Research Alliance of Canada Clusters | |
| The clusters named Beluga, Cedar, Graham, Narval and Niagara are | |
| clusters provided by the Digital Research Alliance of Canada organisation (the Alliance). For Mila researchers, these | |
| clusters are to be used for larger experiments having many jobs, multi-node | |
| computation and/or multi-GPU jobs as well as long running jobs. If you use | |
| these resources for your research, please remember to acknowledge their use in | |
| your papers. | |
| Note | |
| Compute Canada ceased its operational responsibilities for supporting Canada’s | |
| national advanced research computing (ARC) platform on March 31, 2022. The services | |
| will be supported by the new Digital Research Alliance of Canada. | |
| https://ace-net.ca/compute-canada-operations-move-to-the-digital-research-alliance-of-canada-(the-alliance).html | |
| " | |
| Current allocation description,https://docs.mila.quebec/Extra_compute.html#current-allocation-description,"Current allocation description | |
| Clusters of the Alliance are shared with researchers across the country. | |
| Allocations are given by the Alliance to selected research groups to ensure to | |
| a minimal amount of computational resources throughout the year. | |
| Depending on your affiliation, you will have access to different allocations. If | |
| you are a student at University of Montreal, you can have access to the | |
| rrg-bengioy-ad allocation described below. For students from other | |
| universities, you should ask your advisor to know which allocations you could | |
| have access to. | |
| From the Alliance’s documentation: An allocation is an amount of resources | |
| that a research group can target for use for a period of time, usually a year. | |
| To be clear, it is not a maximal amount of resources that can be used | |
| simultaneously, it is a weighting factor of the workload manager to balance | |
| jobs. For instance, even though we are allocated 400 GPU-years across all | |
| clusters, we can use more or less than 400 GPUs simultaneously depending on the | |
| history of usage from our group and other groups using the cluster at a given | |
| period of time. Please see the Alliance’s documentation for | |
| more information on how allocations and resource scheduling are configured for | |
| these installations. | |
| The table below provides information on the allocation for | |
| rrg-bengioy-ad for the period which spans from April 2022 to | |
| April 2023. Note that there are no special allocations for GPUs on | |
| Graham and therefore jobs with GPUs should be submitted with the | |
| account def-bengioy. | |
| Cluster | |
| CPUs | |
| GPUs | |
| # | |
| account | |
| Model | |
| # | |
| SLURM type specifier | |
| account | |
| Beluga | |
| 238 | |
| rrg-bengioy-ad | |
| V100-16G | |
| 77 | |
| v100 | |
| rrg-bengioy-ad | |
| Cedar | |
| 34 | |
| rrg-bengioy-ad | |
| V100-32G | |
| 138 | |
| v100l | |
| rrg-bengioy-ad | |
| Graham | |
| 34 | |
| rrg-bengioy-ad | |
| various | |
| – | |
| – | |
| def-bengioy | |
| Narval | |
| 34 | |
| rrg-bengioy-ad | |
| A100-40G | |
| 185 | |
| a100 | |
| rrg-bengioy-ad | |
| " | |
| Account Creation,https://docs.mila.quebec/Extra_compute.html#account-creation,"Account Creation | |
| To access the Alliance clusters you have to first create an account at | |
| https://ccdb.computecanada.ca. Use a password with at least 8 characters, mixed | |
| case letters, digits and special characters. Later you will be asked to create | |
| another password with those rules, and it’s really convenient that the two | |
| password are the same. | |
| Then, you have to apply for a role at | |
| https://ccdb.computecanada.ca/me/add_role, which basically means telling the | |
| Alliance that you are part of the lab so they know which cluster you can have | |
| access to, and track your usage. | |
| You will be asked for the CCRI (See screenshot below). Please reach out to your | |
| sponsor to get the CCRI. | |
| You will need to wait for your sponsor to accept before being able to login | |
| to the Alliance clusters. | |
| " | |
| Clusters,https://docs.mila.quebec/Extra_compute.html#clusters,"Clusters | |
| Beluga:(Mila doc) | |
| (Digital Research Alliance of Canada doc) | |
| For most students, Beluga is the best choice for both CPU and GPU jobs because | |
| of larger allocations on this cluster. | |
| Narval:(Mila doc) | |
| (Digital Research Alliance of Canada doc) | |
| Narval is the newest cluster, and contains the most powerful GPUs (A100). If your | |
| job can benefit from the A100’s features, such as TF32 floating-point math, Narval | |
| is the best choice. | |
| Cedar:(Mila doc) | |
| (Digital Research Alliance of Canada doc) | |
| Cedar is a good alternative to Beluga if you absolutely need to have an internet connection | |
| on the compute nodes. | |
| Graham:(Mila doc) | |
| (Digital Research Alliance of Canada doc) | |
| We do not have a GPU allocation on Graham anymore but it remains an alternative for CPU jobs. | |
| Niagara:(Mila doc) | |
| (Digital Research Alliance of Canada doc) | |
| Niagara is not recommended for most students. It is a CPU-only cluster with unusual | |
| configurations. Access is not automatic; It is opt-in and must be requested via | |
| CCDB manually. Compute resources in Niagara are not assigned to jobs on a per-CPU, | |
| but on a per-node basis. | |
| " | |
| Beluga,https://docs.mila.quebec/Extra_compute.html#beluga,"Beluga | |
| Beluga is a cluster located at ÉTS in Montreal. It | |
| uses SLURM to schedule jobs. Its full documentation can be found here, and its current status | |
| here. | |
| You can access Beluga via ssh: | |
| ssh <user>@beluga.computecanada.ca | |
| Where <user> is the username you created previously (see Account Creation). | |
| " | |
| Launching Jobs,https://docs.mila.quebec/Extra_compute.html#launching-jobs,"Launching Jobs | |
| Users must specify the resource allocation Group Name using the flag | |
| --account=rrg-bengioy-ad. To launch a CPU-only job: | |
| sbatch --time=1:0:0 --account=rrg-bengioy-ad job.sh | |
| Note | |
| The account name will differ based on your affiliation. | |
| To launch a GPU job: | |
| sbatch --time=1:0:0 --account=rrg-bengioy-ad --gres=gpu:1 job.sh | |
| And to get an interactive session, use the salloc command: | |
| salloc --time=1:0:0 --account=rrg-bengioy-ad --gres=gpu:1 | |
| The full documentation for jobs launching on Beluga can be found here. | |
| " | |
| Beluga nodes description,https://docs.mila.quebec/Extra_compute.html#beluga-nodes-description,"Beluga nodes description | |
| Each GPU node consists of: | |
| 40 CPU cores | |
| 186 GB RAM | |
| 4 GPU NVIDIA V100 (16GB) | |
| Tip | |
| You should ask for max 10 CPU cores and 32 GB of RAM per GPU you are | |
| requesting (as explained here), | |
| otherwise, your job will count for more than 1 allocation, and will take | |
| more time to get scheduled. | |
| " | |
| Beluga Storage,https://docs.mila.quebec/Extra_compute.html#beluga-storage,"Beluga Storage | |
| Storage | |
| Path | |
| Usage | |
| $HOME | |
| /home/<user>/ | |
| Code | |
| Specific libraries | |
| $HOME/projects | |
| /project/rpp-bengioy | |
| Compressed raw datasets | |
| $SCRATCH | |
| /scratch/<user> | |
| Processed datasets | |
| Experimental results | |
| Logs of experiments | |
| $SLURM_TMPDIR | |
| Temporary job results | |
| They are roughly listed in order of increasing performance and optimized for | |
| different uses: | |
| The $HOME folder on NFS is appropriate for codes and libraries which are | |
| small and read once. Do not write experiemental results here! | |
| The $HOME/projects folder should only contain compressed raw datasets | |
| (processed datasets should go in $SCRATCH). We have a limit on the | |
| size and number of file in $HOME/projects, so do not put anything else | |
| there. If you add a new dataset there (make sure it is readable by every | |
| member of the group using chgrp -R rpp-bengioy <dataset>). | |
| The $SCRATCH space can be used for short term storage. It has good | |
| performance and large quotas, but is purged regularly (every file that has | |
| not been used in the last 3 months gets deleted, but you receive an email | |
| before this happens). | |
| $SLURM_TMPDIR points to the local disk of the node on which a job is | |
| running. It should be used to copy the data on the node at the beginning of | |
| the job and write intermediate checkpoints. This folder is cleared after each | |
| job. | |
| When an experiment is finished, results should be transferred back to Mila | |
| servers. | |
| More details on storage can be found here. | |
| " | |
| Modules,https://docs.mila.quebec/Extra_compute.html#modules,"Modules | |
| Many software, such as Python or MATLAB are already compiled and available on | |
| Beluga through the module command and its subcommands. Its full | |
| documentation can be found here. | |
| module avail | |
| Displays all the available modules | |
| module load <module> | |
| Loads <module> | |
| module spider <module> | |
| Shows specific details about <module> | |
| In particular, if you with to use Python 3.6 you can simply do: | |
| module load python/3.6 | |
| Tip | |
| If you wish to use Python on the cluster, we strongly encourage you to | |
| read Alliance Python Documentation, and in particular the Pytorch and/or Tensorflow pages. | |
| The cluster has many Python packages (or wheels), such already compiled for | |
| the cluster. See here for the | |
| details. In particular, you can browse the packages by doing: | |
| avail_wheels <wheel> | |
| Such wheels can be installed using pip. Moreover, the most efficient way to use | |
| modules on the cluster is to build your environnement inside your job. | |
| See the script example below. | |
| " | |
| Script Example,https://docs.mila.quebec/Extra_compute.html#script-example,"Script Example | |
| Here is a sbatch script that follows good practices on Beluga: | |
| 1#!/bin/bash | |
| 2#SBATCH --account=rrg-bengioy-ad # Yoshua pays for your job | |
| 3#SBATCH --cpus-per-task=6 # Ask for 6 CPUs | |
| 4#SBATCH --gres=gpu:1 # Ask for 1 GPU | |
| 5#SBATCH --mem=32G # Ask for 32 GB of RAM | |
| 6#SBATCH --time=3:00:00 # The job will run for 3 hours | |
| 7#SBATCH -o /scratch/<user>/slurm-%j.out # Write the log in $SCRATCH | |
| 8 | |
| 9# 1. Create your environement locally | |
| 10module load python/3.6 | |
| 11virtualenv --no-download $SLURM_TMPDIR/env | |
| 12source $SLURM_TMPDIR/env/bin/activate | |
| 13pip install --no-index torch torchvision | |
| 14 | |
| 15# 2. Copy your dataset on the compute node | |
| 16# IMPORTANT: Your dataset must be compressed in one single file (zip, hdf5, ...)!!! | |
| 17cp $SCRATCH/<dataset.zip> $SLURM_TMPDIR | |
| 18 | |
| 19# 3. Eventually unzip your dataset | |
| 20unzip $SLURM_TMPDIR/<dataset.zip> -d $SLURM_TMPDIR | |
| 21 | |
| 22# 4. Launch your job, tell it to save the model in $SLURM_TMPDIR | |
| 23# and look for the dataset into $SLURM_TMPDIR | |
| 24python main.py --path $SLURM_TMPDIR --data_path $SLURM_TMPDIR | |
| 25 | |
| 26# 5. Copy whatever you want to save on $SCRATCH | |
| 27cp $SLURM_TMPDIR/<to_save> $SCRATCH | |
| " | |
| Using CometML and Wandb,https://docs.mila.quebec/Extra_compute.html#using-cometml-and-wandb,"Using CometML and Wandb | |
| The compute nodes for Beluga don’t have access to the internet, | |
| but there is a special module that can be loaded in order to allow | |
| training scripts to access some specific servers, which includes | |
| the necessary servers for using CometML and Wandb (“Weights and Biases”). | |
| module load httpproxy | |
| More documentation about this can be found here. | |
| " | |
| Graham,https://docs.mila.quebec/Extra_compute.html#graham,"Graham | |
| Graham is a cluster located at University of Waterloo. It uses SLURM to schedule | |
| jobs. Its full documentation can be found here, and its current status here. | |
| You can access Graham via ssh: | |
| ssh <user>@graham.computecanada.ca | |
| Where <user> is the username you created previously (see Account Creation). | |
| Since its structure is similar to Beluga, please look at the Beluga | |
| documentation, as well as relevant parts of the Digital Research Alliance of | |
| Canada Documentation. | |
| Note | |
| For GPU jobs the ressource allocation Group Name is the same as Beluga, so you should use the flag --account=rrg-bengioy-ad for GPU jobs. | |
| " | |
| Cedar,https://docs.mila.quebec/Extra_compute.html#cedar,"Cedar | |
| Cedar is a cluster located at Simon Fraser University. It uses SLURM to schedule | |
| jobs. Its full documentation can be found here, and its current status here. | |
| You can access Cedar via ssh: | |
| ssh <user>@cedar.computecanada.ca | |
| Where <user> is the username you created previously (see Account Creation). | |
| Since its structure is similar to Beluga, please look at the Beluga | |
| documentation, as well as relevant parts of the Digital Research Alliance of | |
| Canada Documentation. | |
| Note | |
| However, we don’t have any CPU priority on Cedar, in this case you can | |
| use --account=def-bengioy for CPU. Thus, it might take some time before | |
| they start. | |
| " | |
| Niagara,https://docs.mila.quebec/Extra_compute.html#niagara,"Niagara | |
| Niagara is a cluster located at University of Toronto. It uses SLURM to schedule | |
| jobs. Its full documentation can be found here, and its current status here. | |
| You can access Niagara via ssh: | |
| ssh <user>@niagara.computecanada.ca | |
| Where <user> is the username you created previously (see Account Creation). | |
| Since its structure is similar to Beluga, please look at the Beluga | |
| documentation, as well as relevant parts of the Digital Research Alliance of | |
| Canada Documentation. | |
| " | |
| FAQ,https://docs.mila.quebec/Extra_compute.html#faq,"FAQ | |
| " | |
| What to do with ImportError: /lib64/libm.so.6: version GLIBC_2.23 not found?,https://docs.mila.quebec/Extra_compute.html#what-to-do-with-importerror-lib64-libm-so-6-version-glibc-2-23-not-found,"What to do with ImportError: /lib64/libm.so.6: version GLIBC_2.23 not found? | |
| The structure of the file system is different than a classical Linux, so your | |
| code has trouble finding libraries. See how to install binary packages. | |
| " | |
| Disk quota exceeded error on /project file systems,https://docs.mila.quebec/Extra_compute.html#disk-quota-exceeded-error-on-project-file-systems,"Disk quota exceeded error on /project file systems | |
| You have files in /project with the wrong permissions. See how to change | |
| permissions. | |
| " | |
| Computing infrastructure and policies,https://docs.mila.quebec/Information.html#computing-infrastructure-and-policies,"Computing infrastructure and policies | |
| This section seeks to provide factual information and policies on the Mila cluster computing environments. | |
| " | |
| Roles and authorizations,https://docs.mila.quebec/Information.html#roles-and-authorizations,"Roles and authorizations | |
| There are mainly two types of researchers statuses at Mila : | |
| Core researchers | |
| Affiliated researchers | |
| This is determined by Mila policy. Core researchers have access to the Mila | |
| computing cluster. See your supervisor’s Mila status to know what is your own | |
| status. | |
| " | |
| Overview of available computing resources at Mila,https://docs.mila.quebec/Information.html#overview-of-available-computing-resources-at-mila,"Overview of available computing resources at Mila | |
| The Mila cluster is to be used for regular development and relatively small | |
| number of jobs (< 5). It is a heterogeneous cluster. It uses | |
| SLURM to schedule jobs. | |
| " | |
| Mila cluster versus Digital Research Alliance of Canada clusters,https://docs.mila.quebec/Information.html#mila-cluster-versus-digital-research-alliance-of-canada-clusters,"Mila cluster versus Digital Research Alliance of Canada clusters | |
| There are a lot of commonalities between the Mila cluster and the clusters from | |
| Digital Research Alliance of Canada (the Alliance). At the time being, the | |
| Alliance clusters where we have a large allocation of resources are beluga, | |
| cedar, graham and narval. We also have comparable computational resources | |
| in the Mila cluster, with more to come. | |
| The main distinguishing factor is that we have more control over our own | |
| cluster than we have over the ones at the Alliance. Notably, also, the compute | |
| nodes in the Mila cluster all have unrestricted access to the Internet, which | |
| is not the case in general for the Alliance clusters (although cedar does | |
| allow it). | |
| At the current time of this writing (June 2021), Mila students are advised to | |
| use a healthy diet of a mix of Mila and Alliance clusters. This is especially | |
| true in times when your favorite cluster is oversubscribed, because you can | |
| easily switch over to a different one if you are used to it. | |
| " | |
| Guarantees about one GPU as absolute minimum,https://docs.mila.quebec/Information.html#guarantees-about-one-gpu-as-absolute-minimum,"Guarantees about one GPU as absolute minimum | |
| There are certain guarantees that the Mila cluster tries to honor when it comes | |
| to giving at minimum one GPU per student, all the time, to be used in | |
| interactive mode. This is strictly better than “one GPU per student on average” | |
| because it’s a floor meaning that, at any time, you should be able to ask for | |
| your GPU, right now, and get it (although it might take a minute for the | |
| request to be processed by SLURM). | |
| Interactive sessions are possible on the Alliance clusters, and there are | |
| generally special rules that allow you to get resources more easily if you | |
| request them for a very short duration (for testing code before queueing long | |
| jobs). You do not get the same guarantee as on the Mila cluster, however. | |
| " | |
| Node profile description,https://docs.mila.quebec/Information.html#node-profile-description,"Node profile description | |
| Name | |
| GPU | |
| CPUs | |
| Sockets | |
| Cores/Socket | |
| Threads/Core | |
| Memory (GB) | |
| TmpDisk (TB) | |
| Arch | |
| Slurm Features | |
| Model | |
| Mem | |
| # | |
| GPU Arch and Memory | |
| GPU Compute Nodes | |
| cn-a[001-011] | |
| RTX8000 | |
| 48 | |
| 8 | |
| 40 | |
| 2 | |
| 20 | |
| 1 | |
| 384 | |
| 3.6 | |
| x86_64 | |
| turing,48gb | |
| cn-b[001-005] | |
| V100 | |
| 32 | |
| 8 | |
| 40 | |
| 2 | |
| 20 | |
| 1 | |
| 384 | |
| 3.6 | |
| x86_64 | |
| volta,nvlink,32gb | |
| cn-c[001-040] | |
| RTX8000 | |
| 48 | |
| 8 | |
| 64 | |
| 2 | |
| 32 | |
| 1 | |
| 384 | |
| 3 | |
| x86_64 | |
| turing,48gb | |
| cn-g[001-026] | |
| A100 | |
| 80 | |
| 4 | |
| 64 | |
| 2 | |
| 32 | |
| 1 | |
| 1024 | |
| 7 | |
| x86_64 | |
| ampere,nvlink,80gb | |
| DGX Systems | |
| cn-d[001-002] | |
| A100 | |
| 40 | |
| 8 | |
| 128 | |
| 2 | |
| 64 | |
| 1 | |
| 1024 | |
| 14 | |
| x86_64 | |
| ampere,nvlink,40gb | |
| cn-d[003-004] | |
| A100 | |
| 80 | |
| 8 | |
| 128 | |
| 2 | |
| 64 | |
| 1 | |
| 2048 | |
| 28 | |
| x86_64 | |
| ampere,nvlink,80gb | |
| cn-e[002-003] | |
| V100 | |
| 32 | |
| 8 | |
| 40 | |
| 2 | |
| 20 | |
| 1 | |
| 512 | |
| 7 | |
| x86_64 | |
| volta,32gb | |
| CPU Compute Nodes | |
| cn-f[001-004] | |
| 32 | |
| 1 | |
| 32 | |
| 1 | |
| 256 | |
| 10 | |
| x86_64 | |
| rome | |
| cn-h[001-004] | |
| 64 | |
| 2 | |
| 32 | |
| 1 | |
| 768 | |
| 7 | |
| x86_64 | |
| milan | |
| Legacy GPU Compute Nodes | |
| kepler5 | |
| V100 | |
| 16 | |
| 2 | |
| 16 | |
| 2 | |
| 4 | |
| 2 | |
| 256 | |
| 3.6 | |
| x86_64 | |
| volta,16gb | |
| TITAN RTX | |
| rtx[1,3-5,7] | |
| titanrtx | |
| 24 | |
| 2 | |
| 20 | |
| 1 | |
| 10 | |
| 2 | |
| 128 | |
| 0.93 | |
| x86_64 | |
| turing,24gb | |
| " | |
| Special nodes and outliers,https://docs.mila.quebec/Information.html#special-nodes-and-outliers,"Special nodes and outliers | |
| " | |
| DGX A100,https://docs.mila.quebec/Information.html#dgx-a100,"DGX A100 | |
| DGX A100 nodes are NVIDIA appliances with 8 NVIDIA A100 Tensor Core GPUs. Each | |
| GPU has 40 GB of memory, for a total of 320 GB per appliance. The GPUs are | |
| interconnected via 6 NVSwitches which allows 4.8 TB/s bi-directional bandwidth. | |
| In order to run jobs on a DGX A100, add the flags below to your Slurm | |
| commands: | |
| --gres=gpu:a100:<number> --reservation=DGXA100 | |
| " | |
| MIG,https://docs.mila.quebec/Information.html#mig,"MIG | |
| MIG (Multi-Instance GPU) | |
| is an NVIDIA technology allowing certain GPUs to be | |
| partitioned into multiple instances, each of which has a roughly proportional | |
| amount of compute resources, device memory and bandwidth to that memory. | |
| NVIDIA supports MIG on its A100 GPUs and allows slicing the A100 into up to 7 | |
| instances. Although this can theoretically be done dynamically, the SLURM job | |
| scheduler does not support doing so in practice as it does not model | |
| reconfigurable resources very well. Therefore, the A100s must currently be | |
| statically partitioned into the required number of instances of every size | |
| expected to be used. | |
| The cn-g series of nodes include A100-80GB GPUs. One third have been | |
| configured to offer regular (non-MIG mode) a100l GPUs. The other two-thirds | |
| have been configured in MIG mode, and offer the following profiles: | |
| Name | |
| GPU | |
| Cluster-wide | |
| Model | |
| Memory | |
| Compute | |
| # | |
| a100l.1g.10gb | |
| a100l.1 | |
| A100 | |
| 10GB | |
| (1/8th) | |
| 1/7th | |
| of full | |
| 72 | |
| a100l.2g.20gb | |
| a100l.2 | |
| A100 | |
| 20GB | |
| (2/8th) | |
| 2/7th | |
| of full | |
| 108 | |
| a100l.3g.40gb | |
| a100l.3 | |
| A100 | |
| 40GB | |
| (4/8th) | |
| 3/7th | |
| of full | |
| 72 | |
| And can be requested using a SLURM flag such as --gres=gpu:a100l.1 | |
| The partitioning may be revised as needs and SLURM capabilities evolve. Other | |
| MIG profiles exist and could be introduced. | |
| Warning | |
| MIG has a number of important limitations, | |
| most notably that a GPU in MIG mode does not support graphics APIs | |
| (OpenGL/Vulkan), nor P2P over NVLink and PCIe. We have therefore chosen to | |
| limit every MIG job to exactly one MIG slice and no more. Thus, | |
| --gres=gpu:a100l.3 will work (and request a size-3 slice of an | |
| a100l GPU) but --gres=gpu:a100l.1:3 (with :3 requesting | |
| three size-1 slices) will not. | |
| " | |
| AMD,https://docs.mila.quebec/Information.html#amd,"AMD | |
| Warning | |
| As of August 20 2019 the GPUs had to return back to AMD. Mila will get | |
| more samples. You can join the amd slack channels to get the latest | |
| information | |
| Mila has a few node equipped with MI50 GPUs. | |
| srun --gres=gpu -c 8 --reservation=AMD --pty bash | |
| first time setup of AMD stack | |
| conda create -n rocm python=3.6 | |
| conda activate rocm | |
| pip install tensorflow-rocm | |
| pip install /wheels/pytorch/torch-1.1.0a0+d8b9d32-cp36-cp36m-linux_x86_64.whl | |
| " | |
| Data sharing policies,https://docs.mila.quebec/Information.html#data-sharing-policies,"Data sharing policies | |
| Note | |
| /network/scratch aims to support | |
| Access Control Lists (ACLs) | |
| to allow collaborative work on rapidly changing data, e.g. work in process | |
| datasets, model checkpoints, etc… | |
| /network/projects aims to offer a collaborative | |
| space for long-term projects. Data that should be kept for a longer period then | |
| 90 days can be stored in that location but first a request to Mila’s helpdesk has to be made to create the project | |
| directory. | |
| " | |
| Monitoring,https://docs.mila.quebec/Information.html#monitoring,"Monitoring | |
| Every compute node on the Mila cluster has a Netdata | |
| monitoring daemon allowing you to get a sense of the state of the node. | |
| This information is exposed in two ways: | |
| For every node, there is a web interface from Netdata itself at <node>.server.mila.quebec:19999. | |
| This is accessible only when using the Mila wifi or through SSH tunnelling. | |
| SSH tunnelling: on your local machine, run | |
| ssh -L 19999:<node>.server.mila.quebec:19999 -p 2222 | |
| login.server.mila.quebec | |
| or ssh -L 19999:<node>.server.mila.quebec:19999 mila if you have | |
| already setup your SSH Login, | |
| then open http://localhost:19999 in your browser. | |
| The Mila dashboard at dashboard.server.mila.quebec | |
| exposes aggregated statistics with the use of grafana. | |
| These are collected internally to an instance of prometheus. | |
| In both cases, those graphs are not editable by individual users, | |
| but they provide valuable insight into the state of the whole cluster | |
| or the individual nodes. | |
| One of the important uses is to collect data about the health | |
| of the Mila cluster and to sound the alarm if outages occur | |
| (e.g. if the nodes crash or if GPUs mysteriously become unavailable for SLURM). | |
| " | |
| Example with Netdata on cn-c001,https://docs.mila.quebec/Information.html#example-with-netdata-on-cn-c001,"Example with Netdata on cn-c001 | |
| For example, if we have a job running on cn-c001, we can type | |
| cn-c001.server.mila.quebec:19999 in a browser address bar and the following | |
| page will appear. | |
| " | |
| Example watching the CPU/RAM/GPU usage,https://docs.mila.quebec/Information.html#example-watching-the-cpu-ram-gpu-usage,"Example watching the CPU/RAM/GPU usage | |
| Given that compute nodes are generally shared | |
| with other users who are also running jobs at the same time and | |
| consuming resources, this is not generally a good way to profile your code | |
| in fine details. | |
| However, it can still be a very useful source of information | |
| for getting an idea of whether the machine that you requested is being | |
| used in its full capacity. | |
| Given how expensive the GPUs are, it generally makes sense to try to | |
| make sure that this resources is always kept busy. | |
| CPU | |
| iowait (pink line): High values means your model is waiting on IO a lot (disk or network). | |
| CPU RAM | |
| You can see how much CPU RAM is being used by your script in practice, | |
| considering the amount that you requested (e.g. `sbatch --mem=8G ...`). | |
| GPU usage is generally more important to monitor than CPU RAM. | |
| You should not cut it so close to the limit that your experiments randomly fail | |
| because they run out of RAM. However, you should not request blindly 32GB of RAM | |
| when you actually require only 8GB. | |
| GPU | |
| Monitors the GPU usage using an nvidia-smi plugin for Netdata. | |
| Under the plugin interface, select the GPU number which was allocated to | |
| you. You can figure this out by running echo $SLURM_JOB_GPUS on the | |
| allocated node or, if you have the job ID, | |
| scontrol show -d job YOUR_JOB_ID | grep 'GRES' and checking IDX | |
| You should make sure you use the GPUs to their fullest capacity. | |
| Select the biggest batch size if possible to increase GPU memory usage and | |
| the GPU computational load. | |
| Spawn multiple experiments if you can fit many on a single GPU. | |
| Running 10 independent MNIST experiments on a single GPU will probably take | |
| less than 10x the time to run a single one. This assumes that you have more | |
| experiments to run, because nothing is gained by gratuitously running experiments. | |
| You can request a less powerful GPU and leave the more powerful GPUs | |
| to other researchers who have experiments that can make best use of them. | |
| Sometimes you really just need a k80 and not a v100. | |
| Other users or jobs | |
| If the node seems unresponsive or slow, | |
| it may be useful to check what other tasks are | |
| running at the same time on that node. | |
| This should not be an issue in general, | |
| but in practice it is useful to be able to | |
| inspect this to diagnose certain problems. | |
| " | |
| Example with Mila dashboard,https://docs.mila.quebec/Information.html#example-with-mila-dashboard,"Example with Mila dashboard | |
| " | |
| Storage,https://docs.mila.quebec/Information.html#storage,"Storage | |
| Path | |
| Performance | |
| Usage | |
| Quota (Space/Files) | |
| Backup | |
| Auto-cleanup | |
| /network/datasets/ | |
| High | |
| Curated raw datasets (read only) | |
| $HOME or /home/mila/<u>/<username>/ | |
| Low | |
| Personal user space | |
| Specific libraries, code, binaries | |
| 100GB/1000K | |
| Daily | |
| no | |
| $SCRATCH or /network/scratch/<u>/<username>/ | |
| High | |
| Temporary job results | |
| Processed datasets | |
| Optimized for small Files | |
| no | |
| no | |
| 90 days | |
| $SLURM_TMPDIR | |
| Highest | |
| High speed disk for temporary job | |
| results | |
| 4TB/- | |
| no | |
| at job end | |
| /network/projects/<groupname>/ | |
| Fair | |
| Shared space to facilitate | |
| collaboration between researchers | |
| Long-term project storage | |
| 200GB/1000K | |
| Daily | |
| no | |
| $ARCHIVE or /network/archive/<u>/<username>/ | |
| Low | |
| Long-term personal storage | |
| 500GB | |
| no | |
| no | |
| Note | |
| The $HOME file system is backed up once a day. For any file | |
| restoration request, file a request to Mila’s IT support with the path to the file or directory to | |
| restore, with the required date. | |
| Warning | |
| Currently there is no backup system for any other file systems of | |
| the Mila cluster. Storage local to personal computers, Google Drive and other | |
| related solutions should be used to backup important data | |
| " | |
| $HOME,https://docs.mila.quebec/Information.html#home,"$HOME | |
| $HOME is appropriate for codes and libraries which are small and read once, | |
| as well as the experimental results that would be needed at a later time (e.g. | |
| the weights of a network referenced in a paper). | |
| Quotas are enabled on $HOME for both disk capacity (blocks) and number of | |
| files (inodes). The limits for blocks and inodes are respectively 100GiB and 1 | |
| million per user. The command to check the quota usage from a login node is: | |
| beegfs-ctl --cfgFile=/etc/beegfs/home.d/beegfs-client.conf --getquota --uid $USER | |
| " | |
| $SCRATCH,https://docs.mila.quebec/Information.html#scratch,"$SCRATCH | |
| $SCRATCH can be used to store processed datasets, work in progress datasets | |
| or temporary job results. Its block size is optimized for small files which | |
| minimizes the performance hit of working on extracted datasets. | |
| Note | |
| Auto-cleanup: this file system is cleared on a weekly basis, | |
| files not used for more than 90 days will be deleted. | |
| " | |
| $SLURM_TMPDIR,https://docs.mila.quebec/Information.html#slurm-tmpdir,"$SLURM_TMPDIR | |
| $SLURM_TMPDIR points to the local disk of the node on which a job is | |
| running. It should be used to copy the data on the node at the beginning of the | |
| job and write intermediate checkpoints. This folder is cleared after each job. | |
| " | |
| projects,https://docs.mila.quebec/Information.html#projects,"projects | |
| projects can be used for collaborative projects. It aims to ease the | |
| sharing of data between users working on a long-term project. | |
| Quotas are enabled on projects for both disk capacity (blocks) and number | |
| of files (inodes). The limits for blocks and inodes are respectively 200GiB and | |
| 1 million per user and per group. | |
| Note | |
| It is possible to request higher quota limits if the project requires | |
| it. File a request to Mila’s IT support. | |
| " | |
| $ARCHIVE,https://docs.mila.quebec/Information.html#archive,"$ARCHIVE | |
| $ARCHIVE purpose is to store data other than datasets that has to be kept | |
| long-term (e.g. generated samples, logs, data relevant for paper submission). | |
| $ARCHIVE is only available on the login nodes. Because this file system | |
| is tuned for large files, it is recommended to archive your directories. For | |
| example, to archive the results of an experiment in | |
| $SCRATCH/my_experiment_results/, run the commands below from a login node: | |
| cd $SCRATCH | |
| tar cJf $ARCHIVE/my_experiment_results.tar.xz --xattrs my_experiment_results | |
| Disk capacity quotas are enabled on $ARCHIVE. The soft limit per user is | |
| 500GB, the hard limit is 550GB. The grace time is 7 days. This means that one | |
| can use more than 500GB for 7 days before the file system enforces quota. | |
| However, it is not possible to use more than 550GB. | |
| The command to check the quota usage from a login node is df: | |
| df -h $ARCHIVE | |
| Note | |
| There is NO backup of this file system. | |
| " | |
| datasets,https://docs.mila.quebec/Information.html#datasets,"datasets | |
| datasets contains curated datasets to the benefit of the Mila community. | |
| To request the addition of a dataset or a preprocessed dataset you think could | |
| benefit the research of others, you can fill this form. Datasets can also be browsed from the | |
| web : Mila Datasets | |
| Datasets in datasets/restricted are restricted and require an explicit | |
| request to gain access. Please submit a support ticket mentioning the dataset’s | |
| access group (ex.: scannet_users), your cluster’s username and the | |
| approbation of the group owner. You can find the dataset’s access group by | |
| listing the content of /network/datasets/restricted with the ls command. | |
| Those datasets are mirrored to the Alliance clusters in | |
| ~/projects/rrg-bengioy-ad/data/curated/ if they follow Digital Research | |
| Alliance of Canada’s good practices on data. | |
| To list the local datasets on an Alliance cluster, you can execute the | |
| following command: | |
| ssh [CLUSTER_LOGIN] -C ""projects/rrg-bengioy-ad/data/curated/list_datasets_cc.sh"" | |
| " | |
| Data Transmission,https://docs.mila.quebec/Information.html#data-transmission,"Data Transmission | |
| Multiple methods can be used to transfer data to/from the cluster: | |
| rsync --bwlimit=10mb; this is the favored method since the bandwidth can | |
| be limited to prevent impacting the usage of the cluster: rsync | |
| Digital Research Alliance of Canada: Globus | |
| " | |
| Getting started,https://docs.mila.quebec/Getting_started.html#getting-started,"Getting started | |
| See User’s guide. | |
| " | |
| User’s guide,https://docs.mila.quebec/Userguide.html#user-s-guide,"User’s guide | |
| …or IDT’s list of opinionated howtos | |
| This section seeks to provide users of the Mila infrastructure with practical | |
| knowledge, tips and tricks and example commands. | |
| " | |
| Quick Start,https://docs.mila.quebec/Userguide.html#quick-start,"Quick Start | |
| Users first need login access to the cluster. It is | |
| recommended to install milatools which will help in the set up of the | |
| ssh configuration needed to securely and easily connect to the | |
| cluster. | |
| " | |
| mila code,https://docs.mila.quebec/Userguide.html#mila-code,"mila code | |
| milatools also makes it easy to run and debug code on the Mila cluster. Using | |
| the mila code command will allow you to use VSCode on the server. Simply run: | |
| mila code path/on/cluster | |
| The details of the command can be found on the github page of the package. Note that you need to | |
| first setup your ssh configuration using mila init before the mila code | |
| command can be used. The initialisation of the ssh configuration is explained | |
| here and on the github page of the package. | |
| " | |
| Logging in to the cluster,https://docs.mila.quebec/Userguide.html#logging-in-to-the-cluster,"Logging in to the cluster | |
| To access the Mila Cluster clusters, you will need a Mila account. Please contact | |
| Mila systems administrators if you don’t have it already. Our IT support service | |
| is available here: https://it-support.mila.quebec/ | |
| You will also need to complete and return an IT Onboarding Training to get | |
| access to the cluster. Please refer to the Mila Intranet for more | |
| informations: | |
| https://sites.google.com/mila.quebec/mila-intranet/it-infrastructure/it-onboarding-training | |
| IMPORTANT : Your access to the Cluster is granted based on your status at | |
| Mila (for students, your status is the same as your main supervisor’ status), | |
| and on the duration of your stay, set during the creation of your account. The | |
| following have access to the cluster : Current Students of Core Professors - | |
| Core Professors - Staff | |
| " | |
| SSH Login,https://docs.mila.quebec/Userguide.html#ssh-login,"SSH Login | |
| You can access the Mila cluster via ssh: | |
| # Generic login, will send you to one of the 4 login nodes to spread the load | |
| ssh <user>@login.server.mila.quebec -p 2222 | |
| # To connect to a specific login node, X in [1, 2, 3, 4] | |
| ssh <user>@login-X.login.server.mila.quebec -p 2222 | |
| Four login nodes are available and accessible behind a load balancer. At each | |
| connection, you will be redirected to the least loaded login-node. | |
| The ECDSA, RSA and ED25519 fingerprints for Mila’s login nodes are: | |
| SHA256:baEGIa311fhnxBWsIZJ/zYhq2WfCttwyHRKzAb8zlp8 (ECDSA) | |
| SHA256:Xr0/JqV/+5DNguPfiN5hb8rSG+nBAcfVCJoSyrR0W0o (RSA) | |
| SHA256:gfXZzaPiaYHcrPqzHvBi6v+BWRS/lXOS/zAjOKeoBJg (ED25519) | |
| Important | |
| Login nodes are merely entry points to the cluster. They give you access | |
| to the compute nodes and to the filesystem, but they are not meant to run | |
| anything heavy. Do not run compute-heavy programs on these nodes, | |
| because in doing so you could bring them down, impeding cluster access for | |
| everyone. | |
| This means no training or experiments, no compiling programs, no Python | |
| scripts, but also no zip of a large folder or anything that demands a | |
| sustained amount of computation. | |
| Rule of thumb: never run a program that takes more than a few seconds on | |
| a login node. | |
| Note | |
| In a similar vein, you should not run VSCode remote SSH instances directly | |
| on login nodes, because even though they are typically not very | |
| computationally expensive, when many people do it, they add up! See | |
| Visual Studio Code for specific instructions. | |
| " | |
| mila init,https://docs.mila.quebec/Userguide.html#mila-init,"mila init | |
| To make it easier to set up a productive environment, Mila publishes the | |
| milatools package, which defines a mila init command which will | |
| automatically perform some of the below steps for you. You can install it with | |
| pip and use it, provided your Python version is at least 3.8: | |
| $ pip install milatools | |
| $ mila init | |
| " | |
| SSH Config,https://docs.mila.quebec/Userguide.html#ssh-config,"SSH Config | |
| The login nodes support the following authentication mechanisms: | |
| publickey,keyboard-interactive. If you would like to set an entry in your | |
| .ssh/config file, please use the following recommendation: | |
| Host mila | |
| User YOUR-USERNAME | |
| Hostname login.server.mila.quebec | |
| PreferredAuthentications publickey,keyboard-interactive | |
| Port 2222 | |
| ServerAliveInterval 120 | |
| ServerAliveCountMax 5 | |
| Then you can simply write ssh mila to connect to a login node. You will also | |
| be able to use mila with scp, rsync and other such programs. | |
| Tip | |
| You can run commands on the login node with ssh directly, for example | |
| ssh mila squeue -u '$USER' (remember to put single quotes around any | |
| $VARIABLE you want to evaluate on the remote side, otherwise it will be | |
| evaluated locally before ssh is even executed). | |
| " | |
| Passwordless login,https://docs.mila.quebec/Userguide.html#passwordless-login,"Passwordless login | |
| To save you some repetitive typing it is highly recommended to set up public | |
| key authentication, which means you won’t have to enter your password every time | |
| you connect to the cluster. | |
| # ON YOUR LOCAL MACHINE | |
| # You might already have done this in the past, but if you haven't: | |
| ssh-keygen # Press ENTER 3x | |
| # Copy your public key over to the cluster | |
| # You will need to enter your password | |
| ssh-copy-id mila | |
| " | |
| Connecting to compute nodes,https://docs.mila.quebec/Userguide.html#connecting-to-compute-nodes,"Connecting to compute nodes | |
| If (and only if) you have a job running on compute node “cnode”, you are | |
| allowed to SSH to it directly, if for some reason you need a second terminal. | |
| That session will be automatically ended when your job is relinquished. | |
| First, however, you need to have | |
| password-less ssh either with a key present in your home or with an | |
| ssh-agent. To generate a key pair on the login node: | |
| # ON A LOGIN NODE | |
| ssh-keygen # Press ENTER 3x | |
| cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys | |
| chmod 600 ~/.ssh/authorized_keys | |
| chmod 700 ~/.ssh | |
| Then from the login node you can write ssh <node>. From your local | |
| machine, you can use ssh -J mila USERNAME@<node> (-J represents a “jump” | |
| through the login node, necessary because the compute nodes are behind a | |
| firewall). | |
| If you wish, you may also add the following wildcard rule in your .ssh/config: | |
| Host *.server.mila.quebec !*login.server.mila.quebec | |
| HostName %h | |
| User YOUR-USERNAME | |
| ProxyJump mila | |
| This will let you connect to a compute node with ssh <node>.server.mila.quebec. | |
| " | |
| Running your code,https://docs.mila.quebec/Userguide.html#running-your-code,"Running your code | |
| " | |
| SLURM commands guide,https://docs.mila.quebec/Userguide.html#slurm-commands-guide,"SLURM commands guide | |
| " | |
| Basic Usage,https://docs.mila.quebec/Userguide.html#basic-usage,"Basic Usage | |
| The SLURM documentation | |
| provides extensive information on the available commands to query the cluster | |
| status or submit jobs. | |
| Below are some basic examples of how to use SLURM. | |
| " | |
| Submitting jobs,https://docs.mila.quebec/Userguide.html#submitting-jobs,"Submitting jobs | |
| " | |
| Batch job,https://docs.mila.quebec/Userguide.html#batch-job,"Batch job | |
| In order to submit a batch job, you have to create a script containing the main | |
| command(s) you would like to execute on the allocated resources/nodes. | |
| 1#!/bin/bash | |
| 2#SBATCH --job-name=test | |
| 3#SBATCH --output=job_output.txt | |
| 4#SBATCH --error=job_error.txt | |
| 5#SBATCH --ntasks=1 | |
| 6#SBATCH --time=10:00 | |
| 7#SBATCH --mem=100Gb | |
| 8 | |
| 9module load python/3.5 | |
| 10python my_script.py | |
| Your job script is then submitted to SLURM with sbatch (ref.) | |
| sbatch job_script | |
| sbatch: Submitted batch job 4323674 | |
| The working directory of the job will be the one where your executed sbatch. | |
| Tip | |
| Slurm directives can be specified on the command line alongside sbatch or | |
| inside the job script with a line starting with #SBATCH. | |
| " | |
| Interactive job,https://docs.mila.quebec/Userguide.html#interactive-job,"Interactive job | |
| Workload managers usually run batch jobs to avoid having to watch its | |
| progression and let the scheduler run it as soon as resources are available. If | |
| you want to get access to a shell while leveraging cluster resources, you can | |
| submit an interactive jobs where the main executable is a shell with the | |
| srun/salloc (srun/salloc) commands | |
| salloc | |
| Will start an interactive job on the first node available with the default | |
| resources set in SLURM (1 task/1 CPU). srun accepts the same arguments as | |
| sbatch with the exception that the environment is not passed. | |
| Tip | |
| To pass your current environment to an interactive job, add | |
| --preserve-env to srun. | |
| salloc can also be used and is mostly a wrapper around srun if provided | |
| without more info but it gives more flexibility if for example you want to get | |
| an allocation on multiple nodes. | |
| " | |
| Job submission arguments,https://docs.mila.quebec/Userguide.html#job-submission-arguments,"Job submission arguments | |
| In order to accurately select the resources for your job, several arguments are | |
| available. The most important ones are: | |
| Argument | |
| Description | |
| -n, –ntasks=<number> | |
| The number of task in your script, usually =1 | |
| -c, –cpus-per-task=<ncpus> | |
| The number of cores for each task | |
| -t, –time=<time> | |
| Time requested for your job | |
| –mem=<size[units]> | |
| Memory requested for all your tasks | |
| –gres=<list> | |
| Select generic resources such as GPUs for your job: --gres=gpu:GPU_MODEL | |
| Tip | |
| Always consider requesting the adequate amount of resources to improve the | |
| scheduling of your job (small jobs always run first). | |
| " | |
| Checking job status,https://docs.mila.quebec/Userguide.html#checking-job-status,"Checking job status | |
| To display jobs currently in queue, use squeue and to get only your jobs type | |
| squeue -u $USER | |
| JOBID USER NAME ST START_TIME TIME NODES CPUS TRES_PER_NMIN_MEM NODELIST (REASON) COMMENT | |
| 133 my_username myjob R 2019-03-28T18:33 0:50 1 2 N/A 7000M node1 (None) (null) | |
| Note | |
| The maximum number of jobs able to be submitted to the system per user is 1000 (MaxSubmitJobs=1000) | |
| at any given time from the given association. If this limit is reached, new submission requests | |
| will be denied until existing jobs in this association complete. | |
| " | |
| Removing a job,https://docs.mila.quebec/Userguide.html#removing-a-job,"Removing a job | |
| To cancel your job simply use scancel | |
| scancel 4323674 | |
| " | |
| Partitioning,https://docs.mila.quebec/Userguide.html#partitioning,"Partitioning | |
| Since we don’t have many GPUs on the cluster, resources must be shared as fairly | |
| as possible. The --partition=/-p flag of SLURM allows you to set the | |
| priority you need for a job. Each job assigned with a priority can preempt jobs | |
| with a lower priority: unkillable > main > long. Once preempted, your job is | |
| killed without notice and is automatically re-queued on the same partition until | |
| resources are available. (To leverage a different preemption mechanism, see the | |
| Handling preemption) | |
| Flag | |
| Max Resource Usage | |
| Max Time | |
| Note | |
| --partition=unkillable | |
| 6 CPUs, mem=32G, 1 GPU | |
| 2 days | |
| --partition=unkillable-cpu | |
| 2 CPUs, mem=16G | |
| 2 days | |
| CPU-only jobs | |
| --partition=short-unkillable | |
| 24 CPUs, mem=128G, 4 GPUs | |
| 3 hours (!) | |
| Large but short jobs | |
| --partition=main | |
| 8 CPUs, mem=48G, 2 GPUs | |
| 5 days | |
| --partition=main-cpu | |
| 8 CPUs, mem=64G | |
| 5 days | |
| CPU-only jobs | |
| --partition=long | |
| no limit of resources | |
| 7 days | |
| --partition=long-cpu | |
| no limit of resources | |
| 7 days | |
| CPU-only jobs | |
| Warning | |
| Historically, before the 2022 introduction of CPU-only nodes (e.g. the cn-f | |
| series), CPU jobs ran side-by-side with the GPU jobs on GPU nodes. To prevent | |
| them obstructing any GPU job, they were always lowest-priority and preemptible. | |
| This was implemented by automatically assigning them to one of the now-obsolete | |
| partitions cpu_jobs, cpu_jobs_low or cpu_jobs_low-grace. | |
| Do not use these partition names anymore. Prefer the *-cpu partition | |
| names defined above. | |
| For backwards-compatibility purposes, the legacy partition names are translated | |
| to their effective equivalent long-cpu, but they will eventually be removed | |
| entirely. | |
| Note | |
| As a convenience, should you request the unkillable, main or long | |
| partition for a CPU-only job, the partition will be translated to its -cpu | |
| equivalent automatically. | |
| For instance, to request an unkillable job with 1 GPU, 4 CPUs, 10G of RAM and | |
| 12h of computation do: | |
| sbatch --gres=gpu:1 -c 4 --mem=10G -t 12:00:00 --partition=unkillable <job.sh> | |
| You can also make it an interactive job using salloc: | |
| salloc --gres=gpu:1 -c 4 --mem=10G -t 12:00:00 --partition=unkillable | |
| The Mila cluster has many different types of nodes/GPUs. To request a specific | |
| type of node/GPU, you can add specific feature requirements to your job | |
| submission command. | |
| To access those special nodes you need to request them explicitly by adding the | |
| flag --constraint=<name>. The full list of nodes in the Mila Cluster can be | |
| accessed Node profile description. | |
| Example: | |
| To request a machine with 2 GPUs using NVLink, you can use | |
| sbatch -c 4 --gres=gpu:2 --constraint=nvlink | |
| Feature | |
| Particularities | |
| 12GB/16GB/24GB/32GB/48GB | |
| Request a specific amount of GPU memory | |
| volta/turing/ampere | |
| Request a specific GPU architecture | |
| nvlink | |
| Machine with GPUs using the NVLink interconnect technology | |
| " | |
| Information on partitions/nodes,https://docs.mila.quebec/Userguide.html#information-on-partitions-nodes,"Information on partitions/nodes | |
| sinfo (ref.) provides most of the | |
| information about available nodes and partitions/queues to submit jobs to. | |
| Partitions are a group of nodes usually sharing similar features. On a | |
| partition, some job limits can be applied which will override those asked for a | |
| job (i.e. max time, max CPUs, etc…) | |
| To display available partitions, simply use | |
| sinfo | |
| PARTITION AVAIL TIMELIMIT NODES STATE NODELIST | |
| batch up infinite 2 alloc node[1,3,5-9] | |
| batch up infinite 6 idle node[10-15] | |
| cpu up infinite 6 idle cpu_node[1-15] | |
| gpu up infinite 6 idle gpu_node[1-15] | |
| To display available nodes and their status, you can use | |
| sinfo -N -l | |
| NODELIST NODES PARTITION STATE CPUS MEMORY TMP_DISK WEIGHT FEATURES REASON | |
| node[1,3,5-9] 2 batch allocated 2 246 16000 0 (null) (null) | |
| node[2,4] 2 batch drain 2 246 16000 0 (null) (null) | |
| node[10-15] 6 batch idle 2 246 16000 0 (null) (null) | |
| ... | |
| And to get statistics on a job running or terminated, use sacct with some of | |
| the fields you want to display | |
| sacct --format=User,JobID,Jobname,partition,state,time,start,end,elapsed,nnodes,ncpus,nodelist,workdir -u $USER | |
| User JobID JobName Partition State Timelimit Start End Elapsed NNodes NCPUS NodeList WorkDir | |
| --------- ------------ ---------- ---------- ---------- ---------- ------------------- ------------------- ---------- -------- ---------- --------------- -------------------- | |
| my_usern+ 2398 run_extra+ batch RUNNING 130-05:00+ 2019-03-27T18:33:43 Unknown 1-01:07:54 1 16 node9 /home/mila/my_usern+ | |
| my_usern+ 2399 run_extra+ batch RUNNING 130-05:00+ 2019-03-26T08:51:38 Unknown 2-10:49:59 1 16 node9 /home/mila/my_usern+ | |
| Or to get the list of all your previous jobs, use the --start=YYYY-MM-DD flag. You can check sacct(1) for further information about additional t" | |
| Information on partitions/nodes,https://docs.mila.quebec/Userguide.html#information-on-partitions-nodes,"ime formats. | |
| sacct -u $USER --start=2019-01-01 | |
| scontrol (ref.) can be used to | |
| provide specific information on a job (currently running or recently terminated) | |
| scontrol show job 43123 | |
| JobId=43123 JobName=python_script.py | |
| UserId=my_username(1500000111) GroupId=student(1500000000) MCS_label=N/A | |
| Priority=645895 Nice=0 Account=my_username QOS=normal | |
| JobState=RUNNING Reason=None Dependency=(null) | |
| Requeue=1 Restarts=3 BatchFlag=1 Reboot=0 ExitCode=0:0 | |
| RunTime=2-10:41:57 TimeLimit=130-05:00:00 TimeMin=N/A | |
| SubmitTime=2019-03-26T08:47:17 EligibleTime=2019-03-26T08:49:18 | |
| AccrueTime=2019-03-26T08:49:18 | |
| StartTime=2019-03-26T08:51:38 EndTime=2019-08-03T13:51:38 Deadline=N/A | |
| PreemptTime=None SuspendTime=None SecsPreSuspend=0 | |
| LastSchedEval=2019-03-26T08:49:18 | |
| Partition=slurm_partition AllocNode:Sid=login-node-1:14586 | |
| ReqNodeList=(null) ExcNodeList=(null) | |
| NodeList=node2 | |
| BatchHost=node2 | |
| NumNodes=1 NumCPUs=16 NumTasks=1 CPUs/Task=16 ReqB:S:C:T=0:0:*:* | |
| TRES=cpu=16,mem=32000M,node=1,billing=3 | |
| Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=* | |
| MinCPUsNode=16 MinMemoryNode=32000M MinTmpDiskNode=0 | |
| Features=(null) DelayBoot=00:00:00 | |
| OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) | |
| WorkDir=/home/mila/my_username | |
| StdErr=/home/mila/my_username/slurm-43123.out | |
| StdIn=/dev/null | |
| StdOut=/home/mila/my_username/slurm-43123.out | |
| Power= | |
| Or more info on a node and its resources | |
| scontrol show node node9 | |
| NodeName=node9 Arch=x86_64 CoresPerSocket=4 | |
| CPUAlloc=16 CPUTot=16 CPULoad=1.38 | |
| AvailableFeatures=(null) | |
| ActiveFeatures=(null) | |
| Gres=(null) | |
| NodeAddr=10.252.232.4 NodeHostName=mila20684000000 Port=0 Version=18.08 | |
| OS=Linux 4.15.0-1036 #38-Ubuntu SMP Fri Dec 7 02:47:47 UTC 2018 | |
| RealMemory=32000 AllocMem=32000 FreeMem=23262 Sockets=2 Boards=1 | |
| State=ALLOCATED+CLOUD ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A | |
| Partitions=slurm_partition | |
| BootTime=2019-03-26T08:50:01 SlurmdStartTime=2019-03-26T08:51:15 | |
| CfgTRES=cpu=16,mem=32000M,billing=3 | |
| AllocTRES=cpu=16,mem=32000M | |
| CapWatts=n/a | |
| CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 | |
| ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s | |
| " | |
| Useful Commands,https://docs.mila.quebec/Userguide.html#useful-commands,"Useful Commands | |
| sallocGet an interactive job and give you a shell. (ssh like) CPU only | |
| salloc --gres=gpu:1 -c 2 --mem=12000Get an interactive job with one GPU, 2 CPUs and 12000 MB RAM | |
| sbatchstart a batch job (same options as salloc) | |
| sattach --pty <jobid>.0Re-attach a dropped interactive job | |
| sinfostatus of all nodes | |
| sinfo -Ogres:27,nodelist,features -tidle,mix,allocList GPU type and FEATURES that you can request | |
| savail(Custom) List available gpu | |
| scancel <jobid>Cancel a job | |
| squeuesummary status of all active jobs | |
| squeue -u $USERsummary status of all YOUR active jobs | |
| squeue -j <jobid>summary status of a specific job | |
| squeue -Ojobid,name,username,partition,state,timeused,nodelist,gres,tresstatus of all jobs including requested resources (see the SLURM squeue doc for all output options) | |
| scontrol show job <jobid>Detailed status of a running job | |
| sacct -j <job_id> -o NodeListGet the node where a finished job ran | |
| sacct -u $USER -S <start_time> -E <stop_time>Find info about old jobs | |
| sacct -oJobID,JobName,User,Partition,Node,StateList of current and recent jobs | |
| " | |
| Special GPU requirements,https://docs.mila.quebec/Userguide.html#special-gpu-requirements,"Special GPU requirements | |
| Specific GPU architecture and memory can be easily requested through the | |
| --gres flag by using either | |
| --gres=gpu:architecture:number | |
| --gres=gpu:memory:number | |
| --gres=gpu:model:number | |
| Example: | |
| To request 1 GPU with at least 16GB of memory use | |
| sbatch -c 4 --gres=gpu:16gb:1 | |
| The full list of GPU and their features can be accessed here. | |
| " | |
| Example script,https://docs.mila.quebec/Userguide.html#example-script,"Example script | |
| Here is a sbatch script that follows good practices on the Mila cluster: | |
| 1#!/bin/bash | |
| 2 | |
| 3#SBATCH --partition=unkillable # Ask for unkillable job | |
| 4#SBATCH --cpus-per-task=2 # Ask for 2 CPUs | |
| 5#SBATCH --gres=gpu:1 # Ask for 1 GPU | |
| 6#SBATCH --mem=10G # Ask for 10 GB of RAM | |
| 7#SBATCH --time=3:00:00 # The job will run for 3 hours | |
| 8#SBATCH -o /network/scratch/<u>/<username>/slurm-%j.out # Write the log on scratch | |
| 9 | |
| 10# 1. Load the required modules | |
| 11module --quiet load anaconda/3 | |
| 12 | |
| 13# 2. Load your environment | |
| 14conda activate ""<env_name>"" | |
| 15 | |
| 16# 3. Copy your dataset on the compute node | |
| 17cp /network/datasets/<dataset> $SLURM_TMPDIR | |
| 18 | |
| 19# 4. Launch your job, tell it to save the model in $SLURM_TMPDIR | |
| 20# and look for the dataset into $SLURM_TMPDIR | |
| 21python main.py --path $SLURM_TMPDIR --data_path $SLURM_TMPDIR | |
| 22 | |
| 23# 5. Copy whatever you want to save on $SCRATCH | |
| 24cp $SLURM_TMPDIR/<to_save> /network/scratch/<u>/<username>/ | |
| " | |
| Portability concerns and solutions,https://docs.mila.quebec/Userguide.html#portability-concerns-and-solutions,"Portability concerns and solutions | |
| When working on a software project, it is important to be aware of all the | |
| software and libraries the project relies on and to list them explicitly and | |
| under a version control system in such a way that they can easily be | |
| installed and made available on different systems. The upsides are significant: | |
| Easily install and run on the cluster | |
| Ease of collaboration | |
| Better reproducibility | |
| To achieve this, try to always keep in mind the following aspects: | |
| Versions: For each dependency, make sure you have some record of the | |
| specific version you are using during development. That way, in the future, you | |
| will be able to reproduce the original environment which you know to be | |
| compatible. Indeed, the more time passes, the more likely it is that newer | |
| versions of some dependency have breaking changes. The pip freeze command can create | |
| such a record for Python dependencies. | |
| Isolation: Ideally, each of your software projects should be isolated from | |
| the others. What this means is that updating the environment for project A | |
| should not update the environment for project B. That way, you can freely | |
| install and upgrade software and libraries for the former without worrying about | |
| breaking the latter (which you might not notice until weeks later, the next time | |
| you work on project B!) Isolation can be achieved using Python Virtual environments and Containers. | |
| " | |
| Managing your environments,https://docs.mila.quebec/Userguide.html#managing-your-environments,"Managing your environments | |
| " | |
| Virtual environments,https://docs.mila.quebec/Userguide.html#virtual-environments,"Virtual environments | |
| A virtual environment in Python is a local, isolated environment in which you | |
| can install or uninstall Python packages without interfering with the global | |
| environment (or other virtual environments). It usually lives in a directory | |
| (location varies depending on whether you use venv, conda or poetry). In order | |
| to use a virtual environment, you have to activate it. Activating an | |
| environment essentially sets environment variables in your shell so that: | |
| python points to the right Python version for that environment (different | |
| virtual environments can use different versions of Python!) | |
| python looks for packages in the virtual environment | |
| pip install installs packages into the virtual environment | |
| Any shell commands installed via pip install are made available | |
| To run experiments within a virtual environment, you can simply activate it | |
| in the script given to sbatch. | |
| " | |
| Pip/Virtualenv,https://docs.mila.quebec/Userguide.html#pip-virtualenv,"Pip/Virtualenv | |
| Pip is the preferred package manager for Python and each cluster provides | |
| several Python versions through the associated module which comes with pip. In | |
| order to install new packages, you will first have to create a personal space | |
| for them to be stored. The preferred solution (as it is the preferred solution | |
| on Digital Research Alliance of Canada clusters) is to use virtual | |
| environments. | |
| First, load the Python module you want to use: | |
| module load python/3.8 | |
| Then, create a virtual environment in your home directory: | |
| python -m venv $HOME/<env> | |
| Where <env> is the name of your environment. Finally, activate the environment: | |
| source $HOME/<env>/bin/activate | |
| You can now install any Python package you wish using the pip command, e.g. | |
| pytorch: | |
| pip install torch torchvision | |
| Or Tensorflow: | |
| pip install tensorflow-gpu | |
| " | |
| Conda,https://docs.mila.quebec/Userguide.html#conda,"Conda | |
| Another solution for Python is to use miniconda or anaconda which are also available through the module | |
| command: (the use of Conda is not recommended for Digital Research Alliance of | |
| Canada clusters due to the availability of custom-built packages for pip) | |
| module load miniconda/3 | |
| === Module miniconda/3 loaded ===] | |
| o enable conda environment functions, first use: | |
| To create an environment (see here | |
| for details) using a specific Python version, you may write: | |
| conda create -n <env> python=3.9 | |
| Where <env> is the name of your environment. You can now activate it by doing: | |
| conda activate <env> | |
| You are now ready to install any Python package you want in this environment. | |
| For instance, to install PyTorch, you can find the Conda command of any version | |
| you want on pytorch’s website, e.g: | |
| conda install pytorch torchvision cudatoolkit=10.0 -c pytorch | |
| If you make a lot of environments and install/uninstall a lot of packages, it | |
| can be good to periodically clean up Conda’s cache: | |
| conda clean --all | |
| " | |
| Using Modules,https://docs.mila.quebec/Userguide.html#using-modules,"Using Modules | |
| A lot of software, such as Python and Conda, is already compiled and available on | |
| the cluster through the module command and its sub-commands. In particular, | |
| if you wish to use Python 3.7 you can simply do: | |
| module load python/3.7 | |
| " | |
| The module command,https://docs.mila.quebec/Userguide.html#the-module-command,"The module command | |
| For a list of available modules, simply use: | |
| module avail | |
| -------------------------------------------------------------------------------------------------------------- Global Aliases --------------------------------------------------------------------------------------------------------------- | |
| cuda/10.0 -> cudatoolkit/10.0 cuda/9.2 -> cudatoolkit/9.2 pytorch/1.4.1 -> python/3.7/cuda/10.2/cudnn/7.6/pytorch/1.4.1 tensorflow/1.15 -> python/3.7/tensorflow/1.15 | |
| cuda/10.1 -> cudatoolkit/10.1 mujoco-py -> python/3.7/mujoco-py/2.0 pytorch/1.5.0 -> python/3.7/cuda/10.2/cudnn/7.6/pytorch/1.5.0 tensorflow/2.2 -> python/3.7/tensorflow/2.2 | |
| cuda/10.2 -> cudatoolkit/10.2 mujoco-py/2.0 -> python/3.7/mujoco-py/2.0 pytorch/1.5.1 -> python/3.7/cuda/10.2/cudnn/7.6/pytorch/1.5.1 | |
| cuda/11.0 -> cudatoolkit/11.0 pytorch -> python/3.7/cuda/10.2/cudnn/7.6/pytorch/1.5.1 tensorflow -> python/3.7/tensorflow/2.2 | |
| cuda/9.0 -> cudatoolkit/9.0 pytorch/1.4.0 -> python/3.7/cuda/10.2/cudnn/7.6/pytorch/1.4.0 tensorflow-cpu/1.15 -> python/3.7/tensorflow/1.15 | |
| -------------------------------------------------------------------------------------------------- /cvmfs/config.mila.quebec/modules/Core --------------------------------------------------------------------------------------------------- | |
| Mila (S,L) anaconda/3 (D) go/1.13.5 miniconda/2 mujoco/1.50 python/2.7 python/3.6 python/3.8 singularity/3.0.3 singularity/3.2.1 singularity/3.5.3 (D) | |
| anaconda/2 go/1.12.4 go/1.14 (D) miniconda/3 (D) mujoco/2.0 (D) python/3.5 python/3.7 (D) singularity/2.6.1 singularity/3.1.1 singularity/3.4.2 | |
| ------------------------------------------------------------------------------------------------ /cvmfs/config.mila.quebec/modules/Compiler ---------------------------------------------------------------------------------------" | |
| The module command,https://docs.mila.quebec/Userguide.html#the-module-command,"---------- | |
| python/3.7/mujoco-py/2.0 | |
| -------------------------------------------------------------------------------------------------- /cvmfs/config.mila.quebec/modules/Cuda --------------------------------------------------------------------------------------------------- | |
| cuda/10.0/cudnn/7.3 cuda/10.0/nccl/2.4 cuda/10.1/nccl/2.4 cuda/11.0/nccl/2.7 cuda/9.0/nccl/2.4 cudatoolkit/9.0 cudatoolkit/10.1 cudnn/7.6/cuda/10.0/tensorrt/7.0 | |
| cuda/10.0/cudnn/7.5 cuda/10.1/cudnn/7.5 cuda/10.2/cudnn/7.6 cuda/9.0/cudnn/7.3 cuda/9.2/cudnn/7.6 cudatoolkit/9.2 cudatoolkit/10.2 cudnn/7.6/cuda/10.1/tensorrt/7.0 | |
| cuda/10.0/cudnn/7.6 (D) cuda/10.1/cudnn/7.6 (D) cuda/10.2/nccl/2.7 cuda/9.0/cudnn/7.5 (D) cuda/9.2/nccl/2.4 cudatoolkit/10.0 cudatoolkit/11.0 (D) cudnn/7.6/cuda/9.0/tensorrt/7.0 | |
| ------------------------------------------------------------------------------------------------ /cvmfs/config.mila.quebec/modules/Pytorch -------------------------------------------------------------------------------------------------- | |
| python/3.7/cuda/10.1/cudnn/7.6/pytorch/1.4.1 python/3.7/cuda/10.1/cudnn/7.6/pytorch/1.5.1 (D) python/3.7/cuda/10.2/cudnn/7.6/pytorch/1.5.0 | |
| python/3.7/cuda/10.1/cudnn/7.6/pytorch/1.5.0 python/3.7/cuda/10.2/cudnn/7.6/pytorch/1.4.1 python/3.7/cuda/10.2/cudnn/7.6/pytorch/1.5.1 (D) | |
| ----------------------------------------------------------------------------------------------- /cvmfs/config.mila.quebec/modules/Tensorflow ------------------------------------------------------------------------------------------------ | |
| python/3.7/tensorflow/1.15 python/3.7/tensorflow/2.0 python/3.7/tensorflow/2.2 (D) | |
| Modules can be loaded using the load command: | |
| module load <module> | |
| To search for a module or a software, use the command spider: | |
| module spider search_term | |
| E.g.: by default, python2 will refer to the os-shipped installation of python2.7 and python3 to python3.6. | |
| If you want to use python3.7 you can type: | |
| module load python3.7 | |
| " | |
| Available Software,https://docs.mila.quebec/Userguide.html#available-software,"Available Software | |
| Modules are divided in 5 main sections: | |
| Section | |
| Description | |
| Core | |
| Base interpreter and software (Python, go, etc…) | |
| Compiler | |
| Interpreter-dependent software (see the note below) | |
| Cuda | |
| Toolkits, cudnn and related libraries | |
| Pytorch/Tensorflow | |
| Pytorch/TF built with a specific Cuda/Cudnn | |
| version for Mila’s GPUs (see the related paragraph) | |
| Note | |
| Modules which are nested (../../..) usually depend on other software/module | |
| loaded alongside the main module. No need to load the dependent software, | |
| the complex naming scheme allows an automatic detection of the dependent | |
| module(s): | |
| i.e.: Loading cudnn/7.6/cuda/9.0/tensorrt/7.0 will load cudnn/7.6 and | |
| cuda/9.0 alongside | |
| python/3.X is a particular dependency which can be served through | |
| python/3.X or anaconda/3 and is not automatically loaded to let the | |
| user pick his favorite flavor. | |
| " | |
| Default package location,https://docs.mila.quebec/Userguide.html#default-package-location,"Default package location | |
| Python by default uses the user site package first and packages provided by | |
| module last to not interfere with your installation. If you want to skip | |
| packages installed in your site-packages folder (in your /home directory), you | |
| have to start Python with the -s flag. | |
| To check which package is loaded at import, you can print package.__file__ | |
| to get the full path of the package. | |
| Example: | |
| module load pytorch/1.5.0 | |
| python -c 'import torch;print(torch.__file__)' | |
| home/mila/my_home/.local/lib/python3.7/site-packages/torch/__init__.py <== package from your own site-package | |
| Now with the -s flag: | |
| module load pytorch/1.5.0 | |
| python -s -c 'import torch;print(torch.__file__)' | |
| cvmfs/ai.mila.quebec/apps/x86_64/debian/pytorch/python3.7-cuda10.1-cudnn7.6-v1.5.0/lib/python3.7/site-packages/torch/__init__.py' | |
| " | |
| On using containers,https://docs.mila.quebec/Userguide.html#on-using-containers,"On using containers | |
| Another option for creating portable code is Using containers on clusters. | |
| Containers are a popular approach at deploying applications by packaging a lot | |
| of the required dependencies together. The most popular tool for this is | |
| Docker, but Docker cannot be used on the Mila | |
| cluster (nor the other clusters from Digital Research Alliance of Canada). | |
| One popular mechanism for containerisation on a computational cluster is called | |
| Singularity. | |
| This is the recommended approach for running containers on the | |
| Mila cluster. See section Singularity for more details. | |
| " | |
| Singularity,https://docs.mila.quebec/Userguide.html#id7,"Singularity | |
| " | |
| Overview,https://docs.mila.quebec/Userguide.html#overview,"Overview | |
| " | |
| What is Singularity?,https://docs.mila.quebec/Userguide.html#what-is-singularity,"What is Singularity? | |
| Running Docker on SLURM is a security problem (e.g. running as root, being able | |
| to mount any directory). The alternative is to use Singularity, which is a | |
| popular solution in the world of HPC. | |
| There is a good level of compatibility between Docker and Singularity, | |
| and we can find many exaggerated claims about able to convert containers | |
| from Docker to Singularity without any friction. | |
| Oftentimes, Docker images from DockerHub are 100% compatible with Singularity, | |
| and they can indeed be used without friction, but things get messy when | |
| we try to convert our own Docker build files to Singularity recipes. | |
| " | |
| Links to official documentation,https://docs.mila.quebec/Userguide.html#links-to-official-documentation,"Links to official documentation | |
| official Singularity user guide (this is the one you | |
| will use most often) | |
| official Singularity admin guide | |
| " | |
| Overview of the steps used in practice,https://docs.mila.quebec/Userguide.html#overview-of-the-steps-used-in-practice,"Overview of the steps used in practice | |
| Most often, the process to create and use a Singularity container is: | |
| on your Linux computer (at home or work) | |
| select a Docker image from DockerHub (e.g. pytorch/pytorch) | |
| make a recipe file for Singularity that starts with that DockerHub image | |
| build the recipe file, thus creating the image file (e.g. my-pytorch-image.sif) | |
| test your singularity container before send it over to the cluster | |
| rsync -av my-pytorch-image.sif <login-node>:Documents/my-singularity-images | |
| on the login node for that cluster | |
| queue your jobs with sbatch ... | |
| (note that your jobs will copy over the my-pytorch-image.sif to $SLURM_TMPDIR | |
| and will then launch Singularity with that image) | |
| do something else while you wait for them to finish | |
| queue more jobs with the same my-pytorch-image.sif, | |
| reusing it many times over | |
| In the following sections you will find specific examples or tips to accomplish | |
| in practice the steps highlighted above. | |
| " | |
| "Nope, not on MacOS",https://docs.mila.quebec/Userguide.html#nope-not-on-macos,"Nope, not on MacOS | |
| Singularity does not work on MacOS, as of the time of this writing in 2021. | |
| Docker does not actually run on MacOS, but there Docker silently installs a | |
| virtual machine running Linux, which makes it a pleasant experience, | |
| and the user does not need to care about the details of how Docker does it. | |
| Given its origins in HPC, Singularity does not provide that kind of seamless | |
| experience on MacOS, even though it’s technically possible to run it | |
| inside a Linux virtual machine on MacOS. | |
| " | |
| Where to build images,https://docs.mila.quebec/Userguide.html#where-to-build-images,"Where to build images | |
| Building Singularity images is a rather heavy task, which can take 20 minutes | |
| if you have a lot of steps in your recipe. This makes it a bad task to run on | |
| the login nodes of our clusters, especially if it needs to be run regularly. | |
| On the Mila cluster, we are lucky to have unrestricted internet access on the compute | |
| nodes, which means that anyone can request an interactive CPU node (no need for GPU) | |
| and build their images there without problem. | |
| Warning | |
| Do not build Singularity images from scratch every time your run a | |
| job in a large batch. This will be a colossal waste of GPU time as well as | |
| internet bandwidth. If you setup your workflow properly (e.g. using bind | |
| paths for your code and data), you can spend months reusing the same | |
| Singularity image my-pytorch-image.sif. | |
| " | |
| Building the containers,https://docs.mila.quebec/Userguide.html#building-the-containers,"Building the containers | |
| Building a container is like creating a new environment except that containers | |
| are much more powerful since they are self-contained systems. With | |
| singularity, there are two ways to build containers. | |
| The first one is by yourself, it’s like when you got a new Linux laptop and you | |
| don’t really know what you need, if you see that something is missing, you | |
| install it. Here you can get a vanilla container with Ubuntu called a sandbox, | |
| you log in and you install each packages by yourself. This procedure can take | |
| time but will allow you to understand how things work and what you need. This is | |
| recommended if you need to figure out how things will be compiled or if you want | |
| to install packages on the fly. We’ll refer to this procedure as singularity | |
| sandboxes. | |
| The second way is more like you know what you want, so you write a list of | |
| everything you need, you send it to singularity and it will install everything | |
| for you. Those lists are called singularity recipes. | |
| " | |
| First way: Build and use a sandbox,https://docs.mila.quebec/Userguide.html#first-way-build-and-use-a-sandbox,"First way: Build and use a sandbox | |
| You might ask yourself: On which machine should I build a container? | |
| First of all, you need to choose where you’ll build your container. This | |
| operation requires memory and high cpu usage. | |
| Warning | |
| Do NOT build containers on any login nodes ! | |
| (Recommended for beginner) If you need to use apt-get, you should build | |
| the container on your laptop with sudo privileges. You’ll only need to | |
| install singularity on your laptop. Windows/Mac users can look there and | |
| Ubuntu/Debian users can use directly: | |
| sudo apt-get install singularity-container | |
| If you can’t install singularity on your laptop and you don’t need | |
| apt-get, you can reserve a cpu node on the Mila cluster to build your | |
| container. | |
| In this case, in order to avoid too much I/O over the network, you should define | |
| the singularity cache locally: | |
| export SINGULARITY_CACHEDIR=$SLURM_TMPDIR | |
| If you can’t install singularity on your laptop and you want to use | |
| apt-get, you can use singularity-hub to build your containers and read | |
| Recipe_section. | |
| " | |
| Download containers from the web,https://docs.mila.quebec/Userguide.html#download-containers-from-the-web,"Download containers from the web | |
| Hopefully, you may not need to create containers from scratch as many have been | |
| already built for the most common deep learning software. You can find most of | |
| them on dockerhub. | |
| Go on dockerhub and select the container you want to pull. | |
| For example, if you want to get the latest PyTorch version with GPU support | |
| (Replace runtime by devel if you need the full Cuda toolkit): | |
| singularity pull docker://pytorch/pytorch:1.0.1-cuda10.0-cudnn7-runtime | |
| Or the latest TensorFlow: | |
| singularity pull docker://tensorflow/tensorflow:latest-gpu-py3 | |
| Currently the pulled image pytorch.simg or tensorflow.simg is read-only | |
| meaning that you won’t be able to install anything on it. Starting now, PyTorch | |
| will be taken as example. If you use TensorFlow, simply replace every | |
| pytorch occurrences by tensorflow. | |
| " | |
| How to add or install stuff in a container,https://docs.mila.quebec/Userguide.html#how-to-add-or-install-stuff-in-a-container,"How to add or install stuff in a container | |
| The first step is to transform your read only container | |
| pytorch-1.0.1-cuda10.0-cudnn7-runtime.simg in a writable version that will | |
| allow you to add packages. | |
| Warning | |
| Depending on the version of singularity you are using, singularity | |
| will build a container with the extension .simg or .sif. If you’re using | |
| .sif files, replace every occurences of .simg by .sif. | |
| Tip | |
| If you want to use apt-get you have to put sudo ahead of the | |
| following commands | |
| This command will create a writable image in the folder pytorch. | |
| singularity build --sandbox pytorch pytorch-1.0.1-cuda10.0-cudnn7-runtime.simg | |
| Then you’ll need the following command to log inside the container. | |
| singularity shell --writable -H $HOME:/home pytorch | |
| Once you get into the container, you can use pip and install anything you need | |
| (Or with apt-get if you built the container with sudo). | |
| Warning | |
| Singularity mounts your home folder, so if you install things into | |
| the $HOME of your container, they will be installed in your real | |
| $HOME! | |
| You should install your stuff in /usr/local instead. | |
| " | |
| Creating useful directories,https://docs.mila.quebec/Userguide.html#creating-useful-directories,"Creating useful directories | |
| One of the benefits of containers is that you’ll be able to use them across | |
| different clusters. However for each cluster the datasets and experiments | |
| folder location can be different. In order to be invariant to those locations, | |
| we will create some useful mount points inside the container: | |
| mkdir /dataset | |
| mkdir /tmp_log | |
| mkdir /final_log | |
| From now, you won’t need to worry anymore when you write your code to specify | |
| where to pick up your dataset. Your dataset will always be in /dataset | |
| independently of the cluster you are using. | |
| " | |
| Testing,https://docs.mila.quebec/Userguide.html#testing,"Testing | |
| If you have some code that you want to test before finalizing your container, | |
| you have two choices. You can either log into your container and run Python | |
| code inside it with: | |
| singularity shell --nv pytorch | |
| Or you can execute your command directly with | |
| singularity exec --nv pytorch Python YOUR_CODE.py | |
| Tip | |
| —nv allows the container to use gpus. You don’t need this if you | |
| don’t plan to use a gpu. | |
| Warning | |
| Don’t forget to clear the cache of the packages you installed in | |
| the containers. | |
| " | |
| Creating a new image from the sandbox,https://docs.mila.quebec/Userguide.html#creating-a-new-image-from-the-sandbox,"Creating a new image from the sandbox | |
| Once everything you need is installed inside the container, you need to convert | |
| it back to a read-only singularity image with: | |
| singularity build pytorch_final.simg pytorch | |
| " | |
| Second way: Use recipes,https://docs.mila.quebec/Userguide.html#second-way-use-recipes,"Second way: Use recipes | |
| A singularity recipe is a file including specifics about installation software, | |
| environment variables, files to add, and container metadata. It is a starting | |
| point for designing any custom container. Instead of pulling a container and | |
| installing your packages manually, you can specify in this file the packages | |
| you want and then build your container from this file. | |
| Here is a toy example of a singularity recipe installing some stuff: | |
| ################# Header: Define the base system you want to use ################ | |
| # Reference of the kind of base you want to use (e.g., docker, debootstrap, shub). | |
| Bootstrap: docker | |
| # Select the docker image you want to use (Here we choose tensorflow) | |
| From: tensorflow/tensorflow:latest-gpu-py3 | |
| ################# Section: Defining the system ################################# | |
| # Commands in the %post section are executed within the container. | |
| %post | |
| echo ""Installing Tools with apt-get"" | |
| apt-get update | |
| apt-get install -y cmake libcupti-dev libyaml-dev wget unzip | |
| apt-get clean | |
| echo ""Installing things with pip"" | |
| pip install tqdm | |
| echo ""Creating mount points"" | |
| mkdir /dataset | |
| mkdir /tmp_log | |
| mkdir /final_log | |
| # Environment variables that should be sourced at runtime. | |
| %environment | |
| # use bash as default shell | |
| SHELL=/bin/bash | |
| export SHELL | |
| A recipe file contains two parts: the header and sections. In the | |
| header you specify which base system you want to use, it can be any docker | |
| or singularity container. In sections, you can list the things you want to | |
| install in the subsection post or list the environment’s variable you need | |
| to source at each runtime in the subsection environment. For a more detailed | |
| description, please look at the singularity documentation. | |
| In order to build a singularity container from a singularity recipe file, you | |
| should use: | |
| sudo singularity build <NAME_CONTAINER> <YOUR_RECIPE_FILES> | |
| Warning | |
| You always need to use sudo when you build a container from a | |
| recipe. As there is no access to sudo on the cluster, a personal computer or | |
| the use singularity hub is needed to build a container | |
| " | |
| Build recipe on singularity hub,https://docs.mila.quebec/Userguide.html#build-recipe-on-singularity-hub,"Build recipe on singularity hub | |
| Singularity hub allows users to build containers from recipes directly on | |
| singularity-hub’s cloud meaning that you don’t need to build containers by | |
| yourself. You need to register on singularity-hub and link your | |
| singularity-hub account to your GitHub account, then: | |
| Create a new github repository. | |
| Add a collection on singularity-hub and select the github repository your created. | |
| Clone the github repository on your computer. | |
| $ git clone <url> | |
| Write the singularity recipe and save it as a file named Singularity. | |
| Git add Singularity, commit and push on the master branch | |
| $ git add Singularity | |
| $ git commit | |
| $ git push origin master | |
| At this point, robots from singularity-hub will build the container for you, you | |
| will be able to download your container from the website or directly with: | |
| singularity pull shub://<github_username>/<repository_name> | |
| " | |
| "Example: Recipe with OpenAI gym, MuJoCo and Miniworld",https://docs.mila.quebec/Userguide.html#example-recipe-with-openai-gym-mujoco-and-miniworld,"Example: Recipe with OpenAI gym, MuJoCo and Miniworld | |
| Here is an example on how you can use a singularity recipe to install complex | |
| environment such as OpenAI gym, MuJoCo and Miniworld on a PyTorch based | |
| container. In order to use MuJoCo, you’ll need to copy the key stored on the | |
| Mila cluster in /ai/apps/mujoco/license/mjkey.txt to your current directory. | |
| #This is a dockerfile that sets up a full Gym install with test dependencies | |
| Bootstrap: docker | |
| # Here we ll build our container upon the pytorch container | |
| From: pytorch/pytorch:1.0-cuda10.0-cudnn7-runtime | |
| # Now we'll copy the mjkey file located in the current directory inside the container's root | |
| # directory | |
| %files | |
| mjkey.txt | |
| # Then we put everything we need to install | |
| %post | |
| export PATH=$PATH:/opt/conda/bin | |
| apt -y update && \ | |
| apt install -y keyboard-configuration && \ | |
| apt install -y \ | |
| python3-dev \ | |
| python-pyglet \ | |
| python3-opengl \ | |
| libhdf5-dev \ | |
| libjpeg-dev \ | |
| libboost-all-dev \ | |
| libsdl2-dev \ | |
| libosmesa6-dev \ | |
| patchelf \ | |
| ffmpeg \ | |
| xvfb \ | |
| libhdf5-dev \ | |
| openjdk-8-jdk \ | |
| wget \ | |
| git \ | |
| unzip && \ | |
| apt clean && \ | |
| rm -rf /var/lib/apt/lists/* | |
| pip install h5py | |
| # Download Gym and MuJoCo | |
| mkdir /Gym && cd /Gym | |
| git clone https://github.com/openai/gym.git || true && \ | |
| mkdir /Gym/.mujoco && cd /Gym/.mujoco | |
| wget https://www.roboti.us/download/mjpro150_linux.zip && \ | |
| unzip mjpro150_linux.zip && \ | |
| wget https://www.roboti.us/download/mujoco200_linux.zip && \ | |
| unzip mujoco200_linux.zip && \ | |
| mv mujoco200_linux mujoco200 | |
| # Export global environment variables | |
| export MUJOCO_PY_MJKEY_PATH=/Gym/.mujoco/mjkey.txt | |
| export MUJOCO_PY_MUJOCO_PATH=/Gym/.mujoco/mujoco150/ | |
| export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/Gym/.mujoco/mjpro150/bin | |
| export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/Gym" | |
| "Example: Recipe with OpenAI gym, MuJoCo and Miniworld",https://docs.mila.quebec/Userguide.html#example-recipe-with-openai-gym-mujoco-and-miniworld,"/.mujoco/mujoco200/bin | |
| export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/bin | |
| cp /mjkey.txt /Gym/.mujoco/mjkey.txt | |
| # Install Python dependencies | |
| wget https://raw.githubusercontent.com/openai/mujoco-py/master/requirements.txt | |
| pip install -r requirements.txt | |
| # Install Gym and MuJoCo | |
| cd /Gym/gym | |
| pip install -e '.[all]' | |
| # Change permission to use mujoco_py as non sudoer user | |
| chmod -R 777 /opt/conda/lib/python3.6/site-packages/mujoco_py/ | |
| pip install --upgrade minerl | |
| # Export global environment variables | |
| %environment | |
| export SHELL=/bin/sh | |
| export MUJOCO_PY_MJKEY_PATH=/Gym/.mujoco/mjkey.txt | |
| export MUJOCO_PY_MUJOCO_PATH=/Gym/.mujoco/mujoco150/ | |
| export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/Gym/.mujoco/mjpro150/bin | |
| export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/Gym/.mujoco/mujoco200/bin | |
| export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/bin | |
| export PATH=/Gym/gym/.tox/py3/bin:$PATH | |
| %runscript | |
| exec /bin/sh ""$@"" | |
| Here is the same recipe but written for TensorFlow: | |
| #This is a dockerfile that sets up a full Gym install with test dependencies | |
| Bootstrap: docker | |
| # Here we ll build our container upon the tensorflow container | |
| From: tensorflow/tensorflow:latest-gpu-py3 | |
| # Now we'll copy the mjkey file located in the current directory inside the container's root | |
| # directory | |
| %files | |
| mjkey.txt | |
| # Then we put everything we need to install | |
| %post | |
| apt -y update && \ | |
| apt install -y keyboard-configuration && \ | |
| apt install -y \ | |
| python3-setuptools \ | |
| python3-dev \ | |
| python-pyglet \ | |
| python3-opengl \ | |
| libjpeg-dev \ | |
| libboost-all-dev \ | |
| libsdl2-dev \ | |
| libosmesa6-dev \ | |
| patchelf \ | |
| ffmpeg \ | |
| xvfb \ | |
| wget \ | |
| git \ | |
| unzip && \ | |
| apt clean && \ | |
| rm -rf /var/lib/apt/lists/* | |
| # Download Gym and MuJoCo | |
| mkdir /Gym && cd /Gym | |
| git clone" | |
| "Example: Recipe with OpenAI gym, MuJoCo and Miniworld",https://docs.mila.quebec/Userguide.html#example-recipe-with-openai-gym-mujoco-and-miniworld," https://github.com/openai/gym.git || true && \ | |
| mkdir /Gym/.mujoco && cd /Gym/.mujoco | |
| wget https://www.roboti.us/download/mjpro150_linux.zip && \ | |
| unzip mjpro150_linux.zip && \ | |
| wget https://www.roboti.us/download/mujoco200_linux.zip && \ | |
| unzip mujoco200_linux.zip && \ | |
| mv mujoco200_linux mujoco200 | |
| # Export global environment variables | |
| export MUJOCO_PY_MJKEY_PATH=/Gym/.mujoco/mjkey.txt | |
| export MUJOCO_PY_MUJOCO_PATH=/Gym/.mujoco/mujoco150/ | |
| export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/Gym/.mujoco/mjpro150/bin | |
| export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/Gym/.mujoco/mujoco200/bin | |
| export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/bin | |
| cp /mjkey.txt /Gym/.mujoco/mjkey.txt | |
| # Install Python dependencies | |
| wget https://raw.githubusercontent.com/openai/mujoco-py/master/requirements.txt | |
| pip install -r requirements.txt | |
| # Install Gym and MuJoCo | |
| cd /Gym/gym | |
| pip install -e '.[all]' | |
| # Change permission to use mujoco_py as non sudoer user | |
| chmod -R 777 /usr/local/lib/python3.5/dist-packages/mujoco_py/ | |
| # Then install miniworld | |
| cd /usr/local/ | |
| git clone https://github.com/maximecb/gym-miniworld.git | |
| cd gym-miniworld | |
| pip install -e . | |
| # Export global environment variables | |
| %environment | |
| export SHELL=/bin/bash | |
| export MUJOCO_PY_MJKEY_PATH=/Gym/.mujoco/mjkey.txt | |
| export MUJOCO_PY_MUJOCO_PATH=/Gym/.mujoco/mujoco150/ | |
| export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/Gym/.mujoco/mjpro150/bin | |
| export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/Gym/.mujoco/mujoco200/bin | |
| export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/bin | |
| export PATH=/Gym/gym/.tox/py3/bin:$PATH | |
| %runscript | |
| exec /bin/bash ""$@"" | |
| Keep in mind that those environment variables are sourced at runtime and not at | |
| build time. This is why, you should also define them in the %post section | |
| since they are required to install MuJoCo. | |
| " | |
| Using containers on clusters,https://docs.mila.quebec/Userguide.html#using-containers-on-clusters,"Using containers on clusters | |
| " | |
| How to use containers on clusters,https://docs.mila.quebec/Userguide.html#how-to-use-containers-on-clusters,"How to use containers on clusters | |
| On every cluster with Slurm, datasets and intermediate results should go in | |
| $SLURM_TMPDIR while the final experiment results should go in $SCRATCH. | |
| In order to use the container you built, you need to copy it on the cluster you | |
| want to use. | |
| Warning | |
| You should always store your container in $SCRATCH ! | |
| Then reserve a node with srun/sbatch, copy the container and your dataset on the | |
| node given by SLURM (i.e in $SLURM_TMPDIR) and execute the code | |
| <YOUR_CODE> within the container <YOUR_CONTAINER> with: | |
| singularity exec --nv -H $HOME:/home -B $SLURM_TMPDIR:/dataset/ -B $SLURM_TMPDIR:/tmp_log/ -B $SCRATCH:/final_log/ $SLURM_TMPDIR/<YOUR_CONTAINER> python <YOUR_CODE> | |
| Remember that /dataset, /tmp_log and /final_log were created in the | |
| previous section. Now each time, we’ll use singularity, we are explicitly | |
| telling it to mount $SLURM_TMPDIR on the cluster’s node in the folder | |
| /dataset inside the container with the option -B such that each dataset | |
| downloaded by PyTorch in /dataset will be available in $SLURM_TMPDIR. | |
| This will allow us to have code and scripts that are invariant to the cluster | |
| environment. The option -H specify what will be the container’s home. For | |
| example, if you have your code in $HOME/Project12345/Version35/ you can | |
| specify -H $HOME/Project12345/Version35:/home, thus the container will only | |
| have access to the code inside Version35. | |
| If you want to run multiple commands inside the container you can use: | |
| singularity exec --nv -H $HOME:/home -B $SLURM_TMPDIR:/dataset/ \ | |
| -B $SLURM_TMPDIR:/tmp_log/ -B $SCRATCH:/final_log/ \ | |
| $SLURM_TMPDIR/<YOUR_CONTAINER> bash -c 'pwd && ls && python <YOUR_CODE>' | |
| " | |
| Example: Interactive case (srun/salloc),https://docs.mila.quebec/Userguide.html#example-interactive-case-srun-salloc,"Example: Interactive case (srun/salloc) | |
| Once you get an interactive session with SLURM, copy <YOUR_CONTAINER> and | |
| <YOUR_DATASET> to $SLURM_TMPDIR | |
| 0. Get an interactive session | |
| srun --gres=gpu:1 | |
| 1. Copy your container on the compute node | |
| rsync -avz $SCRATCH/<YOUR_CONTAINER> $SLURM_TMPDIR | |
| 2. Copy your dataset on the compute node | |
| rsync -avz $SCRATCH/<YOUR_DATASET> $SLURM_TMPDIR | |
| Then use singularity shell to get a shell inside the container | |
| 3. Get a shell in your environment | |
| singularity shell --nv \ | |
| -H $HOME:/home \ | |
| -B $SLURM_TMPDIR:/dataset/ \ | |
| -B $SLURM_TMPDIR:/tmp_log/ \ | |
| -B $SCRATCH:/final_log/ \ | |
| $SLURM_TMPDIR/<YOUR_CONTAINER> | |
| 4. Execute your code | |
| python <YOUR_CODE> | |
| or use singularity exec to execute <YOUR_CODE>. | |
| 3. Execute your code | |
| singularity exec --nv \ | |
| -H $HOME:/home \ | |
| -B $SLURM_TMPDIR:/dataset/ \ | |
| -B $SLURM_TMPDIR:/tmp_log/ \ | |
| -B $SCRATCH:/final_log/ \ | |
| $SLURM_TMPDIR/<YOUR_CONTAINER> \ | |
| python <YOUR_CODE> | |
| You can create also the following alias to make your life easier. | |
| alias my_env='singularity exec --nv \ | |
| -H $HOME:/home \ | |
| -B $SLURM_TMPDIR:/dataset/ \ | |
| -B $SLURM_TMPDIR:/tmp_log/ \ | |
| -B $SCRATCH:/final_log/ \ | |
| $SLURM_TMPDIR/<YOUR_CONTAINER>' | |
| This will allow you to run any code with: | |
| my_env python <YOUR_CODE> | |
| " | |
| Example: sbatch case,https://docs.mila.quebec/Userguide.html#example-sbatch-case,"Example: sbatch case | |
| You can also create a sbatch script: | |
| :linenos: | |
| #!/bin/bash | |
| #SBATCH --cpus-per-task=6 # Ask for 6 CPUs | |
| #SBATCH --gres=gpu:1 # Ask for 1 GPU | |
| #SBATCH --mem=10G # Ask for 10 GB of RAM | |
| #SBATCH --time=0:10:00 # The job will run for 10 minutes | |
| # 1. Copy your container on the compute node | |
| rsync -avz $SCRATCH/<YOUR_CONTAINER> $SLURM_TMPDIR | |
| # 2. Copy your dataset on the compute node | |
| rsync -avz $SCRATCH/<YOUR_DATASET> $SLURM_TMPDIR | |
| # 3. Executing your code with singularity | |
| singularity exec --nv \ | |
| -H $HOME:/home \ | |
| -B $SLURM_TMPDIR:/dataset/ \ | |
| -B $SLURM_TMPDIR:/tmp_log/ \ | |
| -B $SCRATCH:/final_log/ \ | |
| $SLURM_TMPDIR/<YOUR_CONTAINER> \ | |
| python ""<YOUR_CODE>"" | |
| # 4. Copy whatever you want to save on $SCRATCH | |
| rsync -avz $SLURM_TMPDIR/<to_save> $SCRATCH | |
| " | |
| Issue with PyBullet and OpenGL libraries,https://docs.mila.quebec/Userguide.html#issue-with-pybullet-and-opengl-libraries,"Issue with PyBullet and OpenGL libraries | |
| If you are running certain gym environments that require pyglet, you may | |
| encounter a problem when running your singularity instance with the Nvidia | |
| drivers using the --nv flag. This happens because the --nv flag also | |
| provides the OpenGL libraries: | |
| libGL.so.1 => /.singularity.d/libs/libGL.so.1 | |
| libGLX.so.0 => /.singularity.d/libs/libGLX.so.0 | |
| If you don’t experience those problems with pyglet, you probably don’t need | |
| to address this. Otherwise, you can resolve those problems by apt-get install | |
| -y libosmesa6-dev mesa-utils mesa-utils-extra libgl1-mesa-glx, and then making | |
| sure that your LD_LIBRARY_PATH points to those libraries before the ones in | |
| /.singularity.d/libs. | |
| %environment | |
| # ... | |
| export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu/mesa:$LD_LIBRARY_PATH | |
| " | |
| Mila cluster,https://docs.mila.quebec/Userguide.html#mila-cluster,"Mila cluster | |
| On the Mila cluster $SCRATCH is not yet defined, you should add the | |
| experiment results you want to keep in /network/scratch/<u>/<username>/. In | |
| order to use the sbatch script above and to match other cluster environment’s | |
| names, you can define $SCRATCH as an alias for | |
| /network/scratch/<u>/<username> with: | |
| echo ""export SCRATCH=/network/scratch/${USER:0:1}/$USER"" >> ~/.bashrc | |
| Then, you can follow the general procedure explained above. | |
| " | |
| Digital Research Alliance of Canada,https://docs.mila.quebec/Userguide.html#digital-research-alliance-of-canada,"Digital Research Alliance of Canada | |
| Using singularity on Digital Research Alliance of Canada is similar except that | |
| you need to add Yoshua’s account name and load singularity. Here is an example | |
| of a sbatch script using singularity on compute Canada cluster: | |
| Warning | |
| You should use singularity/2.6 or singularity/3.4. There is a bug | |
| in singularity/3.2 which makes gpu unusable. | |
| 1#!/bin/bash | |
| 2#SBATCH --account=rpp-bengioy # Yoshua pays for your job | |
| 3#SBATCH --cpus-per-task=6 # Ask for 6 CPUs | |
| 4#SBATCH --gres=gpu:1 # Ask for 1 GPU | |
| 5#SBATCH --mem=32G # Ask for 32 GB of RAM | |
| 6#SBATCH --time=0:10:00 # The job will run for 10 minutes | |
| 7#SBATCH --output=""/scratch/<user>/slurm-%j.out"" # Modify the output of sbatch | |
| 8 | |
| 9# 1. You have to load singularity | |
| 10module load singularity | |
| 11# 2. Then you copy the container to the local disk | |
| 12rsync -avz $SCRATCH/<YOUR_CONTAINER> $SLURM_TMPDIR | |
| 13# 3. Copy your dataset on the compute node | |
| 14rsync -avz $SCRATCH/<YOUR_DATASET> $SLURM_TMPDIR | |
| 15# 4. Executing your code with singularity | |
| 16singularity exec --nv \ | |
| 17 -H $HOME:/home \ | |
| 18 -B $SLURM_TMPDIR:/dataset/ \ | |
| 19 -B $SLURM_TMPDIR:/tmp_log/ \ | |
| 20 -B $SCRATCH:/final_log/ \ | |
| 21 $SLURM_TMPDIR/<YOUR_CONTAINER> \ | |
| 22 python ""<YOUR_CODE>"" | |
| 23# 5. Copy whatever you want to save on $SCRATCH | |
| 24rsync -avz $SLURM_TMPDIR/<to_save> $SCRATCH | |
| " | |
| Sharing Data with ACLs,https://docs.mila.quebec/Userguide.html#sharing-data-with-acls,"Sharing Data with ACLs | |
| Regular permissions bits are extremely blunt tools: They control access through | |
| only three sets of bits owning user, owning group and all others. Therefore, | |
| access is either too narrow (0700 allows access only by oneself) or too wide | |
| (770 gives all permissions to everyone in the same group, and 777 to | |
| literally everyone). | |
| ACLs (Access Control Lists) are an expansion of the permissions bits that allow | |
| more fine-grained, granular control of accesses to a file. They can be used to | |
| permit specific users access to files and folders even if conservative default | |
| permissions would have denied them such access. | |
| As an illustrative example, to use ACLs to allow $USER (oneself) to | |
| share with $USER2 (another person) a “playground” folder hierarchy in | |
| Mila’s scratch filesystem at a location | |
| /network/scratch/${USER:0:1}/$USER/X/Y/Z/... | |
| in a safe and secure fashion that allows both users to read, write, execute, | |
| search and delete each others’ files: | |
| 1. Grant oneself permissions to access any future files/folders created | |
| by the other (or oneself) | |
| (-d renders this permission a “default” / inheritable one) | |
| setfacl -Rdm user:${USER}:rwx /network/scratch/${USER:0:1}/$USER/X/Y/Z/ | |
| Note | |
| The importance of doing this seemingly-redundant step first is that files | |
| and folders are always owned by only one person, almost always their | |
| creator (the UID will be the creator’s, the GID typically as well). If that | |
| user is not yourself, you will not have access to those files unless the | |
| other person specifically gives them to you – or these files inherited a | |
| default ACL allowing you full access. | |
| This is the inherited, default ACL serving that purpose. | |
| 2. Grant the other permission to access any future files/folders created | |
| by the other (or oneself) | |
| (-d renders this permission a “default” / inheritable one) | |
| setfacl" | |
| Sharing Data with ACLs,https://docs.mila.quebec/Userguide.html#sharing-data-with-acls," -Rdm user:${USER2}:rwx /network/scratch/${USER:0:1}/$USER/X/Y/Z/ | |
| 3. Grant the other permission to access any existing files/folders created | |
| by oneself. | |
| Such files and folders were created before the new default ACLs were added | |
| above and thus did not inherit them from their parent folder at the moment of | |
| their creation. | |
| setfacl -Rm user:${USER2}:rwx /network/scratch/${USER:0:1}/$USER/X/Y/Z/ | |
| Note | |
| The purpose of granting permissions first for future files and then for | |
| existing files is to prevent a race condition whereby after the first | |
| setfacl command the other person could create files to which the | |
| second setfacl command does not apply. | |
| 4. Grant another permission to search through one’s hierarchy down to the | |
| shared location in question. | |
| Non-recursive (!!!!) | |
| May also grant :rx in unlikely event others listing your folders on the | |
| path is not troublesome or desirable. | |
| setfacl -m user:${USER2}:x /network/scratch/${USER:0:1}/$USER/X/Y/ | |
| setfacl -m user:${USER2}:x /network/scratch/${USER:0:1}/$USER/X/ | |
| setfacl -m user:${USER2}:x /network/scratch/${USER:0:1}/$USER/ | |
| Note | |
| In order to access a file, all folders from the root (/) down to the | |
| parent folder in question must be searchable (+x) by the concerned user. | |
| This is already the case for all users for folders such as /, | |
| /network and /network/scratch, but users must explicitly grant access | |
| to some or all users either through base permissions or by adding ACLs, for | |
| at least /network/scratch/${USER:0:1}/$USER, $HOME and subfolders. | |
| To bluntly allow all users to search through a folder (think twice!), | |
| the following command can be used: | |
| chmod a+x /network/scratch/${USER:0:1}/$USER/ | |
| Note | |
| For more information on setfacl and path resolution/access checking, | |
| consider the following documentation viewing commands: | |
| man setfacl | |
| man path_resolution | |
| " | |
| Viewing and Verifying ACLs,https://docs.mila.quebec/Userguide.html#viewing-and-verifying-acls,"Viewing and Verifying ACLs | |
| getfacl /path/to/folder/or/file | |
| 1: # file: somedir/ | |
| 2: # owner: lisa | |
| 3: # group: staff | |
| 4: # flags: -s- | |
| 5: user::rwx | |
| 6: user:joe:rwx #effective:r-x | |
| 7: group::rwx #effective:r-x | |
| 8: group:cool:r-x | |
| 9: mask::r-x | |
| 10: other::r-x | |
| 11: default:user::rwx | |
| 12: default:user:joe:rwx #effective:r-x | |
| 13: default:group::r-x | |
| 14: default:mask::r-x | |
| 15: default:other::--- | |
| Note | |
| man getfacl | |
| " | |
| Contributing datasets,https://docs.mila.quebec/Userguide.html#contributing-datasets,"Contributing datasets | |
| If a dataset could help the research of others at Mila, this form can be filled to request its addition | |
| to /network/datasets. | |
| " | |
| Publicly share a Mila dataset,https://docs.mila.quebec/Userguide.html#publicly-share-a-mila-dataset,"Publicly share a Mila dataset | |
| Mila offers two ways to publicly share a Mila dataset: | |
| Academic Torrent | |
| Google Drive | |
| Note that these options are not mutually exclusive and both can be used. | |
| " | |
| Academic Torrent,https://docs.mila.quebec/Userguide.html#id10,"Academic Torrent | |
| Mila hosts/seeds some datasets created by the Mila community through Academic | |
| Torrent. The first step is to create an | |
| account and a torrent file. | |
| Then drop the dataset in /network/scratch/.transit_datasets and send the | |
| Academic Torrent URL to Mila’s helpdesk. If | |
| the dataset does not reside on the Mila cluster, only the Academic Torrent URL | |
| would be needed to proceed with the initial download. Then you can delete / | |
| stop sharing your copy. | |
| Note | |
| Avoid mentioning dataset in the name of the dataset | |
| Avoid capital letters, special charaters (including spaces) in files and | |
| directories names. Spaces can be replaced by hyphens (-). | |
| Multiple archives can be provided to spread the data (e.g. dataset splits, | |
| raw data, extra data, …) | |
| " | |
| Generate a .torrent file to be uploaded to Academic Torrent,https://docs.mila.quebec/Userguide.html#generate-a-torrent-file-to-be-uploaded-to-academic-torrent,"Generate a .torrent file to be uploaded to Academic Torrent | |
| The command line / Python utility torrentool can be used to create a | |
| DATASET_NAME.torrent file: | |
| # Install torrentool | |
| python3 -m pip install torrentool click | |
| # Change Directory to the location of the dataset to be hosted by Mila | |
| cd /network/scratch/.transit_datasets | |
| torrent create --tracker https://academictorrents.com/announce.php DATASET_NAME | |
| The resulting DATASET_NAME.torrent can then be used to register a new dataset | |
| on Academic Torrent. | |
| Warning | |
| The creation of a DATASET_NAME.torrent file requires the computation of | |
| checksums for the dataset content which can quickly become CPU-heavy. This | |
| process should not be executed on a login node | |
| " | |
| Download a dataset from Academic Torrent,https://docs.mila.quebec/Userguide.html#download-a-dataset-from-academic-torrent,"Download a dataset from Academic Torrent | |
| Academic Torrent provides a Python API to easily download a dataset | |
| from it’s registered list: | |
| # Install the Python API with: | |
| # python3 -m pip install academictorrents | |
| import academictorrents as at | |
| mnist_path = at.get(""323a0048d87ca79b68f12a6350a57776b6a3b7fb"", datastore=""~/scratch/.academictorrents-datastore"") # Download the mnist dataset | |
| Note | |
| Current needs have been evaluated to be for a download speed of about 10 | |
| MB/s. This speed can be higher if more users also seeds the dataset. | |
| " | |
| Google Drive,https://docs.mila.quebec/Userguide.html#id12,"Google Drive | |
| Only a member of the staff team can upload to Mila’s Google Drive | |
| which requires to first drop the dataset in | |
| /network/scratch/.transit_datasets. Then, contact Mila’s helpdesk and provide the following informations: | |
| directory containing the archived dataset (zip is favored) in | |
| /network/scratch/.transit_datasets | |
| the name of the dataset | |
| a licence in .txt format. One of the the creative common licenses can be used. It is | |
| recommended to at least have the Attribution option. The No Derivatives | |
| option is discouraged unless the dataset should not be modified by others. | |
| MD5 checksum of the archive | |
| the arXiv and GitHub URLs (those can be sent later if the article is still in | |
| the submission process) | |
| instructions to know if the dataset needs to be unziped, untared or | |
| else before uploading to Google Drive | |
| Note | |
| Avoid mentioning dataset in the name of the dataset | |
| Avoid capital letters, special charaters (including spaces) in files and | |
| directories names. Spaces can be replaced by hyphens (-). | |
| Multiple archives can be provided to spread the data (e.g. dataset splits, | |
| raw data, extra data, …) | |
| " | |
| Download a dataset from Mila’s Google Drive with gdown,https://docs.mila.quebec/Userguide.html#download-a-dataset-from-mila-s-google-drive-with-gdown,"Download a dataset from Mila’s Google Drive with gdown | |
| The utility gdown is a simple utility to | |
| download data from Google Drive from the command line shell or in a Python | |
| script and requires no setup. | |
| Warning | |
| A limitation however is that it uses a shared client id which can cause a | |
| quota block when too many users uses it in the same day. It is described in | |
| a GitHub issue. | |
| " | |
| Download a dataset from Mila’s Google Drive with rclone,https://docs.mila.quebec/Userguide.html#download-a-dataset-from-mila-s-google-drive-with-rclone,"Download a dataset from Mila’s Google Drive with rclone | |
| Rclone is a command line program to manage files on | |
| cloud storage. In the context of a Google Drive remote, it allows to specify a | |
| client id to avoid sharing with other users which avoid quota limits. Rclone | |
| describes the creation of a client id in its documentaton. Once this is done, a | |
| remote for Mila’s Google Drive can be configured from the command line: | |
| rclone config create mila-gdrive drive client_id XXXXXXXXXXXX-XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX.apps.googleusercontent.com \ | |
| client_secret XXXXXXXXXXXXX-XXXXXXXXXX \ | |
| scope 'drive.readonly' \ | |
| root_folder_id 1peJ6VF9wQ-LeETgcdGxu1e4fo28JbtUt \ | |
| config_is_local false \ | |
| config_refresh_token false | |
| The remote can then be used to download a dataset: | |
| rclone copy --progress mila-gdrive:DATASET_NAME/ ~/scratch/datasets/DATASET_NAME/ | |
| Rclone is available from the conda channel conda-forge. | |
| " | |
| Digital Object Identifier (DOI),https://docs.mila.quebec/Userguide.html#digital-object-identifier-doi,"Digital Object Identifier (DOI) | |
| It is recommended to get a DOI to reference the dataset. A DOI is a permanent | |
| id/URL which prevents losing references of online scientific data. | |
| https://figshare.com can be used to create a DOI: | |
| Go in My Data | |
| Create an item by clicking Create new item | |
| Check Metadata record only at the top | |
| Fill the metadata fields | |
| Then reference the dataset using https://doi.org like this: | |
| https://doi.org/10.6084/m9.figshare.2066037 | |
| " | |
| Data Transmission using Globus Connect Personal,https://docs.mila.quebec/Userguide.html#data-transmission-using-globus-connect-personal,"Data Transmission using Globus Connect Personal | |
| Mila doesn’t own a Globus license but if the source or destination provides a | |
| Globus account, like Digital Research Alliance of Canada for example, it’s | |
| possible to setup Globus Connect Personal to create a personal endpoint on the | |
| Mila cluster by following the Globus guide to Install, Configure, and | |
| Uninstall Globus Connect Personal for Linux. | |
| This endpoint can then be used to transfer data to and from the Mila cluster. | |
| " | |
| JupyterHub,https://docs.mila.quebec/Userguide.html#jupyterhub,"JupyterHub | |
| JupyterHub is a platform connected to SLURM to start a JupyterLab | |
| session as a batch job then connects it when the allocation has been granted. | |
| It does not require any ssh tunnel or port redirection, the hub acts as a proxy | |
| server that will redirect you to a session as soon as it is available. | |
| It is currently available for Mila clusters and some Digital Research Alliance | |
| of Canada (Alliance) clusters. | |
| Cluster | |
| Address | |
| Login type | |
| Mila Local | |
| https://jupyterhub.server.mila.quebec | |
| Google Oauth | |
| Alliance | |
| https://docs.alliancecan.ca/wiki/JupyterHub | |
| DRAC login | |
| Warning | |
| Do not forget to close the JupyterLab session! Closing the window leaves | |
| running the session and the SLURM job it is linked to. | |
| To close it, use the hub menu and then Control Panel > Stop my server | |
| Note | |
| For Mila Clusters: | |
| mila.quebec account credentials should be used to login and start a | |
| JupyterLab session. | |
| " | |
| Access Mila Storage in JupyterLab,https://docs.mila.quebec/Userguide.html#access-mila-storage-in-jupyterlab,"Access Mila Storage in JupyterLab | |
| Unfortunately, JupyterLab does not allow the navigation to parent directories of | |
| $HOME. This makes some file systems like /network/datasets or | |
| $SLURM_TMPDIR unavailable through their absolute path in the interface. It | |
| is however possible to create symbolic links to those resources. To do so, you | |
| can use the ln -s command: | |
| ln -s /network/datasets $HOME | |
| Note that $SLURM_TMPDIR is a directory that is dynamically created for each | |
| job so you would need to recreate the symbolic link every time you start a | |
| JupyterHub session: | |
| ln -sf $SLURM_TMPDIR $HOME | |
| " | |
| Advanced SLURM usage and Multiple GPU jobs,https://docs.mila.quebec/Userguide.html#advanced-slurm-usage-and-multiple-gpu-jobs,"Advanced SLURM usage and Multiple GPU jobs | |
| " | |
| Handling preemption,https://docs.mila.quebec/Userguide.html#handling-preemption,"Handling preemption | |
| On the Mila cluster, jobs can preempt one-another depending on their priority | |
| (unkillable>high>low) (See the Slurm documentation) | |
| The default preemption mechanism is to kill and re-queue the job automatically | |
| without any notice. To allow a different preemption mechanism, every partition | |
| have been duplicated (i.e. have the same characteristics as their counterparts) | |
| allowing a 120sec grace period before killing your job but don’t requeue | |
| it automatically: those partitions are referred by the suffix: -grace | |
| (main-grace, long-grace, main-cpu-grace, long-cpu-grace). | |
| When using a partition with a grace period, a series of signals consisting of | |
| first SIGCONT and SIGTERM then SIGKILL will be sent to the SLURM | |
| job. It’s good practice to catch those signals using the Linux trap command | |
| to properly terminate a job and save what’s necessary to restart the job. On | |
| each cluster, you’ll be allowed a grace period before SLURM actually kills | |
| your job (SIGKILL). | |
| The easiest way to handle preemption is by trapping the SIGTERM signal | |
| 1#SBATCH --ntasks=1 | |
| 2#SBATCH .... | |
| 3 | |
| 4exit_script() { | |
| 5 echo ""Preemption signal, saving myself"" | |
| 6 trap - SIGTERM # clear the trap | |
| 7 # Optional: sends SIGTERM to child/sub processes | |
| 8 kill -- -$$ | |
| 9} | |
| 10 | |
| 11trap exit_script SIGTERM | |
| 12 | |
| 13# The main script part | |
| 14python3 my_script | |
| Note | |
| Requeuing: | |
| The Slurm scheduler on the cluster does not allow a grace period before | |
| preempting a job while requeuing it automatically, therefore your job will | |
| be cancelled at the end of the grace period. | |
| To automatically requeue it, you can just add the sbatch command inside | |
| your exit_script function. | |
| " | |
| Packing jobs,https://docs.mila.quebec/Userguide.html#packing-jobs,"Packing jobs | |
| " | |
| Sharing a GPU between processes,https://docs.mila.quebec/Userguide.html#sharing-a-gpu-between-processes,"Sharing a GPU between processes | |
| srun, when used in a batch job is responsible for starting tasks on the | |
| allocated resources (see srun) SLURM batch script | |
| 1#SBATCH --ntasks-per-node=2 | |
| 2#SBATCH --output=myjob_output_wrapper.out | |
| 3#SBATCH --ntasks=2 | |
| 4#SBATCH --gres=gpu:1 | |
| 5#SBATCH --cpus-per-task=4 | |
| 6#SBATCH --mem=18G | |
| 7srun -l --output=myjob_output_%t.out python script args | |
| This will run Python 2 times, each process with 4 CPUs with the same arguments | |
| --output=myjob_output_%t.out will create 2 output files appending the task | |
| id (%t) to the filename and 1 global log file for things happening outside | |
| the srun command. | |
| Knowing that, if you want to have 2 different arguments to the Python program, | |
| you can use a multi-prog configuration file: srun -l --multi-prog silly.conf | |
| 0 python script firstarg | |
| 1 python script secondarg | |
| Or by specifying a range of tasks | |
| 0-1 python script %t | |
| %t being the taskid that your Python script will parse. Note the -l on the | |
| srun command: this will prepend each line with the taskid (0:, 1:) | |
| " | |
| Sharing a node with multiple GPU 1process/GPU,https://docs.mila.quebec/Userguide.html#sharing-a-node-with-multiple-gpu-1process-gpu,"Sharing a node with multiple GPU 1process/GPU | |
| On Digital Research Alliance of Canada, several nodes, especially nodes with | |
| largeGPU (P100) are reserved for jobs requesting the whole node, therefore | |
| packing multiple processes in a single job can leverage faster GPU. | |
| If you want different tasks to access different GPUs in a single allocation you | |
| need to create an allocation requesting a whole node and using srun with a | |
| subset of those resources (1 GPU). | |
| Keep in mind that every resource not specified on the srun command while | |
| inherit the global allocation specification so you need to split each resource | |
| in a subset (except –cpu-per-task which is a per-task requirement) | |
| Each srun represents a job step (%s). | |
| Example for a GPU node with 24 cores and 4 GPUs and 128G of RAM | |
| Requesting 1 task per GPU | |
| 1#!/bin/bash | |
| 2#SBATCH --nodes=1-1 | |
| 3#SBATCH --ntasks-per-node=4 | |
| 4#SBATCH --output=myjob_output_wrapper.out | |
| 5#SBATCH --gres=gpu:4 | |
| 6#SBATCH --cpus-per-task=6 | |
| 7srun --gres=gpu:1 -n1 --mem=30G -l --output=%j-step-%s.out --exclusive --multi-prog python script args1 & | |
| 8srun --gres=gpu:1 -n1 --mem=30G -l --output=%j-step-%s.out --exclusive --multi-prog python script args2 & | |
| 9srun --gres=gpu:1 -n1 --mem=30G -l --output=%j-step-%s.out --exclusive --multi-prog python script args3 & | |
| 10srun --gres=gpu:1 -n1 --mem=30G -l --output=%j-step-%s.out --exclusive --multi-prog python script args4 & | |
| 11wait | |
| This will create 4 output files: | |
| JOBID-step-0.out | |
| JOBID-step-1.out | |
| JOBID-step-2.out | |
| JOBID-step-3.out | |
| " | |
| Sharing a node with multiple GPU & multiple processes/GPU,https://docs.mila.quebec/Userguide.html#sharing-a-node-with-multiple-gpu-multiple-processes-gpu,"Sharing a node with multiple GPU & multiple processes/GPU | |
| Combining both previous sections, we can create a script requesting a whole node | |
| with four GPUs, allocating 1 GPU per srun and sharing each GPU with multiple | |
| processes | |
| Example still with a 24 cores/4 GPUs/128G RAM | |
| Requesting 2 tasks per GPU | |
| 1#!/bin/bash | |
| 2#SBATCH --nodes=1-1 | |
| 3#SBATCH --ntasks-per-node=8 | |
| 4#SBATCH --output=myjob_output_wrapper.out | |
| 5#SBATCH --gres=gpu:4 | |
| 6#SBATCH --cpus-per-task=3 | |
| 7srun --gres=gpu:1 -n2 --mem=30G -l --output=%j-step-%s-task-%t.out --exclusive --multi-prog silly.conf & | |
| 8srun --gres=gpu:1 -n2 --mem=30G -l --output=%j-step-%s-task-%t.out --exclusive --multi-prog silly.conf & | |
| 9srun --gres=gpu:1 -n2 --mem=30G -l --output=%j-step-%s-task-%t.out --exclusive --multi-prog silly.conf & | |
| 10srun --gres=gpu:1 -n2 --mem=30G -l --output=%j-step-%s-task-%t.out --exclusive --multi-prog silly.conf & | |
| 11wait | |
| --exclusive is important to specify subsequent step/srun to bind to different cpus. | |
| This will produce 8 output files, 2 for each step: | |
| JOBID-step-0-task-0.out | |
| JOBID-step-0-task-1.out | |
| JOBID-step-1-task-0.out | |
| JOBID-step-1-task-1.out | |
| JOBID-step-2-task-0.out | |
| JOBID-step-2-task-1.out | |
| JOBID-step-3-task-0.out | |
| JOBID-step-3-task-1.out | |
| Running nvidia-smi in silly.conf, while parsing the output, we can see 4 | |
| GPUs allocated and 2 tasks per GPU | |
| cat JOBID-step-* | grep Tesla | |
| 0: | 0 Tesla P100-PCIE... On | 00000000:04:00.0 Off | 0 | | |
| 1: | 0 Tesla P100-PCIE... On | 00000000:04:00.0 Off | 0 | | |
| 0: | 0 Tesla P100-PCIE... On | 00000000:83:00.0 Off | 0 | | |
| 1: | 0 Tesla P100-PCIE... On | 00000000:83:00.0 Off | 0 | | |
| 0: | 0 Tesla P100-PCIE... On | 00000000:82:00.0 Off | 0 | | |
| 1: | 0 Tesla P100-PCIE... On | 00000000:82:00.0 Off | 0 | | |
| 0: | 0 Tesla P100-PCIE... On | 00000000:03:00.0 Off | 0 | | |
| 1: | 0 Tesla P100-PCIE... On | 00000000:03:00.0 Off | 0 | | |
| " | |
| Multiple Nodes,https://docs.mila.quebec/Userguide.html#multiple-nodes,"Multiple Nodes | |
| " | |
| Data Parallel,https://docs.mila.quebec/Userguide.html#data-parallel,"Data Parallel | |
| Request 3 nodes with at least 4 GPUs each. | |
| 1#!/bin/bash | |
| 2 | |
| 3# Number of Nodes | |
| 4#SBATCH --nodes=3 | |
| 5 | |
| 6# Number of tasks. 3 (1 per node) | |
| 7#SBATCH --ntasks=3 | |
| 8 | |
| 9# Number of GPU per node | |
| 10#SBATCH --gres=gpu:4 | |
| 11#SBATCH --gpus-per-node=4 | |
| 12 | |
| 13# 16 CPUs per node | |
| 14#SBATCH --cpus-per-gpu=4 | |
| 15 | |
| 16# 16Go per nodes (4Go per GPU) | |
| 17#SBATCH --mem=16G | |
| 18 | |
| 19# we need all nodes to be ready at the same time | |
| 20#SBATCH --wait-all-nodes=1 | |
| 21 | |
| 22# Total resources: | |
| 23# CPU: 16 * 3 = 48 | |
| 24# RAM: 16 * 3 = 48 Go | |
| 25# GPU: 4 * 3 = 12 | |
| 26 | |
| 27# Setup our rendez-vous point | |
| 28RDV_ADDR=$(hostname) | |
| 29WORLD_SIZE=$SLURM_JOB_NUM_NODES | |
| 30# ----- | |
| 31 | |
| 32srun -l torchrun \ | |
| 33 --nproc_per_node=$SLURM_GPUS_PER_NODE\ | |
| 34 --nnodes=$WORLD_SIZE\ | |
| 35 --rdzv_id=$SLURM_JOB_ID\ | |
| 36 --rdzv_backend=c10d\ | |
| 37 --rdzv_endpoint=$RDV_ADDR\ | |
| 38 training_script.py | |
| You can find below a pytorch script outline on what a multi-node trainer could look like. | |
| import os | |
| import torch.distributed as dist | |
| class Trainer: | |
| def __init__(self): | |
| self.local_rank = None | |
| self.chk_path = ... | |
| self.model = ... | |
| @property | |
| def device_id(self): | |
| return self.local_rank | |
| def load_checkpoint(self, path): | |
| self.chk_path = path | |
| # ... | |
| def should_checkpoint(self): | |
| # Note: only one worker saves its weights | |
| return self.global_rank == 0 and self.local_rank == 0 | |
| def save_checkpoint(self): | |
| if self.chk_path is None: | |
| return | |
| # Save your states here | |
| # Note: you should save the weights of self.model not ddp_model | |
| # ... | |
| def initialize(self): | |
| self.global_rank = int(os.environ.get(""RANK"", -1)) | |
| self.local_rank = int(os.environ.get(""LOCAL_RANK"", -1)) | |
| assert self.global_rank >= 0, 'Global rank should be set (Only Rank 0 can save checkpoints)' | |
| assert self.local_rank >= 0, 'Local rank should be set' | |
| dist.init_process_group(backend=""gloo|nccl"") | |
| def sy" | |
| Data Parallel,https://docs.mila.quebec/Userguide.html#data-parallel,"nc_weights(self, resuming=False): | |
| if resuming: | |
| # in the case of resuming all workers need to load the same checkpoint | |
| self.load_checkpoint() | |
| # Wait for everybody to finish loading the checkpoint | |
| dist.barrier() | |
| return | |
| # Make sure all workers have the same initial weights | |
| # This makes the leader save his weights | |
| if self.should_checkpoint(): | |
| self.save_checkpoint() | |
| # All workers wait for the leader to finish | |
| dist.barrier() | |
| # All followers load the leader's weights | |
| if not self.should_checkpoint(): | |
| self.load_checkpoint() | |
| # Leader waits for the follower to load the weights | |
| dist.barrier() | |
| def dataloader(self, dataset, batch_size): | |
| train_sampler = ElasticDistributedSampler(dataset) | |
| train_loader = DataLoader( | |
| dataset, | |
| batch_size=batch_size, | |
| num_workers=4, | |
| pin_memory=True, | |
| sampler=train_sampler, | |
| ) | |
| return train_loader | |
| def train_step(self): | |
| # Your batch processing step here | |
| # ... | |
| pass | |
| def train(self, dataset, batch_size): | |
| self.sync_weights() | |
| ddp_model = torch.nn.parallel.DistributedDataParallel( | |
| self.model, | |
| device_ids=[self.device_id], | |
| output_device=self.device_id | |
| ) | |
| loader = self.dataloader(dataset, batch_size) | |
| for epoch in range(100): | |
| for batch in iter(loader): | |
| self.train_step(batch) | |
| if self.should_checkpoint(): | |
| self.save_checkpoint() | |
| def main(): | |
| trainer = Trainer() | |
| trainer.load_checkpoint(path) | |
| tainer.initialize() | |
| trainer.train(dataset, batch_size) | |
| Note | |
| To bypass Python GIL (Global interpreter lock) pytorch spawn one process for each GPU. | |
| In the example above this means at least 12 processes are spawn, at least 4 on each node. | |
| " | |
| Frequently asked questions (FAQs),https://docs.mila.quebec/Userguide.html#frequently-asked-questions-faqs,"Frequently asked questions (FAQs) | |
| " | |
| Connection/SSH issues,https://docs.mila.quebec/Userguide.html#connection-ssh-issues,"Connection/SSH issues | |
| " | |
| I’m getting connection refused while trying to connect to a login node,https://docs.mila.quebec/Userguide.html#i-m-getting-connection-refused-while-trying-to-connect-to-a-login-node,"I’m getting connection refused while trying to connect to a login node | |
| Login nodes are protected against brute force attacks and might ban your IP if | |
| it detects too many connections/failures. You will be automatically unbanned | |
| after 1 hour. For any further problem, please submit a support ticket. | |
| " | |
| Shell issues,https://docs.mila.quebec/Userguide.html#shell-issues,"Shell issues | |
| " | |
| How do I change my shell ?,https://docs.mila.quebec/Userguide.html#how-do-i-change-my-shell,"How do I change my shell ? | |
| By default you will be assigned /bin/bash as a shell. If you would like to | |
| change for another one, please submit a support ticket. | |
| " | |
| SLURM issues,https://docs.mila.quebec/Userguide.html#slurm-issues,"SLURM issues | |
| " | |
| How can I get an interactive shell on the cluster ?,https://docs.mila.quebec/Userguide.html#how-can-i-get-an-interactive-shell-on-the-cluster,"How can I get an interactive shell on the cluster ? | |
| Use salloc [--slurm_options] without any executable at the end of the | |
| command, this will launch your default shell on an interactive session. Remember | |
| that an interactive session is bound to the login node where you start it so you | |
| could risk losing your job if the login node becomes unreachable. | |
| " | |
| How can I reset my cluster password ?,https://docs.mila.quebec/Userguide.html#how-can-i-reset-my-cluster-password,"How can I reset my cluster password ? | |
| To reset your password, please submit a support ticket. | |
| Warning: your cluster password is the same as your Google Workspace account. So, | |
| after reset, you must use the new password for all your Google services. | |
| " | |
| srun: error: –mem and –mem-per-cpu are mutually exclusive,https://docs.mila.quebec/Userguide.html#srun-error-mem-and-mem-per-cpu-are-mutually-exclusive,"srun: error: –mem and –mem-per-cpu are mutually exclusive | |
| You can safely ignore this, salloc has a default memory flag in case you | |
| don’t provide one. | |
| " | |
| How can I see where and if my jobs are running ?,https://docs.mila.quebec/Userguide.html#how-can-i-see-where-and-if-my-jobs-are-running,"How can I see where and if my jobs are running ? | |
| Use squeue -u YOUR_USERNAME to see all your job status and locations. | |
| To get more info on a running job, try scontrol show job #JOBID | |
| " | |
| Unable to allocate resources: Invalid account or account/partition combination specified,https://docs.mila.quebec/Userguide.html#unable-to-allocate-resources-invalid-account-or-account-partition-combination-specified,"Unable to allocate resources: Invalid account or account/partition combination specified | |
| Chances are your account is not setup properly. You should submit a support ticket. | |
| " | |
| How do I cancel a job?,https://docs.mila.quebec/Userguide.html#how-do-i-cancel-a-job,"How do I cancel a job? | |
| To cancel a specific job, use scancel #JOBID | |
| To cancel all your jobs (running and pending), use scancel -u YOUR_USERNAME | |
| To cancel all your pending jobs only, use scancel -t PD | |
| " | |
| How can I access a node on which one of my jobs is running ?,https://docs.mila.quebec/Userguide.html#how-can-i-access-a-node-on-which-one-of-my-jobs-is-running,"How can I access a node on which one of my jobs is running ? | |
| You can ssh into a node on which you have a job running, your ssh connection | |
| will be adopted by your job, i.e. if your job finishes your ssh connection will | |
| be automatically terminated. In order to connect to a node, you need to have | |
| password-less ssh either with a key present in your home or with an | |
| ssh-agent. You can generate a key on the login node like this: | |
| ssh-keygen (3xENTER) | |
| cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys | |
| chmod 600 ~/.ssh/authorized_keys | |
| chmod 700 ~/.ssh | |
| " | |
| I’m getting Permission denied (publickey) while trying to connect to a node,https://docs.mila.quebec/Userguide.html#i-m-getting-permission-denied-publickey-while-trying-to-connect-to-a-node,"I’m getting Permission denied (publickey) while trying to connect to a node | |
| See previous question | |
| " | |
| Where do I put my data during a job ?,https://docs.mila.quebec/Userguide.html#where-do-i-put-my-data-during-a-job,"Where do I put my data during a job ? | |
| Your /home as well as the datasets are on shared file-systems, it is | |
| recommended to copy them to the $SLURM_TMPDIR to better process them and | |
| leverage higher-speed local drives. If you run a low priority job subject to | |
| preemption, it’s better to save any output you want to keep on the shared file | |
| systems, because the $SLURM_TMPDIR is deleted at the end of each job. | |
| " | |
| slurmstepd: error: Detected 1 oom-kill event(s) in step #####.batch cgroup,https://docs.mila.quebec/Userguide.html#slurmstepd-error-detected-1-oom-kill-event-s-in-step-batch-cgroup,"slurmstepd: error: Detected 1 oom-kill event(s) in step #####.batch cgroup | |
| You exceeded the amount of memory allocated to your job, either you did not | |
| request enough memory or you have a memory leak in your process. Try increasing | |
| the amount of memory requested with --mem= or --mem-per-cpu=. | |
| " | |
| fork: retry: Resource temporarily unavailable,https://docs.mila.quebec/Userguide.html#fork-retry-resource-temporarily-unavailable,"fork: retry: Resource temporarily unavailable | |
| You exceeded the limit of 2000 tasks/PIDs in your job, it probably means there | |
| is an issue with a sub-process spawning too many processes in your script. For | |
| any help with your software, please submit a support ticket. | |
| " | |
| PyTorch issues,https://docs.mila.quebec/Userguide.html#pytorch-issues,"PyTorch issues | |
| " | |
| "I randomly get INTERNAL ASSERT FAILED at ""../aten/src/ATen/MapAllocator.cpp"":263",https://docs.mila.quebec/Userguide.html#i-randomly-get-internal-assert-failed-at-aten-src-aten-mapallocator-cpp-263,"I randomly get INTERNAL ASSERT FAILED at ""../aten/src/ATen/MapAllocator.cpp"":263 | |
| You are using PyTorch 1.10.x and hitting #67864, | |
| for which the solution is PR #72232 | |
| merged in PyTorch 1.11.x. For an immediate fix, consider the following compilable Gist: | |
| hack.cpp. | |
| Compile the patch to hack.so and then export LD_PRELOAD=/absolute/path/to/hack.so | |
| before executing the Python process that import torch a broken PyTorch 1.10. | |
| For Hydra users who are using the submitit launcher plug-in, the env_set key cannot | |
| be used to set LD_PRELOAD in the environment as it does so too late at runtime. The | |
| dynamic loader reads LD_PRELOAD only once and very early during the startup of any | |
| process, before the variable can be set from inside the process. The hack must therefore | |
| be injected using the setup key in Hydra YAML config file: | |
| hydra: | |
| launcher: | |
| setup: | |
| - export LD_PRELOAD=/absolute/path/to/hack.so | |
| " | |
| Mila technical documentation,https://docs.mila.quebec/index.html#mila-technical-documentation,"Mila technical documentation | |
| Welcome to Mila’s technical documentation. If this is your first time here, we | |
| recommend you start by checking out the short quick start guide. | |
| Introduction | |
| Purpose of this documentation | |
| Intended audience | |
| Contributing | |
| How-tos and Guides | |
| User’s guide | |
| Quick Start | |
| Logging in to the cluster | |
| Running your code | |
| Portability concerns and solutions | |
| Singularity | |
| Sharing Data with ACLs | |
| Contributing datasets | |
| Data Transmission using Globus Connect Personal | |
| JupyterHub | |
| Advanced SLURM usage and Multiple GPU jobs | |
| Multiple Nodes | |
| Frequently asked questions (FAQs) | |
| AI tooling and methodology handbook | |
| Systems and services | |
| Computing infrastructure and policies | |
| Roles and authorizations | |
| Overview of available computing resources at Mila | |
| Node profile description | |
| Data sharing policies | |
| Monitoring | |
| Storage | |
| Data Transmission | |
| Computational resources outside of Mila | |
| Digital Research Alliance of Canada Clusters | |
| General theory | |
| What is a computer cluster? | |
| Parts of a computing cluster | |
| The login nodes | |
| The compute nodes | |
| The storage nodes | |
| Different nodes for different uses | |
| UNIX | |
| The workload manager | |
| Processing data | |
| Data parallelism | |
| Model parallelism | |
| Communication concerns | |
| Filesystem concerns | |
| Software on the cluster | |
| Cluster software modules | |
| Containers | |
| Python Virtual environments | |
| Extras | |
| Mila Datasets | |
| Audio and video resources at Mila | |
| Visual Studio Code | |
| Connecting to the cluster | |
| Activating an environment | |
| Troubleshooting | |
| Who, what, where is IDT | |
| IDT’s mission | |
| The IDT team | |
| Support | |
| To reach the Mila infrastructure support, please submit | |
| a support ticket. | |
| Contribution | |
| If you find any errors in the documentation, missing or unclear | |
| sections, or would simply like to contribute, please open an | |
| issue or make a pull request on the github page. | |
| " | |
| Audio and video resources at Mila,https://docs.mila.quebec/Audio_video.html#audio-and-video-resources-at-mila,"Audio and video resources at Mila | |
| See the intranet section on | |
| audio and video | |
| for complete information on audio and video systems made available at Mila. | |
| " | |