Update README.md (#2)
Browse files- Update README.md (c95c6cc7178956a7f7486e94fbdb9c3a5b1f722d)
    	
        README.md
    CHANGED
    
    | @@ -7,8 +7,7 @@ pipeline_tag: text2text-generation | |
| 7 |  | 
| 8 | 
             
            # Elastic models
         | 
| 9 |  | 
| 10 | 
            -
            Elastic models are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator.
         | 
| 11 | 
            -
            ANNA allows you to control model size, latency and quality with a simple slider movement. For each model, ANNA produces a series of optimized models:
         | 
| 12 |  | 
| 13 | 
             
            * __XL__: Mathematically equivalent neural network, optimized with our DNN compiler. 
         | 
| 14 |  | 
| @@ -25,10 +24,10 @@ __Goals of elastic models:__ | |
| 25 | 
             
            * Provide clear quality and latency benchmarks
         | 
| 26 | 
             
            * Provide interface of HF libraries: transformers and diffusers with a single line of code
         | 
| 27 | 
             
            * Provide models supported on a wide range of hardware, which are pre-compiled and require no JIT.
         | 
|  | |
| 28 |  | 
| 29 | 
             
            > It's important to note that specific quality degradation can vary from model to model. For instance, with an S model, you can have 0.5% degradation as well.
         | 
| 30 |  | 
| 31 | 
            -
             | 
| 32 | 
             
            ## Inference
         | 
| 33 |  | 
| 34 | 
             
            To infer our models, you just need to replace `transformers` import with `elastic_models.transformers`:
         | 
| @@ -38,21 +37,28 @@ import torch | |
| 38 | 
             
            from transformers import AutoTokenizer
         | 
| 39 | 
             
            from elastic_models.transformers import AutoModelForCausalLM
         | 
| 40 |  | 
|  | |
|  | |
|  | |
| 41 | 
             
            model_name = "mistralai/Mistral-7B-Instruct-v0.3"
         | 
| 42 | 
            -
             | 
| 43 | 
            -
             | 
| 44 | 
             
            device = torch.device("cuda")
         | 
| 45 | 
            -
            tokenizer = AutoTokenizer.from_pretrained(model_name, token=token)
         | 
| 46 |  | 
|  | |
|  | |
|  | |
|  | |
| 47 | 
             
            model = AutoModelForCausalLM.from_pretrained(
         | 
| 48 | 
             
                model_name, 
         | 
| 49 | 
            -
                token= | 
| 50 | 
            -
                cache_dir= | 
| 51 | 
             
                torch_dtype=torch.bfloat16,
         | 
| 52 | 
             
                attn_implementation="sdpa"
         | 
| 53 | 
             
            ).to(device)
         | 
| 54 | 
             
            model.generation_config.pad_token_id = tokenizer.eos_token_id
         | 
| 55 |  | 
|  | |
| 56 | 
             
            prompt = "Describe basics of DNNs quantization."
         | 
| 57 | 
             
            inputs = tokenizer(prompt, return_tensors="pt")
         | 
| 58 | 
             
            inputs.to(device)
         | 
| @@ -65,12 +71,13 @@ output = tokenizer.batch_decode( | |
| 65 | 
             
                skip_special_tokens=True, 
         | 
| 66 | 
             
                clean_up_tokenization_spaces=False
         | 
| 67 | 
             
            )[0]
         | 
|  | |
|  | |
| 68 | 
             
            print(f"# Q:\n{prompt}\n")
         | 
| 69 | 
             
            print(f"# A:\n{output}\n")
         | 
| 70 | 
             
            ```
         | 
| 71 |  | 
| 72 | 
            -
             | 
| 73 | 
            -
            ### System requirements
         | 
| 74 |  | 
| 75 | 
             
            __GPUs__: H100, L40s
         | 
| 76 |  | 
| @@ -78,17 +85,14 @@ __OS__: Linux #TODO | |
| 78 |  | 
| 79 | 
             
            __Python__: 3.10-3.12
         | 
| 80 |  | 
| 81 | 
            -
             | 
| 82 | 
            -
            ---
         | 
| 83 | 
            -
            ### Installation
         | 
| 84 |  | 
| 85 | 
             
            ```shell
         | 
| 86 | 
             
            pip install thestage
         | 
| 87 | 
             
            pip install elastic_models
         | 
| 88 | 
             
            ```
         | 
| 89 |  | 
| 90 | 
            -
            Then go to app.thestage.ai, login and generate API token from your profile page.
         | 
| 91 | 
            -
            Set up API token as follows:
         | 
| 92 |  | 
| 93 | 
             
            ```shell
         | 
| 94 | 
             
            thestage config set --api-token <YOUR_API_TOKEN>
         | 
| @@ -96,6 +100,7 @@ thestage config set --api-token <YOUR_API_TOKEN> | |
| 96 |  | 
| 97 | 
             
            Congrats, now you can use accelerated models!
         | 
| 98 |  | 
|  | |
| 99 |  | 
| 100 | 
             
            ## Benchmarks
         | 
| 101 |  | 
| @@ -113,7 +118,7 @@ For quality evaluation we have used: #TODO link to github | |
| 113 | 
             
            | Winogrande    | 0 | 0 | 0 | 0  | 0        | 0          |
         | 
| 114 |  | 
| 115 |  | 
| 116 | 
            -
            > __MMLU__: Evaluates/shows  | 
| 117 |  | 
| 118 | 
             
            > __MMLU__: Evaluates/shows ... 
         | 
| 119 |  | 
| @@ -121,28 +126,34 @@ For quality evaluation we have used: #TODO link to github | |
| 121 |  | 
| 122 | 
             
            > __PIQA__: Evaluates/shows ... 
         | 
| 123 |  | 
| 124 | 
            -
             | 
| 125 | 
             
            ### Latency benchmarks
         | 
| 126 |  | 
| 127 | 
             
            We have profiled models in different scenarios:
         | 
| 128 |  | 
| 129 | 
            -
             | 
|  | |
|  | |
|  | |
| 130 | 
             
            | GPU/Model | S   | M | L | XL | Original | W8A8, int8 |
         | 
| 131 | 
             
            |-----------|-----|---|---|----|----------|------------|
         | 
| 132 | 
             
            | H100      | 189 | 0 | 0 | 0  | 48       | 0          |
         | 
| 133 | 
             
            | L40s      | 79  | 0 | 0 | 0  | 42       | 0          |
         | 
| 134 |  | 
| 135 |  | 
| 136 | 
            -
             | 
|  | |
|  | |
| 137 | 
             
            | GPU/Model | S   | M | L | XL | Original | W8A8, int8 |
         | 
| 138 | 
             
            |-----------|-----|---|---|----|----------|------------|
         | 
| 139 | 
             
            | H100      | 189 | 0 | 0 | 0  | 48       | 0          |
         | 
| 140 | 
             
            | L40s      | 79  | 0 | 0 | 0  | 42       | 0          |
         | 
| 141 |  | 
|  | |
|  | |
| 142 |  | 
| 143 | 
             
            ## Links
         | 
| 144 |  | 
| 145 | 
             
            * __Platform__: [app.thestage.ai](app.thestage.ai)
         | 
| 146 | 
             
            * __Elastic models Github__: [app.thestage.ai](app.thestage.ai)
         | 
| 147 | 
             
            * __Subscribe for updates__: [TheStageAI X](https://x.com/TheStageAI)
         | 
| 148 | 
            -
            * __Contact email__: [email protected]
         | 
|  | |
| 7 |  | 
| 8 | 
             
            # Elastic models
         | 
| 9 |  | 
| 10 | 
            +
            Elastic models are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA allows you to control model size, latency and quality with a simple slider movement. For each model, ANNA produces a series of optimized models:
         | 
|  | |
| 11 |  | 
| 12 | 
             
            * __XL__: Mathematically equivalent neural network, optimized with our DNN compiler. 
         | 
| 13 |  | 
|  | |
| 24 | 
             
            * Provide clear quality and latency benchmarks
         | 
| 25 | 
             
            * Provide interface of HF libraries: transformers and diffusers with a single line of code
         | 
| 26 | 
             
            * Provide models supported on a wide range of hardware, which are pre-compiled and require no JIT.
         | 
| 27 | 
            +
            * Provide the best models and service for self-hosting.
         | 
| 28 |  | 
| 29 | 
             
            > It's important to note that specific quality degradation can vary from model to model. For instance, with an S model, you can have 0.5% degradation as well.
         | 
| 30 |  | 
|  | |
| 31 | 
             
            ## Inference
         | 
| 32 |  | 
| 33 | 
             
            To infer our models, you just need to replace `transformers` import with `elastic_models.transformers`:
         | 
|  | |
| 37 | 
             
            from transformers import AutoTokenizer
         | 
| 38 | 
             
            from elastic_models.transformers import AutoModelForCausalLM
         | 
| 39 |  | 
| 40 | 
            +
            # Currently we require to have your HF token
         | 
| 41 | 
            +
            # as we use original weights for part of layers and
         | 
| 42 | 
            +
            # model confugaration as well
         | 
| 43 | 
             
            model_name = "mistralai/Mistral-7B-Instruct-v0.3"
         | 
| 44 | 
            +
            hf_token = ''
         | 
| 45 | 
            +
            hf_cache_dir = ''
         | 
| 46 | 
             
            device = torch.device("cuda")
         | 
|  | |
| 47 |  | 
| 48 | 
            +
            # Create mode
         | 
| 49 | 
            +
            tokenizer = AutoTokenizer.from_pretrained(
         | 
| 50 | 
            +
                model_name, token=hf_token
         | 
| 51 | 
            +
            )
         | 
| 52 | 
             
            model = AutoModelForCausalLM.from_pretrained(
         | 
| 53 | 
             
                model_name, 
         | 
| 54 | 
            +
                token=hf_token,
         | 
| 55 | 
            +
                cache_dir=hf_cache_dir,
         | 
| 56 | 
             
                torch_dtype=torch.bfloat16,
         | 
| 57 | 
             
                attn_implementation="sdpa"
         | 
| 58 | 
             
            ).to(device)
         | 
| 59 | 
             
            model.generation_config.pad_token_id = tokenizer.eos_token_id
         | 
| 60 |  | 
| 61 | 
            +
            # Inference simple as transformers library
         | 
| 62 | 
             
            prompt = "Describe basics of DNNs quantization."
         | 
| 63 | 
             
            inputs = tokenizer(prompt, return_tensors="pt")
         | 
| 64 | 
             
            inputs.to(device)
         | 
|  | |
| 71 | 
             
                skip_special_tokens=True, 
         | 
| 72 | 
             
                clean_up_tokenization_spaces=False
         | 
| 73 | 
             
            )[0]
         | 
| 74 | 
            +
             | 
| 75 | 
            +
            # Validate answer
         | 
| 76 | 
             
            print(f"# Q:\n{prompt}\n")
         | 
| 77 | 
             
            print(f"# A:\n{output}\n")
         | 
| 78 | 
             
            ```
         | 
| 79 |  | 
| 80 | 
            +
            ### Installation
         | 
|  | |
| 81 |  | 
| 82 | 
             
            __GPUs__: H100, L40s
         | 
| 83 |  | 
|  | |
| 85 |  | 
| 86 | 
             
            __Python__: 3.10-3.12
         | 
| 87 |  | 
| 88 | 
            +
            To work with our models
         | 
|  | |
|  | |
| 89 |  | 
| 90 | 
             
            ```shell
         | 
| 91 | 
             
            pip install thestage
         | 
| 92 | 
             
            pip install elastic_models
         | 
| 93 | 
             
            ```
         | 
| 94 |  | 
| 95 | 
            +
            Then go to app.thestage.ai, login and generate API token from your profile page. Set up API token as follows:
         | 
|  | |
| 96 |  | 
| 97 | 
             
            ```shell
         | 
| 98 | 
             
            thestage config set --api-token <YOUR_API_TOKEN>
         | 
|  | |
| 100 |  | 
| 101 | 
             
            Congrats, now you can use accelerated models!
         | 
| 102 |  | 
| 103 | 
            +
            ----
         | 
| 104 |  | 
| 105 | 
             
            ## Benchmarks
         | 
| 106 |  | 
|  | |
| 118 | 
             
            | Winogrande    | 0 | 0 | 0 | 0  | 0        | 0          |
         | 
| 119 |  | 
| 120 |  | 
| 121 | 
            +
            > __MMLU__: Evaluates/shows {MMLU} 
         | 
| 122 |  | 
| 123 | 
             
            > __MMLU__: Evaluates/shows ... 
         | 
| 124 |  | 
|  | |
| 126 |  | 
| 127 | 
             
            > __PIQA__: Evaluates/shows ... 
         | 
| 128 |  | 
|  | |
| 129 | 
             
            ### Latency benchmarks
         | 
| 130 |  | 
| 131 | 
             
            We have profiled models in different scenarios:
         | 
| 132 |  | 
| 133 | 
            +
            <table>
         | 
| 134 | 
            +
            <tr><th> 100 input/300 output; tok/s </th><th> 1000 input/1000 output; tok/s </th></tr>
         | 
| 135 | 
            +
            <tr><td>
         | 
| 136 | 
            +
             | 
| 137 | 
             
            | GPU/Model | S   | M | L | XL | Original | W8A8, int8 |
         | 
| 138 | 
             
            |-----------|-----|---|---|----|----------|------------|
         | 
| 139 | 
             
            | H100      | 189 | 0 | 0 | 0  | 48       | 0          |
         | 
| 140 | 
             
            | L40s      | 79  | 0 | 0 | 0  | 42       | 0          |
         | 
| 141 |  | 
| 142 |  | 
| 143 | 
            +
             | 
| 144 | 
            +
            </td><td>
         | 
| 145 | 
            +
             | 
| 146 | 
             
            | GPU/Model | S   | M | L | XL | Original | W8A8, int8 |
         | 
| 147 | 
             
            |-----------|-----|---|---|----|----------|------------|
         | 
| 148 | 
             
            | H100      | 189 | 0 | 0 | 0  | 48       | 0          |
         | 
| 149 | 
             
            | L40s      | 79  | 0 | 0 | 0  | 42       | 0          |
         | 
| 150 |  | 
| 151 | 
            +
            </td></tr> </table>
         | 
| 152 | 
            +
             | 
| 153 |  | 
| 154 | 
             
            ## Links
         | 
| 155 |  | 
| 156 | 
             
            * __Platform__: [app.thestage.ai](app.thestage.ai)
         | 
| 157 | 
             
            * __Elastic models Github__: [app.thestage.ai](app.thestage.ai)
         | 
| 158 | 
             
            * __Subscribe for updates__: [TheStageAI X](https://x.com/TheStageAI)
         | 
| 159 | 
            +
            * __Contact email__: [email protected]
         | 
