| ## How to run some of the code in this repository | |
| ### 1. Make sure Docker is installed on your machine | |
| ### 2. Clone the repository | |
| ### 3. CD into the repository | |
| ### 4. Run the following command to build the docker image | |
| ```bash | |
| docker docker compose build -t oc-prototype . | |
| ``` | |
| ### 5. Run the following command to run the docker image | |
| ```bash | |
| docker compose up -d oc-prototype | |
| docker exec -it oc-prototype /bin/bash | |
| ``` | |
| ## Prototype TODO's | |
| ## Data | |
| - [X] Process all misinfo claims and generate embeddings for a library namespace | |
| - [X] Upsert claims into pinecone | |
| - [X] Upsert 300k into namespace | |
| - [ ] Update claim format to be similar to: https://www.kaggle.com/datasets/shivkumarganesh/politifact-factcheck-data/data | |
| ## Functions | |
| - [X] Upsert vector | |
| - [X] Batch upsert | |
| - [X] Query against metadata | |
| - [ ] Generate working Dockerfile for project reproducibility | |
| - [ ] Load data into a database | |
| - [ ] Test precision/recall of embeddings | |
| - [ ] Generate working version of climate demo | |
| Embedding pricing: | |
| 1 token = approximately 0.75 words or 1k tokens = 750 words, you pay per 1000 tokens $0.0001 | |
| Using that it can be shown that you get about 4 characters per token or 4Kb of embedding text per 1k tokens or $0.0001 | |
| Using that as your basis you can approximate the cost of your embedding by : | |
| Cost in $ = Size of Data in Kilobytes * 0.000025 | |
| $0.100 / 1M tokens | |
| Credentials for running google cloud queries: see ostreacultura-credentials.json | |