## How to run some of the code in this repository ### 1. Make sure Docker is installed on your machine ### 2. Clone the repository ### 3. CD into the repository ### 4. Run the following command to build the docker image ```bash docker docker compose build -t oc-prototype . ``` ### 5. Run the following command to run the docker image ```bash docker compose up -d oc-prototype docker exec -it oc-prototype /bin/bash ``` ## Prototype TODO's ## Data - [X] Process all misinfo claims and generate embeddings for a library namespace - [X] Upsert claims into pinecone - [X] Upsert 300k into namespace - [ ] Update claim format to be similar to: https://www.kaggle.com/datasets/shivkumarganesh/politifact-factcheck-data/data ## Functions - [X] Upsert vector - [X] Batch upsert - [X] Query against metadata - [ ] Generate working Dockerfile for project reproducibility - [ ] Load data into a database - [ ] Test precision/recall of embeddings - [ ] Generate working version of climate demo Embedding pricing: 1 token = approximately 0.75 words or 1k tokens = 750 words, you pay per 1000 tokens $0.0001 Using that it can be shown that you get about 4 characters per token or 4Kb of embedding text per 1k tokens or $0.0001 Using that as your basis you can approximate the cost of your embedding by : Cost in $ = Size of Data in Kilobytes * 0.000025 $0.100 / 1M tokens Credentials for running google cloud queries: see ostreacultura-credentials.json