|
## How to run some of the code in this repository |
|
|
|
### 1. Make sure Docker is installed on your machine |
|
### 2. Clone the repository |
|
### 3. CD into the repository |
|
### 4. Run the following command to build the docker image |
|
```bash |
|
docker docker compose build -t oc-prototype . |
|
``` |
|
### 5. Run the following command to run the docker image |
|
```bash |
|
docker compose up -d oc-prototype |
|
docker exec -it oc-prototype /bin/bash |
|
``` |
|
|
|
|
|
## Prototype TODO's |
|
|
|
## Data |
|
- [X] Process all misinfo claims and generate embeddings for a library namespace |
|
- [X] Upsert claims into pinecone |
|
- [X] Upsert 300k into namespace |
|
- [ ] Update claim format to be similar to: https://www.kaggle.com/datasets/shivkumarganesh/politifact-factcheck-data/data |
|
|
|
## Functions |
|
- [X] Upsert vector |
|
- [X] Batch upsert |
|
- [X] Query against metadata |
|
|
|
- [ ] Generate working Dockerfile for project reproducibility |
|
- [ ] Load data into a database |
|
- [ ] Test precision/recall of embeddings |
|
- [ ] Generate working version of climate demo |
|
|
|
|
|
Embedding pricing: |
|
|
|
1 token = approximately 0.75 words or 1k tokens = 750 words, you pay per 1000 tokens $0.0001 |
|
Using that it can be shown that you get about 4 characters per token or 4Kb of embedding text per 1k tokens or $0.0001 |
|
Using that as your basis you can approximate the cost of your embedding by : |
|
Cost in $ = Size of Data in Kilobytes * 0.000025 |
|
|
|
$0.100 / 1M tokens |
|
|
|
Credentials for running google cloud queries: see ostreacultura-credentials.json |
|
|