File size: 1,458 Bytes
9ff0a35
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
## How to run some of the code in this repository

### 1. Make sure Docker is installed on your machine
### 2. Clone the repository
### 3. CD into the repository
### 4. Run the following command to build the docker image
```bash
docker docker compose build -t oc-prototype .
```
### 5. Run the following command to run the docker image
```bash
docker compose up -d oc-prototype
docker exec -it oc-prototype /bin/bash
```


## Prototype TODO's

## Data
- [X] Process all misinfo claims and generate embeddings for a library namespace 
- [X] Upsert claims into pinecone 
- [X] Upsert 300k into namespace
- [ ] Update claim format to be similar to: https://www.kaggle.com/datasets/shivkumarganesh/politifact-factcheck-data/data

## Functions
- [X] Upsert vector 
- [X] Batch upsert 
- [X] Query against metadata

- [ ] Generate working Dockerfile for project reproducibility 
- [ ] Load data into a database 
- [ ] Test precision/recall of embeddings
- [ ] Generate working version of climate demo 


Embedding pricing:

1 token = approximately 0.75 words or 1k tokens = 750 words, you pay per 1000 tokens $0.0001
Using that it can be shown that you get about 4 characters per token or 4Kb of embedding text per 1k tokens or $0.0001
Using that as your basis you can approximate the cost of your embedding by :
Cost in $ = Size of Data in Kilobytes * 0.000025

$0.100 / 1M tokens

Credentials for running google cloud queries: see ostreacultura-credentials.json