Spaces:

stefanjwojcik
/

misinfo_detection_app

Sleeping

App Files Files Community

stefanjwojcik commited on Mar 11

Commit

48bb68b

verified ·

1 Parent(s): 143b0d4

Upload 24 files

Browse files

Files changed (25) hide show

.gitattributes +2 -0
data/Climate Misinformation claims.csv +81 -0
data/Combined Misinformation Library.csv +197 -0
data/Indicator_Development.csv +0 -0
data/Indicator_Test.csv +0 -0
data/Modified Misinformation Library.csv +97 -0
data/climate_data/data/README.txt +46 -0
data/expansive_claims_library_expanded_embed.csv +0 -0
data/filtered_fact_check_latest_embed.csv +3 -0
data/random_300k.csv +3 -0
src/Embeddings.jl +186 -0
src/Models.jl +182 -0
src/OstreaCultura.jl +25 -0
src/PyPineCone.jl +415 -0
src/bash/update_fact_checks.sh +14 -0
src/deprecated/Narrative.jl +242 -0
src/deprecated/NarrativeClassification.jl +107 -0
src/dev/Utils.jl +73 -0
src/py_init.jl +14 -0
src/python/DataLoader.py +344 -0
src/python/MiniEncoder.py +10 -0
src/python/__pycache__/DataLoader.cpython-310.pyc +0 -0
src/python/__pycache__/DataLoader.cpython-312.pyc +0 -0
src/python/update_fact_check_data.py +83 -0
src/python/upload_library_hybrid-sparse.py +107 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+data/filtered_fact_check_latest_embed.csv filter=lfs diff=lfs merge=lfs -text
+data/random_300k.csv filter=lfs diff=lfs merge=lfs -text

data/Climate Misinformation claims.csv ADDED Viewed

	@@ -0,0 +1,81 @@

+Topic,Narrative,Claim,Instances
+Climate Change,Global warming is not happening,Ice isn't melting,Antarctica is gaining ice/not warming
+Climate Change,Global warming is not happening,Ice isn't melting,Greenland is gaining ice/not melting
+Climate Change,Global warming is not happening,Ice isn't melting,Arctic sea ice isn't vanishing
+Climate Change,Global warming is not happening,Glaciers aren't vanishing,Glaciers aren't vanishing
+Climate Change,Global warming is not happening,We're heading into an ice age/global cooling,We're heading into an ice age/global cooling
+Climate Change,Global warming is not happening,Weather is cold/snowing,Weather is cold/snowing
+Climate Change,Global warming is not happening,Climate hasn't warmed/changed over the last (few) decade(s),Climate hasn't warmed/changed over the last (few) decade(s)
+Climate Change,Global warming is not happening,Oceans are cooling/not warming,Oceans are cooling/not warming
+Climate Change,Global warming is not happening,Sea level rise is exaggerated/not accelerating,Sea level rise is exaggerated/not accelerating
+Climate Change,Global warming is not happening,Extreme weather isn't increasing/has happened before/isn't linked to climate change,Extreme weather isn't increasing/has happened before/isn't linked to climate change
+Climate Change,Global warming is not happening,They changed the name from 'global warming' to 'climate change',They changed the name from 'global warming' to 'climate change'
+Climate Change,Climate change is not human caused,It's natural cycles/variation,It's the sun/cosmic rays/astronomical
+Climate Change,Climate change is not human caused,It's natural cycles/variation,It's geological (includes volcanoes)
+Climate Change,Climate change is not human caused,It's natural cycles/variation,It's the ocean/internal variability
+Climate Change,Climate change is not human caused,It's natural cycles/variation,Climate has changed naturally/been warm in the past
+Climate Change,Climate change is not human caused,It's natural cycles/variation,Human CO2 emissions are tiny compared to natural CO2 emission
+Climate Change,Climate change is not human caused,It's natural cycles/variation,"It's non-greenhouse gas human climate forcings (aerosols, land use)"
+Climate Change,Climate change is not human caused,There's no evidence for greenhouse effect/carbon dioxide driving climate change,Carbon dioxide is just a trace gas
+Climate Change,Climate change is not human caused,There's no evidence for greenhouse effect/carbon dioxide driving climate change,Greenhouse effect is saturated/logarithmic
+Climate Change,Climate change is not human caused,There's no evidence for greenhouse effect/carbon dioxide driving climate change,Carbon dioxide lags/not correlated with climate change
+Climate Change,Climate change is not human caused,There's no evidence for greenhouse effect/carbon dioxide driving climate change,Water vapor is the most powerful greenhouse gas
+Climate Change,Climate change is not human caused,There's no evidence for greenhouse effect/carbon dioxide driving climate change,There's no tropospheric hot spot
+Climate Change,Climate change is not human caused,There's no evidence for greenhouse effect/carbon dioxide driving climate change,CO2 was higher in the past
+Climate Change,Climate change is not human caused,CO2 is not rising/ocean pH is not falling,CO2 is not rising/ocean pH is not falling
+Climate Change,Climate change is not human caused,Human CO2 emissions are miniscule/not raising atmospheric CO2,Human CO2 emissions are miniscule/not raising atmospheric CO2
+Climate Change,Climate impacts/global warming is beneficial/not bad,Climate sensitivity is low/negative feedbacks reduce warming,Climate sensitivity is low/negative feedbacks reduce warming
+Climate Change,Climate impacts/global warming is beneficial/not bad,Species/plants/reefs aren't showing climate impacts yet/are benefiting from climate change,Species/plants/reefs aren't showing climate impacts yet/are benefiting from climate change
+Climate Change,Climate impacts/global warming is beneficial/not bad,Species/plants/reefs aren't showing climate impacts yet/are benefiting from climate change,Species can adapt to global warming
+Climate Change,Climate impacts/global warming is beneficial/not bad,Species/plants/reefs aren't showing climate impacts yet/are benefiting from climate change,Polar bears are not in danger from climate change
+Climate Change,Climate impacts/global warming is beneficial/not bad,Species/plants/reefs aren't showing climate impacts yet/are benefiting from climate change,Ocean acidification/coral impacts aren't serious
+Climate Change,Climate impacts/global warming is beneficial/not bad,CO2 is beneficial/not a pollutant,CO2 is beneficial/not a pollutant
+Climate Change,Climate impacts/global warming is beneficial/not bad,CO2 is beneficial/not a pollutant,CO2 is plant food
+Climate Change,Climate impacts/global warming is beneficial/not bad,It's only a few degrees (or less),It's only a few degrees (or less)
+Climate Change,Climate impacts/global warming is beneficial/not bad,Climate change does not contribute to human conflict/threaten national security,Climate change does not contribute to human conflict/threaten national security
+Climate Change,Climate impacts/global warming is beneficial/not bad,Climate change doesn't negatively impact health,Climate change doesn't negatively impact health
+Climate Change,Climate solutions won't work,Climate solutions won't work,Climate solutions won't work
+Climate Change,Climate solutions won't work,Climate policies (mitigation or adaptation) are harmful,Climate policies (mitigation or adaptation) are harmful
+Climate Change,Climate solutions won't work,Climate policies (mitigation or adaptation) are harmful,Climate policy will increase costs/harm economy/kill jobs
+Climate Change,Climate solutions won't work,Climate policies (mitigation or adaptation) are harmful,Proposed action would weaken national security/national sovereignty/cause conflict
+Climate Change,Climate solutions won't work,Climate policies (mitigation or adaptation) are harmful,Proposed action would actually harm the environment and species
+Climate Change,Climate solutions won't work,Climate policies (mitigation or adaptation) are harmful,Future generations will be richer and better able to adapt
+Climate Change,Climate solutions won't work,Climate policies (mitigation or adaptation) are harmful,Climate policy limits liberty/freedom/capitalism
+Climate Change,Climate solutions won't work,Climate policies are ineffective/flawed,Climate policies are ineffective/flawed
+Climate Change,Climate solutions won't work,Climate policies are ineffective/flawed,Clean energy/green jobs/businesses won't work
+Climate Change,Climate solutions won't work,Climate policies are ineffective/flawed,Markets/private sector are economically more efficient than government policies
+Climate Change,Climate solutions won't work,Climate policies are ineffective/flawed,Climate policy will make negligible difference to climate change
+Climate Change,Climate solutions won't work,Climate policies are ineffective/flawed,A single country/region only contributes a small % of global emissions
+Climate Change,Climate solutions won't work,Climate policies are ineffective/flawed,Better to adapt/geoengineer/increase resiliency
+Climate Change,Climate solutions won't work,Climate policies are ineffective/flawed,Climate action is pointless because of China/India/other countries' emissions
+Climate Change,Climate solutions won't work,Climate policies are ineffective/flawed,We should invest in technology/reduce poverty/disease first
+Climate Change,Climate solutions won't work,It's too hard to solve,It's too hard to solve
+Climate Change,Climate solutions won't work,It's too hard to solve,Climate policy is politically/legally/economically/technically too difficult
+Climate Change,Climate solutions won't work,It's too hard to solve,Media/public support/acceptance is low/decreasing
+Climate Change,Climate solutions won't work,Clean energy technology/biofuels won't work,Clean energy technology/biofuels won't work
+Climate Change,Climate solutions won't work,Clean energy technology/biofuels won't work,Clean energy/biofuels are too expensive/unreliable/counterproductive/harmful
+Climate Change,Climate solutions won't work,Clean energy technology/biofuels won't work,Carbon Capture & Sequestration (CCS) is unproven/expensive
+Climate Change,Climate solutions won't work,"People need energy (e.g., from fossil fuels/nuclear)
+","People need energy (e.g., from fossil fuels/nuclear)
+"
+Climate Change,Climate solutions won't work,"People need energy (e.g., from fossil fuels/nuclear)
+",Fossil fuel reserves are plentiful
+Climate Change,Climate solutions won't work,"People need energy (e.g., from fossil fuels/nuclear)
+",Fossil fuels are cheap/good/safe for society/economy/environment
+Climate Change,Climate solutions won't work,"People need energy (e.g., from fossil fuels/nuclear)
+",Nuclear power is safe/good for society/economy/environment
+Climate Change,Climate movement/science is unreliable,Climate movement/science is unreliable,Climate movement/science is unreliable
+Climate Change,Climate movement/science is unreliable,"Climate-related science is uncertain/unsound/unreliable (data , methods & models)","Climate-related science is uncertain/unsound/unreliable (data , methods & models)"
+Climate Change,Climate movement/science is unreliable,"Climate-related science is uncertain/unsound/unreliable (data , methods & models)",There's no scientific consensus on climate/the science isn't settled
+Climate Change,Climate movement/science is unreliable,"Climate-related science is uncertain/unsound/unreliable (data , methods & models)",Proxy data is unreliable (includes hockey stick)
+Climate Change,Climate movement/science is unreliable,"Climate-related science is uncertain/unsound/unreliable (data , methods & models)",Temperature record is unreliable
+Climate Change,Climate movement/science is unreliable,"Climate-related science is uncertain/unsound/unreliable (data , methods & models)",Models are wrong/unreliable/uncertain
+Climate Change,Climate movement/science is unreliable,Climate movement is alarmist/wrong/political/biased/hypocritical (people or groups),Climate movement is alarmist/wrong/political/biased/hypocritical (people or groups)
+Climate Change,Climate movement/science is unreliable,Climate movement is alarmist/wrong/political/biased/hypocritical (people or groups),Climate movement is religion
+Climate Change,Climate movement/science is unreliable,Climate movement is alarmist/wrong/political/biased/hypocritical (people or groups),Media (including bloggers) is alarmist/wrong/political/biased
+Climate Change,Climate movement/science is unreliable,Climate movement is alarmist/wrong/political/biased/hypocritical (people or groups),Politicians/government/UN are alarmist/wrong/political/biased
+Climate Change,Climate movement/science is unreliable,Climate movement is alarmist/wrong/political/biased/hypocritical (people or groups),Environmentalists are alarmist/wrong/political/biased
+Climate Change,Climate movement/science is unreliable,Climate movement is alarmist/wrong/political/biased/hypocritical (people or groups),Scientists/academics are alarmist/wrong/political/biased
+Climate Change,Climate movement/science is unreliable,Climate change (science or policy) is a conspiracy (deception),Climate change (science or policy) is a conspiracy (deception)
+Climate Change,Climate movement/science is unreliable,Climate change (science or policy) is a conspiracy (deception),Climate policy/renewables is a hoax/scam/conspiracy/secretive
+Climate Change,Climate movement/science is unreliable,Climate change (science or policy) is a conspiracy (deception),Climate science is a hoax/scam/conspiracy/secretive/money-motivated (includes climategate)

data/Combined Misinformation Library.csv ADDED Viewed

	@@ -0,0 +1,197 @@

+Model,Topic,Narrative,Claims,Counterclaims,Harm 1,Harm 2
+Climate Change,Climate Change,Global warming is not happening,Antarctica is gaining ice/not warming,Antarctica is warming,,
+Climate Change,Climate Change,Global warming is not happening,Greenland is gaining ice/not melting,Greenland is warming,,
+Climate Change,Climate Change,Global warming is not happening,Arctic sea ice isn't vanishing,Arctic sea ice is vanishing,,
+Climate Change,Climate Change,Global warming is not happening,Glaciers aren't vanishing,Glaciers are vanishing,,
+Climate Change,Climate Change,Global warming is not happening,We're heading into an global cooling,We're heading into global warming,,
+Climate Change,Climate Change,Global warming is not happening,It is cold so global warming isn't happening,It is cold but global warming is still happening,,
+Climate Change,Climate Change,Global warming is not happening,Climate hasn't changed over the past few decades,Climate has changed ,,
+Climate Change,Climate Change,Global warming is not happening,Oceans are not warming,Oceans are warming,,
+Climate Change,Climate Change,Global warming is not happening,Sea level rise is exaggerated,Sea level rise is not exaggerated,,
+Climate Change,Climate Change,Global warming is not happening,Sea level rise is exaggerated/not accelerating,Sea level rise is accelerating,,
+Climate Change,Climate Change,Global warming is not happening,Extreme weather isn't increasing/has happened before/isn't linked to climate change,Extreme weather is linked to climate change,,
+Climate Change,Climate Change,Global warming is not happening,Extreme weather isn't increasing,Extreme weather is increasing,,
+Climate Change,Climate Change,Global warming is not happening,They changed the name from 'global warming' to 'climate change',They didn't change the name to climate change,,
+Climate Change,Climate Change,Climate change is not human caused,Climate change is from cosmic rays,Climate change is not caused by cosmic rays,,
+Climate Change,Climate Change,Climate change is not human caused,Climate change is from astronomical forces,Climate change is not caused by astronomical forces,,
+Climate Change,Climate Change,Climate change is not human caused,Climate change is from volcanos,Climate change is not from volcanos,,
+Climate Change,Climate Change,Climate change is not human caused,Climate change is caused by the oceans,Climate change is not caused by the oceans,,
+Climate Change,Climate Change,Climate change is not human caused,Climate change is caused by natural cycles,Climate change is not caused by natural cycles,,
+Climate Change,Climate Change,Climate change is not human caused,Climate change is normal or natural,Climate change is not normal or natural,,
+Climate Change,Climate Change,Climate change is not human caused,Human CO2 emissions are tiny compared to natural CO2 emission,Human CO2 emissions are not tiny,,
+Climate Change,Climate Change,Climate change is not human caused,"It's non-greenhouse gas human climate forcings (aerosols, land use)",,,
+Climate Change,Climate Change,Climate change is not human caused,Carbon dioxide is just a trace gas,,,
+Climate Change,Climate Change,Climate change is not human caused,Greenhouse effect is logarithmic,The greenhouse effect is not logarithmic,,
+Climate Change,Climate Change,Climate change is not human caused,Greenhouse effect is saturated,The greenhouse effect is not saturated,,
+Climate Change,Climate Change,Climate change is not human caused,Carbon dioxide lags climate change,Carbon dioxide does not lag climate change,,
+Climate Change,Climate Change,Climate change is not human caused,Carbon dioxide is not correlated with climate change,Carbon dioxide is correlated with climate change,,
+Climate Change,Climate Change,Climate change is not human caused,Water vapor is the most powerful greenhouse gas,Water vapor is not the most powerful greenhouse gas,,
+Climate Change,Climate Change,Climate change is not human caused,There is no tropospheric hot spot,There is a tropospheric hot spot,,
+Climate Change,Climate Change,Climate change is not human caused,CO2 was higher in the past,CO2 is higher today,,
+Climate Change,Climate Change,Climate change is not human caused,CO2 is not rising,CO2 is not rising,,
+Climate Change,Climate Change,Climate change is not human caused,Ocean pH is not falling,Ocean pH is falling,,
+Climate Change,Climate Change,Climate change is not human caused,Human CO2 emissions are not raising atmospheric CO2,Human CO2 emissions are raising atmospheric CO2,,
+Climate Change,Climate Change,Climate impacts/global warming is beneficial/not bad,Negative feedbacks reduce warming,Negative feedbacks do not reduce climate change,,
+Climate Change,Climate Change,Climate impacts/global warming is beneficial/not bad,Life is not showing signs of climate change,Life is showing signs of climate change,,
+Climate Change,Climate Change,Climate impacts/global warming is beneficial/not bad,Life is benefiting from climate change,Life is not benefiting from climate change,,
+Climate Change,Climate Change,Climate impacts/global warming is beneficial/not bad,Species can adapt to climate change,species cannot adapt to climate change in time,,
+Climate Change,Climate Change,Climate impacts/global warming is beneficial/not bad,Polar bears are not in danger from climate change,Polar bears are in danger from climate change,,
+Climate Change,Climate Change,Climate impacts/global warming is beneficial/not bad,Ocean acidification is not serious,Ocean acidification is serious,,
+Climate Change,Climate Change,Climate impacts/global warming is beneficial/not bad,Climate impact on coral isn't serious,Climate impact on coral is serious,,
+Climate Change,Climate Change,Climate impacts/global warming is beneficial/not bad,CO2 is not a pollutant,,,
+Climate Change,Climate Change,Climate impacts/global warming is beneficial/not bad,CO2 is beneficial to the environment,,,
+Climate Change,Climate Change,Climate impacts/global warming is beneficial/not bad,CO2 is plant food,,,
+Climate Change,Climate Change,Climate impacts/global warming is beneficial/not bad,Climate change is only a few degrees ,Climate change is a big temperature change,,
+Climate Change,Climate Change,Climate impacts/global warming is beneficial/not bad,Climate change does not contribute to human conflict/threaten national security,,,
+Climate Change,Climate Change,Climate impacts/global warming is beneficial/not bad,Climate change doesn't negatively impact health,Climate change does negatively impact health,,
+Climate Change,Climate Change,Climate solutions won't work,Climate solutions won't work,Climate solutions will work,,
+Climate Change,Climate Change,Climate solutions won't work,Climate policies are harmful,Climate policies are not harmful,,
+Climate Change,Climate Change,Climate solutions won't work,Climate policy will reduce jobs,Climate policies will not reduce jobs,,
+Climate Change,Climate Change,Climate solutions won't work,Climate policy will harm the economy,Climate policy will not harm the economy,,
+Climate Change,Climate Change,Climate solutions won't work,Climate policy would weaken national security/national sovereignty/cause conflict,Climate policies would weaken national security ,,
+Climate Change,Climate Change,Climate solutions won't work,Climate policies would cause international conflict,Climate policies would not cause international conflict,,
+Climate Change,Climate Change,Climate solutions won't work,Climate policy would actually harm the environment,Climate policy would not harm the environment,,
+Climate Change,Climate Change,Climate solutions won't work,Climate policy limits capitalism,Climate policy does not limit capitalism,,
+Climate Change,Climate Change,Climate solutions won't work,Climate policy limits freedom,Climate policy does not limit freedom,,
+Climate Change,Climate Change,Climate solutions won't work,Green jobs won't work,Green jobs will work,,
+Climate Change,Climate Change,Climate solutions won't work,Green businesses won't work,Green businesses will work,,
+Climate Change,Climate Change,Climate solutions won't work,Government policies are less efficient than market solutions,Government policies are not less efficient than market solutions,,
+Climate Change,Climate Change,Climate solutions won't work,Climate policy will not make a big difference to climate change,Climate policy will make a difference to climate change,,
+Climate Change,Climate Change,Climate solutions won't work,Most CO2 emissions come from a single country,Most CO2 emissions do not come from a single country,,
+Climate Change,Climate Change,Climate solutions won't work,It is better to adapt to climate change than stop it ,It is not better to adapt to climate change than stop it ,,
+Climate Change,Climate Change,Climate solutions won't work,It is better to geoengineer than stop climate change,It is not better to geoengineer than stop climate change,,
+Climate Change,Climate Change,Climate solutions won't work,Climate policy is useless because of other countries' emissions ,Climate policy is not useless because of other countries' emissions ,,
+Climate Change,Climate Change,Climate solutions won't work,We should invest in other public policy areas first,We should not invest in other public policies first ,,
+Climate Change,Climate Change,Climate solutions won't work,Climate change is too hard to solve,Climate change is not too hard to solve,,
+Climate Change,Climate Change,Climate solutions won't work,Public support for climate policy is low,Public support for climate policy is not low,,
+Climate Change,Climate Change,Climate solutions won't work,Clean energy technology won't work,Clean energy technology will work,,
+Climate Change,Climate Change,Climate solutions won't work,Biofuels won't work,Biofuels won't work,,
+Climate Change,Climate Change,Climate solutions won't work,Clean energy is too expensive,Clean energy is too expensive,,
+Climate Change,Climate Change,Climate solutions won't work,Clean energy is too unreliable,Clean energy is not too unreliable,,
+Climate Change,Climate Change,Climate solutions won't work,Clean energy is harmful,Clean energy is not harmful,,
+Climate Change,Climate Change,Climate solutions won't work,Carbon Capture and Sequestration won't work,Carbon Capture and Sequestration will work,,
+Climate Change,Climate Change,Climate solutions won't work,"People need energy from fossil fuels
+","People do not need energy from fossil fuels
+",,
+Climate Change,Climate Change,Climate solutions won't work,Fossil fuel reserves are plentiful,,,
+Climate Change,Climate Change,Climate solutions won't work,Fossil Fuels are good for society,Fossil Fuels are not good for society,,
+Climate Change,Climate Change,Climate solutions won't work,Fossil fuels are cheap,Fossil Fuels are not cheap,,
+Climate Change,Climate Change,Climate solutions won't work,Fossil fuels are safe,Fossil fuels are not safe,,
+Climate Change,Climate Change,Climate movement/science is unreliable,Climate science is unreliable,Climate science is reliable,Civil Discourse,Violent Extremism
+Climate Change,Climate Change,Climate movement/science is unreliable,Climate science is uncertain,Climate science is not uncertain,Civil Discourse,Violent Extremism
+Climate Change,Climate Change,Climate movement/science is unreliable,Climate science is unsound,Climate science is sound,Civil Discourse,Violent Extremism
+Climate Change,Climate Change,Climate movement/science is unreliable,There is no scientific consensus on climate change,There is scientific consensus on climate change,Civil Discourse,
+Climate Change,Climate Change,Climate movement/science is unreliable,Proxy data on climate change is unreliable,Proxy data on climate change is reliable,Civil Discourse,
+Climate Change,Climate Change,Climate movement/science is unreliable,Temperature record is unreliable,Temperature record is not unreliable,Civil Discourse,
+Climate Change,Climate Change,Climate movement/science is unreliable,Climate models are wrong,Climate models are not wrong,Civil Discourse,
+Climate Change,Climate Change,Climate movement/science is unreliable,Climate movement is alarmist,Climate movement is not alarmist,Civil Discourse,Violent Extremism
+Climate Change,Climate Change,Climate movement/science is unreliable,Climate movement is political,Climate movement is not political,Civil Discourse,Violent Extremism
+Climate Change,Climate Change,Climate movement/science is unreliable,Climate movement is a religion,Climate movement is not a religion,Civil Discourse,Violent Extremism
+Climate Change,Climate Change,Climate movement/science is unreliable,Media about climate change is alarmist,Media about climate change is not alarmist,Civil Discourse,Violent Extremism
+Climate Change,Climate Change,Climate movement/science is unreliable,Media about climate change is political,Media about climate change is not political,Civil Discourse,Violent Extremism
+Climate Change,Climate Change,Climate movement/science is unreliable,The UN is wrong on climate change,The UN is right on climate change,Civil Discourse,Violent Extremism
+Climate Change,Climate Change,Climate movement/science is unreliable,The UN is alarmist on climate change,The UN is not alarmist on climate change,Civil Discourse,Violent Extremism
+Climate Change,Climate Change,Climate movement/science is unreliable,The government is alarmist about climate change,The government is not alarmist about climate change,Civil Discourse,Violent Extremism
+Climate Change,Climate Change,Climate movement/science is unreliable,Scientists are biased about climate change,Scientists are biased about climate change,Civil Discourse,Violent Extremism
+Climate Change,Climate Change,Climate movement/science is unreliable,Scientists are alarmist about climate change,Scientists are not alarmist about climate change,Civil Discourse,Violent Extremism
+Climate Change,Climate Change,Climate movement/science is unreliable,Climate change is not a conspiracy ,Climate change is not a conspiracy ,Civil Discourse,Violent Extremism
+Climate Change,Climate Change,Climate movement/science is unreliable,Climate policies are a scam,Climate policies are not a scam,Civil Discourse,Violent Extremism
+Climate Change,Climate Change,Climate movement/science is unreliable,Clean energy is a conspiracy,Clean energy is not a conspiracy,Civil Discourse,Violent Extremism
+Climate Change,Climate Change,Climate movement/science is unreliable,Climate technology is a scam,Climate technology is not a scam,Civil Discourse,Violent Extremism
+Climate Change,Climate Change,Climate movement/science is unreliable,Climate science is a conspiracy,Climate science is not a conspiracy,Civil Discourse,Violent Extremism
+Anti-semitic,Anti-semitic,Jews are responsible for the death of jesus,Jews are responsible for the death of jesus,,,
+Anti-semitic,Anti-semitic,Jews are trying to destroy Christianity,Jews are trying to destroy Christianity,,,
+Anti-semitic,Anti-semitic,Jews conduct ritual murder,Jews conduct ritual murder,,,
+Anti-semitic,Anti-semitic,Jews use (christian) blood in rituals,Jews use (christian) blood in rituals,,,
+Anti-semitic,Anti-semitic,Jews are penny pinchers or usurius,Jews are penny pinchers or usurius,,,
+Anti-semitic,Anti-semitic,Jews are loyal to israel,Jews are loyal to israel,,,
+Anti-semitic,Anti-semitic,Jews control black politics ,Jews control black politics ,,,
+Anti-semitic,Anti-semitic,Jews control communism ,Jews control communism ,,,
+Anti-semitic,Anti-semitic,Jews control democrats,Jews control democrats,,,
+Anti-semitic,Anti-semitic,Jews control LGBTQ politics,Jews control LGBTQ politics,,,
+Anti-semitic,Anti-semitic,Jews control liberalism,Jews control liberalism,,,
+Anti-semitic,Anti-semitic,Jews control the global financial system,Jews control the global financial system,,,
+Anti-semitic,Anti-semitic,Jews control the UN,Jews control the UN,,,
+Anti-semitic,Anti-semitic,Jews control the weather,Jews control the weather,,,
+Anti-semitic,Anti-semitic,Jews control the West,Jews control the West,,,
+Anti-semitic,Anti-semitic,Jews ran the slave trade,Jews ran the slave trade,,,
+Anti-semitic,Anti-semitic,Jews run hollywood,Jews run hollywood,,,
+Anti-semitic,Anti-semitic,Jews run the media,Jews run the media,,,
+Anti-semitic,Anti-semitic,Antisemitism isn't real,Antisemitism isn't real,,,
+Anti-semitic,Anti-semitic,Jews provoke antisemitism,Jews provoke antisemitism,,,
+Anti-semitic,Anti-semitic,Jesus was not jewish,Jesus was not jewish,,,
+Anti-semitic,Anti-semitic,"Jews are descended from Khazar, not Judea","Jews are descended from Khazar, not Judea",,,
+Anti-semitic,Anti-semitic,Jews are behind global migration,Jews are behind global migration,,,
+Anti-semitic,Anti-semitic,Holocaust did not happen,Holocaust did not happen,,,
+Anti-semitic,Anti-semitic,Jewish life lost during the holocaust is over-estimated,Jewish life lost during the holocaust is over-estimated,,,
+Anti-semitic,Anti-semitic,Jews are behind multiculturalism,Jews are behind multiculturalism,,,
+Anti-semitic,Anti-semitic,Jews are making people gay,Jews are making people gay,,,
+Black,Black,Black lives matter protests were insurrections,Black lives matter protests were insurrections,,,
+Black,Black,Black lives matter protests were riots,Black lives matter protests were riots,,,
+Black,Black,Black people are targeting white people in response to George Floyd,Black people are targeting white people in response to George Floyd,,,
+Black,Black,BLM activists commit non-protest-related crimes,BLM activists commit non-protest-related crimes,,,
+Black,Black,BLM did the J6 insurrection,BLM did the J6 insurrection,,,
+Black,Black,BLM seeks to enslave white people,BLM seeks to enslave white people,,,
+Black,Black,Schools are teaching Black Lives Matter politics,Schools are teaching Black Lives Matter politics,,,
+Black,Black,African Americans abuse government systems,African Americans abuse government systems,,,
+Black,Black,African Americans are abnormally violent,African Americans are abnormally violent,,,
+Black,Black,African Americans are criminals,African Americans are criminals,,,
+Black,Black,African Americans are dependent on welfare,African Americans are dependent on welfare,,,
+Black,Black,African Americans are lazy,African Americans are lazy,,,
+Black,Black,Black people are less intelligent than white people,Black people are less intelligent than white people,,,
+Black,Black,Democrats push the adoption of critical race theory,Democrats push the adoption of critical race theory,,,
+Black,Black,Public education promotes critical race theory,Public education promotes critical race theory,,,
+Black,Black,Public schools teach children critical race theory,Public schools teach children critical race theory,,,
+Black,Black,Implicit bias doesn't exist,Implicit bias doesn't exist,,,
+Black,Black,Systemic racism doesn't exist,Systemic racism doesn't exist,,,
+Black,Black,Most Black people are not descended from slaves,Most Black people are not descended from slaves,,,
+Black,Black,Black reproduction is meant to eliminate white people,Black reproduction is meant to eliminate white people,,,
+Black,Black,Companies will not hire whites because of Affirmative Action,Companies will not hire whites because of Affirmative Action,,,
+Black,Black,Companies will not hire whites because of DEI,Companies will not hire whites because of DEI,,,
+Immigration,Immigration,Immigrants are bringing diseases to the west,Immigrants are bringing diseases to the west,,,
+Immigration,Immigration,Immigrants are unvaccinated,Immigrants are unvaccinated,,,
+Immigration,Immigration,Immigrants are violent,Immigrants are violent,,,
+Immigration,Immigration,Immigrants commit disproportionate crime,Immigrants commit disproportionate crime,,,
+Immigration,Immigration,Immigrants poison the blood of the nation,Immigrants poison the blood of the nation,,,
+Immigration,Immigration,Immigrants are being allowed in to vote in elections,Immigrants are being allowed in to vote in elections,,,
+Immigration,Immigration,Immigrants stole the 2020 election,Immigrants stole the 2020 election,,,
+Immigration,Immigration,Immigration is an invasion of western countries,Immigration is an invasion of western countries,,,
+Immigration,Immigration,Immigration is engineered to replace white people,Immigration is engineered to replace white people,,,
+Immigration,Immigration,immigrants are given free health care in the united states,immigrants are given free health care in the united states,,,
+Immigration,Immigration,Immigration is a globalist/multiculturalist conspiracy,Immigration is a globalist/multiculturalist conspiracy,,,
+Immigration,Immigration,Immigration is a process of deculturalizing the west,Immigration is a process of deculturalizing the west,,,
+Immigration,Immigration,Immigration is a process of despiritualizing the west,Immigration is a process of despiritualizing the west,,,
+Immigration,Immigration,immigration is reverse colonization,immigration is reverse colonization,,,
+Immigration,Immigration,immigration leads to the decline of western civilization,immigration leads to the decline of western civilization,,,
+Immigration,Immigration,immigration will eliminate the white race through racial mixing,immigration will eliminate the white race through racial mixing,,,
+LGBTQ,LGBTQ,LGBTQ rights is a form of colonization by the west,LGBTQ rights is a form of colonization by the west,,,
+LGBTQ,LGBTQ,There are only two genders people are born with,There are only two genders people are born with,,,
+LGBTQ,LGBTQ,LGBTQ is a disease that can be cured,LGBTQ is a disease that can be cured,,,
+LGBTQ,LGBTQ,LGBTQ status is a choice,LGBTQ status is a choice,,,
+LGBTQ,LGBTQ,LGBTQ status is caused by parenting,LGBTQ status is caused by parenting,,,
+LGBTQ,LGBTQ,LGBTQ is a form of moral degeneracy,LGBTQ is a form of moral degeneracy,,,
+LGBTQ,LGBTQ,LGBTQ is pushing children to change their gender,LGBTQ is pushing children to change their gender,,,
+LGBTQ,LGBTQ,LGBTQ people are threats to children,LGBTQ people are threats to children,,,
+LGBTQ,LGBTQ,LGBTQ people groom children,LGBTQ people groom children,,,
+LGBTQ,LGBTQ,LGBTQ people threaten the safety of women and children in bathrooms,LGBTQ people threaten the safety of women and children in bathrooms,,,
+LGBTQ,LGBTQ,LGTBQ people use the LGBTQ identity as a cover for dangerous qualities (e.g. they are secretly a rapist),LGTBQ people use the LGBTQ identity as a cover for dangerous qualities (e.g. they are secretly a rapist),,,
+LGBTQ,LGBTQ,Gays control the media,Gays control the media,,,
+LGBTQ,LGBTQ,There is a secret gay agenda/cabal,There is a secret gay agenda/cabal,,,
+LGBTQ,LGBTQ,Gender affirming care is unsafe,Gender affirming care is unsafe,,,
+LGBTQ,LGBTQ,Gender-affirming health care is a form of child abuse or mutiliation,Gender-affirming health care is a form of child abuse or mutiliation,,,
+LGBTQ,LGBTQ,Gender-affirming health care is a form of sterilization,Gender-affirming health care is a form of sterilization,,,
+LGBTQ,LGBTQ,Most people who transition regret it and want to detransition,Most people who transition regret it and want to detransition,,,
+LGBTQ,LGBTQ,Being transgender is new or represents a recent trend,Being transgender is new or represents a recent trend,,,
+LGBTQ,LGBTQ,LGBTQ is part of a social contagion or rapid onset gender dysphoria,LGBTQ is part of a social contagion or rapid onset gender dysphoria,,,
+LGBTQ,LGBTQ,"Gay marriage is a slippery slope to: pedophilia, bestiality, or polygamy","Gay marriage is a slippery slope to: pedophilia, bestiality, or polygamy",,,
+LGBTQ,LGBTQ,LGBTQ is an ideological movement pushing gender ideology and transgenderism,LGBTQ is an ideological movement pushing gender ideology and transgenderism,,,
+LGBTQ,LGBTQ,[Public figure] is secretly trans,[Public figure] is secretly trans,,,
+LGBTQ,LGBTQ,LGBTQ people can be distinguished by physical features,LGBTQ people can be distinguished by physical features,,,
+LGBTQ,LGBTQ,LGBTQ people are satanists,LGBTQ people are satanists,,,
+LGBTQ,LGBTQ,LGBTQ people cannot provide stable homes,LGBTQ people cannot provide stable homes,,,
+Reproductive health,Reproductive health,Abortion is black genocide,Abortion is black genocide,,,
+Reproductive health,Reproductive health,Abortion is genocide,Abortion is genocide,,,
+Reproductive health,Reproductive health,Abortion is white genocide,Abortion is white genocide,,,
+Reproductive health,Reproductive health,Birth control is black genocide,Birth control is black genocide,,,
+Reproductive health,Reproductive health,Birth control is genocide,Birth control is genocide,,,
+Reproductive health,Reproductive health,Birth control is white genocide,Birth control is white genocide,,,

data/Indicator_Development.csv ADDED Viewed

The diff for this file is too large to render. See raw diff

data/Indicator_Test.csv ADDED Viewed

The diff for this file is too large to render. See raw diff

data/Modified Misinformation Library.csv ADDED Viewed

	@@ -0,0 +1,97 @@

+Target,Type,Misinformation Narrative,Random ID
+Anti-semitic,Anti-Christian,Jews are responsible for the death of jesus,UdR1EJ
+Anti-semitic,Anti-Christian,Jews are trying to destroy Christianity,iiQLW3
+Anti-semitic,Blood Libel,Jews conduct ritual murder,bzvo8C
+Anti-semitic,Blood Libel,Jews use (christian) blood in rituals,E8Gihk
+Anti-semitic,Character Assasination,Jews are penny pinchers or usurius,XhKvwR
+Anti-semitic,Conspiracy,Jews are loyal to israel,gPdJTy
+Anti-semitic,Conspiracy,Jews control black politics ,jHIGen
+Anti-semitic,Conspiracy,Jews control communism ,00oDoA
+Anti-semitic,Conspiracy,Jews control democrats,oX8nUF
+Anti-semitic,Conspiracy,Jews control LGBTQ politics,z26UiS
+Anti-semitic,Conspiracy,Jews control liberalism,Gm048t
+Anti-semitic,Conspiracy,Jews control the global financial system,XrfWiT
+Anti-semitic,Conspiracy,Jews control the UN,omXiC8
+Anti-semitic,Conspiracy,Jews control the weather,oUguN7
+Anti-semitic,Conspiracy,Jews control the West,xnfgPu
+Anti-semitic,Conspiracy,Jews ran the slave trade,GXx4f1
+Anti-semitic,Conspiracy,Jews run hollywood,wFhPBW
+Anti-semitic,Conspiracy,Jews run the media,URbKNx
+Anti-semitic,Deny marginalization,Antisemitism isn't real,PYXuER
+Anti-semitic,Deny marginalization,Jews provoke antisemitism,53X1lc
+Anti-semitic,Ethnic identity,Jesus was not jewish,NMVCa4
+Anti-semitic,Ethnic identity,"Jews are descended from Khazar, not Judea",BhDWro
+Anti-semitic,Great replacement,Jews are behind global migration,dVMbzJ
+Anti-semitic,Holocaust Denial,Holocaust did not happen,JaSIbY
+Anti-semitic,Holocaust Denial,Jewish life lost during the holocaust is over-estimated,dhOlra
+Anti-semitic,Western Chauvinism,Jews are behind multiculturalism,GayHrv
+Anti-semitic,Western Chauvinism,Jews are making people gay,5SYQ2q
+Black,BLM,Black lives matter protests were insurrections,whcn6U
+Black,BLM,Black lives matter protests were riots,qjshDE
+Black,BLM,Black people are targeting white people in response to George Floyd,JJzM7y
+Black,BLM,BLM activists commit non-protest-related crimes,wCYHg7
+Black,BLM,BLM did the J6 insurrection,GVHQah
+Black,BLM,BLM seeks to enslave white people,5nnDbt
+Black,BLM,Schools are teaching Black Lives Matter politics,f8v3rm
+Black,Character Assasination,African Americans abuse government systems,LGCKdm
+Black,Character Assasination,African Americans are abnormally violent,eVd1Eg
+Black,Character Assasination,African Americans are criminals,SY77H4
+Black,Character Assasination,African Americans are dependent on welfare,ySNZtE
+Black,Character Assasination,African Americans are lazy,KJykwB
+Black,Character Assasination,Black people are less intelligent than white people,UikLfc
+Black,CRT,Democrats push the adoption of critical race theory,jyF0Yl
+Black,CRT,Public education promotes critical race theory,YoWcaU
+Black,CRT,Public schools teach children critical race theory,WiYklo
+Black,Deny marginalization,Implicit bias doesn't exist,JkHnUH
+Black,Deny marginalization,Systemic racism doesn't exist,GXMok3
+Black,Ethnic Identity,Most Black people are not descended from slaves,B96bhS
+Black,Great replacement,Black reproduction is meant to eliminate white people,lx3WsW
+Black,Reverse marginalization,Companies will not hire whites because of Affirmative Action,hEZ6KU
+Black,Reverse marginalization,Companies will not hire whites because of DEI,oOeF3U
+Immigration,Character Assasination,Immigrants are bringing diseases to the west,G0mUU3
+Immigration,Character Assasination,Immigrants are unvaccinated,EbtpqX
+Immigration,Character Assasination,Immigrants are violent,XarKDi
+Immigration,Character Assasination,Immigrants commit disproportionate crime,dTISzI
+Immigration,Great replacement,Immigrants poison the blood of the nation,5Cokji
+Immigration,Great replacement,Immigrants are being allowed in to vote in elections,zdgRli
+Immigration,Great replacement,Immigrants stole the 2020 election,cEKPsz
+Immigration,Great replacement,Immigration is an invasion of western countries,L0ZyUA
+Immigration,Great replacement,Immigration is engineered to replace white people,KEpIQf
+Immigration,Policies,immigrants are given free health care in the united states,GkNQFl
+Immigration,Western Chauvinism,Immigration is a globalist/multiculturalist conspiracy,NJ53RR
+Immigration,Western Chauvinism,Immigration is a process of deculturalizing the west,fKJrv0
+Immigration,Western Chauvinism,Immigration is a process of despiritualizing the west,EwykD2
+Immigration,Western Chauvinism,immigration is reverse colonization,iUu1dv
+Immigration,Western Chauvinism,immigration leads to the decline of western civilization,v5RcgG
+Immigration,Western Chauvinism,immigration will eliminate the white race through racial mixing,dlbkPD
+LGBTQ,Anti-liberalism,LGBTQ rights is a form of colonization by the west,98l33O
+LGBTQ,Anti-science,There are only two genders people are born with,B1RpCU
+LGBTQ,Anti-science,LGBTQ is a disease that can be cured,e7r1ws
+LGBTQ,Anti-science,LGBTQ status is a choice,i3TdA8
+LGBTQ,Anti-science,LGBTQ status is caused by parenting,vdmbmW
+LGBTQ,Character Assasination,LGBTQ is a form of moral degeneracy,AUxRMf
+LGBTQ,Character Assasination,LGBTQ is pushing children to change their gender,gTE1iB
+LGBTQ,Character Assasination,LGBTQ people are threats to children,7TqDW3
+LGBTQ,Character Assasination,LGBTQ people groom children,RQ8N4o
+LGBTQ,Character Assasination,LGBTQ people threaten the safety of women and children in bathrooms,6PIKk3
+LGBTQ,Character Assasination,LGTBQ people use the LGBTQ identity as a cover for dangerous qualities (e.g. they are secretly a rapist),TXX0OT
+LGBTQ,Conspiracy,Gays control the media,tCHyHj
+LGBTQ,Conspiracy,There is a secret gay agenda/cabal,6zsNn0
+LGBTQ,Gender affirming care,Gender affirming care is unsafe,ofwj9b
+LGBTQ,Gender affirming care,Gender-affirming health care is a form of child abuse or mutiliation,v7fGCm
+LGBTQ,Gender affirming care,Gender-affirming health care is a form of sterilization,Mo26Zl
+LGBTQ,Gender affirming care,Most people who transition regret it and want to detransition,xpia1A
+LGBTQ,Kids these days,Being transgender is new or represents a recent trend,BMyDKR
+LGBTQ,Kids these days,LGBTQ is part of a social contagion or rapid onset gender dysphoria,KXMEUC
+LGBTQ,Policies,"Gay marriage is a slippery slope to: pedophilia, bestiality, or polygamy",d753iK
+LGBTQ,Policies,LGBTQ is an ideological movement pushing gender ideology and transgenderism,vubrAX
+LGBTQ,Pseudo-science,[Public figure] is secretly trans,2axpUt
+LGBTQ,Psuedo-science,LGBTQ people can be distinguished by physical features,R6Bv5Q
+LGBTQ,Satanism',LGBTQ people are satanists,aVunFJ
+LGBTQ,Western Chauvinism,LGBTQ people cannot provide stable homes,DWHuWO
+Reproductive health,Abortion,Abortion is black genocide,tKzbS5
+Reproductive health,Abortion,Abortion is genocide,9TycbG
+Reproductive health,Abortion,Abortion is white genocide,WNhhGj
+Reproductive health,Abortion,Birth control is black genocide,0FHlMA
+Reproductive health,Abortion,Birth control is genocide,9sDtYl
+Reproductive health,Abortion,Birth control is white genocide,a8GiIm

data/climate_data/data/README.txt ADDED Viewed

	@@ -0,0 +1,46 @@

+-----------------------------------------------------
+Data used in Coan, Boussalis, Cook, and Nanko (2021)
+-----------------------------------------------------
+This directory includes two sub-directories that house the main
+data used during training and in the analysis.
+------------------
+analysis directory
+------------------
+The analysis directory includes a single CSV file: cards_for_analysis.csv. The
+file has the following fields:
+domain: the domain for each organization or blog.
+date: the date the article or blog post was written.
+ctt_status: an indictor for whether the source is a conservative think tank
+(CTTs). [CTT = True; Blog = False]
+pid: unique paragraph identifier
+claim: the estimated sub-claim based on the RoBERTa-Logistic ensemble described
+in the paper. [The variable assumes the following format: superclaim_subclaim.
+For example, 5_1 would represent super-claim 5 ("Climate movement/science is
+unreliable"), sub-claim 1 ("Science is unreliable").]
+------------------
+training directory
+------------------
+The training directory includes 3 CSV files:
+training.csv: annotations used for training
+validation.csv: the held-out validation set used during training (noisy)
+test.csv: the held-out test set used to assess final model performance
+(noise free)
+Each file has the following fields:
+text: the paragraph text that is annotated
+claim: the annotated sub-claim [The variable assumes the following format:
+superclaim_subclaim. For example, 5_1 would represent super-claim 5
+("Climate movement/science is unreliable"), sub-claim 1 ("Science is
+unreliable").]

data/expansive_claims_library_expanded_embed.csv ADDED Viewed

The diff for this file is too large to render. See raw diff

data/filtered_fact_check_latest_embed.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0b868219291c167703e5eb45b95aceae6fa29779b7cb4d62ef977e2853516829
+size 145825870

data/random_300k.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f9843f88f219a0e8d43296ed8e62033affdebf0540cde53c3e1a7c3bac755f8d
+size 86368740

src/Embeddings.jl ADDED Viewed

	@@ -0,0 +1,186 @@

+## Embeddings
+function string_to_float32_vector(str::String)::Vector{Float32}
+    # Remove the "Float32[" prefix and the "]" suffix
+    str = strip(str, ['F', 'l', 'o', 'a', 't', '3', '2', '[', ']'])
+    # Replace 'f' with 'e' for scientific notation
+    str = replace(str, 'f' => 'e')
+    # Split the string by commas to get individual elements
+    elements = split(str, ",")
+    # Convert each element to Float32 and collect into a vector
+    return Float32[parse(Float32, strip(el)) for el in elements]
+end
+function dfdat_to_matrix(df::DataFrame, col::Symbol)::Matrix{Float32}
+    return hcat([string_to_float32_vector(row[col]) for row in eachrow(df)]...)
+end
+"""
+## Any piece of text longer than 280 characters will be chunked into smaller pieces, and the embeddings will be averaged.
+#Example:
+text = repeat("This is a test. ", 100)
+chunktext = create_chunked_text(text)
+function create_chunked_text(text; chunk_size=280)
+    ## Chunk the data
+    chunks = []
+    for chunk in 1:chunk_size:length(text)
+        push!(chunks, text[chunk:min(chunk+chunk_size-1, length(text))])
+    end
+    return chunks
+end
+"""
+function create_chunked_text(text::String; chunk_size::Int=280)
+    chunks = []
+    start_idx = 1
+    while start_idx <= lastindex(text)
+        end_idx = start_idx
+        for _ in 1:chunk_size
+            end_idx = nextind(text, end_idx, 1)
+            if end_idx > lastindex(text)
+                end_idx = lastindex(text)
+                break
+            end
+        end
+        push!(chunks, text[start_idx:end_idx])
+        start_idx = nextind(text, end_idx)
+    end
+    return chunks
+end
+"""
+## Embeddings of text
+"""
+function generate_embeddings(text::String)
+    try
+        return MiniEncoder.get_embeddings(text)
+    catch e
+        println("Error: ", e)
+        return zeros(Float32, 384)
+    end
+end
+"""
+# This is the core function - takes in a string of any length and returns the embeddings
+text = repeat("This is a test. ", 100)
+mini_embed(text)
+# Test to embed truthseeker subsample
+ts = CSV.read("data/truthseeker_subsample.csv", DataFrame)
+ts_embed = mini_embed.(ts.statement) # can embed 3K in 25 seconds
+ts.Embeddings = ts_embed
+CSV.write("data/truthseeker_subsample_embed.csv", ts)
+## embed fact check data
+fc = CSV.read("data/fact_check_latest.csv", DataFrame)
+# drop missing text
+fc = fc[.!ismissing.(fc.text), :]
+fc_embed = mini_embed.(fc.text) # 12 minutes
+fc.Embeddings = fc_embed
+CSV.write("data/fact_check_latest_embed.csv", fc)
+narrs = CSV.read("data/expansive_claims_library_expanded.csv", DataFrame)
+# drop missing text
+narrs.text = narrs.ExpandedClaim
+narrs = narrs[.!ismissing.(narrs.text), :]
+narratives_embed = OC.mini_embed.(narrs.text) # seconds to run
+narrs.Embeddings = narratives_embed
+CSV.write("data/expansive_claims_library_expanded_embed.csv", narrs)
+"""
+function mini_embed(text::String)
+    chunked_text = create_chunked_text(text)
+    embeddings = generate_embeddings.(chunked_text)
+    mean(embeddings)
+end
+"""
+# Get distance and classification
+ts = CSV.read("data/truthseeker_subsample_embed.csv", DataFrame)
+ts_embed = dfdat_to_matrix(ts, :Embeddings)
+fc = CSV.read("data/fact_check_latest_embed.csv", DataFrame)
+fc_embed = dfdat_to_matrix(fc, :Embeddings)
+distances, classification = distances_and_classification(fc_embed, ts_embed[:, 1:5])
+"""
+function distances_and_classification(narrative_matrix, target_matrix)
+    distances = pairwise(CosineDist(), target_matrix, narrative_matrix, dims=2)
+    # get the index of the column with the smallest distance
+    return distances[argmin(distances, dims=2)][:, 1], argmin(distances, dims=2)[:, 1]
+end
+"""
+# Get the dot product of the two matrices
+ind, scores = dotproduct_distances(fc_embed, ts_embed)
+ts.scores = scores
+# Group by target and get the max score
+ts_grouped = combine(groupby(ts, :target), :scores => mean)
+# show the matched text
+ts.fc_text = fc.text[ind]
+"""
+function dotproduct_distances(narrative_matrix, target_matrix)
+    # multiply each column of the narrative matrix by the target vector
+    dprods = narrative_matrix' * target_matrix
+    # get maximum dotproduct and index of the row
+    max_dot = argmax(dprods, dims=1)[1, :]
+    return first.(Tuple.(max_dot)), dprods[max_dot]
+end
+function dotproduct_topk(narrative_matrix, target_vector, k)
+    # multiply each column of the narrative matrix by the target vector
+    dprods = narrative_matrix' * target_vector
+    # indices of the top k dot products
+    topk = sortperm(dprods, rev=true)[1:k]
+    return topk, dprods[topk]
+end
+"""
+# Get the top k scores
+using CSV, DataFrames
+ts = CSV.read("data/truthseeker_subsample_embed.csv", DataFrame)
+ts_embed = OC.dfdat_to_matrix(ts, :Embeddings)
+fc = CSV.read("data/fact_check_latest_embed.csv", DataFrame)
+fc_embed = OC.dfdat_to_matrix(fc, :Embeddings)
+OC.fast_topk(fc_embed, fc, ts.statement[1], 5)
+## How fast to get the top 5 scores for 3K statements?
+@time [OC.fast_topk(fc_embed, fc, ts.statement[x], 5) for x in 1:3000] # 63 seconds
+"""
+function fast_topk(narrative_matrix, narratives, text::String, k)
+    target_vector = mini_embed(text)
+    inds, scores = dotproduct_topk(narrative_matrix, target_vector, k)
+    if hasproperty(narratives, :Policy)
+        policy = narratives.Policy[inds]
+        narrative = narratives.Narrative[inds]
+    else
+        policy = fill("No policy", k)
+        narrative = fill("No narrative", k)
+    end
+    if !hasproperty(narratives, :claimReviewUrl)
+        narratives.claimReviewUrl = fill("No URL", size(narratives, 1))
+    end
+    vec_of_dicts = [Dict("score" => scores[i],
+                        "text" => narratives.text[ind],
+                        "claimUrl" => narratives.claimReviewUrl[ind],
+                        "policy" => policy[i],
+                        "narrative" => narrative[i]) for (i, ind) in enumerate(inds)]
+    return vec_of_dicts
+end
+function load_fasttext_embeddings(file::String="data/fact_check_latest_embed.csv")
+    fc = CSV.read(file, DataFrame)
+    fc_embed = dfdat_to_matrix(fc, :Embeddings)
+    return fc_embed, fc
+end

src/Models.jl ADDED Viewed

	@@ -0,0 +1,182 @@

+## Utility Functions
+## Note: edit ~/.bigqueryrc to set global settings for bq command line tool
+using CSV, DataFrames, JSON3
+function read_json(file_path::String)
+    json_data = JSON3.read(open(file_path, "r"))
+    return json_data
+end
+"""
+## ostreacultura_bq_auth()
+- Activate the service account using the credentials file
+"""
+function ostreacultura_bq_auth()
+    if isfile("ostreacultura-credentials.json")
+        run(`gcloud auth activate-service-account --key-file=ostreacultura-credentials.json`)
+    else
+        println("Credentials file not found")
+    end
+end
+"""
+## julia_to_bq_type(julia_type::DataType)
+- Map Julia types to BigQuery types
+Arguments:
+  - julia_type: The Julia data type to map
+Returns:
+  - The corresponding BigQuery type as a string
+"""
+function julia_to_bq_type(julia_type::DataType)
+    if julia_type == String
+        return "STRING"
+    elseif julia_type == Int64
+        return "INTEGER"
+    elseif julia_type == Float64
+        return "FLOAT"
+    elseif julia_type <: AbstractArray{Float64}
+        return "FLOAT64"
+    elseif julia_type <: AbstractArray{Int64}
+        return "INTEGER"
+    else
+        return "STRING"
+    end
+end
+"""
+## create_bq_schema(df::DataFrame)
+- Create a BigQuery schema from a DataFrame
+Arguments:
+  - df: The DataFrame to create the schema from
+Returns:
+  - The schema as a string in BigQuery format
+Example:
+df = DataFrame(text = ["Alice", "Bob"], embed = [rand(3), rand(3)])
+create_bq_schema(df)
+"""
+function create_bq_schema(df::DataFrame)
+    schema = []
+    for col in names(df)
+        if eltype(df[!, col]) <: AbstractArray
+            push!(schema, Dict("name" => col, "type" => "FLOAT64", "mode" => "REPEATED"))
+        else
+            push!(schema, Dict("name" => col, "type" => julia_to_bq_type(eltype(df[!, col])), "mode" => "NULLABLE"))
+        end
+    end
+    return JSON3.write(schema)
+end
+"""
+## dataframe_to_json(df::DataFrame, file_path::String)
+- Convert a DataFrame to JSON format and save to a file
+Arguments:
+  - df: The DataFrame to convert
+  - file_path: The path where the JSON file should be saved
+"""
+function dataframe_to_json(df::DataFrame, file_path::String)
+    open(file_path, "w") do io
+        for row in eachrow(df)
+            JSON.print(io, Dict(col => row[col] for col in names(df)))
+            write(io, "\n")
+        end
+    end
+end
+"""
+# Function to send a DataFrame to a BigQuery table
+## send_to_bq_table(df::DataFrame, dataset_name::String, table_name::String)
+- Send a DataFrame to a BigQuery table, which will append if the table already exists
+Arguments:
+  - df: The DataFrame to upload
+  - dataset_name: The BigQuery dataset name
+  - table_name: The BigQuery table name
+# Example usage
+df = DataFrame(text = ["Alice", "Bob"], embed = [rand(3), rand(3)])
+send_to_bq_table(df, "climate_truth", "embtest")
+# Upload a DataFrame
+using CSV, DataFrames
+import OstreaCultura as OC
+tdat = CSV.read("data/climate_test.csv", DataFrame)
+emb = OC.multi_embeddings(tdat)
+"""
+function send_to_bq_table(df::DataFrame, dataset_name::String, table_name::String)
+    # Temporary JSON file
+    json_file_path = tempname() * ".json"
+    schema = create_bq_schema(df)
+    ## Save schema to a file
+    schema_file_path = tempname() * ".json"
+    open(schema_file_path, "w") do io
+        write(io, schema)
+    end
+    # Save DataFrame to JSON
+    dataframe_to_json(df, json_file_path)
+    # Use bq command-line tool to load JSON to BigQuery table with specified schema
+    run(`bq load --source_format=NEWLINE_DELIMITED_JSON $dataset_name.$table_name $json_file_path $schema_file_path`)
+    # Clean up and remove the temporary JSON file after upload
+    rm(json_file_path)
+    rm(schema_file_path)
+    return nothing
+end
+"""
+## bq(query::String)
+- Run a BigQuery query and return the result as a DataFrame
+Example: bq("SELECT * FROM ostreacultura.climate_truth.training LIMIT 10")
+"""
+function bq(query::String)
+    tname = tempname()
+    run(pipeline(`bq query --use_legacy_sql=false --format=csv $query`, tname))
+    return CSV.read(tname, DataFrame)
+end
+"""
+## Function to average embeddings over some group
+example:
+avg_embeddings("ostreacultura.climate_truth.embtest", "text", "embed")
+"""
+function avg_embeddings(table::String, group::String, embedname::String)
+    query = """
+    SELECT
+        $group,
+        ARRAY(
+            SELECT AVG(value)
+            FROM UNNEST($embedname) AS value WITH OFFSET pos
+            GROUP BY pos
+            ORDER BY pos
+        ) AS averaged_array
+    FROM (
+        SELECT $group, ARRAY_CONCAT_AGG($embedname) AS $embedname
+        FROM $table
+        GROUP BY $group
+    )
+    """
+    return query
+end
+"""
+## SAVE results of query to a CSV file
+Example:
+bq_csv("SELECT * FROM ostreacultura.climate_truth.training LIMIT 10", "data/test.csv")
+"""
+function bq_csv(query::String, path::String)
+    run(pipeline(`bq query --use_legacy_sql=false --format=csv $query`, path))
+end

src/OstreaCultura.jl ADDED Viewed

	@@ -0,0 +1,25 @@

+## OSTREA
+module OstreaCultura
+@info "Loading OstreaCultura.jl"
+using JSON3, Dates, Sqids, CSV, DataFrames, StatsBase, Distances, PyCall
+import Pandas.DataFrame as pdataframe
+export MiniEncoder
+## Load the FC Dataset
+#const fc = CSV.read("data/fact_check_latest.csv", DataFrame)
+#const fc_embed = OC.dfdat_to_matrix(fc, :Embeddings)
+#export multi_embeddings, DataLoader, df_to_pd, pd_to_df, create_pinecone_context
+#include("Narrative.jl")
+#include("NarrativeClassification.jl")
+include("py_init.jl")
+include("Embeddings.jl")
+include("PyPineCone.jl")
+#include("Models.jl")
+end

src/PyPineCone.jl ADDED Viewed

	@@ -0,0 +1,415 @@

+### PineCone Embed and I/O Functions
+"""
+# This dataset matches the example data from DataLoader.py
+import OstreaCultura as OC
+hi = OC.example_data()
+hi = OC.df_to_pd(hi)
+OC.DataLoader.create_vectors_from_df(hi)
+"""
+function example_data()
+    DataFrame(
+    Embeddings = [[0.1, 0.2, 0.3, 0.4], [0.2, 0.3, 0.4, 0.5]],
+    id = ["vec1", "vec2"],
+    genre = ["drama", "action"]
+)
+end
+"""
+df= OC.DataLoader.pd.read_csv("data/Indicator_Test.csv")
+df_julia = OC.pd_to_df(df)
+"""
+function pd_to_df(df_pd)
+    df= DataFrame()
+    for col in df_pd.columns
+        df[!, col] = getproperty(df_pd, col).values
+    end
+    df
+end
+"""
+Available functions
+pc.create_index - see below
+pc.delete_index: pc.delete_index(index_name)
+"""
+function create_pinecone_context()
+    pc = DataLoader.Pinecone(api_key=ENV["PINECONE_API_KEY"])
+    return pc
+end
+"""
+# Context for inference endpoints
+"""
+function create_inf_pinecone_context()
+    pc = DataLoader.Pinecone(ENV["PINECONE_API_KEY"])
+    return pc
+end
+"""
+pc = create_pinecone_context()
+create_index("new-index", 4, "cosine", "aws", "us-east-1")
+"""
+function create_index(name, dimension, metric, cloud, region)
+    ppc = create_pinecone_context()
+    DataLoader.create_index(ppc, name, dimension, metric, cloud, region)
+end
+"""
+import OstreaCultura as OC
+df = OC.DataLoader.pd.read_csv("data/climate_test.csv")
+model = "multilingual-e5-large"
+out = OC.multi_embeddings(model, df, 96, "text")
+# Id and Embeddings are required columns in the DataFrame
+OC.upsert_data(out, "test-index", "test-namespace")
+df = OC.DataLoader.pd.read_csv("data/Indicator_Test.csv")
+model = "multilingual-e5-large"
+test_embeds = OC.multi_embeddings(model, df, 96, "text")
+test_embeds_min = test_embeds.head(10)
+# Id and Embeddings are required columns in the DataFrame
+OC.upsert_data(test_embeds_min, "test-index", "indicator-test-namespace", chunk_size=100)
+"""
+function upsert_data(df, indexname, namespace; chunk_size=1000)
+    # Import DataLoader.py
+    pc = create_pinecone_context()
+    index = pc.Index(indexname)
+    DataLoader.chunk_df_and_upsert(index, df, namespace=namespace, chunk_size=chunk_size)
+end
+"""
+## How to query data using an existing embedding
+import OstreaCultura as OC; using DataFrames
+mydf = DataFrame(id = ["vec1", "vec2"], text = ["drama", "action"])
+mydf = OC.multi_embeddings(mydf)
+vector = mydf.Embeddings[1]
+top_k = 5
+include_values = true
+OC.query_data("test-index", "test-namespace", vector, top_k, include_values)
+"""
+function query_data(indexname, namespace, vector, top_k, include_values)
+    pc = create_pinecone_context()
+    index = pc.Index(indexname)
+    DataLoader.query_data(index, namespace, vector, top_k, include_values).to_dict()
+end
+"""
+## How to query data using an existing hybrid embedding
+import OstreaCultura as OC; using DataFrames
+querytext = "drama"
+dense = OC.embed_query(querytext)
+top_k = 5
+include_values = true
+include_metadata = true
+OC.query_data_with_sparse("oc-hybrid-library-index", "immigration", dense, OC.DataLoader.empty_sparse_vector(), top_k, include_values, include_metadata)
+"""
+function query_data_with_sparse(indexname, namespace, dense, sparse, top_k, include_values, include_metadata)
+    pc = create_pinecone_context()
+    index = pc.Index(indexname)
+    DataLoader.query_data_with_sparse(index, namespace, dense, sparse, top_k=top_k, include_values=include_values, include_metadata=include_metadata).to_dict()
+end
+"""
+## Querying function for GGWP - using updated hybrid vector
+import OstreaCultura as OC
+claim = "drama"
+indexname = "oc-hybrid-library-index"
+ocmodel = "expanded-fact-checks"
+OC.search(claim, indexname, ocmodel, include_values=false, include_metadata=false)
+res = OC.search(claim, indexname, ocmodel)
+"""
+function search(claim, indexname, ocmodel; top_k=5, include_values=true, include_metadata=true)
+    dense = embed_query(claim)
+    query_data_with_sparse(indexname, ocmodel, dense, DataLoader.empty_sparse_vector(), top_k, include_values, include_metadata)
+end
+function unicodebarplot(x, y, title = "Query Matches")
+    UnicodePlots.barplot(x, y, title=title)
+end
+function searchresult_to_unicodeplot(searchresult)
+    scores = [x["score"] for x in searchresult["matches"]]
+    text = [x["metadata"]["text"] for x in searchresult["matches"]]
+    ## reduce the text to 41 characters
+    text_to_show = [length(x) > 41 ? x[1:41] * "..." : x for x in text]
+    unicodebarplot(text_to_show, scores)
+end
+"""
+## Search and plot the results
+import OstreaCultura as OC
+claim = "drama"
+indexname = "oc-hybrid-library-index"
+ocmodel = "immigration"
+OC.searchplot(claim, indexname, ocmodel)
+"""
+function searchplot(claim, indexname, ocmodel; top_k=5, include_values=true, include_metadata=true)
+    searchresult = search(claim, indexname, ocmodel, top_k=top_k, include_values=include_values, include_metadata=include_metadata)
+    searchresult_to_unicodeplot(searchresult)
+end
+"""
+import OstreaCultura as OC
+df = OC.DataLoader.pd.read_csv("data/climate_test.csv")
+model = "multilingual-e5-large"
+out = OC.multi_embeddings(model, df, 96, "text")
+using CSV, DataFrames
+tdat = CSV.read("data/climate_test.csv", DataFrame)
+OC.multi_embeddings(model, Pandas.DataFrame(tdat), 96, "text")
+"""
+function multi_embeddings(model, data, chunk_size, textcol)
+    pc = create_inf_pinecone_context()
+    DataLoader.chunk_and_embed(pc, model, data, chunk_size, textcol)
+end
+"""
+using CSV, DataFrames
+import OstreaCultura as OC
+tdat = CSV.read("data/climate_test.csv", DataFrame)
+OC.multi_embeddings(tdat)
+"""
+function multi_embeddings(data::DataFrames.DataFrame; kwargs...)
+    data = df_to_pd(data)
+    model = get(kwargs, :model, "multilingual-e5-large")
+    chunk_size = get(kwargs, :chunk_size, 96)
+    textcol = get(kwargs, :textcol, "text")
+    pc = create_inf_pinecone_context()
+    DataLoader.chunk_and_embed(pc, model, data, chunk_size, textcol)
+end
+"""
+## Julia DataFrame to pandas DataFrame
+"""
+function df_to_pd(df::DataFrames.DataFrame)
+    pdataframe(df)
+end
+function embed_query(querytext; kwargs...)
+    firstdf = DataFrame(id = "vec1", text = querytext)
+    firstdf = multi_embeddings(firstdf)
+    vector = firstdf.Embeddings[1]
+    return vector
+end
+"""
+## Query with a vector of embeddings
+import OstreaCultura as OC
+vector = rand(1024)
+indexname = "test-index"
+namespace = "test-namespace"
+vecresults = OC.query_w_vector(vector, indexname, namespace)
+"""
+function query_w_vector(vector, indexname, namespace; kwargs...)
+    top_k = get(kwargs, :top_k, 5)
+    include_values = get(kwargs, :include_values, true)
+    pc = create_pinecone_context()
+    index = pc.Index(indexname)
+    queryresults = DataLoader.query_data(index, namespace, vector, top_k, include_values).to_dict()
+    ##
+    if include_values
+        values_vector = [queryresults["matches"][i]["values"] for i in 1:length(queryresults["matches"])]
+    else
+        values_vector = [missing for i in 1:length(queryresults["matches"])]
+    end
+    # drop the "values" key from each dict so it doesn't get added to the DataFrame
+    for i in 1:length(queryresults["matches"])
+        delete!(queryresults["matches"][i], "values")
+    end
+    out = DataFrame()
+    for i in 1:length(queryresults["matches"])
+        out = vcat(out, DataFrame(queryresults["matches"][i]))
+    end
+    # If desired update this function to add the embeddings to the DataFrame
+    if include_values
+        out[:, "values"] = values_vector
+    end
+    return out
+end
+"""
+import OstreaCultura as OC
+indexname = "test-index"
+namespace = "test-namespace"
+pc = OC.create_pinecone_context()
+vector = OC.embed_query("drama")
+queryresults = OC.query_w_vector(vector, indexname, namespace, top_k=5, include_values=false)
+### now, fetch the underlying data
+#fetched_data = OC.fetch_data(queryresults.id, indexname, namespace)
+index = pc.Index(indexname)
+resultfetch = OC.DataLoader.fetch_data(index, queryresults.id, namespace).to_dict()
+OC.parse_fetched_results(resultfetch)
+"""
+function parse_fetched_results(resultfetch)
+    if length(resultfetch["vectors"]) > 0
+        ids = collect(keys(resultfetch["vectors"]))
+        ## Grab the MetaData
+        data = []
+        for id in ids
+            push!(data, resultfetch["vectors"][id]["metadata"])
+        end
+        ## Create a DataFrame From the MetaData
+        out = DataFrame()
+        for i in 1:length(data)
+            try
+                out = vcat(out, DataFrame(data[i]))
+            catch
+                out = vcat(out, DataFrame(data[i]), cols=:union)
+            end
+        end
+        out[!, :id] = ids
+        return out
+    else
+        @info "No data found"
+        return DataFrame()
+    end
+end
+"""
+import OstreaCultura as OC
+indexname = "test-index"
+namespace = "test-namespace"
+pc = OC.create_pinecone_context()
+index = pc.Index(indexname)
+ids = ["OSJeL7", "3TxWTNpPn"]
+query_results_as_dataframe = OC.fetch_data(ids, indexname, namespace)
+"""
+function fetch_data(ids, indexname, namespace; chunk_size=900)
+    pc = create_pinecone_context()
+    index = pc.Index(indexname)
+    result_out = DataFrame()
+    for i in 1:ceil(Int, length(ids)/chunk_size)
+        chunk = ids[(i-1)*chunk_size+1:min(i*chunk_size, length(ids))]
+        resultfetch = DataLoader.fetch_data(index, chunk, namespace).to_dict()
+        result_out = vcat(result_out, parse_fetched_results(resultfetch))
+    end
+    return result_out
+end
+"""
+## FINAL Query function - embeds, queries, and fetches data
+import OstreaCultura as OC
+querytext = "drama"
+indexname = "test-index"
+namespace = "test-namespace"
+OC.query(querytext, indexname, namespace)
+"""
+function query(querytext::String, indexname::String, namespace::String; kwargs...)
+    top_k = get(kwargs, :top_k, 5)
+    include_values = get(kwargs, :include_values, true)
+    vector = embed_query(querytext)
+    queryresults = query_w_vector(vector, indexname, namespace, top_k=top_k, include_values=include_values)
+    ### now, fetch the underlying data
+    fetched_data = fetch_data(queryresults.id, indexname, namespace)
+    # join the two dataframes on id
+    merged = innerjoin(queryresults, fetched_data, on=:id)
+    return merged
+end
+function filter_claims_closer_to_counterclaims(claim_results, counterclaim_results)
+    # Rename scores to avoid conflicts
+    rename!(claim_results, :score => :claim_score)
+    rename!(counterclaim_results, :score => :counterclaim_score)
+    # Innerjoin
+    df = leftjoin(claim_results, counterclaim_results, on=:id)
+    # Fill missing values with 0
+    df.counterclaim_score = coalesce.(df.counterclaim_score, 0.0)
+    # Keep only results where the claim score is greater than the counterclaim score
+    df = df[df.claim_score .> df.counterclaim_score, :]
+    return df
+end
+"""
+## Query with claims and counterclaims
+import OstreaCultura as OC
+claim = "Climate change is a hoax"
+counterclaim = "Climate change is real"
+indexname = "test-index"
+namespace = "test-namespace"
+hi = OC.query_claims(claim, counterclaim, indexname, namespace)
+"""
+function query_claims(claim::String, counterclaim::String, indexname::String, namespace::String; kwargs...)
+    threshold = get(kwargs, :threshold, 0.8)
+    top_k = get(kwargs, :top_k, 5000) # top_k for the initial query
+    # Get embeddings
+    claim_vector = embed_query(claim)
+    counterclaim_vector = embed_query(counterclaim)
+    # Query the embeddings
+    claim_results = query_w_vector(claim_vector, indexname, namespace, top_k=top_k, include_values=false)
+    counterclaim_results = query_w_vector(counterclaim_vector, indexname, namespace, top_k=top_k, include_values=false)
+    # If a given id has a greater score for the claim than the counterclaim, keep it
+    allscores = filter_claims_closer_to_counterclaims(claim_results, counterclaim_results)
+    # Filter to scores above the threshold
+    allscores = allscores[allscores.claim_score .> threshold, :]
+    if size(allscores)[1] == 0
+        @info "No claims were above the threshold"
+        return DataFrame()
+    else
+        ## now, fetch the data
+        resulting_data = fetch_data(allscores.id, indexname, namespace)
+        # merge the data on id
+        resulting_data = innerjoin(allscores, resulting_data, on=:id)
+        return resulting_data
+    end
+end
+"""
+## Classify a claim against the existing misinformation library
+import OstreaCultura as OC
+## Example 1
+claim = "There is a lot of dispute about whether the Holocaust happened"
+counterclaim = "The Holocaust is a well-documented historical event"
+indexname = "ostreacultura-v1"
+namespace = "modified-misinfo-library"
+hi, counterscore = OC.classify_claim(claim, counterclaim, indexname, namespace)
+## Example 2
+claim = "it's cool to be trans these days"
+counterclaim = ""
+indexname = "ostreacultura-v1"
+namespace = "modified-misinfo-library"
+hi, counterscore = OC.classify_claim(claim, counterclaim, indexname, namespace)
+## Example 3
+claim = "No existe racismo contra las personas negras"
+counterclaim = "Racism is a systemic issue that affects people of color"
+indexname = "ostreacultura-v1"
+namespace = "modified-misinfo-library"
+hi, counterscore = OC.classify_claim(claim, counterclaim, indexname, namespace)
+"""
+function classify_claim(claim::String, counterclaim::String, indexname::String, namespace::String; kwargs...)
+    threshold = get(kwargs, :threshold, 0.8)
+    top_k = get(kwargs, :top_k, 10) # top_k for the initial query
+    # Get embeddings
+    claim_vector = embed_query(claim)
+    if counterclaim != ""
+        counterclaim_vector = embed_query(counterclaim)
+        counterclaim_results = query_w_vector(counterclaim_vector, indexname, namespace, top_k=top_k, include_values=false)
+        counterclaim_score = counterclaim_results.score[1]
+    else
+        counterclaim_score = 0.0
+    end
+    # Query the embeddings
+    claim_results = query_w_vector(claim_vector, indexname, namespace, top_k=top_k, include_values=false)
+    # Filter to scores above the threshold
+    claim_results = claim_results[claim_results.score .> threshold, :]
+    ## now, fetch the data
+    resulting_data = fetch_data(claim_results.id, indexname, namespace)
+    resulting_data.scores = claim_results.score
+    return resulting_data, counterclaim_score
+end
+function generate_sparse_model()
+    df = DataLoader.pd.read_csv("data/random_300k.csv")
+    corpus = df["text"].tolist()
+    vector, bm25 = OC.DataLoader.encode_documents(corpus)
+    return vector, bm25
+end

src/bash/update_fact_checks.sh ADDED Viewed

	@@ -0,0 +1,14 @@

+#!/bin/bash
+# Script to Run periodic updates for the fact-check model
+# Set the working directory
+cd /home/ubuntu/fact-check
+# path to julia
+JULIA=/home/swojcik/.juliaup/bin/julia
+# Run load_fact_check_json() from google_fact_check_api.jl to get the latest data
+$JULIA -e 'include(`src/google_fact_check_api.jl`); load_fact_check_json()'
+# Run the python script that goes and updates the fact-check model data

src/deprecated/Narrative.jl ADDED Viewed

	@@ -0,0 +1,242 @@

+## Structure of a Narrative
+function randid()
+    config = Sqids.configure()  # Local configuration
+    id = Sqids.encode(config, [rand(1:100), rand(1:100)])
+    return id
+end
+function timestamp()
+    (now() - unix2datetime(0)).value
+end
+"""
+ts_to_time(timestamp()) == now()
+"""
+function ts_to_time(ts)
+    return unix2datetime(ts / 1000)
+end
+"""
+    Claim: something that supports a misinformation narrative
+    id: unique identifier for the claim
+    claim: text of the claim
+    counterclaim: text of the counterclaim
+    claimembedding: embedding of the claim
+    counterclaimembedding: embedding of the counterclaim
+    created_at: date the claim was created
+    updated_at: date the claim was last updated
+    source: source of the claim
+"""
+mutable struct Claim
+    id::String
+    claim::String  # claim text
+    counterclaim::String  # counterclaim text
+    claimembedding::Union{Array{Float32, 1}, Nothing}  # embedding of the claim
+    counterclaimembedding::Union{Array{Float32, 1}, Nothing}  # embedding of the counterclaim
+    created_at::Int64  # date the claim was created
+    updated_at::Int64  # date the claim was last updated
+    source::String  # source of the claim
+    keywords::Union{Array{String, 1}, Nothing}  # keywords associated with the claim
+end
+"""
+    createClaim(claim::String, counterclaim::String, source::String)
+    Create a new Claim object with the given claim, counterclaim, and source.
+    The claim and counterclaim embeddings are set to nothing by default.
+    Example:
+    createClaim("Solar panels poison the soil and reduce crop yields",
+                "There is no evidence that solar panels poison the soil or reduce crop yields",
+                "Facebook post")
+"""
+function createClaim(claim::String, counterclaim::String, source::String, keywords::Array{String, 1})
+    return Claim(randid(), claim, counterclaim, nothing, nothing, timestamp(), timestamp(), source, keywords)
+end
+"""
+    Narrative: a collection of claims that support a misinformation narrative
+    id: unique identifier for the narrative
+    title: descriptive title of the narrative
+    type: broad type of narrative (e.g., anti-semitism)
+    target: target group/topic of the narrative
+    narrativesummary: base narrative text
+    claims: list of Claim objects
+    Example:
+    example_narrative = Narrative(
+        randid(),
+        "Jews killed Jesus",
+        "Anti-semitism",
+        "Jews",
+        "Jews are responsible for the death of Jesus",
+        nothing)
+"""
+mutable struct Narrative
+    id::String
+    title::String  # descriptive title (e.g., Jews killed Jesus)
+    topic::String  # broad type of narrative (e.g., anti-semitism)
+    target::String  # target group/topic of the narrative
+    narrativesummary::String  # base narrative text (e.g., Jews are responsible for the death of Jesus)
+    claims::Vector{Claim}  # list of Claim objects
+end
+"""
+## TODO: When you have a lot of narratives, you can create a NarrativeSet
+- If you apply a narrative set over a database, it will perform classification using all the narratives
+"""
+mutable struct NarrativeSet
+    narratives::Vector{Narrative}
+end
+import Base: show
+## Make the Narrative pretty to show -
+function show(io::IO, narrative::Narrative)
+    println(io, "Narrative: $(narrative.title)")
+    println(io, "Topic: $(narrative.topic)")
+    println(io, "Target: $(narrative.target)")
+    println(io, "Narrative Summary: $(narrative.narrativesummary)")
+    println(io, "Claims:")
+    for claim in narrative.claims
+        println(io, "  - $(claim.claim)")
+    end
+end
+"""
+    add_claim!(narrative::Narrative, claim::Claim)
+    Add a claim to a narrative.
+    Example:
+    add_claim!(example_narrative, example_claim)
+"""
+function add_claim!(narrative::Narrative, claim::Claim)
+    push!(narrative.claims, claim)
+end
+function remove_claim!(narrative::Narrative, claim_id::String)
+    narrative.claims = filter(c -> c.id != claim_id, narrative.claims)
+end
+function narrative_to_dataframe(narrative::Narrative)
+    out = DataFrame( narrative_title = narrative.title,
+                        id = [claim.id for claim in narrative.claims],
+                        claim = [claim.claim for claim in narrative.claims],
+                        counterclaim = [claim.counterclaim for claim in narrative.claims],
+                        claimembedding = [claim.claimembedding for claim in narrative.claims],
+                        counterclaimembedding = [claim.counterclaimembedding for claim in narrative.claims],
+                        created_at = [claim.created_at for claim in narrative.claims],
+                        updated_at = [claim.updated_at for claim in narrative.claims],
+                        source = [claim.source for claim in narrative.claims],
+                        keywords = [claim.keywords for claim in narrative.claims])
+    return out
+end
+"""
+# Collapse a dataframe into a narrative
+"""
+function dataframe_to_narrative(df::DataFrame, narrative_title::String, narrative_summary::String)
+    claims = [Claim(row.id, row.claim, row.counterclaim, row.claimembedding, row.counterclaimembedding, row.created_at, row.updated_at, row.source, row.keywords) for row in eachrow(df)]
+    return Narrative(randid(), narrative_title, "", "", narrative_summary, claims)
+end
+function deduplicate_claims_in_narrative!(narrative::Narrative)
+    ## check which claim in non-unique in the set
+    claims = [claim.claim for claim in narrative.claims]
+    is_duplicated = nonunique(DataFrame(claim=claims))
+    # Get ID's of duplicated claims then remove them
+    if length(claims[findall(is_duplicated)]) > 0
+        for dupclaim in claims[findall(is_duplicated)]
+            id_dup = [claim.id for claim in narrative.claims if claim.claim == dupclaim]
+            # Remove all claims except the first one
+            [remove_claim!(narrative, id) for id in id_dup[2:end]]
+        end
+    end
+    return narrative
+end
+"""
+## Embeddings to recover narratives
+cand_embeddings = candidate_embeddings_from_narrative(narrative)
+- Input: narrative
+- Output: candidate embeddings - embeddings of text that match the regex defined in claims
+"""
+function candidate_embeddings(candidates::DataFrame; kwargs...)::DataFrame
+    model_id = get(kwargs, :model_id, "text-embedding-3-small")
+    textcol = get(kwargs, :textcol, "text")
+    # check if text column exists
+    if !textcol in names(candidates)
+        error("Text column not found in the dataframe, try specifying the text column using the textcol keyword argument")
+    end
+    ## Data Embeddings
+    cand_embeddings = create_chunked_embeddings(candidates[:, textcol]; model_id=model_id);
+    ## Add vector of embeddings to dataset
+    candidates[: , "Embeddings"] = [x for x in cand_embeddings]
+    return candidates
+end
+## Embeddings
+"""
+df = CSV.read("data/random_300k.csv", DataFrame)
+df = filter(:message => x -> occursin(Regex("climate"), x), df)
+embeds = create_chunked_embeddings(df[:, "message"]; chunk_size=10)
+"""
+function create_openai_chunked_embeddings(texts; model_id="text-embedding-3-small", chunk_size=1000)
+    ## Chunk the data
+    embeddings = []
+    for chunk in 1:chunk_size:length(texts)
+        embeddings_resp = create_embeddings(ENV["OPENAI_API_KEY"],
+                texts[chunk:min(chunk+chunk_size-1, length(texts))]; model_id=model_id)
+        push!(embeddings, [x["embedding"] for x in embeddings_resp.response["data"]])
+    end
+    return vcat(embeddings...)
+end
+"""
+## Embeddings of narrative claims
+- bang because it modifies the narrative object in place
+include("src/ExampleNarrative.jl")
+include("src/Narrative.jl")
+climate_narrative = create_example_narrative();
+generate_claim_embeddings_from_narrative!(climate_narrative)
+"""
+function generate_openai_claim_embeddings_from_narrative!(narrative::Narrative)
+    ## claim embeddings
+    claim_embeddings = create_chunked_embeddings([x.claim for x in narrative.claims])
+    [narrative.claims[i].claimembedding = claim_embeddings[i] for i in 1:length(narrative.claims)]
+    ## counterclaim embeddings
+    counterclaim_embeddings = create_chunked_embeddings([x.counterclaim for x in narrative.claims])
+    [narrative.claims[i].counterclaimembedding = counterclaim_embeddings[i] for i in 1:length(narrative.claims)]
+    return nothing
+end
+"""
+## Embeddings of candidate data
+cand_embeddings = candidate_embeddings_from_narrative(narrative)
+- Input: narrative
+- Output: candidate embeddings - embeddings of text that match the regex defined in claims
+"""
+function candidate_openai_embeddings(candidates::DataFrame; kwargs...)::DataFrame
+    model_id = get(kwargs, :model_id, "text-embedding-3-small")
+    textcol = get(kwargs, :textcol, "text")
+    # check if text column exists
+    if !textcol in names(candidates)
+        error("Text column not found in the dataframe, try specifying the text column using the textcol keyword argument")
+    end
+    ## Data Embeddings
+    cand_embeddings = create_chunked_embeddings(candidates[:, textcol]; model_id=model_id);
+    ## Add vector of embeddings to dataset
+    candidates[: , "Embeddings"] = [x for x in cand_embeddings]
+    return candidates
+end

src/deprecated/NarrativeClassification.jl ADDED Viewed

	@@ -0,0 +1,107 @@

+## Database retrieval based on keywords
+## need to ] add [email protected]
+"""
+## Calculates distances and assigns tentative classification
+"""
+function distances_and_classification(narrative_matrix, target_matrix)
+    distances = pairwise(CosineDist(), target_matrix, narrative_matrix, dims=2)
+    # get the index of the column with the smallest distance
+    return distances[argmin(distances, dims=2)][:, 1], argmin(distances, dims=2)[:, 1]
+end
+"""
+## Assignments of closest claim and counterclaim to the test data
+"""
+function assignments!(narrative_matrix, target_matrix, narrative_embeddings, target_embeddings; kwargs...)
+    claim_counter_claim = get(kwargs, :claim_counter_claim, "claim")
+    dists, narrative_assignment = distances_and_classification(narrative_matrix, target_matrix)
+    target_embeddings[:, "$(claim_counter_claim)Dist"] = dists
+    target_embeddings[:, "Closest$(claim_counter_claim)"] = [narrative_embeddings[x[2], claim_counter_claim] for x in narrative_assignment[:, 1]]
+    return nothing
+end
+"""
+## Get distances and assign the closest claim to the test data
+include("src/Narrative.jl")
+include("src/NarrativeClassification.jl")
+climate_narrative = create_example_narrative();
+generate_claim_embeddings_from_narrative!(climate_narrative)
+candidate_data = candidate_embeddings(climate_narrative)
+get_distances!(climate_narrative, candidate_data)
+"""
+function get_distances!(narrative::Narrative, target_embeddings::DataFrame)
+    ## Matrix of embeddings
+    narrative_embeddings = narrative_to_dataframe(narrative)
+    narrative_matrix = hcat([claim.claimembedding for claim in narrative.claims]...)
+    counternarrative_matrix = hcat([claim.counterclaimembedding for claim in narrative.claims]...)
+    target_matrix = hcat(target_embeddings[:, "Embeddings"]...)
+    # Create a search function
+    # Assign the closest claim to the test data
+    assignments!(narrative_matrix, target_matrix, narrative_embeddings, target_embeddings, claim_counter_claim="claim")
+    # Assign the closest counterclaim to the test data
+    assignments!(counternarrative_matrix, target_matrix, narrative_embeddings, target_embeddings, claim_counter_claim="counterclaim")
+    return nothing
+end
+function apply_gate_logic!(target_embeddings; kwargs...)
+    threshold = get(kwargs, :threshold, 0.2)
+    # Find those closer to claim than counter claim
+    closer_to_claim = findall(target_embeddings[:, "claimDist"] .< target_embeddings[:, "counterclaimDist"])
+    # Meets the threshold
+    meets_threshold = findall(target_embeddings[:, "claimDist"] .< threshold)
+    # Meets the threshold and is closer to claim than counter claim
+    target_embeddings[:, "OCLabel"] .= 0
+    target_embeddings[intersect(meets_threshold, closer_to_claim), "OCLabel"] .= 1
+    return nothing
+end
+"""
+## Deploy the narrative model
+- Input: narrative, threshold
+include("src/Narrative.jl")
+include("src/NarrativeClassification.jl")
+include("src/ExampleNarrative.jl")
+climate_narrative = create_example_narrative();
+generate_claim_embeddings_from_narrative!(climate_narrative)
+candidate_data = candidate_embeddings_from_narrative(climate_narrative)
+get_distances!(climate_narrative, candidate_data)
+apply_gate_logic!(candidate_data; threshold=0.2)
+return_top_labels(candidate_data)
+"""
+function return_top_labels(target_embeddings; kwargs...)
+    top_labels = get(kwargs, :top_labels, 10)
+    # Filter to "OCLabel" == 1
+    out = target_embeddings[findall(target_embeddings[:, "OCLabel"] .== 1), :]
+    # sort by claimDist
+    sort!(out, :claimDist)
+    return out[1:min(top_labels, nrow(out)), :]
+end
+function return_positive_candidates(target_embeddings)
+    return target_embeddings[findall(target_embeddings[:, "OCLabel"] .== 1), :]
+end
+"""
+## Deploy the narrative model
+- Input: narrative, threshold
+include("src/Narrative.jl")
+include("src/NarrativeClassification.jl")
+include("src/ExampleNarrative.jl")
+climate_narrative = create_example_narrative();
+deploy_narrative_model!(climate_narrative; threshold=0.2)
+"""
+function deploy_narrative_model!(narrative::Narrative; kwargs...)
+    threshold = get(kwargs, :threshold, 0.2)
+    db = get(kwargs, :db, "data/random_300k.csv")
+    generate_claim_embeddings_from_narrative!(narrative)
+    candidate_data = candidate_embeddings_from_narrative(narrative; db=db)
+    get_distances!(narrative, candidate_data)
+    apply_gate_logic!(candidate_data, threshold=threshold)
+    return candidate_data
+end

src/dev/Utils.jl ADDED Viewed

	@@ -0,0 +1,73 @@

+## Utility Functions
+## Note: edit ~/.bigqueryrc to set global settings for bq command line tool
+"""
+## ostreacultura_bq_auth()
+- Activate the service account using the credentials file
+"""
+function ostreacultura_bq_auth()
+    if isfile("ostreacultura-credentials.json")
+        run(`gcloud auth activate-service-account --key-file=ostreacultura-credentials.json`)
+    else
+        println("Credentials file not found")
+    end
+end
+"""
+## bq(query::String)
+- Run a BigQuery query and return the result as a DataFrame
+Example: bq("SELECT * FROM ostreacultura.climate_truth.training LIMIT 10")
+"""
+function bq(query::String)
+    tname = tempname()
+    run(pipeline(`bq query --use_legacy_sql=false --format=csv $query`, tname))
+    return CSV.read(tname, DataFrame)
+end
+"""
+## bq_db(query::String, db::String)
+- Run a BigQuery query and save to a database
+Example:
+bq_db("SELECT * FROM ostreacultura.climate_truth.training LIMIT 10", "data/test.csv")
+"""
+function bq_db(query::String, db::String)
+    run(pipeline(`bq query --use_legacy_sql=false --format=csv $query`, db))
+end
+"""
+ one token is roughly 3/4 of a word
+"""
+function token_estimate(allstrings::Vector{String})
+    ## Tokenize the strings
+    tokens = [split(x) for x in allstrings]
+    ## Estimate the number of tokens
+    token_estimate = sum([length(x) for x in tokens])
+    return token_estimate * 4 / 3
+end
+function chunk_by_tokens(allstrings::Vector{String}, max_tokens::Int=8191)
+    ## Tokenize the strings
+    tokens = [split(x) for x in allstrings]
+    ## Estimate the number of tokens
+    token_estimate = sum([length(x) for x in tokens]) * 4 / 3
+    ## Chunk the strings
+    chunks = []
+    chunk = []
+    chunk_tokens = 0
+    for i in 1:length(allstrings)
+        if chunk_tokens + length(tokens[i]) < max_tokens
+            push!(chunk, allstrings[i])
+            chunk_tokens += length(tokens[i])
+        else
+            push!(chunks, chunk)
+            chunk = [allstrings[i]]
+            chunk_tokens = length(tokens[i])
+        end
+    end
+    push!(chunks, chunk)
+    return chunks
+end

src/py_init.jl ADDED Viewed

	@@ -0,0 +1,14 @@

+##
+DataLoader = PyNULL()
+MiniEncoder = PyNULL()
+function __init__()
+    # Import DataLoader.py
+    pushfirst!(pyimport("sys")."path", "src/python");
+    _DataLoader = pyimport("DataLoader")
+    _MiniEncoder = pyimport("MiniEncoder")
+    copy!(DataLoader, _DataLoader)
+    copy!(MiniEncoder, _MiniEncoder)
+end

src/python/DataLoader.py ADDED Viewed

	@@ -0,0 +1,344 @@

+# pip install pinecone[grpc]
+#from pinecone import Pinecone
+from pinecone.grpc import PineconeGRPC as Pinecone
+import os
+import pandas as pd
+import numpy as np
+from pinecone import ServerlessSpec
+from pinecone_text.sparse import BM25Encoder
+## ID generation
+from sqids import Sqids
+sqids = Sqids()
+#######
+#import protobuf_module_pb2
+#pc = Pinecone("5faec954-a6c5-4af5-a577-89dbd2e4e5b0")
+##### EMBEDDINGS AND ENCODINGS
+"""
+## Embed in the inference API
+df = pd.read_csv('data/Indicator_Test.csv')
+pc = Pinecone("5faec954-a6c5-4af5-a577-89dbd2e4e5b0")
+model = "multilingual-e5-large"
+embeddings = bulk_embed(pc, model, df[1:96])
+"""
+def bulk_embed(pc, model, data, textcol='text'):
+    embeddings = pc.inference.embed(
+        model,
+        inputs=[x for x in data[textcol]],
+        parameters={
+        "input_type": "passage"
+        }
+    )
+    return embeddings
+def join_chunked_results(embeddings):
+    result = []
+    for chunk in embeddings:
+        for emblist in chunk.data:
+            result.append(emblist["values"])
+    return result
+"""
+## Chunk and embed in the inference API
+df = pd.read_csv('data/climate_test.csv')
+pc = Pinecone("5faec954-a6c5-4af5-a577-89dbd2e4e5b0")
+model = "multilingual-e5-large"
+embeddings = chunk_and_embed(pc, model, df)
+## Upgrade this function to return a dataframe with the Embeddings as a new column
+"""
+def chunk_and_embed(pc, model, data, chunk_size=96, textcol='text'):
+    embeddings = []
+    for i in range(0, len(data), chunk_size):
+        chunk = data[i:min(i + chunk_size, len(data))]
+        embeddings.append(bulk_embed(pc, model, chunk, textcol))
+    chunked_embeddings = join_chunked_results(embeddings)
+    data['Embeddings'] = chunked_embeddings
+    data['id'] = [sqids.encode([i, i+1, i+2]) for i in range(len(data))]
+    return data
+"""
+## Query the embeddings
+query = "What is the impact of climate change on the economy?"
+embeddings = query_embed(pc, model, query)
+"""
+def query_embed(pc, model, query):
+    embeddings = pc.inference.embed(
+        model,
+        inputs=query,
+        parameters={
+        "input_type": "query"
+        }
+    )
+    return embeddings[0]['values']
+"""
+### Sparse vector encoding
+- write a function to embed
+from pinecone_text.sparse import BM25Encoder
+corpus = ["The quick brown fox jumps over the lazy dog",
+          "The lazy dog is brown",
+          "The fox is brown"]
+# Initialize BM25 and fit the corpus.
+bm25 = BM25Encoder()
+#bm25.fit(corpus)
+#bm25 = BM25Encoder.default()
+doc_sparse_vector = bm25.encode_documents("The brown fox is quick")
+vector, bm25 = encode_documents(corpus)
+"""
+def encode_documents(corpus):
+    bm25 = BM25Encoder()
+    bm25.fit(corpus)
+    doc_sparse_vector = bm25.encode_documents(corpus)
+    return doc_sparse_vector, bm25
+def encode_query(bm25, query):
+    query_sparse_vector = bm25.encode_queries(query)
+    return query_sparse_vector
+"""
+## Generate format of sparse-dense vectors
+# Example usage
+df = pd.read_csv('data/Indicator_Test.csv')
+df = df.head(3)
+newdf = create_sparse_embeds(df)
+newdf['metadata'] = newdf.metadata.to_list()
+"""
+def create_sparse_embeds(pc, df, textcol='text', idcol='id', model="multilingual-e5-large"):
+    endocs, bm25 = encode_documents(df[textcol].to_list())
+    chunk_and_embed(pc, model, df) # this is an in-place operation
+    # rename Embeddings to values
+    df.rename(columns={'Embeddings': 'values'}, inplace=True)
+    df['sparse_values'] = [x['values'] for x in endocs]
+    df['indices'] = [x['indices'] for x in endocs]
+    df['metadata'] = df.drop(columns=[idcol, 'values', 'indices', 'sparse_values']).to_dict(orient='records')
+    df = df[[idcol, 'values', 'metadata', 'indices', 'sparse_values']]
+    return bm25, df
+"""
+## Generate format of sparse-dense vectors
+# Example usage
+data = {
+    'id': ['vec1', 'vec2'],
+    'values': [[0.1, 0.2, 0.3], [0.2, 0.3, 0.4]],
+    'metadata': [{'genre': 'drama', 'text': 'this'}, {'genre': 'action'}],
+    'sparse_indices': [[10, 45, 16], [12, 34, 56]],
+    'sparse_values': [[0.5, 0.5, 0.2], [0.3, 0.4, 0.1]]
+}
+df = pd.DataFrame(data)
+sparse_dense_dicts = create_sparse_dense_dict(df)
+vecs = create_sparse_dense_vectors_from_df(df)
+index.upsert(vecs, namespace="example-namespace")
+# Example usage
+df = pd.read_csv('data/Indicator_Test.csv')
+df = df.head(3)
+newdf = create_sparse_embeds(df)
+metadata = df[['text', 'label']].to_dict(orient='records')
+newdf['metadata'] = metadata
+vecs = create_sparse_dense_dict(newdf)
+index.upsert(vecs, namespace="example-namespace")
+"""
+def create_sparse_dense_dict(df, id_col='id', values_col='values', metadata_col='metadata', sparse_indices_col='indices', sparse_values_col='sparse_values'):
+    result = []
+    for _, row in df.iterrows():
+        vector_dict = {
+            'id': row[id_col],
+            'values': row[values_col],
+            'metadata': row[metadata_col],
+            'sparse_values': {
+                'indices': row[sparse_indices_col],
+                'values': row[sparse_values_col]
+            }
+        }
+        result.append(vector_dict)
+    return result
+############ UPSERTING DATA
+def create_index(pc, name, dimension, metric, cloud, region):
+    pc.create_index(
+        name=name,
+        dimension=dimension,
+        metric=metric,
+        spec=ServerlessSpec(
+            cloud=cloud,
+            region=region
+        )
+    )
+#pc.delete_index("example-index")
+#index = pc.Index("test-index")
+"""
+## Create vectors from a DataFrame to be uploaded to Pinecone
+import pandas as pd
+# Create a sample DataFrame
+data = {
+    'Embeddings': [
+        [0.1, 0.2, 0.3, 0.4],
+        [0.2, 0.3, 0.4, 0.5]
+    ],
+    'id': ['vec1', 'vec2'],
+    'genre': ['drama', 'action']
+}
+df = pd.DataFrame(data)
+vecs = create_vectors_from_df(df)
+# Upload the vectors to Pinecone
+index.upsert(
+    vectors=vecs,
+    namespace="example-namespace"
+)
+"""
+def create_vectors_from_df(df):
+    vectors = []
+    for _, row in df.iterrows():
+        vectors.append((row['id'], row['Embeddings'], row.drop(['Embeddings', 'id']).to_dict()))
+    return vectors
+def chunk_upload_vectors(index, vectors, namespace="example-namespace", chunk_size=1000):
+    for i in range(0, len(vectors), chunk_size):
+        chunk = vectors[i:min(i + chunk_size, len(vectors))]
+        index.upsert(
+            vectors=chunk,
+            namespace=namespace
+        )
+"""
+## Working Example 2
+df = pd.read_csv('data/Indicator_Test.csv')
+dfe = DataLoader.chunk_and_embed(pc, model, df)
+# Keep only text, embeddings, id
+dfmin = dfe[['text', 'Embeddings', 'id', 'label']]
+DataLoader.chunk_df_and_upsert(index, dfmin, namespace="indicator-test-namespace", chunk_size=96)
+"""
+def chunk_df_and_upsert(index, df, namespace="new-namespace", chunk_size=1000):
+    vectors = create_vectors_from_df(df)
+    chunk_upload_vectors(index, vectors, namespace, chunk_size)
+#### QUERYING DATA
+"""
+namespace = "namespace"
+vector = [0.1, 0.2, 0.3, 0.4]
+top_k = 3
+include_values = True
+"""
+def query_data(index, namespace, vector, top_k=3, include_values=True):
+    out = index.query(
+    namespace=namespace,
+    vector=vector.tolist(),
+    top_k=top_k,
+    include_values=include_values
+    )
+    return out
+"""
+Example:
+"""
+def query_data_with_sparse(index, namespace, vector, sparse_vector, top_k=5, include_values=True, include_metadata=True):
+    out = index.query(
+    namespace=namespace,
+    vector=vector,
+    sparse_vector=sparse_vector,
+    top_k=top_k,
+    include_metadata=include_metadata,
+    include_values=include_values
+    )
+    return out
+# create sparse vector with zero weighting
+def empty_sparse_vector():
+    return {
+        'indices': [1],
+        'values': [0.0]
+    }
+"""
+pc = Pinecone("5faec954-a6c5-4af5-a577-89dbd2e4e5b0")
+index = pc.Index("test-index")
+namespace = "test-namespace"
+vector = np.random.rand(1024)
+top_k = 3
+include_values = True
+filter={
+        "label": {"$lt": 2}
+    }
+query_data_with_filter(index, namespace, vector, top_k, include_values, filter)
+"""
+def query_data_with_filter(index, namespace, vector, top_k=3, include_values=True, filter=None):
+    out = index.query(
+    namespace=namespace,
+    vector=vector.tolist(),
+    top_k=top_k,
+    include_values=include_values,
+    filter=filter
+    )
+    return out
+"""
+pc = Pinecone("5faec954-a6c5-4af5-a577-89dbd2e4e5b0")
+ids = ["UkfgLgeYW9wo", "GkkzUYYOcooB"]
+indexname = "ostreacultura-v1"
+namespace = "cards-data"
+index = pc.Index(indexname)
+DL.fetch_data(index, ids, namespace)
+"""
+def fetch_data(index, ids, namespace):
+    out = index.fetch(ids=ids, namespace=namespace)
+    return out
+def get_all_ids_from_namespace(index, namespace):
+    ids = index.list(namespace=namespace)
+    return ids
+"""
+## Hybrid search weighting - Alpa is equal to the weight of the dense vector
+dense = [0.1, 0.2, 0.3, 0.4]
+sparse_vector={
+        'indices': [10, 45, 16],
+        'values':  [0.5, 0.5, 0.2]
+    }
+dense, sparse = hybrid_score_norm(dense, sparse, alpha=1.0)
+"""
+def hybrid_score_norm(dense, sparse, alpha: float):
+    """Hybrid score using a convex combination
+    alpha * dense + (1 - alpha) * sparse
+    Args:
+        dense: Array of floats representing
+        sparse: a dict of `indices` and `values`
+        alpha: scale between 0 and 1
+    """
+    if alpha < 0 or alpha > 1:
+        raise ValueError("Alpha must be between 0 and 1")
+    hs = {
+        'indices': sparse['indices'],
+        'values': [v * (1 - alpha) for v in sparse['values']]
+    }
+    return [v * alpha for v in dense], hs
+#############

src/python/MiniEncoder.py ADDED Viewed

	@@ -0,0 +1,10 @@

+## Mini Encoder
+from sentence_transformers import SentenceTransformer
+model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
+def get_embeddings(sentences):
+    embeddings = model.encode(sentences)
+    return embeddings

src/python/__pycache__/DataLoader.cpython-310.pyc ADDED Viewed

Binary file (4.98 kB). View file

src/python/__pycache__/DataLoader.cpython-312.pyc ADDED Viewed

Binary file (7.88 kB). View file

src/python/update_fact_check_data.py ADDED Viewed

	@@ -0,0 +1,83 @@

+## SCRIPT TO UPDATE THE FACT CHECK DATA
+#######################################
+from pinecone.grpc import PineconeGRPC as Pinecone
+import os
+import pandas as pd
+import numpy as np
+from pinecone import ServerlessSpec
+from pinecone_text.sparse import BM25Encoder
+import sys
+sys.path.append('src/python')
+import DataLoader
+pc = Pinecone(api_key="5faec954-a6c5-4af5-a577-89dbd2e4e5b0", pool_threads=50) # <-- make sure to set this)
+##############################
+df = pd.read_csv('data/fact_check_latest.csv')
+# Drop non-unique text values
+df = df.drop_duplicates(subset=['text'])
+# skip rows where text is NaN
+df = df.dropna(subset=['text'])
+## for 'claimReviewTitle' and 'claimReviewUrl' columns, fill NaN with empty string
+df['claimReviewUrl'] = df['claimReviewUrl'].fillna('')
+# now, check for NaN values in 'claimReviewUrl' column
+## get top three rows
+# get text and MessageID
+bm25, newdf = DataLoader.create_sparse_embeds(pc, df)
+#metadata = df[['text', 'category', 'claimReviewTitle', 'claimReviewUrl']].to_dict(orient='records')
+metadata = df[['text', 'claimReviewUrl']].to_dict(orient='records')
+newdf.loc[:, 'metadata'] = metadata
+## Taka look at rows where sparse values is an empty array
+sparse_lengths = [len(x) for x in newdf['sparse_values']]
+## Drop newdf rows where sparse length is
+newdf = newdf[np.array(sparse_lengths) != 0].reset_index(drop=True)
+vecs = DataLoader.create_sparse_dense_dict(newdf)
+index = pc.Index("oc-hybrid-library-index")
+for i in range(0, len(vecs), 400):
+    end_index = min(i + 400, len(vecs))
+    index.upsert(vecs[i:end_index], namespace="expanded-fact-checks")
+    print(f"Upserted vectors")
+#####################################
+### Querying performance for TruthSeeker Subset
+df = pd.read_csv('data/truthseeker_subsample.csv')
+corpus = df['claim'].tolist()
+"""
+## Function query, return score, title, link
+Example: get_score_title_link(corpus[0], pc, index)
+"""
+def get_score_title_link(querytext, pc, index):
+    queryembed = DataLoader.query_embed(pc, "multilingual-e5-large", querytext)
+    empty_sparse = DataLoader.empty_sparse_vector()
+    res = index.query(
+        top_k=1,
+        namespace="expanded-fact-checks",
+        vector=queryembed,
+        sparse_vector=empty_sparse,
+        include_metadata=True
+    )
+    score = res['matches'][0]['score']
+    title = res['matches'][0]['metadata']['text']
+    link = res['matches'][0]['metadata']['claimReviewUrl']
+    return pd.Series([score, title, link], index=['score', 'title', 'link'])
+## Get score, title, link for each querytext in corpus
+import time
+from pinecone.grpc import PineconeGRPC
+pc = PineconeGRPC(api_key="5faec954-a6c5-4af5-a577-89dbd2e4e5b0") # <-- make sure to set this)
+index = pc.Index(
+    name="oc-hybrid-library-index",
+    pool_threads=50, # <-- make sure to set this
+)
+### TIMING
+start_time = time.time()
+df[['score', 'title', 'link']] = df['claim'].apply(get_score_title_link, args=(pc, index)) #send the claim column to be scored.
+elapsed_time = time.time() - start_time
+print(f"Time taken: {elapsed_time:.2f} seconds")
+######## END TIMING

src/python/upload_library_hybrid-sparse.py ADDED Viewed

	@@ -0,0 +1,107 @@

+## Upload Telegram 300K to hybrid-sparse
+from pinecone.grpc import PineconeGRPC as Pinecone
+import os
+import pandas as pd
+import numpy as np
+from pinecone import ServerlessSpec
+from pinecone_text.sparse import BM25Encoder
+import sys
+sys.path.append('src/python')
+import DataLoader
+pc = Pinecone("5faec954-a6c5-4af5-a577-89dbd2e4e5b0")
+pc.delete_index("oc-hybrid-library-index")
+pc.create_index(
+  name="oc-hybrid-library-index",
+  dimension=1024,
+  metric="dotproduct",
+  spec=ServerlessSpec(
+    cloud="aws",
+    region="us-east-1"
+  )
+)
+## Upsert Indicator Data
+df = pd.read_csv('data/google_fact_checks2024-11-14.csv')
+# Drop non-unique text values
+df = df.drop_duplicates(subset=['text'])
+## get top three rows
+#df = df.head(3)
+# get text and MessageID
+bm25, newdf = DataLoader.create_sparse_embeds(pc, df)
+metadata = df[['text', 'category', 'claimReviewTitle', 'claimReviewUrl']].to_dict(orient='records')
+newdf.loc[:, 'metadata'] = metadata
+## Taka look at rows where sparse values is an empty array
+sparse_lengths = [len(x) for x in newdf['sparse_values']]
+## Drop newdf rows where sparse length is
+#newdf = newdf[pd.Series(sparse_lengths) != 0]
+# Create a dictionary of sparse and dense vectors for each category value in the dataframe
+#for category in df['category'].unique():
+#    category_df = newdf[df['category'] == category]
+#    vecs = DataLoader.create_sparse_dense_dict(category_df)
+#    index = pc.Index("oc-hybrid-library-index")
+#    for i in range(0, len(vecs), 400):
+#        end_index = min(i + 400, len(vecs))
+#        index.upsert(vecs[i:end_index], namespace=category)
+#    print(f"Upserted {category} vectors")
+vecs = DataLoader.create_sparse_dense_dict(newdf)
+index = pc.Index("oc-hybrid-library-index")
+for i in range(0, len(vecs), 400):
+    end_index = min(i + 400, len(vecs))
+    index.upsert(vecs[i:end_index], namespace="fact-checks")
+    print(f"Upserted vectors")
+################# Querying the index
+df = pd.read_csv('data/google_fact_checks2024-11-14.csv')
+corpus = df['text'].tolist()
+vector, bm25 = DataLoader.encode_documents(corpus)
+index = pc.Index("oc-hybrid-library-index")
+querytext = "satanic"
+queryembed = DataLoader.query_embed(pc, "multilingual-e5-large", querytext)
+query_sparse_vector = bm25.encode_documents(querytext)
+empty_sparse = empty_sparse_vector()
+query_response = index.query(
+    top_k=5,
+    namespace="immigration",
+    vector=queryembed,
+    sparse_vector=empty_sparse,
+    include_metadata=True
+)
+query_response
+## UPLOAD Expansive LLM's
+df = pd.read_csv('data/expansive_claims_library_expanded.csv')
+df['text']=df['ExpandedClaim']
+## get top three rows
+#df = df.head(3)
+# get text and MessageID
+bm25, newdf = DataLoader.create_sparse_embeds(pc, df)
+metadata = df[['Narrative', 'Model', 'Policy']].to_dict(orient='records')
+newdf.loc[:, 'metadata'] = metadata
+## Taka look at rows where sparse values is an empty array
+sparse_lengths = [len(x) for x in newdf['sparse_values']]
+## Drop newdf rows where sparse length is 0
+newdf = newdf[pd.Series(sparse_lengths) != 0]
+# Create a dictionary of sparse and dense vectors for each category value in the dataframe
+#for category in df['category'].unique():
+#    category_df = newdf[df['category'] == category]
+#    vecs = DataLoader.create_sparse_dense_dict(category_df)
+#    index = pc.Index("oc-hybrid-library-index")
+#    for i in range(0, len(vecs), 400):
+#        end_index = min(i + 400, len(vecs))
+#        index.upsert(vecs[i:end_index], namespace=category)
+#    print(f"Upserted {category} vectors")
+vecs = DataLoader.create_sparse_dense_dict(newdf)
+index = pc.Index("oc-hybrid-library-index")
+for i in range(0, len(vecs), 400):
+    end_index = min(i + 400, len(vecs))
+    index.upsert(vecs[i:end_index], namespace="narratives")
+    print(f"Upserted vectors")