Room Name Similarity Model

๊ฐ์‹ค๋ช… ํ…์ŠคํŠธ ์œ ์‚ฌ๋„ ์ธก์ •์„ ์œ„ํ•œ Siamese ๋„คํŠธ์›Œํฌ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.

๋ชจ๋ธ ๊ฐœ์š”

์ด ๋ชจ๋ธ์€ ์ˆ™์†Œ ๊ฐ์‹ค๋ช… ๊ฐ„์˜ ์œ ์‚ฌ๋„๋ฅผ ์ธก์ •ํ•˜์—ฌ ๋™์ผํ•œ ๋ฌผ๋ฆฌ์  ๊ฐ์‹ค์„ ์‹๋ณ„ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. BERT ๊ธฐ๋ฐ˜ Siamese ๋„คํŠธ์›Œํฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ•œ๊ตญ์–ด์™€ ์˜์–ด ํ…์ŠคํŠธ๋ฅผ ๋ชจ๋‘ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋ชจ๋ธ ์ •๋ณด

  • ๋ชจ๋ธ๋ช…: name_similarity_model_0.2
  • ๊ธฐ๋ฐ˜ ๋ชจ๋ธ: klue/bert-base
  • ์ตœ๋Œ€ ์‹œํ€€์Šค ๊ธธ์ด: 64
  • ์–ดํœ˜ ํฌ๊ธฐ: 32,000
  • ์–ธ์–ด: ํ•œ๊ตญ์–ด, ์˜์–ด

์‚ฌ์šฉ๋ฒ•

Python์œผ๋กœ ๋ชจ๋ธ ์‚ฌ์šฉ

import torch
from transformers import AutoTokenizer, AutoModel
import json

# ๋ชจ๋ธ ์ •๋ณด ๋กœ๋“œ
with open('name_similarity_model_0.2_model_info.json', 'r') as f:
    model_info = json.load(f)

with open('name_similarity_model_0.2_tokenizer_info.json', 'r') as f:
    tokenizer_info = json.load(f)

# ํ† ํฌ๋‚˜์ด์ € ๋กœ๋“œ
tokenizer = AutoTokenizer.from_pretrained(tokenizer_info['model_name'])

# ๋ชจ๋ธ ๋กœ๋“œ (PyTorch)
model = torch.load('name_similarity_model_0.2.pth', map_location='cpu')
model.eval()

# ONNX ๋ชจ๋ธ ์‚ฌ์šฉ (๋” ๋น ๋ฅธ ์ถ”๋ก )
import onnxruntime as ort
onnx_session = ort.InferenceSession('name_similarity_model_0.2.onnx')

def calculate_similarity(text1, text2):
    # ํ…์ŠคํŠธ ํ† ํฌ๋‚˜์ด์ง•
    inputs1 = tokenizer(text1, return_tensors='pt', max_length=64, padding=True, truncation=True)
    inputs2 = tokenizer(text2, return_tensors='pt', max_length=64, padding=True, truncation=True)
    
    # ์œ ์‚ฌ๋„ ๊ณ„์‚ฐ
    with torch.no_grad():
        similarity = model(inputs1, inputs2)
    
    return similarity.item()

# ์˜ˆ์‹œ ์‚ฌ์šฉ
text1 = "์Šคํƒ ๋‹ค๋“œ ๋”๋ธ”๋ฃธ"
text2 = "Standard Double Room"
similarity_score = calculate_similarity(text1, text2)
print(f"์œ ์‚ฌ๋„: {similarity_score:.4f}")

ONNX ๋ชจ๋ธ ์‚ฌ์šฉ (๊ถŒ์žฅ)

import onnxruntime as ort
import numpy as np
from transformers import AutoTokenizer

# ONNX ์„ธ์…˜ ์ƒ์„ฑ
session = ort.InferenceSession('name_similarity_model_0.2.onnx')

# ํ† ํฌ๋‚˜์ด์ € ๋กœ๋“œ
tokenizer = AutoTokenizer.from_pretrained('klue/bert-base')

def calculate_similarity_onnx(text1, text2):
    # ํ…์ŠคํŠธ ํ† ํฌ๋‚˜์ด์ง•
    inputs1 = tokenizer(text1, return_tensors='np', max_length=64, padding=True, truncation=True)
    inputs2 = tokenizer(text2, return_tensors='np', max_length=64, padding=True, truncation=True)
    
    # ONNX ๋ชจ๋ธ ์ถ”๋ก 
    input_feed = {
        'input_ids_1': inputs1['input_ids'].astype(np.int64),
        'attention_mask_1': inputs1['attention_mask'].astype(np.int64),
        'input_ids_2': inputs2['input_ids'].astype(np.int64),
        'attention_mask_2': inputs2['attention_mask'].astype(np.int64)
    }
    
    similarity = session.run(None, input_feed)[0]
    return similarity[0][0]

# ์˜ˆ์‹œ ์‚ฌ์šฉ
similarity_score = calculate_similarity_onnx("์Šคํƒ ๋‹ค๋“œ ๋”๋ธ”๋ฃธ", "Standard Double Room")
print(f"์œ ์‚ฌ๋„: {similarity_score:.4f}")

๋ชจ๋ธ ํŒŒ์ผ

  • name_similarity_model_0.2.pth: PyTorch ๋ชจ๋ธ ํŒŒ์ผ
  • name_similarity_model_0.2.onnx: ONNX ๋ชจ๋ธ ํŒŒ์ผ (์ถ”๋ก  ์ตœ์ ํ™”)
  • name_similarity_model_0.2_model_info.json: ๋ชจ๋ธ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ
  • name_similarity_model_0.2_tokenizer_info.json: ํ† ํฌ๋‚˜์ด์ € ์ •๋ณด

์„ฑ๋Šฅ

  • ์ •ํ™•๋„: 85% ์ด์ƒ
  • F1 Score: 0.85 ์ด์ƒ
  • ์ฒ˜๋ฆฌ ์†๋„: 1000 ์Œ/์ดˆ ์ด์ƒ (ONNX ๋ชจ๋ธ ๊ธฐ์ค€)

ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ

์ด ๋ชจ๋ธ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ฐ์ดํ„ฐ๋กœ ํ›ˆ๋ จ๋˜์—ˆ์Šต๋‹ˆ๋‹ค:

  • ๊ธ์ • ์Œ: ๊ฐ™์€ roomtype_id๋ฅผ ๊ฐ€์ง„ ๊ฐ์‹ค๋ช…๋“ค
  • ๋ถ€์ • ์Œ: ๊ฐ™์€ property_id์ด์ง€๋งŒ ๋‹ค๋ฅธ roomtype_id๋ฅผ ๊ฐ€์ง„ ๊ฐ์‹ค๋ช…๋“ค

๋ผ์ด์„ ์Šค

MIT License

์ฐธ๊ณ 

์ด ๋ชจ๋ธ์€ Room Clusterer ํ”„๋กœ์ ํŠธ์˜ ์ผ๋ถ€๋กœ ๊ฐœ๋ฐœ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ๋” ์ž์„ธํ•œ ์ •๋ณด๋Š” ํ”„๋กœ์ ํŠธ ์ €์žฅ์†Œ๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support