JiwonDev

์ •๋ณด๊ฒ€์ƒ‰ #1 ๊ฐœ์š”

by JiwonDev

์ด ๊ธ€์€ '์ •๋ณด๊ฒ€์ƒ‰' ์ด ๋ฌด์—‡์„ ๋ฐฐ์šฐ๋Š”์ง€์— ๋Œ€ํ•œ ๊ฐœ์š”์ž…๋‹ˆ๋‹ค.

์ˆ˜์‹์ด ๋‚˜์˜จ๋‹ค๊ณ  ์–ด๋ ต๊ฒŒ ๋ณด์ผ ์ˆ˜ ์žˆ๋Š”๋ฐ, ์ด๋Š” ๋‹ค์Œ๊ธ€์— ์ฐจ๊ทผ์ฐจ๊ทผ ์„ค๋ช…ํ•˜๋‹ˆ ๊ฒ๋จน์ง€ ๋ง๊ณ  ๊ฐœ๋…๋งŒ ์ •ํ™•ํ•˜๊ฒŒ ์•Œ๊ณ  ๋„˜์–ด๊ฐ‘์‹œ๋‹ค.

 

IR , Information Retreival, ์ •๋ณด๊ฒ€์ƒ‰

1. ์ •๋ณด๊ฒ€์ƒ‰์ด ๋ญ์ฃ ? (DB ๊ฒ€์ƒ‰๊ณผ์˜ ์ฐจ์ด์ )

๋Œ€๋Ÿ‰์˜ ์ •๋ณด ๋ชจ์Œ์œผ๋กœ๋ถ€ํ„ฐ ์‚ฌ์šฉ์ž์˜ ์ •๋ณด์š”๊ตฌ(need)์— ์ ํ•ฉํ•œ(relevant) ์ž๋ฃŒ๋ฅผ ์ฐพ๋Š”(Retrieval) ๊ณผ์ •

= ๋Œ€๋Ÿ‰์˜ ์ž๋ฃŒ ์ง‘ํ•ฉ์œผ๋กœ ๋ถ€ํ„ฐ ์š”๊ตฌ์‚ฌํ•ญ์— ๋งŒ์กฑํ•˜๋Š”(satisfying) ๋น„์ •ํ˜• ๋ฐ์ดํ„ฐ(์ฃผ๋กœ ์ž์—ฐ์–ด ํ…์ŠคํŠธ ๋ฌธ์„œ)๋ฅผ ์ฐพ๋Š” ๊ฒƒ.

์ธ๊ณต์ง€๋Šฅ ์„œ๋น„์Šค์—์„œ ์‚ฌ์šฉํ•˜๋Š” DBMS , IR , QA์˜ ๊ณผ์ •

์ฆ‰ ์ •๋ณด ๊ฒ€์ƒ‰์—์„œ ์ •๋ณด๋“ค์€ ๋น„๊ตฌ์กฐ์ ์ธ ๋ฌธ์„œ(ํ…Œ์ด๋ธ” ํ˜•ํƒœ๊ฐ€ ์•„๋‹˜)๋กœ ๋˜์–ด์žˆ์œผ๋ฉฐ ์—ฐ๊ด€ ์Šคํ‚ค๋งˆ๊ฐ€ ์กด์žฌํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

๊ทธ๋ž˜์„œ SQL ๊ฐ™์€ ์งˆ์˜ ํ˜•์‹์„ ๊ฐ€์ง€์ง€ ์•Š๊ณ , ์ž์—ฐ์–ด๋กœ ๊ฒ€์ƒ‰ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

 

๋˜ํ•œ ๊ฒ€์ƒ‰ํ•˜๋Š” ๊ฒฐ๊ณผ๋„ ๋‹ค๋ฆ…๋‹ˆ๋‹ค. ์ •๋ณด ๊ฒ€์ƒ‰์€ ๊ด€๋ จ์„ฑ์ด ๋†’์€ ๋ฌธ์„œ๋ฅผ ์ฐพ์•„์ฃผ๋Š” ๊ฒƒ์ด์ง€, ์ •๋‹ต์„ ์ฐพ๋Š” ๊ณผ์ •์ด ์•„๋‹™๋‹ˆ๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด, "Seoul"์ด๋ผ๋Š” ํ‚ค์›Œ๋“œ๋กœ ๊ฒ€์ƒ‰ํ•œ๋‹ค๋ฉด ๋‹จ์ˆœ ํ…์ŠคํŠธ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์—ฐ๊ด€ ๊ฒ€์ƒ‰์–ด, ์ด๋ฏธ์ง€, ์ง€๋„๋ฅผ ํฌํ•จํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ์ฐพ์•„์ฃผ๋Š” ๊ฒƒ ์ฒ˜๋Ÿผ์š”. (* ์ •๋‹ต์„ ์ฐพ๋Š” ์‹œ์Šคํ…œ์„ QA(Question Answering), ์Šคํ‚ค๋งˆ์™€ SQL์„ ์ด์šฉํ•ด ๋ฐ์ดํ„ฐ๋ฅผ ์ฐพ๋Š” ์‹œ์Šคํ…œ์„ DBMS๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค.)

* [๋”๋ณด๊ธฐ] ์Šคํ‚ค๋งˆ(schema)

๋”๋ณด๊ธฐ

์˜์–ด๋กœ '๊ฐœ์š”'๋ผ๋Š” ์˜๋ฏธ๋กœ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค์—์„œ ์ œ์•ฝ์กฐ๊ฑด, ๋ฐ์ดํ„ฐ ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

๋Œ€ํ‘œ์ ์ธ ์Šคํ‚ค๋งˆ๋Š” ์†์„ฑ(Attribute), ์†์„ฑ์ด ๋ชจ์ธ ๊ฐœ์ฒด(Entity) ๊ทธ ๊ฐœ์ฒด ์‚ฌ์ด์˜ ๊ด€๊ณ„(Relation)๋“ฑ์ด ์žˆ์Šต๋‹ˆ๋‹ค. 


2. Unstructured Data(๋น„์ •ํ˜• ๋ฐ์ดํ„ฐ)

๋น„์ •ํ˜• ๋ฐ์ดํ„ฐ๋Š” ๋ช…ํ™•ํ•˜์ง€ ์•Š๊ณ , ๊ตฌ์กฐํ™” ๋˜์ง€ ์•Š์€(=ํ…Œ์ด๋ธ” ํ˜•ํƒœ๊ฐ€ ์•„๋‹Œ) ๋ฐ์ดํ„ฐ๋ฅผ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

์šฐ๋ฆฌ๊ฐ€ ํ‰์†Œ์— ๋ณด๋Š” ํ…์ŠคํŠธ, ์˜ค๋””์˜ค, ๋น„๋””์˜ค, ์ด๋ฏธ์ง€๋“ฑ ๋ชจ๋“ ๊ฒƒ์ด ๋น„์ •ํ˜• ๋ฐ์ดํ„ฐ์ž…๋‹ˆ๋‹ค.

์—ฌ๋‹ด์œผ๋กœ ๊ณผ๊ฑฐ์—๋Š” ๋Œ€๋ถ€๋ถ„์ด ๋น„์ •ํ˜• ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ”๋กœ ์“ฐ์ง€ ์•Š๊ณ  ๊ฐ€๊ณตํ•˜์—ฌ (ex Oracle)ํŒ๋งคํ•˜๋Š” ํšŒ์‚ฌ๋“ค์ด ํฐ ์ˆ˜์ต์„ ์–ป์—ˆ์œผ๋‚˜ ํ˜„๋Œ€์˜ ์‚ฐ์—…์‹œ์žฅ์—์„œ๋Š” ๋น„์ •ํ˜• ๋ฐ์ดํ„ฐ ์ž์ฒด๋ฅผ ๋งŽ์ด ์š”๊ตฌํ•˜๊ธฐ์— ๋น„์ •ํ˜• ๋ฐ์ดํ„ฐ๋ฅผ ์‰ฝ๊ฒŒ ์ฐพ์„ ์ˆ˜ ์žˆ๋Š” ๊ฒ€์ƒ‰์—”์ง„(Google, Yahoo)์˜ ์ˆ˜์š”๊ฐ€ ๋งŽ์•„์กŒ์Šต๋‹ˆ๋‹ค.


3. Retrieval(๊ฒ€์ƒ‰) ์šฉ์–ด

์šฉ์–ด๋ฅผ ์™ธ์šธํ•„์š”๋Š” ์—†๊ณ , ํ˜น์‹œ๋‚˜ ํŠน์ • ๋‹จ์–ด๊ฐ€ ๋‚˜์™”์„ ๋•Œ ๋ชจ๋ฅผ๊นŒ๋ด ๊ฐ„๋‹จํ•˜๊ฒŒ ์ •๋ฆฌํ–ˆ์Šต๋‹ˆ๋‹ค.

IR(Information Retrieval)์˜ ๊ธฐ๋ณธ ์›๋ฆฌ

๋”๋ณด๊ธฐ

๋ฌธ์„œ์ง‘ํ•ฉ -> [ Indexer(์ƒ‰์ธ) -> Inverted index -> Retriever(๊ฒ€์ƒ‰) ] <-> ์งˆ์˜, ์‘๋‹ต ๋ฌธ์„œ

์ƒ‰์ธ(Indexing) : ํšจ์œจ์ ์ธ ๊ฒ€์ƒ‰์„ ์œ„ํ•ด ๋ฌธ์„œ์ง‘ํ•ฉ์„ ๋ฏธ๋ฆฌ ๋ถ„๋ฅ˜๋ณ„๋กœ ๊ฐ€๊ณตํ•ด๋‘ .

๊ฒ€์ƒ‰(Retriever) : ์œ ์‚ฌ๋„(์งˆ์˜ - ๋ฌธ์„œ)์— ๋”ฐ๋ผ ์ˆœ์œ„๋ฅผ ๋งค๊น€, ๊ฐ ์œ ์‚ฌ๋„๋ฅผ ๊ณ ์† ๊ณ„์‚ฐ ์ˆ˜ํ–‰ํ•„์š”

์ •๋ณด์š”๊ตฌ(need)๋ฅผ ์งˆ์˜(Query)๋กœ ๋ฐ”๊พธ๋Š” ๊ณผ์ •์—์„œ ์†์‹ค ๋ฐœ์ƒ(loss)

 

์œ ์‚ฌ๋„ ์ ์ˆ˜ ํ‘œ์‹œ๋ฐฉ๋ฒ•

Similarity (Query, Document_1)

Sim(Q, D1) = 0.3

๊ฒ€์ƒ‰๋ชจ๋ธ

๋”๋ณด๊ธฐ

1. Corpus(๋ง๋ญ‰์น˜): ์ž๋ฃŒ๋ฅผ ๊ฒ€์ƒ‰ํ•  Corpus๊ฐ€ ์กด์žฌํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

2. Topic(์ฃผ์ œ): ๊ทธ Corpus์—์„œ ์ž๋ฃŒ๋ฅผ ๊ฒ€์ƒ‰ํ•˜๊ธฐ ์œ„ํ•œ Topic์ด ์กด์žฌํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

3. Relevance(๊ด€๋ จ์„ฑ) : ์–ด๋– ํ•œ ๋ฌธ์„œ๊ฐ€ Topic์„ ํฌํ•จํ•˜๊ณ  ์žˆ์œผ๋ฉด Relevance ํ•˜๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด ์„œ์šธ์˜ ๋•… ํฌ๊ธฐ(์ฃผ์ œ)๋ฅผ ๋ฌผ์—ˆ๋Š”๋ฐ, ์„œ์šธ์˜ ์ง€๋ฆฌ์™€ ๋ฉด์ ์— ๊ด€๋ จ๋œ ๋ฌธ์„œ๊ฐ€ ๋‚˜์˜ค๋ฉด ๊ด€๋ จ์„ฑ์ด ์žˆ๋Š” ๊ฒƒ์ด๊ณ  ์„œ์šธ ์ธ๊ตฌ ์ˆ˜, ์ทจ์—…๋ฅ  ๊ฐ™์€ ๋ฌธ์„œ๊ฐ€ ๋‚˜์˜ค๋ฉด Non-Relevance ํ•˜๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

4. Query(์งˆ์˜): ์ •๋ณด๊ฒ€์ƒ‰(IR)์—์„œ๋Š” ์ž์—ฐ์–ด๋ฅผ ์‚ฌ์šฉํ•ด ์งˆ์˜ํ•ฉ๋‹ˆ๋‹ค.

5. Model(๊ฒ€์ƒ‰๋ชจ๋ธ): Boolean, Vector, Probability ๋“ฑ ๋‹ค์–‘ํ•œ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜๋ฉฐ ๊ฐ๊ฐ์˜ ํŠน์ง•์ด ์žˆ์Šต๋‹ˆ๋‹ค.

์‘์šฉ์„œ๋น„์Šค

๋”๋ณด๊ธฐ

- Information Filtering (Recommender, ์ถ”์ฒœ์‹œ์Šคํ…œ)

- Question-answering (์งˆ์˜ ์‘๋‹ต์‹œ์Šคํ…œ)

- Cross-language Information retreiveal (๊ต์ฐจ ์–ธ์–ด์ •๋ณด๊ฒ€์ƒ‰)

- Documnet classification (๋ฌธ์„œ ๋ถ„๋ฅ˜)

- Document clustering (๋ฌธ์„œ ๊ตฐ์ง‘ํ™”)

 


4. ์ ํ•ฉํ•œ(Relevance) ๋ฌธ์„œ๋Š” ๋ฌด์—‡์ธ๊ฐ€?

์–ด๋–ค ๋ฌธ์„œ๊ฐ€ ์งˆ์˜(Query)์— ๊ฐ€์žฅ ์ ํ•ฉ(Relevance)ํ• ๊นŒ์š”?

์ •๋ณด๊ฒ€์ƒ‰์—์„œ ์™„๋ฒฝํ•˜๊ฒŒ ์ผ์น˜ํ•˜๋Š” ๋ฌธ์„œ๋ž€ ์—†์Šต๋‹ˆ๋‹ค. ๋ง ๊ทธ๋Œ€๋กœ ๊ฒ€์ƒ‰(Retreival)์ด๋‹ˆ๊นŒ์š”.

 

์ •๋ณด๊ฒ€์ƒ‰์€ ์งˆ์˜์™€ ์ „์ฒด๋ฌธ์„œ๋ฅผ ์ด์šฉํ•ด์„œ ๊ฐ๊ฐ์˜ ๋ฌธ์„œ์— ๊ฐ€์ค‘์น˜์ ์ˆ˜๋ฅผ ๋งค๊ฒจ ์ ํ•ฉํ•œ ๋ฌธ์„œ๋ฅผ ์ฐพ์Šต๋‹ˆ๋‹ค.

๊ทธ์ค‘ ๊ฐ€์žฅ ๋Œ€ํ‘œ์ ์ธ ๊ฒƒ์ด TF(์งˆ๋ฌธ์—์„œ Term์˜ ๋นˆ๋„์ˆ˜) ์™€ DF(ํŠน์ • ๋‹จ์–ด๊ฐ€ ๋“ค์–ด๊ฐ„ ๋ฌธ์„œ์˜ ์ˆ˜)๋ฅผ ์ด์šฉํ•œ ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.


 

5. TF-IDF๋ฅผ ์ด์šฉํ•œ Ranking

ํ•ด๋‹น ๋‚ด์šฉ์€ ๋‹ค์Œ ๊ธ€์—์„œ ์ถ”๊ฐ€์ ์œผ๋กœ ์„ค๋ช…ํ•˜๋‹ˆ, ๊ฐ„๋‹จํ•˜๊ฒŒ๋งŒ ๋ณด๊ณ  ๊ฐ‘์‹œ๋‹ค.

์šฉ์–ด๋ฅผ ์‚ฌ์šฉํ•ด์„œ ๊ทธ๋ ‡์ง€ ์–ด๋ ค์šด ๊ฐœ๋…์ด ์•„๋‹™๋‹ˆ๋‹ค. ์ฐจ๊ทผ์ฐจ๊ทผ ๋ฐฐ์›Œ๋ด…์‹œ๋‹ค.

์ „์ฒด ๋ฌธ์„œ๋ฅผ ๋‹จ์ˆœํžˆ ํ‚ค์›Œ๋“œ๋ฅผ ์ด์šฉํ•ด์„œ ๋ถ„๋ฅ˜ํ•˜๊ฒŒ ๋˜๋ฉด ์‹œ๊ฐ„๋„ ์˜ค๋ž˜ ๊ฑธ๋ฆฌ๊ณ  ๋ถˆํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ๊ฐ€ ๋งŽ์ด ํฌํ•จ๋˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

๊ทธ๋ž˜์„œ ๋ฌธ์„œ์˜ ๊ด€๋ จ์„ฑ์„ ์ ์ˆ˜(=๊ฐ€์ค‘์น˜, Weight)๋ฅผ ๋งค๊ฒจ Rankingํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

 

๊ทธ ์ค‘ TF-IDF(Term Frequency-Inverse Document Frequency) ๋Š” ๋‹จ์–ด์˜ ๋นˆ๋„์™€ ๋ฌธ์„œ์˜ ๋นˆ๋„๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ ๋‹จ์–ด๋“ค๋งˆ๋‹ค ์ค‘์š”ํ•œ ์ •๋„๋ฅผ ๊ฐ€์ค‘์น˜๋กœ ์ฃผ๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. ์ฃผ๋กœ ๋ฌธ์„œ์˜ ์œ ์‚ฌ๋„๋ฅผ ๊ตฌํ•˜๊ฑฐ๋‚˜ ๊ฒ€์ƒ‰ ์‹œ์Šคํ…œ์—์„œ ๊ฒฐ๊ณผ, ํŠน์ • ๋‹จ์–ด์˜ ์ค‘์š”๋„๋ฅผ ๊ตฌํ•  ๋•Œ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. ์ƒ์†Œํ•œ ๊ธ€์ž ๋•Œ๋ฌธ์— ์–ด๋ ค์›Œ ๋ณด์ผ ์ˆ˜ ์žˆ์ง€๋งŒ, ์˜๋ฏธ๋ฅผ ๋ณด๋ฉด ๊ฐ„๋‹จํ•ฉ๋‹ˆ๋‹ค.

 

5-1 TF (Term Frquency)

ํŠน์ • ๋ฌธ์„œ ๋‚ด์—์„œ ํŠน์ • ๋‹จ์–ด์˜ ๋นˆ๋„ ์ˆ˜๋ฅผ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

TF ( d , t ) : ํŠน์ •๋ฌธ์„œ(d)์—์„œ ํŠน์ •๋‹จ์–ด(t)์˜ ๋“ฑ์žฅ ํšŸ์ˆ˜

 

5-2 DF (Document Frequency)

๋ฌธ์„œ ์ง‘ํ•ฉ ์ „์ฒด์—์„œ ํŠน์ • ๋‹จ์–ด๊ฐ€ ๋‚˜์˜จ ๋ฌธ์„œ์˜ ๊ฐœ์ˆ˜๋ฅผ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

DF( t ) : ๋ฌธ์„œ ์ง‘ํ•ฉ ์ „์ฒด์—์„œ t๊ฐ€ ๋“ฑ์žฅํ•œ ๋ฌธ์„œ์˜ ์ˆ˜

 

5-3 IDF(Inverse DF)

df๋ฅผ ๋ฐ˜๋น„๋ก€ ํ•˜๋Š” ์ˆ˜์ž…๋‹ˆ๋‹ค. ์ „์ฒด ๋ฌธ์„œ์—์„œ ๋‹จ์–ด๊ฐ€ ์ ๊ฒŒ ๋“ฑ์žฅํ•  ์ˆ˜๋ก ๊ฐ€์ค‘์น˜๊ฐ€ ๋†’์•„์•ผ ํ•˜๊ธฐ์—

๊ณ„์‚ฐํ•  ๋•Œ๋Š” DF ๋ณด๋‹ค๋Š” ๊ทธ ์—ญ์ˆ˜์ธ IDF๋ฅผ ๋งŽ์ด ์”๋‹ˆ๋‹ค. ๋‹ค๋งŒ ๋‹จ์ˆœ ์—ญ์ˆ˜๊ฐ€ ์•„๋‹ˆ๋ผ Log ( ๋ณดํ†ต ์ž์—ฐ๋กœ๊ทธ ln)๋ฅผ  ์”Œ์šฐ๋Š”๋ฐ ๊ทธ ์ด์œ ๋Š” ๋ฌธ์„œ์˜ ๊ฐฏ์ˆ˜(n)๊ฐ€ ์ปค์งˆ ์ˆ˜๋ก ์ˆ˜์น˜๊ฐ€ ๊ธฐํ•˜๊ธ‰์ˆ˜์ ์œผ๋กœ ์ปค์ง€๋Š”๊ฑธ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•จ์ž…๋‹ˆ๋‹ค.

๋ถ„๋ชจ +1 ์€ ๋ถ„๋ชจ๊ฐ€ ์†Œ์ˆซ์ ์ด๋‚˜ 0์ด ๋˜์ง€ ์•Š๋„๋ก ํ•˜๊ธฐ ์œ„ํ•จ์ž…๋‹ˆ๋‹ค.

๋˜ํ•œ ๋กœ๊ทธ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๋‹จ์ˆœํžˆ ํฌ๊ธฐ๊ฐ€ ์ปค์ง€๋Š” ๊ฒƒ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๊ฐ€์ค‘์น˜๋ฅผ ๊ณจ๊ณ ๋ฃจ ๋ถ€์—ฌ๋˜๋Š”๋ฐ๋„ ๋„์›€์ด ๋ฉ๋‹ˆ๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด ๊ธฐ๋ณธ์ ์ธ ๋‹จ์–ด๋“ค์€ ํŠน์ˆ˜ํ•œ, ์ž์ฃผ ์“ฐ์ด์ง€ ์•Š๋Š” ๋‹จ์–ด์— ๋น„ํ•ด ์ตœ์†Œ ์ˆ˜์‹ญ, ์ˆ˜๋ฐฑ ๋ฐฐ๋Š” ๋” ๋“ฑ์žฅํ•ฉ๋‹ˆ๋‹ค.

TF-IDF๋Š” ๋ชจ๋“  ๋ฌธ์„œ์—์„œ ์ž์ฃผ ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ด์˜ ์ค‘์š”๋„๋ฅผ ๋‚ฎ๊ฒŒ ํŒ๋‹จํ•˜๋Š”๋ฐ, ๋‹จ์ˆœํ•œ ์—ญ์ˆ˜๋กœ ๊ณ„์‚ฐํ•ด๋ฒ„๋ฆฌ๋ฉด ์ด๋Ÿฐ ๋‹จ์–ด๋“ค ๋•Œ๋ฌธ์— ๋ณ„๋กœ ํฌ๊ท€ํ•˜์ง€๋„ ์•Š์€ ๋‹จ์–ด๋“ค์—๊ฒŒ ์ƒ๋Œ€์ ์œผ๋กœ ์ƒ๋‹นํžˆ ๋†’์€ ๊ฐ€์ค‘์น˜๊ฐ€ ๋ถ€์—ฌ๋˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

 

๋กœ๊ทธ๋ฅผ ์”Œ์šฐ๋ฉด ์ด๋Ÿฐ ๊ฒฉ์ฐจ๋ฅผ ํฌ๊ฒŒ ์ค„์—ฌ์ฃผ๋Š” ํšจ๊ณผ๋„ ์žˆ์Šต๋‹ˆ๋‹ค.

* [๋”๋ณด๊ธฐ] ๋กœ๊ทธ๋ฅผ ์”Œ์šฐ๋ฉด ์–ด๋–ป๊ฒŒ ๋‹ฌ๋ผ์ง€๋‚˜์š”?

๋”๋ณด๊ธฐ
์ฐธ๊ณ ๋กœ ์‹ค์ œ ๊ณ„์‚ฐ์‹œ์—๋Š” idf(d,t)๊ฐ’์ด 0์ด ๋˜๋ฉด ๊ฐ€์ค‘์น˜๋ฅผ ๊ณฑํ•˜๋Š” ์˜๋ฏธ๊ฐ€ ์—†์–ด์ง€๊ธฐ์— ๋’ค์— ์ƒ์ˆ˜๋ฅผ ๋”ํ•ด์„œ ์ตœ์†Ÿ๊ฐ’์„ 1๋กœ ๋งŒ๋“ญ๋‹ˆ๋‹ค. (ex ln( n/ dt(t) +1 ) +1)

* [๋”๋ณด๊ธฐ] 6. ํŒŒ์ด์ฌ์œผ๋กœ ์‹ค์ œ ๊ตฌํ˜„ํ•ด๋ณด๊ธฐ

๋”๋ณด๊ธฐ

5-4 ์‹ค์ œ๋กœ ๊ตฌํ•ด๋ณด์ž

๋‹ค์Œ๊ณผ ๊ฐ™์€ 4๊ฐœ์˜ ํ…์ŠคํŠธ ๋ฌธ์„œ๊ฐ€ ์žˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•ฉ์‹œ๋‹ค.

๋ฌธ์„œ1 : ๋จน๊ณ  ์‹ถ์€ ์‚ฌ๊ณผ

๋ฌธ์„œ2 : ๋จน๊ณ  ์‹ถ์€ ๋ฐ”๋‚˜๋‚˜

๋ฌธ์„œ3 : ๊ธธ๊ณ  ๋…ธ๋ž€ ๋ฐ”๋‚˜๋‚˜ ๋ฐ”๋‚˜๋‚˜

๋ฌธ์„œ4 : ์ €๋Š” ๊ณผ์ผ์ด ์ข‹์•„์š”

 

๋‹จ์–ด๋ฅผ ๋ถ„๋ฅ˜ํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ์—ฌ๋Ÿฌ๊ฐ€์ง€๊ฐ€ ์žˆ์ง€๋งŒ, ์—ฌ๊ธฐ์—์„œ๋Š” ๋‹จ์–ด์˜ ์ˆœ์„œ์™€ ๋นˆ๋„์ˆ˜๋ฅผ ๊ณ ๋ คํ•ด ๊ฐ„๋‹จํ•˜๊ฒŒ ํ–‰๋ ฌ๋กœ ํ‘œํ˜„ ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. ์œ„ 4๊ฐœ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฌธ์„œ-๋‹จ์–ด ํ–‰๋ ฌ(DTM, Document-Term Matrix)๋กœ ๋งŒ๋“ค๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

- ๊ณผ์ผ์ด ๊ธธ๊ณ  ๋…ธ๋ž€ ๋จน๊ณ  ๋ฐ”๋‚˜๋‚˜ ์‚ฌ๊ณผ ์‹ถ์€ ์ €๋Š” ์ข‹์•„์š”
๋ฌธ์„œ1 0 0 0 1 0 1 1 0 0
๋ฌธ์„œ2 0 0 0 1 1 0 1 0 0
๋ฌธ์„œ3 0 1 1 0 2 0 0 0 0
๋ฌธ์„œ4 1 0 0 0 0 0 0 1 1

 

์ด DTM์„ ์œ„์—์„œ ๋ฐฐ์šด IDF (๋‹จ์–ด์˜ ์ „์ฒด ๋ฌธ์„œ ์ถœํ˜„ ๋นˆ๋„)๋ฅผ ๊ณ„์‚ฐํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

์‹์ด ๋‚˜์™”๋‹ค๊ณ  ์–ด๋ ต๊ฒŒ ๋ณด์ผ ์ˆ˜ ์žˆ๋Š”๋ฐ, ๋‹จ์ˆœํžˆ ๋นˆ๋„์ˆ˜๋ฅผ ๋Œ€์ž…ํ•ด์„œ ๊ณ„์‚ฐํ•œ ๊ฐ’์ž…๋‹ˆ๋‹ค.

๋‹จ์–ด IDF ๋‹จ์–ด IDF
๊ณผ์ผ์ด ln(4/(1+1)) = 0.693147 ์‚ฌ๊ณผ ln(4/(1+1)) = 0.693147
๊ธธ๊ณ  ln(4/(1+1)) = 0.693147 ์‹ถ์€ ln(4/(2+1)) = 0.287682
๋…ธ๋ž€ ln(4/(1+1)) = 0.693147 ์ €๋Š” ln(4/(1+1)) = 0.693147
๋จน๊ณ  ln(4/(2+1)) = 0.287682 ์ข‹์•„์š” ln(4/(1+1)) = 0.693147
๋ฐ”๋‚˜๋‚˜ ln(4/(2+1)) = 0.287682    

์ด์ œ ์ด ๊ฐ’์„ TF(ํŠน์ • ๋ฌธ์„œ์—์„œ ๋‚˜์˜ค๋Š” ๋‹จ์–ด์˜ ๊ฐฏ์ˆ˜)์— ๊ณฑํ•ด์ฃผ๋ฉด ๋ฌธ์„œ๋ณ„ ๋‹จ์–ด์˜ ๊ฐ€์ค‘์น˜(TF-IDF)๊ฐ€ ๋‚˜์˜ค๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

 

์œ„์˜ ํ–‰๋ ฌ(DTM)์— ๋ณ„๋‹ค๋ฅธ ๊ณต์‹์„ ์ ์šฉํ•˜์ง€ ์•Š๊ณ  ๊ทธ๋Œ€๋กœ ์ ์—ˆ๊ธฐ์— ์œ„์˜ ํ–‰๋ ฌ์„ ๊ทธ๋Œ€๋กœ ๊ฐ€์ ธ์˜ค๋ฉด ๋‹จ์–ด๋ณ„ TF(t) ๊ฐ€ ๋˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ๊ฐ๊ฐ์˜ ๊ฐ’์— TF * IDF๋ฅผ ๊ณ„์‚ฐํ•ด์ฃผ๋ฉด ์•„๋ž˜์™€ ๊ฐ™์ด ๊ฐ€์ค‘์น˜๊ฐ€ ๋ถ€์—ฌ๋ฉ๋‹ˆ๋‹ค.

- ๊ณผ์ผ์ด ๊ธธ๊ณ  ๋…ธ๋ž€ ๋จน๊ณ  ๋ฐ”๋‚˜๋‚˜ ์‚ฌ๊ณผ ์‹ถ์€ ์ €๋Š” ์ข‹์•„์š”
๋ฌธ์„œ1 0 0 0 0.287682 0 0.693147 0.287682 0 0
๋ฌธ์„œ2 0 0 0 0.287682 0.287682 0 0.287682 0 0
๋ฌธ์„œ3 0 0.693147 0.693147 0 0.575364 0 0 0 0
๋ฌธ์„œ4 0.693147 0 0 0 0 0 0 0.693147 0.693147

๋ฐ”๋‚˜๋‚˜์˜ ๊ฒฝ์šฐ [๋ฌธ์„œ 3๋ฒˆ]์—์„œ ๋‘ ๋ฒˆ ๋“ฑ์žฅํ•˜์˜€๊ธฐ์— ๊ฐ€์ค‘์น˜๊ฐ€ ๋†’๊ฒŒ ์žกํ˜”์Šต๋‹ˆ๋‹ค.

์ด๋Š” TF-IDF ๊ฐ€์ค‘์น˜๋ฅผ ์ด์šฉํ•˜๋ฉด ํŠน์ • ๋ฌธ์„œ์—์„œ ์ž์ฃผ ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ด๋Š” ๊ทธ ๋ฌธ์„œ๋‚ด์—์„œ ์ค‘์š”ํ•œ ๋‹จ์–ด๋กœ ๋ณด๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

6. ํŒŒ์ด์ฌ์œผ๋กœ ์ง์ ‘ ๊ตฌํ˜„ํ•ด๋ณด๊ธฐ

๋ฌผ๋ก  ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์ด์šฉํ•ด์„œ ๊ฐ„๋‹จํ•˜๊ฒŒ ๊ตฌํ•˜๋Š” ๋ฐฉ๋ฒ•๋„ ์žˆ์ง€๋งŒ, ๋ฐฐ์šฐ๋Š” ๊ณผ์ •์ด๋‹ˆ ์ง์ ‘ ๋…ธ๊ฐ€๋‹ค๋กœ ๊ตฌํ•ด๋ด…์‹œ๋‹ค.

import pandas as pd # ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ์‚ฌ์šฉ์„ ์œ„ํ•ด
from math import log # IDF ๊ณ„์‚ฐ์„ ์œ„ํ•ด

## 1. ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ›์Šต๋‹ˆ๋‹ค. docs๋ฅผ ๋ณดํ†ต corpos(๋ง๋ญ‰์น˜)๋ผ๊ณ  ๋ถ€๋ฆ…๋‹ˆ๋‹ค
docs = [
  '๋จน๊ณ  ์‹ถ์€ ์‚ฌ๊ณผ',
  '๋จน๊ณ  ์‹ถ์€ ๋ฐ”๋‚˜๋‚˜',
  '๊ธธ๊ณ  ๋…ธ๋ž€ ๋ฐ”๋‚˜๋‚˜ ๋ฐ”๋‚˜๋‚˜',
  '์ €๋Š” ๊ณผ์ผ์ด ์ข‹์•„์š”'
] 
vocab = list(set(w for doc in docs for w in doc.split()))
vocab.sort()

N = len(docs) # ์ด ๋ฌธ์„œ์˜ ์ˆ˜

## 2. TF์™€ IDF๋ฅผ ๊ตฌํ•˜๋Š” ํ•จ์ˆ˜๋ฅผ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค.
def tf(t, d):
    return d.count(t)

def idf(t):
    df = 0
    for doc in docs:
        df += t in doc
    return log(N/(df + 1))

def tfidf(t, d):
    return tf(t,d)* idf(t)
    
## 3. TF๋ฅผ ๊ตฌํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ๋Š” DTM(ํ–‰๋ ฌ)์„ ๋งŒ๋“œ๋Š” ๊ณผ์ •๊ณผ ๋™์ผํ•ฉ๋‹ˆ๋‹ค.
result = []
for i in range(N): # ๊ฐ ๋ฌธ์„œ์— ๋Œ€ํ•ด์„œ ์•„๋ž˜ ๋ช…๋ น์„ ์ˆ˜ํ–‰
    result.append([])
    d = docs[i]
    for j in range(len(vocab)):
        t = vocab[j]        
        result[-1].append(tf(t, d))

tf_ = pd.DataFrame(result, columns = vocab)

## 4. ๊ฐ ๋‹จ์–ด์— ๋Œ€ํ•œ IDF ๊ฐ’์„ ๊ตฌํ•ฉ๋‹ˆ๋‹ค.
result = []
for j in range(len(vocab)):
    t = vocab[j]
    result.append(idf(t))

idf_ = pd.DataFrame(result, index = vocab, columns = ["IDF"])

## 5. TF-IDF ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ์„ ๋งŒ๋“ญ๋‹ˆ๋‹ค.
result = []
for i in range(N):
    result.append([])
    d = docs[i]
    for j in range(len(vocab)):
        t = vocab[j]

        result[-1].append(tfidf(t,d))

tfidf_ = pd.DataFrame(result, columns = vocab)

7. ๋™์˜์–ด์™€ ์œ ์˜์–ด์˜ ์ƒ‰์ธ๋ฐฉ๋ฒ•

์œ ์˜์–ด ์‚ฌ์ „์„ ๊ตฌ์„ฑํ•˜๋Š” ๋ฐฉ๋ฒ•

 

๊ธด ๋ฌธ์žฅ์„ ๋ถ„๋ฆฌํ•˜๋Š” ๋ฐฉ๋ฒ•, ์–ด์ ˆ๋‹จ์œ„, ๋ช…์‚ฌ๋‹จ์œ„, gram๋‹จ์œ„ ๋“ฑ
์ €์žฅ๋˜์–ด์žˆ๋Š” ๋ฌธ์„œ๋ฅผ ๊ฒ€์ƒ‰ํ•˜๊ธฐ ์ข‹๊ฒŒ ์ƒ‰์ธํ•˜๋Š” ๋ฐฉ๋ฒ•

 

ํ€ด์ฆˆ

1. ์ž์—ฐ์–ด์˜ ์˜๋ฏธ์™€ ๊ด€๋ จํ•˜์—ฌ ์ •๋ณด๊ฒ€์ƒ‰์—์„œ ๊ณ ๋ คํ•ด์•ผ ํ•  ๋‘ ๊ฐ€์ง€ ๋ฌธ์ œ๋ฅผ ๋‚˜์—ดํ•ด ๋ณด์‹œ์˜ค.
๋”๋ณด๊ธฐ

๋™์˜์–ด ๋ฌธ์ œ, ๋‹ค์˜์–ด ๋ฌธ์ œ

2. ์ ํ•ฉ์„ฑ์˜ ์ •๋„๋ฅผ 2๊ฐ€์ง€ ๋ฐ 4๊ฐ€์ง€ ์šฉ์–ด๋“ค๋กœ ๊ฐ๊ฐ ์„ธ๋ถ„ํ•˜์—ฌ ํ‘œํ˜„ํ•ด ๋ณด์‹œ์˜ค.
๋”๋ณด๊ธฐ

์ ํ•ฉ, ๋ถ€์ ํ•ฉ

๋งค์šฐ์ ํ•ฉ, ์ ํ•ฉ, ๋ถ€๋ถ„์ ํ•ฉ, ๋ถ€์ ํ•ฉ

3. ์ •๋ณด๊ฒ€์ƒ‰์‹œ์Šคํ…œ์„ ๊ตฌ์„ฑํ•˜๋Š” ๋‘ ๊ฐ€์ง€ ๋Œ€ํ‘œ์  ๋ชจ๋“ˆ์€ ๋ฌด์—‡์ธ๊ฐ€?
๋”๋ณด๊ธฐ

์ƒ‰์ธ(indexing)๋ชจ๋“ˆ, ๊ฒ€์ƒ‰(retriever)๋ชจ๋“ˆ

 

4. ์ •๋ณด๊ฒ€์ƒ‰์‹œ์Šคํ…œ์˜ ์ž…๋ ฅ์œผ๋กœ ์ฃผ์–ด์ง€๋Š” ๊ฒƒ์œผ๋กœ, ์‚ฌ์šฉ์ž์˜ ์ •๋ณด์š”๊ตฌ๊ฐ€ ํ‘œํ˜„๋œ ๊ฒƒ์„
์ง€์นญํ•˜๋Š” ์šฉ์–ด ํ•œ๊ตญ์–ด์™€ ์˜์–ด๋กœ ๊ฐ๊ฐ ์ ์œผ์‹œ์˜ค.
๋”๋ณด๊ธฐ

Query, ์งˆ์˜

5. ์ •๋ณด๊ฒ€์ƒ‰์—์„œ '์ ํ•ฉํ•œ' ์˜ ์˜์–ด ๋‹จ์–ด๋ฅผ ์ ์œผ์‹œ์˜ค.
๋”๋ณด๊ธฐ

relevant

์ฐธ๊ณ  - ์ƒ‰์ธ(Indexing), ์ •๋ณด(Information), ๊ฒ€์ƒ‰(Retreiver), ์š”๊ตฌ(Need)

6. ์ •๋ณด๊ฒ€์ƒ‰์— ํ•ด๋‹นํ•˜๋Š” ์˜์–ด ์™„์ „๋ช… ๋ฐ ์•ฝ์–ด๋ฅผ ์ ์œผ์‹œ์˜ค.
๋”๋ณด๊ธฐ

IR, Information Retreival

 

๋ธ”๋กœ๊ทธ์˜ ์ •๋ณด

JiwonDev

JiwonDev

ํ™œ๋™ํ•˜๊ธฐ