์ผ๋‹จ ํ•˜๊ณ  ๋ณด๋Š” ์‚ฌ๋žŒ

๋‚˜์ค‘๋ณด๋‹จ ์ง€๊ธˆ์— ์ง‘์ค‘ํ•˜๋˜, ์ง€๊ธˆ๋ณด๋‹จ ๋‚˜์ค‘์— ์™„๋ฒฝํ•ด์ง€์ž๐Ÿ’ช๐Ÿป

๐Ÿคš๐Ÿป ๋” ๋‚˜์€ ๊ฐœ๋ฐœ์ž ํ•˜๊ธฐ/๐ŸŽจ ์ฝ”๋“œ ๋ฆฌ๋””์ž์ธ ์ผ์ง€

[์ฝ”๋“œ ๋ฆฌ๋””์ž์ธ ์ผ์ง€] Ep.1: ๋‚ฏ์„  ์ž์‹ ๋‚ด ์ž์‹์œผ๋กœ ๋งŒ๋“ค๊ธฐ โ‘ 

JanginTech 2025. 8. 20. 15:31

์ฝ”๋“œ ๋ฆฌํŒฉํ† ๋ง.

๋“œ๋””์–ด.

์ •๋ฆฌ.

ํ•œ๋‹ค

!!!!!!!

 

 

๋ฆฌํŒฉํ† ๋ง ๊ณผ์ •์ด ๊ธธ๊ณ ๋„ ํ—˜๋‚œํ•  ๊ฑฐ ๊ฐ™์•„ ์•„์˜ˆ ์‹œ๋ฆฌ์ฆˆํ™”ํ•ด์„œ ์ •๋ฆฌํ•˜๋ฉด ๋‚˜์ค‘์— ์ฝ๋Š” ์žฌ๋ฏธ๊ฐ€ ์žˆ๊ฒ ๋‹ค ์‹ถ์–ด์„œ ์ œ๋ชฉ์„ ๋ถ™์—ฌ๋ดค๋‹ค.

๋‚จ๋“ค์ด ๋ณด๊ธฐ์— ์–ด๋–ค์ง„ ๋ชจ๋ฅด๊ฒ ์ง€๋งŒ ์ผ๋‹จ ๋‚œ ๋งˆ์Œ์— ๋“ ๋‹ค ๊ณ ๊ธ‰์ง„ ๋А๋‚Œ๋„ ๋‚˜๊ณ  ใ…Žใ…Ž๐Ÿ˜„

 

 

 

 

 


์šฐ์„  ์–ด๋–ค ๋ฉ”์„œ๋“œ ํ•˜๋‚˜๋ฅผ ๋จผ์ € ๊ณ ์ณ๋ณด๋ ค๊ณ  ํ•œ๋‹ค.

์‚ฌ์‹ค ์ด๊ฒŒ ์ค‘์š”๋„๊ฐ€ ๋†’์€ ๋ฆฌํŒฉํ† ๋ง ์ž‘์—…์€ ์•„๋‹ˆ์ง€๋งŒ

์ดˆ๋ฐ˜์— ํ›„๋ฃจ๋ฃฉ(?) ๋งŒ๋“ค์–ด๋ฒ„๋ฆฐ ๋ฉ”์„œ๋“œ๋ผ ์•„์ง ๋‚จ์˜ ์ž์‹ ๊ฐ™๊ธฐ๋„ ํ•˜๊ณ ,,, ์ข€ ์นœํ•ด์ ธ์•ผ๊ฒ ์–ด์„œ ๋ง์ด๋‹ค ใ…Žใ…Ž

 

๋ฌธ์ œ์  

  • soup = BeautifulSoup(html_content, 'html.parser')๋ฅผ ๋ฐ˜๋ณต ์ƒ์„ฑํ•˜๊ณ  ์žˆ๋‹ค
  • ํƒœ๊ทธ ์ œ๊ฑฐ/unwrap์„ ์—ฌ๋Ÿฌ ๋ฒˆ ๋ฃจํ”„ํ•˜๊ณ  ์žˆ์Œ
  • a[href] ์ฒ˜๋ฆฌ์™€ ํ…Œ์ด๋ธ” ๋งˆ์ปค ์ถ”๊ฐ€๊ฐ€ ์„ž์—ฌ์žˆ์–ด ๋ถ„๋ฆฌํ•˜๊ธฐ ์–ด๋ ต๊ณ , ์ค‘๊ฐ„์— table_str = table_str.replace ์™€ ๊ฐ™์€ ๋ฌธ์ž์—ด ๊ต์ฒด ์ฝ”๋“œ๊ฐ€ ๋ถ„์‚ฐ๋˜์–ด ์žˆ์–ด ๊ฒฐ๊ณผ์ ์œผ๋กœ ๋ญ ํ•˜๋Š” for๋ฌธ์ธ์ง€ ์‰ฝ๊ฒŒ ํŒŒ์•…ํ•˜๊ธฐ ์–ด๋ ค์›€

 

๋ชฉํ‘œ

  • BeautifulSoup ํŒŒ์‹ฑ 1ํšŒ๋กœ ๋๋‚ด์ž
  • ๋ถˆํ•„์š”ํ•œ ํƒœ๊ทธ/์†์„ฑ ์ œ๊ฑฐ๋‹จ์€ ๋‹จ์ผ ํ•จ์ˆ˜๋กœ ํ†ตํ•ฉํ•˜์ž(๊ทธ๊ฑฐ๋งŒ ์˜์˜ฅ ๋ถ€๋ฅด๊ฒŒ)
  • ๋งํฌ ์น˜ํ™˜์€ ํ•œ ๋ฃจํ”„์—์„œ ๋๋‚ด์ž(๋ฒˆ๊ฑฐ๋กœ์šฐ๋‹ˆ๊นŒ)
  • ํ…Œ์ด๋ธ” ๋ถ„๋ฆฌ์™€ body ์ •๋ฆฌ๋Š” ํ•œ ํ•จ์ˆ˜์—์„œ ๋ฐ˜ํ™˜ํ•˜๋„๋ก ํ•˜์ž

 

๊ต์ฒด ์„ค๊ณ„

  • ์ •๊ทœ์‹ ์‚ฌ์ „ ์ปดํŒŒ์ผ: ๋ฃจํ”„๋‚ด ์žฌ์ปดํŒŒ์ผ ์ œ๊ฑฐ
  • ํƒœ๊ทธ/์†์„ฑ ์ œ๊ฑฐ ํ‘œ์ค€ํ™”: 250820 ํ˜„์žฌ ํฌ๊ฒŒ 2๊ฐ€์ง€๋กœ ๋‚˜๋‰˜์–ด์„œ ์ด๊ฑธ ๋ถ„๋ฆฌํ•˜๋ ค ํ•œ๋‹ค -> ๊ณตํ†ต(unwrap), FAQ ์ „์šฉ(unwrap) ๋ถ„๋ฆฌ
  • ํ…Œ์ด๋ธ” ๋ฃจํ”„ 1ํšŒ๋กœ:
    • <a>์—์„œ linked_seq ์ถ”์ถœ -> [linkedN]์œผ๋กœ ๊ต์ฒด
    • [tableN] ๋งˆ์ปค๋ฅผ ์•ž์— ๋ถ™์ด๊ณ  table.decompose()
  • ๋ฐ˜ํ™˜:
    • str: ํ…Œ์ด๋ธ”์ด ์ œ๊ฑฐ๋œ body
    • str: ๋งˆ์ปค+์›๋ณธํ…Œ์ด๋ธ” HTML
    • list: ๋งํฌ seq ๋ฆฌ์ŠคํŠธ

 

 

๊ตฌํ˜„ ์‹œ์ž‘: HTML ์ •๋ฆฌ·ํ…Œ์ด๋ธ”/๋งํฌ ์ถ”์ถœ

ํฉ์–ด์ ธ์žˆ๋Š” ๊ธฐ๋Šฅ๋“ค์„ ๊ฒฐ์— ๋งž๊ฒŒ ๋ชจ์•„์„œ ํ•˜๋‚˜์˜ ํ•จ์ˆ˜๋กœ ์ผ์›ํ™”ํ•˜์ž.

 

BEFORE

def extract_and_clean_html_tables(html_string, is_faq=False):
    soup = BeautifulSoup(html_string, 'html.parser')

    # โ‘  ๋ถˆํ•„์š” ํƒœ๊ทธ ์ œ๊ฑฐ(์—ฌ๋Ÿฌ ๋ฒˆ ๋ฃจํ”„)
    for tag in soup.find_all(['span','strong','hr','a','font','img']):
        tag.unwrap()
    if is_faq:
        for tag in soup.find_all(['p','br','div']):
            tag.unwrap()

    # โ‘ก table ์†์„ฑ ์ œ๊ฑฐ(์—ฌ๋Ÿฌ ๋ฒˆ ๋ฃจํ”„)
    for t in soup.find_all(['table','tr','td','tbody']):
        for attr in ['style','width','border','cellpadding','cellspacing']:
            t.attrs.pop(attr, None)

    tables_content = []
    linked_seq_list = []
    table_index = 1

    # โ‘ข ํ…Œ์ด๋ธ” ์ˆœํšŒ ์ค‘๊ฐ„์— a[href] ์ฒ˜๋ฆฌ + ๋งˆ์ปค ๋ฌธ์ž์—ด ์กฐ์ž‘ + ์ œ๊ฑฐ
    for table in soup.find_all('table'):
        for a_tag in table.find_all('a', href=True):
            # href์—์„œ /ํŠน์ •URL/<์ˆซ์ž> ์ถ”์ถœ
            parts = a_tag['href'].split('/ํŠน์ •URL/')
            if len(parts) > 1:
                m = re.search(r'\d+', parts[1])
                if m:
                    seq = m.group(0)
                    linked_seq_list.append(seq)
                    marker = f"[linked{len(linked_seq_list)}]"
                    # ๋ฌธ์ž์—ด ์น˜ํ™˜/๋Œ€์ฒด
                    table_str = str(table)
                    table_str = table_str.replace(a_tag['href'], marker)
                    # ...

        marker = f"[table{table_index}]"
        tables_content.append(marker + "\n" + str(table) + "\n")
        table.decompose()
        table_index += 1

    clean_html = str(soup)
    return clean_html, tables_content, linked_seq_list

 

 

๋ชจ์œผ๊ธฐ

1. BeautifulSoup(html, 'html.parser') ๋‹ค์ค‘ ํ˜ธ์ถœ

2. ๋ถˆํ•„์š” ํƒœ๊ทธ ์ œ๊ฑฐ ๋ฃจํ”„๋“ค(span/strong/font/hr/a/img, FAQ ์‹œ p/br/div)

3. <table> ๋‚ด๋ถ€ style/width/border/cellpadding/cellspacing ์†์„ฑ ์ œ๊ฑฐ

4. <a href>์—์„œ /linkPageDetailPop/<์ˆซ์ž> ์ถ”์ถœ → [linkedN] ์น˜ํ™˜

5. [tableN] ๋งˆ์ปค ์ƒ์„ฑ + table.decompose() (๋ณธ๋ฌธ์—์„œ ํ…Œ์ด๋ธ” ์ œ๊ฑฐ)

6. ์ตœ์ข… clean_html(๋ณธ๋ฌธ), tables_content(๋งˆ์ปค+ํ…Œ์ด๋ธ” HTML), linked_seq_list ๋ฐ˜ํ™˜

 

[1] ๋ฃจํ”„ ํ†ตํ•ฉ: ํƒœ๊ทธ ์ œ๊ฑฐ ๋ฃจํ”„

์ด 3๊ฐ€์ง€๋ฅผ ์ œ๊ฑฐํ•จ

1. ๊ณตํ†ต ์ œ๊ฑฐ: span, strong, font, hr, a, img

2. FAQ ์ „์šฉ: p, br, div

3. ๋นˆ ํ…์ŠคํŠธ ๋…ธ๋“œ ์ œ๊ฑฐ(๊ธฐ์กด์—” ์‚ฐ๋ฐœ์ ์œผ๋กœ ์ฒ˜๋ฆฌ ๋˜๋Š” ๋ฏธ์ฒ˜๋ฆฌ): ์ด ๋กœ์ง์€ ์ด ๋ฉ”์„œ๋“œ ํ˜ธ์ถœ ์ „์—๋„ ์“ฐ์ด๋Š” ๋กœ์ง์œผ๋กœ ๋˜‘๊ฐ™์€ ์ฒ˜๋ฆฌ๋ฅผ ์—ด ๋Ÿฌ ๋ฒˆ ํ•ด์ฃผ๊ณ  ์žˆ์—ˆ๋‹ค.. ์ด๊ฒŒ ๋ฐ”๋กœ ๋ฌด์ง€์„ฑ ๋ณต๋ถ™์˜ ๊ฒฐ๊ณผ

 

def _unwrap_tags(soup: BeautifulSoup, is_faq: bool) -> None:
    targets = set(UNWRAP_COMMON) | (UNWRAP_FAQ if is_faq else set())
    # BeautifulSoup๋Š” setํ˜• tag๋ช…๋„ ํ—ˆ์šฉ๋จ (์ถœ์ฒ˜: chatgpt ใ…Žใ…Ž)
    for tag in soup.find_all(targets):
        tag.unwrap()
    # ๋นˆ ํ…์ŠคํŠธ ๋…ธ๋“œ ์ œ๊ฑฐ
    for el in list(soup.find_all(string=True)):
        if isinstance(el, NavigableString) and not el.strip():
            el.extract()

 

 

 

 

์‚ฌ์‹ค ์ด๊ฑด ์ง€ํ”ผํ‹ฐํ•œํ…Œ ์งœ๋‹ฌ๋ผํ•œ ์ฝ”๋“œ๋‹ค.

' ๊ณตํ†ต ํƒœ๊ทธ๋ฅผ ์ƒ์ˆ˜ํ™”ํ•ด์•ผ๊ฒ ๋‹ค'๊ณ  ์ƒ๊ฐํ–ˆ๋Š”๋ฐ ์ง€ํ”ผํ‹ฐ๋„ ๋งˆ์ฐฌ๊ฐ€์ง€์˜€๋‚˜ ๋ณด๋‹ค ใ…Žใ…Ž

๋‚˜๋Š” ๋ฆฌ์ŠคํŠธ๋กœ ํƒœ๊ทธ ๊ฒ€์ƒ‰ํ•˜๊ฒŒ ํ–ˆ์—ˆ๋Š”๋ฐ, ์ง€ํ”ผํ‹ฐ๋Š” setํƒ€์ž…์œผ๋กœ ๊ฒ€์ƒ‰์„ ํ•˜๊ฒŒ๋” ์งฐ๋‹ค.

๋ญ ์‚ฌ์‹ค.. ์–ด์ฐจํ”ผ ๋ฌธ์„œ์—์„œ ๋“ฑ์žฅํ•˜๋Š” ์ˆœ์„œ๋Œ€๋กœ ์ฒ˜๋ฆฌํ•  ๊ฑฐ๋‹ˆ๊นŒ list๋ƒ set๋ƒ๊ฐ€ ์—ฌ๊ธฐ์„œ ์ค‘์š”ํ•˜์ง„ ์•Š๋‹ค..ใ…Žใ…Ž

 

[2] ๋ฉ”์„œ๋“œ: ํ…Œ์ด๋ธ” ๋‚ด๋ถ€ ํƒœ๊ทธ ์ฒ˜๋ฆฌ

def _clean_table_attrs(soup: BeautifulSoup) -> None:
    for td in soup.find_all(["table","tr","td","tbody"]):
        for attr in ("style","width","border","cellpadding","cellspacing"):
            td.attrs.pop(attr, None)

 

[3] ๋ฉ”์„œ๋“œ: ๋งํฌ ์ถ”์ถœ ๋ฐ ์น˜ํ™˜

def _extract_link_seq_and_replace(table: Tag, linked_seq_list: list[str]) -> None:
    for a in table.find_all("a", href=True):
        m = RE_LINK_SEQ.search(a.get("href") or "")
        if not m:
            continue
        linked_seq_list.append(m.group(1))
        a.replace_with(f"[linked{len(linked_seq_list)}]")

 

์ด๊ฑด ๊ฐœ์ธ์ ์œผ๋กœ ๋ง˜์— ๋“œ๋Š” ๋ฉ”์„œ๋“œ๋‹ค.

 

BEFORE AFTER
    for table in soup.find_all('table'):
        for a_tag in table.find_all('a', href=True):
            # href์—์„œ /linkPageDetailPop/<์ˆซ์ž> ์ถ”์ถœ
            parts = a_tag['href'].split('/linkPageDetailPop/')
            if len(parts) > 1:
                m = re.search(r'\d+', parts[1])
                if m:
                    seq = m.group(0)
                    linked_seq_list.append(seq)
                    marker = f"[linked{len(linked_seq_list)}]
..
def _extract_link_seq_and_replace(table: Tag, linked_seq_list: list[str]) -> None:
    for a in table.find_all("a", href=True):
        m = RE_LINK_SEQ.search(a.get("href") or "")
        if not m:
            continue
        linked_seq_list.append(m.group(1))
        a.replace_with(f"[linked{len(linked_seq_list)}]")

 

split ๋Œ€์‹  ์ •๊ทœ์‹ ๊ฒ€์ƒ‰์„ ์ผ๋‹ค.

๊ตณ์ด ์ž˜๋ผ๋‚ผ ํ•„์š” ์—†์ด re.search๋Š” href ์ „์ฒด ๋ฌธ์ž์—ด์„ ์Šค์บ”ํ•ด์ฃผ๋‹ˆ๊นŒ ํ•ด๋‹น URL๋’ค์— ์˜ค๋Š” ์ˆซ์ž seq๋ฅผ ๋ฐ”๋กœ ์žก์•„์„œ ๋ฆฌ์ŠคํŠธ์— ์ €์žฅ + ์น˜ํ™˜ํ•˜๋Š”๋ฐ ๋ฌธ์ œ์—†์ด ์ž‘๋™ํ•œ๋‹ค.

 

์ด ์ฝ”๋“œ๋ฅผ ์ง€ํ”ผํ‹ฐํ•œํ…Œ ๋ณด์—ฌ์ฃผ๋‹ˆ๊นŒ ์ง€ํ”ผํ‹ฐ๋Š” ์ด๋ ‡๊ฒŒ ๋งํ•˜๋”๋ผ:

_extract_link_seq_and_replace(table, linked_seq_list)

๊ธฐ์กด ์œ„์น˜(์Šคํฌ๋ฆฐ์ƒท ๋งค์นญ)
  • ์Šค์ƒท์—์„œ for a_tag in table.find_all('a', href=True): ...
    • parts = a_tag['href'].split('/url/')
    • re.search(r'\d+', parts[1])
    • table_str = str(table); table_str.replace(...); ๊ฐ™์€ ๋ฌธ์ž์—ด ๊ธฐ๋ฐ˜ ์น˜ํ™˜
ํšจ๊ณผ
  • DOM ์ˆ˜์ค€ ์น˜ํ™˜์œผ๋กœ ์•ˆ์ •์„ฑ↑(๋ฌธ์ž์—ด ์น˜ํ™˜ ๋ˆ„๋ฝ/๊นจ์ง ๋ฐฉ์ง€)
  • ์ •๊ทœ์‹ ์‚ฌ์ „ ์ปดํŒŒ์ผ๋กœ ๋ฐ˜๋ณต ๋น„์šฉ↓ (RE_LINK_SEQ)

 

DOM ์ˆ˜์ค€ ์น˜ํ™˜? ์ •๊ทœ์‹ '์‚ฌ์ „ ์ปดํŒŒ์ผ'?

 

์ข€ ๋” ์ฐพ์•„๋ด์•ผ๊ฒ ๋Š”๋ฐ..? ์ด๊ฑด ๋‹ค์Œ ํฌ์ŠคํŠธ์—..