XMLファイルをpandasデータフレームに解析する-Python 3.8.x

2020-07-01 python xml xml-parsing

私はJHU Amazonレビューデータセットを非常に特殊な方法でプロジェクトに解析する作業をしています。

以下に数千のAmazon製品レビューを含むxmlファイルのリストがあります

<review>
<unique_id>
B000AN11UA:disappointed:c._tina_"ctina401"
</unique_id>
<unique_id>
2179
</unique_id>
<asin>
B000AN11UA
</asin>
<product_name>
Incase Limited Edition iPod Case - "Fleur" Signature Series: Apparel
</product_name>
<product_type>
apparel
</product_type>
<product_type>
apparel
</product_type>
<helpful>
3 of 3
</helpful>
<rating>
1.0
</rating>
<title>
disappointed
</title>
<date>
March 10, 2006
</date>
<reviewer>
C. Tina "Ctina401"
</reviewer>
<reviewer_location>
Woonsocket
</reviewer_location>
<review_text>
I want to start by saying Fred Flare- shipped this product very fast!! And the transaction itself was very smooth. I do however, have extreme problems with the product itself. The product is not leather, its nylon, and it sort of looks cheap? The inside material is sued, but that's only the lining for the base of the wallet. Also, The wallet part is very hard to use. You cant really put too much in the wallet- The credit card slots are a little too snug, and there is no place for my I.D. The wallet included a small "note book" but it also doesn't fit in the wallet? I was very excited about this product, but now I feel duped. The pictures made the wallet seem like it was of higher quality, and that it was user friendly, but it's not. I do not recommend this product
</review_text>
</review>
<review>
<unique_id>
B000AN11UA:cute_but_way_disappointing:katherine_m._perkins
</unique_id>
<unique_id>
2180
</unique_id>
<asin>
B000AN11UA
</asin>
<product_name>
Incase Limited Edition iPod Case - "Fleur" Signature Series: Apparel
</product_name>
<product_type>
apparel
</product_type>
<product_type>
apparel
</product_type>
<helpful>
3 of 5
</helpful>
<rating>
1.0
</rating>
<title>
Cute But Way Disappointing
</title>
<date>
January 13, 2006
</date>
<reviewer>
Katherine M. Perkins
</reviewer>
<reviewer_location>
Pasadena, CA USA
</reviewer_location>
<review_text>
I have to say that I was disappointed when I opened up the package containing my iPod wallet. It's cute, but not $60ish cute. First of all, it's not leather, it's nylon. The lining is indeed suede, but the photos in the product listing are misleading. I'm keeping it because the hassle of shipping it back, etc. isn't worth it. It does the job, but it wasn't what I was expecting. I feel ripped off
</review_text>
</review>

理想的には、各review<review> ... </review>タグで示される)を解析して、次の情報を抽出できます。

  • 一意のID( <unique_id>
  • 評価( <rating>
  • テキスト( <review_text>

何千ものレビューのこのファイル全体でIDs, Ratings, Text最終的にデータフレームに変換するIDs, Ratings, Text 3つのリストを保持したいと思います

私はxml.etreeを使用してxml.etreeを達成しようとしましたが、この投稿触発されましたが、失敗しています:

import pandas as pd, xml
from xml.etree import cElementTree as ET

ids = []
reviews = []
ratings = []

with open('amazon_review.txt') as file:
    txt = file.read()
    tree = ET.fromstring(txt)
    root = tree.getroot()
    """
    Here, something like:
    for id, review, rating in zip(root.findall(unique_id, review_text, rating)):
        ids.append(id)
        reviews.append(review)
        ratings.append(review)
    """
    for tags in root.findall(".//unique_id"):
        print(tags.text)

しかし、ParseErrorでエラーが発生します。

Traceback (most recent call last):

  File "c:\users\wundermahn\appdata\local\programs\python\python38\lib\site-packages\IPython\core\interactiveshell.py", line 3331, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)

  File "<ipython-input-6-2c88d31ca8ac>", line 4, in <module>
    tree = ET.fromstring(txt)

  File "c:\users\wundermahn\appdata\local\programs\python\python38\lib\xml\etree\ElementTree.py", line 1320, in XML
    parser.feed(text)

  File "<string>", line unknown
ParseError: junk after document element: line 42, column 0

私はこの質問を見つけたので、次のことも試しました:

with open('test_review.txt') as file:
    txt = file.read()
    tree = ET.fromstring(txt)
    for node in tree.iter('entry'):
        for elem in node:
            print(elem.tag, elem.text)

しかし、その結果、同じエラーが発生します。

上記のサンプルファイルを解析して、多くのレビューで目的の結果を得るにはどうすればよいですか?

Answers

Related