15

I have a text file with the following format:

1: frack 0.733, shale 0.700, 
10: space 0.645, station 0.327, nasa 0.258, 
4: celebr 0.262, bahar 0.345 

I need to covert this text to a DataFrame with the following format:

Id   Term    weight
1    frack   0.733
1    shale   0.700
10   space   0.645
10   station 0.327
10   nasa    0.258
4    celebr  0.262
4    bahar   0.345

How I can do it?

  • I can only think of regex helping here. – amanb Apr 22 at 19:13
  • 1
    Depending on how large/long your file is, you can loop through the file without pandas to format it properly first. – Quang Hoang Apr 22 at 19:20
  • It can be done with explode and split – WeNYoBen Apr 22 at 19:24
  • Also , When you read the text to pandas what is the format of the df ? – WeNYoBen Apr 22 at 19:25
  • The data is in text format. – Mary Apr 22 at 19:26
12

Here's an optimized way to parse the file with re, first taking the ID and then parsing the data tuples. This takes advantage of the fact that file objects are iterable. When you iterate over an open file, you get the individual lines as strings, from which you can extract the meaningful data elements.

import re
import pandas as pd

SEP_RE = re.compile(r":\s+")
DATA_RE = re.compile(r"(?P<term>[a-z]+)\s+(?P<weight>\d+\.\d+)", re.I)


def parse(filepath: str):
    def _parse(filepath):
        with open(filepath) as f:
            for line in f:
                id, rest = SEP_RE.split(line, maxsplit=1)
                for match in DATA_RE.finditer(rest):
                    yield [int(id), match["term"], float(match["weight"])]
    return list(_parse(filepath))

Example:

>>> df = pd.DataFrame(parse("/Users/bradsolomon/Downloads/doc.txt"),
...                   columns=["Id", "Term", "weight"])
>>> 
>>> df
   Id     Term  weight
0   1    frack   0.733
1   1    shale   0.700
2  10    space   0.645
3  10  station   0.327
4  10     nasa   0.258
5   4   celebr   0.262
6   4    bahar   0.345

>>> df.dtypes
Id          int64
Term       object
weight    float64
dtype: object

Walkthrough

SEP_RE looks for an initial separator: a literal : followed by one or more spaces. It uses maxsplit=1 to stop once the first split is found. Granted, this assumes your data is strictly formatted: that the format of your entire dataset consistently follows the example format laid out in your question.

After that, DATA_RE.finditer() deals with each (term, weight) pair extraxted from rest. The string rest itself will look like frack 0.733, shale 0.700,. .finditer() gives you multiple match objects, where you can use ["key"] notation to access the element from a given named capture group, such as (?P<term>[a-z]+).

An easy way to visualize this is to use an example line from your file as a string:

>>> line = "1: frack 0.733, shale 0.700,\n"
>>> SEP_RE.split(line, maxsplit=1)
['1', 'frack 0.733, shale 0.700,\n']

Now you have the initial ID and rest of the components, which you can unpack into two identifiers.

>>> id, rest = SEP_RE.split(line, maxsplit=1)
>>> it = DATA_RE.finditer(rest)
>>> match = next(it)
>>> match
<re.Match object; span=(0, 11), match='frack 0.733'>
>>> match["term"]
'frack'
>>> match["weight"]
'0.733'

The better way to visualize it is with pdb. Give it a try if you dare ;)

Disclaimer

This is one of those questions that demands a particular type of solution that may not generalize well if you ease up restrictions on your data format.

For instance, it assumes that each each Term can only take upper or lowercase ASCII letters, nothing else. If you have other Unicode characters as identifiers, you would want to look into other re characters such as \w.

  • 3
    Brilliant answer, I must say. – amanb Apr 22 at 19:42
  • @amanb Thank you! – Brad Solomon Apr 22 at 19:45
4

You can use the DataFrame constructor if you massage your input to the appropriate format. Here is one way:

import pandas as pd
from itertools import chain

text="""1: frack 0.733, shale 0.700, 
10: space 0.645, station 0.327, nasa 0.258, 
4: celebr 0.262, bahar 0.345 """

df = pd.DataFrame(
    list(
        chain.from_iterable(
            map(lambda z: (y[0], *z.strip().split()), y[1].split(",")) for y in 
            map(lambda x: x.strip(" ,").split(":"), text.splitlines())
        )
    ), 
    columns=["Id", "Term", "weight"]
)

print(df)
#  Id     Term weight
#0  4    frack  0.733
#1  4    shale  0.700
#2  4    space  0.645
#3  4  station  0.327
#4  4     nasa  0.258
#5  4   celebr  0.262
#6  4    bahar  0.345

Explanation

I assume that you've read your file into the string text. The first thing you want to do is strip leading/trailing commas and whitespace before splitting on :

print(list(map(lambda x: x.strip(" ,").split(":"), text.splitlines())))
#[['1', ' frack 0.733, shale 0.700'], 
# ['10', ' space 0.645, station 0.327, nasa 0.258'], 
# ['4', ' celebr 0.262, bahar 0.345']]

The next step is to split on the comma to separate the values, and assign the Id to each set of values:

print(
    [
        list(map(lambda z: (y[0], *z.strip().split()), y[1].split(","))) for y in 
        map(lambda x: x.strip(" ,").split(":"), text.splitlines())
    ]
)
#[[('1', 'frack', '0.733'), ('1', 'shale', '0.700')],
# [('10', 'space', '0.645'),
#  ('10', 'station', '0.327'),
#  ('10', 'nasa', '0.258')],
# [('4', 'celebr', '0.262'), ('4', 'bahar', '0.345')]]

Finally, we use itertools.chain.from_iterable to flatten this output, which can then be passed straight to the DataFrame constructor.

Note: The * tuple unpacking is a python 3 feature.

4

Assuming your data (csv file) looks like given:

df = pd.read_csv('untitled.txt', sep=': ', header=None)
df.set_index(0, inplace=True)

# split the `,`
df = df[1].str.strip().str.split(',', expand=True)

#    0             1              2           3
#--  ------------  -------------  ----------  ---
# 1  frack 0.733   shale 0.700
#10  space 0.645   station 0.327  nasa 0.258
# 4  celebr 0.262  bahar 0.345

# stack and drop empty
df = df.stack()
df = df[~df.eq('')]

# split ' '
df = df.str.strip().str.split(' ', expand=True)

# edit to give final expected output:

# rename index and columns for reset_index
df.index.names = ['Id', 'to_drop']
df.columns = ['Term', 'weight']

# final df
final_df  = df.reset_index().drop('to_drop', axis=1)
  • how do you not getting error by ''' sep=': ' ''' which is 2 character separator? – Rebin Apr 22 at 19:55
  • 1
    @Rebin add engine='python' – pault Apr 22 at 19:58
  • @pault weird, 'cause I already split by ' '. It yields correct data on my computer. – Quang Hoang Apr 22 at 20:02
  • I dont know how to add engine python? what is the command? – Rebin Apr 22 at 20:02
  • 1
    @Rebin add it as a param to pd.read_csv - df = pd.read_csv(..., engine='python') – pault Apr 22 at 20:04
3

Just to put my two cents in: you could write yourself a parser and feed the result into pandas:

import pandas as pd
from parsimonious.grammar import Grammar
from parsimonious.nodes import NodeVisitor

file = """
1: frack 0.733, shale 0.700, 
10: space 0.645, station 0.327, nasa 0.258, 
4: celebr 0.262, bahar 0.345 
"""

grammar = Grammar(
    r"""
    expr    = (garbage / line)+

    line    = id colon pair*
    pair    = term ws weight sep? ws?
    garbage = ws+

    id      = ~"\d+"
    colon   = ws? ":" ws?
    sep     = ws? "," ws?

    term    = ~"[a-zA-Z]+"
    weight  = ~"\d+(?:\.\d+)?"

    ws      = ~"\s+"
    """
)

tree = grammar.parse(file)

class PandasVisitor(NodeVisitor):
    def generic_visit(self, node, visited_children):
        return visited_children or node

    def visit_pair(self, node, visited_children):
        term, _, weight, *_ = visited_children
        return (term.text, weight.text)

    def visit_line(self, node, visited_children):
        id, _, pairs = visited_children
        return [(id.text, *pair) for pair in pairs]

    def visit_garbage(self, node, visited_children):
        return None

    def visit_expr(self, node, visited_children):
        return [item
                for lst in visited_children
                for sublst in lst if sublst
                for item in sublst]

pv = PandasVisitor()
out = pv.visit(tree)

df = pd.DataFrame(out, columns=["Id", "Term", "weight"])
print(df)

This yields

   Id     Term weight
0   1    frack  0.733
1   1    shale  0.700
2  10    space  0.645
3  10  station  0.327
4  10     nasa  0.258
5   4   celebr  0.262
6   4    bahar  0.345

Here, we are building a grammar with the possible information: either a line or whitespace. The line is built of an id (e.g. 1), followed a colon (:), whitespace and pairs of term and weight evtl. followed by a separator.

Afterwards, we need a NodeVisitor class to actually do sth. with the retrieved ast.

1

It is possible to just use entirely pandas:

df = pd.read_csv(StringIO(u"""1: frack 0.733, shale 0.700, 
10: space 0.645, station 0.327, nasa 0.258, 
4: celebr 0.262, bahar 0.345 """), sep=":", header=None)

#df:
    0                                          1
0   1                 frack 0.733, shale 0.700, 
1  10   space 0.645, station 0.327, nasa 0.258, 
2   4                 celebr 0.262, bahar 0.345 

Turn the column 1 into a list and then expand:

df[1] = df[1].str.split(",", expand=False)

dfs = []
for idx, rows in df.iterrows():
    print(rows)
    dfslice = pd.DataFrame({"Id": [rows[0]]*len(rows[1]), "terms": rows[1]})
    dfs.append(dfslice)
newdf = pd.concat(dfs, ignore_index=True)

# this creates newdf:
   Id           terms
0   1     frack 0.733
1   1     shale 0.700
2   1                
3  10     space 0.645
4  10   station 0.327
5  10      nasa 0.258
6  10                
7   4    celebr 0.262
8   4    bahar 0.345 

Now we need to str split the last line and drop empties:

newdf["terms"] = newdf["terms"].str.strip()
newdf = newdf.join(newdf["terms"].str.split(" ", expand=True))
newdf.columns = ["Id", "terms", "Term", "Weights"]
newdf = newdf.drop("terms", axis=1).dropna()

Resulting newdf:

   Id     Term Weights
0   1    frack   0.733
1   1    shale   0.700
3  10    space   0.645
4  10  station   0.327
5  10     nasa   0.258
7   4   celebr   0.262
8   4    bahar   0.345
0

Here is another take for your question. Creating a list which will contain lists for every id and term. And then produce the dataframe.

import pandas as pd
file=r"give_your_path".replace('\\', '/')
my_list_of_lists=[]#creating an empty list which will contain lists of [Id Term  Weight]
with open(file,"r+") as f:
    for line in f.readlines():#looping every line
        my_id=[line.split(":")[0]]#storing the Id in order to use it in every term
        for term in [s.strip().split(" ") for s in line[line.find(":")+1:].split(",")[:-1]]:
            my_list_of_lists.append(my_id+term)
df=pd.DataFrame.from_records(my_list_of_lists)#turning the lists to dataframe
df.columns=["Id","Term","weight"]#giving columns their names
0

Could I assume that there is just 1 space before 'TERM'?

df=pd.DataFrame(columns=['ID','Term','Weight'])
with open('C:/random/d1','r') as readObject:
    for line in readObject:
        line=line.rstrip('\n')
        tempList1=line.split(':')
        tempList2=tempList1[1]
        tempList2=tempList2.rstrip(',')
        tempList2=tempList2.split(',')
        for item in tempList2:
            e=item.split(' ')
            tempRow=[tempList1[0], e[0],e[1]]
            df.loc[len(df)]=tempRow
print(df)
0

Maybe it will be easy to understand what there happens. You only need to update the code to read a file instead of using the variable.

import pandas as pd

txt = """1: frack 0.733, shale 0.700,
10: space 0.645, station 0.327, nasa 0.258,
4: celebr 0.262, bahar 0.345"""

data = []
for line in txt.splitlines():
    key, values = line.split(':')
    for elements in values.split(','):
        if elements:
            term, weight = elements.split()
            data.append({'Id': key, 'Term': term, 'Weight': weight})

df = pd.DataFrame(data)

DF:

   Id    Term  Weight
0   1    frack  0.733
1   1    shale  0.700
2  10    space  0.645
3  10  station  0.327
4  10     nasa  0.258
5   4   celebr  0.262
6   4    bahar  0.345
-3

1) You can read row by row.

2) Then you can separate by ':' for your index and ',' for the values

1)

with open('path/filename.txt','r') as filename:
   content = filename.readlines()

2) content = [x.split(':') for x in content]

This will give you the following result:

content =[
    ['1','frack 0.733, shale 0.700,'],
    ['10', 'space 0.645, station 0.327, nasa 0.258,'],
    ['4','celebr 0.262, bahar 0.345 ']]
  • 3
    Your result is not the result asked for in the question. – MyNameIsCaleb Apr 22 at 19:31

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service, privacy policy and cookie policy

Not the answer you're looking for? Browse other questions tagged or ask your own question.