Quantcast
Channel: Robin on Linux
Viewing all articles
Browse latest Browse all 236

A problem of using Pyspark SQL

$
0
0

Here is the code:

from pyspark.sql import SQLContext
from pyspark.context import SparkContext
from pyspark.sql.types import *

from typing import List

sc = SparkContext()
sqlContext = SQLContext.getOrCreate(sc)

schema = StructType([StructField('id', LongType(), True),
                      StructField('gid', LongType(), True),
                      StructField('pid', LongType(), True),
                      StructField('firstlogin', IntegerType(), True)
])

row = ['2', '29', '29', '29']
df = sqlContext.createDataFrame(row, schema)
df.show()

It will report error after running ‘cat xxx.py|bin/pyspark’:

TypeError: StructType can not accept object '2' in type <class 'str'>

I used to think it was because ‘2’ is a string, so I changed ‘row’ to be ‘[2, 29, 29, 29]’. But the error also changed to:

TypeError: StructType can not accept object 2 in type <class 'int'>

Then I searched on google, and find this article. Looks like I forgot to transfer ‘list’ of python to ‘RDD’ of Apache Spark.
But at last, I found the real reason: I just need to add ‘[]’ between my ‘list’!
The right code is here:

row = ['2', '29', '29', '29']
df = sqlContext.createDataFrame([row], schema)


Viewing all articles
Browse latest Browse all 236

Trending Articles