Import date column in Pandas to BigQuery
Imaging we have a small CSV file: name,enroll_time robin,2021-01-15 09:50:33 tony,2021-01-14 01:50:33 jaime,2021-01-13 00:50:33 tyrion,2021-2-15 13:22:17 bran,2022-3-16 14:00:01 Let’s try to load it...
View ArticleA few notes for Pandas and BigQuery
Get the memory size of a DataFrame of Pandas df.memory_usage(deep=True).sum() 2. Upload a large DataFrame of Pandas to BigQuery table If your DataFrame is too big, the uploading operation will report...
View ArticleA stupid mistake in the new deep learning experiment
After my old colleague, JianMei prepared about 1TB data of the birds’ sound records (every mp3 file will be transferred to an image by using spectrogram and split into chunks with each chunk 2.5...
View ArticleAn old bug about PyArrow
To save memory for my program using Pandas, I change types of some column from string to category as the reference. df[["os_type", "cpu_type", "chip_brand"]] = df[["os_type", "cpu_type",...
View ArticleHow to gracefully end a PySpark application
This article recommend using “return” to jump out of a PySpark application. But after I did by following what he said. It reports error: File "test.py", line 333 return ^ SyntaxError: 'return' outside...
View ArticleTo put Back-Quote in a string of Bash
It’s very simple to print a word “hello” in Bash: echo "hello" But how to print a word with Back-Quotes? echo "`hello`" # It will report error because Bash will try to run 'hello' as a command bash:...
View ArticleSome thoughts about cuDF and cuML
I just received an email from NVIDIA about their RAPIDS. Although the cuDF and cuML look fantastic for a data scientist. I am still doubtful about them. In our daily work, we usually process small...
View ArticleStrange time output in a container of Kubernetes cluster
After running a workflow in Argo, I found out the output of the “date” command is totally wrong: # date Wed Mar 3 00:41:27 2021 # TZ='America/Los_Angeles' date Wed Mar 3 00:41:36 2021 #...
View ArticleChange the schema of BigQuery tables
We can easily add new column for a table in BigQuery: ALTER TABLE mydataset.mytable ADD COLUMN new_col STRING But when you want to delete or rename an existed column, there is no SQL to implement it....
View ArticleAccelerate reading of NumPy array from files
In the training process, I need to read array data from .npy file and get a part of it: import numpy as np data = np.load("sample1.npy") sound1 = data[start1: end1] sound2 = data[start2: end2] Since...
View ArticleSource code reading of LightGBM
Finally I get a few hours to look into the code of LightGBM. I used to have some questions about LighGBM, and now fortunately I can answer some of them by myself. Even some answers may be wrong, that...
View ArticleAn error about multiprocessing of Python
Our python program reported errors when running a new dataset: [77 rows x 4 columns]]'. Reason: 'error("'i' format requires -2147483648 <= number <= 2147483647",)'...
View ArticleBe careful when you use “isin()” method in Pandas
import pandas as pd df_excl = pd.DataFrame({"id": ["12345"]}) df = pd.DataFrame({"id": ["12345", "67890"]}) result = df[~df.id.isin(df_excl[["id"]])] print(result) Guess what’s the result of above...
View ArticleDebug CUDA error for PyTorch
After I changed my dataset for my code, the training failed: /tmp/pip-req-build-_tx3iysr/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:310: operator(): block: [0,0,0], thread: [59,0,0] Assertion...
View ArticleTake care of the comma (in Python)
Think about the result of this snippet: def concat(a, b): return a + "_" + b left = "hello", right = "world" print(concat(left, right)) Should be “hello_world”, right? But the actual result is an...
View ArticleMigrate Spark job to BigQuery
I have just finished a work about migrating Spark job to BigQuery, or more precisely: migrate Python code to SQL. It’s a tedious work but improve the performance significantly: from 4 hours runtime of...
View ArticleTrace memory error of CUDA program
The program which used CUDA for computing in GPU reported error about memory: terminate called after throwing an instance of 'std::runtime_error' what(): [CUDA] an illegal memory access was...
View ArticleBe careful with random generate number
This is the program I have used for a month: import numpy as np np.random.seed(202105) rand = np.random.rand() # business logic code using 'rand' Then I add another np.random.rand() in the head of the...
View ArticleRecover truncated table in BigQuery
If you accidentally truncate a table in BigQuery, you can try this article to recover the data. Furthermore, I found out that the "bq cp project:dataset.table@-36000 project:dataset.table” method...
View ArticleUpgrade GKE cluster
Normally, to upgrade a cluster of Google Kubernetes Engine, we need to upgrade the master at first, and then node_pools. For convenience, I just click the button “UPGRADE AVAILABLE” in the “Release...
View Article