Robin on Linux

↧

Import date column in Pandas to BigQuery

January 14, 2021, 7:22 pm

Imaging we have a small CSV file: name,enroll_time robin,2021-01-15 09:50:33 tony,2021-01-14 01:50:33 jaime,2021-01-13 00:50:33 tyrion,2021-2-15 13:22:17 bran,2022-3-16 14:00:01 Let’s try to load it...

View Article

A few notes for Pandas and BigQuery

January 21, 2021, 2:57 pm

Get the memory size of a DataFrame of Pandas df.memory_usage(deep=True).sum() 2. Upload a large DataFrame of Pandas to BigQuery table If your DataFrame is too big, the uploading operation will report...

View Article

A stupid mistake in the new deep learning experiment

January 28, 2021, 9:46 pm

After my old colleague, JianMei prepared about 1TB data of the birds’ sound records (every mp3 file will be transferred to an image by using spectrogram and split into chunks with each chunk 2.5...

View Article

An old bug about PyArrow

February 4, 2021, 4:59 pm

To save memory for my program using Pandas, I change types of some column from string to category as the reference. df[["os_type", "cpu_type", "chip_brand"]] = df[["os_type", "cpu_type",...

View Article

How to gracefully end a PySpark application

February 11, 2021, 4:33 pm

This article recommend using “return” to jump out of a PySpark application. But after I did by following what he said. It reports error: File "test.py", line 333 return ^ SyntaxError: 'return' outside...

View Article

To put Back-Quote in a string of Bash

February 18, 2021, 2:52 pm

It’s very simple to print a word “hello” in Bash: echo "hello" But how to print a word with Back-Quotes? echo "`hello`" # It will report error because Bash will try to run 'hello' as a command bash:...

View Article

Some thoughts about cuDF and cuML

February 25, 2021, 9:45 pm

I just received an email from NVIDIA about their RAPIDS. Although the cuDF and cuML look fantastic for a data scientist. I am still doubtful about them. In our daily work, we usually process small...

View Article

Strange time output in a container of Kubernetes cluster

March 4, 2021, 4:49 pm

After running a workflow in Argo, I found out the output of the “date” command is totally wrong: # date Wed Mar 3 00:41:27 2021 # TZ='America/Los_Angeles' date Wed Mar 3 00:41:36 2021 #...

View Article

Change the schema of BigQuery tables

March 10, 2021, 7:27 pm

We can easily add new column for a table in BigQuery: ALTER TABLE mydataset.mytable ADD COLUMN new_col STRING But when you want to delete or rename an existed column, there is no SQL to implement it....

View Article

Accelerate reading of NumPy array from files

March 18, 2021, 4:18 pm

In the training process, I need to read array data from .npy file and get a part of it: import numpy as np data = np.load("sample1.npy") sound1 = data[start1: end1] sound2 = data[start2: end2] Since...

View Article

Image may be NSFW.
Clik here to view.

Source code reading of LightGBM

March 30, 2021, 8:54 pm

Finally I get a few hours to look into the code of LightGBM. I used to have some questions about LighGBM, and now fortunately I can answer some of them by myself. Even some answers may be wrong, that...

View Article

An error about multiprocessing of Python

April 7, 2021, 5:56 pm

Our python program reported errors when running a new dataset: [77 rows x 4 columns]]'. Reason: 'error("'i' format requires -2147483648 <= number <= 2147483647",)'...

View Article

Be careful when you use “isin()” method in Pandas

April 8, 2021, 9:17 pm

import pandas as pd df_excl = pd.DataFrame({"id": ["12345"]}) df = pd.DataFrame({"id": ["12345", "67890"]}) result = df[~df.id.isin(df_excl[["id"]])] print(result) Guess what’s the result of above...

View Article

Debug CUDA error for PyTorch

April 22, 2021, 6:17 pm

After I changed my dataset for my code, the training failed: /tmp/pip-req-build-_tx3iysr/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:310: operator(): block: [0,0,0], thread: [59,0,0] Assertion...

View Article

Take care of the comma (in Python)

April 28, 2021, 9:06 pm

Think about the result of this snippet: def concat(a, b): return a + "_" + b left = "hello", right = "world" print(concat(left, right)) Should be “hello_world”, right? But the actual result is an...

View Article

Migrate Spark job to BigQuery

May 6, 2021, 5:45 pm

I have just finished a work about migrating Spark job to BigQuery, or more precisely: migrate Python code to SQL. It’s a tedious work but improve the performance significantly: from 4 hours runtime of...

View Article

Trace memory error of CUDA program

May 13, 2021, 5:57 pm

The program which used CUDA for computing in GPU reported error about memory: terminate called after throwing an instance of 'std::runtime_error' what(): [CUDA] an illegal memory access was...

View Article

Be careful with random generate number

May 26, 2021, 10:50 pm

This is the program I have used for a month: import numpy as np np.random.seed(202105) rand = np.random.rand() # business logic code using 'rand' Then I add another np.random.rand() in the head of the...

View Article

Recover truncated table in BigQuery

June 2, 2021, 11:39 pm

If you accidentally truncate a table in BigQuery, you can try this article to recover the data. Furthermore, I found out that the "bq cp project:dataset.table@-36000 project:dataset.table” method...

View Article

Image may be NSFW.
Clik here to view.

Upgrade GKE cluster

June 10, 2021, 5:15 pm

Normally, to upgrade a cluster of Google Kubernetes Engine, we need to upgrade the master at first, and then node_pools. For convenience, I just click the button “UPGRADE AVAILABLE” in the “Release...

View Article