Quantcast
Browsing all 236 articles
Browse latest View live

Import date column in Pandas to BigQuery

Imaging we have a small CSV file: name,enroll_time robin,2021-01-15 09:50:33 tony,2021-01-14 01:50:33 jaime,2021-01-13 00:50:33 tyrion,2021-2-15 13:22:17 bran,2022-3-16 14:00:01 Let’s try to load it...

View Article


A few notes for Pandas and BigQuery

Get the memory size of a DataFrame of Pandas df.memory_usage(deep=True).sum() 2. Upload a large DataFrame of Pandas to BigQuery table If your DataFrame is too big, the uploading operation will report...

View Article


A stupid mistake in the new deep learning experiment

After my old colleague, JianMei prepared about 1TB data of the birds’ sound records (every mp3 file will be transferred to an image by using spectrogram and split into chunks with each chunk 2.5...

View Article

An old bug about PyArrow

To save memory for my program using Pandas, I change types of some column from string to category as the reference. df[["os_type", "cpu_type", "chip_brand"]] = df[["os_type", "cpu_type",...

View Article

How to gracefully end a PySpark application

This article recommend using “return” to jump out of a PySpark application. But after I did by following what he said. It reports error: File "test.py", line 333 return ^ SyntaxError: 'return' outside...

View Article


To put Back-Quote in a string of Bash

It’s very simple to print a word “hello” in Bash: echo "hello" But how to print a word with Back-Quotes? echo "`hello`" # It will report error because Bash will try to run 'hello' as a command bash:...

View Article

Some thoughts about cuDF and cuML

I just received an email from NVIDIA about their RAPIDS. Although the cuDF and cuML look fantastic for a data scientist. I am still doubtful about them. In our daily work, we usually process small...

View Article

Strange time output in a container of Kubernetes cluster

After running a workflow in Argo, I found out the output of the “date” command is totally wrong: # date Wed Mar 3 00:41:27 2021 # TZ='America/Los_Angeles' date Wed Mar 3 00:41:36 2021 #...

View Article


Change the schema of BigQuery tables

We can easily add new column for a table in BigQuery: ALTER TABLE mydataset.mytable ADD COLUMN new_col STRING But when you want to delete or rename an existed column, there is no SQL to implement it....

View Article


Accelerate reading of NumPy array from files

In the training process, I need to read array data from .npy file and get a part of it: import numpy as np data = np.load("sample1.npy") sound1 = data[start1: end1] sound2 = data[start2: end2] Since...

View Article

Image may be NSFW.
Clik here to view.

Source code reading of LightGBM

Finally I get a few hours to look into the code of LightGBM. I used to have some questions about LighGBM, and now fortunately I can answer some of them by myself. Even some answers may be wrong, that...

View Article

An error about multiprocessing of Python

Our python program reported errors when running a new dataset: [77 rows x 4 columns]]'. Reason: 'error("'i' format requires -2147483648 <= number <= 2147483647",)'...

View Article

Be careful when you use “isin()” method in Pandas

import pandas as pd df_excl = pd.DataFrame({"id": ["12345"]}) df = pd.DataFrame({"id": ["12345", "67890"]}) result = df[~df.id.isin(df_excl[["id"]])] print(result) Guess what’s the result of above...

View Article


Debug CUDA error for PyTorch

After I changed my dataset for my code, the training failed: /tmp/pip-req-build-_tx3iysr/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:310: operator(): block: [0,0,0], thread: [59,0,0] Assertion...

View Article

Take care of the comma (in Python)

Think about the result of this snippet: def concat(a, b): return a + "_" + b left = "hello", right = "world" print(concat(left, right)) Should be “hello_world”, right? But the actual result is an...

View Article


Migrate Spark job to BigQuery

I have just finished a work about migrating Spark job to BigQuery, or more precisely: migrate Python code to SQL. It’s a tedious work but improve the performance significantly: from 4 hours runtime of...

View Article

Trace memory error of CUDA program

The program which used CUDA for computing in GPU reported error about memory: terminate called after throwing an instance of 'std::runtime_error' what(): [CUDA] an illegal memory access was...

View Article


Be careful with random generate number

This is the program I have used for a month: import numpy as np np.random.seed(202105) rand = np.random.rand() # business logic code using 'rand' Then I add another np.random.rand() in the head of the...

View Article

Recover truncated table in BigQuery

If you accidentally truncate a table in BigQuery, you can try this article to recover the data. Furthermore, I found out that the "bq cp project:dataset.table@-36000 project:dataset.table” method...

View Article

Image may be NSFW.
Clik here to view.

Upgrade GKE cluster

Normally, to upgrade a cluster of Google Kubernetes Engine, we need to upgrade the master at first, and then node_pools. For convenience, I just click the button “UPGRADE AVAILABLE” in the “Release...

View Article
Browsing all 236 articles
Browse latest View live