CUDA Programming – String Matching

CUDA is worked on by NVIDIA to help application programmers to use the untapped potential of GPUs. As GPU have so many cores that are not used most of the time,Developers can take advantage of this situation and make their programs faster in the process.
Introduction to CUDA

Now , String Mathching
The Kernel Code :

__global__ void searchKeywordKernel(int *result, char *data, char *keyword,int datasize)
int i = blockIdx.x*threadIdx.x;
// Detect the first matching character

if (data[i] == keyword[0]) {
// Loop through next keyword character
for (int j=1; j<3; i++) {
if (data[i+j] != keyword[j])
// Store the first matching character to the result list
result[i] = 1;

Looking at the kernel code. By using the blockID , ThreadID we get a unique indexing number to match string.

We also specify the number of BLOCKS and THREADS
We do that by ,

searchKeywordKernel<<>>(dev_result, dev_data, dev_keyword,datasize);

So we get the number of blocks by Dividing the size with 512(As that is the optimal no. of threads in a block)

Ran with CUDA
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2015 NVIDIA Corporation
Built on Tue_Aug_11_14:27:32_CDT_2015
Cuda compilation tools, release 7.5, V7.5.17

1)nvcc -o string su_2.c

input.text a file it is reading from
and patter to be matched being Sherlock

String Matched at
4 52 34 54 56 65 23 54 453 545
Time Taken 33:534232 seconds

GPU and its powerful thread capabilities is real power that is yet to be exploited, CPU has only 4 cores whereas GPU can have multiple number of codes.


Matrix Multiplication

It always frightens us when we hear the word “Matrix” and the complex computations involved in Multiplying two Prime Numbers.

Lets take a look at the serial code , Brute force approach

Brute force

Brute Force Serial

  • Input: matrices A and B
  • Let C be a new matrix of the appropriate size
  • Pick a tile size T = Θ(√M)
  • For I from 1 to n in steps of T:
    • For J from 1 to p in steps of T:
      • For K from 1 to m in steps of T:
        • Multiply AI:I+T, K:K+T and BK:K+T, J:J+T into CI:I+T, J:J+T, that is:
        • For i from I to min(I + T, n):
          • For j from J to min(J + T, p):
            • Let sum = 0
            • For k from K to min(K + T, m):
              • Set sum ← sum + Aik × Bkj
            • Set CijCij + sum
  • Return C

Is the Algorithm for Brute force is optimal for smaller sizes

Divide and Conquer 
Divide and Conquer Serial

  • Inputs: matrices A of size n × m, B of size m × p.
  • Base case: if max(n, m, p) is below some threshold, use an unrolled version of t
  • If max(n, m, p) = n, split A horizontally:

Now Serial Divide and Conquer- D & C (Serial)


Now Strassen Matrix Multiplication Strassen Serial


Source : WIKIPEDIA for algo


SIMD – Single Instruction Multiple Data

In our Parallel Computing course, We were so eager to dive into Parallelize code but took a step back and looked at how we could improve already present serial code.

SIMD is an approach to do the same instruction on Multiple data points simultaneously. So more work is done with a single process instruction at given moment.


As you can see in the above picture , four scalar operation are converted to one SIMD operation.