Options to Load Data into TensorFlow

The first step before training a machine learning algorithm is to load the data. There is two commons way to load data:

1. Load data into memory: It is the simplest method. You load all your data into memory as a single array. You can write a Python code. This lines of code are unrelated to Tensorflow.

2. Tensorflow data pipeline. Tensorflow has built-in API that helps you to load the data, perform the operation and feed the machine learning algorithm easily. This method works very well especially when you have a large dataset. For instance, image records are known to be enormous and do not fit into memory. The data pipeline manages the memory by itself

What solution to use?

Load data in memory

If your dataset is not too big, i.e., less than 10 gigabytes, you can use the first method. The data can fit into the memory. You can use a famous library called Pandas to import CSV files. You will learn more about pandas in the next tutorial.

Load data with Tensorflow pipeline

The second method works best if you have a large dataset. For instance, if you have a dataset of 50 gigabytes, and your computer has only 16 gigabytes of memory then the machine will crash.

In this situation, you need to build a Tensorflow pipeline. The pipeline will load the data in batch, or small chunk. Each batch will be pushed to the pipeline and be ready for the training. Building a pipeline is an excellent solution because it allows you to use parallel computing. It means Tensorflow will train the model across multiple CPUs. It fosters the computation and permits for training powerful neural network.

You will see in the next tutorials on how to build a significant pipeline to feed your neural network.

In a nutshell, if you have a small dataset, you can load the data in memory with Pandas library.

If you have a large dataset and you want to make use of multiple CPUs, then you will be more comfortable to work with Tensorflow pipeline.

Create Tensorflow pipeline

In the example before, we manually add three values for X_1 and X_2. Now we will see how to load data to Tensorflow.

Step 1) Create the data

First of all, let's use numpy library to generate two random values.

import numpy as np

x_input = np.random.sample((1,2))

print(x_input)

[[0.8835775 0.23766977]]

Step 2: Create the placeholder

Like in the previous example, we create a placeholder with the name X. We need to specify the shape of the tensor explicitly. In case, we will load an array with only two values. We can write the shape as shape=[1,2]

# using a placeholder

x = tf.placeholder(tf.float32, shape=[1,2], name = 'X')

Step 3: Define the dataset method

next, we need to define the Dataset where we can populate the value of the placeholder x. We need to use the method tf.data.Dataset.from_tensor_slices

dataset = tf.data.Dataset.from_tensor_slices(x)

Step 4: Create the pipeline

In step four, we need to initialize the pipeline where the data will flow. We need to create an iterator with make_initializable_iterator. We name it iterator. Then we need to call this iterator to feed the next batch of data, get_next. We name this step get_next. Note that in our example, there is only one batch of data with only two values.

iterator = dataset.make_initializable_iterator()

get_next = iteraror.get_next()

Step 5: Execute the operation

The last step is similar to the previous example. We initiate a session, and we run the operation iterator. We feed the feed_dict with the value generated by numpy. These two value will populate the placeholder x. Then we run get_next to print the result.

with tf.Session() as sess:

    # feed the placeholder with data

    sess.run(iterator.initializer, feed_dict={ x: x_input })

    print(sess.run(get_next)) # output [ 0.52374458  0.71968478]

[0.8835775  0.23766978]

TensorFlow is the most famous deep learning library these recent years. A practitioner using TensorFlow can build any deep learning structure, like CNN, RNN or simple artificial neural network.

TensorFlow is mostly used by academics, startups, and large companies. Google uses TensorFlow in almost all Google daily products including Gmail, Photo and Google Search Engine.

Google Brain team's developed TensorFlow to fill the gap between researchers and products developers. In 2015, they made TensorFlow public; it is rapidly growing in popularity. Nowadays, TensorFlow is the deep learning library with the most repositories on GitHub.

Practitioners use Tensorflow because it is easy to deploy at scale. It is built to work in the cloud or on mobile devices like iOs and Android.

Tensorflow works in a session. Each session is defined by a graph with different computations. A simple example can be to multiply to number. In Tensorflow, three steps are required: