Connect a Private Cloud SQL to Cloud Data Fusion

Can’t connect a Private Cloud SQL from Cloud Data Fusion? Here’s 3 reasons why and what you can do to fix it.

Cloud SQL failure from Data Fusion

The Data Engineers reading would probably know that Google Cloud SQL and Cloud Data Fusion are part of Google’s offerings for Data and Analytics in Google Cloud Platform (GCP). Hence, it should be simple to connect both of them together to form a simple data pipeline right? Yes, if you don’t mind using public interfaces; no, if both Data Fusion and Cloud SQL are meant to be internal services.

Here are some simple guidelines to ensuring that the Private versions of Cloud SQL and Data Fusion can work together.

Deploy Cloud SQL and Data Fusion in the same network

Shared VPC vs Standalone VPC: Google Documentation

The first consideration that needs to be put in place is that Cloud SQL and Data Fusion should reside in the same VPC network; both Cloud SQL and Data Fusion employ Google’s Private services access concept which allows them to “peer” services running in Google’s VPCs to ours. A nuance of this kind of peering means that it doesn’t support Transitive Connections, meaning that two peers cannot connect to each other over our VPC, this situation can be made worse if we ourselves are peering two VPCs together. So, keep them on the same VPC (hint: it can also be a Shared VPC).

Utilize a proxy or TCP forwarding between Cloud SQL and Data Fusion

Diagram showing a proxy between Cloud SQL and Data Fusion
Adding a Proxy between Cloud SQL and Data Fusion

In order to ensure that two Private access services can speak to each other, they have to think that the connection comes from within our VPC. This means that there needs to be a middle man in between each service so that GCP thinks the connections come from our VPC and are not the result of Transitive Peering. Here are the two methods that worked for us:

  • Deploy a Cloud SQL Proxy (either in docker or non-docker form)
  • Deploy a VM with iptables forwarding

Both these solutions will create the desired effect, ensuring that the connections to Cloud SQL from Data Fusion originate from a VM in our VPC because it is acting as a proxy. An example of how you might configure TCP forwarding using IPTables is to:

sudo iptables -t nat -A PREROUTING -p tcp –dport 3306 -j DNAT –to <CloudSQL IP>
sudo iptables -A FORWARD -p tcp -d <CloudSQL IP> –dport 3306 -j ACCEPT
sudo iptables -t nat -A POSTROUTING -j MASQUERADE

The above forwards traffic received by the VM on port 3306 to Cloud SQL on port 3306 and masquerades the traffic as coming from the IP of the VM

Ensure that firewall rules allow the Proxy to communicate

This is a step we find that many implementers have missed out on, after setting up the proxy in the step above, the GCP firewall needs to be configured to allow traffic from the correct source and ports being used to communicate with the proxy instance for this to work.

In our testing, we realized that we could not change ports for some of the drivers in Data Fusion, so depending on the driver, you may need to allow port 3306 from the Allocated Internal IP range for Data Fusion which was defined in the Private Service Connection tab of your VPC (or VPC host if you are using a Shared VPC). This will commonly be described as a cdf-<data fusion instance name>. If you are using a Data Fusion driver that supports changing ports, you can allocate the selected port accordingly.

In short, create the following ingress and egress firewall rules:

  • Allow ingress traffic from your Cloud Data Fusion range to talk to the proxy VM
  • Allow egress traffic from your proxy VM to talk to Cloud SQL’s internal IP range

Note: don’t forget to add other rules depending on other use-cases.

Conclusion

You can allow private Data Fusion instances to talk to private Cloud SQL instances. However, the answer just isn’t as straightforward as turning them on. Having a proxy in the middle also has other considerations if he traffic volume is high. For example, do you need High-Availability? Ensure that the proxy is scaled up according to the volume needed and remember to apply your Cloud Monitoring and Logging agents to the proxy to ensure that you get a full end-to-end view of your Cloud SQL and Data Fusion pipeline!

Did this article help you? Reach out to us at marketing@matrixc.com or read up more of our blog posts at https://www.matrixc.com/blog/. We’d be happy to help you out on your GCP journey!

Read more about Google Cloud here

Leave a Comment

Your email address will not be published. Required fields are marked *