The Data Engineers reading would probably know that Google Cloud SQL and Cloud Data Fusion are part of Google’s offerings for Data and Analytics in Google Cloud Platform (GCP). Hence, it should be simple to connect both of them together to form a simple data pipeline right? Yes, if you don’t mind using public interfaces; no, if both Data Fusion and Cloud SQL are meant to be internal services.
Here are some simple guidelines to ensuring that the Private versions of Cloud SQL and Data Fusion can work together.
Deploy Cloud SQL and Data Fusion in the same network
The first consideration that needs to be put in place is that Cloud SQL and Data Fusion should reside in the same VPC network; both Cloud SQL and Data Fusion employ Google’s Private services access concept which allows them to “peer” services running in Google’s VPCs to ours. A nuance of this kind of peering means that it doesn’t support Transitive Connections, meaning that two peers cannot connect to each other over our VPC, this situation can be made worse if we ourselves are peering two VPCs together. So, keep them on the same VPC (hint: it can also be a Shared VPC).
Utilize a proxy or TCP forwarding between Cloud SQL and Data Fusion
In order to ensure that two Private access services can speak to each other, they have to think that the connection comes from within our VPC. This means that there needs to be a middle man in between each service so that GCP thinks the connections come from our VPC and are not the result of Transitive Peering. Here are the two methods that worked for us:
- Deploy a Cloud SQL Proxy (either in docker or non-docker form)
- Deploy a VM with iptables forwarding
Both these solutions will create the desired effect, ensuring that the connections to Cloud SQL from Data Fusion originate from a VM in our VPC because it is acting as a proxy. An example of how you might configure TCP forwarding using IPTables is to:
sudo iptables -t nat -A PREROUTING -p tcp –dport 3306 -j DNAT –to <CloudSQL IP>The above forwards traffic received by the VM on port 3306 to Cloud SQL on port 3306 and masquerades the traffic as coming from the IP of the VM
sudo iptables -A FORWARD -p tcp -d <CloudSQL IP> –dport 3306 -j ACCEPT
sudo iptables -t nat -A POSTROUTING -j MASQUERADE
Ensure that firewall rules allow the Proxy to communicate
This is a step we find that many implementers have missed out on, after setting up the proxy in the step above, the GCP firewall needs to be configured to allow traffic from the correct source and ports being used to communicate with the proxy instance for this to work.
In our testing, we realized that we could not change ports for some of the drivers in Data Fusion, so depending on the driver, you may need to allow port 3306 from the Allocated Internal IP range for Data Fusion which was defined in the Private Service Connection tab of your VPC (or VPC host if you are using a Shared VPC). This will commonly be described as a cdf-<data fusion instance name>. If you are using a Data Fusion driver that supports changing ports, you can allocate the selected port accordingly.
In short, create the following ingress and egress firewall rules:
- Allow ingress traffic from your Cloud Data Fusion range to talk to the proxy VM
- Allow egress traffic from your proxy VM to talk to Cloud SQL’s internal IP range
Note: don’t forget to add other rules depending on other use-cases.
You can allow private Data Fusion instances to talk to private Cloud SQL instances. However, the answer just isn’t as straightforward as turning them on. Having a proxy in the middle also has other considerations if he traffic volume is high. For example, do you need High-Availability? Ensure that the proxy is scaled up according to the volume needed and remember to apply your Cloud Monitoring and Logging agents to the proxy to ensure that you get a full end-to-end view of your Cloud SQL and Data Fusion pipeline!