Does it make sense that hadoop sqooping would need to sqoop most of the same data all of the time?


I’m a DBA supporting a database that has a large volume of call center data. The team I actually support is the owner of the database, which just imports new data daily, and mostly on this db for archival purposes.

Another team I don’t support directly, is a big data team, who have access to this database via a hadoop sqooping process. Lately, they’ve been complaining about how long it takes for their process to run. Their sqooping query boils down to a select statement that looks for all records on a table that were imported since 2019. This is millions of complex records. I’m considering some covering indexes that might help them, but I’m also questioning why they need to transfer all of these records three times a day, when it would only be a difference of ~ 10k new records a day.

Does it make sense that hadoop needs a full transfer of all of the data each time it runs? Maybe it does and I’m just ignorant to how hadoop and big data works, or maybe they aren’t bothering to target or batch what they are transferring because it’s easier to just grab everything.